Class BasicURLNormalizer


  • public class BasicURLNormalizer
    extends URLFilter
    Code borrowed from Apache Nutch. Converts URLs to a normal form:
    • remove dot segments in path: /./ or /../
    • remove default ports, e.g. 80 for protocol http://
    • normalize percent-encoding in URL paths
    • Field Detail

      • LOG

        public static final org.slf4j.Logger LOG
    • Constructor Detail

      • BasicURLNormalizer

        public BasicURLNormalizer()
    • Method Detail

      • filter

        public String filter​(String urlString)
        Description copied from class: URLFilter
        Returns a modified version of the input URL or null if the URL should be removed
        Specified by:
        filter in class URLFilter
        Parameters:
        urlString - a URL string to check against filter(s)
        Returns:
        a filtered URL
      • parseQueryParameters

        public static List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair> parseQueryParameters​(String s,
                                                                                                               int queryStartIdx,
                                                                                                               Set<String> queryElementsToRemove)
        Receives the URL query string and parses it into a list of name-value pairs. Optionally, allows to remove query parameters.
        Parameters:
        s - a String containing the URL file (as per java.net.URL.getFile(), i.e., the path + query + fragment)
        queryStartIdx - the index position of the query part in the string s.
        queryElementsToRemove - a set of query parameter names to be ignored while parsing the query parameters.
      • formatQueryParameters

        public static String formatQueryParameters​(List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair> parameters)
        Formats a list of query parameter name-value pairs into a query parameter string.
        Parameters:
        parameters - the query parameter name-value pairs
        Returns:
        a URL query string
      • unescapePath

        public static String unescapePath​(String path)
        Remove % encoding from path segment in URL for characters which should be unescaped according to RFC3986.