Package crawlercommons.filters.basic
Class BasicURLNormalizer
- java.lang.Object
-
- crawlercommons.filters.URLFilter
-
- crawlercommons.filters.basic.BasicURLNormalizer
-
public class BasicURLNormalizer extends URLFilter
Code borrowed from Apache Nutch. Converts URLs to a normal form:- remove dot segments in path:
/./
or/../
- remove default ports, e.g. 80 for protocol
http://
- normalize percent-encoding in URL paths
- remove dot segments in path:
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
BasicURLNormalizer.Builder
A builder class for theBasicURLNormalizer
.static class
BasicURLNormalizer.IdnNormalization
-
Field Summary
Fields Modifier and Type Field Description static org.slf4j.Logger
LOG
-
Constructor Summary
Constructors Constructor Description BasicURLNormalizer()
BasicURLNormalizer(BasicURLNormalizer.Builder builder)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static String
escapePath(String path)
Convert path segment of URL from Unicode to UTF-8 and escape all characters which should be escaped according to RFC3986.static String
escapePath(String path, boolean[] extraEscapedBytes)
String
filter(String urlString)
Returns a modified version of the input URL or null if the URL should be removedstatic String
formatQueryParameters(List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair> parameters)
Formats a list of query parameter name-value pairs into a query parameter string.static void
main(String[] args)
static BasicURLNormalizer.Builder
newBuilder()
Create a new builder object for creating a customizedBasicURLNormalizer
object.static List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair>
parseQueryParameters(String s, int queryStartIdx, Set<String> queryElementsToRemove)
Receives the URL query string and parses it into a list of name-value pairs.static String
unescapePath(String path)
Remove % encoding from path segment in URL for characters which should be unescaped according to RFC3986.
-
-
-
Constructor Detail
-
BasicURLNormalizer
public BasicURLNormalizer()
-
BasicURLNormalizer
public BasicURLNormalizer(BasicURLNormalizer.Builder builder)
-
-
Method Detail
-
filter
public String filter(String urlString)
Description copied from class:URLFilter
Returns a modified version of the input URL or null if the URL should be removed
-
parseQueryParameters
public static List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair> parseQueryParameters(String s, int queryStartIdx, Set<String> queryElementsToRemove)
Receives the URL query string and parses it into a list of name-value pairs. Optionally, allows to remove query parameters.- Parameters:
s
- a String containing the URL file (as per java.net.URL.getFile(), i.e., the path + query + fragment)queryStartIdx
- the index position of the query part in the strings
.queryElementsToRemove
- a set of query parameter names to be ignored while parsing the query parameters.
-
formatQueryParameters
public static String formatQueryParameters(List<crawlercommons.filters.basic.BasicURLNormalizer.NameValuePair> parameters)
Formats a list of query parameter name-value pairs into a query parameter string.- Parameters:
parameters
- the query parameter name-value pairs- Returns:
- a URL query string
-
unescapePath
public static String unescapePath(String path)
Remove % encoding from path segment in URL for characters which should be unescaped according to RFC3986.
-
escapePath
public static String escapePath(String path)
Convert path segment of URL from Unicode to UTF-8 and escape all characters which should be escaped according to RFC3986.
-
newBuilder
public static BasicURLNormalizer.Builder newBuilder()
Create a new builder object for creating a customizedBasicURLNormalizer
object.- Returns:
- a
BasicURLNormalizer.Builder
ready to use
-
main
public static void main(String[] args) throws IOException
- Throws:
IOException
-
-