Package crawlercommons.sitemaps
Class SiteMapParser
- java.lang.Object
-
- crawlercommons.sitemaps.SiteMapParser
-
public class SiteMapParser extends Object
-
-
Field Summary
Fields Modifier and Type Field Description protected Set<String>acceptedNamespacesSet of namespaces (ifstrictNamespace) accepted by the parser.protected Map<String,Extension>extensionNamespacesMap of sitemap extension namespaces required to find the right extension handler.static org.slf4j.LoggerLOGstatic intMAX_BYTES_ALLOWEDSitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.protected booleanstrictTrue (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#locationprotected booleanstrictNamespaceIndicates whether the parser should work with the namespace from the specifications or any namespace.
-
Constructor Summary
Constructors Constructor Description SiteMapParser()SiteMapParser with strict location validation (isStrict()) and not allowing partially parsed content.SiteMapParser(boolean strict)SiteMapParser with configurable location validation, not allowing partially parsed content.SiteMapParser(boolean strict, boolean allowPartial)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddAcceptedNamespace(String namespaceUri)Add namespace URI to set of accepted namespaces.voidaddAcceptedNamespace(String[] namespaceUris)Add namespace URIs to set of accepted namespaces.voidenableExtension(Extension extension)Enable a support for a sitemap extension in the parser.voidenableExtensions()Enable all supported sitemap extensions in the parser.booleanisStrict()booleanisStrictNamespace()AbstractSiteMapparseSiteMap(byte[] content, URL url)Parse a sitemap, given the content bytes and the URL.AbstractSiteMapparseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap)Returns a processed copy of an unprocessed sitemap object, i.e.AbstractSiteMapparseSiteMap(String contentType, byte[] content, URL url)Parse a sitemap, given the MIME type, the content bytes, and the URL.AbstractSiteMapparseSiteMap(URL onlineSitemapUrl)Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.protected AbstractSiteMapprocessGzippedXML(URL url, byte[] response)Decompress the gzipped content and process the resulting XML Sitemap.protected SiteMapprocessText(URL sitemapUrl, byte[] content)Process a text-based Sitemap.protected SiteMapprocessText(URL sitemapUrl, InputStream stream)Process a text-based Sitemap.protected AbstractSiteMapprocessXml(URL sitemapUrl, byte[] xmlContent)Parse the given XML content.protected AbstractSiteMapprocessXml(URL sitemapUrl, InputSource is)Parse the given XML content.voidsetAllowDocTypeDefinitions(boolean allowDocTypeDefinitions)Sets if the parser allows a DTD in sitemaps or feeds.voidsetStrictNamespace(boolean s)Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)).voidsetURLFilter(URLFilter filter)UseURLFilterto filter URLs, eg.voidsetURLFilter(Function<String,String> filter)Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.static booleanurlIsValid(String sitemapBaseUrl, String testUrl)See if testUrl is under sitemapBaseUrl.voidwalkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action)Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.voidwalkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action)Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.
-
-
-
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
MAX_BYTES_ALLOWED
public static final int MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov. 2016 the limit has been 10MB).- See Also:
- Constant Field Values
-
strict
protected boolean strict
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#location
-
strictNamespace
protected boolean strictNamespace
Indicates whether the parser should work with the namespace from the specifications or any namespace. Defaults to false.
-
acceptedNamespaces
protected Set<String> acceptedNamespaces
Set of namespaces (ifstrictNamespace) accepted by the parser. URLs from other namespaces are ignored.
-
-
Constructor Detail
-
SiteMapParser
public SiteMapParser()
SiteMapParser with strict location validation (isStrict()) and not allowing partially parsed content.
-
SiteMapParser
public SiteMapParser(boolean strict)
SiteMapParser with configurable location validation, not allowing partially parsed content.- Parameters:
strict- seeisStrict()
-
SiteMapParser
public SiteMapParser(boolean strict, boolean allowPartial)- Parameters:
strict- seeisStrict()allowPartial- if true: allow URLs from sitemaps only partially parsed because of format errors or truncated (incompletely fetched) content. If false any parser error will cause anUnknownFormatException.
-
-
Method Detail
-
setAllowDocTypeDefinitions
public void setAllowDocTypeDefinitions(boolean allowDocTypeDefinitions)
Sets if the parser allows a DTD in sitemaps or feeds.- Parameters:
allowDocTypeDefinitions- true if allowed. Default is false.
-
isStrict
public boolean isStrict()
- Returns:
- whether invalid URLs will be rejected (where invalid means that the URL is not under the base URL, see sitemap file location)
-
isStrictNamespace
public boolean isStrictNamespace()
- Returns:
- whether the parser allows any namespace or just the one from the
specification (or any namespace accepted,
addAcceptedNamespace(String))
-
setStrictNamespace
public void setStrictNamespace(boolean s)
Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)). Note enabling strict namespace checking always adds the namespace defined by the current sitemap specification (Namespace.SITEMAP) to the list of accepted namespaces.- Parameters:
s- if true enable strict namespace-checking, disable if false
-
addAcceptedNamespace
public void addAcceptedNamespace(String namespaceUri)
Add namespace URI to set of accepted namespaces.- Parameters:
namespaceUri- URI of the accepted XML namespace
-
addAcceptedNamespace
public void addAcceptedNamespace(String[] namespaceUris)
Add namespace URIs to set of accepted namespaces.- Parameters:
namespaceUris- array of accepted XML namespace URIs
-
enableExtension
public void enableExtension(Extension extension)
Enable a support for a sitemap extension in the parser.- Parameters:
extension- sitemap extension (news, images, videos, etc.)
-
enableExtensions
public void enableExtensions()
Enable all supported sitemap extensions in the parser.
-
setURLFilter
public void setURLFilter(Function<String,String> filter)
Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.
-
setURLFilter
public void setURLFilter(URLFilter filter)
UseURLFilterto filter URLs, eg. to configure that URLs found in sitemaps are normalized byBasicURLNormalizer:sitemapParser.setURLFilter(new BasicURLNormalizer());
-
parseSiteMap
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.- Parameters:
onlineSitemapUrl- URL of the online sitemap- Returns:
- Extracted SiteMap/SiteMapIndex or null if the onlineSitemapUrl is null
- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
Returns a processed copy of an unprocessed sitemap object, i.e. transfer the value of getLastModified(). Please note that the sitemap input stays unchanged. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
contentType- MIME type of contentcontent- raw bytes of sitemap filesitemap- anAbstractSiteMapimplementation- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
Parse a sitemap, given the content bytes and the URL.- Parameters:
content- raw bytes of sitemap fileurl- URL to sitemap file- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
Parse a sitemap, given the MIME type, the content bytes, and the URL. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
contentType- MIME type of contentcontent- raw bytes of sitemap fileurl- URL to sitemap file- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error reading in the site mapURL
-
walkSiteMap
public void walkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap URL and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
- Parameters:
onlineSitemapUrl- URL of the online sitemapaction- The action to be performed for each element- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error fetching the content of anyURL
-
walkSiteMap
public void walkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
- Parameters:
sitemap- The sitemap to traverseaction- The action to be performed for each element- Throws:
UnknownFormatException- if there is an error parsing the sitemapIOException- if there is an error fetching the content of anyURL
-
processXml
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
Parse the given XML content.- Parameters:
sitemapUrl- URL to sitemap filexmlContent- the byte[] backing the sitemapUrl- Returns:
- The site map
- Throws:
UnknownFormatException- if there is an error parsing the sitemap
-
processText
protected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
sitemapUrl- URL to sitemap filecontent- the byte[] backing the sitemapUrl- Returns:
- The site map
- Throws:
IOException- if there is an error reading in the site map content
-
processText
protected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
sitemapUrl- URL to sitemap filestream- content stream- Returns:
- The site map
- Throws:
IOException- if there is an error reading in the site map content
-
processGzippedXML
protected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
Decompress the gzipped content and process the resulting XML Sitemap.- Parameters:
url- - URL of the gzipped contentresponse- - Gzipped content- Returns:
- the site map
- Throws:
UnknownFormatException- if there is an error parsing the gzipIOException- if there is an error reading in the gzipURL
-
processXml
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
Parse the given XML content.- Parameters:
sitemapUrl- a sitemapURLis- anInputSourcebacking the sitemap- Returns:
- the site map
- Throws:
UnknownFormatException- if there is an error parsing theInputSource
-
urlIsValid
public static boolean urlIsValid(String sitemapBaseUrl, String testUrl)
See if testUrl is under sitemapBaseUrl. Only URLs under sitemapBaseUrl are valid.- Parameters:
sitemapBaseUrl- the base URL of the sitemaptestUrl- the URL to be tested- Returns:
- true if testUrl is under sitemapBaseUrl, false otherwise
-
-