Package crawlercommons.sitemaps
Class SiteMapParser
- java.lang.Object
- 
- crawlercommons.sitemaps.SiteMapParser
 
- 
 public class SiteMapParser extends Object 
- 
- 
Field SummaryFields Modifier and Type Field Description protected Set<String>acceptedNamespacesSet of namespaces (ifstrictNamespace) accepted by the parser.protected Map<String,Extension>extensionNamespacesMap of sitemap extension namespaces required to find the right extension handler.static org.slf4j.LoggerLOGstatic intMAX_BYTES_ALLOWEDSitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.protected booleanstrictTrue (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#locationprotected booleanstrictNamespaceIndicates whether the parser should work with the namespace from the specifications or any namespace.
 - 
Constructor SummaryConstructors Constructor Description SiteMapParser()SiteMapParser with strict location validation (isStrict()) and not allowing partially parsed content.SiteMapParser(boolean strict)SiteMapParser with configurable location validation, not allowing partially parsed content.SiteMapParser(boolean strict, boolean allowPartial)
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddAcceptedNamespace(String namespaceUri)Add namespace URI to set of accepted namespaces.voidaddAcceptedNamespace(String[] namespaceUris)Add namespace URIs to set of accepted namespaces.voidenableExtension(Extension extension)Enable a support for a sitemap extension in the parser.voidenableExtensions()Enable all supported sitemap extensions in the parser.booleanisStrict()booleanisStrictNamespace()AbstractSiteMapparseSiteMap(byte[] content, URL url)Parse a sitemap, given the content bytes and the URL.AbstractSiteMapparseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap)Returns a processed copy of an unprocessed sitemap object, i.e.AbstractSiteMapparseSiteMap(String contentType, byte[] content, URL url)Parse a sitemap, given the MIME type, the content bytes, and the URL.AbstractSiteMapparseSiteMap(URL onlineSitemapUrl)Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.protected AbstractSiteMapprocessGzippedXML(URL url, byte[] response)Decompress the gzipped content and process the resulting XML Sitemap.protected SiteMapprocessText(URL sitemapUrl, byte[] content)Process a text-based Sitemap.protected SiteMapprocessText(URL sitemapUrl, InputStream stream)Process a text-based Sitemap.protected AbstractSiteMapprocessXml(URL sitemapUrl, byte[] xmlContent)Parse the given XML content.protected AbstractSiteMapprocessXml(URL sitemapUrl, InputSource is)Parse the given XML content.voidsetAllowDocTypeDefinitions(boolean allowDocTypeDefinitions)Sets if the parser allows a DTD in sitemaps or feeds.voidsetStrictNamespace(boolean s)Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)).voidsetURLFilter(URLFilter filter)UseURLFilterto filter URLs, eg.voidsetURLFilter(Function<String,String> filter)Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.static booleanurlIsValid(String sitemapBaseUrl, String testUrl)See if testUrl is under sitemapBaseUrl.voidwalkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action)Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.voidwalkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action)Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.
 
- 
- 
- 
Field Detail- 
LOGpublic static final org.slf4j.Logger LOG 
 - 
MAX_BYTES_ALLOWEDpublic static final int MAX_BYTES_ALLOWED Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov. 2016 the limit has been 10MB).- See Also:
- Constant Field Values
 
 - 
strictprotected boolean strict True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#location
 - 
strictNamespaceprotected boolean strictNamespace Indicates whether the parser should work with the namespace from the specifications or any namespace. Defaults to false.
 - 
acceptedNamespacesprotected Set<String> acceptedNamespaces Set of namespaces (ifstrictNamespace) accepted by the parser. URLs from other namespaces are ignored.
 
- 
 - 
Constructor Detail- 
SiteMapParserpublic SiteMapParser() SiteMapParser with strict location validation (isStrict()) and not allowing partially parsed content.
 - 
SiteMapParserpublic SiteMapParser(boolean strict) SiteMapParser with configurable location validation, not allowing partially parsed content.- Parameters:
- strict- see- isStrict()
 
 - 
SiteMapParserpublic SiteMapParser(boolean strict, boolean allowPartial)- Parameters:
- strict- see- isStrict()
- allowPartial- if true: allow URLs from sitemaps only partially parsed because of format errors or truncated (incompletely fetched) content. If false any parser error will cause an- UnknownFormatException.
 
 
- 
 - 
Method Detail- 
setAllowDocTypeDefinitionspublic void setAllowDocTypeDefinitions(boolean allowDocTypeDefinitions) Sets if the parser allows a DTD in sitemaps or feeds.- Parameters:
- allowDocTypeDefinitions- true if allowed. Default is false.
 
 - 
isStrictpublic boolean isStrict() - Returns:
- whether invalid URLs will be rejected (where invalid means that the URL is not under the base URL, see sitemap file location)
 
 - 
isStrictNamespacepublic boolean isStrictNamespace() - Returns:
- whether the parser allows any namespace or just the one from the
         specification (or any namespace accepted,
         addAcceptedNamespace(String))
 
 - 
setStrictNamespacepublic void setStrictNamespace(boolean s) Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)). Note enabling strict namespace checking always adds the namespace defined by the current sitemap specification (Namespace.SITEMAP) to the list of accepted namespaces.- Parameters:
- s- if true enable strict namespace-checking, disable if false
 
 - 
addAcceptedNamespacepublic void addAcceptedNamespace(String namespaceUri) Add namespace URI to set of accepted namespaces.- Parameters:
- namespaceUri- URI of the accepted XML namespace
 
 - 
addAcceptedNamespacepublic void addAcceptedNamespace(String[] namespaceUris) Add namespace URIs to set of accepted namespaces.- Parameters:
- namespaceUris- array of accepted XML namespace URIs
 
 - 
enableExtensionpublic void enableExtension(Extension extension) Enable a support for a sitemap extension in the parser.- Parameters:
- extension- sitemap extension (news, images, videos, etc.)
 
 - 
enableExtensionspublic void enableExtensions() Enable all supported sitemap extensions in the parser.
 - 
setURLFilterpublic void setURLFilter(Function<String,String> filter) Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.
 - 
setURLFilterpublic void setURLFilter(URLFilter filter) UseURLFilterto filter URLs, eg. to configure that URLs found in sitemaps are normalized byBasicURLNormalizer:sitemapParser.setURLFilter(new BasicURLNormalizer()); 
 - 
parseSiteMappublic AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.- Parameters:
- onlineSitemapUrl- URL of the online sitemap
- Returns:
- Extracted SiteMap/SiteMapIndex or null if the onlineSitemapUrl is null
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error reading in the site map- URL
 
 - 
parseSiteMappublic AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException Returns a processed copy of an unprocessed sitemap object, i.e. transfer the value of getLastModified(). Please note that the sitemap input stays unchanged. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
- contentType- MIME type of content
- content- raw bytes of sitemap file
- sitemap- an- AbstractSiteMapimplementation
- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error reading in the site map- URL
 
 - 
parseSiteMappublic AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException Parse a sitemap, given the content bytes and the URL.- Parameters:
- content- raw bytes of sitemap file
- url- URL to sitemap file
- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error reading in the site map- URL
 
 - 
parseSiteMappublic AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException Parse a sitemap, given the MIME type, the content bytes, and the URL. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
- contentType- MIME type of content
- content- raw bytes of sitemap file
- url- URL to sitemap file
- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error reading in the site map- URL
 
 - 
walkSiteMappublic void walkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap URL and wants a simple way to traverse it. Exceptions thrown by the action are relayed to the caller. - Parameters:
- onlineSitemapUrl- URL of the online sitemap
- action- The action to be performed for each element
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error fetching the content of any- URL
 
 - 
walkSiteMappublic void walkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap and wants a simple way to traverse it. Exceptions thrown by the action are relayed to the caller. - Parameters:
- sitemap- The sitemap to traverse
- action- The action to be performed for each element
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
- IOException- if there is an error fetching the content of any- URL
 
 - 
processXmlprotected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException Parse the given XML content.- Parameters:
- sitemapUrl- URL to sitemap file
- xmlContent- the byte[] backing the sitemapUrl
- Returns:
- The site map
- Throws:
- UnknownFormatException- if there is an error parsing the sitemap
 
 - 
processTextprotected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
- sitemapUrl- URL to sitemap file
- content- the byte[] backing the sitemapUrl
- Returns:
- The site map
- Throws:
- IOException- if there is an error reading in the site map content
 
 - 
processTextprotected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
- sitemapUrl- URL to sitemap file
- stream- content stream
- Returns:
- The site map
- Throws:
- IOException- if there is an error reading in the site map content
 
 - 
processGzippedXMLprotected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException Decompress the gzipped content and process the resulting XML Sitemap.- Parameters:
- url- - URL of the gzipped content
- response- - Gzipped content
- Returns:
- the site map
- Throws:
- UnknownFormatException- if there is an error parsing the gzip
- IOException- if there is an error reading in the gzip- URL
 
 - 
processXmlprotected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException Parse the given XML content.- Parameters:
- sitemapUrl- a sitemap- URL
- is- an- InputSourcebacking the sitemap
- Returns:
- the site map
- Throws:
- UnknownFormatException- if there is an error parsing the- InputSource
 
 - 
urlIsValidpublic static boolean urlIsValid(String sitemapBaseUrl, String testUrl) See if testUrl is under sitemapBaseUrl. Only URLs under sitemapBaseUrl are valid.- Parameters:
- sitemapBaseUrl- the base URL of the sitemap
- testUrl- the URL to be tested
- Returns:
- true if testUrl is under sitemapBaseUrl, false otherwise
 
 
- 
 
-