Package crawlercommons.sitemaps
Class SiteMapParser
- java.lang.Object
-
- crawlercommons.sitemaps.SiteMapParser
-
public class SiteMapParser extends Object
-
-
Field Summary
Fields Modifier and Type Field Description protected Set<String>
acceptedNamespaces
Set of namespaces (ifstrictNamespace
) accepted by the parser.protected Map<String,Extension>
extensionNamespaces
Map of sitemap extension namespaces required to find the right extension handler.static org.slf4j.Logger
LOG
static int
MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.protected boolean
strict
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#locationprotected boolean
strictNamespace
Indicates whether the parser should work with the namespace from the specifications or any namespace.
-
Constructor Summary
Constructors Constructor Description SiteMapParser()
SiteMapParser with strict location validation (isStrict()
) and not allowing partially parsed content.SiteMapParser(boolean strict)
SiteMapParser with configurable location validation, not allowing partially parsed content.SiteMapParser(boolean strict, boolean allowPartial)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addAcceptedNamespace(String namespaceUri)
Add namespace URI to set of accepted namespaces.void
addAcceptedNamespace(String[] namespaceUris)
Add namespace URIs to set of accepted namespaces.void
enableExtension(Extension extension)
Enable a support for a sitemap extension in the parser.void
enableExtensions()
Enable all supported sitemap extensions in the parser.boolean
isStrict()
boolean
isStrictNamespace()
AbstractSiteMap
parseSiteMap(byte[] content, URL url)
Parse a sitemap, given the content bytes and the URL.AbstractSiteMap
parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.AbstractSiteMap
parseSiteMap(String contentType, byte[] content, URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.AbstractSiteMap
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.protected AbstractSiteMap
processGzippedXML(URL url, byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.protected SiteMap
processText(URL sitemapUrl, byte[] content)
Process a text-based Sitemap.protected SiteMap
processText(URL sitemapUrl, InputStream stream)
Process a text-based Sitemap.protected AbstractSiteMap
processXml(URL sitemapUrl, byte[] xmlContent)
Parse the given XML content.protected AbstractSiteMap
processXml(URL sitemapUrl, InputSource is)
Parse the given XML content.void
setAllowDocTypeDefinitions(boolean allowDocTypeDefinitions)
Sets if the parser allows a DTD in sitemaps or feeds.void
setStrictNamespace(boolean s)
Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)
).void
setURLFilter(URLFilter filter)
UseURLFilter
to filter URLs, eg.void
setURLFilter(Function<String,String> filter)
Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.static boolean
urlIsValid(String sitemapBaseUrl, String testUrl)
See if testUrl is under sitemapBaseUrl.void
walkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action)
Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.void
walkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action)
Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.
-
-
-
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
MAX_BYTES_ALLOWED
public static final int MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov. 2016 the limit has been 10MB).- See Also:
- Constant Field Values
-
strict
protected boolean strict
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#location
-
strictNamespace
protected boolean strictNamespace
Indicates whether the parser should work with the namespace from the specifications or any namespace. Defaults to false.
-
acceptedNamespaces
protected Set<String> acceptedNamespaces
Set of namespaces (ifstrictNamespace
) accepted by the parser. URLs from other namespaces are ignored.
-
-
Constructor Detail
-
SiteMapParser
public SiteMapParser()
SiteMapParser with strict location validation (isStrict()
) and not allowing partially parsed content.
-
SiteMapParser
public SiteMapParser(boolean strict)
SiteMapParser with configurable location validation, not allowing partially parsed content.- Parameters:
strict
- seeisStrict()
-
SiteMapParser
public SiteMapParser(boolean strict, boolean allowPartial)
- Parameters:
strict
- seeisStrict()
allowPartial
- if true: allow URLs from sitemaps only partially parsed because of format errors or truncated (incompletely fetched) content. If false any parser error will cause anUnknownFormatException
.
-
-
Method Detail
-
setAllowDocTypeDefinitions
public void setAllowDocTypeDefinitions(boolean allowDocTypeDefinitions)
Sets if the parser allows a DTD in sitemaps or feeds.- Parameters:
allowDocTypeDefinitions
- true if allowed. Default is false.
-
isStrict
public boolean isStrict()
- Returns:
- whether invalid URLs will be rejected (where invalid means that the URL is not under the base URL, see sitemap file location)
-
isStrictNamespace
public boolean isStrictNamespace()
- Returns:
- whether the parser allows any namespace or just the one from the
specification (or any namespace accepted,
addAcceptedNamespace(String)
)
-
setStrictNamespace
public void setStrictNamespace(boolean s)
Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (seeaddAcceptedNamespace(String)
). Note enabling strict namespace checking always adds the namespace defined by the current sitemap specification (Namespace.SITEMAP
) to the list of accepted namespaces.- Parameters:
s
- if true enable strict namespace-checking, disable if false
-
addAcceptedNamespace
public void addAcceptedNamespace(String namespaceUri)
Add namespace URI to set of accepted namespaces.- Parameters:
namespaceUri
- URI of the accepted XML namespace
-
addAcceptedNamespace
public void addAcceptedNamespace(String[] namespaceUris)
Add namespace URIs to set of accepted namespaces.- Parameters:
namespaceUris
- array of accepted XML namespace URIs
-
enableExtension
public void enableExtension(Extension extension)
Enable a support for a sitemap extension in the parser.- Parameters:
extension
- sitemap extension (news, images, videos, etc.)
-
enableExtensions
public void enableExtensions()
Enable all supported sitemap extensions in the parser.
-
setURLFilter
public void setURLFilter(Function<String,String> filter)
Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.
-
setURLFilter
public void setURLFilter(URLFilter filter)
UseURLFilter
to filter URLs, eg. to configure that URLs found in sitemaps are normalized byBasicURLNormalizer
:sitemapParser.setURLFilter(new BasicURLNormalizer());
-
parseSiteMap
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.- Parameters:
onlineSitemapUrl
- URL of the online sitemap- Returns:
- Extracted SiteMap/SiteMapIndex or null if the onlineSitemapUrl is null
- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
Returns a processed copy of an unprocessed sitemap object, i.e. transfer the value of getLastModified(). Please note that the sitemap input stays unchanged. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- anAbstractSiteMap
implementation- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
Parse a sitemap, given the content bytes and the URL.- Parameters:
content
- raw bytes of sitemap fileurl
- URL to sitemap file- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site mapURL
-
parseSiteMap
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
Parse a sitemap, given the MIME type, the content bytes, and the URL. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.- Parameters:
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap file- Returns:
- Extracted SiteMap/SiteMapIndex
- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site mapURL
-
walkSiteMap
public void walkSiteMap(URL onlineSitemapUrl, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap URL and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
- Parameters:
onlineSitemapUrl
- URL of the online sitemapaction
- The action to be performed for each element- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error fetching the content of anyURL
-
walkSiteMap
public void walkSiteMap(AbstractSiteMap sitemap, Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.This method is a convenience method for a user who has a sitemap and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
- Parameters:
sitemap
- The sitemap to traverseaction
- The action to be performed for each element- Throws:
UnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error fetching the content of anyURL
-
processXml
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
Parse the given XML content.- Parameters:
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrl- Returns:
- The site map
- Throws:
UnknownFormatException
- if there is an error parsing the sitemap
-
processText
protected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
sitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrl- Returns:
- The site map
- Throws:
IOException
- if there is an error reading in the site map content
-
processText
protected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.- Parameters:
sitemapUrl
- URL to sitemap filestream
- content stream- Returns:
- The site map
- Throws:
IOException
- if there is an error reading in the site map content
-
processGzippedXML
protected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
Decompress the gzipped content and process the resulting XML Sitemap.- Parameters:
url
- - URL of the gzipped contentresponse
- - Gzipped content- Returns:
- the site map
- Throws:
UnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzipURL
-
processXml
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
Parse the given XML content.- Parameters:
sitemapUrl
- a sitemapURL
is
- anInputSource
backing the sitemap- Returns:
- the site map
- Throws:
UnknownFormatException
- if there is an error parsing theInputSource
-
urlIsValid
public static boolean urlIsValid(String sitemapBaseUrl, String testUrl)
See if testUrl is under sitemapBaseUrl. Only URLs under sitemapBaseUrl are valid.- Parameters:
sitemapBaseUrl
- the base URL of the sitemaptestUrl
- the URL to be tested- Returns:
- true if testUrl is under sitemapBaseUrl, false otherwise
-
-