public class SiteMapParser extends Object
Modifier and Type | Field and Description |
---|---|
protected Set<String> |
acceptedNamespaces
Set of namespaces (if
strictNamespace ) accepted by the parser. |
protected Map<String,Extension> |
extensionNamespaces
Map of sitemap extension namespaces required to find the right extension
handler.
|
static org.slf4j.Logger |
LOG |
static int |
MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than
50MB (52,428,800 bytes)" as specified in the
Sitemaps XML
format (before Nov.
|
protected boolean |
strict
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
https://www.sitemaps.org/protocol.html#location
|
protected boolean |
strictNamespace
Indicates whether the parser should work with the namespace from the
specifications or any namespace.
|
Constructor and Description |
---|
SiteMapParser()
SiteMapParser with strict location validation (
isStrict() ) and not
allowing partially parsed content. |
SiteMapParser(boolean strict)
SiteMapParser with configurable location validation, not allowing
partially parsed content.
|
SiteMapParser(boolean strict,
boolean allowPartial) |
Modifier and Type | Method and Description |
---|---|
void |
addAcceptedNamespace(String namespaceUri)
Add namespace URI to set of accepted namespaces.
|
void |
addAcceptedNamespace(String[] namespaceUris)
Add namespace URIs to set of accepted namespaces.
|
void |
enableExtension(Extension extension)
Enable a support for a sitemap extension in the parser.
|
void |
enableExtensions()
Enable all supported sitemap extensions in the parser.
|
boolean |
isStrict() |
boolean |
isStrictNamespace() |
AbstractSiteMap |
parseSiteMap(byte[] content,
URL url)
Parse a sitemap, given the content bytes and the URL.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.
|
AbstractSiteMap |
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and
fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and
wants a "Keep it simple" way to parse it.
|
protected AbstractSiteMap |
processGzippedXML(URL url,
byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
byte[] content)
Process a text-based Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
InputStream stream)
Process a text-based Sitemap.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
byte[] xmlContent)
Parse the given XML content.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
InputSource is)
Parse the given XML content.
|
void |
setStrictNamespace(boolean s)
Sets the parser to allow any XML namespace or just the one from the
specification, or any accepted namespace (see
addAcceptedNamespace(String) ). |
void |
setURLFilter(java.util.function.Function<String,String> filter)
Set URL filter function to normalize URLs found in sitemaps or filter
URLs away if the function returns null.
|
void |
setURLFilter(URLFilter filter)
Use
URLFilter to filter URLs, eg. |
static boolean |
urlIsValid(String sitemapBaseUrl,
String testUrl)
See if testUrl is under sitemapBaseUrl.
|
void |
walkSiteMap(AbstractSiteMap sitemap,
java.util.function.Consumer<SiteMapURL> action)
Traverse a sitemap, recursively fetching and traversing the content of
any enclosed sitemap index, and performing the specified action for each
sitemap URL until all URLs have been processed or the action throws an
exception.
|
void |
walkSiteMap(URL onlineSitemapUrl,
java.util.function.Consumer<SiteMapURL> action)
Fetch a sitemap from the specified URL, recursively fetching and
traversing the content of any enclosed sitemap index, and performing the
specified action for each sitemap URL until all URLs have been processed
or the action throws an exception.
|
public static final org.slf4j.Logger LOG
public static final int MAX_BYTES_ALLOWED
protected boolean strict
protected boolean strictNamespace
protected Set<String> acceptedNamespaces
strictNamespace
) accepted by the parser. URLs from other namespaces are ignored.public SiteMapParser()
isStrict()
) and not
allowing partially parsed content.public SiteMapParser(boolean strict)
strict
- see isStrict()
public SiteMapParser(boolean strict, boolean allowPartial)
strict
- see isStrict()
allowPartial
- if true: allow URLs from sitemaps only partially parsed
because of format errors or truncated (incompletely fetched)
content. If false any parser error will cause an
UnknownFormatException
.public boolean isStrict()
public boolean isStrictNamespace()
addAcceptedNamespace(String)
)public void setStrictNamespace(boolean s)
addAcceptedNamespace(String)
). Note enabling strict namespace
checking always adds the namespace defined by the current sitemap
specificiation (Namespace.SITEMAP
) to the list of accepted
namespaces.s
- if true enable strict namespace-checking, disable if falsepublic void addAcceptedNamespace(String namespaceUri)
namespaceUri
- URI of the accepted XML namespacepublic void addAcceptedNamespace(String[] namespaceUris)
namespaceUris
- array of accepted XML namespace URIspublic void enableExtension(Extension extension)
extension
- sitemap extension (news, images, videos, etc.)public void enableExtensions()
public void setURLFilter(java.util.function.Function<String,String> filter)
public void setURLFilter(URLFilter filter)
URLFilter
to filter URLs, eg. to configure that URLs found in
sitemaps are normalized by
BasicURLNormalizer
:
sitemapParser.setURLFilter(new BasicURLNormalizer());
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
onlineSitemapUrl
- URL of the online sitemapUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- an AbstractSiteMap
implementationUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
content
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public void walkSiteMap(URL onlineSitemapUrl, java.util.function.Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
This method is a convenience method for a user who has a sitemap URL and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
onlineSitemapUrl
- URL of the online sitemapaction
- The action to be performed for each elementUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error fetching the content of any
URL
public void walkSiteMap(AbstractSiteMap sitemap, java.util.function.Consumer<SiteMapURL> action) throws UnknownFormatException, IOException
This method is a convenience method for a user who has a sitemap and wants a simple way to traverse it.
Exceptions thrown by the action are relayed to the caller.
sitemap
- The sitemap to traverseaction
- The action to be performed for each elementUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error fetching the content of any
URL
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrlUnknownFormatException
- if there is an error parsing the sitemapprotected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
sitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrlIOException
- if there is an error reading in the site map contentprotected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
sitemapUrl
- URL to sitemap filestream
- content streamIOException
- if there is an error reading in the site map contentprotected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
url
- - URL of the gzipped contentresponse
- - Gzipped contentUnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzip URL
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
sitemapUrl
- a sitemap URL
is
- an InputSource
backing the sitemapUnknownFormatException
- if there is an error parsing the
InputSource
public static boolean urlIsValid(String sitemapBaseUrl, String testUrl)
sitemapBaseUrl
- the base URL of the sitemaptestUrl
- the URL to be testedCopyright © 2009–2021 Crawler-Commons. All rights reserved.