Class SiteMapParser


  • public class SiteMapParser
    extends Object
    • Field Detail

      • LOG

        public static final org.slf4j.Logger LOG
      • MAX_BYTES_ALLOWED

        public static final int MAX_BYTES_ALLOWED
        Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov. 2016 the limit has been 10MB).
        See Also:
        Constant Field Values
      • strict

        protected boolean strict
        True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: https://www.sitemaps.org/protocol.html#location
      • strictNamespace

        protected boolean strictNamespace
        Indicates whether the parser should work with the namespace from the specifications or any namespace. Defaults to false.
      • acceptedNamespaces

        protected Set<String> acceptedNamespaces
        Set of namespaces (if strictNamespace) accepted by the parser. URLs from other namespaces are ignored.
      • extensionNamespaces

        protected Map<String,​Extension> extensionNamespaces
        Map of sitemap extension namespaces required to find the right extension handler.
    • Constructor Detail

      • SiteMapParser

        public SiteMapParser()
        SiteMapParser with strict location validation (isStrict()) and not allowing partially parsed content.
      • SiteMapParser

        public SiteMapParser​(boolean strict)
        SiteMapParser with configurable location validation, not allowing partially parsed content.
        Parameters:
        strict - see isStrict()
      • SiteMapParser

        public SiteMapParser​(boolean strict,
                             boolean allowPartial)
        Parameters:
        strict - see isStrict()
        allowPartial - if true: allow URLs from sitemaps only partially parsed because of format errors or truncated (incompletely fetched) content. If false any parser error will cause an UnknownFormatException.
    • Method Detail

      • setAllowDocTypeDefinitions

        public void setAllowDocTypeDefinitions​(boolean allowDocTypeDefinitions)
        Sets if the parser allows a DTD in sitemaps or feeds.
        Parameters:
        allowDocTypeDefinitions - true if allowed. Default is false.
      • isStrict

        public boolean isStrict()
        Returns:
        whether invalid URLs will be rejected (where invalid means that the URL is not under the base URL, see sitemap file location)
      • isStrictNamespace

        public boolean isStrictNamespace()
        Returns:
        whether the parser allows any namespace or just the one from the specification (or any namespace accepted, addAcceptedNamespace(String))
      • setStrictNamespace

        public void setStrictNamespace​(boolean s)
        Sets the parser to allow any XML namespace or just the one from the specification, or any accepted namespace (see addAcceptedNamespace(String)). Note enabling strict namespace checking always adds the namespace defined by the current sitemap specification (Namespace.SITEMAP) to the list of accepted namespaces.
        Parameters:
        s - if true enable strict namespace-checking, disable if false
      • addAcceptedNamespace

        public void addAcceptedNamespace​(String namespaceUri)
        Add namespace URI to set of accepted namespaces.
        Parameters:
        namespaceUri - URI of the accepted XML namespace
      • addAcceptedNamespace

        public void addAcceptedNamespace​(String[] namespaceUris)
        Add namespace URIs to set of accepted namespaces.
        Parameters:
        namespaceUris - array of accepted XML namespace URIs
      • enableExtension

        public void enableExtension​(Extension extension)
        Enable a support for a sitemap extension in the parser.
        Parameters:
        extension - sitemap extension (news, images, videos, etc.)
      • enableExtensions

        public void enableExtensions()
        Enable all supported sitemap extensions in the parser.
      • setURLFilter

        public void setURLFilter​(Function<String,​String> filter)
        Set URL filter function to normalize URLs found in sitemaps or filter URLs away if the function returns null.
      • setURLFilter

        public void setURLFilter​(URLFilter filter)
        Use URLFilter to filter URLs, eg. to configure that URLs found in sitemaps are normalized by BasicURLNormalizer:
         sitemapParser.setURLFilter(new BasicURLNormalizer());
         
      • parseSiteMap

        public AbstractSiteMap parseSiteMap​(URL onlineSitemapUrl)
                                     throws UnknownFormatException,
                                            IOException
        Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.
        Parameters:
        onlineSitemapUrl - URL of the online sitemap
        Returns:
        Extracted SiteMap/SiteMapIndex or null if the onlineSitemapUrl is null
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
        IOException - if there is an error reading in the site map URL
      • parseSiteMap

        public AbstractSiteMap parseSiteMap​(String contentType,
                                            byte[] content,
                                            AbstractSiteMap sitemap)
                                     throws UnknownFormatException,
                                            IOException
        Returns a processed copy of an unprocessed sitemap object, i.e. transfer the value of getLastModified(). Please note that the sitemap input stays unchanged. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.
        Parameters:
        contentType - MIME type of content
        content - raw bytes of sitemap file
        sitemap - an AbstractSiteMap implementation
        Returns:
        Extracted SiteMap/SiteMapIndex
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
        IOException - if there is an error reading in the site map URL
      • parseSiteMap

        public AbstractSiteMap parseSiteMap​(String contentType,
                                            byte[] content,
                                            URL url)
                                     throws UnknownFormatException,
                                            IOException
        Parse a sitemap, given the MIME type, the content bytes, and the URL. Note that contentType is assumed to be correct; in general it is more robust to use the method that doesn't take a contentType, but instead detects this using Tika.
        Parameters:
        contentType - MIME type of content
        content - raw bytes of sitemap file
        url - URL to sitemap file
        Returns:
        Extracted SiteMap/SiteMapIndex
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
        IOException - if there is an error reading in the site map URL
      • walkSiteMap

        public void walkSiteMap​(URL onlineSitemapUrl,
                                Consumer<SiteMapURL> action)
                         throws UnknownFormatException,
                                IOException
        Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.

        This method is a convenience method for a user who has a sitemap URL and wants a simple way to traverse it.

        Exceptions thrown by the action are relayed to the caller.

        Parameters:
        onlineSitemapUrl - URL of the online sitemap
        action - The action to be performed for each element
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
        IOException - if there is an error fetching the content of any URL
      • walkSiteMap

        public void walkSiteMap​(AbstractSiteMap sitemap,
                                Consumer<SiteMapURL> action)
                         throws UnknownFormatException,
                                IOException
        Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.

        This method is a convenience method for a user who has a sitemap and wants a simple way to traverse it.

        Exceptions thrown by the action are relayed to the caller.

        Parameters:
        sitemap - The sitemap to traverse
        action - The action to be performed for each element
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
        IOException - if there is an error fetching the content of any URL
      • processXml

        protected AbstractSiteMap processXml​(URL sitemapUrl,
                                             byte[] xmlContent)
                                      throws UnknownFormatException
        Parse the given XML content.
        Parameters:
        sitemapUrl - URL to sitemap file
        xmlContent - the byte[] backing the sitemapUrl
        Returns:
        The site map
        Throws:
        UnknownFormatException - if there is an error parsing the sitemap
      • processText

        protected SiteMap processText​(URL sitemapUrl,
                                      byte[] content)
                               throws IOException
        Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.
        Parameters:
        sitemapUrl - URL to sitemap file
        content - the byte[] backing the sitemapUrl
        Returns:
        The site map
        Throws:
        IOException - if there is an error reading in the site map content
      • processText

        protected SiteMap processText​(URL sitemapUrl,
                                      InputStream stream)
                               throws IOException
        Process a text-based Sitemap. Text sitemaps only list URLs but no priorities, last mods, etc.
        Parameters:
        sitemapUrl - URL to sitemap file
        stream - content stream
        Returns:
        The site map
        Throws:
        IOException - if there is an error reading in the site map content
      • processGzippedXML

        protected AbstractSiteMap processGzippedXML​(URL url,
                                                    byte[] response)
                                             throws IOException,
                                                    UnknownFormatException
        Decompress the gzipped content and process the resulting XML Sitemap.
        Parameters:
        url - - URL of the gzipped content
        response - - Gzipped content
        Returns:
        the site map
        Throws:
        UnknownFormatException - if there is an error parsing the gzip
        IOException - if there is an error reading in the gzip URL
      • urlIsValid

        public static boolean urlIsValid​(String sitemapBaseUrl,
                                         String testUrl)
        See if testUrl is under sitemapBaseUrl. Only URLs under sitemapBaseUrl are valid.
        Parameters:
        sitemapBaseUrl - the base URL of the sitemap
        testUrl - the URL to be tested
        Returns:
        true if testUrl is under sitemapBaseUrl, false otherwise