Class SiteMapCrossSubmitValidator


  • public class SiteMapCrossSubmitValidator
    extends Object
    Validator for sitemap cross submits. The sitemap protocol defines strict requirements regarding the location of a sitemap:

    The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.

    If you have the permission to change http://example.org/path/sitemap.xml, it is assumed that you also have permission to provide information for URLs with the prefix http://example.org/path/.

    However, when the sitemap location (on host A) is specified in the robots.txt file of host B, this proves the ownership and the sitemap is allowed to cross-submit URLs on host B. Note: in order to use the validator, you need to create a sitemap parser without strict validation, see SiteMapParser.isStrict(), SiteMapParser.urlIsValid(String, String) and the constructors of SiteMapParser.
    • Field Detail

      • LOG

        public static final org.slf4j.Logger LOG
    • Method Detail

      • validateSiteMapURLs

        public static void validateSiteMapURLs​(SiteMap sitemap)
        Validate the URLs submitted in a sitemap whether they are valid, that is below the same URL prefix as the location of the sitemap. Invalid URLs are removed. Calling this method on a sitemap has the same effect as using a sitemap parser with strict validation, see SiteMapParser.isStrict().
        Parameters:
        sitemap - sitemap holding the URLs and the base URL (the URL prefix)
      • validateSiteMapURLs

        public static void validateSiteMapURLs​(SiteMap sitemap,
                                               String host)
        Validate the URLs in a sitemap against a single cross-submit host. Invalid URLs are removed. This method implements the typical cross-submit check for a sitemap announced in the robots.txt of one host while located on a different host. Naturally, it also verifies that all URLs in a Sitemap must be from a single host.
        Parameters:
        sitemap -
        host - host name proved for cross-submits. Usually the host of the robots.txt file the sitemap was announced.
      • validateSiteMapURLs

        public static void validateSiteMapURLs​(SiteMap sitemap,
                                               Collection<String> hosts)
        Validate the URLs in a sitemap against a set of cross-submit hosts. Invalid URLs are removed. This method is useful, if the same sitemap location is found in the robots.txt files of multiple hosts.
        Parameters:
        sitemap - sitemap holding the URLs to be validated
        hosts - set of host names proved for cross-submits
      • validate

        protected static boolean validate​(URL url,
                                          Collection<String> domains,
                                          SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
        Validate a single URL whether its host, ICANN or private domain is part of a list of domain names.
        Parameters:
        url - URL to validate
        domains - set of domain names proved for cross-submits
        domainValidationLevel - validation level for the domain names
      • validateSiteMapURLs

        public static void validateSiteMapURLs​(SiteMap sitemap,
                                               Collection<String> domains,
                                               SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
        Validate the URLs in a sitemap against a set of cross-submit domains. Invalid URLs are removed. This method is useful, if a site owner is not sufficiently precise regarding sitemap location. It allows to prove ownership by verifying that both the sitemap and cross-submit host share the same domain holder. The public suffix list is used to determine the shared part below the suffix, cf. EffectiveTldFinder.
        Parameters:
        sitemap - sitemap holding the URLs to be validated
        domains - set of domain names proved for cross-submits
        domainValidationLevel - validation level for the domain names
      • validateSiteMapURLs

        public static void validateSiteMapURLs​(SiteMap sitemap,
                                               Predicate<URL> validator)