Package crawlercommons.sitemaps
Class SiteMapCrossSubmitValidator
- java.lang.Object
-
- crawlercommons.sitemaps.SiteMapCrossSubmitValidator
-
public class SiteMapCrossSubmitValidator extends Object
Validator for sitemap cross submits. The sitemap protocol defines strict requirements regarding the location of a sitemap:The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.
If you have the permission to change http://example.org/path/sitemap.xml, it is assumed that you also have permission to provide information for URLs with the prefix http://example.org/path/.
proves the ownership
and the sitemap is allowed tocross-submit
URLs on host B. Note: in order to use the validator, you need to create a sitemap parser without strict validation, seeSiteMapParser.isStrict()
,SiteMapParser.urlIsValid(String, String)
and the constructors ofSiteMapParser
.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SiteMapCrossSubmitValidator.CrossSubmitValidationLevel
-
Field Summary
Fields Modifier and Type Field Description static org.slf4j.Logger
LOG
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description protected static boolean
validate(URL url, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validate a single URL whether its host, ICANN or private domain is part of a list of domain names.static void
validateSiteMapURLs(AbstractSiteMap sitemap, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validation of a sitemap or recursive validation of a sitemap index.static void
validateSiteMapURLs(SiteMap sitemap)
Validate the URLs submitted in a sitemap whether they are valid, that is below the same URL prefix as the location of the sitemap.static void
validateSiteMapURLs(SiteMap sitemap, String host)
Validate the URLs in a sitemap against a single cross-submit host.static void
validateSiteMapURLs(SiteMap sitemap, Collection<String> hosts)
Validate the URLs in a sitemap against a set of cross-submit hosts.static void
validateSiteMapURLs(SiteMap sitemap, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validate the URLs in a sitemap against a set of cross-submit domains.static void
validateSiteMapURLs(SiteMap sitemap, Predicate<URL> validator)
-
-
-
Method Detail
-
validateSiteMapURLs
public static void validateSiteMapURLs(SiteMap sitemap)
Validate the URLs submitted in a sitemap whether they are valid, that is below the same URL prefix as the location of the sitemap. Invalid URLs are removed. Calling this method on a sitemap has the same effect as using a sitemap parser with strict validation, seeSiteMapParser.isStrict()
.- Parameters:
sitemap
- sitemap holding the URLs and the base URL (the URL prefix)
-
validateSiteMapURLs
public static void validateSiteMapURLs(SiteMap sitemap, String host)
Validate the URLs in a sitemap against a single cross-submit host. Invalid URLs are removed. This method implements the typical cross-submit check for a sitemap announced in the robots.txt of one host while located on a different host. Naturally, it also verifies thatall URLs in a Sitemap must be from a single host
.- Parameters:
sitemap
-host
- host name proved for cross-submits. Usually the host of the robots.txt file the sitemap was announced.
-
validateSiteMapURLs
public static void validateSiteMapURLs(SiteMap sitemap, Collection<String> hosts)
Validate the URLs in a sitemap against a set of cross-submit hosts. Invalid URLs are removed. This method is useful, if the same sitemap location is found in the robots.txt files of multiple hosts.- Parameters:
sitemap
- sitemap holding the URLs to be validatedhosts
- set of host names proved for cross-submits
-
validate
protected static boolean validate(URL url, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validate a single URL whether its host, ICANN or private domain is part of a list of domain names.- Parameters:
url
- URL to validatedomains
- set of domain names proved for cross-submitsdomainValidationLevel
- validation level for the domain names
-
validateSiteMapURLs
public static void validateSiteMapURLs(SiteMap sitemap, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validate the URLs in a sitemap against a set of cross-submit domains. Invalid URLs are removed. This method is useful, if a site owner is not sufficiently precise regarding sitemap location. It allows toprove ownership
by verifying that both the sitemap and cross-submit host share the same domain holder. The public suffix list is used to determine the shared part below the suffix, cf.EffectiveTldFinder
.- Parameters:
sitemap
- sitemap holding the URLs to be validateddomains
- set of domain names proved for cross-submitsdomainValidationLevel
- validation level for the domain names
-
validateSiteMapURLs
public static void validateSiteMapURLs(SiteMap sitemap, Predicate<URL> validator)
-
validateSiteMapURLs
public static void validateSiteMapURLs(AbstractSiteMap sitemap, Collection<String> domains, SiteMapCrossSubmitValidator.CrossSubmitValidationLevel domainValidationLevel)
Validation of a sitemap or recursive validation of a sitemap index. SeevalidateSiteMapURLs(SiteMap, Collection, CrossSubmitValidationLevel)
.- Parameters:
sitemap
- sitemap or sitemap index, holding the URLs to be validateddomains
- set of domain names proved for cross-submitsdomainValidationLevel
- validation level for the domain names
-
-