Package crawlercommons.robots
Class SimpleRobotRules
- java.lang.Object
-
- crawlercommons.robots.BaseRobotRules
-
- crawlercommons.robots.SimpleRobotRules
-
- All Implemented Interfaces:
Serializable
public class SimpleRobotRules extends BaseRobotRules
Result from parsing a single robots.txt file – a set of allow/disallow rules to check whether a given URL is allowed, and optionally a Crawl-delay and Sitemap URLs.Allow/disallow rules are matched following the Robots Exclusion Protocol RFC 9309. This includes Google's robots.txt extensions to the original RFC draft: the
Allowdirective,$/*special characters and precedence of longer (more specific) patterns.See also: Robots Exclusion on Wikipedia
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classSimpleRobotRules.RobotRuleSingle rule that maps from a path prefix to an allow flag.static classSimpleRobotRules.RobotRulesMode
-
Field Summary
Fields Modifier and Type Field Description protected SimpleRobotRules.RobotRulesMode_modeprotected ArrayList<SimpleRobotRules.RobotRule>_rulesprotected static boolean[]specialCharactersPathMatchingSpecial characters which require percent-encoding for path matching-
Fields inherited from class crawlercommons.robots.BaseRobotRules
UNSET_CRAWL_DELAY
-
-
Constructor Summary
Constructors Constructor Description SimpleRobotRules()SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRule(String prefix, boolean allow)Add an allow/disallow rule to the rulesetvoidclearRules()booleanequals(Object obj)static StringescapePath(String urlPathQuery, boolean[] additionalEncodedBytes)Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.List<SimpleRobotRules.RobotRule>getRobotRules()inthashCode()booleanisAllowAll()Is our ruleset set up to allow all access?booleanisAllowed(String url)Check whether a URL is allowed to be fetched according to the robots rules.booleanisAllowed(URL url)Check whether a URL is allowed to be fetched according to the robots rules.booleanisAllowNone()Is our ruleset set up to disallow all access?voidsortRules()Sort and deduplicate robot rules.StringtoString()Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).-
Methods inherited from class crawlercommons.robots.BaseRobotRules
addSitemap, getCrawlDelay, getSitemaps, isDeferVisits, setCrawlDelay, setDeferVisits
-
-
-
-
Field Detail
-
_rules
protected ArrayList<SimpleRobotRules.RobotRule> _rules
-
_mode
protected SimpleRobotRules.RobotRulesMode _mode
-
specialCharactersPathMatching
protected static final boolean[] specialCharactersPathMatching
Special characters which require percent-encoding for path matching
-
-
Constructor Detail
-
SimpleRobotRules
public SimpleRobotRules()
-
SimpleRobotRules
public SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
-
Method Detail
-
clearRules
public void clearRules()
-
addRule
public void addRule(String prefix, boolean allow)
Add an allow/disallow rule to the ruleset- Parameters:
prefix- path prefix or patternallow- whether to allow the URLs matching the prefix or pattern
-
getRobotRules
public List<SimpleRobotRules.RobotRule> getRobotRules()
- Returns:
- the list of allow/disallow rules
-
isAllowed
public boolean isAllowed(String url)
Check whether a URL is allowed to be fetched according to the robots rules. Note that the URL must be properly normalized, otherwise the URL path may not be matched against the robots rules. In order to normalize the URL,BasicURLNormalizercan be used:BasicURLNormalizer normalizer = new BasicURLNormalizer(); String urlNormalized = normalizer.filter(urlNotNormalized);
- Specified by:
isAllowedin classBaseRobotRules- Parameters:
url- URL string to be checked- Returns:
- true if the URL is allowed
-
isAllowed
public boolean isAllowed(URL url)
Check whether a URL is allowed to be fetched according to the robots rules.- Specified by:
isAllowedin classBaseRobotRules- Parameters:
url- URL to be checked- Returns:
- true if the URL is allowed
- See Also:
isAllowed(String)
-
escapePath
public static String escapePath(String urlPathQuery, boolean[] additionalEncodedBytes)
Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.- Parameters:
urlPathQuery- path and query component of the URLadditionalEncodedBytes- boolean array to request bytes (ASCII characters) to be percent-encoded in addition to other characters requiring encoding (Unicode/non-ASCII and characters not allowed in URLs).- Returns:
- properly percent-encoded URL path and query
-
sortRules
public void sortRules()
Sort and deduplicate robot rules. This method must be called after the robots.txt has been processed and before rule matching. The ordering is implemented inSimpleRobotRules.RobotRule.compareTo(RobotRule)and defined by RFC 9309, section 2.2.2:The most specific match found MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group MAY be deduplicated.
-
isAllowAll
public boolean isAllowAll()
Is our ruleset set up to allow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesModewithout inspecting the set of allow/disallow rules.- Specified by:
isAllowAllin classBaseRobotRules- Returns:
- true if all URLs are allowed.
-
isAllowNone
public boolean isAllowNone()
Is our ruleset set up to disallow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesModewithout inspecting the set of allow/disallow rules.- Specified by:
isAllowNonein classBaseRobotRules- Returns:
- true if no URLs are allowed.
-
hashCode
public int hashCode()
- Overrides:
hashCodein classBaseRobotRules
-
equals
public boolean equals(Object obj)
- Overrides:
equalsin classBaseRobotRules
-
toString
public String toString()
Description copied from class:BaseRobotRulesReturns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).- Overrides:
toStringin classBaseRobotRules
-
-