Package crawlercommons.robots
Class SimpleRobotRules
- java.lang.Object
-
- crawlercommons.robots.BaseRobotRules
-
- crawlercommons.robots.SimpleRobotRules
-
- All Implemented Interfaces:
Serializable
public class SimpleRobotRules extends BaseRobotRules
Result from parsing a single robots.txt file – a set of allow/disallow rules to check whether a given URL is allowed, and optionally a Crawl-delay and Sitemap URLs.Allow/disallow rules are matched following the Robots Exclusion Protocol RFC 9309. This includes Google's robots.txt extensions to the original RFC draft: the
Allowdirective,$/*special characters and precedence of longer (more specific) patterns.See also: Robots Exclusion on Wikipedia
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classSimpleRobotRules.RobotRuleSingle rule that maps from a path prefix to an allow flag.static classSimpleRobotRules.RobotRulesMode
-
Field Summary
Fields Modifier and Type Field Description protected SimpleRobotRules.RobotRulesMode_modeprotected ArrayList<SimpleRobotRules.RobotRule>_rulesprotected static boolean[]specialCharactersPathMatchingSpecial characters which require percent-encoding for path matching-
Fields inherited from class crawlercommons.robots.BaseRobotRules
UNSET_CRAWL_DELAY
-
-
Constructor Summary
Constructors Constructor Description SimpleRobotRules()SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRule(String prefix, boolean allow)voidclearRules()booleanequals(Object obj)static StringescapePath(String urlPathQuery, boolean[] additionalEncodedBytes)Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.List<SimpleRobotRules.RobotRule>getRobotRules()inthashCode()booleanisAllowAll()Is our ruleset set up to allow all access?booleanisAllowed(String url)booleanisAllowNone()Is our ruleset set up to disallow all access?voidsortRules()Sort and deduplicate robot rules.StringtoString()Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).-
Methods inherited from class crawlercommons.robots.BaseRobotRules
addSitemap, getCrawlDelay, getSitemaps, isDeferVisits, setCrawlDelay, setDeferVisits
-
-
-
-
Field Detail
-
_rules
protected ArrayList<SimpleRobotRules.RobotRule> _rules
-
_mode
protected SimpleRobotRules.RobotRulesMode _mode
-
specialCharactersPathMatching
protected static final boolean[] specialCharactersPathMatching
Special characters which require percent-encoding for path matching
-
-
Constructor Detail
-
SimpleRobotRules
public SimpleRobotRules()
-
SimpleRobotRules
public SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
-
Method Detail
-
clearRules
public void clearRules()
-
addRule
public void addRule(String prefix, boolean allow)
-
getRobotRules
public List<SimpleRobotRules.RobotRule> getRobotRules()
- Returns:
- the list of allow/disallow rules
-
isAllowed
public boolean isAllowed(String url)
- Specified by:
isAllowedin classBaseRobotRules
-
escapePath
public static String escapePath(String urlPathQuery, boolean[] additionalEncodedBytes)
Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.- Parameters:
urlPathQuery- path and query component of the URLadditionalEncodedBytes- boolean array to request bytes (ASCII characters) to be percent-encoded in addition to other characters requiring encoding (Unicode/non-ASCII and characters not allowed in URLs).- Returns:
- properly percent-encoded URL path and query
-
sortRules
public void sortRules()
Sort and deduplicate robot rules. This method must be called after the robots.txt has been processed and before rule matching. The ordering is implemented inSimpleRobotRules.RobotRule.compareTo(RobotRule)and defined by RFC 9309, section 2.2.2:The most specific match found MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group MAY be deduplicated.
-
isAllowAll
public boolean isAllowAll()
Is our ruleset set up to allow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesModewithout inspecting the set of allow/disallow rules.- Specified by:
isAllowAllin classBaseRobotRules- Returns:
- true if all URLs are allowed.
-
isAllowNone
public boolean isAllowNone()
Is our ruleset set up to disallow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesModewithout inspecting the set of allow/disallow rules.- Specified by:
isAllowNonein classBaseRobotRules- Returns:
- true if no URLs are allowed.
-
hashCode
public int hashCode()
- Overrides:
hashCodein classBaseRobotRules
-
equals
public boolean equals(Object obj)
- Overrides:
equalsin classBaseRobotRules
-
toString
public String toString()
Description copied from class:BaseRobotRulesReturns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).- Overrides:
toStringin classBaseRobotRules
-
-