Package crawlercommons.robots
Class SimpleRobotRules
- java.lang.Object
-
- crawlercommons.robots.BaseRobotRules
-
- crawlercommons.robots.SimpleRobotRules
-
- All Implemented Interfaces:
Serializable
public class SimpleRobotRules extends BaseRobotRules
Result from parsing a single robots.txt file – a set of allow/disallow rules to check whether a given URL is allowed, and optionally a Crawl-delay and Sitemap URLs.Allow/disallow rules are matched following the Robots Exclusion Protocol RFC 9309. This includes Google's robots.txt extensions to the original RFC draft: the
Allow
directive,$
/*
special characters and precedence of longer (more specific) patterns.See also: Robots Exclusion on Wikipedia
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SimpleRobotRules.RobotRule
Single rule that maps from a path prefix to an allow flag.static class
SimpleRobotRules.RobotRulesMode
-
Field Summary
Fields Modifier and Type Field Description protected SimpleRobotRules.RobotRulesMode
_mode
protected ArrayList<SimpleRobotRules.RobotRule>
_rules
protected static boolean[]
specialCharactersPathMatching
Special characters which require percent-encoding for path matching-
Fields inherited from class crawlercommons.robots.BaseRobotRules
UNSET_CRAWL_DELAY
-
-
Constructor Summary
Constructors Constructor Description SimpleRobotRules()
SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addRule(String prefix, boolean allow)
void
clearRules()
boolean
equals(Object obj)
static String
escapePath(String urlPathQuery, boolean[] additionalEncodedBytes)
Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.List<SimpleRobotRules.RobotRule>
getRobotRules()
int
hashCode()
boolean
isAllowAll()
Is our ruleset set up to allow all access?boolean
isAllowed(String url)
boolean
isAllowNone()
Is our ruleset set up to disallow all access?void
sortRules()
Sort and deduplicate robot rules.String
toString()
Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).-
Methods inherited from class crawlercommons.robots.BaseRobotRules
addSitemap, getCrawlDelay, getSitemaps, isDeferVisits, setCrawlDelay, setDeferVisits
-
-
-
-
Field Detail
-
_rules
protected ArrayList<SimpleRobotRules.RobotRule> _rules
-
_mode
protected SimpleRobotRules.RobotRulesMode _mode
-
specialCharactersPathMatching
protected static final boolean[] specialCharactersPathMatching
Special characters which require percent-encoding for path matching
-
-
Constructor Detail
-
SimpleRobotRules
public SimpleRobotRules()
-
SimpleRobotRules
public SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
-
Method Detail
-
clearRules
public void clearRules()
-
addRule
public void addRule(String prefix, boolean allow)
-
getRobotRules
public List<SimpleRobotRules.RobotRule> getRobotRules()
- Returns:
- the list of allow/disallow rules
-
isAllowed
public boolean isAllowed(String url)
- Specified by:
isAllowed
in classBaseRobotRules
-
escapePath
public static String escapePath(String urlPathQuery, boolean[] additionalEncodedBytes)
Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.- Parameters:
urlPathQuery
- path and query component of the URLadditionalEncodedBytes
- boolean array to request bytes (ASCII characters) to be percent-encoded in addition to other characters requiring encoding (Unicode/non-ASCII and characters not allowed in URLs).- Returns:
- properly percent-encoded URL path and query
-
sortRules
public void sortRules()
Sort and deduplicate robot rules. This method must be called after the robots.txt has been processed and before rule matching. The ordering is implemented inSimpleRobotRules.RobotRule.compareTo(RobotRule)
and defined by RFC 9309, section 2.2.2:The most specific match found MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group MAY be deduplicated.
-
isAllowAll
public boolean isAllowAll()
Is our ruleset set up to allow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesMode
without inspecting the set of allow/disallow rules.- Specified by:
isAllowAll
in classBaseRobotRules
- Returns:
- true if all URLs are allowed.
-
isAllowNone
public boolean isAllowNone()
Is our ruleset set up to disallow all access?Note: This is decided only based on the
SimpleRobotRules.RobotRulesMode
without inspecting the set of allow/disallow rules.- Specified by:
isAllowNone
in classBaseRobotRules
- Returns:
- true if no URLs are allowed.
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classBaseRobotRules
-
equals
public boolean equals(Object obj)
- Overrides:
equals
in classBaseRobotRules
-
toString
public String toString()
Description copied from class:BaseRobotRules
Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).- Overrides:
toString
in classBaseRobotRules
-
-