Package crawlercommons.robots
Class SimpleRobotRules
- java.lang.Object
-
- crawlercommons.robots.BaseRobotRules
-
- crawlercommons.robots.SimpleRobotRules
-
- All Implemented Interfaces:
Serializable
public class SimpleRobotRules extends BaseRobotRules
Result from parsing a single robots.txt file - which means we get a set of rules, and an optional crawl-delay, and an optional sitemap URL. Note that we support Google's extensions (Allow directive and '$'/'*' special chars) plus the more widely used Sitemap directive. See https://en.wikipedia.org/wiki/Robots_exclusion_standard See https://developers.google.com/search/reference/robots_txt- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SimpleRobotRules.RobotRule
Single rule that maps from a path prefix to an allow flag.static class
SimpleRobotRules.RobotRulesMode
-
Field Summary
Fields Modifier and Type Field Description protected SimpleRobotRules.RobotRulesMode
_mode
protected ArrayList<SimpleRobotRules.RobotRule>
_rules
-
Fields inherited from class crawlercommons.robots.BaseRobotRules
UNSET_CRAWL_DELAY
-
-
Constructor Summary
Constructors Constructor Description SimpleRobotRules()
SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addRule(String prefix, boolean allow)
void
clearRules()
boolean
equals(Object obj)
List<SimpleRobotRules.RobotRule>
getRobotRules()
int
hashCode()
boolean
isAllowAll()
Is our ruleset set up to allow all access?boolean
isAllowed(String url)
boolean
isAllowNone()
Is our ruleset set up to disallow all access?void
sortRules()
In order to match up with Google's convention, we want to match rules from longest to shortest.String
toString()
Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10)-
Methods inherited from class crawlercommons.robots.BaseRobotRules
addSitemap, getCrawlDelay, getSitemaps, isDeferVisits, setCrawlDelay, setDeferVisits
-
-
-
-
Field Detail
-
_rules
protected ArrayList<SimpleRobotRules.RobotRule> _rules
-
_mode
protected SimpleRobotRules.RobotRulesMode _mode
-
-
Constructor Detail
-
SimpleRobotRules
public SimpleRobotRules()
-
SimpleRobotRules
public SimpleRobotRules(SimpleRobotRules.RobotRulesMode mode)
-
-
Method Detail
-
clearRules
public void clearRules()
-
addRule
public void addRule(String prefix, boolean allow)
-
getRobotRules
public List<SimpleRobotRules.RobotRule> getRobotRules()
-
isAllowed
public boolean isAllowed(String url)
- Specified by:
isAllowed
in classBaseRobotRules
-
sortRules
public void sortRules()
In order to match up with Google's convention, we want to match rules from longest to shortest. So sort the rules.
-
isAllowAll
public boolean isAllowAll()
Is our ruleset set up to allow all access?- Specified by:
isAllowAll
in classBaseRobotRules
- Returns:
- true if all URLs are allowed.
-
isAllowNone
public boolean isAllowNone()
Is our ruleset set up to disallow all access?- Specified by:
isAllowNone
in classBaseRobotRules
- Returns:
- true if no URLs are allowed.
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classBaseRobotRules
-
equals
public boolean equals(Object obj)
- Overrides:
equals
in classBaseRobotRules
-
toString
public String toString()
Description copied from class:BaseRobotRules
Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10)- Overrides:
toString
in classBaseRobotRules
-
-