Class SimpleRobotRulesParser
- java.lang.Object
- 
- crawlercommons.robots.BaseRobotsParser
- 
- crawlercommons.robots.SimpleRobotRulesParser
 
 
- 
- All Implemented Interfaces:
- Serializable
 
 public class SimpleRobotRulesParser extends BaseRobotsParser Robots.txt parser following RFC 9309, supporting the Sitemap and Crawl-delay extensions.This implementation of BaseRobotsParserretrieves a set ofrulesfor an agent with the given name from therobots.txtfile of a given domain. The implementation follows RFC 9309. The following robots.txt extensions are supported:- the Crawl-delay directive
- the Sitemap directive
 The class fulfills two tasks. The first one is the parsing of the robots.txtfile, done inparseContent(String, byte[], String, Collection). During the parsing process the parser searches for the provided agent name(s). If the parser finds a matching name, the set of rules for this name is parsed and added to the list of rules returned later. If another group of rules is matched in the robots.txt file, also the rules of this group are added to the returned list of rules. See RFC 9309, section 2.2.1 about the merging of groups of rules.By default and following RFC 9309, section 2.2.1, the user-agent name ("product token") is matched literally but case-insensitive over the full name. See setExactUserAgentMatching(boolean)for details of agent name matching and a legacy substring prefix matching mode.SimpleRobotRulesParserallows to pass multiple agent names as a collection. All rule groups matching any of the agent names are followed and merged into one set of rules. The order of the agent names in the collection does not affect the selection of rules.The parser always parses the entire file to select all matching rule groups, and also to collect all of the sitemap directives. If no rule set matches any of the provided user-agent names, or if an empty collection of agent names is passed, the rule set for the '*'agent is returned. If there is no such rule set inside therobots.txtfile, a rule set allowing all resource to be crawled is returned.The crawl-delay is parsed and added to the rules. Note that if the crawl delay inside the file exceeds a maximum value, the crawling of all resources is prohibited. The default maximum value is defined with DEFAULT_MAX_CRAWL_DELAY=300000L milliseconds. The default value can be changed using the constructor (SimpleRobotRulesParser(long, int)or viasetMaxCrawlDelay(long).The second task of this class is to generate a set of rules if the fetching of the robots.txtfile fails. ThefailedFetch(int)method returns a predefined set of rules based on the given error code. If the status code is indicating a client error (status code = 4xx) we can assume that therobots.txtfile is not there and crawling of all resources is allowed. If the status code equals a different error code (3xx or 5xx) the parser assumes a temporary error and a set of rules prohibiting any crawling is returned.Note that fetching of the robots.txt file is outside the scope of this class. It must be implemented in the calling code, including the following of "at least five consecutive redirects" as required by RFC 9309, section 2.3.1.2. - See Also:
- Serialized Form
 
- 
- 
Field SummaryFields Modifier and Type Field Description static longDEFAULT_MAX_CRAWL_DELAYDefault max Crawl-Delay in milliseconds, seesetMaxCrawlDelay(long)static intDEFAULT_MAX_WARNINGSDefault max number of warnings logged during parse of any one robots.txt file, seesetMaxWarnings(int)protected static PatternUSER_AGENT_PRODUCT_TOKEN_MATCHERPattern to match a valid user-agent product tokens as defined in RFC 9309, section 2.2.1
 - 
Constructor SummaryConstructors Constructor Description SimpleRobotRulesParser()SimpleRobotRulesParser(long maxCrawlDelay, int maxWarnings)
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description SimpleRobotRulesfailedFetch(int httpStatusCode)The fetch of robots.txt failed, so return rules appropriate for the given HTTP status code.longgetMaxCrawlDelay()Get configured max crawl delay.intgetMaxWarnings()Get max number of logged warnings per robots.txtintgetNumWarnings()Get the number of warnings due to invalid rules/lines in the latest processed robots.txt file (seeparseContent(String, byte[], String, String).booleanisExactUserAgentMatching()protected static booleanisValidUserAgentToObey(String userAgent)Validate a user-agent product token as defined in RFC 9309, section 2.2.1static voidmain(String[] args)SimpleRobotRulesparseContent(String url, byte[] content, String contentType, String robotNames)Deprecated.SimpleRobotRulesparseContent(String url, byte[] content, String contentType, Collection<String> robotNames)Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.voidsetExactUserAgentMatching(boolean exactMatching)Set how the user-agent names in the robots.txt (User-agent:lines) are matched with the provided robot names: (with exact matching) follow the Robots Exclusion Protocol RFC 9309 and match user agent literally but case-insensitive over the full string length: Crawlers set their own name, which is called a product token, to find relevant groups.voidsetMaxCrawlDelay(long maxCrawlDelay)Set the max value in milliseconds accepted for the Crawl-Delay directive.voidsetMaxWarnings(int maxWarnings)Set the max number of warnings about parse errors logged per robots.txtprotected String[]splitRobotNames(String robotNames)Split a string listing user-agent / robot names into tokens.protected booleanuserAgentProductTokenPartialMatch(String agentName, Collection<String> targetTokens)
 
- 
- 
- 
Field Detail- 
USER_AGENT_PRODUCT_TOKEN_MATCHERprotected static final Pattern USER_AGENT_PRODUCT_TOKEN_MATCHER Pattern to match a valid user-agent product tokens as defined in RFC 9309, section 2.2.1
 - 
DEFAULT_MAX_WARNINGSpublic static final int DEFAULT_MAX_WARNINGS Default max number of warnings logged during parse of any one robots.txt file, seesetMaxWarnings(int)- See Also:
- Constant Field Values
 
 - 
DEFAULT_MAX_CRAWL_DELAYpublic static final long DEFAULT_MAX_CRAWL_DELAY Default max Crawl-Delay in milliseconds, seesetMaxCrawlDelay(long)- See Also:
- Constant Field Values
 
 
- 
 - 
Constructor Detail- 
SimpleRobotRulesParserpublic SimpleRobotRulesParser() 
 - 
SimpleRobotRulesParserpublic SimpleRobotRulesParser(long maxCrawlDelay, int maxWarnings)- Parameters:
- maxCrawlDelay- see- setMaxCrawlDelay(long)
- maxWarnings- see- setMaxWarnings(int)
 
 
- 
 - 
Method Detail- 
isValidUserAgentToObeyprotected static boolean isValidUserAgentToObey(String userAgent) Validate a user-agent product token as defined in RFC 9309, section 2.2.1- Parameters:
- userAgent- user-agent token to verify
- Returns:
- true if the product token is valid
 
 - 
failedFetchpublic SimpleRobotRules failedFetch(int httpStatusCode) The fetch of robots.txt failed, so return rules appropriate for the given HTTP status code.Set rules for response status codes following RFC 9309, section 2.3.1 :- Success (HTTP 200-299): throws IllegalStateExceptionbecause the response content needs to be parsed
- "Unavailable" Status (HTTP 400-499): allow all
- "Unreachable" Status (HTTP 500-599): disallow all
- every other HTTP status code is treated as "allow all", but further
 visits on the server are deferred (see
 BaseRobotRules.setDeferVisits(boolean))
 - Specified by:
- failedFetchin class- BaseRobotsParser
- Parameters:
- httpStatusCode- a failure status code (NOT 2xx)
- Returns:
- robot rules
 
- Success (HTTP 200-299): throws 
 - 
parseContent@Deprecated public SimpleRobotRules parseContent(String url, byte[] content, String contentType, String robotNames) Deprecated.Description copied from class:BaseRobotsParserParse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Note that multiple agent names may be provided as comma-separated values. How agent names are matched against user-agent lines in the robots.txt depends on the implementing class. Also note that names are lower-cased before comparison, and that any robot name you pass shouldn't contain commas or spaces; if the name has spaces, it will be split into multiple names, each of which will be compared against agent names in the robots.txt file. An agent name is considered a match if it's a prefix match on the provided robot name. For example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot" as the agent name, because of splitting on spaces, lower-casing, and the prefix match rule.- Specified by:
- parseContentin class- BaseRobotsParser
- Parameters:
- url- URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.
- content- raw bytes from the site's robots.txt file
- contentType- HTTP response header (mime-type)
- robotNames- name(s) of crawler, to be used when processing file contents (just the name portion, w/o version or other details)
- Returns:
- robot rules.
 
 - 
splitRobotNamesprotected String[] splitRobotNames(String robotNames) Split a string listing user-agent / robot names into tokens. Splitting is done at comma and/or whitespace, the tokens are converted to lower-case.- Parameters:
- robotNames- robot / user-agent string
- Returns:
- array of user-agent / robot tokens
 
 - 
parseContentpublic SimpleRobotRules parseContent(String url, byte[] content, String contentType, Collection<String> robotNames) Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.Multiple agent names can be passed as a collection. See setExactUserAgentMatching(boolean)for details how agent names are matched. If multiple agent names are passed, all matching rule groups are followed and merged into one set of rules. The order of the agent names in the collection does not affect the selection of rules.If none of the provided agent names is matched, rules addressed to the wildcard user-agent ( *) are selected.- Specified by:
- parseContentin class- BaseRobotsParser
- Parameters:
- url- URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.
- content- raw bytes from the site's robots.txt file
- contentType- content type (MIME type) from HTTP response header
- robotNames- crawler (user-agent) name(s), used to select rules from the robots.txt file by matching the names against the user-agent lines in the robots.txt file. Robot names should be single token names, without version or other parts. Names must be lower-case, as the user-agent line is also converted to lower-case for matching. If the collection is empty, the rules for the wildcard user-agent (- *) are selected. The wildcard user-agent name should not be contained in robotNames.
- Returns:
- robot rules.
 
 - 
userAgentProductTokenPartialMatchprotected boolean userAgentProductTokenPartialMatch(String agentName, Collection<String> targetTokens) 
 - 
getNumWarningspublic int getNumWarnings() Get the number of warnings due to invalid rules/lines in the latest processed robots.txt file (seeparseContent(String, byte[], String, String). Note: an incorrect value may be returned if the processing of the robots.txt happened in a different than the current thread.- Returns:
- number of warnings
 
 - 
getMaxWarningspublic int getMaxWarnings() Get max number of logged warnings per robots.txt
 - 
setMaxWarningspublic void setMaxWarnings(int maxWarnings) Set the max number of warnings about parse errors logged per robots.txt
 - 
getMaxCrawlDelaypublic long getMaxCrawlDelay() Get configured max crawl delay.- Returns:
- the configured max. crawl delay, see
         setMaxCrawlDelay(long)
 
 - 
setMaxCrawlDelaypublic void setMaxCrawlDelay(long maxCrawlDelay) Set the max value in milliseconds accepted for the Crawl-Delay directive. If the value in the robots.txt is greater than the max. value, all pages are skipped to avoid that overtly long Crawl-Delays block fetch queues and make the crawling slow. Note: the value is in milliseconds as some sites use floating point numbers to define the delay.
 - 
setExactUserAgentMatchingpublic void setExactUserAgentMatching(boolean exactMatching) Set how the user-agent names in the robots.txt (User-agent:lines) are matched with the provided robot names:- (with exact matching) follow the
 Robots Exclusion
 Protocol RFC 9309 and match user agent literally but
 case-insensitive over the full string length:
 
 Crawlers set their own name, which is called a product token, to find relevant groups. The product token MUST contain only upper and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"). [...] Crawlers MUST use case-insensitive matching to find the group that matches the product token and then obey the rules of the group. 
- (without exact matching) split the user-agent and robot names at
 whitespace into words and perform a prefix match (one of the user-agent
 words must be a prefix of one of the robot words, eg. the robot name
 WebCrawler/3.0matches the robots.txt directiveUser-agent: webcrawler This prefix matching on words allows that crawler developers lazily use the HTTP User-Agent string also for the robots.txt parser. It does not cover the case when the HTTP User-Agent string is used in the robots.txt.
 - Parameters:
- exactMatching- if true, configure exact user-agent name matching. If false, disable exact matching and do prefix matching on user-agent words.
 
- (with exact matching) follow the
 Robots Exclusion
 Protocol RFC 9309 and match user agent literally but
 case-insensitive over the full string length:
 
 
 - 
isExactUserAgentMatchingpublic boolean isExactUserAgentMatching() - Returns:
- whether exact user-agent matching is configured, see
         setExactUserAgentMatching(boolean)
 
 - 
mainpublic static void main(String[] args) throws MalformedURLException, IOException - Throws:
- MalformedURLException
- IOException
 
 
- 
 
-