public class SimpleRobotRulesParser extends BaseRobotsParser
This implementation of BaseRobotsParser retrieves a set of
rules for an agent with the given name from the
robots.txt file of a given domain.
The class fulfills two tasks. The first one is the parsing of the
robots.txt file done in
parseContent(String, byte[], String, String). During the parsing
process the parser searches for the provided agent name(s). If the parser
finds a matching name, the set of rules for this name is parsed and returned
as the result. Note that if more than one agent name is given to the
parser, it parses the rules for the first matching agent name inside the file
and skips all following user agent groups. It doesn't matter which of the
other given agent names would match additional rules inside the file. Thus,
if more than one agent name is given to the parser, the result can be
influenced by the order of rule sets inside the robots.txt file.
Note that the parser always parses the entire file, even if a matching agent name group has been found, as it needs to collect all of the sitemap directives.
If no rule set matches any of the provided agent names, the rule set for the
'*' agent is returned. If there is no such rule set inside the
robots.txt file, a rule set allowing all resource to be crawled
is returned.
The crawl delay is parsed and added to the rules. Note that if the crawl
delay inside the file exceeds a maximum value, the crawling of all resources
is prohibited. The maximum value is defined with MAX_CRAWL_DELAY=
.
The second task of this class is to generate a set of rules if the fetching
of the robots.txt file fails. The failedFetch(int)
method returns a predefined set of rules based on the given error code. If
the status code is indicating a client error (status code = 4xx) we can
assume that the robots.txt file is not there and crawling of all
resources is allowed. If the status code equals a different error code (3xx
or 5xx) the parser assumes a temporary error and a set of rules prohibiting
any crawling is returned.
| Constructor and Description |
|---|
SimpleRobotRulesParser() |
| Modifier and Type | Method and Description |
|---|---|
BaseRobotRules |
failedFetch(int httpStatusCode)
The fetch of robots.txt failed, so return rules appropriate give the HTTP
status code.
|
int |
getNumWarnings() |
BaseRobotRules |
parseContent(String url,
byte[] content,
String contentType,
String robotNames)
Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent.
|
public BaseRobotRules failedFetch(int httpStatusCode)
BaseRobotsParserfailedFetch in class BaseRobotsParserhttpStatusCode - a failure status code (NOT 2xx)public BaseRobotRules parseContent(String url, byte[] content, String contentType, String robotNames)
BaseRobotsParserparseContent in class BaseRobotsParserurl - URL that content was fetched from (for reporting purposes)content - raw bytes from the site's robots.txt filecontentType - HTTP response header (mime-type)robotNames - name(s) of crawler, to be used when processing file contents
(just the name portion, w/o version or other details)public int getNumWarnings()
Copyright © 2009–2016 Crawler-Commons. All rights reserved.