public class SimpleRobotRulesParser extends BaseRobotsParser
This implementation of BaseRobotsParser
retrieves a set of
rules
for an agent with the given name from the
robots.txt
file of a given domain.
The class fulfills two tasks. The first one is the parsing of the
robots.txt
file done in
parseContent(String, byte[], String, String)
. During the parsing
process the parser searches for the provided agent name(s). If the parser
finds a matching name, the set of rules for this name is parsed and returned
as the result. Note that if more than one agent name is given to the
parser, it parses the rules for the first matching agent name inside the file
and skips all following user agent groups. It doesn't matter which of the
other given agent names would match additional rules inside the file. Thus,
if more than one agent name is given to the parser, the result can be
influenced by the order of rule sets inside the robots.txt
file.
Note that the parser always parses the entire file, even if a matching agent name group has been found, as it needs to collect all of the sitemap directives.
If no rule set matches any of the provided agent names, the rule set for the
'*'
agent is returned. If there is no such rule set inside the
robots.txt
file, a rule set allowing all resource to be crawled
is returned.
The crawl delay is parsed and added to the rules. Note that if the crawl
delay inside the file exceeds a maximum value, the crawling of all resources
is prohibited. The maximum value is defined with MAX_CRAWL_DELAY
=
.
The second task of this class is to generate a set of rules if the fetching
of the robots.txt
file fails. The failedFetch(int)
method returns a predefined set of rules based on the given error code. If
the status code is indicating a client error (status code = 4xx) we can
assume that the robots.txt
file is not there and crawling of all
resources is allowed. If the status code equals a different error code (3xx
or 5xx) the parser assumes a temporary error and a set of rules prohibiting
any crawling is returned.
Constructor and Description |
---|
SimpleRobotRulesParser() |
Modifier and Type | Method and Description |
---|---|
BaseRobotRules |
failedFetch(int httpStatusCode)
The fetch of robots.txt failed, so return rules appropriate give the HTTP
status code.
|
int |
getNumWarnings() |
BaseRobotRules |
parseContent(String url,
byte[] content,
String contentType,
String robotNames)
Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent.
|
public BaseRobotRules failedFetch(int httpStatusCode)
BaseRobotsParser
failedFetch
in class BaseRobotsParser
httpStatusCode
- a failure status code (NOT 2xx)public BaseRobotRules parseContent(String url, byte[] content, String contentType, String robotNames)
BaseRobotsParser
parseContent
in class BaseRobotsParser
url
- URL that content was fetched from (for reporting purposes)content
- raw bytes from the site's robots.txt filecontentType
- HTTP response header (mime-type)robotNames
- name(s) of crawler, to be used when processing file contents
(just the name portion, w/o version or other details)public int getNumWarnings()
Copyright © 2009–2016 Crawler-Commons. All rights reserved.