public class SimpleRobotRulesParser extends BaseRobotsParser
This implementation of BaseRobotsParser
retrieves a set of
rules
for an agent with the given name from the
robots.txt
file of a given domain.
The class fulfills two tasks. The first one is the parsing of the
robots.txt
file done in
parseContent(String, byte[], String, String)
. During the parsing
process the parser searches for the provided agent name(s). If the parser
finds a matching name, the set of rules for this name is parsed and returned
as the result. Note that if more than one agent name is given to the
parser, it parses the rules for the first matching agent name inside the file
and skips all following user agent groups. It doesn't matter which of the
other given agent names would match additional rules inside the file. Thus,
if more than one agent name is given to the parser, the result can be
influenced by the order of rule sets inside the robots.txt
file.
Note that the parser always parses the entire file, even if a matching agent name group has been found, as it needs to collect all of the sitemap directives.
If no rule set matches any of the provided agent names, the rule set for the
'*'
agent is returned. If there is no such rule set inside the
robots.txt
file, a rule set allowing all resource to be crawled
is returned.
The crawl-delay is parsed and added to the rules. Note that if the crawl
delay inside the file exceeds a maximum value, the crawling of all resources
is prohibited. The default maximum value is defined with
DEFAULT_MAX_CRAWL_DELAY
=300000L
milliseconds. The default value can be changed using the constructor
(SimpleRobotRulesParser(long, int)
or via
setMaxCrawlDelay(long)
.
The second task of this class is to generate a set of rules if the fetching
of the robots.txt
file fails. The failedFetch(int)
method returns a predefined set of rules based on the given error code. If
the status code is indicating a client error (status code = 4xx) we can
assume that the robots.txt
file is not there and crawling of all
resources is allowed. If the status code equals a different error code (3xx
or 5xx) the parser assumes a temporary error and a set of rules prohibiting
any crawling is returned.
Modifier and Type | Field and Description |
---|---|
static long |
DEFAULT_MAX_CRAWL_DELAY
Default max Crawl-Delay in milliseconds, see
setMaxCrawlDelay(long) |
static int |
DEFAULT_MAX_WARNINGS
Default max number of warnings logged during parse of any one robots.txt
file, see
setMaxWarnings(int) |
Constructor and Description |
---|
SimpleRobotRulesParser() |
SimpleRobotRulesParser(long maxCrawlDelay,
int maxWarnings) |
Modifier and Type | Method and Description |
---|---|
SimpleRobotRules |
failedFetch(int httpStatusCode)
The fetch of robots.txt failed, so return rules appropriate give the HTTP
status code.
|
long |
getMaxCrawlDelay()
Get configured max crawl delay.
|
int |
getMaxWarnings()
Get max number of logged warnings per robots.txt
|
int |
getNumWarnings()
Get the number of warnings due to invalid rules/lines in the latest
processed robots.txt file (see
parseContent(String, byte[], String, String) . |
static void |
main(String[] args) |
SimpleRobotRules |
parseContent(String url,
byte[] content,
String contentType,
String robotNames)
Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent.
|
void |
setMaxCrawlDelay(long maxCrawlDelay)
Set the max value in milliseconds accepted for the Crawl-Delay
directive.
|
void |
setMaxWarnings(int maxWarnings)
Set the max number of warnings about parse errors logged per robots.txt
|
public static final int DEFAULT_MAX_WARNINGS
setMaxWarnings(int)
public static final long DEFAULT_MAX_CRAWL_DELAY
setMaxCrawlDelay(long)
public SimpleRobotRulesParser()
public SimpleRobotRulesParser(long maxCrawlDelay, int maxWarnings)
maxCrawlDelay
- see setMaxCrawlDelay(long)
maxWarnings
- see setMaxWarnings(int)
public SimpleRobotRules failedFetch(int httpStatusCode)
BaseRobotsParser
failedFetch
in class BaseRobotsParser
httpStatusCode
- a failure status code (NOT 2xx)public SimpleRobotRules parseContent(String url, byte[] content, String contentType, String robotNames)
BaseRobotsParser
parseContent
in class BaseRobotsParser
url
- URL that robots.txt content was fetched from. A complete and
valid URL (e.g., https://example.com/robots.txt) is expected.
Used to resolve relative sitemap URLs and for
logging/reporting purposes.content
- raw bytes from the site's robots.txt filecontentType
- HTTP response header (mime-type)robotNames
- name(s) of crawler, to be used when processing file contents
(just the name portion, w/o version or other details)public int getNumWarnings()
parseContent(String, byte[], String, String)
.
Note: an incorrect value may be returned if the processing of the
robots.txt happened in a different than the current thread.public int getMaxWarnings()
public void setMaxWarnings(int maxWarnings)
public long getMaxCrawlDelay()
setMaxCrawlDelay(long)
public void setMaxCrawlDelay(long maxCrawlDelay)
public static void main(String[] args) throws MalformedURLException, IOException
MalformedURLException
IOException
Copyright © 2009–2021 Crawler-Commons. All rights reserved.