Class SimpleRobotRulesParser
- java.lang.Object
-
- crawlercommons.robots.BaseRobotsParser
-
- crawlercommons.robots.SimpleRobotRulesParser
-
- All Implemented Interfaces:
Serializable
public class SimpleRobotRulesParser extends BaseRobotsParser
This implementation of
BaseRobotsParser
retrieves a set ofrules
for an agent with the given name from therobots.txt
file of a given domain.The class fulfills two tasks. The first one is the parsing of the
robots.txt
file done inparseContent(String, byte[], String, String)
. During the parsing process the parser searches for the provided agent name(s). If the parser finds a matching name, the set of rules for this name is parsed and returned as the result. Note that if more than one agent name is given to the parser, it parses the rules for the first matching agent name inside the file and skips all following user agent groups. It doesn't matter which of the other given agent names would match additional rules inside the file. Thus, if more than one agent name is given to the parser, the result can be influenced by the order of rule sets inside therobots.txt
file.Note that the parser always parses the entire file, even if a matching agent name group has been found, as it needs to collect all of the sitemap directives.
If no rule set matches any of the provided agent names, the rule set for the
'*'
agent is returned. If there is no such rule set inside therobots.txt
file, a rule set allowing all resource to be crawled is returned.The crawl-delay is parsed and added to the rules. Note that if the crawl delay inside the file exceeds a maximum value, the crawling of all resources is prohibited. The default maximum value is defined with
DEFAULT_MAX_CRAWL_DELAY
=300000L milliseconds. The default value can be changed using the constructor (SimpleRobotRulesParser(long, int)
or viasetMaxCrawlDelay(long)
.The second task of this class is to generate a set of rules if the fetching of the
robots.txt
file fails. ThefailedFetch(int)
method returns a predefined set of rules based on the given error code. If the status code is indicating a client error (status code = 4xx) we can assume that therobots.txt
file is not there and crawling of all resources is allowed. If the status code equals a different error code (3xx or 5xx) the parser assumes a temporary error and a set of rules prohibiting any crawling is returned.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static long
DEFAULT_MAX_CRAWL_DELAY
Default max Crawl-Delay in milliseconds, seesetMaxCrawlDelay(long)
static int
DEFAULT_MAX_WARNINGS
Default max number of warnings logged during parse of any one robots.txt file, seesetMaxWarnings(int)
-
Constructor Summary
Constructors Constructor Description SimpleRobotRulesParser()
SimpleRobotRulesParser(long maxCrawlDelay, int maxWarnings)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description SimpleRobotRules
failedFetch(int httpStatusCode)
The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.long
getMaxCrawlDelay()
Get configured max crawl delay.int
getMaxWarnings()
Get max number of logged warnings per robots.txtint
getNumWarnings()
Get the number of warnings due to invalid rules/lines in the latest processed robots.txt file (seeparseContent(String, byte[], String, String)
.static void
main(String[] args)
SimpleRobotRules
parseContent(String url, byte[] content, String contentType, String robotNames)
Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.void
setMaxCrawlDelay(long maxCrawlDelay)
Set the max value in milliseconds accepted for the Crawl-Delay directive.void
setMaxWarnings(int maxWarnings)
Set the max number of warnings about parse errors logged per robots.txt
-
-
-
Field Detail
-
DEFAULT_MAX_WARNINGS
public static final int DEFAULT_MAX_WARNINGS
Default max number of warnings logged during parse of any one robots.txt file, seesetMaxWarnings(int)
- See Also:
- Constant Field Values
-
DEFAULT_MAX_CRAWL_DELAY
public static final long DEFAULT_MAX_CRAWL_DELAY
Default max Crawl-Delay in milliseconds, seesetMaxCrawlDelay(long)
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SimpleRobotRulesParser
public SimpleRobotRulesParser()
-
SimpleRobotRulesParser
public SimpleRobotRulesParser(long maxCrawlDelay, int maxWarnings)
- Parameters:
maxCrawlDelay
- seesetMaxCrawlDelay(long)
maxWarnings
- seesetMaxWarnings(int)
-
-
Method Detail
-
failedFetch
public SimpleRobotRules failedFetch(int httpStatusCode)
Description copied from class:BaseRobotsParser
The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.- Specified by:
failedFetch
in classBaseRobotsParser
- Parameters:
httpStatusCode
- a failure status code (NOT 2xx)- Returns:
- robot rules
-
parseContent
public SimpleRobotRules parseContent(String url, byte[] content, String contentType, String robotNames)
Description copied from class:BaseRobotsParser
Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Note that multiple agent names may be provided as comma-separated values; the order of these shouldn't matter, as the file is parsed in order, and each agent name found in the file will be compared to every agent name found in robotNames. Also note that names are lower-cased before comparison, and that any robot name you pass shouldn't contain commas or spaces; if the name has spaces, it will be split into multiple names, each of which will be compared against agent names in the robots.txt file. An agent name is considered a match if it's a prefix match on the provided robot name. For example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot" as the agent name, because of splitting on spaces, lower-casing, and the prefix match rule.- Specified by:
parseContent
in classBaseRobotsParser
- Parameters:
url
- URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.content
- raw bytes from the site's robots.txt filecontentType
- HTTP response header (mime-type)robotNames
- name(s) of crawler, to be used when processing file contents (just the name portion, w/o version or other details)- Returns:
- robot rules.
-
getNumWarnings
public int getNumWarnings()
Get the number of warnings due to invalid rules/lines in the latest processed robots.txt file (seeparseContent(String, byte[], String, String)
. Note: an incorrect value may be returned if the processing of the robots.txt happened in a different than the current thread.- Returns:
- number of warnings
-
getMaxWarnings
public int getMaxWarnings()
Get max number of logged warnings per robots.txt
-
setMaxWarnings
public void setMaxWarnings(int maxWarnings)
Set the max number of warnings about parse errors logged per robots.txt
-
getMaxCrawlDelay
public long getMaxCrawlDelay()
Get configured max crawl delay.- Returns:
- the configured max. crawl delay, see
setMaxCrawlDelay(long)
-
setMaxCrawlDelay
public void setMaxCrawlDelay(long maxCrawlDelay)
Set the max value in milliseconds accepted for the Crawl-Delay directive. If the value in the robots.txt is greater than the max. value, all pages are skipped to avoid that overtly long Crawl-Delays block fetch queues and make the crawling slow. Note: the value is in milliseconds as some sites use floating point numbers to define the delay.
-
main
public static void main(String[] args) throws MalformedURLException, IOException
- Throws:
MalformedURLException
IOException
-
-