SimpleRobotRulesParser (Crawler-commons 0.7 API)

java.lang.Object
- crawlercommons.robots.BaseRobotsParser
- - crawlercommons.robots.SimpleRobotRulesParser

All Implemented Interfaces:

Serializable
```
public class SimpleRobotRulesParser
extends BaseRobotsParser
```
This implementation of BaseRobotsParser retrieves a set of rules for an agent with the given name from the robots.txt file of a given domain.

The class fulfills two tasks. The first one is the parsing of the robots.txt file done in parseContent(String, byte[], String, String). During the parsing process the parser searches for the provided agent name(s). If the parser finds a matching name, the set of rules for this name is parsed and returned as the result. Note that if more than one agent name is given to the parser, it parses the rules for the first matching agent name inside the file and skips all following user agent groups. It doesn't matter which of the other given agent names would match additional rules inside the file. Thus, if more than one agent name is given to the parser, the result can be influenced by the order of rule sets inside the robots.txt file.

Note that the parser always parses the entire file, even if a matching agent name group has been found, as it needs to collect all of the sitemap directives.

If no rule set matches any of the provided agent names, the rule set for the '*' agent is returned. If there is no such rule set inside the robots.txt file, a rule set allowing all resource to be crawled is returned.

The crawl delay is parsed and added to the rules. Note that if the crawl delay inside the file exceeds a maximum value, the crawling of all resources is prohibited. The maximum value is defined with MAX_CRAWL_DELAY= .

The second task of this class is to generate a set of rules if the fetching of the robots.txt file fails. The failedFetch(int) method returns a predefined set of rules based on the given error code. If the status code is indicating a client error (status code = 4xx) we can assume that the robots.txt file is not there and crawling of all resources is allowed. If the status code equals a different error code (3xx or 5xx) the parser assumes a temporary error and a set of rules prohibiting any crawling is returned.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

SimpleRobotRulesParser()

Constructors
Constructor and Description
`SimpleRobotRulesParser()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`BaseRobotRules`	`failedFetch(int httpStatusCode)` The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
`int`	`getNumWarnings()`
`BaseRobotRules`	`parseContent(String url, byte[] content, String contentType, String robotNames)` Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - SimpleRobotRulesParser
```
public SimpleRobotRulesParser()
```
- Method Detail
  - failedFetch
```
public BaseRobotRules failedFetch(int httpStatusCode)
```
    Description copied from class: BaseRobotsParser
    
    The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
    
    Specified by:
    
    failedFetch in class BaseRobotsParser
    
    Parameters:
    
    httpStatusCode - a failure status code (NOT 2xx)
    
    Returns:
    
    robot rules
  - parseContent
```
public BaseRobotRules parseContent(String url,
                                   byte[] content,
                                   String contentType,
                                   String robotNames)
```
    Description copied from class: BaseRobotsParser
    
    Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Note that multiple agent names may be provided as comma-separated values; the order of these shouldn't matter, as the file is parsed in order, and each agent name found in the file will be compared to every agent name found in robotNames. Also note that names are lower-cased before comparison, and that any robot name you pass shouldn't contain commas or spaces; if the name has spaces, it will be split into multiple names, each of which will be compared against agent names in the robots.txt file. An agent name is considered a match if it's a prefix match on the provided robot name. For example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot" as the agent name, because of splitting on spaces, lower-casing, and the prefix match rule.
    
    Specified by:
    
    parseContent in class BaseRobotsParser
    
    Parameters:
    
    url - URL that content was fetched from (for reporting purposes)
    
    content - raw bytes from the site's robots.txt file
    
    contentType - HTTP response header (mime-type)
    
    robotNames - name(s) of crawler, to be used when processing file contents (just the name portion, w/o version or other details)
    
    Returns:
    
    robot rules.
  - getNumWarnings
```
public int getNumWarnings()
```

Class SimpleRobotRulesParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

SimpleRobotRulesParser

Method Detail

failedFetch

parseContent

getNumWarnings