java.lang.Object
- crawlercommons.robots.BaseRobotsParser

All Implemented Interfaces:

Serializable

Direct Known Subclasses:

SimpleRobotRulesParser
```
public abstract class BaseRobotsParser
extends Object
implements Serializable
```
Robots.txt parser definition.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor Description

BaseRobotsParser()

Method Summary

All Methods Instance Methods Abstract Methods Deprecated Methods
Modifier and Type	Method	Description
`abstract BaseRobotRules`	`failedFetch(int httpStatusCode)`	The fetch of robots.txt failed, so return rules appropriate for the given HTTP status code.
`abstract BaseRobotRules`	`parseContent(String url, byte[] content, String contentType, String robotNames)`	Deprecated. since 1.4 - replaced by `parseContent(java.lang.String,byte[],java.lang.String,java.util.Collection<java.lang.String>)`.
`abstract BaseRobotRules`	`parseContent(String url, byte[] content, String contentType, Collection<String> robotNames)`	Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - BaseRobotsParser
```
public BaseRobotsParser()
```
- Method Detail
  - parseContent
```
@Deprecated
public abstract BaseRobotRules parseContent(String url,
                                            byte[] content,
                                            String contentType,
                                            String robotNames)
```
    Deprecated.
    since 1.4 - replaced by parseContent(java.lang.String,byte[],java.lang.String,java.util.Collection<java.lang.String>). Passing a collection of robot names gives users more control how user-agent and robot names are matched. Passing a list of names is also more efficient as it does not require to split the robot name string again and again on every robots.txt file to be parsed.
    
    Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Note that multiple agent names may be provided as comma-separated values. How agent names are matched against user-agent lines in the robots.txt depends on the implementing class. Also note that names are lower-cased before comparison, and that any robot name you pass shouldn't contain commas or spaces; if the name has spaces, it will be split into multiple names, each of which will be compared against agent names in the robots.txt file. An agent name is considered a match if it's a prefix match on the provided robot name. For example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot" as the agent name, because of splitting on spaces, lower-casing, and the prefix match rule.
    
    Parameters:
    
    url - URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.
    
    content - raw bytes from the site's robots.txt file
    
    contentType - HTTP response header (mime-type)
    
    robotNames - name(s) of crawler, to be used when processing file contents (just the name portion, w/o version or other details)
    
    Returns:
    
    robot rules.
  - parseContent
```
public abstract BaseRobotRules parseContent(String url,
                                            byte[] content,
                                            String contentType,
                                            Collection<String> robotNames)
```
    Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Multiple agent names can be provided as collection. How agent names are matched against user-agent lines in the robots.txt depends on the implementing class.
    
    Parameters:
    
    url - URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.
    
    content - raw bytes from the site's robots.txt file
    
    contentType - content type (MIME type) from HTTP response header
    
    robotNames - name(s) of crawler, used to select rules from the robots.txt file by matching the names against the user-agent lines in the robots.txt file.
    
    Returns:
    
    robot rules.
  - failedFetch
```
public abstract BaseRobotRules failedFetch(int httpStatusCode)
```
    The fetch of robots.txt failed, so return rules appropriate for the given HTTP status code.
    
    Parameters:
    
    httpStatusCode - a failure status code (NOT 2xx)
    
    Returns:
    
    robot rules

Class BaseRobotsParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

BaseRobotsParser

Method Detail

parseContent

parseContent

failedFetch