Class BaseRobotsParser

    • Constructor Detail

      • BaseRobotsParser

        public BaseRobotsParser()
    • Method Detail

      • parseContent

        public abstract BaseRobotRules parseContent​(String url,
                                                    byte[] content,
                                                    String contentType,
                                                    String robotNames)
        Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent. Note that multiple agent names may be provided as comma-separated values; the order of these shouldn't matter, as the file is parsed in order, and each agent name found in the file will be compared to every agent name found in robotNames. Also note that names are lower-cased before comparison, and that any robot name you pass shouldn't contain commas or spaces; if the name has spaces, it will be split into multiple names, each of which will be compared against agent names in the robots.txt file. An agent name is considered a match if it's a prefix match on the provided robot name. For example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot" as the agent name, because of splitting on spaces, lower-casing, and the prefix match rule.
        Parameters:
        url - URL that robots.txt content was fetched from. A complete and valid URL (e.g., https://example.com/robots.txt) is expected. Used to resolve relative sitemap URLs and for logging/reporting purposes.
        content - raw bytes from the site's robots.txt file
        contentType - HTTP response header (mime-type)
        robotNames - name(s) of crawler, to be used when processing file contents (just the name portion, w/o version or other details)
        Returns:
        robot rules.
      • failedFetch

        public abstract BaseRobotRules failedFetch​(int httpStatusCode)
        The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
        Parameters:
        httpStatusCode - a failure status code (NOT 2xx)
        Returns:
        robot rules