Class SimpleRobotRules

    • Method Detail

      • clearRules

        public void clearRules()
      • addRule

        public void addRule​(String prefix,
                            boolean allow)
        Add an allow/disallow rule to the ruleset
        Parameters:
        prefix - path prefix or pattern
        allow - whether to allow the URLs matching the prefix or pattern
      • isAllowed

        public boolean isAllowed​(String url)
        Check whether a URL is allowed to be fetched according to the robots rules. Note that the URL must be properly normalized, otherwise the URL path may not be matched against the robots rules. In order to normalize the URL, BasicURLNormalizer can be used:
         BasicURLNormalizer normalizer = new BasicURLNormalizer();
         String urlNormalized = normalizer.filter(urlNotNormalized);
         
        Specified by:
        isAllowed in class BaseRobotRules
        Parameters:
        url - URL string to be checked
        Returns:
        true if the URL is allowed
      • isAllowed

        public boolean isAllowed​(URL url)
        Check whether a URL is allowed to be fetched according to the robots rules.
        Specified by:
        isAllowed in class BaseRobotRules
        Parameters:
        url - URL to be checked
        Returns:
        true if the URL is allowed
        See Also:
        isAllowed(String)
      • escapePath

        public static String escapePath​(String urlPathQuery,
                                        boolean[] additionalEncodedBytes)
        Encode/decode (using percent-encoding) all characters where necessary: encode Unicode/non-ASCII characters) and decode printable ASCII characters without special semantics.
        Parameters:
        urlPathQuery - path and query component of the URL
        additionalEncodedBytes - boolean array to request bytes (ASCII characters) to be percent-encoded in addition to other characters requiring encoding (Unicode/non-ASCII and characters not allowed in URLs).
        Returns:
        properly percent-encoded URL path and query
      • sortRules

        public void sortRules()
        Sort and deduplicate robot rules. This method must be called after the robots.txt has been processed and before rule matching. The ordering is implemented in SimpleRobotRules.RobotRule.compareTo(RobotRule) and defined by RFC 9309, section 2.2.2:
        The most specific match found MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group MAY be deduplicated.
      • isAllowAll

        public boolean isAllowAll()
        Is our ruleset set up to allow all access?

        Note: This is decided only based on the SimpleRobotRules.RobotRulesMode without inspecting the set of allow/disallow rules.

        Specified by:
        isAllowAll in class BaseRobotRules
        Returns:
        true if all URLs are allowed.
      • isAllowNone

        public boolean isAllowNone()
        Is our ruleset set up to disallow all access?

        Note: This is decided only based on the SimpleRobotRules.RobotRulesMode without inspecting the set of allow/disallow rules.

        Specified by:
        isAllowNone in class BaseRobotRules
        Returns:
        true if no URLs are allowed.
      • toString

        public String toString()
        Description copied from class: BaseRobotRules
        Returns a string with the crawl delay as well as a list of sitemaps if they exist (and aren't more than 10).
        Overrides:
        toString in class BaseRobotRules