Crawler-commons 1.5 API

Packages 
Package Description
crawlercommons  
crawlercommons.domains
Classes contained within the domains package relate to the definition of "paid-level" domains or "effective top-level domains", that is Internet domain names on level below a public suffix defined in the public suffix list.
crawlercommons.filters
The filters package contains code and resources for URL filtering.
crawlercommons.filters.basic
URL normalizer performing basic normalizations applicable to http:// and https:// URLs.
crawlercommons.mimetypes
Utilities for detecting MIME types relevant for in the context of crawler-commons.
crawlercommons.robots
The robots package contains all of the robots.txt rule inference, parsing and utilities contained within Crawler-Commons.
crawlercommons.sitemaps
Classes focused on parsing and processing sitemaps and holding the resulting set of URLs with crawling-related metadata, such as the change frequency of a page.
crawlercommons.sitemaps.extension
Extensions to the sitemaps protocol for additional attributes and links to alternate media formats, for example image, video and news sitemaps.
crawlercommons.sitemaps.sax
SAX handlers to parse specific elements of XML sitemaps or Atom/RSS feeds.
crawlercommons.sitemaps.sax.extension
SAX handlers to parse extensions of XML sitemaps.