crawlercommons |
|
crawlercommons.domains |
Classes contained within the domains package relate to the definition of
"paid-level" domains or "effective top-level domains",
that is Internet domain names on level below a public suffix defined in the
public suffix list.
|
crawlercommons.filters |
The filters package contains code and resources for URL filtering.
|
crawlercommons.filters.basic |
URL normalizer performing basic normalizations applicable to
http:// and https:// URLs.
|
crawlercommons.mimetypes |
Utilities for detecting MIME types relevant for in the context of
crawler-commons.
|
crawlercommons.robots |
The robots package contains all of the robots.txt rule inference, parsing and
utilities contained within Crawler-Commons.
|
crawlercommons.sitemaps |
Classes focused on parsing and processing
sitemaps and holding the resulting set of
URLs with crawling-related metadata, such as the change frequency of a page.
|
crawlercommons.sitemaps.extension |
Extensions to the sitemaps protocol
for additional attributes and links to alternate media formats, for example
image, video and news sitemaps.
|
crawlercommons.sitemaps.sax |
SAX handlers to parse specific elements of XML sitemaps or Atom/RSS feeds.
|
crawlercommons.sitemaps.sax.extension |
SAX handlers to parse extensions of XML sitemaps.
|