Skip navigation links
A B C D E F G H I K L M N P R S T U V W _ 

A

abort() - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
Terminate any async request being processed.
abort() - Method in class crawlercommons.fetcher.file.SimpleFileFetcher
Deprecated.
 
abort() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
AbortedFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
AbortedFetchException() - Constructor for exception crawlercommons.fetcher.AbortedFetchException
Deprecated.
 
AbortedFetchException(String, AbortedFetchReason) - Constructor for exception crawlercommons.fetcher.AbortedFetchException
Deprecated.
 
AbortedFetchException(String, String, AbortedFetchReason) - Constructor for exception crawlercommons.fetcher.AbortedFetchException
Deprecated.
 
AbortedFetchReason - Enum in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
AbstractSiteMap - Class in crawlercommons.sitemaps
SiteMap or SiteMapIndex
AbstractSiteMap() - Constructor for class crawlercommons.sitemaps.AbstractSiteMap
 
AbstractSiteMap.SitemapType - Enum in crawlercommons.sitemaps
Various Sitemap types
addCookie(Cookie) - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
Adds an HTTP cookie, replacing any existing equivalent cookies.
addCookies(Cookie[]) - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
Adds an array of HTTP cookies.
addRule(String, boolean) - Method in class crawlercommons.robots.SimpleRobotRules
 
addSitemap(String) - Method in class crawlercommons.robots.BaseRobotRules
 
addSiteMapUrl(SiteMapURL) - Method in class crawlercommons.sitemaps.SiteMap
 
addUrlIntoSitemap(String, SiteMap, String, String, String, int) - Method in class crawlercommons.sitemaps.SiteMapParser
Adds the given URL to the given sitemap while showing the relevant logs
addValidMimeType(String) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
addValidMimeTypes(Set<String>) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 

B

BadProtocolFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
BadProtocolFetchException() - Constructor for exception crawlercommons.fetcher.BadProtocolFetchException
Deprecated.
 
BadProtocolFetchException(String) - Constructor for exception crawlercommons.fetcher.BadProtocolFetchException
Deprecated.
 
BaseFetcher - Class in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
BaseFetcher() - Constructor for class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
BaseFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
BaseFetchException() - Constructor for exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
BaseFetchException(String) - Constructor for exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
BaseFetchException(String, String) - Constructor for exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
BaseFetchException(String, Exception) - Constructor for exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
BaseFetchException(String, String, Exception) - Constructor for exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
BaseHttpFetcher - Class in crawlercommons.fetcher.http
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
BaseHttpFetcher(int, UserAgent) - Constructor for class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
BaseHttpFetcher.RedirectMode - Enum in crawlercommons.fetcher.http
Deprecated.
 
BaseRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
BaseRobotRules() - Constructor for class crawlercommons.robots.BaseRobotRules
 
BaseRobotsParser - Class in crawlercommons.robots
 
BaseRobotsParser() - Constructor for class crawlercommons.robots.BaseRobotsParser
 
BasicURLNormalizer - Class in crawlercommons.filters.basic
Code borrowed from Apache Nutch.
BasicURLNormalizer() - Constructor for class crawlercommons.filters.basic.BasicURLNormalizer
 

C

clear() - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
Clears all cookies.
clear() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
clearExpired(Date) - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
Removes all of cookies in this HTTP state that have expired by the specified date.
clearRules() - Method in class crawlercommons.robots.SimpleRobotRules
 
COMMENT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
compareTo(SimpleRobotRules.RobotRule) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
compareToBase(BaseFetchException) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
containsKey(Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
containsValue(Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
convertToDate(String) - Static method in class crawlercommons.sitemaps.AbstractSiteMap
Convert the given date (given in an acceptable DateFormat), null if the date is not in the correct format.
crawlercommons - package crawlercommons
 
CrawlerCommons - Class in crawlercommons
 
CrawlerCommons() - Constructor for class crawlercommons.CrawlerCommons
 
crawlercommons.domains - package crawlercommons.domains
Classes contained within the domains package relate to the definition of Top Level Domain's, various domain registrars and the effective handling of such domains.
crawlercommons.fetcher - package crawlercommons.fetcher
The main fetching package within Crawler Commons, this package defines base fetching and encoding classes, Enum's to determine reasoning behind typical fetching behaviour as well as the base Exceptions which may be used.
crawlercommons.fetcher.file - package crawlercommons.fetcher.file
This package includes the SimpleFileFetcher code which extends the BaseFetcher.
crawlercommons.fetcher.http - package crawlercommons.fetcher.http
This package concerns the fetching of files over the HTTP protocol: Extending from BaseHttpFetcher (which itself extends BaseFetcher) the SimpleHttpFetcher provides the Crawler Commons HTTP fetching implementation.
crawlercommons.filters - package crawlercommons.filters
The filters package contains code and resources for URL filtering.
crawlercommons.filters.basic - package crawlercommons.filters.basic
 
crawlercommons.robots - package crawlercommons.robots
The robots package contains all of the robots.txt rule inference, parsing and utilities contained within Crawler Commons.
crawlercommons.sitemaps - package crawlercommons.sitemaps
Sitemaps package provides all classes relevant to focused sitemap parsing, url definition and processing.
createFetcher(BaseHttpFetcher) - Static method in class crawlercommons.robots.RobotUtils
 
createFetcher(UserAgent, int) - Static method in class crawlercommons.robots.RobotUtils
 

D

DEFAULT_ACCEPT_LANGUAGE - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
DEFAULT_BROWSER_VERSION - Static variable in class crawlercommons.fetcher.http.UserAgent
Deprecated.
 
DEFAULT_CRAWLER_VERSION - Static variable in class crawlercommons.fetcher.http.UserAgent
Deprecated.
 
DEFAULT_MAX_CONNECTIONS_PER_HOST - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
DEFAULT_MAX_CONTENT_SIZE - Static variable in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
DEFAULT_MAX_REDIRECTS - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
DEFAULT_MIN_RESPONSE_RATE - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
DEFAULT_PRIORITY - Static variable in class crawlercommons.sitemaps.SiteMapURL
 
DEFAULT_REDIRECT_MODE - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
DOT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
DOT_REGEX - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

E

EffectiveTLD(String) - Constructor for class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
EffectiveTldFinder - Class in crawlercommons.domains
Given a URL's hostname, there are determining the actual domain requires knowledge of the various domain registrars and their assignment policies.
EffectiveTldFinder.EffectiveTLD - Class in crawlercommons.domains
 
EncodingUtils - Class in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
EncodingUtils() - Constructor for class crawlercommons.fetcher.EncodingUtils
Deprecated.
 
EncodingUtils.ExpandedResult - Class in crawlercommons.fetcher
Deprecated.
 
entrySet() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
equals(Object) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
equals(Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
equals(Object) - Method in class crawlercommons.robots.BaseRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
equals(Object) - Method in class crawlercommons.sitemaps.SiteMapURL
 
ETLD_DATA - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
EXCEPTION - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
ExpandedResult(byte[], boolean) - Constructor for class crawlercommons.fetcher.EncodingUtils.ExpandedResult
Deprecated.
 

F

failedFetch(int) - Method in class crawlercommons.robots.BaseRobotsParser
The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
failedFetch(int) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
fetch(String) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
fetch(HttpRequestBase, String, Payload) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
FetchedResult - Class in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
FetchedResult(String, String, long, Metadata, byte[], String, int, Payload, String, int, String, int, String) - Constructor for class crawlercommons.fetcher.FetchedResult
Deprecated.
 
filter(String) - Method in class crawlercommons.filters.basic.BasicURLNormalizer
 
filter(String) - Method in class crawlercommons.filters.URLFilter
Returns a modified version of the input URL or null if the URL should be removed
finalize() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 

G

get(String) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
get(String, Payload) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
Get the content stored in the resource referenced by the 'url' parameter.
get(String, Payload) - Method in class crawlercommons.fetcher.file.SimpleFileFetcher
Deprecated.
 
get(String, Payload) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
get(Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
getAbortReason() - Method in exception crawlercommons.fetcher.AbortedFetchException
Deprecated.
 
getAcceptLanguage() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getAgentName() - Method in class crawlercommons.fetcher.http.UserAgent
Deprecated.
Obtain the just the user agent name
getAssignedDomain(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
This method uses the effective TLD to determine which component of a FQDN is the NIC-assigned domain name.
getBaseUrl() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getBaseUrl() - Method in class crawlercommons.sitemaps.SiteMap
 
getCause() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
getChangeFrequency() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL's change frequency
getConnectionRequestTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
getConnectionTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
getContent() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getContentLength() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getContentType() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getCookies() - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
Returns an immutable array of cookies that this HTTP state currently contains.
getCrawlDelay() - Method in class crawlercommons.robots.BaseRobotRules
 
getDefaultMaxContentSize() - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
getDomain() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
getEffectiveTLD(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getEffectiveTLDs() - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getElementAttributeValue(Element, String, String) - Method in class crawlercommons.sitemaps.SiteMapParser
Get the element's attribute value.
getElementValue(Element, String) - Method in class crawlercommons.sitemaps.SiteMapParser
Get the element's textual content.
getError() - Method in exception crawlercommons.sitemaps.UnknownFormatException
public method, callable by exception catcher.
getExpanded() - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
Deprecated.
 
getFetchedUrl() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getFetchTime() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getFullDateFormat() - Static method in class crawlercommons.sitemaps.AbstractSiteMap
 
getHeaders() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getHostAddress() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getHttpHeaders() - Method in exception crawlercommons.fetcher.HttpFetchException
Deprecated.
 
getHttpStatus() - Method in exception crawlercommons.fetcher.HttpFetchException
Deprecated.
 
getHttpVersion() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
getInstance() - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getKeepAliveDuration(HttpResponse, HttpContext) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher.MyConnectionKeepAliveStrategy
Deprecated.
 
getLastModified() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getLastModified() - Method in class crawlercommons.sitemaps.SiteMapURL
Return when this URL was last modified.
getLocalizedMessage() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
getMaxConnectionsPerHost() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getMaxContentSize(String) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
getMaxFetchTime() - Static method in class crawlercommons.robots.RobotUtils
 
getMaxRedirects() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getMaxRetryCount() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
getMaxThreads() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getMessage() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
getMessage() - Method in exception crawlercommons.fetcher.HttpFetchException
Deprecated.
 
getMimeTypeFromContentType(String) - Static method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
getMinResponseRate() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
Return the minimum response rate.
getNewBaseUrl() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getNumRedirects() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getNumWarnings() - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
getPayload() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getPLD(String) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the hostname.
getPLD(URL) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the URL.
getPriority() - Method in class crawlercommons.sitemaps.SiteMapURL
Return this URL's priority (a value between [0.0 - 1.0]).
getReason() - Method in exception crawlercommons.fetcher.RedirectFetchException
Deprecated.
 
getReasonPhrase() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getRedirectedUrl() - Method in exception crawlercommons.fetcher.RedirectFetchException
Deprecated.
 
getRedirectMode() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getResponseRate() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getRobotRules(BaseHttpFetcher, BaseRobotsParser, URL) - Static method in class crawlercommons.robots.RobotUtils
Externally visible, static method for use in tools and for testing.
getSitemap(URL) - Method in class crawlercommons.sitemaps.SiteMapIndex
Returns the Sitemap that has the given URL.
getSitemaps() - Method in class crawlercommons.robots.BaseRobotRules
 
getSitemaps() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
getSiteMapUrls() - Method in class crawlercommons.sitemaps.SiteMap
 
getSocketTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
getStackTrace() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
getStatusCode() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
getType() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
getUrl() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL.
getUserAgent() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
getUserAgentString() - Method in class crawlercommons.fetcher.http.UserAgent
Deprecated.
Obtain a String representing the user agent characteristics.
getValidMimeTypes() - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
getVersion() - Static method in class crawlercommons.CrawlerCommons
 

H

hashCode() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
hashCode() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
hashCode() - Method in class crawlercommons.robots.BaseRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
hashCode() - Method in class crawlercommons.sitemaps.SiteMapURL
 
hasUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
HttpFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
HttpFetchException() - Constructor for exception crawlercommons.fetcher.HttpFetchException
Deprecated.
 
HttpFetchException(String, String, int, Metadata) - Constructor for exception crawlercommons.fetcher.HttpFetchException
Deprecated.
 

I

IdleConnectionMonitorThread(HttpClientConnectionManager) - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher.IdleConnectionMonitorThread
Deprecated.
 
initCause(Throwable) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
initialize(InputStream) - Method in class crawlercommons.domains.EffectiveTldFinder
 
IOFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
IOFetchException() - Constructor for exception crawlercommons.fetcher.IOFetchException
Deprecated.
 
IOFetchException(String, IOException) - Constructor for exception crawlercommons.fetcher.IOFetchException
Deprecated.
 
isAllowAll() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowAll() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to allow all access?
isAllowed(String) - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowed(String) - Method in class crawlercommons.robots.SimpleRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to disallow all access?
isConfigured() - Method in class crawlercommons.domains.EffectiveTldFinder
 
isDeferVisits() - Method in class crawlercommons.robots.BaseRobotRules
 
isEmpty() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
isException() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
isIndex() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
isProcessed() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isStrict() - Method in class crawlercommons.sitemaps.SiteMapParser
 
isTruncated() - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
Deprecated.
 
isValid() - Method in class crawlercommons.sitemaps.SiteMapURL
Is the siteMapURL under the base url ?
isWild() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 

K

keySet() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 

L

LocalCookieStore - Class in crawlercommons.fetcher.http
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
LocalCookieStore() - Constructor for class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
 
LOG - Static variable in class crawlercommons.filters.basic.BasicURLNormalizer
 
LOG - Static variable in class crawlercommons.sitemaps.SiteMapParser
 

M

main(String[]) - Static method in class crawlercommons.filters.basic.BasicURLNormalizer
 
main(String[]) - Static method in class crawlercommons.sitemaps.SiteMapTester
 
MAX_BYTES_ALLOWED - Static variable in class crawlercommons.sitemaps.SiteMapParser
Sitemap docs must be limited to 10MB (10,485,760 bytes)
MyConnectionKeepAliveStrategy() - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher.MyConnectionKeepAliveStrategy
Deprecated.
 

N

nextUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
NO_MIN_RESPONSE_RATE - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
NO_REDIRECTS - Static variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 

P

PaidLevelDomain - Class in crawlercommons.domains
Routines to extract the PLD (paid-level domain, as per the IRLbot paper) from a hostname or URL.
PaidLevelDomain() - Constructor for class crawlercommons.domains.PaidLevelDomain
 
parseAtom(SiteMap, Element, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the XML document which is assumed to be in Atom format.
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.BaseRobotsParser
Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
parseRSS(SiteMap, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML document which is assumed to be in RSS format.
parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a SiteMap or SiteMapIndex given an online sitemap URL
parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a processed copy of an unprocessed sitemap object, i.e.
parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the content bytes and the URL.
parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the MIME type, the content bytes, and the URL.
parseSitemapIndex(URL, NodeList) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML that contains a Sitemap Index.
parseSyndicationFormat(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the XML document, looking for a feed element to determine if it's an Atom doc rss to determine if it's an RSS doc.
parseXmlSitemap(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML that contains a valid Sitemap.
Payload - Class in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
Payload() - Constructor for class crawlercommons.fetcher.Payload
Deprecated.
 
printStackTrace() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
printStackTrace(PrintStream) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
printStackTrace(PrintWriter) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
processDeflateEncoded(byte[]) - Static method in class crawlercommons.fetcher.EncodingUtils
Deprecated.
 
processDeflateEncoded(byte[], int) - Static method in class crawlercommons.fetcher.EncodingUtils
Deprecated.
 
processGzip(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Decompress the gzipped content and process the resulting XML Sitemap.
processGzipEncoded(byte[]) - Static method in class crawlercommons.fetcher.EncodingUtils
Deprecated.
 
processGzipEncoded(byte[], int) - Static method in class crawlercommons.fetcher.EncodingUtils
Deprecated.
 
processText(String, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Process a text-based Sitemap.
processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
put(String, Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
putAll(Map<? extends String, ? extends Object>) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 

R

readBaseFields(DataInput) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
RedirectFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
RedirectFetchException() - Constructor for exception crawlercommons.fetcher.RedirectFetchException
Deprecated.
 
RedirectFetchException(String, String, RedirectFetchException.RedirectExceptionReason) - Constructor for exception crawlercommons.fetcher.RedirectFetchException
Deprecated.
 
RedirectFetchException.RedirectExceptionReason - Enum in crawlercommons.fetcher
Deprecated.
 
remove(Object) - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
report() - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
Produces a neat report containing everything from a FetchedResult .
RobotRule(String, boolean) - Constructor for class crawlercommons.robots.SimpleRobotRules.RobotRule
 
RobotUtils - Class in crawlercommons.robots
 
RobotUtils() - Constructor for class crawlercommons.robots.RobotUtils
 
run() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher.IdleConnectionMonitorThread
Deprecated.
 

S

setAcceptLanguage(String) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
setChangeFrequency(SiteMapURL.ChangeFrequency) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency
setChangeFrequency(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency In case of a bad ChangeFrequency, the current frequency in this instance will be set to NULL
setConnectionRequestTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
setConnectionTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
setCrawlDelay(long) - Method in class crawlercommons.robots.BaseRobotRules
 
setDefaultMaxContentSize(int) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
setDeferVisits(boolean) - Method in class crawlercommons.robots.BaseRobotRules
 
setExpanded(byte[]) - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
Deprecated.
 
setHttpVersion(HttpVersion) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
setLastModified(Date) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setLastModified(Date) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setMaxConnectionsPerHost(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
setMaxContentSize(String, int) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
setMaxRedirects(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
setMaxRetryCount(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
setMinResponseRate(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
setPayload(Payload) - Method in class crawlercommons.fetcher.FetchedResult
Deprecated.
 
setPriority(double) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority is out of range).
setPriority(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority missing or is out of range).
setProcessed(boolean) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setRedirectMode(BaseHttpFetcher.RedirectMode) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
setSocketTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
setStackTrace(StackTraceElement[]) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
setTruncated(boolean) - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
Deprecated.
 
setType(AbstractSiteMap.SitemapType) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setUrl(URL) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setUrl(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setValid(boolean) - Method in class crawlercommons.sitemaps.SiteMapURL
Valid means that it follows the official guidelines that the siteMapURL must be under the base url
setValidMimeTypes(Set<String>) - Method in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
SimpleFileFetcher - Class in crawlercommons.fetcher.file
Deprecated.
As of release 0.6.
SimpleFileFetcher() - Constructor for class crawlercommons.fetcher.file.SimpleFileFetcher
Deprecated.
 
SimpleHttpFetcher - Class in crawlercommons.fetcher.http
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
SimpleHttpFetcher(UserAgent) - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
SimpleHttpFetcher(int, UserAgent) - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher
Deprecated.
 
SimpleHttpFetcher.IdleConnectionMonitorThread - Class in crawlercommons.fetcher.http
Deprecated.
 
SimpleHttpFetcher.MyConnectionKeepAliveStrategy - Class in crawlercommons.fetcher.http
Deprecated.
 
SimpleRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
SimpleRobotRules() - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules(SimpleRobotRules.RobotRulesMode) - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules.RobotRule - Class in crawlercommons.robots
Single rule that maps from a path prefix to an allow flag.
SimpleRobotRules.RobotRulesMode - Enum in crawlercommons.robots
 
SimpleRobotRulesParser - Class in crawlercommons.robots
This implementation of BaseRobotsParser retrieves a set of rules for an agent with the given name from the robots.txt file of a given domain.
SimpleRobotRulesParser() - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
 
SiteMap - Class in crawlercommons.sitemaps
 
SiteMap() - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL, Date) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String, String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMapIndex - Class in crawlercommons.sitemaps
 
SiteMapIndex() - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapIndex(URL) - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapParser - Class in crawlercommons.sitemaps
 
SiteMapParser() - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapParser(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapTester - Class in crawlercommons.sitemaps
Sitemap Tool for recursively fetching all URL's from a sitemap (and all of it's children)
SiteMapTester() - Constructor for class crawlercommons.sitemaps.SiteMapTester
 
SiteMapURL - Class in crawlercommons.sitemaps
The SitemapUrl class represents a URL found in a Sitemap.
SiteMapURL(String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(String, String, String, String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, Date, SiteMapURL.ChangeFrequency, double, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL.ChangeFrequency - Enum in crawlercommons.sitemaps
Allowed change frequencies
size() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
sortRules() - Method in class crawlercommons.robots.SimpleRobotRules
In order to match up with Google's convention, we want to match rules from longest to shortest.
strict - Variable in class crawlercommons.sitemaps.SiteMapParser
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: http://www.sitemaps.org/protocol.html#location

T

toString() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
toString() - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 
toString() - Method in class crawlercommons.fetcher.http.LocalCookieStore
Deprecated.
 
toString() - Method in class crawlercommons.sitemaps.SiteMap
 
toString() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
toString() - Method in class crawlercommons.sitemaps.SiteMapURL
 

U

UnknownFormatException - Exception in crawlercommons.sitemaps
 
UnknownFormatException() - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
Default constructor - initializes instance variable to unknown
UnknownFormatException(String) - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
Constructor receives some kind of message that is saved in an instance variable.
UNSET_CRAWL_DELAY - Static variable in class crawlercommons.robots.BaseRobotRules
 
url - Variable in class crawlercommons.sitemaps.AbstractSiteMap
 
UrlFetchException - Exception in crawlercommons.fetcher
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
UrlFetchException() - Constructor for exception crawlercommons.fetcher.UrlFetchException
Deprecated.
 
UrlFetchException(String, String) - Constructor for exception crawlercommons.fetcher.UrlFetchException
Deprecated.
 
URLFilter - Class in crawlercommons.filters
 
URLFilter() - Constructor for class crawlercommons.filters.URLFilter
 
urlIsValid(String, String) - Method in class crawlercommons.sitemaps.SiteMapParser
See if testUrl is under sitemapBaseUrl.
UserAgent - Class in crawlercommons.fetcher.http
Deprecated.
As of release 0.6. We recommend directly using Apache HttpClient, async-http-client, or any other robust, industrial-strength HTTP clients.
UserAgent(String, String, String) - Constructor for class crawlercommons.fetcher.http.UserAgent
Deprecated.
Set user agent characteristics
UserAgent(String, String, String, String) - Constructor for class crawlercommons.fetcher.http.UserAgent
Deprecated.
Set user agent characteristics
UserAgent(String, String, String, String, String) - Constructor for class crawlercommons.fetcher.http.UserAgent
Deprecated.
Set user agent characteristics

V

valueOf(String) - Static method in enum crawlercommons.fetcher.AbortedFetchReason
Deprecated.
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.fetcher.http.BaseHttpFetcher.RedirectMode
Deprecated.
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.fetcher.RedirectFetchException.RedirectExceptionReason
Deprecated.
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns the enum constant of this type with the specified name.
values() - Static method in enum crawlercommons.fetcher.AbortedFetchReason
Deprecated.
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.fetcher.http.BaseHttpFetcher.RedirectMode
Deprecated.
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Method in class crawlercommons.fetcher.Payload
Deprecated.
 
values() - Static method in enum crawlercommons.fetcher.RedirectFetchException.RedirectExceptionReason
Deprecated.
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns an array containing the constants of this enum type, in the order they are declared.

W

WILD_CARD - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
writeBaseFields(DataOutput) - Method in exception crawlercommons.fetcher.BaseFetchException
Deprecated.
 

_

_acceptLanguage - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_defaultMaxContentSize - Variable in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
_maxConnectionsPerHost - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_maxContentSizes - Variable in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
_maxRedirects - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_maxThreads - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_minResponseRate - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_redirectMode - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_userAgent - Variable in class crawlercommons.fetcher.http.BaseHttpFetcher
Deprecated.
 
_validMimeTypes - Variable in class crawlercommons.fetcher.BaseFetcher
Deprecated.
 
A B C D E F G H I K L M N P R S T U V W _ 
Skip navigation links

Copyright © 2009–2016 Crawler-Commons. All rights reserved.