- get(String) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- get(String, Payload) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
Get the content stored in the resource referenced by the 'url' parameter.
- get(String, Payload) - Method in class crawlercommons.fetcher.file.SimpleFileFetcher
-
Deprecated.
- get(String, Payload) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- get(Object) - Method in class crawlercommons.fetcher.Payload
-
Deprecated.
- getAbortReason() - Method in exception crawlercommons.fetcher.AbortedFetchException
-
Deprecated.
- getAcceptLanguage() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getAgentName() - Method in class crawlercommons.fetcher.http.UserAgent
-
Deprecated.
Obtain the just the user agent name
- getAssignedDomain(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
-
This method uses the effective TLD to determine which component of a FQDN
is the NIC-assigned domain name.
- getBaseUrl() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getBaseUrl() - Method in class crawlercommons.sitemaps.SiteMap
-
- getCause() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- getChangeFrequency() - Method in class crawlercommons.sitemaps.SiteMapURL
-
Return the URL's change frequency
- getConnectionRequestTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- getConnectionTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- getContent() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getContentLength() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getContentType() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getCookies() - Method in class crawlercommons.fetcher.http.LocalCookieStore
-
Deprecated.
Returns an immutable array of cookies
that this HTTP state
currently contains.
- getCrawlDelay() - Method in class crawlercommons.robots.BaseRobotRules
-
- getDefaultMaxContentSize() - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- getDomain() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
-
- getEffectiveTLD(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
-
- getEffectiveTLDs() - Static method in class crawlercommons.domains.EffectiveTldFinder
-
- getElementAttributeValue(Element, String, String) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Get the element's attribute value.
- getElementValue(Element, String) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Get the element's textual content.
- getError() - Method in exception crawlercommons.sitemaps.UnknownFormatException
-
public method, callable by exception catcher.
- getExpanded() - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
-
Deprecated.
- getFetchedUrl() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getFetchTime() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getFullDateFormat() - Static method in class crawlercommons.sitemaps.AbstractSiteMap
-
- getHeaders() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getHostAddress() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getHttpHeaders() - Method in exception crawlercommons.fetcher.HttpFetchException
-
Deprecated.
- getHttpStatus() - Method in exception crawlercommons.fetcher.HttpFetchException
-
Deprecated.
- getHttpVersion() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- getInstance() - Static method in class crawlercommons.domains.EffectiveTldFinder
-
- getKeepAliveDuration(HttpResponse, HttpContext) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher.MyConnectionKeepAliveStrategy
-
Deprecated.
- getLastModified() - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- getLastModified() - Method in class crawlercommons.sitemaps.SiteMapURL
-
Return when this URL was last modified.
- getLocalizedMessage() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- getMaxConnectionsPerHost() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getMaxContentSize(String) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- getMaxFetchTime() - Static method in class crawlercommons.robots.RobotUtils
-
- getMaxRedirects() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getMaxRetryCount() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- getMaxThreads() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getMessage() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- getMessage() - Method in exception crawlercommons.fetcher.HttpFetchException
-
Deprecated.
- getMimeTypeFromContentType(String) - Static method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- getMinResponseRate() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
Return the minimum response rate.
- getNewBaseUrl() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getNumRedirects() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getNumWarnings() - Method in class crawlercommons.robots.SimpleRobotRulesParser
-
- getPayload() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getPLD(String) - Static method in class crawlercommons.domains.PaidLevelDomain
-
Extract the PLD (paid-level domain) from the hostname.
- getPLD(URL) - Static method in class crawlercommons.domains.PaidLevelDomain
-
Extract the PLD (paid-level domain) from the URL.
- getPriority() - Method in class crawlercommons.sitemaps.SiteMapURL
-
Return this URL's priority (a value between [0.0 - 1.0]).
- getReason() - Method in exception crawlercommons.fetcher.RedirectFetchException
-
Deprecated.
- getReasonPhrase() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getRedirectedUrl() - Method in exception crawlercommons.fetcher.RedirectFetchException
-
Deprecated.
- getRedirectMode() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getResponseRate() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getRobotRules(BaseHttpFetcher, BaseRobotsParser, URL) - Static method in class crawlercommons.robots.RobotUtils
-
Externally visible, static method for use in tools and for testing.
- getSitemap(URL) - Method in class crawlercommons.sitemaps.SiteMapIndex
-
Returns the Sitemap that has the given URL.
- getSitemaps() - Method in class crawlercommons.robots.BaseRobotRules
-
- getSitemaps() - Method in class crawlercommons.sitemaps.SiteMapIndex
-
- getSiteMapUrls() - Method in class crawlercommons.sitemaps.SiteMap
-
- getSocketTimeout() - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- getStackTrace() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- getStatusCode() - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- getType() - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- getUrl() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- getUrl() - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- getUrl() - Method in class crawlercommons.sitemaps.SiteMapURL
-
Return the URL.
- getUserAgent() - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- getUserAgentString() - Method in class crawlercommons.fetcher.http.UserAgent
-
Deprecated.
Obtain a String representing the user agent characteristics.
- getValidMimeTypes() - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- getVersion() - Static method in class crawlercommons.CrawlerCommons
-
- PaidLevelDomain - Class in crawlercommons.domains
-
Routines to extract the PLD (paid-level domain, as per the IRLbot paper) from
a hostname or URL.
- PaidLevelDomain() - Constructor for class crawlercommons.domains.PaidLevelDomain
-
- parseAtom(SiteMap, Element, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the XML document which is assumed to be in Atom format.
- parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.BaseRobotsParser
-
Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent.
- parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.SimpleRobotRulesParser
-
- parseRSS(SiteMap, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse XML document which is assumed to be in RSS format.
- parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Returns a SiteMap or SiteMapIndex given an online sitemap URL
- parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Returns a processed copy of an unprocessed sitemap object, i.e.
- parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse a sitemap, given the content bytes and the URL.
- parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse a sitemap, given the MIME type, the content bytes, and the URL.
- parseSitemapIndex(URL, NodeList) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse XML that contains a Sitemap Index.
- parseSyndicationFormat(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the XML document, looking for a feed element to determine if
it's an Atom doc rss to determine if it's an RSS
doc.
- parseXmlSitemap(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse XML that contains a valid Sitemap.
- Payload - Class in crawlercommons.fetcher
-
- Payload() - Constructor for class crawlercommons.fetcher.Payload
-
Deprecated.
- printStackTrace() - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- printStackTrace(PrintStream) - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- printStackTrace(PrintWriter) - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- processDeflateEncoded(byte[]) - Static method in class crawlercommons.fetcher.EncodingUtils
-
Deprecated.
- processDeflateEncoded(byte[], int) - Static method in class crawlercommons.fetcher.EncodingUtils
-
Deprecated.
- processGzip(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Decompress the gzipped content and process the resulting XML Sitemap.
- processGzipEncoded(byte[]) - Static method in class crawlercommons.fetcher.EncodingUtils
-
Deprecated.
- processGzipEncoded(byte[], int) - Static method in class crawlercommons.fetcher.EncodingUtils
-
Deprecated.
- processText(String, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Process a text-based Sitemap.
- processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the given XML content.
- processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the given XML content.
- put(String, Object) - Method in class crawlercommons.fetcher.Payload
-
Deprecated.
- putAll(Map<? extends String, ? extends Object>) - Method in class crawlercommons.fetcher.Payload
-
Deprecated.
- setAcceptLanguage(String) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- setChangeFrequency(SiteMapURL.ChangeFrequency) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's change frequency
- setChangeFrequency(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's change frequency In case of a bad ChangeFrequency, the
current frequency in this instance will be set to NULL
- setConnectionRequestTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- setConnectionTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- setCrawlDelay(long) - Method in class crawlercommons.robots.BaseRobotRules
-
- setDefaultMaxContentSize(int) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- setDeferVisits(boolean) - Method in class crawlercommons.robots.BaseRobotRules
-
- setExpanded(byte[]) - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
-
Deprecated.
- setHttpVersion(HttpVersion) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- setLastModified(Date) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setLastModified(String) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setLastModified(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set when this URL was last modified.
- setLastModified(Date) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set when this URL was last modified.
- setMaxConnectionsPerHost(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- setMaxContentSize(String, int) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- setMaxRedirects(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- setMaxRetryCount(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- setMinResponseRate(int) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- setPayload(Payload) - Method in class crawlercommons.fetcher.FetchedResult
-
Deprecated.
- setPriority(double) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority
is used if the given priority is out of range).
- setPriority(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority
is used if the given priority missing or is out of range).
- setProcessed(boolean) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setRedirectMode(BaseHttpFetcher.RedirectMode) - Method in class crawlercommons.fetcher.http.BaseHttpFetcher
-
Deprecated.
- setSocketTimeout(int) - Method in class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- setStackTrace(StackTraceElement[]) - Method in exception crawlercommons.fetcher.BaseFetchException
-
Deprecated.
- setTruncated(boolean) - Method in class crawlercommons.fetcher.EncodingUtils.ExpandedResult
-
Deprecated.
- setType(AbstractSiteMap.SitemapType) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setUrl(URL) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL.
- setUrl(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL.
- setValid(boolean) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Valid means that it follows the official guidelines that the siteMapURL
must be under the base url
- setValidMimeTypes(Set<String>) - Method in class crawlercommons.fetcher.BaseFetcher
-
Deprecated.
- SimpleFileFetcher - Class in crawlercommons.fetcher.file
-
- SimpleFileFetcher() - Constructor for class crawlercommons.fetcher.file.SimpleFileFetcher
-
Deprecated.
- SimpleHttpFetcher - Class in crawlercommons.fetcher.http
-
- SimpleHttpFetcher(UserAgent) - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- SimpleHttpFetcher(int, UserAgent) - Constructor for class crawlercommons.fetcher.http.SimpleHttpFetcher
-
Deprecated.
- SimpleHttpFetcher.IdleConnectionMonitorThread - Class in crawlercommons.fetcher.http
-
Deprecated.
- SimpleHttpFetcher.MyConnectionKeepAliveStrategy - Class in crawlercommons.fetcher.http
-
Deprecated.
- SimpleRobotRules - Class in crawlercommons.robots
-
Result from parsing a single robots.txt file - which means we get a set of
rules, and a crawl-delay.
- SimpleRobotRules() - Constructor for class crawlercommons.robots.SimpleRobotRules
-
- SimpleRobotRules(SimpleRobotRules.RobotRulesMode) - Constructor for class crawlercommons.robots.SimpleRobotRules
-
- SimpleRobotRules.RobotRule - Class in crawlercommons.robots
-
Single rule that maps from a path prefix to an allow flag.
- SimpleRobotRules.RobotRulesMode - Enum in crawlercommons.robots
-
- SimpleRobotRulesParser - Class in crawlercommons.robots
-
This implementation of
BaseRobotsParser
retrieves a set of
rules
for an agent with the given name from the
robots.txt
file of a given domain.
- SimpleRobotRulesParser() - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
-
- SiteMap - Class in crawlercommons.sitemaps
-
- SiteMap() - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(URL) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(String) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(URL, Date) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(String, String) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMapIndex - Class in crawlercommons.sitemaps
-
- SiteMapIndex() - Constructor for class crawlercommons.sitemaps.SiteMapIndex
-
- SiteMapIndex(URL) - Constructor for class crawlercommons.sitemaps.SiteMapIndex
-
- SiteMapParser - Class in crawlercommons.sitemaps
-
- SiteMapParser() - Constructor for class crawlercommons.sitemaps.SiteMapParser
-
- SiteMapParser(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
-
- SiteMapTester - Class in crawlercommons.sitemaps
-
Sitemap Tool for recursively fetching all URL's from a sitemap (and all of
it's children)
- SiteMapTester() - Constructor for class crawlercommons.sitemaps.SiteMapTester
-
- SiteMapURL - Class in crawlercommons.sitemaps
-
The SitemapUrl class represents a URL found in a Sitemap.
- SiteMapURL(String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(URL, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(String, String, String, String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(URL, Date, SiteMapURL.ChangeFrequency, double, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL.ChangeFrequency - Enum in crawlercommons.sitemaps
-
Allowed change frequencies
- size() - Method in class crawlercommons.fetcher.Payload
-
Deprecated.
- sortRules() - Method in class crawlercommons.robots.SimpleRobotRules
-
In order to match up with Google's convention, we want to match rules
from longest to shortest.
- strict - Variable in class crawlercommons.sitemaps.SiteMapParser
-
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
- valueOf(String) - Static method in enum crawlercommons.fetcher.AbortedFetchReason
-
Deprecated.
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum crawlercommons.fetcher.http.BaseHttpFetcher.RedirectMode
-
Deprecated.
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum crawlercommons.fetcher.RedirectFetchException.RedirectExceptionReason
-
Deprecated.
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum crawlercommons.fetcher.AbortedFetchReason
-
Deprecated.
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum crawlercommons.fetcher.http.BaseHttpFetcher.RedirectMode
-
Deprecated.
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Method in class crawlercommons.fetcher.Payload
-
Deprecated.
- values() - Static method in enum crawlercommons.fetcher.RedirectFetchException.RedirectExceptionReason
-
Deprecated.
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
-
Returns an array containing the constants of this enum type, in
the order they are declared.