public class SiteMapParser extends Object
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static int |
MAX_BYTES_ALLOWED
Sitemap docs must be limited to 10MB (10,485,760 bytes)
|
protected boolean |
strict
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
|
Constructor and Description |
---|
SiteMapParser() |
SiteMapParser(boolean strict) |
Modifier and Type | Method and Description |
---|---|
protected void |
addUrlIntoSitemap(String urlStr,
SiteMap siteMap,
String lastMod,
String changeFreq,
String priority,
int urlIndex)
Adds the given URL to the given sitemap while showing the relevant logs
|
protected String |
getElementAttributeValue(Element elem,
String elementName,
String attributeName)
Get the element's attribute value.
|
protected String |
getElementValue(Element elem,
String elementName)
Get the element's textual content.
|
boolean |
isStrict() |
protected void |
parseAtom(SiteMap sitemap,
Element elem,
Document doc)
Parse the XML document which is assumed to be in Atom format.
|
protected void |
parseRSS(SiteMap sitemap,
Document doc)
Parse XML document which is assumed to be in RSS format.
|
AbstractSiteMap |
parseSiteMap(byte[] content,
URL url)
Parse a sitemap, given the content bytes and the URL.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.
|
AbstractSiteMap |
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL
|
protected SiteMapIndex |
parseSitemapIndex(URL url,
NodeList nodeList)
Parse XML that contains a Sitemap Index.
|
protected SiteMap |
parseSyndicationFormat(URL sitemapUrl,
Document doc)
Parse the XML document, looking for a feed element to determine if
it's an Atom doc rss to determine if it's an RSS
doc.
|
protected SiteMap |
parseXmlSitemap(URL sitemapUrl,
Document doc)
Parse XML that contains a valid Sitemap.
|
protected AbstractSiteMap |
processGzip(URL url,
byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.
|
protected SiteMap |
processText(String sitemapUrl,
byte[] content)
Process a text-based Sitemap.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
byte[] xmlContent)
Parse the given XML content.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
InputSource is)
Parse the given XML content.
|
protected boolean |
urlIsValid(String sitemapBaseUrl,
String testUrl)
See if testUrl is under sitemapBaseUrl.
|
public static final org.slf4j.Logger LOG
public static final int MAX_BYTES_ALLOWED
protected boolean strict
public SiteMapParser()
public SiteMapParser(boolean strict)
public boolean isStrict()
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.onlineSitemapUrl
- URL of the online sitemapUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- an AbstractSiteMap
implementationUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map URL
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
content
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map URL
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrlUnknownFormatException
- if there is an error parsing the sitemapprotected SiteMap processText(String sitemapUrl, byte[] content) throws IOException
sitemapUrl
- a string sitemap URLsitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrlIOException
- if there is an error reading in the site map Stringprotected AbstractSiteMap processGzip(URL url, byte[] response) throws IOException, UnknownFormatException
url
- - URL of the gzipped contentresponse
- - Gzipped contentUnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzip URL
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
sitemapUrl
- a sitemap URL
is
- an InputSource
backing the sitemapUnknownFormatException
- if there is an error parsing the InputSource
protected SiteMap parseXmlSitemap(URL sitemapUrl, Document doc)
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
</urlset>
protected SiteMapIndex parseSitemapIndex(URL url, NodeList nodeList)
Parse XML that contains a Sitemap Index. Example Sitemap Index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>
url
- - URL of Sitemap IndexnodeList
- a NodeList
backing the sitemapprotected SiteMap parseSyndicationFormat(URL sitemapUrl, Document doc) throws UnknownFormatException
sitemapUrl
- the URL location of the Sitemapdoc
- - XML document to parseUnknownFormatException
- if XML does not appear to be Atom or RSSprotected void parseAtom(SiteMap sitemap, Element elem, Document doc)
Parse the XML document which is assumed to be in Atom format. Atom 1.0 example:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Example Feed</title>
<subtitle>A subtitle.</subtitle>
<link href="http://example.org/feed/" rel="self"/>
<link href="http://example.org/"/>
<modified>2003-12-13T18:30:02Z</modified>
<author>
<name>John Doe</name>
<email>johndoe@example.com</email>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
...
</feed>
protected void parseRSS(SiteMap sitemap, Document doc)
Parse XML document which is assumed to be in RSS format. RSS 2.0 example:
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Lift Off News</title>
<link>http://liftoff.msfc.nasa.gov/</link>
<description>Liftoff to Space Exploration.</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss</docs>
<generator>Weblog Editor 2.0</generator>
<managingEditor>editor@example.com</managingEditor>
<webMaster>webmaster@example.com</webMaster>
<ttl>5</ttl>
<item>
<title>Star City</title>
<link>http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp</link>
<description>How do Americans get ready to work with Russians aboard the
International Space Station? They take a crash course in culture,
language and protocol at Russia's Star City.
</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/06/03.html#item573</guid>
</item>
<item>
<title>Space Exploration</title>
<link>http://liftoff.msfc.nasa.gov/</link>
<description>Sky watchers in Europe, Asia, and parts of Alaska and Canada
will experience a partial eclipse of the Sun on Saturday, May 31.
</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/05/30.html#item572</guid>
</item>
</channel>
</rss>
protected String getElementValue(Element elem, String elementName)
elem
- elementName
- protected String getElementAttributeValue(Element elem, String elementName, String attributeName)
elem
- elementName
- attributeName
- protected void addUrlIntoSitemap(String urlStr, SiteMap siteMap, String lastMod, String changeFreq, String priority, int urlIndex)
urlStr
- an URL string to add to the SiteMap
siteMap
- the sitemap to add URL(s) tolastMod
- last time the SiteMapURL
was modifiedchangeFreq
- the SiteMapURL
change frquencypriority
- priority of this SiteMapURL
urlIndex
- index position to which this entry has been addedCopyright © 2009–2016 Crawler-Commons. All rights reserved.