
![[image]](http://mowser.com/img?url=http%3A%2F%2Flobobrowser.org%2Fimages%2Fspacer.gif)
![[image]](http://mowser.com/img?url=http%3A%2F%2Flobobrowser.org%2Fimages%2Fspacer.gif)

Cobra: Java HTML Parser
The all-Java Cobra HTML Toolkit includes a HTML DOM parser that can be used independently of the rendering engine. The following are some of its features:innerHTML property of an element. It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled. It is CSS2-aware.Cobra Version
Information provided in this page has been updated to apply to Cobra 0.98.1+. Cobra may be downloaded from the SourceForge download area for this project.API Documentation
See the Cobra API Documentation.Basic Usage
The recommended way to use the Cobra HTML parser is via the DocumentBuilderImpl class, roughly as follows:
import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
// A document URI and a charset should be provided.
Document document = dbi.parse(new InputSourceImpl(inputStream, documentURI, charset));
The HtmlParser class can be used directly as well. In particular, it can be used to parse an HTML document into a third-party DOM implementation, or to parse HTML below a particular DOM node (which is how the innerHTML property is implemented).
import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
import org.w3c.dom.html2.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
HTMLDocument document = (HTMLDocument) dbi.createDocument();
...
HtmlParser parser = new HtmlParser(context, document);
parser.parse(myReader, someParentNode);
Incremental Notifications
A document notification listener can be added to aHTMLDocumentImpl instance by calling addDocumentNotificationListener(). The DocumentNotificationListener interface implementation will be notified of several types of document modifications as the document is parsed. Various notifications (intended to allow incremental rendering) can also occur as styles are modified or as the document is modified programmatically with Javascript.
Performance Tips
Parser performance is typically affected by loading of remote scripts and CSS documents. There are generally two ways to deal with this: (1) Disable Javascript and/or CSS, and (2) Implement some sort of caching mechanism.All Cobra requests are processed through UserAgentContext.createHttpRequest(), so the way Cobra processes requests can be changed by either implementing the UserAgentContext and HttpRequest interfaces, or by extending simple implementations of these interfaces provided with Cobra.
Enabling of Javascript is controlled by the UserAgentContext.isScriptingEnabled() method, so it is straightforward to disable Javascript by simply extending SimpleUserAgentContext. Similarly, remote CSS document loading is controlled by the UserAgentContext.isExternalCSSEnabled() method.
Disabling Arbitrary Elements
Before disabling of CSS or Javascript were explicitly supported by Cobra, a general purpose technique could be implemented to achieve the same results. Essentially, theHTMLDocumentImpl class can be extended and its createElement, createText and other such methods can be overridden to provide custom node instances.
Examples
getImages() method of HTMLDocumentImpl to get a list of image elements in a page. This is equivalent to using the document.images property in Javascript.
HTMLDocumentImpl is a class that implements the standard W3C Document interface. So utilities that use Document, e.g. XPath, will work with documents parsed by Cobra. But it is also possible to use a standard XML Document instance in conjunction with the Cobra HTML parser, as this example illustrates. We will use XPath to retrieve all the "A" links from a page.
form.submit() in Javascript. This example illustrates how to retrieve the main page from the MetaCrawler search engine, populate its search form, submit it, and list the first-page results of the query. It also disables Javascript and remote CSS.
See Also
Support The Project
![[image]](http://mowser.com/img?url=http%3A%2F%2Flobobrowser.org%2Fimages%2Fspacer.gif)
You are viewing a mobilized version of this site...
View original page here