|
XML has great benefits, but before you can realize those benefits, you must be able to read and write XML documents. There are two programming language independent paradigms for reading XML documents:
SAX parsing involves streaming an XML document through a parser and sending notifications to a SAX handler. SAX is very fast and only needs to maintain information it deems important in memory. The DOM, in contrast, reads an entire XML document into memory and constructs a tree-like structure through which you access the information you are interested in.
An open source Java-specific representation of XML documents, known as JDOM, represents an XML document as a collection of Elements all pieced together using the already familiar to Java programmers Collection classes. JDOM is very easy to use and it is fast, but the code and its concepts are not portable to other programming languages.
Before looking into the specifics of each implementation, it's beneficial to look at a simple XML document. Listing 1 shows a simple XML document that represents books.
Listing 1. books.xml
<?xml version="1.0"?>
<books>
<book category="computer-programming">
<author>Steven Haines</author>
<title>Java 2 Primer Plus</title>
<price>44.95</price>
</book>
<book category="fiction">
<author>Tim LaHaye</author>
<title>Left Behind</title>
<price>15.95</price>
</book>
</books>
In this example, the root element is the books element. It has two children that are book elements. The book elements have a category attribute and three subelements. Each subelement has text-based content, for example the author of "Left Behind" is "Tim LaHaye".
SAX Parsing
Using a SAX Parser has the following benefits:
-
It can parse files of any size
-
It is useful when you want to build your own data structure
-
It is useful when you only want a small subset of the information
-
It is simple
-
It is fast
But it also has its drawbacks:
-
No random access to the document
-
Complex searches can be difficult to implement
-
Data Type Definition (DTD) is not available
-
Lexical information is not available
-
SAX is read-only
-
SAX is not supported in current browsers
Your choice of using SAX or DOM will be dictated by the requirements of your system. If you need to quickly extract information from a large XML file, then SAX is the best option. But if you need extensive validation and want to perform complex searches on a document small enough to fit into memory, then you may choose to use the DOM.
The process of using a SAX parser is shown below:
SAX Parser Sequence Diagram
The steps are outlined here:
-
Download a SAX parser (e.g. com.sun.xml.parser.Parser or com.sun.xml.parser.ValidatingParser)
-
Import org.xml.sax.* into your source code
-
Create an instance of your parser
-
Implement DocumentHandler (extend HandlerBase)
-
Set your DocumentHandler class as the parser's document handler (setDocumentHandler())
-
Ask the parser to parse the document
-
Handle the DocumentHandler interface methods (callbacks)
Listing 2 shows sample Java code to parse the books.xml document and return the number of books that it finds.
Listing 2. BookCounter.java
import org.xml.sax.*;
public class BookCounter extends HandlerBase {
private int count = 0;
public void countBooks() throws Exception {
Parser p = new com.sun.xml.parser.Parser();
p.setDocumentHandler( this );
p.parse( "books.xml" );
}
public void startElement(String name, AttributeList atts) throws SAXException {
if( name.equals( "book" ) ) { count++; }
}
public void endDocument() throws SAXException {
System.out.println( "There are " + count + " books" );
}
public static void main( String[] args ) {
BookCounter counter = new BookCounter();
counter.countBooks();
}
}
The BookCounter class extends HandlerBase that implements DocumentHandler and provides default implementations of its methods. The BookCounter overrides the startElement() and endDocument() methods: startElement() is called when the parser finds a new element coincidentally it will also send an endElement() method if you are interested in knowing when the parser is finished processing the node) and the endDocument() method is called when the SAX parser is finished processing the entire document. Listing 2 is a simple example that demonstrates how you can easily extract information from an XML and perform some processing.
Document Object Model (DOM)
The Document Object Model (DOM) loads an XML document in its entirety into memory and builds a tree-like structure to hold it. The tree-like structure mimics the structure of the XML document itself, so it is a logical way of navigating the data in an XML document. For example, listing 3 shows the books.xml file:
Listing 3. books.xml
<?xml version="1.0"?>
<books>
<book category="computer-programming">
<author>Steven Haines</author>
<title>Java 2 Primer Plus</title>
<price>44.95</price>
</book>
<book category="fiction">
<author>Tim LaHaye</author>
<title>Left Behind</title>
<price>15.95</price>
</book>
</books>
In this document, the root of the document is the books node, it has two book children, and each book node has three children: author, title, and price. The book nodes also maintain a category attribute.
Document Object Model
In DOM terms, each node of the XML document is an Element. An Element can contain other Element nodes, Attribute nodes, and Text nodes, the specifics of which we will see shortly.
While the SAX streams through an XML document, the infrastructure provided by the DOM offers the following benefits:
-
Allows the developer to read, search, modify, add to, and delete from a document
-
Ensures proper grammar and well-formedness
-
Abstracts content away from grammar
-
Simplifies internal document manipulation
Probably the most important of these features is the ability to determine whether an XML document is properly formed: this is called validation. The structure of an XML document can be defined in another document, external to the source document, that comes in one of two forms:
DTDs are older technology, but still very much in use. Schemas are a newer technology and hold about the same market share as DTDs. Regardless of the implementation, the rules define how an XML document should look. They do not verify that the data in the XML file is valid (that's your job), just that the document has the correct elements and attributes and that they are in the right places.
To use the DOM, the first thing you must do is to create and instance of the javax.xml.parsers.DocumentBuilderFactory class and call one of its parse() methods. It defines the following parse() methods:
Document parse(java.io.File f)
Document parse(org.xml.sax.InputSource is)
Document parse(java.io.InputStream is)
Document parse(java.io.InputStream is, String systemId)
Document parse(String uri)
The difference only is the source of the XML document. The following code snippet creates a DocumentBuilderFactory and parses the books.xml file:
File f = new File( "books.xml" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse( f );
The parse() methods returns an org.w3c.dom.Document object that you will use to start navigating your XML document. It defines the following methods:
DocumentType getDoctype() The DTD (see org.w3c.dom.DocumentType) associated with this document
Element getDocumentElement() Returns the root element of the document
NodeList getElementsByTagName(String tagname)
Returns a NodeList of all the Elements with a given tag name
The getDocumentElement() method returns an org.w3x.dom.Element representing the root of the document. The Element interface (and its super interface: org.w3c.dom.Node) contain the following methods:
String getAttribute(String name) Retrieves an attribute value by name
Attr getAttributeNode(String name) Retrieves an attribute node by name
NodeList getElementsByTagName(String name) Returns a NodeList of all descendant Elements with a given tag name
String getTagName() The name of the element
NamedNodeMap getAttributes() A NamedNodeMap containing the attributes of this node or null otherwise
NodeList getChildNodes() A NodeList that contains all children of this node
Node getFirstChild() The first child of this node
Node getLastChild() The last child of this node
String getNodeName() The name of this node, depending on its type
short getNodeType() A code representing the type of the underlying object
String getNodeValue() The value of this node, depending on its type
Combining these methods allows you to parse an XML document's elements and methods. Listing 3 puts all of this together into an example that parses the books.xml file.
Listing 3. DOMSample.java
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import java.io.FileInputStream;
import java.io.File;
import java.io.IOException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.DOMException;
public class DOMSample
{
public static void main( String[] args )
{
try
{
File file = new File( "books.xml" );
if( !file.exists() )
{
System.out.println( "Couldn't find file..." );
return;
}
// Parse the document
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( file );
// Walk the document
Element root = document.getDocumentElement();
System.out.println( "root=" + root.getTagName() );
// List the children of <books>; a set of <book> elements
NodeList list = root.getChildNodes();
for( int i=0; i<list.getLength(); i++ )
{
Node node = list.item( i );
if( node.getNodeType() == node.ELEMENT_NODE )
{
// Found a <book> element
System.out.println( "Handling node: " +
node.getNodeName() );
Element element = ( Element )node;
System.out.println( "tCategory Attribute: " +
element.getAttribute( "category" ) );
// Get its children: <author>, <title>, <price>
NodeList childList = element.getChildNodes();
for( int j=0; j<childList.getLength(); j++ )
{
// Once one of these nodes is attained, the next
// step is locating its text element
Node childNode = childList.item( j );
if( childNode.getNodeType() == childNode.ELEMENT_NODE )
{
NodeList childNodeList =
childNode.getChildNodes();
for( int k=0; k<childNodeList.getLength(); k++ )
{
Node innerChildNode = childNodeList.item( k );
System.out.println( "ttNode=" +
innerChildNode.getNodeValue() );
}
}
}
}
}
} catch( Exception e )
{
e.printStackTrace();
}
}
}
The output of listing 3 should be similar to the following:
root=books
Handling node: book
Category Attribute: fiction
Node=Left Behind
Node=Tim Lahaye
Node=14.95
Handling node: book
Category Attribute: Computer Programming
Node=Java 2 Primer Plus
Node=Steven Haines
Node=44.95
The DOM is more difficult to use than the SAX parser. However, because it maintains the XML document in memory and its parsing mimics the structure of the document, it is a great tool. When deciding between SAX and DOM, the primary consideration is the memory requirement to maintain the entire document in memory all at one time. If your document is small enough to keep in memory then parsing it inside a DOM is more natural whereas if you document is too large, you may have no choice but to use SAX.
Summary
The power behind XML stems from its OS and programming language independence and its widespread adoption across various standards committees. There are two standard ways to read and manipulate XML documents: the SAX Parser and the DOM; each has its own set of benefits and costs so the choice will be dependent on your business needs. An open source project, known as JDOM, provides a very Java-centric approach to manipulating XML documents.
The choice of implementations will be dictated by your business requirements, but for all Java projects that I can afford the memory requirement, I use JDOM. I encourage you to look through the online resources available for you here on InformIT.com and start implementing XML solutions; some day in the future you may be required to work with programs running on different operating systems and different programming languages so if you are XML-based you will be ahead of the curve!
Trackback(0)
|