Saturday, April 4, 2009

Implementation of SAX Parser in Java using SAX2 APIs


For those who have reached to this article directly, before we move on to discussing the implmenetation of a sample SAX-based XML parser in Java, they may like to refresh their understanding of SAX by referring to this article - Evolution of Java and XML combo. SAX, DOM, JAXP, JDOM >>

Implementation of a SAX2-based XML parser in Java

We will start with looking at the various steps involved in writing a SAX-based XML Parser in Java and subsequently we'll see the code-listing and the output. The implementation can be broken down into the following 5-6 steps:-

(1) Inheriting DefaultHandler:
If you're using SAX2 then you can inherit from the class DefaultHandler, which is the base class for SAX2 event handlers. This provides default implementation for all the callbacks of all the four core SAX2 handler interfaces: EntityResolver, DTDHandler, ContentHandler, and ErrorHandler. We normally need to override only the methods of the ContentHandler interface in most of the cases. In case you are using SAX1, you would use HandlerBase class in place of DefaultHandler. The signature of some of the methods of SAX1 may differ from the same of SAX2 and hence you would require to make the necessary changes in your method-override definition.

public class SAXXMLParserImpl extends DefaultHandler{

(2) New instance of SAXParserFactory: SAX Parsers are obtained from a factory class named 'SAXParserFactory' and hence one must need to get an instance of the factory first.
//Getting a new instance of the SAX Parser 
FactorySAXParserFactory factory = SAXParserFactory.newInstance();

(3) New instance of SAX Parser: once you have got a factory instance then you can simply use the API to get a new instance of the SAX Parser.
//Getting a parser from the factory
SAXParser saxParser = factory.newSAXParser();

(4) Parsing the XML document: now that you have a SAX Parser instance, you just need to pass the XML document and a DefaultHandler instance for parsing the XML document.
//Parsing the XML document using the parser
saxParser.parse( new File(XML_FILE_TO_BE_PARSED), new SAXXMLParserImpl() );

(5) Implementing the required handlers: inheriting from the DefaultHandler class would provide you the default implementation of all the SAX2 APIs, but the default implenmentation (at least in some cases) can be as good as nothing. You would need to override at least some of the methods to make the processing of XML documents possible.

(6) Cosmetic/Admin Stuff: you may like to have private members to keep references to the XML File path and output stream. The members would obviously be required to be set correctly before they are used. Additionally, you may like to define few simple helper methods for performing routine tasks to make the code more readable, maintainable, and scalable.

Source Code of the Implementation


SAXXMLParserImpl.java


import java.io.*;

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

public class SAXXMLParserImpl extends DefaultHandler{

//Path of the XML File to be parsed - private as we don't want it outside
//and 'final' as once it's assigned to a value (path), it doesn't require any change
private static final String XML_FILE_TO_BE_PARSED = "C:\\LanguageList.xml";

//Reference to the output stream
private static Writer out;

public static void main (String argv [])
{
//Getting a new instance of the SAX Parser Factory
SAXParserFactory factory = SAXParserFactory.newInstance();

try {

//Setting up the output stream - in this case System.out with UTF8 encoding
out = new OutputStreamWriter(System.out, "UTF8");

//Getting a parser from the factory
SAXParser saxParser = factory.newSAXParser();

//Parsing the XML document using the parser
saxParser.parse( new File(XML_FILE_TO_BE_PARSED), new SAXXMLParserImpl() );

} catch (Throwable throwable) { //Throwable as it can be either Error or Exception
throwable.printStackTrace ();
}
System.exit (0);
}

//Implementation of the required methods of the ContentHandler interface

public void startDocument()throws SAXException
{
printData("XML File being parsed: " + XML_FILE_TO_BE_PARSED);
printNewLine();printNewLine();
printData("INFO: ### Parsing of the XML Doc started ###");
printNewLine();printNewLine();

printData ("");
printNewLine();
}

public void endDocument()throws SAXException
{
try {
printNewLine();
printNewLine();
printData("INFO: ### Parsing of the XML Doc completed ###");

out.flush ();
} catch(IOException ioe) {
throw new SAXException ("ERROR: I/O Eexception thrown while parsing XML", ioe);
}
}

public void startElement(String namespaceURI, String localName, String qName, Attributes atts)throws SAXException
{

printData ("<" + qName);

if (atts != null) {
for (int i = 0; i < atts.getLength (); i++) {
printData (" ");
printData (atts.getQName(i) + "=\"" + atts.getValue(i) + "\"");
}
}

printData (">");
}

public void endElement(String namespaceURI, String localName, String qName)throws SAXException
{
printData ("");
}

public void characters(char buffer [], int offset, int length)throws SAXException
{
String string = new String(buffer, offset, length);
printData(string);
}

//Definition of helper methods

//printData: accepts a String and prints it on the assigned output stream
private void printData(String string)throws SAXException
{
try {

out.write(string);
out.flush();

} catch (IOException ioe) {
throw new SAXException ("ERROR: I/O Exception thrown while printing the data", ioe);
}
}

//printNewLine: prints a new line on the underlying platform
//end of line character may vary from one platform to another
private void printNewLine()throws SAXException
{
//Getting the line separator of the underlying platform
String endOfLine = System.getProperty("line.separator");

try {

out.write (endOfLine);

} catch (IOException ioe) {
throw new SAXException ("ERROR: I/O Exception thrown while printing a new line", ioe);
}
}

}

LanguageList.xml
<?xml version="1.0" encoding="UTF-8"?>
<LanguageList>
<Language id = "1">
<Name>Java</Name>
<Description>Arguably the most wodely used language for Application Dev</Description>
</Language>
<Language id = "2">
<Name>C</Name>
<Description>Arguably the most widely used language for System Soft Dev</Description>
</Language>
</LanguageList>

Output
XML File being parsed: C:\LanguageList.xml

INFO: ### Parsing of the XML Doc started ###

<?xml version='1.0' encoding='UTF-8'?>
<LanguageList>
<Language id="1">
<Name>Java</Name>
<Description>Arguably the most wodely used language for Application Dev</Description>
</Language>
<Language id="2">
<Name>C</Name>
<Description>Arguably the most widely used language for System Soft Dev</Description>
</Language>
</LanguageList>

INFO: ### Parsing of the XML Doc completed ###

Liked the article? Subscribe to this blog for regular updates. Wanna follow it to tell the world that you enjoy GeekExplains? Please find the 'Followers' widget in the rightmost sidebar.



Share/Save/Bookmark


Wednesday, April 1, 2009

SAX v/s DOM. How to choose between DOM and SAX?


Differences between DOM and SAX. When to use what?

Before going through the differences, if you need a refresh of what SAX and DOM are, please refer to this article - SAX, DOM, JAXP, & JDOM >>.

While comparing two entities, we tend to see both of them as competitors and consequently comparing them to find a winner. This of course is not applicable in every case - not at least in the case of SAX and DOM. Both have their own pros and cons and they are certainly not in direct competition with each other.


SAX v/s DOM

Main differences between SAX and DOM, which are the two most popular APIs for processing XML documents in Java, are:-
  • Read v/s Read/Write: SAX can be used only for reading XML documents and not for the manipulation of the underlying XML data whereas DOM can be used for both read and write of the data in an XML document.
  • Sequential Access v/s Random Access: SAX can be used only for a sequential processing of an XML document whereas DOM can be used for a random processing of XML docs. So what to do if you want a random access to the underlying XML data while using SAX? You got to store and manage that information so that you can retrieve it when you need.
  • Call back v/s Tree: SAX uses call back mechanism and uses event-streams to read chunks of XML data into the memory in a sequential manner whereas DOM uses a tree representation of the underlying XML document and facilitates random access/manipulation of the underlying XML data.
  • XML-Dev mailing list v/s W3C: SAX was developed by the XML-Dev mailing list whereas DOM was developed by W3C (World Wide Web Consortium).
  • Information Set: SAX doesn't retain all the info of the underlying XML document such as comments whereas DOM retains almost all the info. New versions of SAX are trying to extend their coverage of information.
Usual Misconceptions
  • SAX is always faster: this is a very common misunderstanding and one should be aware that SAX may not always be faster because it might not enjoy the storage-size advantage in every case due to the cost of call backs depending upon the particular situation, SAX is being used in.
  • DOM always keeps the whole XML doc in memory: it's not always true. DOM implementations not only vary in their code size and performance, but also in their memory requirements and few of them don't keep the entire XML doc in memory all the time. Otherwise, processing/manipulation of very large XML docs may virtually become impossible using DOM, which is of course not the case.

How to choose one between the two?

It primarily depends upon the requirement. If the underlying XML data requires manipulation then almost always DOM will be used as SAX doesn't allow that. Similarly if the nature of access is random (for example, if you need contextual info at every stage) then DOM will be the way to go in most of the cases. But, if the XML document is only required to be read and that too sequentially, then SAX will probably be a better alternative in most of the cases. SAX was developed mainly for pasring XML documents and it's certainly good at it. SO, if you need to process an XML document maybe to update a datasource, SAX will probably make a alternative.

Requirements may certainly fall between the two extremes discussed above and for any such situation you should weight both the alternatives before picking any of the two. There are applications where a combination of both SAX and DOM are used for XML processing so that might also be an alternative in your case. But, basically it would be a design decision and evidently it would require a thorough analysis of the pros and cons of all possible approaches in that situation.

Read Next: A step-by-step implementation (with explanation of the code) of a SAX Parser in Java using SAX2 APIs - Simple SAX Parser Impl in Java >>

Liked the article? Subscribe to this blog for regular updates. Wanna follow it to tell the world that you enjoy GeekExplains? Please find the 'Followers' widget in the rightmost sidebar.



Share/Save/Bookmark


Sax, DOM, JAXP, & JDOM. Evolution of Java-XML combo.


Evolution of the XML Parsing/Manipulation using Java

The combination of Java and XML has been one of the most attracting things which had happened in the field of software development in the 21st century. It has been mainly for two reasons - Java, arguably the most widely used programming language and XML, almost unarguably the best mechanism of data description and transfer.

Since these two were different technologies and hence it initially required a developer to have a sound understanding of both of these before he can make the best use of the combination. Since then there have been a paradigm shift towards Java and we have seen few interesting technologies getting evolved to make this happen. Some of them are:-

SAX - Simple API for XML Parsing

It was the first to come on the scene and interestingly it was developed in the XML-Dev maling list. Evidently the people who developed this were XML gurus and it is quite visible in the usage of this API. You got to have a fair understanding of XML, but at least Java developers got something to combine the two worlds - Java and XML in a structured way. It instantly became a hit for the obvious reasons.

Being the first in the evolution ladder, it obviously had only the basic support for XML processing. It is an event-based technology, which uses callbacks to load the parts of the XML document in a sequential way. This effectively means you can't go back to some part which was read/processed previously - if you do have such a requirement then you would need to store/manage the relevant data yourself.

Since this API does require to load the entire XML doc and also because it offers only a sequential processing of the doc hence it is quite fast. Another reason of it being faster is that it does not allow modification of the underlying XML data.

Interested in going through a step-by-step implementation (with explanation of the complete source code) of a simple SAX Parser in Java using SAX2 APIs? Here is it for you - SAX Parser Implementation in Java >>

DOM - Document Object Model

The Java binding for DOM provided a tree-based representation of the XML documents - allowing random access and modification of the underlying XML data. Not very difficult to deduce that it would be slower as compared to SAX.

The event-based callback methodology was replaced by an object-oriented in-memory representation of the XML documents. Though, it differs from one implementation to another if the entire document or a part of it would be kept in the memory at a particular instant, but the Java developers are kept out of all the hassle and they get the entire tree readily available whenever they wish.

JAXP - Java API for XML Parsing

The creators and designers of Java realized that the Java developers should not be XML gurus to use the XML in Java applications. The first step towards making this possible was the evolution of JAXP, which made it easier to obtain either a DOM Document or a SAX-compliant parser via a factory class. This reduced the dependence of Java developers over the numerous vendors supplying the parsers of either type. Additionally, JAXP made sure that an interchange between the parsers required minimal code changes.

JDOM - Java Document Object Model

Even though JAXP reduced the need for caring about the different parser implemenattions, still it required the developers to use either the DOM or SAX for manipulating the XML data. JDOM evolved as the designers of Java APIs thought of moving more towards Java and Java-like constructs while processing XML documents and it supported moving away from non-Java structs like Attributes (in SAX) and NamedNodeMap (in DOM). Now the Java developers can use the mucm more familiar Java Collection classes to manipulate XML data. Moving towards the customary Java constructs also helped making the processing faster - almost at par with SAX.

So, now that we are aware of what SAX and DOM are, let's move towards discussing the differences between the two. As is the case with most of the other technological comparisons, neither of the two is an absolute favourite and the choice would more often than not depend upon your requirement. SAX v/s DOM. When to use what?

Liked the article? Subscribe to this blog for regular updates. Wanna follow it to tell the world that you enjoy GeekExplains? Please find the 'Followers' widget in the rightmost sidebar.



Share/Save/Bookmark