What is SAX, Part 4

by Richard G. Baldwin
baldwin@austin.cc.tx.us
Baldwin's Home Page

Dateline: 06/21/99

prolog

In Part 3 of this series of articles on SAX, I provided a general discussion and showed you the output of a Java program that uses XML4J to parse and display the various parts of a simple XML document. I promised to continue the discussion of the program in this article, and to show you the Java code that was used to produce the output shown.

tighten your technical seat belt

You may need to tighten your technical seat belt for this article. It delves much more deeply into technical material than is normally the case in these articles.

If the Java technology used in this article is unfamiliar to you, see my online Java tutorials for an explanation of these and other Object-Oriented Programming concepts using Java.

the XML File, a short book of poems

As a review, the following listing shows the XML file used with this program. The XML file represents the rudimentary aspects of a book of poems. It contains one verse each from two well-known poems.

The XML markup for the first poem is correct from a syntax viewpoint.

A syntax error was purposely introduced into the second poem to illustrate the error-handling capability of SAX and the IBM parser.

The line with the error is highlighted in boldface. The highlighted element is missing its end tag (</line>).

<?xml version="1.0"?>

<bookOfPoems>

<poem PoemNumber="1" 
      DummyAttribute="dummy value">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>

<poem PoemNumber="2"
      DummyAttribute="dummy value">
<line>Twas the night before Christmas,</line>
<line>And all through the house,
<line>Not a creature was stirring,</line>
<line>Not even a mouse.</line>
</poem>

</bookOfPoems>

The manner in which the sample program processes this XML file is described in the Java code fragments in the following sections.

some required import directives

The first fragment shows import directives. These directives are shown here simply to illustrate that the program imports packages that are part of the IBM parser library and are not part of the standard Java API.

import org.xml.sax.*;
import org.xml.sax.helpers.ParserFactory;

identifying the parser package

The next fragment shows the controlling class and the beginning of the main method.

The class begins by defining a String that identifies the class from which the parser will be instantiated. The particular string used here identifies the IBM parser. I believe that this is the only statement that would need to be modified in order to use this program with a SAX based parser from a different vendor, but I haven't verified that.

class Sax01 {
  static final String parserClass = 
               "com.ibm.xml.parsers.SAXParser";

  public static void main(String args[])
                              throws Exception{
    Parser parser = 
         ParserFactory.makeParser(parserClass);

a SAX factory method

The first statement inside the main method shown above uses a SAX factory method along with the identification of the parser vender to create an object of type Parser. This is actually an object of type Interface org.xml.sax.Parser.

All SAX parsers must implement this interface. It allows applications to register handlers for different types of events and to initiate a parse from a URI, or a character stream.

a short side trip

Completely as an aside, in case you, like many others, are having difficulty separating URI, URL, URN, and URC in your mind, here is a quote from a W3C document that explains the differences in the terms.

URI -- Uniform Resource Identifier. The generic
set of all names/addresses that are short 
strings that refer to resources. (specified 
1994; ratified as Internet Draft Standard 1998)

URL -- Uniform Resource Locator. The set of URI
schemes that have explicit instructions on how 
to access the resource on the internet. Full 
definition is given in the URL specification. 

URN -- Uniform Resource Name.
1.An URI that has an institutional commitment 
to persistence, availability, etc. Note that 
this sort of URI may also be a URL. See, for 
example, PURLs.

2.A particular scheme which is currently 
(1991,2,3,4,5,6,7) under development in the 
IETF (see discussion forums below), which 
should provide for the resolution using 
internet protocols of names which have a 
greater persistence than that currently 
associated with internet host names or 
organizations. When defined, a URN(2) will
be an example of a URI.

URC -- Uniform Resource Citation, or Uniform 
Resource Characteristics. A set of 
attribute/value pairs describing a resource. 
Some of the values may be URIs of various 
kinds. Others may include, for example,
authorship, publisher, datatype, date, 
copyright status and shoe size. Not normally 
discussed as a short string, but a set of 
fields and values with some defined free
formatting.

now back to the main road

Here is what the documentation has to say about the class named org.xml.sax.helpers.ParserFactory.

Java-specific class for dynamically loading SAX parsers.

This class is not part of the platform-independent definition of SAX; it is an additional convenience class designed specifically for Java XML application writers. SAX applications can use the static methods in this class to allocate a SAX parser dynamically at run-time based either on the value of the `org.xml.sax.parser' system property or on a string containing the class name.

Here is what Clifford J. Berg, author of advanced JAVA Development for Enterprise Applications has to say about factory methods in general:

A class you have defined that has a method createInstance() -- or any method -- that has the function of creating an instance based on runtime or configuration criteria such as property settings.

the bottom line on makeParser()

The bottom line is that the makeParser() method of the ParserFactory class creates an instance (object) of a class that implements the Parser interface.

The object is based on a String that specifies the class libraries provided by the vendor of the SAX based parser software.

This parser object can then be used to perform the routine processing of the XML file, generating a series of document events and potentially error events based on the information in the file.

a DocumentHandler object

The next fragment instantiates an object of the DocumentHandler type to handle events and errors. Note that DocumentHandler is an interface and is not a class.

I will explain how this object performs its work in conjunction with a discussion of the EventHandler class later.

DocumentHandler handler = new EventHandler();

something to confuse you

The two statements in the next fragment can be confusing to persons who have become used to Java Beans design patterns. Generally the design patterns indicate:

Methods that are used to register event listeners should begin with the word add
Methods that provide mutable access to properties should begin with the word set.

However, the two statements in the next fragment invoke methods that begin with the word set to register two different listeners on the Parser object.

One of those handlers listens for document events such as the start or end of an element. The other handler listens for events caused by errors in the XML data.

different interfaces for events and errors

Document event methods and error event methods are declared in two different interfaces. The handler object instantiated above is of the type EventHandler. A superclass of that class implements both interfaces making it possible for an object of that type to listen for both types of events. However, it does give rise to the requirement to cast the handler object to type ErrorHandler before registering it on the parser object.

parser.setDocumentHandler(handler);
parser.setErrorHandler((ErrorHandler)handler);

generating events

The single executable statement in the next fragment is where the action is centered. This statement executes the parse() method on the object of type Parser to make a pass through the XML document specified by the parameter (Sax01.xml).

parser.parse("Sax01.xml");

While making the pass through the document, this method generates a variety of document events and error events as the various tags, attributes, and data values in that document are encountered.

handling events

This, in turn, causes event and error handling methods overridden by the application programmer to be executed, providing the functional behavior of the program.

The above statement ends the main() method and also ends the controlling class.

the EventHandler class

The next fragment begins the definition of the class containing overridden methods for handling document events and error events.

class EventHandler extends HandlerBase{

  //handle startDocument event
  public void startDocument(){
    System.out.println("Start Document");
  }//end startDocument()
    
  //handle endDocument event
  public void endDocument(){
    System.out.println("End Document");
  }//end endDocument()

This class extends the class named HandlerBase. The class named HandlerBase, which is the default base class for handlers, implements the default behavior for four different SAX interfaces:

DocumentHandler
ErrorHandler
EntityResolver
DTDHandler

The first two of these interfaces are of interest to us in this article. If I have the time, I will pursue the other two interfaces in subsequent articles.

use of HandlerBase is optional

The use of the HandlerBase class is optional. Application writers can extend this class when they need to implement only part of an interface.

Parser writers can instantiate this class to provide default handlers when the application has not supplied its own.

overriding methods to provide functionality

The EventHandler class overrides the event handling methods of the DocumentHandler interface and the ErrorHandler interface to provide the desired functionality for the program.

The above fragment shows the beginning of the class along with the first two overridden event-handling methods.

start and end document events

The Parser object invokes these two overridden methods when the parse process encounters the beginning and the end of the XML document.

The default versions of these two methods return quietly doing nothing. Application writers can override the startDocument() method to take specific actions at the beginning of a document (such as creating an output file).

Similarly the application writer can override endDocument() to take specific action at the end of a document (such as closing a file).

Note that these methods don't receive any parameters.

In this sample program, these overridden methods simply announce the beginning and the end of the document.

the start element event

The next overridden handler method is more complicated than most in this article. This method is invoked at the start of every element.

For review, the start or beginning of an element might look like this in an XML document:

<poem PoemNumber="1" DummyAttribute="dummy value">

The boldface portions are commonly referred to as attributes. An element can contain no attributes or multiple attributes.

In this case, the element named poem contains two attributes named PoemNumber and DummyAttribute (the name of the attribute is unrelated to the name of the element).

Each attribute also has a value, which is enclosed in double quotation marks. In this case, the values for the two attributes are 1 and dummy value.

the startElement() event handler method

The event handler method that gets called when the parser encounters a new element is startElement(), as shown in the next fragment.

This method receives two parameters. The first parameter is a String containing the name of the element. The second parameter is a reference to an object of type AttributeList containing information about the attributes.

The code in the following fragment iterates through the AttributeList object, extracting and displaying information about each of the attributes described by that object.

public void startElement(
               String name,AttributeList atts){
  System.out.println("Start element: " + name);
  if (atts != null) {
    int len = atts.getLength();
    //process all attributes
    for (int i = 0; i < len; i++) {
      String attName = atts.getName(i);
      String type = atts.getType(i);
      String value = atts.getValue(i);
      System.out.println(
                        "Attribute: " + attName 
                         + ", Value = " + value 
                         + ", Type = " + type);
    }//end for loop on attributes
  }//end if
}//end startElement()

the AttributeList interface

Note that AttributeList is an interface that is implemented by the parser vendor.

An AttributeList object includes only attributes that have been specified or defaulted: #IMPLIED attributes are not included.

Here is some information about attributes of the #IMPLIED type:

The XML document may provide a value for the attribute but is not required to do so. In this case, if no value is provided, an application-dependent value will be used. For example, for an IMPLIED attribute named backgroundColor, an XML processor might accept a value if provided in the XML document, and might cause the background color to be green if an attribute value is not provided. A different XML processor might cause the same default background color to be red. That is what I mean by "application-dependent value."

getting information from the AttributeList

There are two ways for the application to obtain information from the AttributeList.

First, it can iterate through the entire list as in the above fragment.

Second, the application can request the value or type of specific attributes as in the following code where the name of the attribute is passed as a parameter.

Note that this formulation is not used in this sample program.

public void startElement(
              String name, AttributeList atts){
  String identifier = atts.getValue("id");
  String label = atts.getValue("label");
   [...]
}//end startElement()

a portion of the program output

The output produced for the first element and the attributes of that element for each of the poems in this article is shown in the following box. Note that line breaks and spaces were manually inserted to force the material to fit in this format.

Start element: poem
Attribute: PoemNumber,
           Value = 1,
           Type = CDATA
Attribute: DummyAttribute,
           Value = dummy value,
           Type = CDATA
...

Start element: poem
Attribute: PoemNumber,
           Value = 2,
           Type = CDATA
Attribute: DummyAttribute,
           Value = dummy value,
           Type = CDATA

what is type CDATA?

The name and value of the attribute is pretty obvious, but what about the type CDATA? Here is some information about this, but I doubt that it will mean much when taken out of context:

CDATA means that the value of this attribute may be any string of characters (as well as an empty string) and should be ignored by the parser. CDATA is used in situations where it is impossible to force more strict limitations on the attribute value with one of the following keywords...

There are three allowable types for an attribute:

string type, such as CDATA
tokenized types
enumerated types, such as (true | false)

I'm going to drop this discussion at this point. If you would like to pursue it further, read my tutorial that contains detailed information about DTDs.

the endElement() handler method is much simpler

Because it doesn't need to deal with attributes, the overridden endElement() event handler is much simpler. This method is invoked when the parser encounters an end tag for an element.

This method receives a single parameter that is the name of the element. This overridden version simply announces that the event has occurred and displays the name of the element.

  public void endElement (String name){
    System.out.println("End element: " + name);
  }//end endElement()

the content of an element

The content of an XML element is the text that appears between the beginning and ending tags. The next fragment shows the event handler that is invoked by the parser when the parser encounters content. The name of the content handler method is characters().

Here is what the documentation has to say about the characters() method:

public void characters(char[] ch,
                         int start,
                         int length)
                  throws SAXException

Receive notification of character data. 

The Parser will call this method to report each
chunk of character data. SAX parsers may return
all contiguous character data in a single 
chunk, or they may split it into several 
chunks; however, all of the characters in any 
single event must come from the same external 
entity, so that the Locator provides useful 
information.

The application must not attempt to read from 
the array outside of the specified range.

Note that some parsers will report whitespace 
using the ignorableWhitespace() method rather 
than this one (validating parsers must do so).

Parameters:
     ch - The characters from the XML document.
     start - The start position in the array.
     length - The number of characters to 
       read from the array.

the characters() method in a nutshell

This method receives a character array containing the content of an element. The overridden version of the method in this sample program simply converts the array to a String object and displays it.

public void characters(
               char[] ch,int start,int length){
    System.out.println(new String(
                           ch, start, length));
  }//end characters()

This overridden method produced the boldface lines in the following output for the first poem.

Start element: line
Roses are red,
End element: line

Start element: line
Violets are blue.
End element: line

Start element: line
Sugar is sweet,
End element: line

Start element: line
and so are you.
End element: line

additional event handler methods

That completes my discussion of overridden methods of the DocumentHandler interface. The above examples have shown all of the methods of this interface except for the following:

ignorableWhitespace(char[] ch, int start, int length)
processingInstruction(java.lang.String target, java.lang.String data)
setDocumentLocator(Locator locator)

I will leave it as an exercise for the reader to investigate the first two methods in this list. The third method will be used later in this sample program.

the ErrorHandler interface

That brings us to the methods that are declared in the interface named ErrorHandler. This interface, which declares three different handler methods, is the Basic interface for SAX error handlers.

A SAX application that needs to implement customized error handling, must implement this interface. Then it must register an object of the interface type with the SAX parser using the parser's setErrorHandler() method. The parser will then report all errors and warnings through this interface.

avoiding exceptions

When the handler object is registered on the parser, the parser will use this interface instead of throwing an exception. It is then up to the application to decide what to do about the problem, including whether to throw an exception for different types of errors and warnings.

Note that there is no requirement for the parser to continue to provide useful information after a call to the fatalError() method.

a default error handling implementation

The HandlerBase class provides a default implementation of this interface, ignoring warnings and recoverable errors and throwing a SAXParseException for fatal errors. An application can extend that class, as was done in this sample program, rather than to implement the complete interface itself.

overridden error handler methods

The overridden versions of all three of the error handler methods are shown in the next fragment. All three of the methods make a call to the method named getLocationString() to determine the location of the problem in the XML document and to display that information along with the nature of the message.

The getLocationString() method is discussed later.

In addition, the FatalError() method terminates the program after displaying a termination message.

public void warning(SAXParseException ex){
    System.out.println("[Warning] " 
                  + getLocationString(ex)+ ": "
                            + ex.getMessage());
  }//end warning()
  //-----------------------------------------//

  public void error(SAXParseException ex) {
    System.out.println("[Error] "
                  + getLocationString(ex)+ ": "
                            + ex.getMessage());
  }//end error()
  //-----------------------------------------//

  public void fatalError(SAXParseException ex)
                         throws SAXException {
    System.out.println("[Fatal Error] "
                 + getLocationString(ex)+ ": "
                           + ex.getMessage());
    System.out.println("Terminating");
    System.exit(1);
  }//end fatalError()

the getLocationString() method

The next fragment shows the beginning of a private utility method named getLocationString().

This method is called by each of the error handling methods to determine the location in the XML file where the error was detected by the parser

constructing an information String

The method declares a StringBuffer object that is later used to construct a String containing the desired information to return to the calling method

private String getLocationString(
                         SAXParseException ex){
    StringBuffer str = new StringBuffer();

getting the name of the XML file

The first task undertaken by this method is to determine the name of the XML file being processed when the error occurred. This information, and other useful information as well, is contained in the SAXParseException object received by the error handler and passed on to this method as a parameter.

some methods of the SAXParseException class

The following methods of the SAXParseException class are of interest in this article:

getColumnNumber() -- Get the column number (int) of the end of the text where the exception occurred.

getLineNumber() -- Get the line number (int) of the end of the text where the exception occurred.

getSystemId() -- Get the system identifier (String)of the entity where the exception occurred.

error message output

I'm going to begin the discussion by showing you the output produced on my computer by purposely omitting an end tag from one of the lines. Note that I manually inserted line breaks to force the material to fit in this format.

systemID: 
file:/G:/Baldwin/AA-School/JavaProg/Combined
    /Java/Sax01.xml
[Fatal Error] 
Sax01.xml:17:7: "</line>" expected.
Terminating

The beginning portions of the code that produced this output are shown below.

String systemId = ex.getSystemId();
  if(systemId != null){
    System.out.println("systemID: " + systemId);
    //get file name from end of systemID
    int index = systemId.lastIndexOf('/');
    if(index != -1){
      systemId = systemId.substring(index + 1);
    }//end if(index..
    str.append(systemId);
  }//end if(systemID...

The complete output shown above was produced by a combination of this method and the FatalError() method shown earlier. Part of the output was produced by the FatalError() method using the String object returned by this method.

getSystemId() returns a URL

As you can see, the String that was returned by the getSystemID() method is the URL for the XML file on the local drive (G:).

Although there is quite a bit of code involved, all that it does is extract the filename from the end of the URL and append it at the beginning of the StringBuffer object being constructed for return to the calling method.

getting line and column numbers

The next fragment completes the construction of the StringBuffer object by getting the line and column number of the location of the problem in the XML file using the two methods described earlier.

This information is appended onto the StringBuffer object with some colons added for cosmetic purposes.

returning the String

Then the StringBuffer object is converted to a String object and returned to the calling error handler method where it is displayed on the screen.

      str.append(':');
      str.append(ex.getLineNumber());
      str.append(':');
      str.append(ex.getColumnNumber());

      return str.toString();

    }//end getLocationString()

once again: what is SAX?

So, there you have it; a four-part series of articles that answer the burning question: What is SAX?

Now you know what SAX is, and why it is important to Java programmers writing applications to process XML documents.

coming attractions...

Subsequent articles will provide more useful illustrations of the capability provided by SAX.

Subsequent articles will also provide a similar treatment for DOM, the Document Object Model.

the XML octopus

Trying to wrap your brain around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows up. XML has many tentacles, reaching out in all directions. But, that's what makes it fun. As your XML host, I will do my best to lead you to the information that you need to keep the XML octopus under control.

Credits

This HTML page was produced using the WYSIWYG features of Microsoft Word 97. The images on this page were used with permission from the Microsoft Word 97 Clipart Gallery.

311144

About the author

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

baldwin@austin.cc.tx.us
Baldwin's Home Page

-end-