Java XML Parsing
Upcoming SlideShare
Loading in...5

Java XML Parsing






Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Java XML Parsing Java XML Parsing Presentation Transcript

  • XML Prepared By Srinivasan Jayakumar
  • Briefly: The Power of XML
    • XML is Extensible Markup Language
      • Text-based representation for describing data structure
        • Both human and machine readable
      • Originated from Standardized Generalized Markup Language (SGML)
      • Became a World Wide Web Consortium (W3C) standard in 1998
    • XML is a great choice for exchanging data between disparate systems
  • Synergy between Java and XML
    • Java+XML=Portable language+Portable Data
    • Allows use Java to generate XML data
      • Use Java to access SQL databases
      • Use Java to format data in XML
      • Use Java to parse data
      • Use Java to validate data
      • Use Java to transform data
  • HTML and XML
    • HTML and XML look similar, because they are both SGML languages
      • use elements enclosed in tags (e.g. <body>This is an element</body> )
      • use tag attributes (e.g., <font face=&quot;Verdana&quot; size=&quot;+1&quot; color=&quot;red&quot;> )
    • More precisely,
      • HTML is defined in SGML
      • XML is a (very small) subset of SGML
  • HTML and XML
    • HTML is for humans
      • HTML describes web pages
      • Browsers ignore and/or correct many HTML errors, so HTML is often sloppy
    • XML is for computers
      • XML describes data
      • The rules are strict and errors are not allowed
        • In this way, XML is like a programming language
      • Current versions of most browsers display XML
  • Example XML document <?xml version=&quot;1.0&quot;?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale=&quot;F&quot;>103</high> Low Temp: <low scale=&quot;F&quot;>70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>
  • Overall structure
    • An XML document may start with one or more processing instructions or directives:
      • <?xml version=&quot;1.0&quot;?> <?xml-stylesheet type=&quot;text/css&quot; href=&quot;ss.css&quot;?>
    • Following the directives, there must be exactly one root element containing all the rest of the XML:
      • <weatherReport> ... </weatherReport>
  • XML building blocks
    • Aside from the directives, an XML document is built from:
      • elements: high in < high scale=&quot;F&quot;>103</ high >
      • tags, in pairs: <high scale=&quot;F&quot;> 103 </high>
      • attributes: <high scale=&quot;F&quot; >103</high>
      • entities: <afternoon>Sunny & amp; hot</afternoon>
      • data: <high scale=&quot;F&quot;> 103 </high>
  • Elements and attributes
    • Attributes and elements are interchangeable
    • Example:
    • Elements are easier to use from Java
    • Attributes may contain elaborate metadata, such as unique IDs
      • <name> <first>David</first> <last>Smith</last>
      • </name>
    <name first=&quot;David&quot; last= &quot; Smith&quot;> </name>
  • Well-formed XML
    • In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name>
      • Empty elements can be abbreviated: <break /> .
      • XML tags are case sensitive and may not begin with the letters xml , in any combination of cases
    • Elements must be properly nested
      • e.g. not <b><i>bold and italic</b></i>
    • XML document must have one and only one root element
    • The values of attributes must be enclosed in quotes
      • e.g. <time unit=&quot;days&quot;>
  • XML as a tree
    • An XML document represents a hierarchy
    • A hierarchy is a tree
    novel foreword chapter number=&quot;1&quot; paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
  • Viewing XML
    • XML is designed to be processed by computer programs, not to be displayed to humans
    • Nevertheless, almost all current Web browsers can display XML documents
      • They do not all display it the same way
      • They may not display it at all if it has errors
    • This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used
  • XML Parsers
  • Stream Model
    • Stream seen by parser is a sequence of elements
    • As each XML element is seen, an event occurs
      • Some code registered with the parser (the event handler) is executed
    • This approach is popularized by the Simple API for XML (SAX)
    • Problem:
      • Hard to get a global view of the document
      • Parsing state represented by global variables set by the event handlers
  • Data Model
    • The XML data is transformed into a navigable data structure in memory
      • Because of the nesting of XML elements, a tree data structure is used
      • The tree is navigated to discover the XML document
    • This approach is popularized by the Document Object Model (DOM)
    • Problem:
      • May require large amounts of memory
      • May not be as fast as stream approach
        • Some DOM parsers use SAX to build the tree
  • SAX and DOM
    • SAX and DOM are standards for XML parsers
      • DOM is a W3C standard
      • SAX is an ad-hoc (but very popular) standard
    • There are various implementations available
    • Java implementations are provided as part of JAXP ( Java API for XML Processing )
    • JAXP package is included in JDK starting from JDK 1.4
      • Is available separately for Java 1.3
  • Difference between SAX and DOM
    • DOM reads the entire document into memory and stores it as a tree data structure
    • SAX reads the document and calls handler methods for each element or block of text that it encounters
    • Consequences:
      • DOM provides &quot;random access&quot; into the document
      • SAX provides only sequential access to the document
      • DOM is slow and requires huge amount of memory, so it cannot be used for large documents
      • SAX is fast and requires very little memory, so it can be used for huge documents
        • This makes SAX much more popular for web sites
  • SAX Parsing
  • Parsing with SAX
    • SAX uses the source-listener-delegate model for parsing XML documents
      • Source is XML data consisting of a XML elements
      • A listener written in Java is attached to the document which listens for an event
      • When event is thrown, some method is delegated for handling the code
  • SAX Parsing: process XML as Stream
  • Simple SAX program
    • The program consists of two classes:
      • Sample -- This class contains the main method; it
        • Gets a factory to make parsers
        • Gets a parser from the factory
        • Creates a Handler object to handle callbacks from the parser
        • Tells the parser which handler to send its callbacks to
        • Reads and parses the input XML file
      • Handler -- This class contains handlers for three kinds of callbacks:
        • startElement callbacks, generated when a start tag is seen
        • endElement callbacks, generated when an end tag is seen
        • characters callbacks, generated for the contents of an element
  • The Sample class
    • import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*;
    • // For simplicity, we let the operating system handle exceptions // In &quot;real life&quot; this is poor programming practice public class Sample { public static void main(String args[]) throws Exception {
    • // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance();
    • // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true);
    • // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
  • The Sample class
    • // Create a handler Handler handler = new Handler();
      • // Tell the parser to use this handler parser.setContentHandler(handler);
      • // Finally, read and parse the document parser.parse(&quot;hello.xml&quot;);
      • } // end of Sample class
    • The parser reads the file hello.xml
    • It should be located
      • In the same directory
      • In a directory that is included in the classpath
  • The Handler class
    • public class Handler extends DefaultHandler {
      • DefaultHandler is an adapter class that defines empty methods to be overridden
    • We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags.
      • The methods will just print a line
      • Each of these 3 methods throws a SAXException
    • // SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println(&quot;startElement: &quot; + qualifiedName); }
  • The Handler class
    • // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println(&quot;characters: &quot;&quot; + new String(ch, start, length) + &quot;&quot;&quot;); }
    • // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println(&quot;Element: /&quot; + qualifiedName); } } // End of Handler class
  • Results
    • If the file hello.xml contains: <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
    • Then the output from running java Sample will be: startElement: display characters: &quot;Hello World!&quot; Element: /display
  • More results
    • Now suppose the file hello.xml contains :
      • <?xml version=&quot;1.0&quot;?> <display> <i>Hello</i> World! </display>
    • Notice that the root element, <display> , contains a nested element <i> and whitespace (including newlines)
    • The result will be as shown at the right:
    • startElement: display characters: &quot;&quot; characters: &quot; &quot; characters: &quot; &quot; startElement: i characters: &quot;Hello&quot; endElement: /i characters: &quot;World!&quot; characters: &quot; &quot; endElement: /display
    // empty string // newline // spaces // another newline
  • Factories
    • SAX uses a parser factory
      • A factory is a design pattern alternative to constructors
    • Factories allow the programmer to:
      • Decide whether or not to create a new object
      • Decide what kind of object to create
      • class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } } }
  • Parser factories
    • To create a SAX parser factory, call static method: SAXParserFactory.newInstance()
      • Returns an object of type SAXParserFactory
      • It may throw a FactoryConfigurationError
    • Then, the parser can be customized:
      • public void setNamespaceAware(boolean awareness)
        • Call this with true if you are using namespaces
        • The default (if you don’t call this method) is false
      • public void setValidating(boolean validating)
        • Call this with true if you want to validate against a DTD
        • The default (if you don’t call this method) is false
        • Validation will give an error if you do not have a DTD
  • Getting a parser
    • Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
    • Note: SAXParser is not thread-safe
    • If a parser will be used by in multiple threads, create a separate SAXParser object for each thread
  • Declaring which handler to use
    • Since the SAX parser will call the handlers, we need to supply these methods
    • Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler);
    • These statements could be combined: parser.setContentHandler(new Handler());
    • Finally, the parser is invoked on the file to parse: parser.parse(&quot;hello.xml&quot;);
    • Everything else is done in the handler methods
  • SAX handlers
    • A callback handler must implement 4 interfaces:
      • interface ContentHandler
        • Handles basic parsing callbacks, e.g., element starts and ends
      • interface DTDHandler
        • Handles only notation and unparsed entity declarations
      • interface EntityResolver
        • Does customized handling for external entities
      • interface ErrorHandler
        • Must be implemented or parsing errors will be ignored!
    • Implementing all these interfaces is a lot of work
      • It is easier to use an adapter class
  • Class DefaultHandler
    • DefaultHandler is in an adapter from package org.xml.sax.helpers
    • DefaultHandler implements ContentHandler , DTDHandler , EntityResolver , and ErrorHandler
    • DefaultHandler provides empty methods for every method declared in each of the interfaces
    • To use this class, extend it and override the methods that are important to the application
  • ContentHandler methods
    • public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
    • This method is called at the beginning of elements
    • When SAX calls startElement , it passes in a parameter of type Attributes
    • The following methods look up attributes by name rather than by index:
      • public int getIndex(String qualifiedName)
      • public int getIndex(String uri, String localName)
      • public String getValue(String qualifiedName)
      • public String getValue(String uri, String localName)
  • ContentHandler methods
    • endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
    • The parameters to endElement are the same as those to startElement , except that the Attributes parameter is omitted
    • public void characters(char[] ch, int start, int length) throws SAXException
    • ch is an array of characters
      • Only length characters, starting from ch[start] , are the contents of the element
  • Error Handling
    • SAX error handling is unusual
    • Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered
      • Ignored errors can cause unexpected behavior
    • The ErrorHandler interface declares:
      • public void fatalError (SAXParseException exception) throws SAXException // XML not well structured
      • public void error (SAXParseException exception) throws SAXException // XML validation error
      • public void warning (SAXParseException exception) throws SAXException // minor problem
  • External parsers
    • Alternatively, you can use an existing parser:
      • Xerces, Electric XML, Expat, MSXML, CMarkup
    • Stages of the parsing
      • Get the URL object for the source
      • Create InputSource object encapsulating the data source
      • Create the parser
      • Launch the parser on the data source
  • Problems with SAX
    • SAX provides only sequential access to the document being processed
    • SAX has only a local view of the current element being processed
      • Global knowledge of parsing must be stored in global variables
      • A single startElement() method for all elements
        • In startElement() there are many “if-then-else” tests for checking a specific element
        • When an element is seen, a global flag is set
        • When finished with the element global flag must be set to false
  • DOM Parsing
  • DOM
    • DOM represents the XML document as a tree
      • Hierarchical nature of tree maps well to hierarchical nesting of XML elements
      • Tree contains a global view of the document
        • Makes navigation of document easy
        • Allows to modify any subtree
        • Easier processing than SAX but memory intensive!
    • As well as SAX, DOM is an API only
      • Does not specify a parser
      • Lists the API and requirements for the parser
    • DOM parsers typically use SAX parsing
  • DOM Parsing: process entire document
  • Simple DOM program
    • First we need to create a DOM parser, called a DocumentBuilder
    • The parser is created, not by a constructor, but by calling a static factory method
    • DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    • DocumentBuilder builder = factory.newDocumentBuilder();
  • Simple DOM program
    • An XML file hello.xml will be be parsed <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
    • To read this file, we add the following line : Document document = builder.parse(&quot;hello.xml&quot;);
    • document contains the entire XML file as a tree
    • The following code finds the content of the root element and prints it
    • Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue());
    • The output of the program is: Hello World!
  • Reading in the tree
    • The parse method reads in the entire XML document and represents it as a tree in memory
      • For a large document, parsing could take a while
      • If you want to interact with your program while it is parsing, you need to use parser in a separate thread
    • Practically, an XML parse tree may require up to 10 times memory as the original XML document
      • If you have a lot of tree manipulation to do, DOM is much more convenient than SAX
      • If you do not have a lot of tree manipulation to do, consider using SAX instead
  • Structure of the DOM tree
    • The DOM tree is composed of Node objects
    • Node is an interface
      • Some of the more important sub-interfaces are Element , Attr , and Text
        • An Element node may have children
        • Attr and Text nodes are the leaves of the tree
    • Hence, the DOM tree is composed of Node objects
      • Node objects can be downcast into specific types if needed
  • Operations on Node s
    • The results returned by getNodeName() , getNodeValue() , getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() getNodeValue() getNodeType() getAttributes()
    tag name null ELEMENT_NODE NamedNodeMap &quot;#text&quot; text contents TEXT_NODE null name of attribute value of attribute ATTRIBUTE_NODE null
  • Distinguishing Node types
    • An easy way to handle different types of nodes:
      • switch(node.getNodeType()) {
        • case Node.ELEMENT_NODE:
          • Element element = (Element)node; ...; break;
        • case Node.TEXT_NODE:
          • Text text = (Text)node; ... break;
        • case Node.ATTRIBUTE_NODE:
          • Attr attr = (Attr)node; ... break;
        • default: ...
      • }
  • Operations on Node s
    • Tree-walking methods that return a Node :
      • getParentNode()
      • getFirstChild()
      • getNextSibling()
      • getPreviousSibling()
      • getLastChild()
    • Test methods that return a boolean :
      • hasAttributes()
      • hasChildNodes()
  • Operations for Element s
    • String getTagName()
      • Returns the name of the tag
    • boolean hasAttribute(String name)
      • Returns true if this Element has the named attribute
    • String getAttribute(String name)
      • Returns the value of the named attribute
    • boolean hasAttributes()
      • Returns true if this Element has any attributes
    • NamedNodeMap getAttributes()
      • Returns a NamedNodeMap of all the Element’s attributes
  • Operations on Text s
    • Text is a subinterface of CharacterData and inherits the following operations (among others):
      • public String getData() throws DOMException
        • Returns the text contents of this Text node
      • public int getLength()
        • Returns the number of Unicode characters in the text
      • public String substringData(int offset, int count) throws DOMException
        • Returns a substring of the text contents
  • Operations on Attribute s
    • String getName()
      • Returns the name of this attribute.
    • Element getOwnerElement()
      • Returns the Element node this attribute is attached to
    • String getValue()
      • Returns the value of the attribute as a String
  • Overview
    • DOM, unlike SAX, gives allows to create and modify XML trees
    • There are three basic kinds of operations:
      • Creating a new DOM
      • Modifying the structure of a DOM
      • Modifying the content of a DOM
    • Creating a new DOM requires a few extra methods just to get started
      • Afterwards, you can add elements through modifying its structure and contents
  • Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }
  • Creating structure
    • The following are instance methods of Document :
      • public Element createElement(String tagName)
      • public Element createElementNS(String namespaceURI, String qualifiedName)
      • public Attr createAttribute(String name)
      • public Attr createAttributeNS(String namespaceURI, String qualifiedName)
      • public ProcessingInstruction createProcessingInstruction (String target, String data)
      • public EntityReference createEntityReference(String name)
      • public Text createTextNode(String data)
      • public Comment createComment(String data)
  • Methods of Node
    • public Node appendChild(Node newChild)
    • public Node insertBefore(Node newChild, Node refChild)
    • public Node removeChild(Node oldChild)
    • public Node replaceChild(Node newChild, Node oldChild)
    • setNodeValue(String nodeValue)
      • Functionality depends on the type of the node
  • Methods of Element
    • public void setAttribute(String name, String value)
    • public Attr setAttributeNode(Attr newAttr)
    • public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value)
    • public Attr setAttributeNodeNS(Attr newAttr)
    • public void removeAttribute(String name)
    • public void removeAttributeNS(String namespaceURI, String localName)
    • public Attr removeAttributeNode(Attr oldAttr)
  • Method of Attribute
    • public void setValue(String value)
    • This is the only method that modifies an Attribute
      • The rest just retrieve information
  • Queries ?