Java XML Parsing
Upcoming SlideShare
Loading in...5

Java XML Parsing






Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Java XML Parsing Java XML Parsing Presentation Transcript

    • XML Prepared By Srinivasan Jayakumar
    • Briefly: The Power of XML
      • XML is Extensible Markup Language
        • Text-based representation for describing data structure
          • Both human and machine readable
        • Originated from Standardized Generalized Markup Language (SGML)
        • Became a World Wide Web Consortium (W3C) standard in 1998
      • XML is a great choice for exchanging data between disparate systems
    • Synergy between Java and XML
      • Java+XML=Portable language+Portable Data
      • Allows use Java to generate XML data
        • Use Java to access SQL databases
        • Use Java to format data in XML
        • Use Java to parse data
        • Use Java to validate data
        • Use Java to transform data
    • HTML and XML
      • HTML and XML look similar, because they are both SGML languages
        • use elements enclosed in tags (e.g. <body>This is an element</body> )
        • use tag attributes (e.g., <font face=&quot;Verdana&quot; size=&quot;+1&quot; color=&quot;red&quot;> )
      • More precisely,
        • HTML is defined in SGML
        • XML is a (very small) subset of SGML
    • HTML and XML
      • HTML is for humans
        • HTML describes web pages
        • Browsers ignore and/or correct many HTML errors, so HTML is often sloppy
      • XML is for computers
        • XML describes data
        • The rules are strict and errors are not allowed
          • In this way, XML is like a programming language
        • Current versions of most browsers display XML
    • Example XML document <?xml version=&quot;1.0&quot;?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale=&quot;F&quot;>103</high> Low Temp: <low scale=&quot;F&quot;>70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>
    • Overall structure
      • An XML document may start with one or more processing instructions or directives:
        • <?xml version=&quot;1.0&quot;?> <?xml-stylesheet type=&quot;text/css&quot; href=&quot;ss.css&quot;?>
      • Following the directives, there must be exactly one root element containing all the rest of the XML:
        • <weatherReport> ... </weatherReport>
    • XML building blocks
      • Aside from the directives, an XML document is built from:
        • elements: high in < high scale=&quot;F&quot;>103</ high >
        • tags, in pairs: <high scale=&quot;F&quot;> 103 </high>
        • attributes: <high scale=&quot;F&quot; >103</high>
        • entities: <afternoon>Sunny & amp; hot</afternoon>
        • data: <high scale=&quot;F&quot;> 103 </high>
    • Elements and attributes
      • Attributes and elements are interchangeable
      • Example:
      • Elements are easier to use from Java
      • Attributes may contain elaborate metadata, such as unique IDs
        • <name> <first>David</first> <last>Smith</last>
        • </name>
      <name first=&quot;David&quot; last= &quot; Smith&quot;> </name>
    • Well-formed XML
      • In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name>
        • Empty elements can be abbreviated: <break /> .
        • XML tags are case sensitive and may not begin with the letters xml , in any combination of cases
      • Elements must be properly nested
        • e.g. not <b><i>bold and italic</b></i>
      • XML document must have one and only one root element
      • The values of attributes must be enclosed in quotes
        • e.g. <time unit=&quot;days&quot;>
    • XML as a tree
      • An XML document represents a hierarchy
      • A hierarchy is a tree
      novel foreword chapter number=&quot;1&quot; paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
    • Viewing XML
      • XML is designed to be processed by computer programs, not to be displayed to humans
      • Nevertheless, almost all current Web browsers can display XML documents
        • They do not all display it the same way
        • They may not display it at all if it has errors
      • This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used
    • XML Parsers
    • Stream Model
      • Stream seen by parser is a sequence of elements
      • As each XML element is seen, an event occurs
        • Some code registered with the parser (the event handler) is executed
      • This approach is popularized by the Simple API for XML (SAX)
      • Problem:
        • Hard to get a global view of the document
        • Parsing state represented by global variables set by the event handlers
    • Data Model
      • The XML data is transformed into a navigable data structure in memory
        • Because of the nesting of XML elements, a tree data structure is used
        • The tree is navigated to discover the XML document
      • This approach is popularized by the Document Object Model (DOM)
      • Problem:
        • May require large amounts of memory
        • May not be as fast as stream approach
          • Some DOM parsers use SAX to build the tree
    • SAX and DOM
      • SAX and DOM are standards for XML parsers
        • DOM is a W3C standard
        • SAX is an ad-hoc (but very popular) standard
      • There are various implementations available
      • Java implementations are provided as part of JAXP ( Java API for XML Processing )
      • JAXP package is included in JDK starting from JDK 1.4
        • Is available separately for Java 1.3
    • Difference between SAX and DOM
      • DOM reads the entire document into memory and stores it as a tree data structure
      • SAX reads the document and calls handler methods for each element or block of text that it encounters
      • Consequences:
        • DOM provides &quot;random access&quot; into the document
        • SAX provides only sequential access to the document
        • DOM is slow and requires huge amount of memory, so it cannot be used for large documents
        • SAX is fast and requires very little memory, so it can be used for huge documents
          • This makes SAX much more popular for web sites
    • SAX Parsing
    • Parsing with SAX
      • SAX uses the source-listener-delegate model for parsing XML documents
        • Source is XML data consisting of a XML elements
        • A listener written in Java is attached to the document which listens for an event
        • When event is thrown, some method is delegated for handling the code
    • SAX Parsing: process XML as Stream
    • Simple SAX program
      • The program consists of two classes:
        • Sample -- This class contains the main method; it
          • Gets a factory to make parsers
          • Gets a parser from the factory
          • Creates a Handler object to handle callbacks from the parser
          • Tells the parser which handler to send its callbacks to
          • Reads and parses the input XML file
        • Handler -- This class contains handlers for three kinds of callbacks:
          • startElement callbacks, generated when a start tag is seen
          • endElement callbacks, generated when an end tag is seen
          • characters callbacks, generated for the contents of an element
    • The Sample class
      • import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*;
      • // For simplicity, we let the operating system handle exceptions // In &quot;real life&quot; this is poor programming practice public class Sample { public static void main(String args[]) throws Exception {
      • // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance();
      • // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true);
      • // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
    • The Sample class
      • // Create a handler Handler handler = new Handler();
        • // Tell the parser to use this handler parser.setContentHandler(handler);
        • // Finally, read and parse the document parser.parse(&quot;hello.xml&quot;);
        • } // end of Sample class
      • The parser reads the file hello.xml
      • It should be located
        • In the same directory
        • In a directory that is included in the classpath
    • The Handler class
      • public class Handler extends DefaultHandler {
        • DefaultHandler is an adapter class that defines empty methods to be overridden
      • We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags.
        • The methods will just print a line
        • Each of these 3 methods throws a SAXException
      • // SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println(&quot;startElement: &quot; + qualifiedName); }
    • The Handler class
      • // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println(&quot;characters: &quot;&quot; + new String(ch, start, length) + &quot;&quot;&quot;); }
      • // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println(&quot;Element: /&quot; + qualifiedName); } } // End of Handler class
    • Results
      • If the file hello.xml contains: <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
      • Then the output from running java Sample will be: startElement: display characters: &quot;Hello World!&quot; Element: /display
    • More results
      • Now suppose the file hello.xml contains :
        • <?xml version=&quot;1.0&quot;?> <display> <i>Hello</i> World! </display>
      • Notice that the root element, <display> , contains a nested element <i> and whitespace (including newlines)
      • The result will be as shown at the right:
      • startElement: display characters: &quot;&quot; characters: &quot; &quot; characters: &quot; &quot; startElement: i characters: &quot;Hello&quot; endElement: /i characters: &quot;World!&quot; characters: &quot; &quot; endElement: /display
      // empty string // newline // spaces // another newline
    • Factories
      • SAX uses a parser factory
        • A factory is a design pattern alternative to constructors
      • Factories allow the programmer to:
        • Decide whether or not to create a new object
        • Decide what kind of object to create
        • class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } } }
    • Parser factories
      • To create a SAX parser factory, call static method: SAXParserFactory.newInstance()
        • Returns an object of type SAXParserFactory
        • It may throw a FactoryConfigurationError
      • Then, the parser can be customized:
        • public void setNamespaceAware(boolean awareness)
          • Call this with true if you are using namespaces
          • The default (if you don’t call this method) is false
        • public void setValidating(boolean validating)
          • Call this with true if you want to validate against a DTD
          • The default (if you don’t call this method) is false
          • Validation will give an error if you do not have a DTD
    • Getting a parser
      • Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
      • Note: SAXParser is not thread-safe
      • If a parser will be used by in multiple threads, create a separate SAXParser object for each thread
    • Declaring which handler to use
      • Since the SAX parser will call the handlers, we need to supply these methods
      • Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler);
      • These statements could be combined: parser.setContentHandler(new Handler());
      • Finally, the parser is invoked on the file to parse: parser.parse(&quot;hello.xml&quot;);
      • Everything else is done in the handler methods
    • SAX handlers
      • A callback handler must implement 4 interfaces:
        • interface ContentHandler
          • Handles basic parsing callbacks, e.g., element starts and ends
        • interface DTDHandler
          • Handles only notation and unparsed entity declarations
        • interface EntityResolver
          • Does customized handling for external entities
        • interface ErrorHandler
          • Must be implemented or parsing errors will be ignored!
      • Implementing all these interfaces is a lot of work
        • It is easier to use an adapter class
    • Class DefaultHandler
      • DefaultHandler is in an adapter from package org.xml.sax.helpers
      • DefaultHandler implements ContentHandler , DTDHandler , EntityResolver , and ErrorHandler
      • DefaultHandler provides empty methods for every method declared in each of the interfaces
      • To use this class, extend it and override the methods that are important to the application
    • ContentHandler methods
      • public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
      • This method is called at the beginning of elements
      • When SAX calls startElement , it passes in a parameter of type Attributes
      • The following methods look up attributes by name rather than by index:
        • public int getIndex(String qualifiedName)
        • public int getIndex(String uri, String localName)
        • public String getValue(String qualifiedName)
        • public String getValue(String uri, String localName)
    • ContentHandler methods
      • endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
      • The parameters to endElement are the same as those to startElement , except that the Attributes parameter is omitted
      • public void characters(char[] ch, int start, int length) throws SAXException
      • ch is an array of characters
        • Only length characters, starting from ch[start] , are the contents of the element
    • Error Handling
      • SAX error handling is unusual
      • Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered
        • Ignored errors can cause unexpected behavior
      • The ErrorHandler interface declares:
        • public void fatalError (SAXParseException exception) throws SAXException // XML not well structured
        • public void error (SAXParseException exception) throws SAXException // XML validation error
        • public void warning (SAXParseException exception) throws SAXException // minor problem
    • External parsers
      • Alternatively, you can use an existing parser:
        • Xerces, Electric XML, Expat, MSXML, CMarkup
      • Stages of the parsing
        • Get the URL object for the source
        • Create InputSource object encapsulating the data source
        • Create the parser
        • Launch the parser on the data source
    • Problems with SAX
      • SAX provides only sequential access to the document being processed
      • SAX has only a local view of the current element being processed
        • Global knowledge of parsing must be stored in global variables
        • A single startElement() method for all elements
          • In startElement() there are many “if-then-else” tests for checking a specific element
          • When an element is seen, a global flag is set
          • When finished with the element global flag must be set to false
    • DOM Parsing
    • DOM
      • DOM represents the XML document as a tree
        • Hierarchical nature of tree maps well to hierarchical nesting of XML elements
        • Tree contains a global view of the document
          • Makes navigation of document easy
          • Allows to modify any subtree
          • Easier processing than SAX but memory intensive!
      • As well as SAX, DOM is an API only
        • Does not specify a parser
        • Lists the API and requirements for the parser
      • DOM parsers typically use SAX parsing
    • DOM Parsing: process entire document
    • Simple DOM program
      • First we need to create a DOM parser, called a DocumentBuilder
      • The parser is created, not by a constructor, but by calling a static factory method
      • DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      • DocumentBuilder builder = factory.newDocumentBuilder();
    • Simple DOM program
      • An XML file hello.xml will be be parsed <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
      • To read this file, we add the following line : Document document = builder.parse(&quot;hello.xml&quot;);
      • document contains the entire XML file as a tree
      • The following code finds the content of the root element and prints it
      • Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue());
      • The output of the program is: Hello World!
    • Reading in the tree
      • The parse method reads in the entire XML document and represents it as a tree in memory
        • For a large document, parsing could take a while
        • If you want to interact with your program while it is parsing, you need to use parser in a separate thread
      • Practically, an XML parse tree may require up to 10 times memory as the original XML document
        • If you have a lot of tree manipulation to do, DOM is much more convenient than SAX
        • If you do not have a lot of tree manipulation to do, consider using SAX instead
    • Structure of the DOM tree
      • The DOM tree is composed of Node objects
      • Node is an interface
        • Some of the more important sub-interfaces are Element , Attr , and Text
          • An Element node may have children
          • Attr and Text nodes are the leaves of the tree
      • Hence, the DOM tree is composed of Node objects
        • Node objects can be downcast into specific types if needed
    • Operations on Node s
      • The results returned by getNodeName() , getNodeValue() , getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() getNodeValue() getNodeType() getAttributes()
      tag name null ELEMENT_NODE NamedNodeMap &quot;#text&quot; text contents TEXT_NODE null name of attribute value of attribute ATTRIBUTE_NODE null
    • Distinguishing Node types
      • An easy way to handle different types of nodes:
        • switch(node.getNodeType()) {
          • case Node.ELEMENT_NODE:
            • Element element = (Element)node; ...; break;
          • case Node.TEXT_NODE:
            • Text text = (Text)node; ... break;
          • case Node.ATTRIBUTE_NODE:
            • Attr attr = (Attr)node; ... break;
          • default: ...
        • }
    • Operations on Node s
      • Tree-walking methods that return a Node :
        • getParentNode()
        • getFirstChild()
        • getNextSibling()
        • getPreviousSibling()
        • getLastChild()
      • Test methods that return a boolean :
        • hasAttributes()
        • hasChildNodes()
    • Operations for Element s
      • String getTagName()
        • Returns the name of the tag
      • boolean hasAttribute(String name)
        • Returns true if this Element has the named attribute
      • String getAttribute(String name)
        • Returns the value of the named attribute
      • boolean hasAttributes()
        • Returns true if this Element has any attributes
      • NamedNodeMap getAttributes()
        • Returns a NamedNodeMap of all the Element’s attributes
    • Operations on Text s
      • Text is a subinterface of CharacterData and inherits the following operations (among others):
        • public String getData() throws DOMException
          • Returns the text contents of this Text node
        • public int getLength()
          • Returns the number of Unicode characters in the text
        • public String substringData(int offset, int count) throws DOMException
          • Returns a substring of the text contents
    • Operations on Attribute s
      • String getName()
        • Returns the name of this attribute.
      • Element getOwnerElement()
        • Returns the Element node this attribute is attached to
      • String getValue()
        • Returns the value of the attribute as a String
    • Overview
      • DOM, unlike SAX, gives allows to create and modify XML trees
      • There are three basic kinds of operations:
        • Creating a new DOM
        • Modifying the structure of a DOM
        • Modifying the content of a DOM
      • Creating a new DOM requires a few extra methods just to get started
        • Afterwards, you can add elements through modifying its structure and contents
    • Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }
    • Creating structure
      • The following are instance methods of Document :
        • public Element createElement(String tagName)
        • public Element createElementNS(String namespaceURI, String qualifiedName)
        • public Attr createAttribute(String name)
        • public Attr createAttributeNS(String namespaceURI, String qualifiedName)
        • public ProcessingInstruction createProcessingInstruction (String target, String data)
        • public EntityReference createEntityReference(String name)
        • public Text createTextNode(String data)
        • public Comment createComment(String data)
    • Methods of Node
      • public Node appendChild(Node newChild)
      • public Node insertBefore(Node newChild, Node refChild)
      • public Node removeChild(Node oldChild)
      • public Node replaceChild(Node newChild, Node oldChild)
      • setNodeValue(String nodeValue)
        • Functionality depends on the type of the node
    • Methods of Element
      • public void setAttribute(String name, String value)
      • public Attr setAttributeNode(Attr newAttr)
      • public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value)
      • public Attr setAttributeNodeNS(Attr newAttr)
      • public void removeAttribute(String name)
      • public void removeAttributeNS(String namespaceURI, String localName)
      • public Attr removeAttributeNode(Attr oldAttr)
    • Method of Attribute
      • public void setValue(String value)
      • This is the only method that modifies an Attribute
        • The rest just retrieve information
    • Queries ?