Java XML Parsing
Upcoming SlideShare
Loading in...5
×
 

Java XML Parsing

on

  • 2,046 views

 

Statistics

Views

Total Views
2,046
Views on SlideShare
2,045
Embed Views
1

Actions

Likes
0
Downloads
79
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Java XML Parsing Java XML Parsing Presentation Transcript

  • XML Prepared By Srinivasan Jayakumar
  • Briefly: The Power of XML
    • XML is Extensible Markup Language
      • Text-based representation for describing data structure
        • Both human and machine readable
      • Originated from Standardized Generalized Markup Language (SGML)
      • Became a World Wide Web Consortium (W3C) standard in 1998
    • XML is a great choice for exchanging data between disparate systems
  • Synergy between Java and XML
    • Java+XML=Portable language+Portable Data
    • Allows use Java to generate XML data
      • Use Java to access SQL databases
      • Use Java to format data in XML
      • Use Java to parse data
      • Use Java to validate data
      • Use Java to transform data
  • HTML and XML
    • HTML and XML look similar, because they are both SGML languages
      • use elements enclosed in tags (e.g. <body>This is an element</body> )
      • use tag attributes (e.g., <font face=&quot;Verdana&quot; size=&quot;+1&quot; color=&quot;red&quot;> )
    • More precisely,
      • HTML is defined in SGML
      • XML is a (very small) subset of SGML
  • HTML and XML
    • HTML is for humans
      • HTML describes web pages
      • Browsers ignore and/or correct many HTML errors, so HTML is often sloppy
    • XML is for computers
      • XML describes data
      • The rules are strict and errors are not allowed
        • In this way, XML is like a programming language
      • Current versions of most browsers display XML
  • Example XML document <?xml version=&quot;1.0&quot;?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale=&quot;F&quot;>103</high> Low Temp: <low scale=&quot;F&quot;>70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>
  • Overall structure
    • An XML document may start with one or more processing instructions or directives:
      • <?xml version=&quot;1.0&quot;?> <?xml-stylesheet type=&quot;text/css&quot; href=&quot;ss.css&quot;?>
    • Following the directives, there must be exactly one root element containing all the rest of the XML:
      • <weatherReport> ... </weatherReport>
  • XML building blocks
    • Aside from the directives, an XML document is built from:
      • elements: high in < high scale=&quot;F&quot;>103</ high >
      • tags, in pairs: <high scale=&quot;F&quot;> 103 </high>
      • attributes: <high scale=&quot;F&quot; >103</high>
      • entities: <afternoon>Sunny & amp; hot</afternoon>
      • data: <high scale=&quot;F&quot;> 103 </high>
  • Elements and attributes
    • Attributes and elements are interchangeable
    • Example:
    • Elements are easier to use from Java
    • Attributes may contain elaborate metadata, such as unique IDs
      • <name> <first>David</first> <last>Smith</last>
      • </name>
    <name first=&quot;David&quot; last= &quot; Smith&quot;> </name>
  • Well-formed XML
    • In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name>
      • Empty elements can be abbreviated: <break /> .
      • XML tags are case sensitive and may not begin with the letters xml , in any combination of cases
    • Elements must be properly nested
      • e.g. not <b><i>bold and italic</b></i>
    • XML document must have one and only one root element
    • The values of attributes must be enclosed in quotes
      • e.g. <time unit=&quot;days&quot;>
  • XML as a tree
    • An XML document represents a hierarchy
    • A hierarchy is a tree
    novel foreword chapter number=&quot;1&quot; paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
  • Viewing XML
    • XML is designed to be processed by computer programs, not to be displayed to humans
    • Nevertheless, almost all current Web browsers can display XML documents
      • They do not all display it the same way
      • They may not display it at all if it has errors
    • This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used
  • XML Parsers
  • Stream Model
    • Stream seen by parser is a sequence of elements
    • As each XML element is seen, an event occurs
      • Some code registered with the parser (the event handler) is executed
    • This approach is popularized by the Simple API for XML (SAX)
    • Problem:
      • Hard to get a global view of the document
      • Parsing state represented by global variables set by the event handlers
  • Data Model
    • The XML data is transformed into a navigable data structure in memory
      • Because of the nesting of XML elements, a tree data structure is used
      • The tree is navigated to discover the XML document
    • This approach is popularized by the Document Object Model (DOM)
    • Problem:
      • May require large amounts of memory
      • May not be as fast as stream approach
        • Some DOM parsers use SAX to build the tree
  • SAX and DOM
    • SAX and DOM are standards for XML parsers
      • DOM is a W3C standard
      • SAX is an ad-hoc (but very popular) standard
    • There are various implementations available
    • Java implementations are provided as part of JAXP ( Java API for XML Processing )
    • JAXP package is included in JDK starting from JDK 1.4
      • Is available separately for Java 1.3
  • Difference between SAX and DOM
    • DOM reads the entire document into memory and stores it as a tree data structure
    • SAX reads the document and calls handler methods for each element or block of text that it encounters
    • Consequences:
      • DOM provides &quot;random access&quot; into the document
      • SAX provides only sequential access to the document
      • DOM is slow and requires huge amount of memory, so it cannot be used for large documents
      • SAX is fast and requires very little memory, so it can be used for huge documents
        • This makes SAX much more popular for web sites
  • SAX Parsing
  • Parsing with SAX
    • SAX uses the source-listener-delegate model for parsing XML documents
      • Source is XML data consisting of a XML elements
      • A listener written in Java is attached to the document which listens for an event
      • When event is thrown, some method is delegated for handling the code
  • SAX Parsing: process XML as Stream
  • Simple SAX program
    • The program consists of two classes:
      • Sample -- This class contains the main method; it
        • Gets a factory to make parsers
        • Gets a parser from the factory
        • Creates a Handler object to handle callbacks from the parser
        • Tells the parser which handler to send its callbacks to
        • Reads and parses the input XML file
      • Handler -- This class contains handlers for three kinds of callbacks:
        • startElement callbacks, generated when a start tag is seen
        • endElement callbacks, generated when an end tag is seen
        • characters callbacks, generated for the contents of an element
  • The Sample class
    • import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*;
    • // For simplicity, we let the operating system handle exceptions // In &quot;real life&quot; this is poor programming practice public class Sample { public static void main(String args[]) throws Exception {
    • // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance();
    • // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true);
    • // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
  • The Sample class
    • // Create a handler Handler handler = new Handler();
      • // Tell the parser to use this handler parser.setContentHandler(handler);
      • // Finally, read and parse the document parser.parse(&quot;hello.xml&quot;);
      • } // end of Sample class
    • The parser reads the file hello.xml
    • It should be located
      • In the same directory
      • In a directory that is included in the classpath
  • The Handler class
    • public class Handler extends DefaultHandler {
      • DefaultHandler is an adapter class that defines empty methods to be overridden
    • We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags.
      • The methods will just print a line
      • Each of these 3 methods throws a SAXException
    • // SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println(&quot;startElement: &quot; + qualifiedName); }
  • The Handler class
    • // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println(&quot;characters: &quot;&quot; + new String(ch, start, length) + &quot;&quot;&quot;); }
    • // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println(&quot;Element: /&quot; + qualifiedName); } } // End of Handler class
  • Results
    • If the file hello.xml contains: <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
    • Then the output from running java Sample will be: startElement: display characters: &quot;Hello World!&quot; Element: /display
  • More results
    • Now suppose the file hello.xml contains :
      • <?xml version=&quot;1.0&quot;?> <display> <i>Hello</i> World! </display>
    • Notice that the root element, <display> , contains a nested element <i> and whitespace (including newlines)
    • The result will be as shown at the right:
    • startElement: display characters: &quot;&quot; characters: &quot; &quot; characters: &quot; &quot; startElement: i characters: &quot;Hello&quot; endElement: /i characters: &quot;World!&quot; characters: &quot; &quot; endElement: /display
    // empty string // newline // spaces // another newline
  • Factories
    • SAX uses a parser factory
      • A factory is a design pattern alternative to constructors
    • Factories allow the programmer to:
      • Decide whether or not to create a new object
      • Decide what kind of object to create
      • class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } } }
  • Parser factories
    • To create a SAX parser factory, call static method: SAXParserFactory.newInstance()
      • Returns an object of type SAXParserFactory
      • It may throw a FactoryConfigurationError
    • Then, the parser can be customized:
      • public void setNamespaceAware(boolean awareness)
        • Call this with true if you are using namespaces
        • The default (if you don’t call this method) is false
      • public void setValidating(boolean validating)
        • Call this with true if you want to validate against a DTD
        • The default (if you don’t call this method) is false
        • Validation will give an error if you do not have a DTD
  • Getting a parser
    • Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
    • Note: SAXParser is not thread-safe
    • If a parser will be used by in multiple threads, create a separate SAXParser object for each thread
  • Declaring which handler to use
    • Since the SAX parser will call the handlers, we need to supply these methods
    • Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler);
    • These statements could be combined: parser.setContentHandler(new Handler());
    • Finally, the parser is invoked on the file to parse: parser.parse(&quot;hello.xml&quot;);
    • Everything else is done in the handler methods
  • SAX handlers
    • A callback handler must implement 4 interfaces:
      • interface ContentHandler
        • Handles basic parsing callbacks, e.g., element starts and ends
      • interface DTDHandler
        • Handles only notation and unparsed entity declarations
      • interface EntityResolver
        • Does customized handling for external entities
      • interface ErrorHandler
        • Must be implemented or parsing errors will be ignored!
    • Implementing all these interfaces is a lot of work
      • It is easier to use an adapter class
  • Class DefaultHandler
    • DefaultHandler is in an adapter from package org.xml.sax.helpers
    • DefaultHandler implements ContentHandler , DTDHandler , EntityResolver , and ErrorHandler
    • DefaultHandler provides empty methods for every method declared in each of the interfaces
    • To use this class, extend it and override the methods that are important to the application
  • ContentHandler methods
    • public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
    • This method is called at the beginning of elements
    • When SAX calls startElement , it passes in a parameter of type Attributes
    • The following methods look up attributes by name rather than by index:
      • public int getIndex(String qualifiedName)
      • public int getIndex(String uri, String localName)
      • public String getValue(String qualifiedName)
      • public String getValue(String uri, String localName)
  • ContentHandler methods
    • endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
    • The parameters to endElement are the same as those to startElement , except that the Attributes parameter is omitted
    • public void characters(char[] ch, int start, int length) throws SAXException
    • ch is an array of characters
      • Only length characters, starting from ch[start] , are the contents of the element
  • Error Handling
    • SAX error handling is unusual
    • Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered
      • Ignored errors can cause unexpected behavior
    • The ErrorHandler interface declares:
      • public void fatalError (SAXParseException exception) throws SAXException // XML not well structured
      • public void error (SAXParseException exception) throws SAXException // XML validation error
      • public void warning (SAXParseException exception) throws SAXException // minor problem
  • External parsers
    • Alternatively, you can use an existing parser:
      • Xerces, Electric XML, Expat, MSXML, CMarkup
    • Stages of the parsing
      • Get the URL object for the source
      • Create InputSource object encapsulating the data source
      • Create the parser
      • Launch the parser on the data source
  • Problems with SAX
    • SAX provides only sequential access to the document being processed
    • SAX has only a local view of the current element being processed
      • Global knowledge of parsing must be stored in global variables
      • A single startElement() method for all elements
        • In startElement() there are many “if-then-else” tests for checking a specific element
        • When an element is seen, a global flag is set
        • When finished with the element global flag must be set to false
  • DOM Parsing
  • DOM
    • DOM represents the XML document as a tree
      • Hierarchical nature of tree maps well to hierarchical nesting of XML elements
      • Tree contains a global view of the document
        • Makes navigation of document easy
        • Allows to modify any subtree
        • Easier processing than SAX but memory intensive!
    • As well as SAX, DOM is an API only
      • Does not specify a parser
      • Lists the API and requirements for the parser
    • DOM parsers typically use SAX parsing
  • DOM Parsing: process entire document
  • Simple DOM program
    • First we need to create a DOM parser, called a DocumentBuilder
    • The parser is created, not by a constructor, but by calling a static factory method
    • DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    • DocumentBuilder builder = factory.newDocumentBuilder();
  • Simple DOM program
    • An XML file hello.xml will be be parsed <?xml version=&quot;1.0&quot;?> <display>Hello World!</display>
    • To read this file, we add the following line : Document document = builder.parse(&quot;hello.xml&quot;);
    • document contains the entire XML file as a tree
    • The following code finds the content of the root element and prints it
    • Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue());
    • The output of the program is: Hello World!
  • Reading in the tree
    • The parse method reads in the entire XML document and represents it as a tree in memory
      • For a large document, parsing could take a while
      • If you want to interact with your program while it is parsing, you need to use parser in a separate thread
    • Practically, an XML parse tree may require up to 10 times memory as the original XML document
      • If you have a lot of tree manipulation to do, DOM is much more convenient than SAX
      • If you do not have a lot of tree manipulation to do, consider using SAX instead
  • Structure of the DOM tree
    • The DOM tree is composed of Node objects
    • Node is an interface
      • Some of the more important sub-interfaces are Element , Attr , and Text
        • An Element node may have children
        • Attr and Text nodes are the leaves of the tree
    • Hence, the DOM tree is composed of Node objects
      • Node objects can be downcast into specific types if needed
  • Operations on Node s
    • The results returned by getNodeName() , getNodeValue() , getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() getNodeValue() getNodeType() getAttributes()
    tag name null ELEMENT_NODE NamedNodeMap &quot;#text&quot; text contents TEXT_NODE null name of attribute value of attribute ATTRIBUTE_NODE null
  • Distinguishing Node types
    • An easy way to handle different types of nodes:
      • switch(node.getNodeType()) {
        • case Node.ELEMENT_NODE:
          • Element element = (Element)node; ...; break;
        • case Node.TEXT_NODE:
          • Text text = (Text)node; ... break;
        • case Node.ATTRIBUTE_NODE:
          • Attr attr = (Attr)node; ... break;
        • default: ...
      • }
  • Operations on Node s
    • Tree-walking methods that return a Node :
      • getParentNode()
      • getFirstChild()
      • getNextSibling()
      • getPreviousSibling()
      • getLastChild()
    • Test methods that return a boolean :
      • hasAttributes()
      • hasChildNodes()
  • Operations for Element s
    • String getTagName()
      • Returns the name of the tag
    • boolean hasAttribute(String name)
      • Returns true if this Element has the named attribute
    • String getAttribute(String name)
      • Returns the value of the named attribute
    • boolean hasAttributes()
      • Returns true if this Element has any attributes
    • NamedNodeMap getAttributes()
      • Returns a NamedNodeMap of all the Element’s attributes
  • Operations on Text s
    • Text is a subinterface of CharacterData and inherits the following operations (among others):
      • public String getData() throws DOMException
        • Returns the text contents of this Text node
      • public int getLength()
        • Returns the number of Unicode characters in the text
      • public String substringData(int offset, int count) throws DOMException
        • Returns a substring of the text contents
  • Operations on Attribute s
    • String getName()
      • Returns the name of this attribute.
    • Element getOwnerElement()
      • Returns the Element node this attribute is attached to
    • String getValue()
      • Returns the value of the attribute as a String
  • Overview
    • DOM, unlike SAX, gives allows to create and modify XML trees
    • There are three basic kinds of operations:
      • Creating a new DOM
      • Modifying the structure of a DOM
      • Modifying the content of a DOM
    • Creating a new DOM requires a few extra methods just to get started
      • Afterwards, you can add elements through modifying its structure and contents
  • Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }
  • Creating structure
    • The following are instance methods of Document :
      • public Element createElement(String tagName)
      • public Element createElementNS(String namespaceURI, String qualifiedName)
      • public Attr createAttribute(String name)
      • public Attr createAttributeNS(String namespaceURI, String qualifiedName)
      • public ProcessingInstruction createProcessingInstruction (String target, String data)
      • public EntityReference createEntityReference(String name)
      • public Text createTextNode(String data)
      • public Comment createComment(String data)
  • Methods of Node
    • public Node appendChild(Node newChild)
    • public Node insertBefore(Node newChild, Node refChild)
    • public Node removeChild(Node oldChild)
    • public Node replaceChild(Node newChild, Node oldChild)
    • setNodeValue(String nodeValue)
      • Functionality depends on the type of the node
  • Methods of Element
    • public void setAttribute(String name, String value)
    • public Attr setAttributeNode(Attr newAttr)
    • public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value)
    • public Attr setAttributeNodeNS(Attr newAttr)
    • public void removeAttribute(String name)
    • public void removeAttributeNS(String namespaceURI, String localName)
    • public Attr removeAttributeNode(Attr oldAttr)
  • Method of Attribute
    • public void setValue(String value)
    • This is the only method that modifies an Attribute
      • The rest just retrieve information
  • Queries ?