Your SlideShare is downloading. ×
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

ApacheCon 2000 Everything you ever wanted to know about XML Parsing

611
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
611
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Everything you always wanted to know about XML parsing Ted Leung CTO, Enkubator, LLC ted@enkubator.com
  • 2. Thank You IBM n ASF n Xerces-Dev n
  • 3. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 4. Overview Focus on server side XML n Not document processing n Benefits to servers / e-business n Model-View separation for *ML n Ubiquitous format for data exchange n Hope for network effects from data availability n
  • 5. xml.apache.org Xerces XML parser n Java n C++ n Perl n Xalan XSLT processor n Java n C++ (soon) n FOP n Cocoon n
  • 6. Xerces-J Why Java? n Unicode n Code motion n Server side cross platform support n C++ version has lagged Java version n
  • 7. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 8. Basic XML Concepts Well formedness n Validity n DTD’s n Schemas n Entities n
  • 9. Example: RSS <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?> <!DOCTYPE rss PUBLIC quot;-//Netscape Communications//DTD RSS .91//ENquot; quot;http://my.netscape.com/publish/formats/rss-0.91.dtdquot;> <rss version=quot;0.91quot;> <channel> <title>freshmeat.net</title> <link>http://freshmeat.net</link> <description>the one-stop-shop for all your Linux software needs</description> <language>en</language> <rating>(PICS-1.1 quot;http://www.classify.org/safesurf/quot; 1 r (SS~~000 1))</rating> <copyright>Copyright 1999, Freshmeat.net</copyright> <pubDate>Thu, 23 Aug 1999 07:00:00 GMT</pubDate> <lastBuildDate>Thu, 23 Aug 1999 16:20:26 GMT</lastBuildDate> <docs>http://www.blahblah.org/fm.cdf</docs>
  • 10. RSS 2 <image> <title>freshmeat.net</title> <url>http://freshmeat.net/images/fm.mini.jpg</url> <link>http://freshmeat.net</link> <width>88</width> <height>31</height> <description>This is the Freshmeat image stupid</description> </image> <item> <title>kdbg 1.0beta2</title> <link>http://www.freshmeat.net/news/1999/08/23/935449823.html</link> <description>This is the Freshmeat image stupid</description> </item> <item> <title>HTML-Tree 1.7</title> <link>http://www.freshmeat.net/news/1999/08/23/935449856.html</link> <description>This is the Freshmeat image stupid</description> </item>
  • 11. RSS 3 <textinput> <title>quick finder</title> <description>Use the text input below to search freshmeat</description> <name>query</name> <link>http://core.freshmeat.net/search.php3</link> </textinput> <skipHours> <hour>2</hour> </skipHours> <skipDays> <day>1</day> </skipDays> </channel> </rss>
  • 12. RSS DTD <!ELEMENT rss (channel)> <!ATTLIST rss version CDATA #REQUIRED> <!-- must be quot;0.91quot;> --> <!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? | managingEditor? | webMaster? | skipHours? | skipDays?)*> <!ELEMENT title (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT link (#PCDATA)> <!ELEMENT language (#PCDATA)> <!ELEMENT rating (#PCDATA)> <!ELEMENT copyright (#PCDATA)> <!ELEMENT pubDate (#PCDATA)> <!ELEMENT lastBuildDate (#PCDATA)> <!ELEMENT docs (#PCDATA)> <!ELEMENT managingEditor (#PCDATA)> <!ELEMENT webMaster (#PCDATA)> <!ELEMENT hour (#PCDATA)> <!ELEMENT day (#PCDATA)> <!ELEMENT skipHours (hour+)> <!ELEMENT skipDays (day+)>
  • 13. RSS DTD 2 <!ELEMENT item (title | link | description)*> <!ELEMENT textinput (title | description | name | link)*> <!ELEMENT name (#PCDATA)> <!ELEMENT image (title | url | link | width? | height? | description?)*> <!ELEMENT url (#PCDATA)> <!ELEMENT width (#PCDATA)> <!ELEMENT height (#PCDATA)>
  • 14. XML Application Tasks Convert XML into application object n Convert application object to XML n Read or write XML into a database n XSLT conversion to XHTML or HTML n
  • 15. RSS Objects Logical View RSSChannel RSSItem +getTitle() +getTitle() +getDescription() +getDescription() +getLink() +getLink() +getItems() +getImage() +getTextInput() RSS Image RSSTextInput +getTitle() +getDescription() +getLink() +getTitle() +getURL() +getDescription() +getHeight() +getLInk() +getWidth() +getName()
  • 16. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 17. Parser API’s What is the job of a parser API? n Make the structure and contents of a n document available Report errors n
  • 18. SAX API Event callback style n Representation of elements & atts n must build own stack n Development model n Reasonably open n xml-dev n http://www.megginson.com/SAX/SAX2 n
  • 19. Handler Strategies Single Monolithic n Requires the handler to know entirely about the n grammar One per application object and multiplexer n Each application object has its own n DocumentHandler The Multiplexer handler is registered with the n parser Multiplexer is responsible for swapping SAX n handlers as the context changes.
  • 20. SAX 1 RSS Handler public void startElement(String name,AttributeList atts) throws SAXException { currentText = new StringBuffer(); textStack.push(currentText); if (name.equals(quot;rssquot;) { elementStack.push(name); } else if (name.equals(quot;channelquot;)) { elementStack.push(name); currentChannel = new RSSChannel(); currentItems = new Vector(); currentChannel.setItems(currentItems); currentImage = new RSSImage(); currentChannel.setImage(currentImage); currentTextInput = new RSSTextInput(); currentChannel.setTextInput(currentTextInput); currentChannel.setSkipHours(new Vector()); currentChannel.setSkipDays(new Vector()); } else if (name.equals(quot;itemquot;)) { elementStack.push(name); currentItem = new RSSItem(); } else if (name.equals(quot;imagequot;)) { elementStack.push(name); } else if (name.equals(quot;textinputquot;)) { elementStack.push(name); } else if (name.equals(quot;skipHoursquot;)) { elementStack.push(name); } else if (name.equals(quot;skipDaysquot;)) { elementStack.push(name); } else { } }
  • 21. SAX 1 RSS Handler 2 public void endElement(String name) throws SAXException { String stackValue = (String) elementStack.peek(); String text = ((StringBuffer)textStack.pop()).toString(); if (name.equals(quot;rssquot;)) { elementStack.pop(); } else if (name.equals(quot;channelquot;)) { elementStack.pop(); } else if (name.equals(quot;titlequot;)) { if (stackValue.equals(quot;channelquot;)) { currentChannel.setTitle(text); } else if (stackValue.equals(quot;imagequot;)) { currentImage.setTitle(text); } else if (stackValue.equals(quot;itemquot;)) { currentItem.setTitle(text); } else if (stackValue.equals(quot;textinputquot;)) { currentTextInput.setTitle(text); } else { } ...
  • 22. SAX 1 RSS Handler 3 } else if (name.equals(quot;descriptionquot;)) { if (stackValue.equals(quot;channelquot;)) { currentChannel.setDescription(text); } else if (stackValue.equals(quot;imagequot;)) { currentImage.setDescription(text); } else if (stackValue.equals(quot;itemquot;)) { currentItem.setDescription(text); } else if (stackValue.equals(quot;textinputquot;)) { currentTextInput.setDescription(text); } } else if (name.equals(quot;languagequot;)) { currentChannel.setLanguage(text); } else if (name.equals(quot;itemquot;)) { currentItems.add(currentItem); elementStack.pop(); } else if (name.equals(quot;textinputquot;)) { currentChannel.setTextInput(currentTextInput); elementStack.pop(); } }
  • 23. SAX1 RSS Handler 4 public void characters(char[] ch,int start,int length) throws SAXException { currentText.append(ch, start, length); }
  • 24. SAX 1 RSS Driver public class SAX1RSSParse extends Object { public static void main(String args[]) { Parser p = new SAXParser(); DocumentHandler h = new SAX1RSSHandler(); p.setDocumentHandler(h); try { p.parse(quot;file:///Work/ApacheCon/fm0.91_full.rdfquot;); } catch (SAXException se) { System.out.println(quot;Error during parsingquot;+se.getMessage()); se.printStackTrace(); } catch (IOException e) { System.out.println(quot;I/O Error during parsing+e.getMessage()); e.printStackTrace(); } System.out.println(((SAX1RSSHandler)h).getChannel().toString()); } }
  • 25. SAX 1 Basic Event Handling Basic Event Handling n attribute handling in startElement n characters() for content n character data handling in endElement n Multiple characters calls n
  • 26. SAX 1 RSS ErrorHandler public class SAX1RSSErrorHandler implements ErrorHandler { public void warning(SAXParseException ex) throws SAXException { System.err.println(quot;[Warning] ”+getLocationString(ex)+quot;:”+ ex.getMessage()); } public void error(SAXParseException ex) throws SAXException { System.err.println(quot;[Error] quot;+getLocationString(ex)+quot;:”+ ex.getMessage()); } public void fatalError(SAXParseException ex) throws SAXException { System.err.println(quot;[Fatal Error]quot;+getLocationString(ex)+quot;:” +ex.getMessage()); throw ex; }
  • 27. SAX1 RSS ErrorHandler 2 private String getLocationString(SAXParseException ex) { StringBuffer str = new StringBuffer(); String systemId = ex.getSystemId(); if (systemId != null) { int index = systemId.lastIndexOf('/'); if (index != -1) systemId = systemId.substring(index + 1); str.append(systemId); } str.append(':'); str.append(ex.getLineNumber()); str.append(':'); str.append(ex.getColumnNumber()); return str.toString(); } } }
  • 28. ErrorDriver public class SAX1RSSParse extends Object { public static void main(String args[]) { Parser p = new SAXParser(); DocumentHandler h = new SAX1RSSHandler(); p.setDocumentHandler(h); ErrorHandler eh = new SAX1RSSErrorHandler(); p.setErrorHandler(eh); try { p.parse(quot;file:///Work/ApacheCon/fm0.91_full.rdfquot;); } catch (SAXException se) { System.out.println(quot;Error during parsingquot;+se.getMessage()); se.printStackTrace(); } catch (IOException e) { System.out.println(quot;I/O Error during parsing+e.getMessage()); e.printStackTrace(); } System.out.println(((SAX1RSSHandler)h).getChannel().toString()); } }
  • 29. SAX 1 Error Handling Error Handling n warning n error – parser may recover n fatalError – XML 1.0 spec errors n Locators n
  • 30. Entity Handling <!DOCTYPE rss PUBLIC quot;-//Netscape Communications//DTD RSS91//ENquot; n quot;http://my.netscape.com/publish/formats/rss-0.91.dtdquot;> This will do a network fetch! n public class SAX1RSSEntityHandler implements EntityResolver { public SAX1RSSEntityHandler() { } public InputSource resolveEntity(String publicId,String systemId) throws SAXException, IOException { if (publicId.equals(quot;-//Netscape Communications//DTD RSS 91//ENquot;)) { FileReader r = new FileReader(quot;/usr/share/xml/rss91.dtdquot;); return new InputSource(r); } return null; } }
  • 31. EntityDriver public class SAX1RSSParse extends Object { public static void main(String args[]) { Parser p = new SAXParser(); DocumentHandler h = new SAX1RSSHandler(); p.setDocumentHandler(h); ErrorHandler eh = new SAX1RSSErrorHandler(); p.setErrorHandler(eh); EntityResolver r = new SAX1RSSEntityResolver(); p.setEntityResolver(r); try { p.parse(quot;file:///Work/ApacheCon/fm0.91_full.rdfquot;); } catch (SAXException se) { System.out.println(quot;Error during parsingquot;+se.getMessage()); se.printStackTrace(); } catch (IOException e) { System.out.println(quot;I/O Error during parsing+e.getMessage()); e.printStackTrace(); } System.out.println(((SAX1RSSHandler)h).getChannel().toString()); } }
  • 32. HandlerBase The kitchen sink n Lets you do DocumentHandler, n ErrorHandler, EntityResolver and DTD Handler A good way to go if you are only n overriding a small number of methods
  • 33. SAX 2 Beta Configurable n DTD n ContentHandler replaces n DocumentHandler (NS Aware) Attributes replaces AttributeList (NS n Aware) XMLReader replaces Parser n XMLFilter n DefaultHandler replaces HandlerBase n
  • 34. SAX 2 Beta Optional LexicalHandler n Entity boundaries n DTD bounaries n CDATA sections n Comments n Optional DeclHandler n ElementDecls and AttributeDecls n Internal and External Entity Decls n NO parameter entities n Compatibility Adapters n
  • 35. SAX 2 Configurable Uses strings (URI’s) as lookup keys n Features - booleans n validation n namespaces n Properties - Objects n Extensibility by returning an object that n implements additional function
  • 36. Xerces and Configurable Standard way to set all parser n configuration settings. Applies to both SAX and DOM Parsers n
  • 37. Xerces Features Inclusion of general entities n Inclusion of parameter entities n Dynamic validation n Extra warnings n Allow use of java encoding names n Lazy evaluating DOM n DOM EntityReference creation n Inclusion of igorable whitespace in DOM n
  • 38. Xerces Properties Name of DOM implementation class n DOM node currently being parsed n
  • 39. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 40. DOM API Object representation of elements & n atts Development Model n W3C Recommendations and process n http:///www.w3.org/DOM n
  • 41. DOM Architecture DocumentType Node +getNodeName() +getNodeValue() ProcessingInstruction +getChildNodes() +getFirstChild() Document +getLastChild() +getNextSibling() +getPreviousSibling() +insertBefore() +appendChild() +replaceChild() +removeChild() Element Attr +getAttribute() +setAttribute() +removeAttribute() +getTagName() CharacterData EntityReference Entity Notation +getAttributeNode() +setAttributeNode() +removeAttributeNode() +normalize() NodeList NamedNodeMap Comment Text CDataSection
  • 42. DOM Instance (XML) <?xml version=quot;1.0quot; encoding=quot;US-ASCIIquot;?> <a> <b>in b</b> <c> text in c <d/> </c> </a>
  • 43. DOM Tree Document Element a Text Element Text Element Text [lf] b [lf] c [lf] Text Text Element Text in b [lf]text in c[lf] d [lf]
  • 44. DOM RSS public static void main(String args[]) { DOMParser p = new DOMParser(); try { p.parse(quot;file:///Work/ApacheCon/fm0.91_full.rdfquot;); System.out.println(dom2Channel(p.getDocument()).toString()); } catch (SAXException se) { System.out.println(quot;Error during parsing quot;+se.getMessage()); se.printStackTrace(); } catch (IOException e) { System.out.println(quot;I/O Error during parsing “+e.getMessage()); e.printStackTrace(); } }
  • 45. DOM2Channel public static RSSChannel dom2Channel(Document d) { RSSChannel c = new RSSChannel(); Element channel = null; NodeList nl = d.getElementsByTagName(quot;channelquot;); channel = (Element) nl.item(0); for (Node n = channel.getFirstChild(); n != null; n = n.getNextSibling()) { if (n.getNodeType() != Node.ELEMENT_NODE) continue; Element e = (Element) n; e.normalize(); String text = e.getFirstChild().getNodeValue(); if (e.getTagName().equals(quot;titlequot;)) { c.setTitle(text); } else if (e.getTagName().equals(quot;descriptionquot;)) { c.setDescription(text); } else if (e.getTagName().equals(quot;linkquot;)) { c.setLink(text); } else if (e.getTagName().equals(quot;languagequot;)) { ...
  • 46. DOM2Channel 2 … } nl = channel.getElementsByTagName(quot;itemquot;); c.setItems(dom2Items(nl)); nl = channel.getElementsByTagName(quot;imagequot;); c.setImage(dom2Image(nl.item(0))); nl = channel.getElementsByTagName(quot;textinputquot;); c.setTextInput(dom2TextInput(nl.item(0))); nl = channel.getElementsByTagName(quot;skipHoursquot;); c.setSkipHours(dom2SkipHours(nl.item(0))); nl = channel.getElementsByTagName(quot;skipDaysquot;); c.setSkipDays(dom2SkipDays(nl.item(0))); return c; }
  • 47. DOM Level 1 Attr n API for result as objects n API for result as strings n No DTD support n Need Import - copying between doms is n bad
  • 48. DOM Level 2 Core / Namespaces n Import n Traversal n Events n Range n Stylesheets n
  • 49. Xerces DOM extensions Feature controls insertion of ignorable n whitespace Feature controls creation of n EntityReference Nodes User data object provided on n implementation class org.apache.xml.dom.NodeImpl
  • 50. Xerces Lazy DOM Delays object creation n a node is accessed - its object is created n all siblings are also created. n Reduces time to complete parsing n Reduces memory usage n Good if you only access part of the n DOM
  • 51. Tricky DOM stuff item() based access n use getNextSibling() n Entity and EntityReference nodes – n depends on parser settings Ignorable whitespace n Subclassing the DOM n Use the Document factory class n
  • 52. DOM vs SAX DOM n creates a document object tree n tree must the be reparsed & converted to n BO’s Could subclass BO’s from DOM classes n SAX n event handler based API n stack management n no intermediate data structures generated n
  • 53. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 54. Validation When to use validation n non-validating != wf checking n Not required to read external entities n validation dial settings n
  • 55. Namespaces Purpose n Syntactic discrimination only n They don’t point to anything n They don’t imply schema composition/combination n Universal Names n URI + Local Name n {http://www.mydomain.com}Element Declarations n xmlns “attribute” n Prefixes n Scoping n Validation and DTDs n
  • 56. Namespace Example <xsl:stylesheet version=quot;1.0quot; xmlns:xsl=quot;http://www.w3.org/1999/XSL/Transformquot; xmlns:html=quot;http://www.w3.org/TR/xhtml1/strictquot;> <xsl:template match=quot;/quot;> <html:html> <html:head> <html:title>Title</html:title> </html:head> <html:body> <xsl:apply-templates/> </html:body> </xsl:template> ... </xsl:stylesheet>
  • 57. Default Namespace Example <xsl:stylesheet version=quot;1.0quot; xmlns:xsl=quot;http://www.w3.org/1999/XSL/Transformquot; xmlns=quot;http://www.w3.org/TR/xhtml1/strictquot;> <xsl:template match=quot;/quot;> <html> <head> <title>Title</title> </head> <body> <xsl:apply-templates/> </body> </xsl:template> ... </xsl:stylesheet>
  • 58. SAX2 ContentHandler public class NamespaceDriver extends DefaultHandler implements ContentHandler { public void startPrefixMapping(String prefix,String uri) throws SAXException { System.out.println(quot;startMapping quot;+prefix+quot;, uri = quot;+uri); } public void endPrefixMapping(String prefix) throws SAXException { System.out.println(quot;endMapping quot;+prefix); } public void startElement(String namespaceURI,String localName,String rawName,Attributes atts) throws SAXException { System.out.println(quot;Start: nsURI=quot;+namespaceURI+quot; localName= quot;+localName+quot; rawName = quot;+rawName); printAttributes(atts); } public void endElement(String namespaceURI,String localName,String rawName) throws SAXException { System.out.println(quot;End: nsURI=quot;+namespaceURI+quot; localName= quot;+localName+quot; rawName = quot;+rawName); }
  • 59. SAX2 ContentHandler 2 public void printAttributes(Attributes atts) { for (int i = 0; i < atts.getLength(); i++) { System.out.println(quot;tAttribute quot;+i+quot; URI=quot;+atts.getURI(i)+ quot; LocalName=quot;+atts.getLocalName(i)+ quot; Type=quot;+atts.getType(i)+quot; Value=quot;+atts.getValue(i)); } }
  • 60. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 61. XML Schema Richer grammar specification n W3C Working Draft n Structures n XML Instance Syntax n Namespaces n Weird content models n Datatypes n Lots n Prototype Support in Xerces n
  • 62. RSS Schema <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <!DOCTYPE schema PUBLIC quot;-//W3C//DTD XMLSCHEMA 20000225//ENquot; quot;http://www.w3.org/TR/2000/WD-xmlschema-1- 20000225/structures.dtdquot;> <xsd:schema targetNamespace=quot;http://my.netscape.com/publish/formats/rss-0.91quot;> <xsd:element name=quot;rssquot;> <xsd:complexType> <xsd:element ref=quot;channelquot;/> <xsd:attribute name=quot;versionquot; fixed=quot;0.91quot; minOccurs=quot;1quot;/> </xsd:complexType> </xsd:element>
  • 63. RSS Schema 2 <xsd:element name=quot;channelquot;> <xsd:complexType> <xsd:element ref=quot;titlequot;/> <xsd:element ref=quot;descriptionquot;/> <xsd:element ref=quot;linkquot;/> <xsd:element ref=quot;languagequot;/> <xsd:element ref=quot;itemquot; minOccurs=quot;1quot; maxOccurs=quot;*quot;/> <xsd:element ref=quot;ratingquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;imagequot; minOccurs=quot;0quot;/> <xsd:element ref=quot;textinputquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;copyrightquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;pubDatequot; minOccurs=quot;0quot;/> <xsd:element ref=quot;lastBuildDatequot; minOccurs=quot;0quot;/> <xsd:element ref=quot;docsquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;managingEditorquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;webMasterquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;skipHoursquot; minOccurs=quot;0quot;/> <xsd:element ref=quot;skipDaysquot; minOccurs=quot;0quot;/> </xsd:complexType> </xsd:element>
  • 64. RSS Schema 3 ... <xsd:element name=quot;namequot; type=quot;xsd:stringquot;/> <xsd:element name=quot;ratingquot; type=quot;xsd:stringquot;/> <xsd:element name=quot;languagequot; type=quot;xsd:stringquot;/> <xsd:element name=quot;widthquot; type=quot;xsd:integerquot;/> <xsd:element name=quot;heightquot; type=quot;xsd:integerquot;/> <xsd:element name=quot;copyrightquot; type=quot;xsd:stringquot;/> <xsd:element name=quot;pubDatequot; type=quot;xsd:datequot;/> <xsd:element name=quot;lastBuildDatequot; type=quot;xsd:datequot;/> <xsd:element name=quot;docsquot; type=quot;xsd:stringquot;/> <xsd:element name=quot;managingEditorquot; type=quot;xsd:stringquot;/> <xsd:element name=quot;webMasterquot; type=quot;xsd:stringquot;/> <xsd:element name=quot;hourquot; type=quot;xsd:timequot;/> <xsd:element name=quot;dayquot; type=quot;xsd:stringquot;/> … </xsd:schema>
  • 65. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 66. Grammar Access No DOM standard n SAX 2 n Common Applications n Editors n Stub generators n Also experimental API in Xerces n
  • 67. DTD Access Example public class DTDDumper implements DeclHandler { public void elementDecl (String name, String model) throws SAXException { System.out.println(quot;Element quot;+name+quot; has this content model: quot;+model); } public void attributeDecl (String eName, String aName, String type, String valueDefault, String value) throws SAXException { System.out.print(quot;Attribute quot;+aName+quot; on Element quot;+eName+quot; has type quot;+type); if (valueDefault != null) System.out.print(quot;, it has a default type of quot;+valueDefault); if (value != null) System.out.print(quot;, and a default value of quot;+value); System.out.println(); }
  • 68. DTD Access Example 2 public void internalEntityDecl (String name, String value) throws SAXException { String peText = name.startsWith(quot;%quot;) ? quot;Parameter quot; : quot;quot;; System.out.println(quot;Internal quot;+peText+quot;Entity quot;+name+quot; has value: quot;+value); } public void externalEntityDecl (String name, String publicId, String systemId) throws SAXException { String peText = name.startsWith(quot;%quot;) ? quot;Parameter quot; : quot;quot;; System.out.print(quot;External quot;+peText+quot;Entity quot;+name); System.out.print(quot;is available at quot;+systemId); if (publicId != null) System.out.print(quot; which is known as quot;+publicId); System.out.println(); }
  • 69. DTD Access Driver public static void main(String args[]) { try { XMLReader r = new SAXParser(); r.setProperty(quot;http://xml.org/sax/properties/declaration-handlerquot;, new DTDDumper()); r.parse(args[0]); } catch (Exception e) { e.printStackTrace(); } }
  • 70. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 71. Round Tripping Use SAX2 - DOM can’t do it n XML Decl n Comments (SAX2) n PE Decls n
  • 72. “Serializing” Several choices n XMLSerializer n TextSerializer n HTMLSerializer n XHTMLSerializer n Work as either: n SAX DocumentHandlers n DOM serializers n
  • 73. Serializer Example DOMParser p = new DOMParser(); p.parse(quot;file:///twl/Work/ApacheCon/fm0.91_full.rdfquot;); Document d = p.getDocument(); // XML OutputFormat format = new OutputFormat(quot;xmlquot;, quot;UTF-8quot;, true); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(d); // XHTML format = new OutputFormat(quot;xhtmlquot;, quot;UTF-8quot;, true); serializer = new XHTMLSerializer(System.out, format); serializer.serialize(d);
  • 74. Entity Handling Entity / Catalog handling n Catalog handling is mostly for document n processing (SGML) applications Pluggable entity handling is important for n server applications. Provides access point to DBMS, LDAP, etc. n predefined character entities n amp, lt, gt, apos, quot n
  • 75. Concurrency “Thread safety” n Xerces allows multiple parser instances, n one per thread Xerces does not allow multiple threads to n enter a single parser instance. org.apache.xerces.framework.XMLParser#parseSome() n
  • 76. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 77. Grammar Design Attributes vs Elements n performance tradeoff vs programming n tradeoff At most 1 attribute, many subelements n ID & IDREF n NMTOKEN vs CDATA n CDATASECTIONs n Ignorable Whitespace n
  • 78. Grammar Design 2 Everything is Strings n Data modelling n Relationships as elements n modelling inheritance n Plan to use schema datatypes n avoid weird types n
  • 79. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 80. Performance Tips skip the DOM n reuse parser instances (code) n reset() method n defaulting is slow n external anythings are slower, entities n are slower only validate if you have to, validate n once for all time use fast encoding (US-ASCII) n
  • 81. Outline Overview n xml.apache.org project n Basic XML Concepts n SAX 1 / 2 n DOM Parsing n Namespaces n XML Schema n Grammar Access n Round Tripping n Grammar Design n Performance tips n Xerces Architecture n
  • 82. Xerces Architecture Use SAX based infrastructure where n possible to avoid reinventing the wheel Modular / Framework design n Prefer composition for system n components
  • 83. Xerces Architecture Diagram SAXParser DOMParser DOM Implementations Error Handling XMLParser XMLDTDScanner DTDValidator XMLDocumentScanner XSchemaValidator Entity Handling Readers
  • 84. Things we don’t do HTML parsing (yet) n java Tidy n http://www3.sympatico.ca/ac.quick/jtidy.ht ml Grammar caching (yet) n Parse several documents in a stream n