Xml processing-by-asfak

  • 841 views
Uploaded on

My Seminar slide taken on XML Processing on 2011-12-20 at KAZ Software Ltd.

My Seminar slide taken on XML Processing on 2011-12-20 at KAZ Software Ltd.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
841
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. XML Processing Md. Asfak Mahamud KAZ Software Ltd.
  • 2. XML and Other Markup Languages SGML (1973) HTML (1989) XML (1996) “ XML has several favorable attributes that distinguish it from other competing technologies. Programmers find XML easy to learn because it is human-readable . The downside, however, is that an XML document needs to be parsed for it to become machine-readable.” Ref: XML on a Chip? “ A specially prepared document for Sun Microsystem by XimpleWare [ 6/9/2003 ]“
  • 3. Regular Language
    • Regular languages are languages which can be recognized by a computer with finite (i.e. fixed) memory.
    • Such a computer corresponds to a DFA.
    • For example, L = {1 n | n is even}
    • However, there are many languages which cannot be recognized using only finite memory, a simple example is the language
    • L = {0 n 1 n | n E N }
    • i.e. the language of words which start with a number of 0s followed by the same number of 1s
    Ref: http://www.cs.nott.ac.uk/~txa/g51mal/notes-3x.pdf
  • 4. XML is not regular
    • “ Well-formed XML is not a regular language, and it can-not be parsed by a finite-state automaton, but rather requires at least a push-down automaton (PDA).”
    Ref: A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu,Yinfei Pan By Pumping Lemma we can prove it. A proof: http://welbog.homeip.net/glue/53/XML-is-not-regular
  • 5. Typical XML Processing Symantic Analysis Parsing input XML Output XML Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 6. Typical XML Processing Parsing Access Modification Serialization input XML Output XML Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Symantic Analysis
  • 7. Typical XML Processing Parsing Access Modification Serialization input XML Output XML Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Performance Bottleneck Symantic Analysis
  • 8. Typical XML Processing Parsing Access Modification Serialization input XML Output XML Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Performance Bottleneck Performance affected by parsing models Symantic Analysis
  • 9. Steps in Parsing Parsing Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Character Conversion Lexical Analysis (FSM) Syntactic Analysis (PDA) Bit Sequence 36 61 3E Character Sequence ‘ <‘ ‘a’ ‘>’ Token Sequence (‘<a>’ ‘X’ ‘</a>’) Data Representation (tree, event, integer array)
  • 10. Steps in Parsing Parsing Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Character Conversion Lexical Analysis (FSM) Syntactic Analysis (PDA) Bit Sequence 36 61 3E Character Sequence ‘ <‘ ‘a’ ‘>’ Token Sequence (‘<a>’ ‘X’ ‘</a>’) Data Representation (tree, event, integer array) Invariant among different parsing models
  • 11. Steps in Parsing Parsing Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University Character Conversion Lexical Analysis (FSM) Syntactic Analysis (PDA) Bit Sequence 36 61 3E Character Sequence ‘ <‘ ‘a’ ‘>’ Token Sequence (‘<a>’ ‘X’ ‘</a>’) Data Representation (tree, event, integer array) PARSING MODEL DEPENDENT Invariant among different parsing models Different among different parsing models
  • 12. Xml Processing: DOM & SAX or StAX Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 13. Why DOM is memory intensive?
    • Overhead of allocating small memory blocks
      • OS pre-divides heap into linked lists of small fixed-size free memory blocks, also known as buckets. Any request for a small memory block will be assigned by OS a smallest pre-allocated block in the bucket that the fits the size of the request. For instance, a request to allocate a single-byte returns a 16-byte chunk (an 8-byte memory block plus 8 byte for boundary tags). When the OS has to allocate lots of small memory blocks, the overhead can become very significant.
    • Unnecessary de-coupling between a node object and its name
      • A node object is a small memory block containing a pointer to the node name in the form of a string object, which is another small block. The binding between node object and node name plays right into the weakness of the OS: It is like the overhead of small memory blocks isn’t bad enough – DOM &quot;knowingly&quot; creates as many small blocks as possible to take advantage of the &quot;overhead.&quot;
    Ref: XML on a Chip? “ A specially prepared document for Sun Microsystem by XimpleWare [ 6/9/2003 ]“
  • 14. Efficiency Problems of DOM and SAX/StAX Parsing Models
    • Extractive
    Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao
  • 15. Efficiency Problems of DOM and SAX/StAX Parsing Models (contd.)
    • Encoding
    Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao “ Even a small change does the DOM model make on the XML document; it must decode the entire document first, and then build the structure. It is a virtually overhead.”
  • 16. XML Processing: VTD V irtual T oken D escriptor
    • - developed by  XimpleWare . 
    • dual-licensed under GPL and proprietary license.
    • originally written in Java, but is now available in C, C++ and C#.
    • latest version 2.10 (2011, Feb)
  • 17. VTD-XML
    • Non-Extractive, Document-Centric Parsing
      • Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated  extractive  parsing. In contrast,  non-extractive  tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.
    • Virtual Token Descriptor
      • Virtual Token Descriptor  ( VTD ) applies the concept of non-extractive, document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64-bit in length, they can be stored efficiently and managed as an array.
    • Location Cache
      • Location Caches  ( LC ) build on VTD records to provide efficient random access. Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.
    Ref: http://en.wikipedia.org/wiki/VTD-XML
  • 18. VTD: inside VTD record Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 19. Xml Processing: VTD Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 20. VTD-XML Parsed Representation of XML. Image: http://vtd-xml.sourceforge.net/technical/2.html
  • 21. VTD-XML Resolving child elements using Location Cache. Image: http://vtd-xml.sourceforge.net/technical/2.html
  • 22. James Clark (on 2002)
    • “ Improve XML processing models .
    • Right now, developers are generally caught between the inefficiencies of DOM and the unfamiliar feel of SAX .
    • An API that offers the best of both is needed.”
    Ref: Keeping pace with James Clark https://www.ibm.com/developerworks/xml/library/x-jclark.html?dwzone=xml http://www.jclark.com/bio.htm
  • 23. VTD-XML has both DOM and SAX like features.
    • “ After the parser finishes processing XML, the processing model provides two views of the underlying XML data.
    • The first is a flat view of all VTD records corresponding to all tokens in XML in document order, it can be thought of as a view of cached SAX events .
    • The second is a hierarchical view enabled by a cursor-based navigation API allowing for DOM-like random access within the document. And the cursor always points to the VTD record of the current element.”
    Ref: http://vtd-xml.sourceforge.net/technical/3.html
  • 24.
    • Demo
  • 25. VTD
    • Most memory-efficient  (1.3x~1.5x the size of an XML document) random-access XML parser.
    Ref: http://vtd-xml.sourceforge.net/benchmark4.html http://vtd-xml.sourceforge.net/technical/2.html n1   = total tokens (including ending tags) n2  = tokens for starting tags s = document of size (in bytes) (n1 - n2) x8   = Total size of VTD records in bytes (without ending tags) n2x8 = Total size of LCs (totally indexed, i.e. one LC entry per element). Memory usage in bytes:  ( s + 8x(n1-n2) + 8xn2) = s + 8xn1.
  • 26. VTD
    • Fastest   XML parser
    • Fastest   XPath  1.0
    • implementation
    Ref: http://vtd-xml.sourceforge.net/benchmark4.html
  • 27. VTD
    • World's only incremental-update capable  XML
    • parser capable of  cutting, pasting, splitting and assembling XML  documents with max efficiency.
      • Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
    • World's only XML parser that allows you to use XPath to process 256 GB  XML documents.
    Ref: http://vtd-xml.sourceforge.net
  • 28. Incremental Update (Do not touch un-required content)
    • Problem: Change ‘red’ to ‘blue’
    <color> red </color>
    • Human Approach:
    • open the file with a simple notepad,
    • move the cursor to the start of the text node,
    • replace &quot;red&quot; with &quot;blue&quot;
    DOM Approach: 1. Build the DOM tree 2. Navigate to and then update the text node 3. Write the updated structure back into XML Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html  ” if we humans can edit XML like this, why can't XML parsers “ - Jimmy Zhang, JavaWorld.com, 07/24/06
  • 29.
    • Demo: Incremental Update
  • 30. VTD on Android Platform Ref: Analyzing XML Parsers Performance for Android Platform M V Uttam Tej ,Dhanaraj Cheelu, M.Rajasekhara Babu, P Venkata Krishna SCSE, VIT University, Vellore, Tamil Nadu
  • 31. Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 32. Comparisons (contd.) Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 33. Comparisons (contd.) Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University
  • 34. VTD-XML’s Limitations
    • As a file format, it increases the document size by about 30% to 50%.
    • As an API, it is not compatible with  DOM  or  SAX .
    • It is difficult to support certain validation techniques, employed by DTD and XML Schema (e.g., default attributes and elements), that require modifications to the XML instances being parsed.
    Ref: http://en.wikipedia.org/wiki/VTD-XML
  • 35. Parallel Approach to XML Parsing A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu, Yinfei Pan
  • 36. Parallel Approach to XML Parsing (cont.) A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu, Yinfei Pan
  • 37. Limitations of PXP “ First, the skeleton requires extra memory that is proportional to the number of node in the DOM tree. Further, the partitioning scheme based on subtrees can cause load imbalance on processing cores for XML documents with irregular or deep tree structures (e.g., TREEBANK with parts-of-speech tagging [29]). This scheme severely limits the granularity of parallelism that can be achieved, and thus cannot scale with increasing core count.” Ref: 2.2 PriorWork on Parallel XML Parsing “ A Data Parallel Algorithm for XML DOM Parsing” Bhavik Shah 1 , Praveen R. Rao 1 , and Bongki Moon 2 and Mohan Rajagopalan 3 1 University of Missouri-Kansas City 2 University of Arizona 3 Intel Research Labs
  • 38. ParDOM Ref: “ A Data Parallel Algorithm for XML DOM Parsing” Bhavik Shah 1 , Praveen R. Rao 1 , and Bongki Moon 2 and Mohan Rajagopalan 3 1 University of Missouri-Kansas City 2 University of Arizona 3 Intel Research Labs
  • 39. ParDOM (contd) Ref: “ A Data Parallel Algorithm for XML DOM Parsing” Bhavik Shah 1 , Praveen R. Rao 1 , and Bongki Moon 2 and Mohan Rajagopalan 3 1 University of Missouri-Kansas City 2 University of Arizona 3 Intel Research Labs
  • 40. ParDOM (contd) Ref: “ A Data Parallel Algorithm for XML DOM Parsing” Bhavik Shah 1 , Praveen R. Rao 1 , and Bongki Moon 2 and Mohan Rajagopalan 3 1 University of Missouri-Kansas City 2 University of Arizona 3 Intel Research Labs
  • 41.
    • Thank you.