Efficient Querying of XML Data Using Structural Joins
Content <ul><li>A quick look at XML query languages </li></ul><ul><li>Lore - an example of a native XML database </li></ul...
XML <ul><li>Replacement for HTML </li></ul><ul><ul><li>Focus is on storing and processing. </li></ul></ul><ul><li>Electron...
XML query languages <ul><li>XML-QL </li></ul><ul><ul><li>Influenced by SQL </li></ul></ul><ul><ul><li>Submitted to W3C (lo...
XPath <ul><li>*/para </li></ul><ul><ul><li>selects all para grandchildren of the context node </li></ul></ul><ul><li>/doc/...
XQuery <ul><li>document(&quot;books.xml&quot;)//chapter/title </li></ul><ul><ul><li>Finds all titles of chapters in docume...
XML documents as trees <ul><li><book year=“2000”> </li></ul><ul><ul><li><title> XML </title> </li></ul></ul><ul><ul><li><a...
XML documents as trees <ul><li><book year=“2000”> </li></ul><ul><ul><li><title> XML </title  id=“id1” > </li></ul></ul><ul...
Executing queries <ul><li>How does one execute a complex query: </li></ul><ul><ul><li>Parse the query (i.e. break it down ...
XML databases <ul><li>XML is semi-structured; data items may have missing elements or multiple occurrences of the same ele...
Semi-structured XML databases <ul><li>There aren’t many around </li></ul><ul><li>Store XML files plus indexes </li></ul><u...
LORE An example of a native semi-structured database
Lore - sample database <ul><ul><ul><ul><li>Select  x </li></ul></ul></ul></ul><ul><ul><ul><ul><li>From  DBGroup.Member x <...
Lore - data model <ul><li>Called the Object Exchange Model </li></ul><ul><li>The data model is a graph (though the  refere...
Lore - indexes <ul><li>Vindex (value index) - implemented as a B+-tree </li></ul><ul><ul><li>Supports  finding all atomic ...
Lore - statistics  ( partial list ) <ul><li>For each labeled path  p  of length <= k (usually k=1): </li></ul><ul><ul><li>...
Lore - path expressions  ( simplified ) <ul><li>Simple path expressions </li></ul><ul><ul><li>x.l y </li></ul></ul><ul><li...
Lore -  basic physical operators   ( slightly edited ) <ul><li>Scan(father, label,  son ) </li></ul><ul><ul><li>Finds all ...
Lore - physical path subplans x and y are unbound y is bound x and y are unbound <ul><li>The estimated hit-rate (per x) of...
Lore - sample logical plan <ul><ul><li>Select  x From DBGroup.Member x Where exists y in x.age: y<30 </li></ul></ul><ul><u...
Lore - sample physical subplans <ul><li>(a) corresponds to a possible left-right plan of the top “glue” </li></ul><ul><li>...
Lore - path expressions strategies <ul><li>A higher level view of path expressions solving </li></ul><ul><li>Top-Down </li...
Lore - path strategies  (continued) <ul><li>Top-Down is better  when there are few paths satisfying the required structure...
DB2 An example of a RDBMS support of XML
DB2 -  XML  support  <ul><li>XML column </li></ul><ul><ul><li>An  entire  XML document is stored as a column in a table. <...
DB2 - a nice diagram...
DB2 -  example Data Access Definition
DB2 - example DAD   (continued)
DB2 - searching  XML  documents <ul><li>Well, whatever is in the side tables is queried using SQL. </li></ul><ul><li>What ...
DB2 - conclusions  (in a nutshell) <ul><li>Pros </li></ul><ul><ul><li>Integrated solution which automates a lot of work. <...
On Supporting Containment Queries in RDBMS Zhang, Naughton, DeWitt, Luo, Lohman ACM SIGMOD  2001
Article goals <ul><li>Given that a lot of XML data  is  (and will probably be) stored in RDBMS which is the best way to su...
Structural relationships in trees <ul><li>Note that  x  is a descendant of  y  if and only if: preorder( x ) > preorder( y...
Structural relationships in  XML <ul><li>The previous observations are true even if we look at any monotone functions of t...
The inverted indexes <ul><li>An Elements index (E-Index):  Holds for each XML element, the  docno ,  begin ,  end  and  le...
Experiment plan <ul><li>Compare the following two systems: </li></ul><ul><li>An inverted list engine supporting containmen...
Using the inverted indexes tables <ul><li>E//&quot;T” </li></ul><ul><li>select *  from ELEMENTS  e, TEXTS  t </li></ul><ul...
Experiment setup <ul><li>The data sets: </li></ul>
Experiment setup  (continued) <ul><li>The queries are all simple queries of the form: E//T, E//E, E/T or E/E </li></ul>
Experiment results
Results analysis <ul><li>Why did DB2 perform  better  in QS4, QD4 and QG5? </li></ul><ul><li>Remember that each list in th...
DB2  merge algorithms <ul><li>When joining on: </li></ul><ul><li>a.docno = d.docno  and  a.begin < d.wordno  and  d.wordno...
The Multi-Predicate Merge Join <ul><li>begin-desc = Dlist->first node; OutputList = NULL; </li></ul><ul><li>for (a = Alist...
Comparison of the merge algorithms <ul><li>It seems like the NLJ algorithm will usually compare less items,  BUT </li></ul...
MPMGJN  & traditional joins - statistics Note: DB2 did not choose NLJ for QG4
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al-Khalifa, Jagadish, Koudas, Patel, Srivastava, Wu...
Structural-Join algorithms <ul><li>Tree-Merge-Anc (aka MPMGJN) </li></ul><ul><li>Tree-Merge-Desc </li></ul><ul><li>Stack-T...
Tree-Merge-Anc <ul><li>begin-desc = Dlist->first node; OutputList = NULL; </li></ul><ul><li>for (a = Alist->firstNode; ; a...
Analysis of Tree-Merge-Anc <ul><li>Ancestor-Descendant structural relationships: </li></ul><ul><ul><li>O(|Alist| + |Dlist|...
Tree-Merge-Desc <ul><li>begin-anc = Alist->first node; OutputList = NULL; </li></ul><ul><li>for (d = Dlist->firstNode; ; d...
Analysis of Tree-Merge-Desc <ul><li>Ancestor-Descendant and Father-Son structural relationships:   </li></ul><ul><ul><li>O...
Stack-Tree algorithms <ul><li>Motivation </li></ul><ul><ul><li>A depth-first traversal of a tree can be performed in linea...
Stack-Tree-Desc <ul><li>a = Alist->first node;  d = Dlist->first node;  OutputList = NULL; </li></ul><ul><li>while (lists ...
Stack-Tree-Desc  (father-son example) a1 d1 a2 d2 . . . . an dn dn+1 dn+2 . . d2n d1 d2n d2 d2n-1 d3 d2n-2 dn dn+1 a1 a2 a...
Analysis of Stack-Tree-Dec <ul><li>O(|Alist| + |Dlist| + |OutputList|) for  ancestor-descendant as well as father-son stru...
Stack-Tree-Anc <ul><li>a = Alist->first node;  d = Dlist->first node;  OutputList = NULL; </li></ul><ul><li>while (lists a...
Stack-Tree-Anc  (father-son example) a1 d1 a2 d2 . . . . an dn dn+1 dn+2 . . d2n a1 (a1,d1) a2 (a2,d2) . . . (an-1,dn-1) a...
Analysis of Stack-Tree-Anc <ul><li>O(|Alist| + |Dlist| + |OutputList|) For  ancestor-descendant as well as father-son stru...
Experiment workload <ul><li>Experimented with real XML data as well as synthetic data generated by IBM XML data generator ...
Experiment results <ul><li>Implemented the structural join algorithms, as well as bottom-up and top-down, on the TIMBER na...
Experiment results   (continued) <ul><li>Implemented the STJ-D as an application program interfacing to a commercial RDBMS...
Experiment results   (continued)
Holistic Twig Joins: Optimal XML Pattern Matching Bruno, Koudas, Srivastava ACM SIGMOD  2002
Twig patterns <ul><li>book[title = “XML” AND year = 2000] </li></ul><ul><li>book[title = “XML”]//author[Fn = “jane” AND Ln...
Twig pattern matching <ul><li>Given a twig pattern  Q  and an XML database  D , a  match  is a mapping from nodes in  Q  t...
Twig pattern matching approaches <ul><li>Decompose the twig into a series of  binary structural joins , compute each (usin...
PathStack-Desc <ul><li>go to start of all lists;  OutputList = NULL; </li></ul><ul><li>while (lists are not empty) { </li>...
PathStack-Desc  (example) a1 b1 a2 b2 c1 b3 c2 ? e.startPos > stack->top.endPos a1 b1 a2 b2 (a2,b2,c1) (a1,b2,c1) (a1,b1,c...
PathStack-Desc experimental results <ul><li>Implemented the binary join algorithms, as well as the StackPath, in C++ using...
Final remarks <ul><li>What we did not do (partial list): </li></ul><ul><ul><li>Look at using B+-Trees with the stack algor...
Appendix:  TwigStack  (in a nutshell) <ul><li>getNext(q) returns a query node such that the  head  of its list satisfies: ...
Appendix  (continued) <ul><li>Note that as long as 09 succeeds we return the node whose head has the smallest startPos (L)...
Appendix  (continued) <ul><li>Both used ternary trees. </li></ul><ul><ul><li>Left sub-tree in (a) has only A1=A2=A3=A4 pat...
Bibliography <ul><li>Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Struc...
 
Upcoming SlideShare
Loading in …5
×

Structural joins on XML documents

995 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
995
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Structural joins on XML documents

  1. 1. Efficient Querying of XML Data Using Structural Joins
  2. 2. Content <ul><li>A quick look at XML query languages </li></ul><ul><li>Lore - an example of a native XML database </li></ul><ul><li>DB2 - an example of RDBMS’s support for XML </li></ul><ul><li>On supporting containment queries in RDBMS </li></ul><ul><li>The Tree-Merge and Stack-Tree algorithms </li></ul><ul><li>The StackPath algorithm </li></ul>
  3. 3. XML <ul><li>Replacement for HTML </li></ul><ul><ul><li>Focus is on storing and processing. </li></ul></ul><ul><li>Electronic Data Interchange </li></ul><ul><ul><li>Querying becomes desirable. </li></ul></ul><ul><li>People with many XML documents actually have an XML database. </li></ul>
  4. 4. XML query languages <ul><li>XML-QL </li></ul><ul><ul><li>Influenced by SQL </li></ul></ul><ul><ul><li>Submitted to W3C (lost favor to XQuery) </li></ul></ul><ul><li>XPath </li></ul><ul><ul><li>used in XSLT </li></ul></ul><ul><ul><li>the basis for path expressions in XQuery </li></ul></ul><ul><li>XQuery </li></ul><ul><ul><li>A W3C working draft (version 1.0) </li></ul></ul><ul><ul><li>Based on Quilt (which in turn was mainly influenced by XML-QL and Lorel) </li></ul></ul><ul><ul><li>No updates, limited IR features </li></ul></ul>
  5. 5. XPath <ul><li>*/para </li></ul><ul><ul><li>selects all para grandchildren of the context node </li></ul></ul><ul><li>/doc/chapter[5]/section[2] </li></ul><ul><ul><li>selects the second section of the fifth chapter of the doc </li></ul></ul><ul><li>chapter//para </li></ul><ul><ul><li>selects the para element descendants of the chapter element children of the context node </li></ul></ul><ul><li>para[@type=&quot;warning&quot;] </li></ul><ul><ul><li>selects all para children of the context node that have a type attribute with value warning </li></ul></ul><ul><li>chapter[title=&quot;Introduction&quot;] </li></ul><ul><ul><li>selects the chapter children of the context node that have one or more title children with string-value equal to Introduction </li></ul></ul>
  6. 6. XQuery <ul><li>document(&quot;books.xml&quot;)//chapter/title </li></ul><ul><ul><li>Finds all titles of chapters in document books.xml </li></ul></ul><ul><li>document(bib.xml&quot;)//book[publisher = &quot;Addison-Wesley” AND @year > &quot;1991&quot;] </li></ul><ul><ul><li>Finds all books in document bib.xml published by Addison-Wesley after 1991 </li></ul></ul><ul><li><results> { </li></ul><ul><li>FOR $t IN distinct(document(&quot;prices.xml&quot;)/prices/book/title) </li></ul><ul><li>LET $p := avg(document(&quot;prices.xml&quot;)/prices/book[title=$t]/price) </li></ul><ul><li>WHERE (document(&quot;bib/xml&quot;)/book[title=$t]/publisher) = &quot;Addison-Wesley&quot; </li></ul><ul><li>RETURN </li></ul><ul><li><result> { $t } <avg> { $p } </avg> </result> </li></ul><ul><li>} </results> </li></ul><ul><ul><li>Returns the title and average price of all books published by Addison-Wesley </li></ul></ul>
  7. 7. XML documents as trees <ul><li><book year=“2000”> </li></ul><ul><ul><li><title> XML </title> </li></ul></ul><ul><ul><li><authors> </li></ul></ul><ul><ul><li><author> Bill </author> </li></ul></ul><ul><ul><li><author> Jake </author> </li></ul></ul><ul><ul><li></authors> </li></ul></ul><ul><ul><li><chapter> </li></ul></ul><ul><ul><li><head> History </head> </li></ul></ul><ul><ul><li><section> </li></ul></ul><ul><ul><li> <head> … </head> </li></ul></ul><ul><ul><li> <section> … </section> </li></ul></ul><ul><ul><li></section> </li></ul></ul><ul><ul><li><section> … </section> </li></ul></ul><ul><ul><li></chapter> </li></ul></ul><ul><ul><li><chapter> … </chapter> </li></ul></ul><ul><li></book> Order of nodes is important </li></ul>book year title authors chapter chapter 2000 XML author head section section Bill Jake author History head section ... ... ... ...
  8. 8. XML documents as trees <ul><li><book year=“2000”> </li></ul><ul><ul><li><title> XML </title id=“id1” > </li></ul></ul><ul><ul><li><authors> </li></ul></ul><ul><ul><li><author> Bill </author> </li></ul></ul><ul><ul><li><author> Jake </author> </li></ul></ul><ul><ul><li></authors> </li></ul></ul><ul><ul><li><chapter> </li></ul></ul><ul><ul><li><head> History </head> </li></ul></ul><ul><ul><li><section> </li></ul></ul><ul><ul><li> <head> … </head> </li></ul></ul><ul><ul><li> <section> … </section idref=“id1” > </li></ul></ul><ul><ul><li></section> </li></ul></ul><ul><ul><li><section> … </section> </li></ul></ul><ul><ul><li></chapter> </li></ul></ul><ul><ul><li><chapter> … </chapter> </li></ul></ul><ul><li></book> Order of nodes is important </li></ul>book year authors chapter chapter 2000 xml author head section section Bill Jake author History head ... ... ... ... title section
  9. 9. Executing queries <ul><li>How does one execute a complex query: </li></ul><ul><ul><li>Parse the query (i.e. break it down to basic operations). </li></ul></ul><ul><ul><li>Let a query optimizer devise a corresponding physical query plan. </li></ul></ul><ul><ul><li>Execute the required basic operations combining the intermediate results as you go. </li></ul></ul><ul><li>The most common basic operations are: </li></ul><ul><ul><li>Finding nodes satisfying a given predicate on their value. </li></ul></ul><ul><ul><li>Finding nodes satisfying a given structural relationship. </li></ul></ul>
  10. 10. XML databases <ul><li>XML is semi-structured; data items may have missing elements or multiple occurrences of the same element. It may even not have a DTD. </li></ul><ul><li>Native semi-structured databases </li></ul><ul><ul><li>X-Hive, Lore </li></ul></ul><ul><li>RDBMS </li></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>SQL-Server </li></ul></ul><ul><ul><li>DB2 </li></ul></ul><ul><ul><li>All added support for XML </li></ul></ul>
  11. 11. Semi-structured XML databases <ul><li>There aren’t many around </li></ul><ul><li>Store XML files plus indexes </li></ul><ul><li>Usually build (and store) most or all of the tree </li></ul><ul><li>Usually solve path expressions by pointer-chasing </li></ul>
  12. 12. LORE An example of a native semi-structured database
  13. 13. Lore - sample database <ul><ul><ul><ul><li>Select x </li></ul></ul></ul></ul><ul><ul><ul><ul><li>From DBGroup.Member x </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Where exists y in x.age: y<30 </li></ul></ul></ul></ul>
  14. 14. Lore - data model <ul><li>Called the Object Exchange Model </li></ul><ul><li>The data model is a graph (though the reference edges are marked as such). </li></ul><ul><li>Each vertex is an object with a unique object identifier. </li></ul><ul><li>Atomic objects have no outgoing edges and contain values (like strings, gifs, audio etc.) </li></ul><ul><li>All other objects may have outgoing edges. </li></ul><ul><li>Tag-Names (labels) are attached to the edges, not the vertices. </li></ul><ul><li>Objects may optionally have aliases (names). </li></ul><ul><li>As is obvious this is just another view of our XML tree </li></ul>
  15. 15. Lore - indexes <ul><li>Vindex (value index) - implemented as a B+-tree </li></ul><ul><ul><li>Supports finding all atomic objects with a given incoming edge label satisfying a given predicate. </li></ul></ul><ul><li>Lindex (label index) - implemented using extendible hashing </li></ul><ul><ul><li>Supports finding all parents of a given object via an edge with a given label. </li></ul></ul><ul><li>Bindex (edge index) </li></ul><ul><ul><li>Supports finding all parent-child pairs connected via a given label. This is useful for locating edges with rare labels. </li></ul></ul><ul><li>In addition there are some other indexes (not important to us). </li></ul><ul><li>Note that we need more indexes than in a relational database </li></ul>
  16. 16. Lore - statistics ( partial list ) <ul><li>For each labeled path p of length <= k (usually k=1): </li></ul><ul><ul><li>The total number of instances of p , denoted | p | </li></ul></ul><ul><ul><li>The total number of distinct objects reachable via p , denoted | p | d </li></ul></ul><ul><ul><li>The total number of l -labeled edges going out of p , denoted | p l | </li></ul></ul><ul><ul><li>The total number of l -labeled edges coming into p , denoted | p l | </li></ul></ul>
  17. 17. Lore - path expressions ( simplified ) <ul><li>Simple path expressions </li></ul><ul><ul><li>x.l y </li></ul></ul><ul><li>Path expressions </li></ul><ul><ul><li>an ordered list of simple path expressions </li></ul></ul><ul><ul><li>x.l y, y.l 2 z </li></ul></ul><ul><li>Path expressions logical plan: </li></ul><ul><ul><ul><li>x.B y, y.C z, z.D v </li></ul></ul></ul>
  18. 18. Lore - basic physical operators ( slightly edited ) <ul><li>Scan(father, label, son ) </li></ul><ul><ul><li>Finds all the sons of a given father (through a given label). </li></ul></ul><ul><ul><li>Does pointer-chasing </li></ul></ul><ul><li>Lindex( father , label, son) </li></ul><ul><ul><li>Finds all the fathers of a given son (through a given label). </li></ul></ul><ul><ul><li>Uses the Lindex </li></ul></ul><ul><li>Bindex(label, father , son ) </li></ul><ul><ul><li>Finds all the father-son pairs connected by a given label. </li></ul></ul><ul><ul><li>Uses the Bindex </li></ul></ul><ul><li>Vindex(label, operator, value, atomic-object ) </li></ul><ul><ul><li>Finds all the the atomic objects with a given label incoming label satisfying the given predicate. </li></ul></ul><ul><ul><li>Uses the Vindex </li></ul></ul><ul><li>Name(alias, node) </li></ul><ul><ul><li>Verifies that the specified node has the given alias. </li></ul></ul>
  19. 19. Lore - physical path subplans x and y are unbound y is bound x and y are unbound <ul><li>The estimated hit-rate (per x) of scan(x, “C”, y) is: (| B C | / | B | d ) </li></ul><ul><li>The estimated hit-rate (per y) of Lindex(x, “C”, y) is: (| C B | / | C | d ) </li></ul>
  20. 20. Lore - sample logical plan <ul><ul><li>Select x From DBGroup.Member x Where exists y in x.age: y<30 </li></ul></ul><ul><ul><li>Glue nodes are pivot points, they recursively evaluate the cost of evaluating their sons in left-right or right-left order. </li></ul></ul>
  21. 21. Lore - sample physical subplans <ul><li>(a) corresponds to a possible left-right plan of the top “glue” </li></ul><ul><li>(b) corresponds to a possible left-right plan of the right “glue” </li></ul><ul><li>(c) corresponds to a possible right-left plan of the right “glue” </li></ul><ul><li>(d) corresponds to a possible right-left plan of the top “glue”, using (c) </li></ul>
  22. 22. Lore - path expressions strategies <ul><li>A higher level view of path expressions solving </li></ul><ul><li>Top-Down </li></ul><ul><ul><li>Look for all Member objects in DBGroup and for each one look for Age subobjects with a value < 30. </li></ul></ul><ul><ul><li>uses scan </li></ul></ul><ul><li>Bottom-up </li></ul><ul><ul><li>Look for all atomic objects with value < 30 and for each one walk up the tree using only Age -labeled followed by Member -labeled edges. </li></ul></ul><ul><ul><li>uses Vindex and then Lindex </li></ul></ul><ul><li>Hybrid </li></ul><ul><ul><li>Do Top-Down part of the way and Bottom-Up part of the way. </li></ul></ul><ul><li>Select x From DBGroup.Member x Where exists y in x.age: y<30 </li></ul>
  23. 23. Lore - path strategies (continued) <ul><li>Top-Down is better when there are few paths satisfying the required structure, but many objects satisfying the predicate. </li></ul><ul><li>Bottom-Up is better when there are a few objects satisfying the predicate but many paths satisfying the required structure. </li></ul><ul><li>Hybrid is better when the fan-out degree (going down), increases at the same time the fan-in degree (going up) does. </li></ul>
  24. 24. DB2 An example of a RDBMS support of XML
  25. 25. DB2 - XML support <ul><li>XML column </li></ul><ul><ul><li>An entire XML document is stored as a column in a table. </li></ul></ul><ul><ul><li>may be XMLCLOB, XMLVARCHAR or XMLFile. </li></ul></ul><ul><ul><li>You define which XML elements or attributes should be extracted to indexed columns in side tables. </li></ul></ul><ul><ul><li>UDF’s are provided for inserting, updating and selecting fragments of a document. </li></ul></ul><ul><li>XML collection </li></ul><ul><ul><li>Compose an XML document from existing DB2 tables. </li></ul></ul><ul><ul><li>Decompose an XML document and retrieve some of it into a set of DB2 tables. </li></ul></ul><ul><ul><li>Basically a conversion mechanism. </li></ul></ul><ul><ul><li>Stored procedures automate most of the work. </li></ul></ul>
  26. 26. DB2 - a nice diagram...
  27. 27. DB2 - example Data Access Definition
  28. 28. DB2 - example DAD (continued)
  29. 29. DB2 - searching XML documents <ul><li>Well, whatever is in the side tables is queried using SQL. </li></ul><ul><li>What about things not in any side table? </li></ul><ul><ul><li>A loosely coupled IR engine (part of the DB2 Text Extender) is called using a UDF to take care of this. </li></ul></ul><ul><ul><li>The UDF’s use a syntax compatible with XPath. </li></ul></ul>
  30. 30. DB2 - conclusions (in a nutshell) <ul><li>Pros </li></ul><ul><ul><li>Integrated solution which automates a lot of work. </li></ul></ul><ul><ul><li>We can ask queries that mix data from XML and the regular database tables (aka “web-supported database queries” and “database-supported web queries”). </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>One has to manually define the mappings between the XML documents and the tables. </li></ul></ul><ul><ul><li>Is it fast enough? </li></ul></ul>
  31. 31. On Supporting Containment Queries in RDBMS Zhang, Naughton, DeWitt, Luo, Lohman ACM SIGMOD 2001
  32. 32. Article goals <ul><li>Given that a lot of XML data is (and will probably be) stored in RDBMS which is the best way to support containment queries ? </li></ul><ul><ul><li>Using a loosely coupled IR engine? </li></ul></ul><ul><ul><li>OR </li></ul></ul><ul><ul><li>Using the native tables and query mechanisms of the RDBMS? </li></ul></ul>
  33. 33. Structural relationships in trees <ul><li>Note that x is a descendant of y if and only if: preorder( x ) > preorder( y ) and postorder( x ) < postorder( y ) </li></ul><ul><li>y is the father of x if in addition: level( x ) = level( y ) + 1 </li></ul>1 2 3 4 5 6 7 9 11 12 14 8 10 13 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 book year title authors chapter 2000 XML author head section Bill Jake author History head pre-order book year title authors chapter 2000 XML author head section Bill Jake author History head post-order
  34. 34. Structural relationships in XML <ul><li>The previous observations are true even if we look at any monotone functions of the preorder and the postorder numbers. </li></ul><ul><li>The start and end position of an element in an XML document are exactly such monotone functions. </li></ul><ul><li>In other words we can use a small extension of the regular </li></ul><ul><li>IR inverted-index to also solve structural relationships! </li></ul><ul><li>Note that we have a problem of adapting the numbers if the document changes. </li></ul>
  35. 35. The inverted indexes <ul><li>An Elements index (E-Index): Holds for each XML element, the docno , begin , end and level of every occurrence of that element. </li></ul><ul><li>A Text index (T-Index): Holds for each text word, the docno , wordno and level of every occurrence of the word. </li></ul>
  36. 36. Experiment plan <ul><li>Compare the following two systems: </li></ul><ul><li>An inverted list engine supporting containment queries on XML data. </li></ul><ul><ul><li>The engine was built (due to lack of a commercial one). </li></ul></ul><ul><ul><li>The code was written in C++ and the inverted-indexes were stored in a B+-tree with each list stored as a record . </li></ul></ul><ul><ul><li>Each list is in ascending order of docno, begin (or wordno ). </li></ul></ul><ul><ul><li>An in-house algorithm was developed for evaluating simple containment queries. </li></ul></ul><ul><li>A full RDBMS approach (tried DB2 7.1 and SQL-Server 7.0) </li></ul><ul><ul><li>The E-index and T-index are stored as the following tables: ELEMENTS(term, docno, begin, end, level) TEXTS(term, docno, wordno, level) </li></ul></ul><ul><li>Note that we do not use the IR engine of the RDBMS. </li></ul>
  37. 37. Using the inverted indexes tables <ul><li>E//&quot;T” </li></ul><ul><li>select * from ELEMENTS e, TEXTS t </li></ul><ul><li>where e.term = ’E’ and t.term = ’T’ </li></ul><ul><li>and e.docno = t.docno </li></ul><ul><li>and e.begin < t.wordno and t.wordno < e.end </li></ul><ul><li>E=&quot;T&quot; </li></ul><ul><li>select * from ELEMENTS e, TEXTS t </li></ul><ul><li>where e.term = ’E’ and t.term = ’T’ </li></ul><ul><li>and e.docno = t.docno </li></ul><ul><li>and e.begin + 1 = t.wordno and t.wordno + 1 = e.end </li></ul><ul><li>In a similar fashion we solve Elements only queries, father-son, and words distance queries. </li></ul>(how will this look for E//E ?) x y x y x y
  38. 38. Experiment setup <ul><li>The data sets: </li></ul>
  39. 39. Experiment setup (continued) <ul><li>The queries are all simple queries of the form: E//T, E//E, E/T or E/E </li></ul>
  40. 40. Experiment results
  41. 41. Results analysis <ul><li>Why did DB2 perform better in QS4, QD4 and QG5? </li></ul><ul><li>Remember that each list in the inverted engine is stored as one record! </li></ul><ul><li>Why did DB2 perform worse in all the other queries? </li></ul><ul><ul><li>Bad optimizer decisions? </li></ul></ul><ul><ul><li>Is I/O more expensive (locking, security, etc.)? </li></ul></ul><ul><ul><li>Other factors? </li></ul></ul><ul><ul><li>It turns out that the queries are CPU-bound ! </li></ul></ul><ul><ul><li>Further investigation found out that it was the merge algorithm. </li></ul></ul>
  42. 42. DB2 merge algorithms <ul><li>When joining on: </li></ul><ul><li>a.docno = d.docno and a.begin < d.wordno and d.wordno < a.end </li></ul><ul><li>Standard Merge-Join only uses the a.docno = d.docno predicate (since it does one comparison, using one index per table), and applies the rest of the condition on each matching couple. </li></ul><ul><li>Hash-Join only uses the a.docno = d.docno predicate (since it can not handle inequalities anyway), and thus performs similarly to the classical merge join. </li></ul><ul><li>Index nested-loop join looks, for each row in the outer table, for all rows in the inner table index that lie between a start-key and a stop-key. </li></ul><ul><li>Assuming the outer table is ELEMENTS and the inner table is TEXTS: </li></ul><ul><ul><li>start-key: term = value and docno = outer.docno and wordno > outer.begin </li></ul></ul><ul><ul><li>end-key: term = value and docno = outer.docno and wordno < outer.end </li></ul></ul>
  43. 43. The Multi-Predicate Merge Join <ul><li>begin-desc = Dlist->first node; OutputList = NULL; </li></ul><ul><li>for (a = Alist->firstNode; ; a = a->nextNode) { </li></ul><ul><li>d = begin_desc; </li></ul><ul><li>while (d.docno < a.docno) d = d->nextNode; </li></ul><ul><li>if (a.docno < b.docno) continue; </li></ul><ul><li>while (d.begin <= a.begin) d = d->nextNode; </li></ul><ul><li>begin_desc = d; </li></ul><ul><li>while (d.begin < a.end) { // implies d.end < a.end </li></ul><ul><li>if (a.docno < b.docno) break; </li></ul><ul><li>append (a,d) to OutputList; </li></ul><ul><li>d = d->nextNode; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>doc begin end 5 7 20 5 14 19 5 21 28 5 22 27 5 29 31 5 32 40 doc begin 5 2 5 23 5 24 5 33 5 37 5 42 Alist Dlist
  44. 44. Comparison of the merge algorithms <ul><li>It seems like the NLJ algorithm will usually compare less items, BUT </li></ul><ul><li>It has to spend time on index seeks! </li></ul><ul><li>It uses random access so cache utilization is poor. </li></ul>
  45. 45. MPMGJN & traditional joins - statistics Note: DB2 did not choose NLJ for QG4
  46. 46. Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al-Khalifa, Jagadish, Koudas, Patel, Srivastava, Wu ICDE 2002
  47. 47. Structural-Join algorithms <ul><li>Tree-Merge-Anc (aka MPMGJN) </li></ul><ul><li>Tree-Merge-Desc </li></ul><ul><li>Stack-Tree-Desc </li></ul><ul><li>Stack-Tree-Anc </li></ul><ul><li>The ?-?-Anc algorithms produce the output sorted by the ancestors. </li></ul><ul><li>The ?-?-Desc algorithms produce the output sorted by the descendants. </li></ul><ul><ul><li>The sorting variant to use depends on the way an optimizer chooses to compose a complex query. </li></ul></ul>Based on the ( docId, startPos, endPos, level) information of XML elements and attributes. Given two lists of potential ancestors and potential descendants, both in ascending order of docId+startPos , the following structural join algorithms are presented:
  48. 48. Tree-Merge-Anc <ul><li>begin-desc = Dlist->first node; OutputList = NULL; </li></ul><ul><li>for (a = Alist->firstNode; ; a = a->nextNode) { </li></ul><ul><li>d = begin_desc; </li></ul><ul><li>while (d.startPos <= a.startPos) d = d->nextNode; </li></ul><ul><li>begin_desc = d; </li></ul><ul><li>while (d.startPos < a.endPos) { // implies d.endPos < a.endPos </li></ul><ul><li>if (a.level +1 != d.level) continue; // father-son </li></ul><ul><li>append (a,d) to OutputList; </li></ul><ul><li>d = d->nextNode; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>Note: For ease of exposition, we assume that Alist and Dlist have the same docId. </li></ul>
  49. 49. Analysis of Tree-Merge-Anc <ul><li>Ancestor-Descendant structural relationships: </li></ul><ul><ul><li>O(|Alist| + |Dlist| + |OutputList|) </li></ul></ul><ul><ul><li>Since first while loop increases d, and second while loop increases output or a. </li></ul></ul><ul><li>Father-Son structural relationships: </li></ul><ul><ul><li>O(|Alist| * |Dlist|) </li></ul></ul>Can sub-sorting on levelNum help ? . . . ... a1 a2 a3 an d3 d1 d2 dn begin end a1 1 4n a2 2 4n-1 a3 3 4n-2 . . an n 3n+1 begin d1 n+1 d2 n+3 d3 n+5 . . dn 3n-1 Alist Dlist
  50. 50. Tree-Merge-Desc <ul><li>begin-anc = Alist->first node; OutputList = NULL; </li></ul><ul><li>for (d = Dlist->firstNode; ; d = d->nextNode) { </li></ul><ul><li>a = begin_anc; </li></ul><ul><li>while (a.endPos <= d.startPos) a = a->nextNode; </li></ul><ul><li>begin_anc = a; </li></ul><ul><li>while (a.startPos < d.startPos) { </li></ul><ul><li>if (a.level +1 != d.level) continue; // father-son </li></ul><ul><li>if (d.endPos < a.endPos) append (a,d) to OutputList; </li></ul><ul><li>a = a->nextNode; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>Note: For ease of exposition, we assume that Alist and Dlist have the same docId. </li></ul>
  51. 51. Analysis of Tree-Merge-Desc <ul><li>Ancestor-Descendant and Father-Son structural relationships: </li></ul><ul><ul><li>O(|Alist| * |Dlist|). </li></ul></ul><ul><ul><li>Works in linear time on most real data. </li></ul></ul>... begin end a0 1 4n+2 a1 2 5 a2 6 9 a3 10 13 . . an 4n-2 4n+1 begin d1 3 d2 7 d3 11 . . dn 4n-1 Dlist Alist a0 a3 a1 a2 an d1 d2 d3 dn ...
  52. 52. Stack-Tree algorithms <ul><li>Motivation </li></ul><ul><ul><li>A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. </li></ul></ul><ul><ul><li>An ancestor-descendant structural relationship is manifested as the ancestor appearing higher on the stack than the descendant. </li></ul></ul><ul><ul><li>Unfortunately, a depth-first traversal requires going over all the tree. </li></ul></ul>
  53. 53. Stack-Tree-Desc <ul><li>a = Alist->first node; d = Dlist->first node; OutputList = NULL; </li></ul><ul><li>while (lists are not empty) { </li></ul><ul><li>e = (a.startPos < d.startPos) ? a : d; </li></ul><ul><li>while (e.startPos > stack->top.endPos) stack->pop(); </li></ul><ul><li>if (e == a) { // remember that e.startPos > stack->top.startPos </li></ul><ul><li>stack->push(a); </li></ul><ul><li>a = a->nextNode; </li></ul><ul><li>} else // e == d </li></ul><ul><li>for each a’ in stack { // Father-Son: If (stack->top.level + 1 = d.level) append(stack->top, d) </li></ul><ul><li>append (a’, d) to OutputList; </li></ul><ul><li>} </li></ul><ul><li>d = d->nextNode; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>Note: For ease of exposition, we assume that Alist and Dlist have the same docId. </li></ul>
  54. 54. Stack-Tree-Desc (father-son example) a1 d1 a2 d2 . . . . an dn dn+1 dn+2 . . d2n d1 d2n d2 d2n-1 d3 d2n-2 dn dn+1 a1 a2 a3 an a1 (a1,d1) a2 (a2,d2) ... . . . (an-1,dn-1) an (an,dn) (an,dn+1) (an-1,dn+2) ... (a3,d2n-2) (a2,d2n-1) (a1,d2n) ? e.startPos > stack->top.endPos . . .
  55. 55. Analysis of Stack-Tree-Dec <ul><li>O(|Alist| + |Dlist| + |OutputList|) for ancestor-descendant as well as father-son structural relationships. </li></ul><ul><ul><li>Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). </li></ul></ul><ul><ul><li>The inner “for loop” outputs a new pair each time, so its total time is O(|OutputList|). </li></ul></ul><ul><ul><li>When doing father-son structural joins, we do not even have a “for loop”. </li></ul></ul><ul><li>The algorithm is non-blocking. </li></ul><ul><li>IO complexity is O(|Alist|/P + |Dlist|/P + |OutputList|/P) where P is the page size. </li></ul><ul><ul><li>Each input page is read just once (and output sent as soon as it is computed). </li></ul></ul><ul><ul><li>The stack is as large as the tree height, so it is very reasonable to assume that it fits in RAM. </li></ul></ul>
  56. 56. Stack-Tree-Anc <ul><li>a = Alist->first node; d = Dlist->first node; OutputList = NULL; </li></ul><ul><li>while (lists are not empty) { </li></ul><ul><li>e = (a.startPos < d.startPos) ? a : d; </li></ul><ul><li>while (e.startPos > stack->top.endPos) { </li></ul><ul><li>temp = stack->pop(); </li></ul><ul><li>if (stack->isEmpty()) { </li></ul><ul><li>append temp->selfList to OutputList; append temp->inheritList to OutputList; </li></ul><ul><li> } else { </li></ul><ul><li>append temp->inheritList to temp->selfList; append temp->selfList to stack->top->inheritList; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>if (e == a) { // remember that e.startPos > stack->top.startPos </li></ul><ul><li>stack->push(a); a = a->nextNode; </li></ul><ul><li>} else { // e == d </li></ul><ul><li>for each a’ in stack { </li></ul><ul><li>if(a’ == stack->bottom) append (a’, d) to OutputList; </li></ul><ul><li>else append (a’, d) to selfList associated with a’ </li></ul><ul><li>} </li></ul><ul><li>d = d->nextNode; </li></ul><ul><li>} </li></ul><ul><li>} if (!stack->isEmpty()) flush the stack held lists to the output </li></ul><ul><li>Note: For ease of exposition, we assume that Alist and Dlist have the same docId. </li></ul>
  57. 57. Stack-Tree-Anc (father-son example) a1 d1 a2 d2 . . . . an dn dn+1 dn+2 . . d2n a1 (a1,d1) a2 (a2,d2) . . . (an-1,dn-1) an (an,dn) (an,dn+1) (a3,d3),(a3,d2n-2)...(an,dn),(an,dn+1) (a2,d2n-1) (a1,d2n) (an,dn), (an,dn+1) (a2,d2),(a2,d2n-1)...(an,dn),(an,dn+1) . . . ? e.startPos > stack->top.endPos d1 d2n d2 d2n-1 d3 d2n-2 dn dn+1 a1 a2 a3 an . . .
  58. 58. Analysis of Stack-Tree-Anc <ul><li>O(|Alist| + |Dlist| + |OutputList|) For ancestor-descendant as well as father-son structural relationships. </li></ul><ul><ul><li>Assuming the lists are maintained as linked lists with head and tail pointers. </li></ul></ul><ul><li>The algorithm is blocking (but only partially). </li></ul><ul><li>IO complexity is O(|Alist|/P + |Dlist|/P + |OutputList|/P) where P is the page size. </li></ul><ul><ul><li>We cannot assume that all the lists fit in RAM. </li></ul></ul><ul><ul><li>All that we do with lists (except output) is appending. </li></ul></ul><ul><ul><li>We can page out a list and we need only keep its tail in RAM. So we need two extra pages in memory per stack entry - still a reasonable assumption. </li></ul></ul><ul><ul><li>We only need to know the address of the head of a list. </li></ul></ul><ul><ul><li>Each list page is thus paged out at most once, and paged back in only for output. </li></ul></ul>
  59. 59. Experiment workload <ul><li>Experimented with real XML data as well as synthetic data generated by IBM XML data generator (with similar results). </li></ul><ul><li>Presented the results for the largest data set: 6.3 million elements (800Mb of data). </li></ul>
  60. 60. Experiment results <ul><li>Implemented the structural join algorithms, as well as bottom-up and top-down, on the TIMBER native XML query engine (built on top of SHORE). </li></ul><ul><li>Bottom-up and top-down performed poorly: </li></ul><ul><ul><li>Even on 10% of the data it took bottom-up 283.5 seconds to run QS1, and 717.8 seconds for top-down to do it. </li></ul></ul><ul><ul><li>It took less than 15 seconds for any of the join algorithms to complete QS1 on the full data set! </li></ul></ul>
  61. 61. Experiment results (continued) <ul><li>Implemented the STJ-D as an application program interfacing to a commercial RDBMS through a set of cursors. </li></ul><ul><li>Also ran the queries using the RDBMS join mechanisms. </li></ul>QS1: Combined: an index on startPos, endPos Small: up to 10% selectivity, Medium: up to 25%
  62. 62. Experiment results (continued)
  63. 63. Holistic Twig Joins: Optimal XML Pattern Matching Bruno, Koudas, Srivastava ACM SIGMOD 2002
  64. 64. Twig patterns <ul><li>book[title = “XML” AND year = 2000] </li></ul><ul><li>book[title = “XML”]//author[Fn = “jane” AND Ln = “doe”] </li></ul>book year title authors chapter chapter 2000 XML author head section section Ln title section ... ... ... Fn jane doe author Ln Fn john moe author Ln Fn john doe ... XML ... title XML year 2000 book title XML author Ln book jane doe Fn Twig patterns
  65. 65. Twig pattern matching <ul><li>Given a twig pattern Q and an XML database D , a match is a mapping from nodes in Q to nodes in D, satisfying: </li></ul><ul><ul><li>Query node predicates are satisfied by their images. </li></ul></ul><ul><ul><li>The structural relationships between the query nodes are satisfied by their images. </li></ul></ul><ul><ul><li>If Q has k nodes, the result may be represented by a relation with k columns. </li></ul></ul>
  66. 66. Twig pattern matching approaches <ul><li>Decompose the twig into a series of binary structural joins , compute each (using STJ-D for example) and join the results. </li></ul><ul><ul><li>Note that one may have intermediate results that are very big. Consider for example: book[title = “XML” ] </li></ul></ul><ul><li>Decompose the twig into a series of rooted path-expressions , compute each one independently and merge-join the results. </li></ul><ul><ul><li>Note that one may have intermediate results that are very big (but only in different branches). Consider for example: book//author[Fn = “jane” AND Ln = “doe”] </li></ul></ul><ul><li>Decompose the twig into a series of rooted path-expressions, compute them simultaneously taking interdependencies into account, and merge-join the results. </li></ul>
  67. 67. PathStack-Desc <ul><li>go to start of all lists; OutputList = NULL; </li></ul><ul><li>while (lists are not empty) { </li></ul><ul><li>e = element with minimum startPos in all lists; </li></ul><ul><li>i = the list e was taken from; advance list i; </li></ul><ul><li>for(int j=1; j < numLists; j++) { </li></ul><ul><li>while (e.startPos > stack j ->top.endPos) stack j ->pop(); </li></ul><ul><li>} </li></ul><ul><li>if (e is not from the leaf list) { // remember that for every stack e.startPos > stack->top.startPos </li></ul><ul><li>stack i ->push(a, &stack i-1 ->top); // if the I-1 stack is not empty of course </li></ul><ul><li>} else { // e is the path query leaf </li></ul><ul><li>let (x 1 , x 2 , … x numLists-1 ) be the linked list whose head is the top of the numLists-1 stack. </li></ul><ul><li>For each (y 1 , y 2 , … y numLists-1 , e) such that for all j y j is below x j do: </li></ul><ul><li>append (y 1 , y 2 , … y numLists-1 , e) to OutputList ; </li></ul><ul><li>} </li></ul><ul><li>} Note: For ease of exposition, we assume that all lists have the same docId. </li></ul>
  68. 68. PathStack-Desc (example) a1 b1 a2 b2 c1 b3 c2 ? e.startPos > stack->top.endPos a1 b1 a2 b2 (a2,b2,c1) (a1,b2,c1) (a1,b1,c1) (a1,b3,c2) b3 a1 b1 a2 b2 c1 b3 c2 b c a
  69. 69. PathStack-Desc experimental results <ul><li>Implemented the binary join algorithms, as well as the StackPath, in C++ using the file-system as the storage engine. </li></ul><ul><li>Used a synthetic data set made of 1 million nodes with 6 different labels (A1, A2, …A6) uniformly distributed (no information regarding other parameters). </li></ul>
  70. 70. Final remarks <ul><li>What we did not do (partial list): </li></ul><ul><ul><li>Look at using B+-Trees with the stack algorithms. </li></ul></ul><ul><ul><li>Look at the TwigStack algorithm. </li></ul></ul><ul><ul><li>Look at Kleen-closure evaluation. </li></ul></ul><ul><li>Conclusions: </li></ul><ul><ul><li>There is a lot more work to be done by everybody. </li></ul></ul>
  71. 71. Appendix: TwigStack (in a nutshell) <ul><li>getNext(q) returns a query node such that the head of its list satisfies: </li></ul><ul><ul><li>It has the smallest startPos (L) of all the heads of its descendant and sibling lists. </li></ul></ul><ul><ul><li>It participates in a solution to the sub-query rooted at that query node. </li></ul></ul><ul><ul><li>If it is part of a solution involving its ancestors they were already read. </li></ul></ul>
  72. 72. Appendix (continued) <ul><li>Note that as long as 09 succeeds we return the node whose head has the smallest startPos (L) of all the heads of lists in the sub-tree of q. When 09 fails we “float up” a node whose list head has the the smallest startPos (L) of all the heads of lists in its descendant or sibling lists. </li></ul><ul><li>Once a node floats up, its father node’s list does not contain any more ancestors of its list head (otherwise 09 would not fail). Applying the same logic to the father and grandfather etc. leads us by induction to the conclusion that if it this node’s list head is part of a solution involving its ancestors, these ancestors are already out of their lists. </li></ul>
  73. 73. Appendix (continued) <ul><li>Both used ternary trees. </li></ul><ul><ul><li>Left sub-tree in (a) has only A1=A2=A3=A4 paths. </li></ul></ul><ul><ul><li>Middle sub-tree in (a) has only A1=A5=A6=A7 paths. </li></ul></ul><ul><ul><li>Right sub-tree in (a) has solutions. Its size varies (8% to 24% of the tree). </li></ul></ul><ul><ul><li>(b): left has no A2 or A3, middle has no A4 or A5, right has no A6 or A7. </li></ul></ul>
  74. 74. Bibliography <ul><li>Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 </li></ul><ul><li>Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 </li></ul><ul><li>Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002 </li></ul><ul><li>Shu-Yao Chien, Vassilis J. Tsotras, Carlo Zaniolo, Donghui Zhang, “Efficient Complex Query Support for Multiversion XML Documents”, Proc. of VLDB 2001 </li></ul><ul><li>Jason McHugh, Jennifer Widom, “Query Optimization for XML”, Proc. of VLDB 1999 </li></ul><ul><li>Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, Guy Lohman, “On Supporting Containment Queries in Relational Database Management Systems”, ACM SIGMOD 2001 </li></ul><ul><li>Quanzhong Li, Bongki Moon, “Indexing and Querying XML Data for Regular Path Expressions”, Proc. of VLDB 2001 </li></ul><ul><li>IBM DB2 web site: h ttp://www-3.ibm.com/software/data/db2/ </li></ul><ul><li>www.w3.org site (on XPath and XQuery) </li></ul>

×