Bitmap Indexes for Relational XML Twig Query Processing - Presentation Transcript
Kyong-Ha Lee and Bongki Moon The University of Arizona Bitmap Indexes For Relational XML Twig Query Processing
CIKM'09, Hong Kong 2 XML Data and Queries a1 0 (1, 32,1) <a> <a> <b>t1</b> <c> <d>t2</d> <e>t3</e> </c> </a> <a> <b> <e>t4</e> </b> <d> <c>t5</c> </d> </a> . . . . . </a> a2 a3 a4 1 6 11 (2,11,2) (12,21,2) (22,31,2) 9 c1 b1 7 12 15 2 3 d2 b2 e3 b3 (13,16,3) (17,20,3) (23,28,3) (29,30,3) (5,10,3) (3,4,3) 10 c2 e2 d3 d1 e1 8 c3 13 14 4 5 (26,27,4) (24,25,4) (18,19,4) (6,7,4) (8,9,4) (14,15,4) //A/B/C //A[//B]//C //A[./B/C]//E A A A B C E B B C C
CIKM'09, Hong Kong 3 XML Stored in RDB NODE table PATH table . . . . . . . . .
To answer a twig query A twig pattern is decomposed into several path patterns. Path solutions are joined together to compose a final result. Holistic Twig Join(HTJ) algorithm Specialized multi-way& sort-merge join guarantees I/O optimality for a certain subset of XML query. The optimality depends on how the elements are partitioned. uses stacks and streams in which elements are sorted in an order. CIKM'09, Hong Kong 4 Twig Join A A E B C SA A A SE SB B E SC C Stacks Streams
Discrepancy between XML in RDB and conventional HTJ algorithms Logical: Streams vs. Table Physical: partitioned vs. record-oriented Supporting actual data including a large volume of texts requires references to records. How to feed tuples to HTJ algorithm? What’s the best partitioning scheme for XML stored in RDB? Bitmap index, a conventional index in RDBMS An efficient way to indicate tuples. Efficient support for logical operations Can we use the bitmap index for supporting HTJ? CIKM'09, Hong Kong 5 Motivation
Tag-based partitioning Simple, and skipping technique can be used to read useful elements only. For a query node, only one stream is accessed Tag+Level partitioning More I/O optimality, suitable for deep XML Some streams may be accessed for a single query node Path-based partitioning More I/O optimality, suitable for shallow XML A path with //-axes may require accessing many streams for a single query node CIKM'09, Hong Kong 6 HTJ on Different Partitioning Schemes
CIKM'09, Hong Kong 7 Bitmap Index How to partition tuples in NODE table By building a bitmap index on certain column(s) in the table. bitTag for tagName, bitTag+ for (tagName, Level), bitPath for pathId column Determines I/O optimality of holistic twig join algorithms. During twig join process, useful tuples are accessed via the bitmap index. A B E . . . 110000 1 0 0 0 0010000100 0000010000 Bit-vectors . . . disk blocks
bitAnc : A bit-vector represents terminal elements corr. to a certain path and all their ancestors. bitDesc: A bit–vector represents terminal elements corr. to a certain path and all their descendants. CIKM'09, Hong Kong 8 Additional Indexes a1 0 a2 a3 a4 1 6 11 b1 2 7 12 b2 b3 14 e2 d3 8 c3 13 A subtree covered by the left 3 bit-vectors bitPath,bitAnc, andbitDescfor PathId=2, i.e. /A/A/B
Basic index Bit-vectors are built on a single column or a group of columns Requires labeled values, and reading records Hybrid index A Combination of two different indexes descTag : bitDesc & bitTag bitTwig : bitPath & bitAnc does not require labeled values to compute twig solution CIKM'09, Hong Kong 9 Two Types of Indexes
Choose the minimum position value among the current 1’s as a current element for a query node Check if 1 exists in an interval, pos(a) and pos(d)? looking-ahead at the next 1 CIKM'09, Hong Kong 11 Advancing Cursors 0 eov P0 : /A P1 : /A/A q : //A (0,0,1) 6 1 Currq Current1 Next1
Early detection with a bit-vector absence Condensing query nodes For path-based partition Reduces |INDEX| and |RECORD| Skipping reading obsolete records with advance(k) For tag, (tag, level)-based partition Reduces |RECORD| Moving cursors over compressed bit-vectors with no decompression A composite cursor moving over a bit-vector compressed by run-length encoding scheme Reduces |INDEX| CIKM'09, Hong Kong 12 Optimizations A A E B E C C P: //A/B/C CA = 11 10000000000100000 CB = 4 advance(11) 00001000010000100
CIKM'09, Hong Kong 13 Compressed Bit-vector 000100000000100000000000000011 00000000000 . . . 00000000000000 0000000000000000000000000000001 00 (a) An original bit-vector with 8,000 bits 31 bits 2 bits 256* 31 bits 31 bits (b) Grouping as a unit of 31 bits and Merging identical groups 000010…010…011 100… 0100000000 000…001 000…000 Run-length is 256 31 literal bits Remaining word Uncompressed word Compressed word (c) Encoding each group as 1 word (4byte on a 32-bit machine) Cursor C ={ C.position, //Integer position value (Logical address) C. word, // The current word C is located at. C.bit, // The position of the bit C is visiting, in C.word C. rest } //The bit position in the remaining word
CIKM'09, Hong Kong 14 Moving A Cursor over A Compressed Bit-vector a) Get the position of the next 1 C = {31, 0, 31,0} Skip to examine 31* 256 bits C={7998, 2, 31, 0} 000010…010…011 100… 0100000000 000…001 000…000 Remaining word Run-length is 256 b) Check a bit value at the position 3,000 C = {31, 0, 31,0} with distance to move, 2,869=(3000-31) Since 31* 256 > 2,869, The bit we find is within the word 1. 000010…010…011 100… 0100000000 000…001 000…000
CIKM'09, Hong Kong 15 Experiments Datasets Synthetic : XMark Real : DBLP, Treebank, Swiss-prot Query sets
CIKM'09, Hong Kong 16 Statistics of Dataset and Indexes
# of distinct paths really varies
# of distinct tag names are not much different
Index build time is largely
affected by attribute cardinality
Index size is smaller than
labeled value size in most cases
CIKM'09, Hong Kong 17 Query Execution Time
CIKM'09, Hong Kong 18 Input Data Size
Merging used bit-vectors for a path pattern with //-axes and putting it into a bitmap index for the next time for a given path //A//B, P:/A/A/B P:/A/B acts like a pre-computed join index A path pattern with //-axes can be represented by a single bit-vector. Logical operations: OR, NOT are simply supported by bitwise-logical operations: &, |, ^ CIKM'09, Hong Kong 19 Other Features on bitPath
CIKM'09, Hong Kong 20 Twig Queries with Logical Operations P//A, P//A//B//X ≡P//A//B//C V P//A//B//D , P//A//E A A A A B E B E X (C|D) //A[./B/C or ./B/D]//E P//A , P//A//E , P//A/B ⓧ(P//A/B ⊙A//A/B/C) A A A A A B B E E B C ¬ C //A[./B/not(C)]//E
We investigated the possibilities of bitmap indexes for XML query processing Partitioning XML stored in RDB in various ways Cursor movements do not require decompression of bit-vectors We devised a way to identify element relationship with only bitmap index, bitTwig Our experiments showed that bitTwig was best for queries against shallow XML documents For deep XML documents, bitTag/w advance(k) showed the best performance. Future work: evaluating our system with more HTJ algorithms and other indexes CIKM'09, Hong Kong 21 Conclusions
0 comments
Post a comment