Bitmap Indexes for Relational XML Twig Query Processing

  • 821 views
Uploaded on

The slides I presented at CIKM'09

The slides I presented at CIKM'09

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
821
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
19
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Kyong-Ha Lee and Bongki Moon
    The University of Arizona
    Bitmap Indexes For Relational XML Twig Query Processing
  • 2. CIKM'09, Hong Kong
    2
    XML Data and Queries
    a1
    0
    (1, 32,1)
    <a>
    <a>
    <b>t1</b>
    <c>
    <d>t2</d>
    <e>t3</e>
    </c>
    </a>
    <a>
    <b>
    <e>t4</e>
    </b>
    <d>
    <c>t5</c>
    </d>
    </a>
    . . . . .
    </a>
    a2
    a3
    a4
    1
    6
    11
    (2,11,2)
    (12,21,2)
    (22,31,2)
    9
    c1
    b1
    7
    12
    15
    2
    3
    d2
    b2
    e3
    b3
    (13,16,3)
    (17,20,3)
    (23,28,3)
    (29,30,3)
    (5,10,3)
    (3,4,3)
    10
    c2
    e2
    d3
    d1
    e1
    8
    c3
    13
    14
    4
    5
    (26,27,4)
    (24,25,4)
    (18,19,4)
    (6,7,4)
    (8,9,4)
    (14,15,4)
    //A/B/C
    //A[//B]//C
    //A[./B/C]//E
    A
    A
    A
    B
    C
    E
    B
    B
    C
    C
  • 3. CIKM'09, Hong Kong
    3
    XML Stored in RDB
    NODE table
    PATH table
    . . .
    . . .
    . . .
  • 4. To answer a twig query
    A twig pattern is decomposed into several path patterns.
    Path solutions are joined together to compose a final result.
    Holistic Twig Join(HTJ) algorithm
    Specialized multi-way& sort-merge join
    guarantees I/O optimality for a certain subset of XML query.
    The optimality depends on how the elements are partitioned.
    uses stacks and streams in which elements are sorted in an order.
    CIKM'09, Hong Kong
    4
    Twig Join
    A
    A
    E
    B
    C
    SA
    A
    A
    SE
    SB
    B
    E
    SC
    C
    Stacks
    Streams
  • 5. Discrepancy between XML in RDB and conventional HTJ algorithms
    Logical: Streams vs. Table
    Physical: partitioned vs. record-oriented
    Supporting actual data including a large volume of texts requires references to records.
    How to feed tuples to HTJ algorithm?
    What’s the best partitioning scheme for XML stored in RDB?
    Bitmap index, a conventional index in RDBMS
    An efficient way to indicate tuples.
    Efficient support for logical operations
    Can we use the bitmap index for supporting HTJ?
    CIKM'09, Hong Kong
    5
    Motivation
  • 6. Tag-based partitioning
    Simple, and skipping technique can be used to read useful elements only.
    For a query node, only one stream is accessed
    Tag+Level partitioning
    More I/O optimality, suitable for deep XML
    Some streams may be accessed for a single query node
    Path-based partitioning
    More I/O optimality, suitable for shallow XML
    A path with //-axes may require accessing many streams for a single query node
    CIKM'09, Hong Kong
    6
    HTJ on Different Partitioning Schemes
  • 7. CIKM'09, Hong Kong
    7
    Bitmap Index
    How to partition tuples in NODE table
    By building a bitmap index on certain column(s) in the table.
    bitTag for tagName,
    bitTag+ for (tagName, Level),
    bitPath for pathId column
    Determines I/O optimality of holistic twig join algorithms.
    During twig join process, useful tuples are accessed via the bitmap index.
    A
    B
    E
    . . .
    110000
    1
    0
    0
    0
    0010000100
    0000010000
    Bit-vectors
    . . .
    disk blocks
  • 8. bitAnc : A bit-vector represents terminal elements corr. to a certain path and all their ancestors.
    bitDesc: A bit–vector represents terminal elements corr. to a certain path and all their descendants.
    CIKM'09, Hong Kong
    8
    Additional Indexes
    a1
    0
    a2
    a3
    a4
    1
    6
    11
    b1
    2
    7
    12
    b2
    b3
    14
    e2
    d3
    8
    c3
    13
    A subtree covered by the left 3 bit-vectors
    bitPath,bitAnc, andbitDescfor PathId=2, i.e. /A/A/B
  • 9. Basic index
    Bit-vectors are built on a single column or a group of columns
    Requires labeled values, and reading records
    Hybrid index
    A Combination of two different indexes
    descTag : bitDesc & bitTag
    bitTwig : bitPath & bitAnc
    does not require labeled values to compute twig solution
    CIKM'09, Hong Kong
    9
    Two Types of Indexes
  • 10. CIKM'09, Hong Kong
    10
    Identifying Element Relationship with Bit-vectors
    a1
    1
    1
    1
    0
    0
    0
    1
    1
    0
    0
    0
    1
    1
    0
    0
    0
    1
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    1100001000010000
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    a2
    b1
    • For a query //A//B, can the pairs (a1, b1) and (a2, b2) be solution?
    b2
    a1
    0
    a2
    a3
    a4
    1
    6
    11
    b1
    2
    7
    12
    b2
    b3
    P2: /A/A/B
    P0: /A
    P1: /A/A
  • 11. Choose the minimum position value among the current 1’s as a current element for a query node
    Check if 1 exists in an interval, pos(a) and pos(d)?
    looking-ahead at the next 1
    CIKM'09, Hong Kong
    11
    Advancing Cursors
    0
    eov
    P0 : /A
    P1 : /A/A
    q : //A
    (0,0,1)
    6
    1
    Currq
    Current1
    Next1
  • 12. Early detection with a bit-vector absence
    Condensing query nodes
    For path-based partition
    Reduces |INDEX| and |RECORD|
    Skipping reading obsolete records with advance(k)
    For tag, (tag, level)-based partition
    Reduces |RECORD|
    Moving cursors over compressed bit-vectors with no decompression
    A composite cursor moving over a bit-vector compressed by run-length encoding scheme
    Reduces |INDEX|
    CIKM'09, Hong Kong
    12
    Optimizations
    A
    A
    E
    B
    E
    C
    C
    P: //A/B/C
    CA = 11
    10000000000100000
    CB = 4
    advance(11)
    00001000010000100
  • 13. CIKM'09, Hong Kong
    13
    Compressed Bit-vector
    000100000000100000000000000011 00000000000 . . . 00000000000000 0000000000000000000000000000001 00
    (a) An original bit-vector with 8,000 bits
    31 bits
    2 bits
    256* 31 bits
    31 bits
    (b) Grouping as a unit of 31 bits and Merging identical groups
    000010…010…011
    100… 0100000000
    000…001
    000…000
    Run-length is 256
    31 literal bits
    Remaining
    word
    Uncompressed word
    Compressed word
    (c) Encoding each group as 1 word (4byte on a 32-bit machine)
    Cursor C
    ={ C.position, //Integer position value (Logical address)
    C. word, // The current word C is located at.
    C.bit, // The position of the bit C is visiting, in C.word
    C. rest } //The bit position in the remaining word
  • 14. CIKM'09, Hong Kong
    14
    Moving A Cursor over A Compressed Bit-vector
    a) Get the position of the next 1
    C = {31, 0, 31,0}
    Skip to examine
    31* 256 bits
    C={7998, 2, 31, 0}
    000010…010…011
    100… 0100000000
    000…001
    000…000
    Remaining
    word
    Run-length is 256
    b) Check a bit value at the position 3,000
    C = {31, 0, 31,0}
    with distance to move,
    2,869=(3000-31)
    Since 31* 256 > 2,869,
    The bit we find is within the word 1.
    000010…010…011
    100… 0100000000
    000…001
    000…000
  • 15. CIKM'09, Hong Kong
    15
    Experiments
    Datasets
    Synthetic : XMark
    Real : DBLP, Treebank, Swiss-prot
    Query sets
  • 16. CIKM'09, Hong Kong
    16
    Statistics of Dataset and Indexes
    • # of distinct paths really varies
    • 17. # of distinct tag names are not much different
    • 18. Index build time is largely
    affected by attribute cardinality
    • Index size is smaller than
    labeled value size in most cases
  • 19. CIKM'09, Hong Kong
    17
    Query Execution Time
  • 20. CIKM'09, Hong Kong
    18
    Input Data Size
  • 21. Merging used bit-vectors for a path pattern with //-axes and putting it into a bitmap index for the next time
    for a given path //A//B, P:/A/A/B P:/A/B
    acts like a pre-computed join index
    A path pattern with //-axes can be represented by a single bit-vector.
    Logical operations: OR, NOT
    are simply supported by bitwise-logical operations: &, |, ^
    CIKM'09, Hong Kong
    19
    Other Features on bitPath
  • 22. CIKM'09, Hong Kong
    20
    Twig Queries with Logical Operations
    P//A,
    P//A//B//X ≡P//A//B//C V P//A//B//D ,
    P//A//E
    A
    A
    A
    A
    B
    E
    B
    E
    X
    (C|D)
    //A[./B/C or ./B/D]//E
    P//A ,
    P//A//E ,
    P//A/B ⓧ(P//A/B ⊙A//A/B/C)
    A
    A
    A
    A
    A
    B
    B
    E
    E
    B
    C
    ¬ C
    //A[./B/not(C)]//E
  • 23. We investigated the possibilities of bitmap indexes for XML query processing
    Partitioning XML stored in RDB in various ways
    Cursor movements do not require decompression of bit-vectors
    We devised a way to identify element relationship with only bitmap index, bitTwig
    Our experiments showed that bitTwig was best for queries against shallow XML documents
    For deep XML documents, bitTag/w advance(k) showed the best performance.
    Future work: evaluating our system with more HTJ algorithms and other indexes
    CIKM'09, Hong Kong
    21
    Conclusions
  • 24. Thanks! Questions?