Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

All About Bitmap Indexes... And Sorting Them

on

  • 17,242 views

A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing, as a way to ...

A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing, as a way to improve the indexes.

Much of the talk is based on the following preprint:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes.

http://arxiv.org/abs/0901.3751

Statistics

Views

Total Views
17,242
Views on SlideShare
17,177
Embed Views
65

Actions

Likes
16
Downloads
608
Comments
4

6 Embeds 65

http://www.slideshare.net 58
https://twitter.com 3
http://twitter.com 1
http://a0.twimg.com 1
http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

All About Bitmap Indexes... And Sorting Them All About Bitmap Indexes... And Sorting Them Presentation Transcript

  • All About Bitmap Indexes. . . And Sorting Them Daniel Lemire http://www.daniel-lemire.com/ Joint work (presented at BDA’08 and DOLAP’08) with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 20, 2009 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed. Improving indexes is practically important. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan! Data is not random? Compress it! A specific computer architecture? taylor your code for it! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Bitmap indexes Bitmap indexes have a long SELECT * FROM history. (1972 at IBM.) T WHERE x=a Long history with DW & OLAP. AND y=b; (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Bitmap compression A column with n rows and L distinct column index bitmaps values ⇒ nL bits x=3 x=1 x=2 x E.g., n = 106 , L = 104 → 10 Gbits 1 1 0 0 Uncompressed bitmaps are often 3 1 0 0 impractical n 1 1 0 0 Moreover, bitmaps often contain long 2 1 0 0 streams of zeroes. . . ... ... ... ... Logical operations over these zeroes is a L waste of CPU cycles. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • How to compress bitmaps? Must handle long streams of zeroes efficiently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become 000000101. How many bits to use for the counters? Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2N + (x mod 2N ). Write N − 1 as gamma code; write x mod 2N as an N − 1-bit number. E.g. 17 = 24 + 1, 0010001 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10× the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound! ... No definitive answer on whether more compression is better. It depends! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al.’s (was known to Wu as WBC) word-aligned hybrid. 0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . . ⇒ dirty word, run of 2 “clean 0” words, dirty word. . . Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Computational and storage bounds n → number of rows, c → number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding, . . . )? With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 in time O(|B1 | + |B2 |). We don’t know how to do this using the other compression techniques! Hence, with RLE, compress saves both storage and CPU cycles!!!! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What happens when you have many bitmaps? Consider B1 ∨ B2 ∨ . . . ∨ BN . First compute the first two : B1 ∨ B2 in time O(|B1 | + |B2 |). |B3 ∨ B4 | is in O(|B3 | + |B4 |). Thus (B1 ∨ B2 ) ∨ (B3 ∨ B4 ) takes O(2 |Bi |). . . i N i=1 |Bi | log N) Total is in O( [Lemire et al., 2009]. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Improving compression via k-of-N encoding With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, you can represent L 1-of-N value 2-of-N 2 = L(L − 1)/2 values by mapping each 100000 cat 1100 value to a pair of bitmaps; 010000 dog 1010 L More generally, you can represent k values 001000 dish 1001 by mapping each value to a k-tuple of 000100 fish 0110 bitmaps; 000010 cow 0101 At query time, you need to load k bitmaps 100000 cat 1100 in a look-up for one value; 000001 pony 0011 You trade query-time performance for fewer bitmaps; Often, fewer bitmaps translates into a smaller index, created faster. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Encode then sort? Or vice versa? Two different conceptual approaches: Encode attributes in table, obtaining an uncompressed index 1 Sort the index rows Compress each column Sort the table rows 2 Encode attributes in table, build compressed index on-the-fly. paint maker 1 1 0 1 0 1 0 1 1 1 0 1 red ford ⇒ ⇒ 1 0 1 0 1 1 1 0 1 0 1 1 blue honda 0 1 1 1 0 1 1 1 0 1 0 1 green ford ... ... ... ... ... ... paint maker paint maker 1 0 1 0 1 1 red ford blue honda ⇒ ⇒ 1 1 0 1 0 1 blue honda green ford 0 1 1 1 0 1 green ford red ford ... ... ... ... ... ... Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Gray-code order Gray-code (GC) order is an Gray-code Lex. order alternative to lexicographical 011 order (defined only for bit 011 011 arrays); 011 101 110 May improve compression more 101 110 than lex. sort (k > 1); 110 111 [Pinar et al., 2005] process an 110 111 uncompressed bitmap index. 111 111 Slow, if uncompressed index 111 101 does not fit in RAM. 111 101 GC order is not supported by DBMSes or Unix utilities. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it’s essentially free: What Pinar et al. do: expensive GC sort after encoding 1 eg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]); Instead, sort the table lexicographically—comparing values 2 alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax] Map ordered values to k-tuples of bitmaps ordered as Gray 3 codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). Easily extended for > 1 columns. 4 In our tests, this is as good as a Gray-code bitmap index sort [Pinar et al., 2005], but technically much easier. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What about “other” Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, § 7.2.1.1] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995]. Our tests suggest traditional Gray codes are best. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Test data sets Previous studies used data sets where the uncompressed index would fit in RAM. Do their results apply to more realistic data sets? Our tests: Mix of real and synthetic data, up to 877 M rows, 22 GB, 4 M attribute values. using 4–10 columns/dimensions Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • When sorting, column order matters The first column(s) gain Compression on TWEED-4d more from the sort (column 1 is primary sort 1 Gray Random-sort key); 0.8 Its bitmaps (first 11 in example) are compressed 0.6 1-C/N well, compared to a 0.4 “randomsort”. (Red above green) 0.2 Least important column’s 0 11 18 43 49 bitmaps (43–49) don’t rang des bitmaps gain much (red vs green) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • When sorting, column order matters Netflix: 24 column orderings 5.5e+08 1-of-N encoding Conceptually, we may 4-of-N encoding 5e+08 4.5e+08 wish to reorder columns, 4e+08 index size eg swap columns 1 & 3. 3.5e+08 Column order is crucial 3e+08 2.5e+08 (to successful sorting). 2e+08 Finding the best ordering 1.5e+08 1e+08 quickly remains open. 432 4311 4232 4211 4133 4122 3423 3411 3242 3211 3144 3122 2434 2411 2343 2311 2144 2133 1434 1422 1343 1322 1244 1233 4 column permutation Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Progress toward choosing column order Paper models “gain” of putting a given column first. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don’t seem to depend on gain. Factors: skews of columns number of distinct values k density of column’s bitmaps Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : “sparse but not too sparse” Order columns by decreasing 1 1 − 1/ni min , , where ni 4w − 1 50 100 150 200 250 300 distinct values in column ni → the number of distinct values in column i, w → the word size. See 30–40% size reduction, merely knowing dimension sizes (ni ). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • What usually works for dimension ordering?: k > 1 √ Density formula (ni → k ni ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by descending skew 1 descending size 2 (And do the reverse when k = 1.) Open issues, k > 1 How do we balance skew & size factors? 1 What other properties of the histograms are needed? 2 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is ≈ 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 differences (??) Seems sufficient (and much faster) to work with groups of bitmaps (reorder attributes, not bitmaps) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Index size versus block-wise sorting Netflix Instead of fully sorting the 900 k=1 table, we sorted it k=2 800 k=3 block-wise; k=4 700 taille de l’index (Mo) Fewer blocks means a 600 500 more complete sort; 400 Larger k means smaller 300 index (in this case); 200 Index size diminishes 100 0 100 200 300 400 500 600 700 drastically with sorting. # de blocs Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Open Source Software? Lemur Bitmap Index C++ Library: http://code.google.com/p/lemurbitmapindex/. JavaEWAH: A compressed alternative to the Java BitSet class http://code.google.com/p/javaewah/. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Future direction? Need better mathematical modelling of bitmap compressed size in sorted tables; Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Questions? ? Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006). Improving bitmap index compression by data reorganization. http: //hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked 2008-12-15). Goddyn, L. and Gvozdjak, P. (2003). Binary gray codes with long bit runs. Electronic Journal of Combinatorics, 10(R27):1–10. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. available from http://arxiv.org/abs/0901.3751. Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
  • Compressing bitmap indices by data reorganization. In ICDE’05, pages 310–321. Savage, C. and Winkler, P. (1995). Monotone gray codes and the middle levels problem. Journal of Combinatorial Theory, A, 70(2):230–248. Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31(1):1–38. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them