Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- MaskedVByte: SIMD-accelerated VByte by Daniel Lemire 215 views
- Spark, spark streaming & tachyon by Johan hong 1658 views
- Closing The Loop for Evaluating Big... by Swiss Big Data Us... 1742 views
- Image compression Algorithms by Shivam Shrivastava 632 views
- Is Design Metrically Opposed? - Jar... by ProductCamp Boston 952 views
- Realtime analytics + hadoop 2.0 by Rommel Garcia 1103 views

4,394 views

4,239 views

4,239 views

Published on

Published in:
Technology

License: CC Attribution License

No Downloads

Total views

4,394

On SlideShare

0

From Embeds

0

Number of Embeds

44

Shares

0

Downloads

152

Comments

0

Likes

6

No embeds

No notes for slide

- 1. Compressing column-oriented indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). November 19, 2009 Daniel Lemire Compressing column-oriented indexes
- 2. Row Stores name, date, age, sex, salary name, date, age, sex, salary name, date, age, sex, salary Dominant paradigm name, date, age, sex, salary Transactional: Quick append and delete name, date, age, sex, salary Daniel Lemire Compressing column-oriented indexes
- 3. Column Stores Goes back to StatCan in the seventies [Turner et al., 1979] Made fashionable again in Data name date age sex salary Warehousing by Stonebraker [Stonebraker et al., 2005] New: Oracle Exadata hybrid columnar compression Favors run-length encoding (compression) Daniel Lemire Compressing column-oriented indexes
- 4. Main column-oriented indexes (1) Bitmap indexes [O’Neil, 1989] (2) Projection indexes [O’Neil and Quass, 1997] Both are compressible. Daniel Lemire Compressing column-oriented indexes
- 5. Bitmap indexes Bitmap indexes have a long SELECT * FROM history. (1972 at IBM.) T WHERE x=a Long history with DW & OLAP. AND y=b; (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire Compressing column-oriented indexes
- 6. Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... It is a form of vectorization. Daniel Lemire Compressing column-oriented indexes
- 7. Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to ﬁlter queries, e.g. Apache Lucene Daniel Lemire Compressing column-oriented indexes
- 8. Bitmap compression A column with n rows and L distinct column index bitmaps values ⇒ nL bits x=3 x=1 x=2 x E.g., n = 106 , L = 104 → 10 Gbits 1 1 0 0 Uncompressed bitmaps are often 3 0 0 1 impractical n 1 1 0 0 Moreover, bitmaps often contain long 2 0 1 0 streams of zeroes. . . ... ... ... ... Logical operations over these zeroes is a L waste of CPU cycles. Daniel Lemire Compressing column-oriented indexes
- 9. How to compress bitmaps? Must handle long streams of zeroes eﬃciently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 Daniel Lemire Compressing column-oriented indexes
- 10. Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become 000000101. How many bits to use for the counters? Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2N + (x mod 2N ). Write N − 1 as gamma code; write x mod 2N as an N − 1-bit number. E.g. 17 = 24 + 1, 0010001 Daniel Lemire Compressing column-oriented indexes
- 11. RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits. Daniel Lemire Compressing column-oriented indexes
- 12. Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al.’s (was known to Wu as WBC) word-aligned hybrid. 0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . . ⇒ dirty word, run of 2 “clean 0” words, dirty word. . . Daniel Lemire Compressing column-oriented indexes
- 13. What are bitmap indexes for? Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps. Myth: bitmap indexes are for low cardinality columns. the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality [Zaker et al., 2008]. Daniel Lemire Compressing column-oriented indexes
- 14. What about other compression types? With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 in time O(|B1 | + |B2 |). Hence, with RLE, compress saves both storage and CPU cycles!!!! Not always true with other techniques such as Huﬀman, LZ77, Arithmetic Coding, . . . Daniel Lemire Compressing column-oriented indexes
- 15. What happens when you have many bitmaps? Consider B1 ∨ B2 ∨ . . . ∨ BN . First compute the ﬁrst two : B1 ∨ B2 in time O(|B1 | + |B2 |). |B3 ∨ B4 | is in O(|B3 | + |B4 |). Thus (B1 ∨ B2 ) ∨ (B3 ∨ B4 ) takes O(2 i |Bi |). . . Total is in O( N |Bi | log N), can be i=1 generalized [Lemire et al., 2009]. Daniel Lemire Compressing column-oriented indexes
- 16. How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are eﬃcient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs). Daniel Lemire Compressing column-oriented indexes
- 17. Open Source Software? Lemur Bitmap Index C++ Library: http://code.google.com/p/lemurbitmapindex/. JavaEWAH: A compressed alternative to the Java BitSet class http://code.google.com/p/javaewah/. Daniel Lemire Compressing column-oriented indexes
- 18. Projection indexes Simply write out the values SELECT sequentially. sum(number*price) Ideal for low selectivity queries FROM T; on few columns. Compressible with RLE. Daniel Lemire Compressing column-oriented indexes
- 19. Improving compression by sorting the table RLE are order-sensitive: they compress sorted tables better; But ﬁnding the best row ordering is NP-hard [Lemire et al., 2009]. So we sort: lexicographically with Gray codes Hilbert, . . . Daniel Lemire Compressing column-oriented indexes
- 20. How many ways to sort? (1) Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times, beneﬁts grow with table size) Daniel Lemire Compressing column-oriented indexes
- 21. How many ways to sort? (2) Gray Codes are list of tuples with successive (Hamming) distance of 1 [Knuth, 2005, § 7.2.1.1]. Reﬂected Gray Code order is sometimes slightly better than lexicographical. . . . . . but beneﬁt goes as ≈ 1/N with column cardinality N poorly supported by existing software. Daniel Lemire Compressing column-oriented indexes
- 22. How many ways to sort? (3) Reﬂected Gray Code order is not the only Gray code. Knuth also presents Modular Gray-code. But alternatives to reﬂected are never better? Daniel Lemire Compressing column-oriented indexes
- 23. How many ways to sort? (4) Can also try esoteric orders. Hilbert Index [Hamilton and Rau-Chaplin, 2007] Gives very bad results for column-oriented indexes. Daniel Lemire Compressing column-oriented indexes
- 24. Modelling the size of an index Any formal result? Tricky: There are many variations on RLE. Use: number of runs of identical value in a column Daniel Lemire Compressing column-oriented indexes
- 25. Recursive orders Lexicographical, reﬂected Gray code and modular Gray code belong to a larger class: Deﬁnition A recursive order over c-tuples is such that it generates a recursive order over c − 1-tuples. All orders over 1-tuples are recursive. This is a recursive order: This is not recursive: 1 0 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 Daniel Lemire Compressing column-oriented indexes
- 26. When sorting, column order matters Question Given a phone directory, to minimize the number of runs, should sort by ﬁrst or last names? Daniel Lemire Compressing column-oriented indexes
- 27. When sorting, column order matters c columns any recursive order in practice, column order is very signiﬁcant (factor of two or more) Proposition The number of column runs vary by a factor of ≈ c under the permutation of the columns. Daniel Lemire Compressing column-oriented indexes
- 28. But column reordering fails to buy optimality From some tables. . . Lemma No recursive order minimizes the number of runs—even after reordering the columns. Open problem: how far from optimality? Daniel Lemire Compressing column-oriented indexes
- 29. Best column order? We almost have this result [Lemire and Kaser, ]: any recursive order order the columns by increasing cardinality (small to LARGE) Proposition The expected number of runs is minimized. Truth is complicated. Assume uniformly distributed tables. Daniel Lemire Compressing column-oriented indexes
- 30. What about non-uniform or dependent columns? Real columns have skewed distributions [Missaoui et al., 2007] and they are statistically dependent. It can impact column ordering in unpredictable ways. Daniel Lemire Compressing column-oriented indexes
- 31. Take away messages Column stores are good because of RLE and sorting; Lexicographical sort with right column order is good; More exotic sorting (such as Hilbert) might be bad. Daniel Lemire Compressing column-oriented indexes
- 32. Future direction? Need better mathematical modelling of skewed and dependent columns; New column-oriented indexes? Better ways to sort? Daniel Lemire Compressing column-oriented indexes
- 33. Questions? ? Daniel Lemire Compressing column-oriented indexes
- 34. Hamilton, C. H. and Rau-Chaplin, A. (2007). Compact Hilbert indices: Space-ﬁlling curves for domains with unequal side lengths. Information Processing Letters, 105(5):155–163. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D. and Kaser, O. Reordering columns for smaller indexes. in preparation, available from http://arxiv.org/abs/0909.1346. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. to appear in Data & Knowledge Engineering, preprint available from http://arxiv.org/abs/0901.3751. Daniel Lemire Compressing column-oriented indexes
- 35. Missaoui, R., Goutte, C., Choupo, A. K., and Boujenoui, A. (2007). A probabilistic model for data cube compression and query approximation. In DOLAP, pages 33–40. O’Neil, P. and Quass, D. (1997). Improved query performance with variant indexes. In SIGMOD ’97, pages 38–49. O’Neil, P. E. (1989). Model 204 architecture and performance. In 2nd International Workshop on High Performance Transaction Systems, pages 40–59. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. (2005). C-store: a column-oriented DBMS. Daniel Lemire Compressing column-oriented indexes
- 36. In VLDB’05, pages 553–564. Turner, M. J., Hammond, R., and Cotton, P. (1979). A DBMS for large statistical databases. In VLDB’79, pages 319–327. Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with eﬃcient compression. ACM Transactions on Database Systems, 31(1):1–38. Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008). An adequate design for large data warehouse systems: Bitmap index versus b-tree index. IJCC, 2(2). Daniel Lemire Compressing column-oriented indexes

No public clipboards found for this slide

Be the first to comment