Faster Column-Oriented Indexes

                        Daniel Lemire

  http://www.professeurs.uqam.ca/pages/lemire.danie...
Some trends in business intelligence (BI)




                                Low-latency BI, Complex Event
              ...
Row Stores




    name, date, age, sex, salary


    name, date, age, sex, salary


    name, date, age, sex, salary     ...
Column Stores




                                             Goes back to StatCan in the
                               ...
Vectorization




                                       Modern superscalar CPUs support
const i n t N = 2048;            ...
Main column-oriented indexes




     (1) Bitmap indexes [O’Neil, 1989]
     (2) Projection indexes [O’Neil and Quass, 199...
Bitmap indexes




  SELECT * FROM
  T WHERE x=a                            Vectors of booleans
  AND y=b;
  Above, comput...
Other applications of the bitmaps/bitsets




      The Java language has had a bitmap class since the
      beginning: ja...
Bitmaps and fast AND/OR operations


     Computing the union of two sets of integers between 1 and 64
     (eg row ids, t...
What are bitmap indexes for?




     Myth: bitmap indexes are for low cardinality columns (e.g.,
     SEX).
             ...
Projection indexes




                 name


                        date
        city


                               ...
How to compress column indexes?




     Must handle long streams of identical values efficiently ⇒
     Run-length encoding...
What about other compression types?




     With RLE, we can often process the data in compressed form
     Hence, with R...
How do we improve performance?




     Smaller indexes are faster.
     In data warehousing: data is often updated in bat...
Modelling the size of an index




      Any formal result?
      Tricky: There are many variations on RLE.
      Use: num...
Improving compression by reordering the rows



      RLE is order-sensitive:
      they compress sorted tables better;
  ...
How many ways to sort? (1)




      Lexicographic row sorting is                        a     a
          fast, even for ...
How many ways to sort? (2)




      Gray Codes are list of tuples                       a     a
      with successive (Ha...
How many ways to sort? (3)




                                                         a     a
      Reflected Gray Code o...
How many ways to sort? (4)




                                       Hilbert Index
                                      ...
Recursive orders




      Lexicographical, reflected Gray code and modular Gray
      code belong to a larger class: recur...
Best column order?


  Column order is important for recursive orders.
  We almost have this result [Lemire and Kaser, 200...
How do you know when the lexicographical order is good
enough?




     Even though row reordering is NP-hard, we find it h...
Thankfully, we can detect cases where recursive orders are
good enough



  We can bound the suboptimality of all recursiv...
Bounding the optimality of sorting: the computation




      How do you compute µ very fast so you know lexicographical
 ...
Bounding the optimality of sorting: actual numbers




                         columns        µ
   Census-Income 4-D     ...
Take away message




     Column stores are good because of vectorization and
     RLE/sorting
     Sorting is sometimes ...
Future direction?




      Minimizing the number of runs it the wrong problem! We
      want to maximize long runs!
     ...
Questions?




                             ?




             Daniel Lemire       Faster Column-Oriented Indexes
Aouiche, K. and Lemire, D. (2007).
A comparison of five probabilistic view-size estimation
techniques in OLAP.
In DOLAP’07,...
Reordering columns for smaller indexes.
in preparation, available from
http://arxiv.org/abs/0909.1346.
Lemire, D., Kaser, ...
C-store: a column-oriented DBMS.
In VLDB’05, pages 553–564.
Turner, M. J., Hammond, R., and Cotton, P. (1979).
A DBMS for ...
Upcoming SlideShare
Loading in …5
×

Faster Column-Oriented Indexes

3,526 views
3,416 views

Published on

Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,526
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
114
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Faster Column-Oriented Indexes

  1. 1. Faster Column-Oriented Indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 10, 2010 Daniel Lemire Faster Column-Oriented Indexes
  2. 2. Some trends in business intelligence (BI) Low-latency BI, Complex Event Processing [Hyde, 2010] Commotization, open source software: Pentaho, LucidDB (http://www.luciddb.org/) Column-oriented databases ← source: gooddata.com Daniel Lemire Faster Column-Oriented Indexes
  3. 3. Row Stores name, date, age, sex, salary name, date, age, sex, salary name, date, age, sex, salary Dominant paradigm name, date, age, sex, salary Transactional: Quick append and delete name, date, age, sex, salary Daniel Lemire Faster Column-Oriented Indexes
  4. 4. Column Stores Goes back to StatCan in the seventies [Turner et al., 1979] Made fashionable again in Data name date age sex salary Warehousing by Stonebraker [Stonebraker et al., 2005] New: Oracle Exadata hybrid columnar compression Daniel Lemire Faster Column-Oriented Indexes
  5. 5. Vectorization Modern superscalar CPUs support const i n t N = 2048; vectorization (SSE) i n t a [N] , b [N ] ; This code is four times faster with i n t i =0; -ftree-vectorize (GNU GCC) f o r ( ; i <N ; i ++) Need long streams, same data type, and a [ i ] += b [ i ] ; no branching. Columns are good candidates! Daniel Lemire Faster Column-Oriented Indexes
  6. 6. Main column-oriented indexes (1) Bitmap indexes [O’Neil, 1989] (2) Projection indexes [O’Neil and Quass, 1997] Both are compressible. Daniel Lemire Faster Column-Oriented Indexes
  7. 7. Bitmap indexes SELECT * FROM T WHERE x=a Vectors of booleans AND y=b; Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire Faster Column-Oriented Indexes
  8. 8. Other applications of the bitmaps/bitsets The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene: org.apache.lucene.util.OpenBitSet.java. Daniel Lemire Faster Column-Oriented Indexes
  9. 9. Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... It is a form of vectorization. Daniel Lemire Faster Column-Oriented Indexes
  10. 10. What are bitmap indexes for? Myth: bitmap indexes are for low cardinality columns (e.g., SEX). the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality [Zaker et al., 2008]. Daniel Lemire Faster Column-Oriented Indexes
  11. 11. Projection indexes name date city Write out the (normalized) column values sequentially. It is a projection of the table on a single column. name Best for low selectivity queries on few columns: date SELECT sum(number*price) city FROM T;. Daniel Lemire Faster Column-Oriented Indexes
  12. 12. How to compress column indexes? Must handle long streams of identical values efficiently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 It is a bit more complicated (more another day) Daniel Lemire Faster Column-Oriented Indexes
  13. 13. What about other compression types? With RLE, we can often process the data in compressed form Hence, with RLE, compression saves both storage and CPU cycles!!!! Not always true with other techniques such as Huffman, LZ77, Arithmetic Coding, . . . Daniel Lemire Faster Column-Oriented Indexes
  14. 14. How do we improve performance? Smaller indexes are faster. In data warehousing: data is often updated in batches. So spend time at construction time optimizing the index. Daniel Lemire Faster Column-Oriented Indexes
  15. 15. Modelling the size of an index Any formal result? Tricky: There are many variations on RLE. Use: number of runs of identical value in a column AAABBBCCAA has 4 runs Daniel Lemire Faster Column-Oriented Indexes
  16. 16. Improving compression by reordering the rows RLE is order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2010]. Actually an instance of the Traveling Salesman Problem (TSP) So we use heuristics: lexicographically Gray codes Hilbert, . . . Daniel Lemire Faster Column-Oriented Indexes
  17. 17. How many ways to sort? (1) Lexicographic row sorting is a a fast, even for very large a b tables. a c easy: sort is a Unix staple. b a Substantial index-size reductions b b (often 2.5 times, benefits grow b c with table size) Daniel Lemire Faster Column-Oriented Indexes
  18. 18. How many ways to sort? (2) Gray Codes are list of tuples a a with successive (Hamming) a b distance of 1 [Knuth, 2005]. a c b c Reflected Gray Code order is b b sometimes slightly better than lexicographical. . . b a Daniel Lemire Faster Column-Oriented Indexes
  19. 19. How many ways to sort? (3) a a Reflected Gray Code order is not a b the only Gray code. a c b c Knuth also presents Modular b a Gray-code. b b Daniel Lemire Faster Column-Oriented Indexes
  20. 20. How many ways to sort? (4) Hilbert Index [Hamilton and Rau-Chaplin, 2007]. Also a Gray code (conditionnally) Gives very bad results for column-oriented indexes. Daniel Lemire Faster Column-Oriented Indexes
  21. 21. Recursive orders Lexicographical, reflected Gray code and modular Gray code belong to a larger class: recursive orders. They sort on the first column, then the second and so on. Not all Gray codes are recursive orders: Hilbert is not. Daniel Lemire Faster Column-Oriented Indexes
  22. 22. Best column order? Column order is important for recursive orders. We almost have this result [Lemire and Kaser, 2009]: any recursive order order the columns by increasing cardinality (small to LARGE) Proposition The expected number of runs is minimized (among all possible column orders). Daniel Lemire Faster Column-Oriented Indexes
  23. 23. How do you know when the lexicographical order is good enough? Even though row reordering is NP-hard, we find it hard to improve over recursive orders. Sometimes, fancier alternatives (to be discussed another day) work better, but not always. Daniel Lemire Faster Column-Oriented Indexes
  24. 24. Thankfully, we can detect cases where recursive orders are good enough We can bound the suboptimality of all recursive orders. Proposition Consider a table with n distinct rows and column cardinalities Ni for i = 1, . . . , c. Recursive ordering is µ-optimal for the problem of minimizing the runs where min(N1 , n) + min(N1 N2 , n) + · · · + min(N1 N2 · · · Nc , n) µ = . n Daniel Lemire Faster Column-Oriented Indexes
  25. 25. Bounding the optimality of sorting: the computation How do you compute µ very fast so you know lexicographical sort is good enough? Trick is to determine n, the number of distinct rows without sorting the table. Thankfully: n can be estimated quickly with probabilistic methods [Aouiche and Lemire, 2007]. Daniel Lemire Faster Column-Oriented Indexes
  26. 26. Bounding the optimality of sorting: actual numbers columns µ Census-Income 4-D 4 2.63 DBGEN 4-D 4 1.02 Netflix 4 2.00 Census1881 7 5.09 Daniel Lemire Faster Column-Oriented Indexes
  27. 27. Take away message Column stores are good because of vectorization and RLE/sorting Sorting is sometimes nearly optimal, but not always but we can sometimes tell when sorting is optimal Daniel Lemire Faster Column-Oriented Indexes
  28. 28. Future direction? Minimizing the number of runs it the wrong problem! We want to maximize long runs! Must study fancier row-reordering heuristics. Daniel Lemire Faster Column-Oriented Indexes
  29. 29. Questions? ? Daniel Lemire Faster Column-Oriented Indexes
  30. 30. Aouiche, K. and Lemire, D. (2007). A comparison of five probabilistic view-size estimation techniques in OLAP. In DOLAP’07, pages 17–24. Hamilton, C. H. and Rau-Chaplin, A. (2007). Compact Hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105(5):155–163. Hyde, J. (2010). Data in flight. Commun. ACM, 53(1):48–52. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D. and Kaser, O. (2009). Daniel Lemire Faster Column-Oriented Indexes
  31. 31. Reordering columns for smaller indexes. in preparation, available from http://arxiv.org/abs/0909.1346. Lemire, D., Kaser, O., and Aouiche, K. (2010). Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 69(1):3–28. O’Neil, P. and Quass, D. (1997). Improved query performance with variant indexes. In SIGMOD ’97, pages 38–49. O’Neil, P. E. (1989). Model 204 architecture and performance. In 2nd International Workshop on High Performance Transaction Systems, pages 40–59. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. (2005). Daniel Lemire Faster Column-Oriented Indexes
  32. 32. C-store: a column-oriented DBMS. In VLDB’05, pages 553–564. Turner, M. J., Hammond, R., and Cotton, P. (1979). A DBMS for large statistical databases. In VLDB’79, pages 319–327. Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008). An adequate design for large data warehouse systems: Bitmap index versus B-Tree index. IJCC, 2(2). Daniel Lemire Faster Column-Oriented Indexes

×