SlideShare a Scribd company logo
Faster Column-Oriented Indexes

                        Daniel Lemire

  http://www.professeurs.uqam.ca/pages/lemire.daniel.htm
            blog: http://www.daniel-lemire.com/

Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc).




                    February 10, 2010




                   Daniel Lemire   Faster Column-Oriented Indexes
Some trends in business intelligence (BI)




                                Low-latency BI, Complex Event
                                Processing [Hyde, 2010]
                                Commotization, open source software:
                                Pentaho, LucidDB
                                (http://www.luciddb.org/)
                                Column-oriented databases ←
source: gooddata.com




                       Daniel Lemire   Faster Column-Oriented Indexes
Row Stores




    name, date, age, sex, salary


    name, date, age, sex, salary


    name, date, age, sex, salary                   Dominant paradigm
    name, date, age, sex, salary
                                                   Transactional: Quick append and delete

    name, date, age, sex, salary




                                   Daniel Lemire      Faster Column-Oriented Indexes
Column Stores




                                             Goes back to StatCan in the
                                             seventies [Turner et al., 1979]
                                             Made fashionable again in Data
name   date   age   sex   salary
                                             Warehousing by
                                             Stonebraker [Stonebraker et al., 2005]
                                             New: Oracle Exadata hybrid columnar
                                             compression




                             Daniel Lemire      Faster Column-Oriented Indexes
Vectorization




                                       Modern superscalar CPUs support
const i n t N = 2048;                  vectorization (SSE)
i n t a [N] , b [N ] ;                 This code is four times faster with
i n t i =0;                            -ftree-vectorize (GNU GCC)
f o r ( ; i <N ; i ++)                 Need long streams, same data type, and
    a [ i ] += b [ i ] ;               no branching.
                                       Columns are good candidates!




                       Daniel Lemire      Faster Column-Oriented Indexes
Main column-oriented indexes




     (1) Bitmap indexes [O’Neil, 1989]
     (2) Projection indexes [O’Neil and Quass, 1997]
     Both are compressible.




                       Daniel Lemire   Faster Column-Oriented Indexes
Bitmap indexes




  SELECT * FROM
  T WHERE x=a                            Vectors of booleans
  AND y=b;
  Above, compute
   {r | r is the row id of a row where x = a} ∩
   {r | r is the row id of a row where y = b}




                         Daniel Lemire    Faster Column-Oriented Indexes
Other applications of the bitmaps/bitsets




      The Java language has had a bitmap class since the
      beginning: java.util.BitSet. (Sun’s implementation is based
      on 64-bit words.)
      Search engines use bitmaps to filter queries, e.g. Apache
      Lucene: org.apache.lucene.util.OpenBitSet.java.




                        Daniel Lemire   Faster Column-Oriented Indexes
Bitmaps and fast AND/OR operations


     Computing the union of two sets of integers between 1 and 64
     (eg row ids, trivial table). . .
     E.g., {1, 5, 8} ∪ {1, 3, 5}?
     Can be done in one operation by a CPU:
     BitwiseOR( 10001001, 10101000)
     Extend to sets from 1..N using N/64 operations.
     To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] :
     a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ;
     a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ;
     a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ;
     ...
     It is a form of vectorization.


                            Daniel Lemire   Faster Column-Oriented Indexes
What are bitmap indexes for?




     Myth: bitmap indexes are for low cardinality columns (e.g.,
     SEX).
              the Bitmap index is the conclusive choice for data
              warehouse design for columns with high or low
              cardinality [Zaker et al., 2008].




                       Daniel Lemire   Faster Column-Oriented Indexes
Projection indexes




                 name


                        date
        city


                                                  Write out the (normalized)
                                                  column values sequentially.
                                                  It is a projection of the table
                                                  on a single column.
               name
                                                  Best for low selectivity queries
                                                  on few columns:
                               date               SELECT sum(number*price)
         city                                     FROM T;.




                                  Daniel Lemire     Faster Column-Oriented Indexes
How to compress column indexes?




     Must handle long streams of identical values efficiently ⇒
     Run-length encoding? (RLE)
     Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
     So just encode the run lengths, e.g.,
     0001111100010111 →
      3, 5, 3, 1,1,3
     It is a bit more complicated (more another day)




                        Daniel Lemire   Faster Column-Oriented Indexes
What about other compression types?




     With RLE, we can often process the data in compressed form
     Hence, with RLE, compression saves both storage and
     CPU cycles!!!!
     Not always true with other techniques such as Huffman,
     LZ77, Arithmetic Coding, . . .




                      Daniel Lemire   Faster Column-Oriented Indexes
How do we improve performance?




     Smaller indexes are faster.
     In data warehousing: data is often updated in batches.
     So spend time at construction time optimizing the index.




                        Daniel Lemire   Faster Column-Oriented Indexes
Modelling the size of an index




      Any formal result?
      Tricky: There are many variations on RLE.
      Use: number of runs of identical value in a column
      AAABBBCCAA has 4 runs




                       Daniel Lemire   Faster Column-Oriented Indexes
Improving compression by reordering the rows



      RLE is order-sensitive:
      they compress sorted tables better;
      But finding the best row ordering is
      NP-hard [Lemire et al., 2010].
      Actually an instance of the Traveling Salesman Problem
      (TSP)
      So we use heuristics:
          lexicographically
          Gray codes
          Hilbert, . . .




                         Daniel Lemire   Faster Column-Oriented Indexes
How many ways to sort? (1)




      Lexicographic row sorting is                        a     a
          fast, even for very large                       a     b
          tables.                                         a     c
          easy: sort is a Unix staple.
                                                          b     a
      Substantial index-size reductions                   b     b
      (often 2.5 times, benefits grow                      b     c
      with table size)




                       Daniel Lemire   Faster Column-Oriented Indexes
How many ways to sort? (2)




      Gray Codes are list of tuples                       a     a
      with successive (Hamming)                           a     b
      distance of 1 [Knuth, 2005].                        a     c
                                                          b     c
      Reflected Gray Code order is
                                                          b     b
          sometimes slightly better
          than lexicographical. . .                       b     a




                       Daniel Lemire   Faster Column-Oriented Indexes
How many ways to sort? (3)




                                                         a     a
      Reflected Gray Code order is not                    a     b
      the only Gray code.                                a     c
                                                         b     c
      Knuth also presents Modular
                                                         b     a
      Gray-code.
                                                         b     b




                      Daniel Lemire   Faster Column-Oriented Indexes
How many ways to sort? (4)




                                       Hilbert Index
                                       [Hamilton and Rau-Chaplin, 2007].
                                       Also a Gray code
                                       (conditionnally)
                                       Gives very bad results for
                                       column-oriented indexes.




                  Daniel Lemire   Faster Column-Oriented Indexes
Recursive orders




      Lexicographical, reflected Gray code and modular Gray
      code belong to a larger class: recursive orders.
      They sort on the first column, then the second and so on.
      Not all Gray codes are recursive orders: Hilbert is not.




                         Daniel Lemire   Faster Column-Oriented Indexes
Best column order?


  Column order is important for recursive orders.
  We almost have this result [Lemire and Kaser, 2009]:
      any recursive order
      order the columns by increasing cardinality (small to
      LARGE)

  Proposition

  The expected number of runs is minimized (among all possible
  column orders).




                        Daniel Lemire   Faster Column-Oriented Indexes
How do you know when the lexicographical order is good
enough?




     Even though row reordering is NP-hard, we find it hard to
     improve over recursive orders.
     Sometimes, fancier alternatives (to be discussed another day)
     work better, but not always.




                       Daniel Lemire   Faster Column-Oriented Indexes
Thankfully, we can detect cases where recursive orders are
good enough



  We can bound the suboptimality of all recursive orders.
  Proposition

  Consider a table with n distinct rows and column cardinalities Ni
  for i = 1, . . . , c. Recursive ordering is µ-optimal for the problem of
  minimizing the runs where

            min(N1 , n) + min(N1 N2 , n) + · · · + min(N1 N2 · · · Nc , n)
   µ =                                                                     .
                                        n




                           Daniel Lemire   Faster Column-Oriented Indexes
Bounding the optimality of sorting: the computation




      How do you compute µ very fast so you know lexicographical
      sort is good enough?
      Trick is to determine n, the number of distinct rows without
      sorting the table.
      Thankfully: n can be estimated quickly with probabilistic
      methods [Aouiche and Lemire, 2007].




                        Daniel Lemire   Faster Column-Oriented Indexes
Bounding the optimality of sorting: actual numbers




                         columns        µ
   Census-Income 4-D         4         2.63
   DBGEN 4-D                 4         1.02
   Netflix                    4         2.00
   Census1881               7          5.09




                       Daniel Lemire   Faster Column-Oriented Indexes
Take away message




     Column stores are good because of vectorization and
     RLE/sorting
     Sorting is sometimes nearly optimal, but not always but we
     can sometimes tell when sorting is optimal




                      Daniel Lemire   Faster Column-Oriented Indexes
Future direction?




      Minimizing the number of runs it the wrong problem! We
      want to maximize long runs!
      Must study fancier row-reordering heuristics.




                        Daniel Lemire   Faster Column-Oriented Indexes
Questions?




                             ?




             Daniel Lemire       Faster Column-Oriented Indexes
Aouiche, K. and Lemire, D. (2007).
A comparison of five probabilistic view-size estimation
techniques in OLAP.
In DOLAP’07, pages 17–24.
Hamilton, C. H. and Rau-Chaplin, A. (2007).
Compact Hilbert indices: Space-filling curves for domains with
unequal side lengths.
Information Processing Letters, 105(5):155–163.
Hyde, J. (2010).
Data in flight.
Commun. ACM, 53(1):48–52.
Knuth, D. E. (2005).
The Art of Computer Programming, volume 4, chapter fascicle
2.
Addison Wesley.
Lemire, D. and Kaser, O. (2009).

                   Daniel Lemire   Faster Column-Oriented Indexes
Reordering columns for smaller indexes.
in preparation, available from
http://arxiv.org/abs/0909.1346.
Lemire, D., Kaser, O., and Aouiche, K. (2010).
Sorting improves word-aligned bitmap indexes.
Data & Knowledge Engineering, 69(1):3–28.
O’Neil, P. and Quass, D. (1997).
Improved query performance with variant indexes.
In SIGMOD ’97, pages 38–49.
O’Neil, P. E. (1989).
Model 204 architecture and performance.
In 2nd International Workshop on High Performance
Transaction Systems, pages 40–59.
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,
Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,
O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.
(2005).
                   Daniel Lemire   Faster Column-Oriented Indexes
C-store: a column-oriented DBMS.
In VLDB’05, pages 553–564.
Turner, M. J., Hammond, R., and Cotton, P. (1979).
A DBMS for large statistical databases.
In VLDB’79, pages 319–327.
Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008).
An adequate design for large data warehouse systems: Bitmap
index versus B-Tree index.
IJCC, 2(2).




                  Daniel Lemire   Faster Column-Oriented Indexes

More Related Content

Similar to Faster Column-Oriented Indexes

What the C?
What the C?What the C?
What the C?
baccigalupi
 
AWS July Webinar Series - Getting Started with Amazon DynamoDB
AWS July Webinar Series - Getting Started with Amazon DynamoDBAWS July Webinar Series - Getting Started with Amazon DynamoDB
AWS July Webinar Series - Getting Started with Amazon DynamoDB
Amazon Web Services
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
Rakuten Group, Inc.
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Yan Cui
 
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoSagie Davidovich
 
An Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array StructuresAn Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array Structures
Jun-ya Norimatsu
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
EarGram: an Application for Interactive Exploration of Large Databases of Aud...
EarGram: an Application for Interactive Exploration of Large Databases of Aud...EarGram: an Application for Interactive Exploration of Large Databases of Aud...
EarGram: an Application for Interactive Exploration of Large Databases of Aud...
Gilberto Bernardes
 
Introduction of c_language
Introduction of c_languageIntroduction of c_language
Introduction of c_languageSINGH PROJECTS
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
Age Mooij
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
Andrey Lomakin
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
DataWorks Summit
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
advaitdeo
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorialmubarakss
 

Similar to Faster Column-Oriented Indexes (20)

What the C?
What the C?What the C?
What the C?
 
AWS July Webinar Series - Getting Started with Amazon DynamoDB
AWS July Webinar Series - Getting Started with Amazon DynamoDBAWS July Webinar Series - Getting Started with Amazon DynamoDB
AWS July Webinar Series - Getting Started with Amazon DynamoDB
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
Sql rally 2013 columnstore indexes
Sql rally 2013   columnstore indexesSql rally 2013   columnstore indexes
Sql rally 2013 columnstore indexes
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - Scalapeño
 
An Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array StructuresAn Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array Structures
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Data representation
Data representationData representation
Data representation
 
Nosql
NosqlNosql
Nosql
 
EarGram: an Application for Interactive Exploration of Large Databases of Aud...
EarGram: an Application for Interactive Exploration of Large Databases of Aud...EarGram: an Application for Interactive Exploration of Large Databases of Aud...
EarGram: an Application for Interactive Exploration of Large Databases of Aud...
 
Introduction of c_language
Introduction of c_languageIntroduction of c_language
Introduction of c_language
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 

More from Daniel Lemire

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarks
Daniel Lemire
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
Daniel Lemire
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons Learned
Daniel Lemire
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Daniel Lemire
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnées
Daniel Lemire
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted Integers
Daniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
Daniel Lemire
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Daniel Lemire
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
Daniel Lemire
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
Daniel Lemire
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)
Daniel Lemire
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
Daniel Lemire
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression
Daniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
Daniel Lemire
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific Data
Daniel Lemire
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQL
Daniel Lemire
 
Write good papers
Write good papersWrite good papers
Write good papers
Daniel Lemire
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting Them
Daniel Lemire
 
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPA Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
Daniel Lemire
 

More from Daniel Lemire (20)

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarks
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons Learned
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnées
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted Integers
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression
 
OLAP and more
OLAP and moreOLAP and more
OLAP and more
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific Data
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQL
 
Write good papers
Write good papersWrite good papers
Write good papers
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting Them
 
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPA Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Faster Column-Oriented Indexes

  • 1. Faster Column-Oriented Indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 10, 2010 Daniel Lemire Faster Column-Oriented Indexes
  • 2. Some trends in business intelligence (BI) Low-latency BI, Complex Event Processing [Hyde, 2010] Commotization, open source software: Pentaho, LucidDB (http://www.luciddb.org/) Column-oriented databases ← source: gooddata.com Daniel Lemire Faster Column-Oriented Indexes
  • 3. Row Stores name, date, age, sex, salary name, date, age, sex, salary name, date, age, sex, salary Dominant paradigm name, date, age, sex, salary Transactional: Quick append and delete name, date, age, sex, salary Daniel Lemire Faster Column-Oriented Indexes
  • 4. Column Stores Goes back to StatCan in the seventies [Turner et al., 1979] Made fashionable again in Data name date age sex salary Warehousing by Stonebraker [Stonebraker et al., 2005] New: Oracle Exadata hybrid columnar compression Daniel Lemire Faster Column-Oriented Indexes
  • 5. Vectorization Modern superscalar CPUs support const i n t N = 2048; vectorization (SSE) i n t a [N] , b [N ] ; This code is four times faster with i n t i =0; -ftree-vectorize (GNU GCC) f o r ( ; i <N ; i ++) Need long streams, same data type, and a [ i ] += b [ i ] ; no branching. Columns are good candidates! Daniel Lemire Faster Column-Oriented Indexes
  • 6. Main column-oriented indexes (1) Bitmap indexes [O’Neil, 1989] (2) Projection indexes [O’Neil and Quass, 1997] Both are compressible. Daniel Lemire Faster Column-Oriented Indexes
  • 7. Bitmap indexes SELECT * FROM T WHERE x=a Vectors of booleans AND y=b; Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire Faster Column-Oriented Indexes
  • 8. Other applications of the bitmaps/bitsets The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene: org.apache.lucene.util.OpenBitSet.java. Daniel Lemire Faster Column-Oriented Indexes
  • 9. Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... It is a form of vectorization. Daniel Lemire Faster Column-Oriented Indexes
  • 10. What are bitmap indexes for? Myth: bitmap indexes are for low cardinality columns (e.g., SEX). the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality [Zaker et al., 2008]. Daniel Lemire Faster Column-Oriented Indexes
  • 11. Projection indexes name date city Write out the (normalized) column values sequentially. It is a projection of the table on a single column. name Best for low selectivity queries on few columns: date SELECT sum(number*price) city FROM T;. Daniel Lemire Faster Column-Oriented Indexes
  • 12. How to compress column indexes? Must handle long streams of identical values efficiently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 It is a bit more complicated (more another day) Daniel Lemire Faster Column-Oriented Indexes
  • 13. What about other compression types? With RLE, we can often process the data in compressed form Hence, with RLE, compression saves both storage and CPU cycles!!!! Not always true with other techniques such as Huffman, LZ77, Arithmetic Coding, . . . Daniel Lemire Faster Column-Oriented Indexes
  • 14. How do we improve performance? Smaller indexes are faster. In data warehousing: data is often updated in batches. So spend time at construction time optimizing the index. Daniel Lemire Faster Column-Oriented Indexes
  • 15. Modelling the size of an index Any formal result? Tricky: There are many variations on RLE. Use: number of runs of identical value in a column AAABBBCCAA has 4 runs Daniel Lemire Faster Column-Oriented Indexes
  • 16. Improving compression by reordering the rows RLE is order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2010]. Actually an instance of the Traveling Salesman Problem (TSP) So we use heuristics: lexicographically Gray codes Hilbert, . . . Daniel Lemire Faster Column-Oriented Indexes
  • 17. How many ways to sort? (1) Lexicographic row sorting is a a fast, even for very large a b tables. a c easy: sort is a Unix staple. b a Substantial index-size reductions b b (often 2.5 times, benefits grow b c with table size) Daniel Lemire Faster Column-Oriented Indexes
  • 18. How many ways to sort? (2) Gray Codes are list of tuples a a with successive (Hamming) a b distance of 1 [Knuth, 2005]. a c b c Reflected Gray Code order is b b sometimes slightly better than lexicographical. . . b a Daniel Lemire Faster Column-Oriented Indexes
  • 19. How many ways to sort? (3) a a Reflected Gray Code order is not a b the only Gray code. a c b c Knuth also presents Modular b a Gray-code. b b Daniel Lemire Faster Column-Oriented Indexes
  • 20. How many ways to sort? (4) Hilbert Index [Hamilton and Rau-Chaplin, 2007]. Also a Gray code (conditionnally) Gives very bad results for column-oriented indexes. Daniel Lemire Faster Column-Oriented Indexes
  • 21. Recursive orders Lexicographical, reflected Gray code and modular Gray code belong to a larger class: recursive orders. They sort on the first column, then the second and so on. Not all Gray codes are recursive orders: Hilbert is not. Daniel Lemire Faster Column-Oriented Indexes
  • 22. Best column order? Column order is important for recursive orders. We almost have this result [Lemire and Kaser, 2009]: any recursive order order the columns by increasing cardinality (small to LARGE) Proposition The expected number of runs is minimized (among all possible column orders). Daniel Lemire Faster Column-Oriented Indexes
  • 23. How do you know when the lexicographical order is good enough? Even though row reordering is NP-hard, we find it hard to improve over recursive orders. Sometimes, fancier alternatives (to be discussed another day) work better, but not always. Daniel Lemire Faster Column-Oriented Indexes
  • 24. Thankfully, we can detect cases where recursive orders are good enough We can bound the suboptimality of all recursive orders. Proposition Consider a table with n distinct rows and column cardinalities Ni for i = 1, . . . , c. Recursive ordering is µ-optimal for the problem of minimizing the runs where min(N1 , n) + min(N1 N2 , n) + · · · + min(N1 N2 · · · Nc , n) µ = . n Daniel Lemire Faster Column-Oriented Indexes
  • 25. Bounding the optimality of sorting: the computation How do you compute µ very fast so you know lexicographical sort is good enough? Trick is to determine n, the number of distinct rows without sorting the table. Thankfully: n can be estimated quickly with probabilistic methods [Aouiche and Lemire, 2007]. Daniel Lemire Faster Column-Oriented Indexes
  • 26. Bounding the optimality of sorting: actual numbers columns µ Census-Income 4-D 4 2.63 DBGEN 4-D 4 1.02 Netflix 4 2.00 Census1881 7 5.09 Daniel Lemire Faster Column-Oriented Indexes
  • 27. Take away message Column stores are good because of vectorization and RLE/sorting Sorting is sometimes nearly optimal, but not always but we can sometimes tell when sorting is optimal Daniel Lemire Faster Column-Oriented Indexes
  • 28. Future direction? Minimizing the number of runs it the wrong problem! We want to maximize long runs! Must study fancier row-reordering heuristics. Daniel Lemire Faster Column-Oriented Indexes
  • 29. Questions? ? Daniel Lemire Faster Column-Oriented Indexes
  • 30. Aouiche, K. and Lemire, D. (2007). A comparison of five probabilistic view-size estimation techniques in OLAP. In DOLAP’07, pages 17–24. Hamilton, C. H. and Rau-Chaplin, A. (2007). Compact Hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105(5):155–163. Hyde, J. (2010). Data in flight. Commun. ACM, 53(1):48–52. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D. and Kaser, O. (2009). Daniel Lemire Faster Column-Oriented Indexes
  • 31. Reordering columns for smaller indexes. in preparation, available from http://arxiv.org/abs/0909.1346. Lemire, D., Kaser, O., and Aouiche, K. (2010). Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 69(1):3–28. O’Neil, P. and Quass, D. (1997). Improved query performance with variant indexes. In SIGMOD ’97, pages 38–49. O’Neil, P. E. (1989). Model 204 architecture and performance. In 2nd International Workshop on High Performance Transaction Systems, pages 40–59. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. (2005). Daniel Lemire Faster Column-Oriented Indexes
  • 32. C-store: a column-oriented DBMS. In VLDB’05, pages 553–564. Turner, M. J., Hammond, R., and Cotton, P. (1979). A DBMS for large statistical databases. In VLDB’79, pages 319–327. Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008). An adequate design for large data warehouse systems: Bitmap index versus B-Tree index. IJCC, 2(2). Daniel Lemire Faster Column-Oriented Indexes