Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
             Part 1: Stavros Harizopoulos (HP Labs)
             Part 2: Daniel Abadi (Yale)
             Part 3: Peter Boncz (CWI)




                                                                  VLDB 2009 Tutorial                                   1
                                                           Column-Oriented Database Systems
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   What is a column-store?

                      row-store                                                                  column-store
                                                                        Date         Store         Product   Customer   Price
 Date       Store Product Customer Price




    + easy to add/modify a record                                          + only need to read in relevant data

    - might read in unnecessary data                                       - tuple writes require multiple accesses


                => suitable for read-mostly, read-intensive, large data repositories



            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                      2
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Are these two fundamentally different?
   l     The only fundamental difference is the storage layout
   l     However: we need to look at the big picture

              different storage layouts proposed
       row-stores                        row-stores++                                row-stores++           converge?


    ‘70s                ‘80s                  ‘90s               ‘00s                 today
                                                                                  column-stores
                 new applications
             new bottleneck in hardware


  l     How did we get here, and where we are heading Part 1
  l     What are the column-specific optimizations?          Part 2
  l     How do we improve CPU efficiency when operating on Cs
                                                                                                               Part 3
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems            3
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Outline
   l     Part 1: Basic concepts — Stavros
         l      Introduction to key features
         l      From DSM to column-stores and performance tradeoffs
         l      Column-store architecture overview
         l      Will rows and columns ever converge?

   l     Part 2: Column-oriented execution — Daniel

   l     Part 3: MonetDB/X100 and CPU efficiency — Peter



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       4
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




    Telco Data Warehousing example
l    Typical DW installation                                            dimension tables
                                                                           account                     fact table
                                                                                or RAM
l    Real-world example                                                                                     usage        source

“One Size Fits All? - Part 2: Benchmarking                                        toll
Results” Stonebraker et al. CIDR 2007                                                                                 star schema

                      QUERY 2
         SELECT account.account_number,
         sum (usage.toll_airtime),
         sum (usage.toll_price)                                                              Column-store           Row-store
         FROM usage, toll, source, account
         WHERE usage.toll_id = toll.toll_id
                                                                             Query 1         2.06                   300
         AND usage.source_id = source.source_id                              Query 2         2.20                   300
         AND usage.account_id = account.account_id
         AND toll.type_ind in (‘AE’. ‘AA’)
                                                                             Query 3         0.09                   300
         AND usage.toll_price > 0                                            Query 4         5.24                   300
         AND source.type != ‘CIBER’
         AND toll.rating_method = ‘IS’
                                                                             Query 5         2.88                   300
         AND usage.invoice_date = 20051013
         GROUP BY account.account_number
                                                                                   Why? Three main factors (next slides)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           5
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

   Telco example explained (1/3):
   read efficiency
                       row store                                                                        column store




 read pages containing entire rows                                                    read only columns needed

 one row = 212 columns!                                                               in this example: 7 columns

 is this typical? (it depends)                                                        caveats:
                                                                                      • “select * ” not any faster
                                                                                      • clever disk prefetching
    What about vertical partitioning?
                                                                                      • clever tuple reconstruction
    (it does not work with ad-hoc
    queries)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                  6
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

   Telco example explained (2/3):
   compression efficiency
   l     Columns compress better than rows
         l      Typical row-store compression ratio 1 : 3
         l      Column-store 1 : 10

   l     Why?
         l     Rows contain values from different domains
               => more entropy, difficult to dense-pack
         l     Columns exhibit significantly less entropy
         l     Examples:                 Male, Female, Female, Female, Male
                                                                    1998, 1998, 1999, 1999, 1999, 2000

         l      Caveat: CPU cost (use lightweight compression)


             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       7
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

   Telco example explained (3/3):
   sorting & indexing efficiency
   l     Compression and dense-packing free up space
         l      Use multiple overlapping column collections
         l      Sorted columns compress better
         l      Range queries are faster
         l      Use sparse clustered indexes


          What about heavily-indexed row-stores?
          (works well for single column access,
          cross-column joins become increasingly expensive)




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       8
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Additional opportunities for column-stores
   l     Block-tuple / vectorized processing
         l      Easier to build block-tuple operators
                l   Amortizes function-call cost, improves CPU cache performance
         l      Easier to apply vectorized primitives
                l   Software-based: bitwise operations
                l   Hardware-based: SIMD              Part 3

   l     Opportunities with compressed columns
         l      Avoid decompression: operate directly on compressed
         l      Delay decompression (and tuple reconstruction)
                                                                                                                      more
                l   Also known as: late materialization                                                             in Part 2

   l     Exploit columnar storage in other DBMS components
         l      Physical design (both static and dynamic)                                                   See: Database
                                                                                                            Cracking, from CWI
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                     9
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                                                                                        “Column-Stores vs Row-Stores:
                                                                                                        How Different are They
 Effect on C-Store performance                                                                          Really?” Abadi, Hachem, and
                                                                                                        Madden. SIGMOD 2008.

                                              Average for SSBM queries on C-store


                                                                                                                   original
                    Time (sec)




                                                                                                                   C-store




                                                                                                                     enable
                                                                                                                       late
                                 column-oriented                                  enable                          materialization
                                  join algorithm                              compression &
                                                                          operate on compressed


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                          10
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Summary of column-store key features
                                                                                 columnar storage
                                                                                                                   Part 1
                                                                                    header/ID elimination
   l     Storage layout
                                                                                        compression         Part 2      Part 3

                                                                                            multiple sort orders



                                                                               column operators               Part 1        Part 2

                                                                                  avoid decompression
   l     Execution engine                                                                                          Part 2
                                                                                       late materialization

                                                                                          vectorized operations         Part 3

   l     Design tools, optimizer
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                        11
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Outline
   l     Part 1: Basic concepts — Stavros
         l      Introduction to key features
         l      From DSM to column-stores and performance tradeoffs
         l      Column-store architecture overview
         l      Will rows and columns ever converge?

   l     Part 2: Column-oriented execution — Daniel

   l     Part 3: MonetDB/X100 and CPU efficiency — Peter



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       12
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   From DSM to Column-stores
                                       TOD: Time Oriented Database – Wiederhold et al.
   70s -1985:                          "A Modular, Self-Describing Clinical Databank
                                       System," Computers and Biomedical Research, 1975
                                       More 1970s: Transposed files, Lorie, Batory,
                                       Svensson.
                                       “An overview of cantor: a new system for data analysis”
                                        Karasalo, Svensson, SSDBM 1983

   1985: DSM paper                                  “A decomposition storage model”
                                                    Copeland and Khoshafian. SIGMOD 1985.

   1990s: Commercialization through SybaseIQ
   Late 90s – 2000s: Focus on main-memory performance
         l      DSM “on steroids” [1997 – now]                                              CWI: MonetDB

         l      Hybrid DSM/NSM [2001 – 2004]                                               Wisconsin: PAX, Fractured Mirrors

                                                                              Michigan: Data Morphing                  CMU: Clotho
   2005 – : Re-birth of read-optimized DSM as “column-store”
                            MIT: C-Store                    CWI: MonetDB/X100                               10+ startups
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                         13
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


                                                                               “A decomposition storage
The original DSM paper                                                         model” Copeland and
                                                                               Khoshafian. SIGMOD 1985.
         l      Proposed as an alternative to NSM
         l      2 indexes: clustered on ID, non-clustered on value
         l      Speeds up queries projecting few columns
         l      Requires more storage                              value
                                                                                          ID                0100 0962 1000 ..
                                                                             1 2 3 4 ..




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                   14
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Memory wall and PAX
  l     90s: Cache-conscious research
         “Cache Conscious Algorithms for
   from: Relational Query Processing.”
         Shatdal, Kant, Naughton. VLDB 1994.
                                                                                         “DBMSs on a modern processor:
    “Database Architecture Optimized for                                            and: Where does time go?” Ailamaki,
to: the New Bottleneck: Memory Access.”                                                  DeWitt, Hill, Wood. VLDB 1999.
    Boncz, Manegold, Kersten. VLDB 1999.


  l     PAX: Partition Attributes Across
        l     Retains NSM I/O pattern
        l     Optimizes cache-to-RAM communication
          “Weaving Relations for Cache Performance.”
          Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems             15
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   More hybrid NSM/DSM schemes
   l     Dynamic PAX: Data Morphing
                                                               “Data morphing: an adaptive, cache-conscious
                                                               storage technique.” Hankins, Patel, VLDB 2003.

   l     Clotho: custom layout using scatter-gather I/O
                            “Clotho: Decoupling Memory Page Layout from Storage Organization.”
                            Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.

   l     Fractured mirrors
         l      Smart mirroring with both NSM/DSM copies
                                                                                                            “A Case For Fractured
                                                                                                            Mirrors.” Ramamurthy,
                                                                                                            DeWitt, Su, VLDB 2002.




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                        16
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   MonetDB (more in Part 3)
   l     Late 1990s, CWI: Boncz, Manegold, and Kersten
   l     Motivation:
         l      Main-memory
         l      Improve computational efficiency by avoiding expression
                interpreter
         l      DSM with virtual IDs natural choice
         l      Developed new query execution algebra
   l     Initial contributions:
         l      Pointed out memory-wall in DBMSs
         l      Cache-conscious projections and joins
         l      …

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       17
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   2005: the (re)birth of column-stores
   l     New hardware and application realities
         l      Faster CPUs, larger memories, disk bandwidth limit
         l      Multi-terabyte Data Warehouses
   l     New approach: combine several techniques
         l     Read-optimized, fast multi-column access,
               disk/CPU efficiency, light-weight compression

   l     C-store paper:
         l      First comprehensive design description of a column-store
   l     MonetDB/X100
         l      “proper” disk-based column store
   l     Explosion of new products
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       18
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Performance tradeoffs: columns vs. rows
    DSM traditionally was not favored by technology trends
    How has this changed?

    l     Optimized DSM in “Fractured Mirrors,” 2002
    l     “Apples-to-apples” comparison “Performance Tradeoffs in Read-
                                                                                        Optimized Databases”
                                                                                        Harizopoulos, Liang, Abadi,
                                                                                        Madden, VLDB’06
    l     Follow-up study                           “Read-Optimized Databases, In-
                                                    Depth” Holloway, DeWitt, VLDB’08
    l     Main-memory DSM vs. NSM
                                             “DSM vs. NSM: CPU performance tradeoffs in block-oriented
                                             query processing” Boncz, Zukowski, Nes, DaMoN’08
    l     Flash-disks: a come-back for PAX?                        “Query Processing Techniques
             “Fast Scans and Joins Using Flash                     for Solid State Drives”
             Drives” Shah, Harizopoulos,                           Tsirogiannis, Harizopoulos,
             Wiener, Graefe. DaMoN’08
            VLDB 2009 Tutorial
                                                                   Shah, Wiener, Graefe,
                                            Column-Oriented Database Systems                   19
                                                                   SIGMOD’09
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Fractured mirrors: a closer look
   l     Store DSM relations inside a B-tree                                                                “A Case For Fractured
                                                                                                            Mirrors” Ramamurthy,
         l       Leaf nodes contain values                                                                  DeWitt, Su, VLDB 2002.
         l       Eliminate IDs, amortize header overhead
         l       Custom implementation on Shore
                                                                                                 sparse
        Tuple           TID       Column
        Header                     Data                                                       B-tree on ID
                        1         a1                                                                 3
                        2         a2
                        3         a3                          1      a1        2      a2        3      a3        4       a4     5   a5

                        4         a4                                            a1      a2 a3                                 a4 a5
                                                                         1                                           4
                        5         a5
                                                          Similar: storage density “Efficient columnar
                                                          comparable               storage in B-trees” Graefe.
                                                          to column stores         Sigmod Record 03/2007.
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                               20
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




     Fractured mirrors: performance
     From PAX paper:                                                                                        column?

                                                                                               time
                                                                                                                      row
                                                            regular DSM
                                                                                                                      column?



                                                                                                            columns projected:
                                                                                                            1    2    3 4 5


                                                                                                                            optimized
 l     Chunk-based tuple merging                                                                                              DSM
       l     Read in segments of M pages
       l     Merge segments in memory
       l     Becomes CPU-bound after 5 pages

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           21
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                                                       “Performance Tradeoffs in Read-
   Column-scanner                                                      Optimized Databases”
   implementation                                                      Harizopoulos, Liang, Abadi,
                                                                       Madden, VLDB’06
      row scanner                                                                             column scanner

                                                                                                            Joe 45
                                                                                                            …   …
                      Joe 45
                      …   …                         SELECT name, age
                                                     WHERE age > 40
 apply                                                                                                           S
 predicate(s)
                            S                                                                                        #POS 45
                                                                                 Joe                                 #POS …
                                    Direct I/O                                   Sue
                                                                                 …
                              prefetch ~100ms                                                                 apply
  1 Joe 45                      worth of data                                                           predicate #1    S
  2 Sue 37
  ……     …
                                                                                                            45
                                                                                                            37
                                                                                                            …
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                          22
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Scan performance
   l     Large prefetch hides disk seeks in columns
   l     Column-CPU efficiency with lower selectivity
   l     Row-CPU suffers from memory stalls                                                                     not shown,
   l     Memory stalls disappear in narrow tuples                                                           details in the paper

   l     Compression: similar to narrow




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                     23
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


                                                                      “Read-Optimized Databases, In-
      Even more results                                               Depth” Holloway, DeWitt, VLDB’08
                                                                                       35
  •     Same engine as before                                           narrow & compressed tuple:
  •     Additional findings                                                  30
                                                                                CPU-bound!
                                                                                                   C-25%
                                                                                       25
                                                                                                   C-10%
                                                                                                   R-50%
                                                                                       20




                                                                                Time (s)
                                                                                       15



                                                                                       10



                                                                                           5
                                 wide attributes:
                                 same as before                                            0
                                                                                               1   2   3    4       5    6    7    8   9   10
                                                                                                                Columns Returned




      Non-selective queries, narrow tuples, favor well-compressed rows
      Materialized views are a win
                                                        Column-joins are
      Scan times determine early materialized joins     covered in part 2!
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                      24
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Speedup of columns over rows
                                                                                             “Performance Tradeoffs in Read-
                                                                                             Optimized Databases”
                                                                                             Harizopoulos, Liang, Abadi,
                           cycles per disk byte   144                                        Madden, VLDB’06


                                                  72

          (cpdb)                                  36
                                                                                                  +++
                                                  18
                                                            _ = + ++
                                                   9
                                                        8   12   16       20        24         28           32   36
                                                                      tuple width
   l     Rows favored by narrow tuples and low cpdb
         l      Disk-bound workloads have higher cpdb
             VLDB 2009 Tutorial                                        Column-Oriented Database Systems                25
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




    Varying prefetch size
                           no competing
                            disk traffic
                40
                                                                                                     Column 2
   time (sec)




                30
                                                                                                     Column 8
                20                                                                                   Column 16
                                                                                                     Column 48 (x 128KB)
                10
                                                                                                 Row (any prefetch size)
                 0
                     4         8      12          16         20          24         28          32
                               selected bytes per tuple

    l           No prefetching hurts columns in single scans

                 VLDB 2009 Tutorial                                  Column-Oriented Database Systems                      26
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Varying prefetch size
                                                                           with competing disk traffic
                                                             40            Column, 48                    40



                                                time (sec)
                                                             30            Row, 48                       30

                                                             20                                          20

                                                             10                                          10              Column, 8
                                                                                                                         Row, 8
                                                              0                                             0
                                                                  4     12        20        28                  4   12   20    28
                                                                      selected bytes per tuple

   l     No prefetching hurts columns in single scans
   l     Under competing traffic, columns outperform rows for
         any prefetch size
            VLDB 2009 Tutorial                                        Column-Oriented Database Systems                         27
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                              “DSM vs. NSM: CPU performance trade
   CPU Performance                                            offs in block-oriented query processing”
                                                              Boncz, Zukowski, Nes, DaMoN’08
         l      Benefit in on-the-fly conversion between NSM and DSM
         l      DSM: sequential access (block fits in L2), random in L1
         l      NSM: random access, SIMD for grouped Aggregation




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       28
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   New storage technology: Flash SSDs
      l     Performance characteristics
            l     very fast random reads, slow random writes
            l     fast sequential reads and writes
      l     Price per bit (capacity follows)
            l     cheaper than RAM, order of magnitude more expensive than Disk
      l     Flash Translation Layer introduces unpredictability
            l     avoid random writes!
      l     Form factors not ideal yet
            l     SSD (Ł small reads still suffer from SATA overhead/OS limitations)
            l     PCI card (Ł high price, limited expandability)


      l     Boost Sequential I/O in a simple package
            l     Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box)
      l     Column Store Updates
            l     useful for delta structures and logs
      l     Random I/O on flash fixes unclustered index access
            l     still suboptimal if I/O block size > record size
            l     therefore column stores profit mush less than horizontal stores
      l     Random I/O useful to exploit secondary, tertiary table orderings
            l     the larger the data, the deeper clustering one can exploit

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       29
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




  Even faster column scans on flash SSDs
                                                                                        30K Read IOps, 3K Write Iops
   l     New-generation SSDs                                                            250MB/s Read BW, 200MB/s Write
         l      Very fast random reads, slower random writes
         l      Fast sequential RW, comparable to HDD arrays
   l     No expensive seeks across columns
   l     FlashScan and Flashjoin: PAX on SSDs, inside Postgres
                                                                                        “Query Processing Techniques for
                                                                                        Solid State Drives” Tsirogiannis,
                                                                                        Harizopoulos, Shah, Wiener, Graefe,
                                                                                        SIGMOD’09

                                                                                                      mini-pages with no
                                                                                                      qualified attributes are
                                                                                                      not accessed



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                            30
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Column-scan performance over time
       regular DSM (2001)
                                                                                   column-store (2006)      ..to 1.2x slower

                                               from 7x slower




                                              ..to 2x slower
                                                                                                            ..to same




                                                                 and 3x faster!

     optimized DSM (2002)                                                SSD Postgres/PAX (2009)

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                   31
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Outline
   l     Part 1: Basic concepts — Stavros
         l      Introduction to key features
         l      From DSM to column-stores and performance tradeoffs
         l      Column-store architecture overview
         l      Will rows and columns ever converge?

   l     Part 2: Column-oriented execution — Daniel

   l     Part 3: MonetDB/X100 and CPU efficiency — Peter



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       32
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Architecture of a column-store
                            storage layout
   l     read-optimized: dense-packed, compressed
   l     organize in extends, batch updates
   l     multiple sort orders
   l     sparse indexes                         engine
                                      l block-tuple operators

                                      l new access methods

              system-level            l optimized relational operators

    l     system-wide column support
    l     loading / updates
    l     scaling through multiple nodes
    l     transactions / redundancy
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       33
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                                                            “C-Store: A Column-Oriented
                                                                            DBMS.” Stonebraker et al.
   C-Store                                                                  VLDB 2005.

   l     Compress columns
   l     No alignment
   l     Big disk blocks
   l     Only materialized views (perhaps many)
   l     Focus on Sorting not indexing
   l     Data ordered on anything, not just time
   l     Automatic physical DBMS design
   l     Optimize for grid computing
   l     Innovative redundancy
   l     Xacts – but no need for Mohan
   l     Column optimizer and executor

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       34
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   C-Store: only materialized views (MVs)
   l     Projection (MV) is some number of columns from a fact table
   l     Plus columns in a dimension table – with a 1-n join between
         Fact and Dimension table
   l     Stored in order of a storage key(s)
   l     Several may be stored!
   l     With a permutation, if necessary, to map between them
   l     Table (as the user specified it and sees it) is not stored!
   l     No secondary indexes (they are a one column sorted MV plus
         a permutation, if you really want one)
        User view:                                                   Possible set of MVs:
        EMP (name, age, salary, dept)                                MV-1 (name, dept, floor) in floor order
        Dept (dname, floor)                                          MV-2 (salary, age) in age order
                                                                     MV-3 (dname, salary, name) in salary order

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       35
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Continuous Load and Query (Vertica)
                   Hybrid Storage Architecture



                          > Write Optimized                                                       > Read Optimized
                            Store (WOS)                                                             Store (ROS)
  Trickle                                                                                           • On disk
   Load                                                                                             • Sorted / Compressed
                                                                      TUPLE MOVER
                                                                       Asynchronous                 • Segmented
                                  A         B         C
                                                                       Data Transfer                • Large data loaded direct



                           §Memory based
                                                                                                            A      B      C
                           §Unsorted / Uncompressed
                           §Segmented
                           §Low latency / Small quick                                                           (A B C | A)
                            inserts
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                            36
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Loading Data (Vertica)

       > INSERT, UPDATE, DELETE                                                                             Write-Optimized
                                                                                                             Store (WOS)
       > Bulk and Trickle Loads                                                                               In-memory
             §COPY
                                                                                                       Automatic
             §COPY DIRECT                                                                              Tuple Mover

       > User loads data into logical Tables
       > Vertica loads atomically into
         storage
                                                                                                             Read-Optimized
                                                                                                               Store (ROS)
                                                                                                                 On-disk




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                         37
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Applications for column-stores
         l      Data Warehousing
                l   High end (clustering)
                l   Mid end/Mass Market
                l   Personal Analytics
         l      Data Mining
                l   E.g. Proximity
         l      Google BigTable
         l      RDF
                l   Semantic web data management
         l      Information retrieval
                l   Terabyte TREC
         l      Scientific datasets
                l   SciDB initiative
                l   SLOAN Digital Sky Survey on MonetDB




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       38
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   List of column-store systems
         l      Cantor (history)
         l      Sybase IQ
         l      SenSage (former Addamark Technologies)
         l      Kdb
         l      1010data
         l      MonetDB
         l      C-Store/Vertica
         l      X100/VectorWise
         l      KickFire
         l      SAP Business Accelerator
         l      Infobright
         l      ParAccel
         l      Exasol
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       39
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Outline
   l     Part 1: Basic concepts — Stavros
         l      Introduction to key features
         l      From DSM to column-stores and performance tradeoffs
         l      Column-store architecture overview
         l      Will rows and columns ever converge?

   l     Part 2: Column-oriented execution — Daniel

   l     Part 3: MonetDB/X100 and CPU efficiency — Peter



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       40
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Simulate a Column-Store inside a Row-Store
                                          Date     Store     Product Customer             Price


                                          01/01 BOS Table                     Mesa        $20

                                          01/01 NYC Chair                     Lutz        $13                    Option B:
                                                                                                            Index Every Column
                                          01/01 BOS            Bed          Mudd          $79
       Option A:                                                                                                 Date Index
  Vertical Partitioning

          Date              Store            Product           Customer                  Price
       TID    Value       TID     Value     TID   Value        TID    Value        TID    Value

        1 01/01           1     BOS         1 Table             1    Mesa          1      $20
                                                                                                                 Store Index
        2 01/01           2     NYC         2 Chair             2     Lutz         2      $13

        3 01/01           3     BOS         3 Bed               3    Mudd          3      $79



                                                                                                                    …
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                          41
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Simulate a Column-Store inside a Row-Store
                                          Date     Store     Product Customer              Price


                                          01/01 BOS Table                   Mesa          $20

                                          01/01 NYC Chair                    Lutz         $13                    Option B:
                                                                                                            Index Every Column
                                          01/01 BOS            Bed          Mudd          $79
       Option A:                                                                                                 Date Index
  Vertical Partitioning

              Date                    Store          Product        Customer               Price
     Value     StartPos Length      TID    Value    TID    Value     TID   Value     TID     Value

    01/01          1        3       1     BOS       1 Table 1              Mesa       1     $20
                                                                                                                 Store Index
                                    2     NYC       2 Chair 2               Lutz      2     $13

 Can explicitly run-                3     BOS       3 Bed 3                Mudd       3     $79
 length encode date
 “Teaching an Old Elephant New Tricks.”
 Bruno, CIDR 2009.
                                                                                                                    …
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                          42
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                 Experiments
                 l    Star Schema Benchmark (SSBM)                             Adjoined Dimension Column Index (ADC Index)
                                                                               to Improve Star Schema Query Performance”.
                                                                               O’Neil et. al. ICDE 2008.

                 l    Implemented by professional DBA
                 l    Original row-store plus 2 column-store
                      simulations on same row-store product
                     250.0
                                                                                                     “Column-Stores vs Row-Stores:
                     200.0
                                                                                                     How Different are They Really?”
                                                                                                     Abadi, Hachem, and Madden.
Time (seconds)




                     150.0                                                                           SIGMOD 2008.

                     100.0

                      50.0

                       0.0
                                                   Vertically Partitioned      Row-Store With All
                                Normal Row-Store
                                                         Row-Store                  Indexes
                 Average                25.7                79.9                        221.2

                         VLDB 2009 Tutorial                          Column-Oriented Database Systems                         43
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   What’s Going On? Vertical Partitions
   l     Vertical partitions in row-stores:
         l      Work well when workload is known
         l      ..and queries access disjoint sets of columns
         l      See automated physical design
                                                                                                            Tuple    TID   Column
                                                                                                            Header          Data

                                                                                                                     1
   l     Do not work well as full-columns
                                                                                                                     2
         l      TupleID overhead significant
                                                                                                                     3
         l      Excessive joins
                                                                          Queries touch 3-4 foreign keys in fact table,
                                                                          1-2 numeric columns
   “Column-Stores vs. Row-Stores:                                         Complete fact table takes up ~4 GB
   How Different Are They Really?”                                        (compressed)
   Abadi, Madden, and Hachem.                                             Vertically partitioned tables take up 0.7-1.1
   SIGMOD 2008.                                                           GB (compressed)

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                          44
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




    What’s Going On? All Indexes Case
l    Tuple construction
     l     Common type of query:                                      SELECT store_name, SUM(revenue)
                                                                      FROM Facts, Stores
                                                                      WHERE fact.store_id = stores.store_id
                                                                         AND stores.country = “Canada”
                                                                      GROUP BY store_name


     l     Result of lower part of query plan is a set of TIDs that passed
           all predicates
     l     Need to extract SELECT attributes at these TIDs
            l    BUT: index maps value to TID
            l    You really want to map TID to value (i.e., a vertical partition)
            Tuple construction is SLOW

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems         45
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   So….
   l     All indexes approach is a poor way to simulate a column-store
   l     Problems with vertical partitioning are NOT fundamental
           l    Store tuple header in a separate partition
           l    Allow virtual TIDs
           l    Combine clustered indexes, vertical partitioning
   l     So can row-stores simulate column-stores?
           l    Might be possible, BUT:
                 l Need better support for vertical partitioning at the storage layer
                 l Need support for column-specific optimizations at the executer level
                 l Full integration: buffer pool, transaction manager, ..

           l    When will this happen?                                                        See Part 2, Part 3
                  l   Most promising features = soon                                          for most promising
                                                                                                   features
                  l   ..unless new technology / new objectives change the game
                       (SSDs, Massively Parallel Platforms, Energy-efficiency)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems              46
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   End of Part 1
   l     Basic concepts — Stavros
         l      Introduction to key features
         l      From DSM to column-stores and performance tradeoffs
         l      Column-store architecture overview
         l      Will rows and columns ever converge?

   l     Part 2: Column-oriented execution — Daniel

   l     Part 3: MonetDB/X100 and CPU efficiency — Peter



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       47
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Part 2 Outline

   l         Compression

   l         Tuple Materialization

   l         Joins




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       48
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems

                                                                                    Compression
 “Super-Scalar RAM-CPU Cache Compression”
 Zukowski, Heman, Nes, Boncz, ICDE’06

 “Integrating Compression and Execution in Column-
 Oriented Database Systems” Abadi, Madden, and
 Ferreira, SIGMOD ’06

 •Query optimization in compressed database
 systems” Chen, Gehrke, Korn, SIGMOD’01
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Compression
      l     Trades I/O for CPU
      l     Increased column-store opportunities:
              l    Higher data value locality in column stores
              l    Techniques such as run length encoding far more
                   useful
              l    Can use extra space to store multiple copies of
                   data in different sort orders




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       50
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Run-length Encoding


    Quarter Product ID Price                                                   Quarter                         Product ID               Price
                                                                          (value, start_pos, run_length)    (value, start_pos, run_length)
         Q1                 1                  5
                                                                           (Q1, 1, 300)                            (1, 1, 5)                 5
         Q1                 1                  7
                                                                                                                   (2, 6, 2)                 7
         Q1                 1                  2                           (Q2, 301, 350)                                                    2
         Q1                 1                  9                                                                       …
                                                                           (Q3, 651, 500)                                                    9
         Q1                 1                  6                                                                (1, 301, 3)                  6
         Q1                 2                  8                           (Q4, 1151, 600)                      (2, 304, 1)                  8
         Q1                 2                  5
         …                  …                  …                                                                       …                     5
                                                                                                                                             …
         Q2                 1                  3
                                                                                                                                             3
         Q2                 1                  8
                                                                                                                                             8
         Q2                 1                  1
                                                                                                                                             1
         Q2                 2                  4
         …                  …                  …                                                                                             4
                                                                                                                                             …
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                        51
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                           “Integrating Compression and Execution in Column-
                                           Oriented Database Systems” Abadi et. al, SIGMOD ’06

   Bit-vector Encoding

     l     For each unique      Product ID                                                  ID: 1           ID: 2   ID: 3   …
           value, v, in column
           c, create bit-vector    1                                                          1             0        0      0
           b                       1                                                          1             0        0      0
             l   b[i] = 1 if c[i] = v                          1                              1             0        0      0
                                                               1                              1             0        0      0
     l     Good for columns
                                                               1                              1             0        0      0
           with few unique
                                                               2                              0             1        0      0
           values
                                                               2                              0             1        0      0
     l     Each bit-vector                                     …                              …             …        …      …
           can be further                                      1                              1             0        0      0
           compressed if                                       1                              1             0        0      0
           sparse                                              2                              0             1        0      0
                                                               3                              0             0        1      0
                                                               …                              …             …        …      …

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           52
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                           “Integrating Compression and Execution in Column-
                                           Oriented Database Systems” Abadi et. al, SIGMOD ’06

   Dictionary Encoding

                                                                                 Quarter
  l     For each                                   Quarter
        unique value                                                                  0                         Quarter
                                                                                      1
        create                                          Q1                            3
        dictionary entry                                                              0                            24
                                                        Q2                            2
  l     Dictionary can                                                                0                           128
                                                        Q4                            0
        be per-block or                                                               0                           122
                                                        Q1                            1
        per-column                                                                    3

  l     Column-stores
                                                        Q3
                                                        Q1
                                                                                      2
                                                                                      2             OR             +
        have the                                        Q1                            +                        Dictionary
        advantage that                                                         Dictionary
        dictionary                                      Q1
                                                                                                             24: Q1, Q2, Q4, Q1
        entries may                                     Q2                         0: Q1                             …
        encode multiple                                 Q4                         1: Q2                    122: Q2, Q4, Q3, Q3
        values at once                                  Q3                         2: Q3                             …
                                                        Q3                         3: Q4                    128: Q3, Q1, Q1, Q1
                                                        …

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                       53
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Frame Of Reference Encoding
                                                                               Price                        Price
     l     Encodes values as b bit                                                                          Frame: 50
           offset from chosen frame                                             45
                                                                                                              -5
                                                                                54
           of reference                                                                                        4
                                                                                48
     l     Special escape code (e.g.                                                                          -2
                                                                                55                                      4 bits per
           all bits set to 1) indicates                                                                        5        value
                                                                                51
                                                                                                               1
           a difference larger than                                             53
                                                                                                               3
           can be stored in b bits                                              40
                                                                                                              ∞
             l   After escape code,                                             50
                                                                                                              40
                 original (uncompressed)                                        49                                        Exceptions (see
                                                                                                               0          part 3 for a better
                 value is written                                               62                                        way to deal with
                                                                                                              -1          exceptions)
                                                                                52
                                                                                                              ∞
   “Compressing Relations and Indexes                                           50
                                                                                …                             62
   ” Goldstein, Ramakrishnan, Shaft,                                                                           2
   ICDE’98                                                                                                     0
                                                                                                              …
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                   54
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Differential Encoding
                                                                                    Time                    Time
     l     Encodes values as b bit offset
           from previous value
     l     Special escape code (just like                                            5:00                    5:00
           frame of reference encoding)                                              5:02                     2
           indicates a difference larger than
           can be stored in b bits                                                   5:03                     1
                                                                                                                    2 bits per
             l   After escape code, original                                         5:03                     0     value
                 (uncompressed) value is written                                     5:04                     1
     l     Performs well on columns
           containing increasing/decreasing                                          5:06                     2
           sequences                                                                 5:07                     1
             l   inverted lists                                                      5:08                     1
             l   timestamps
             l   object IDs                                                          5:10                     2
                                                                                                                    Exception (see
             l   sorted / clustered columns                                          5:15                    ∞      part 3 for a better
                                                                                                                    way to deal with
                                                                                     5:16                   5:15    exceptions)
      “Improved Word-Aligned Binary                                                  5:16                     1
      Compression for Text Indexing”                                                  …                       0
      Ahn, Moffat, TKDE’06


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           55
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   What Compression Scheme To Use?
                                                Does column appear in the sort key?

                                             yes                                    no
                 Is the average                                                                  Are number of unique
                 run-length > 2                                                                     values < ~50000
               yes                no

                                                                                                            yes
                                 Differential
           RLE
                                 Encoding
                                                                      no                 Does this column appear frequently
                                                                                               in selection predicates?


                                                                                                     yes          no
                           Is the data numerical
                         and exhibit good locality?

                        yes                       no
                                                                                            Bit-vector             Dictionary
                                                                                           Compression            Compression
         Frame of Reference                       Leave Data
             Encoding                            Uncompressed

                                                       OR
                                                  Heavyweight
                                                  Compression


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           56
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                        “Super-Scalar RAM-CPU Cache Compression”
                                                        Zukowski, Heman, Nes, Boncz, ICDE’06
   Heavy-Weight Compression Schemes




   l     Modern disk arrays can achieve > 1GB/s
   l     1/3 CPU for decompression Ł 3GB/s needed
     Ł     Lightweight compression schemes are better
     Ł     Even better: operate directly on compressed data

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       57
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                           “Integrating Compression and Execution in Column-
                                           Oriented Database Systems” Abadi et. al, SIGMOD ’06

     Operating Directly on Compressed Data


     l     I/O - CPU tradeoff is no longer a tradeoff
     l     Reduces memory–CPU bandwidth requirements
     l     Opens up possibility of operating on multiple
           records at once




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       58
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                           “Integrating Compression and Execution in Column-
                                           Oriented Database Systems” Abadi et. al, SIGMOD ’06

     Operating Directly on Compressed Data
           Quarter                                       Product ID
                                                       1        2         3
                                                       …        …         …
                                                                                                            ProductID, COUNT(*))
      (Q1, 1, 300)                                     1        0         0
                                    301-306            0        0         1                                       (1, 3)
      (Q2, 301, 6)                                     0        1         0
                                                                                                                  (2, 1)
      (Q3, 307, 500)                                   1        0         0
                                                       1        0         0                                       (3, 2)
      (Q4, 807, 600)                                   0        0         1
                                                       0        1         0
                                                                                               SELECT ProductID, Count(*)
                                                       1        0         0
                                                                                               FROM table
                                                       0        0         1                    WHERE (Quarter = Q2)
                                                       0        1         0                    GROUP BY ProductID
         Index Lookup + Offset jump                    0        0         1
                                                       …        …         …

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                              59
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                           “Integrating Compression and Execution in Column-
                                           Oriented Database Systems” Abadi et. al, SIGMOD ’06

     Operating Directly on Compressed Data
                                                                                                            SELECT ProductID, Count(*)
                                                                     Block API                              FROM table
                                                                                                            WHERE (Quarter = Q2)
                                                         Data                                               GROUP BY ProductID
          Aggregation                                    isOneValue()
           Operator                                      isValueSorted()
                                                         isPosContiguous()
                                                         isSparse()

             Selection                                   getNext()
             Operator                                    decompressIntoArray()                                  (Q1, 1, 300)
                                                         getValueAtPosition(pos)
                                                                                                                (Q2, 301, 6)
                                                         getMin()
         Compression-                                    getMax()                                               (Q3, 307, 500)
          Aware Scan                                     getSize()
                                                                                                                (Q4, 807, 600)
           Operator

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                            60
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
                                    Tuple Materialization and
                              Column-Oriented Join Algorithms
  “Materialization Strategies in a Column-                       “Query Processing Techniques for
  Oriented DBMS” Abadi, Myers, DeWitt,                           Solid State Drives” Tsirogiannis,
  and Madden. ICDE 2007.                                         Harizopoulos Shah, Wiener, and
                                                                 Graefe. SIGMOD 2009.
  “Self-organizing tuple reconstruction in
  column-stores“, Idreos, Manegold,                              “Cache-Conscious Radix-Decluster
  Kersten, SIGMOD’09                                             Projections”, Manegold, Boncz, Nes,
                                                                 VLDB’04
  “Column-Stores vs Row-Stores: How
  Different are They Really?” Abadi,
  Madden, and Hachem. SIGMOD 2008.
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             When should columns be projected?
             l    Where should column projection operators be placed in a query
                  plan?
                    l    Row-store:
                            l    Column projection involves removing unneeded columns from
                                 tuples
                            l    Generally done as early as possible
                    l    Column-store:
                            l    Operation is almost completely opposite from a row-store
                            l    Column projection involves reading needed columns from storage
                                 and extracting values for a listed set of tuples
                                    §   This process is called “materialization”
                            l    Early materialization: project columns at beginning of query plan
                                    §   Straightforward since there is a one-to-one mapping across columns
                            l    Late materialization: wait as long as possible for projecting columns
                                    §   More complicated since selection and join operators on one column obfuscates
                                        mapping to other columns from same table
                            l    Most column-stores construct tuples and column projection time
                                    §   Many database interfaces expect output in regular tuples (rows)
                                    §   Rest of discussion will focus on this case


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems           62
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             When should tuples be constructed?

                                     Select + Aggregate                                                 QUERY:

                         4       2    2 7                                                               SELECT custID,SUM(price)
                                                                                                        FROM table
                         4       1    3 13                                                              WHERE (prodID = 4) AND
                         4       3    3 42                                                                     (storeID = 1) AND
                                                                                                        GROUP BY custID
                         4       1    3 80

                             Construct                           l     Solution 1: Create rows first (EM).
                                                                       But:
                                                                         l   Need to construct ALL tuples
         (4,1,4)             2        2         7
                             1        3         13                       l   Need to decompress data
                             3        3         42                       l   Poor memory bandwidth
                             1        3         80                           utilization
          prodID        storeID      custID price


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                       63
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Solution 2: Operate on columns

                                                                                                        QUERY:
                                                                       1          0
                                                                                                        SELECT custID,SUM(price)
                                                                       1          1
                             AGG                                                                        FROM table
                                                                       1          0                     WHERE (prodID = 4) AND
                                                                       1          1                            (storeID = 1) AND
                     Data               Data                                                            GROUP BY custID
                    Source             Source
                    custID              price
                                                                    Data        Data
                                                                   Source      Source

                              AND
                                                                                                              4      2      2      7
                                                                       4          2                           4      1      3      13
                       Data          Data
                                                                       4          1                           4      3      3      42
                      Source        Source                             4          3                           4      1      3      80
                      prodID        storeID
                                                                       4          1                         prodID storeID custID price
                                                                   prodID      storeID



            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 64
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Solution 2: Operate on columns

                                                                                                        QUERY:

                                                                            0                           SELECT custID,SUM(price)
                             AGG                                                                        FROM table
                                                                            1
                                                                                                        WHERE (prodID = 4) AND
                                                                            0                                  (storeID = 1) AND
                     Data               Data                                1                           GROUP BY custID
                    Source             Source
                    custID              price

                                                                         AND
                              AND
                                                                                                              4      2      2      7
                                                                     1           0                            4      1      3      13
                       Data          Data
                                                                     1           1                            4      3      3      42
                      Source        Source                           1           0                            4      1      3      80
                      prodID        storeID
                                                                     1           1                          prodID storeID custID price




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 65
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Solution 2: Operate on columns

                                                                                                        QUERY:
                                                                     3              13
                                                                                    1                   SELECT custID,SUM(price)
                             AGG                                     3              80
                                                                                    1                   FROM table
                                                                                                        WHERE (prodID = 4) AND
                                                                                                               (storeID = 1) AND
                                                           2
                                                           0                                  7
                                                                                              0         GROUP BY custID
                     Data               Data
                    Source             Source              3
                                                           1       Data            Data       13
                                                                                              1
                    custID              price                     Source          Source
                                                           3
                                                           0                                  42
                                                                                              0
                                                           3
                                                           1                                  80
                                                                                              1
                              AND                       custID                                price

                                                                                                              4      2      2      7
                                                                             0                                4      1      3      13
                       Data          Data
                                                                             1                                4      3      3      42
                      Source        Source                                   0                                4      1      3      80
                      prodID        storeID
                                                                             1                              prodID storeID custID price




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 66
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Solution 2: Operate on columns

                                                                                                        QUERY:
                                                                                                        SELECT custID,SUM(price)
                             AGG                                                                        FROM table
                                                                                                        WHERE (prodID = 4) AND
                                                                         3 1
                                                                         1 93                                  (storeID = 1) AND
                     Data               Data                                                            GROUP BY custID
                    Source             Source
                    custID              price


                                                                          AGG
                              AND
                                                                                                              4      2      2      7
                                                                                                              4      1      3      13
                       Data          Data
                                                                     3
                                                                     1              13
                                                                                    1                         4      3      3      42
                      Source        Source                           3
                                                                     1              80
                                                                                    1                         4      1      3      80
                      prodID        storeID
                                                                                                            prodID storeID custID price




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 67
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                    “Materialization Strategies in a Column-Oriented DBMS”
                                                    Abadi, Myers, DeWitt, and Madden. ICDE 2007.

      For plans without joins, late materialization
      is a win

                           10                                                                   QUERY:
                            9
                            8                                                                   SELECT C1, SUM(C2)
         Time (seconds)




                            7                                                                   FROM table
                            6                                                                   WHERE (C1 < CONST) AND
                            5
                            4
                                                                                                      (C2 < CONST)
                            3                                                                   GROUP BY C1
                            2
                            1
                                                                                                l    Ran on 2 compressed
                            0
                                 Low selectivity       Medium        High selectivity
                                                                                                     columns from TPC-H
                                                      selectivity                                    scale 10 data
                                      Early Materialization   Late Materialization




                          VLDB 2009 Tutorial                                Column-Oriented Database Systems          68
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                  “Materialization Strategies in a Column-Oriented DBMS”
                                                  Abadi, Myers, DeWitt, and Madden. ICDE 2007.

      Even on uncompressed data, late
      materialization is still a win

                           14                                                               QUERY:
                           12                                                               SELECT C1, SUM(C2)
          Time (seconds)




                           10                                                               FROM table
                           8                                                                WHERE (C1 < CONST) AND
                           6                                                                      (C2 < CONST)
                           4
                                                                                            GROUP BY C1
                           2

                           0                                                                  l    Materializing late still
                                Low selectivity      Medium        High selectivity
                                                    selectivity
                                                                                                   works best

                                    Early Materialization   Late Materialization




                  VLDB 2009 Tutorial                                    Column-Oriented Database Systems                69
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)



             What about for plans with joins?
   Select R1.B, R1.C, R2.E, R2.H, R3.F                                                                             B, C, E, H, F
   From R1, R2, R3
                                                                                                            Project
   Where R1.A = R2.D AND R2.G = R3.K

                      B, C, E, H, F                                                                             Join
                                                                                                                G=K           F
                                                                          B, C
                                          Join 2
                                          G=K                                                 G                          K
        B, C, E, G, H                                                                                                  Scan
                                                        F, K                              Project
                    Join 1                                                                                         R3 (F, K, L)
                    A=D                             Scan                                    Join
                                                                                            A=D
                                               R3 (F, K, L)                                                        G          E, H
   A, B, C                         D, E, G, H                                   A                           D
         Scan                      Scan                                         Scan                    Scan
   R1 (A, B, C) R2 (D, E, G, H)                                           R1 (A, B, C) R2 (D, E, G, H)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                70   70
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)



             What about for plans with joins?
   Select R1.B, R1.C, R2.E, R2.H, R3.F                                                                             B, C, E, H, F
   From R1, R2, R3
                                                                                                            Project
   Where R1.A = R2.D AND R2.G = R3.K

                      B, C, E, H, F                                                                             Join
                                                                                                                G=K           F
                                                                          B, C
                                          Join 2
                                          G=K                                                 G                          K
        B, C, E, G, H
                 Early                                                                         Late                    Scan
                                                        F, K                              Project
            materialization
              Join 1
                                                                                       materialization (F, K, L)
                                                                                                    R3
                     A=D                            Scan                                    Join
                                                                                            A=D
                                               R3 (F, K, L)                                                        G          E, H
   A, B, C                         D, E, G, H                                   A                           D
         Scan                      Scan                                         Scan                    Scan
   R1 (A, B, C) R2 (D, E, G, H)                                           R1 (A, B, C) R2 (D, E, G, H)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                71   71
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             Early Materialization Example
                                  2 7                                                       QUERY:

                                  3 13                                                      SELECT C.lastName,SUM(F.price)
                                                                     1 Green
                                                                                            FROM facts AS F, customers AS C
                                  3 42                               2 White                WHERE F.custID = C.custID
                                  3 80                                                      GROUP BY C.lastName
                                                                     3 Brown

                           Construct                                 Construct

   (4,1,4)         12         6          2        7
                   1          1          3        13                1          Green
                   11         2          3        42                2          White
                   1          1          3        80                3          Brown
      prodID storeID quantity custID price                     custID          lastName

                          Facts                                  Customers
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                  72
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             Early Materialization Example
                                                                                            QUERY:
                                                      7      White
                                                                                            SELECT C.lastName,SUM(F.price)
                                                      13 Brown
                                                                                            FROM facts AS F, customers AS C
                                                      42 Brown                              WHERE F.custID = C.custID
                                                                                            GROUP BY C.lastName
                                                      80 Brown


                                                       Join

                                 2 7
                                 3 13                                1 Green
                                 3 42                                2 White
                                 3 80                                3 Brown


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                  73
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             Late Materialization Example
                                                                                            QUERY:
                                                    1           2
                                                    2           3                           SELECT C.lastName,SUM(F.price)
                                                                                            FROM facts AS F, customers AS C
                                                    3           1
                                                                                            WHERE F.custID = C.custID
                                                    4           3                           GROUP BY C.lastName


                                                        Join                                            Late materialized join
                                                                                                        causes out of order
                                                                                                        probing of projected
   (4,1,4)         12         6          2        7                                                     columns from the
                   1          1          3        13                1          Green                    inner relation
                   11         2          1        42                2          White
                   1          1          3        80                3          Brown
   prodID       storeID quantity custID price                  custID          lastName

                          Facts                                  Customers
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                        74
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Late Materialized Join Performance
   l     Naïve LM join about 2X slower than EM join on typical
         queries (due to random I/O)
         l      This number is very dependent on
                l   Amount of memory available
                l   Number of projected attributes
                l   Join cardinality
   l     But we can do better
         l      Invisible Join
         l      Jive/Flash Join
         l      Radix cluster/decluster join



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       75
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


                                                              “Column-Stores vs Row-Stores: How
   Invisible Join                                             Different are They Really?” Abadi,
                                                              Madden, and Hachem. SIGMOD 2008.




         l     Designed for typical joins when data is modeled using a star
               schema
                 l   One (“fact”) table is joined with multiple dimension tables
         l     Typical query:
                     select c_nation, s_nation, d_year,
                            sum(lo_revenue) as revenue
                     from customer, lineorder, supplier, date
                     where lo_custkey = c_custkey
                        and lo_suppkey = s_suppkey
                        and lo_orderdate = d_datekey
                        and c_region = 'ASIA‘
                        and s_region = 'ASIA‘
                        and d_year >= 1992 and d_year <= 1997
                     group by c_nation, s_nation, d_year
                     order by d_year asc, revenue desc;


             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       76
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


                                                              “Column-Stores vs Row-Stores: How
                                                              Different are They Really?” Abadi,
             Invisible Join                                   Madden, and Hachem. SIGMOD 2008.


    Apply “region = ‘Asia’” On Customer Table
               custkey region  nation                         …
                  1     ASIA   CHINA                          …                                       Hash Table (or bit-map)
                  2     ASIA   INDIA                          …                                      Containing Keys 1, 2 and 3
                  3     ASIA   INDIA                          …
                  4    EUROPE FRANCE                          …


     Apply “region = ‘Asia’” On Supplier Table
               suppkey region  nation                         …
                  1     ASIA  RUSSIA                          …                                             Hash Table (or bit-map)
                  2   EUROPE SPAIN                            …                                              Containing Keys 1, 3
                  3     ASIA  JAPAN                           …


    Apply “year in [1992,1997]” On Date Table
                      dateid          year            …
                     01011997         1997            …                                                Hash Table Containing
                     01021997         1997            …                                               Keys 01011997, 01021997,
                     01031997         1997            …
                                                                                                            and 01031997

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           77
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                    Original Fact Table                                                “Column-Stores vs Row-Stores:
   orderkey custkey suppkey orderdate revenue                                          How Different are They Really?”
      1        3       1    01011997 43256                                             Abadi et. al. SIGMOD 2008.
      2        3       2    01011997 33333                                                  1           1          1        1
      3        4       3    01021997 12121                                                  1           0          1        0
                                                                                            0           1          1        0
      4
      5
      6
      7
               1
               4
               1
               3
                       1
                       2
                       2
                       2
                            01021997 23233
                            01021997 45456
                            01031997 43251
                            01031997 34235
                                                                                            1
                                                                                            0
                                                                                            1
                                                                                                  & & = 1
                                                                                                        0
                                                                                                        0
                                                                                                                   1
                                                                                                                   1
                                                                                                                   1
                                                                                                                            1
                                                                                                                            0
                                                                                                                            0
                                                                                            1           0          1        0




             Hash Table                                      Hash Table                            Hash Table Containing
              Containing                                     Containing                               Keys 01011997,
            Keys 1, 2 and 3                                 Keys 1 and 3                          01021997, and 01031997

                     +
                   custkey
                                                                 +
                                                               suppkey
                                                                                                              +
                                                                                                            orderdate
                      3                        1                  1                        1                01011997        1
                      3                        1                  2                        0                01011997        1
                      4                        0                  3                        1                01021997        1
                      1                        1                  1                        1                01021997        1
                      4                        0                  2                        0                01021997        1
                      1                        1                  2                        0                01031997        1
                      3                        1                  2                        0                01031997        1

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                      78
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

          custkey
             3                     1
             3                     0
             4
             1
             4
                         +         0
                                   1
                                   0
                                                           3
                                                           1          +               CHINA
                                                                                      INDIA
                                                                                      INDIA
                                                                                     FRANCE
                                                                                                      =             INDIA
                                                                                                                    CHINA

             1                     0
             3                     0


         suppkey
            1                      1
            2                      0
            3
            1
            2
                         +
                                   0
                                   1
                                   0
                                   0
                                                           1
                                                           1
                                                                      +             RUSSIA
                                                                                    SPAIN
                                                                                    JAPAN
                                                                                                      =            RUSSIA
                                                                                                                   RUSSIA

            2
            2                      0


         orderdate
         01011997                  1
         01011997                  0                                                      01011997          1997
         01021997
         01021997
         01021997
                         +         0
                                   1
                                   0
                                                     01011997
                                                     01021997
                                                                        JOIN              01021997
                                                                                          01031997
                                                                                                            1997
                                                                                                            1997      =     1997
                                                                                                                            1997

         01031997                  0
         01031997                  0
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                         79
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                              “Column-Stores vs Row-Stores: How
                                                              Different are They Really?” Abadi,
                                                              Madden, and Hachem. SIGMOD 2008.
             Invisible Join
    Apply “region = ‘Asia’” On Customer Table
               custkey region  nation                         …
                  1     ASIA   CHINA                          …                                       Hash Table (or bit-map)
                  2     ASIA   INDIA                          …                                      Containing Keys 1, 2 and 3
                  3     ASIA   INDIA                          …
                  4    EUROPE FRANCE                          …
                                                                                                               Range [1-3]
                                                                                                               (between-predicate rewriting)
     Apply “region = ‘Asia’” On Supplier Table
               suppkey region  nation                         …
                  1     ASIA  RUSSIA                          …                                             Hash Table (or bit-map)
                  2   EUROPE SPAIN                            …                                              Containing Keys 1, 3
                  3     ASIA  JAPAN                           …


    Apply “year in [1992,1997]” On Date Table
                      dateid          year            …
                     01011997         1997            …                                                Hash Table Containing
                     01021997         1997            …                                               Keys 01011997, 01021997,
                     01031997         1997            …
                                                                                                            and 01031997

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                          80
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Invisible Join

             l    Bottom Line
                    l    Many data warehouses model data using star/snowflake
                         schemes
                    l    Joins of one (fact) table with many dimension tables is
                         common
                    l    Invisible join takes advantage of this by making sure
                         that the table that can be accessed in position order is
                         the fact table for each join
                    l    Position lists from the fact table are then intersected (in
                         position order)
                    l    This reduces the amount of data that must be accessed
                         out of order from the dimension tables
                    l    “Between-predicate rewriting” trick not relevant for this
                         discussion

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       81
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

          custkey
             3                     1
             3                     0
             4
             1
             4
                         +         0
                                   1
                                   0
                                                           3
                                                           1          +               CHINA
                                                                                      INDIA
                                                                                      INDIA
                                                                                     FRANCE
                                                                                                      =             INDIA
                                                                                                                    CHINA

             1                     0
             3                     0


         suppkey                                                                  Still accessing table out of order
            1                      1
            2                      0
            3
            1
            2
                         +
                                   0
                                   1
                                   0
                                   0
                                                           1
                                                           1
                                                                      +             RUSSIA
                                                                                    SPAIN
                                                                                    JAPAN
                                                                                                      =            RUSSIA
                                                                                                                   RUSSIA

            2
            2                      0


         orderdate
         01011997                  1
         01011997                  0                                                      01011997          1997
         01021997
         01021997
         01021997
                         +         0
                                   1
                                   0
                                                     01011997
                                                     01021997
                                                                        JOIN              01021997
                                                                                          01031997
                                                                                                            1997
                                                                                                            1997      =     1997
                                                                                                                            1997

         01031997                  0
         01031997                  0
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                         82
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Jive/Flash Join

                                                                    3
                                                                    1           +              CHINA
                                                                                               INDIA
                                                                                               INDIA
                                                                                              FRANCE
                                                                                                            =   INDIA
                                                                                                                CHINA



   “Fast Joins using Join
   Indices”. Li and Ross,
   VLDBJ 8:1-24, 1999.                                                                     Still accessing table out of order




   “Query Processing
   Techniques for Solid State
   Drives”. Tsirogiannis,
   Harizopoulos et. al.
   SIGMOD 2009.



            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                   83
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Jive/Flash Join
   1.       Add column with dense
            ascending integers                                                                          CHINA



   2.
            from 1
            Sort new position list
                                                                    1
                                                                    2
                                                                             3
                                                                             1
                                                                                         +              INDIA
                                                                                                        INDIA
                                                                                                       FRANCE
                                                                                                                =
                                                                                                                        INDIA
                                                                                                                        CHINA



            by second column
   3.       Probe projected
            column in order using
            new sorted position
            list, keeping first
            column from position
                                                                    2
                                                                    1
                                                                             1
                                                                             3          +               CHINA
                                                                                                        INDIA
                                                                                                        INDIA
                                                                                                       FRANCE
                                                                                                                =   2
                                                                                                                    1
                                                                                                                        CHINA
                                                                                                                        INDIA



            list around
   4.       Sort new result by first
                                                                                                                    1   INDIA
            column                                                                                                  2   CHINA




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                      84
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Jive/Flash Join

             l    Bottom Line
                    l    Instead of probing projected columns from inner table out of
                         order:
                            l    Sort join index
                            l    Probe projected columns in order
                            l    Sort result using an added column
                    l    LM vs EM tradeoffs:
                            l    LM has the extra sorts (EM accesses all columns in order)
                            l    LM only has to fit join columns into memory (EM needs join
                                 columns and all projected columns)
                                    §   Results in big memory and CPU savings (see part 3 for why there is
                                        CPU savings)
                            l    LM only has to materialize relevant columns
                            l    In many cases LM advantages outweigh disadvantages
                    l    LM would be a clear winner if not for those pesky sorts … can
                         we do better?


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems        85
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Radix Cluster/Decluster
   l     The full sort from the Jive join is actually overkill
         l      We just want to access the storage blocks in order (we
                don’t mind random access within a block)
         l      So do a radix sort and stop early
         l      By stopping early, data within each block is accessed out of
                order, but in the order specified in the original join index
                l   Use this pseudo-order to accelerate the post-probe sort as well

          •“Database Architecture Optimized for the
          New Bottleneck: Memory Access”
                                                                                         “Cache-Conscious Radix-Decluster
          VLDB’99                                                                        Projections”, Manegold, Boncz, Nes,
          •“Generic Database Cost Models for                                             VLDB’04
          Hierarchical Memory Systems”, VLDB’02
          (all Manegold, Boncz, Kersten)


             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                          86
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Radix Cluster/Decluster
   l     Bottom line
         l      Both sorts from the Jive join can be significantly reduced in
                overhead
         l      Only been tested when there is sufficient memory for the
                entire join index to be stored three times
                l   Technique is likely applicable to larger join indexes, but utility will
                    go down a little
         l      Only works if random access within a storage block
                l   Don’t want to use radix cluster/decluster if you have variable-
                    width column values or compression schemes that can only be
                    decompressed starting from the beginning of the block




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       87
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   LM vs EM joins
   l     Invisible, Jive, Flash, Cluster, Decluster techniques
         contain a bag of tricks to improve LM joins
   l     Research papers show that LM joins become 2X faster
         than EM joins (instead of 2X slower) for a wide array of
         query types




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       88
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Tuple Construction Heuristics
             l    For queries with selective predicates,
                  aggregations, or compressed data, use late
                  materialization
             l    For joins:
                    l    Research papers:
                            l    Always use late materialization
                    l    Commercial systems:
                            l    Inner table to a join often materialized before join
                                 (reduces system complexity):
                            l    Some systems will use LM only if columns from
                                 inner table can fit entirely in memory




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       89
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Outline

   l          Computational Efficiency of DB on modern hardware
         l        how column-stores can help here
         l        Keynote revisited: MonetDB & VectorWise in more depth


   l          CPU efficient column compression
         l        vectorized decompression

   l          Conclusions
         l        future work




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       90
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
                                    40 years of hardware evolution
                                                               vs.
                                    DBMS computational efficiency
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   CPU Architecture
   Elements:
   l Storage
         l      CPU caches L1/L2/L3
   l     Registers
   l     Execution Unit(s)
         l      Pipelined
         l      SIMD




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       92
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   CPU Metrics




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       93
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   CPU Metrics




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       94
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   DRAM Metrics




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       95
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Super-Scalar Execution (pipelining)




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       96
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Hazards
   l     Data hazards                                                           l     Control Hazards
         l      Dependencies between instructions                                     l     Branch mispredictions
         l      L1 data cache misses                                                  l     Computed branches (late binding)
                                                                                      l     L1 instruction cache misses

   Result: bubbles in the pipeline




       Out-of-order execution addresses data hazards
       l control hazards typically more expensive

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                  97
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   SIMD




         l     Single Instruction Multiple Data
               l     Same operation applied on a vector of values
               l     MMX: 64 bits, SSE: 128bits, AVX: 256bits
               l     SSE, e.g. multiply 8 short integers

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       98
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   A Look at the Query Pipeline


                           102      ivan       350


              next()                         37 *–50
                                              7 30                                         SELECT id, name
                                                                                                  (age-30)*50 AS bonus
                           102      ivan       37 7       350
                                                                                           FROM   employee
                                                                                           WHERE age > 30
                                                       PROJECT
             next()                          37
                                             22 > 30 ?
                           101
                           102      alice
                                    ivan       22 TRUE
                                               37 FALSE
                                                          SELECT
             next()
                           102
                           101      ivan
                                    alice      37
                                               22             SCAN



            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                    99
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   A Look at the Query Pipeline


                           102      ivan       350


              next()                         37 *–50
                                              7 30
                                                                                      Operators
                                                       PROJECT
                                                                                      Iterator interface
             next()                          37
                                             22 > 30 ?                                -open()
                                                                                      -next(): tuple
                                                          SELECT                      -close()
             next()
                                                              SCAN



            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       100
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   A Look at the Query Pipeline


                           102      ivan       350


              next()                         37 *–50
                                              7 30

                           102      ivan       37 7       350
                                                                                      Primitives
                                                       PROJECT
                                                                                      Provide computational
             next()                          37
                                             22 > 30 ?                                functionality
                           101
                           102      alice
                                    ivan       22 TRUE
                                               37 FALSE
                                                          SELECT                      All arithmetic allowed in
                                                                                      expressions,
             next()
                                                                                      e.g. Multiplication
                           102
                           101      ivan
                                    alice      37
                                               22             SCAN
                                                                                       7 * 50
                                                                                      mult(int,int) Ł       int

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems             101
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Database Architecture causes Hazards
   l     Data hazards                                                           l     Control Hazards
         l      Dependencies between instructions                                     l     Branch mispredictions
         l      L1 data cache misses                                                  l     Computed branches (late binding)
       SIMD                       Out-of-order Execution                              l     L1 instruction cache misses

         work on one tuple at a time

         Large Tree/Hash Structures                                    PROJECT
                                                                                                            Operators
         Code footprint of all
         operators in query plan
         exceeds L1 cache                                              SELECT                               Iterator interface
                                                                                                            -open()
         Data-dependent conditions
                                                                                                            -next(): tuple
        Next() late binding                SCAN                                                             -close()
        method calls
       “DBMSs On A Modern Processor: Where Does Time Go? ”
        Tree, List, Hash traversal  Complex NSM record navigation
        Ailamaki, DeWitt, Hill, Wood, VLDB’99

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                      102
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                   DBMS Computational Efficiency
   TPC-H 1GB, query 1
   l selects 98% of fact table, computes net prices and
     aggregates all
   l Results:
           l    C program:                        ?
           l    MySQL:                            26.2s
           l    DBMS “X”:                         28.1s




                                                                          “MonetDB/X100: Hyper-Pipelining Query
                                                                          Execution ” Boncz, Zukowski, Nes, CIDR’05

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems         103
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                   DBMS Computational Efficiency
   TPC-H 1GB, query 1
   l selects 98% of fact table, computes net prices and
     aggregates all
   l Results:
           l    C program:                         0.2s
           l    MySQL:                            26.2s
           l    DBMS “X”:                         28.1s




                                                                          “MonetDB/X100: Hyper-Pipelining Query
                                                                          Execution ” Boncz, Zukowski, Nes, CIDR’05

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems         104
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
                                                                                               MonetDB
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                               a column-store
   l     “save disk I/O when scan-intensive queries                                                         need a few
         columns”
   l     “avoid an expression interpreter to improve
         computational efficiency”




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems             106
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




      RISC Database Algebra




   SELECT         id, name, (age-30)*50 as bonus
   FROM           people
   WHERE          age > 30




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       107
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




      RISC Database Algebra

CPU happy? Give it “nice” code !

- few dependencies (control,data)
- CPU gets out-of-order execution
- compiler can e.g. generate SIMD

One loop for an entire column                                     Simple, hard-
- no per-tuple interpretation
batcalc_minus_int(int* res,                                      coded semantics
                       int* col,
- arrays: no record navigation
                       int val,                                    in operators
- better instruction cache locality
                                     int n)
{
     for(i=0; i<n; i++) as bonus
  SELECT  id, name, (age-30)*50
  FROM   res[i] = col[i] - val;
          people
} WHERE   age > 30




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       108
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




      RISC Database Algebra

CPU happy? Give it “nice” code !

- few dependencies (control,data)
- CPU gets out-of-order execution
- compiler can e.g. generate SIMD

One loop for an entire column
- no per-tuple interpretation
batcalc_minus_int(int* res,
                       int* col,
- arrays: no record navigation
                       int val,
- better instruction cache locality
                                     int n)
{                                                                                MATERIALIZED
     for(i=0; i<n; i++) as bonus
  SELECT  id, name, (age-30)*50                                                   intermediate
  FROM   res[i] = col[i] - val;
          people
} WHERE   age > 30
                                                                                     results


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       109
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                               a column-store
   l     “save disk I/O when scan-intensive queries                                                         need a few columns”
   l     “avoid an expression interpreter to improve computational
         efficiency”
       How?
           l    RISC query algebra: hard-coded semantics
                   l   Decompose complex expressions in multiple operations
           l    Operators only handle simple arrays
                   l   No code that handles slotted buffered record layout
           l    Relational algebra becomes array manipulation language
                   l   Often SIMD for free


                   l   Plus: use of cache-conscious algorithms for Sort/Aggr/Join

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                      110
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                               a Faustian pact
   l     You want efficiency
           l    Simple hard-coded operators
   l     I take scalability
           l    Result materialization




                               n    C program:                        0.2s
                               n    MonetDB:                          3.7s
                               n    MySQL:                            26.2s
                               n    DBMS “X”:                         28.1s


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       111
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems

                                                           as a research platform
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   SIGMOD 1985

                                                                                     B
                                                                               MonetD bra
                                                                                    lge
                                                                               BAT A
                                 ts
                         B suppor
                  MonetD          G,
                         M L, ODM
                  SQL, X
                   ..RDF              •“MIL Primitives for Querying a Fragmented World”,
                                      Boncz, Kersten, VLDBJ’98
                                      •“Flattening an Object Algebra to Provide Performance”
                                      Boncz, Wilschut, Kersten, ICDE’98
                                      •“MonetDB/XQuery: a fast XQuery processor powered
                                   C- by a relational engine” Boncz, Grust, vanKeulen,
                           pport on e
                     RDF su SW-Stor Rittinger, Teubner, SIGMOD’06
                            /
                     STORE            •“SW-Store: a vertically partitioned DBMS for Semantic
                                      Web data management“ Abadi, Marcus, Madden,
                                      Hollenbach, VLDBJ’09

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       113
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




               The                                                     Software Stack

   Extensible query lang
    frontend frameworkXQuery                                               SQL 03                           RDF   Arrays
      Extensible Dynamic
          .. SGL?
        Runtime QOPT                                                    Optimizers
        Framework!
                SOAP                MonetDB 4                            MonetDB 5                      OGIS
                                                                                                            Extensible
                                                                                                     Architecture-Conscious
                                                      MonetDB kernel                                 compile
                                                                                                       Execution platform




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                      114
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
                                    optimizer                          frontend                             backend
                               as a research platform
                                                               •“Database Architecture Optimized for the New
   l     Cache-Conscious Joins              Bottleneck: Memory Access” VLDB’99
         l      Cost Models, Radix-cluster, •“Generic Database Cost Models for Hierarchical Memory
                Radix-decluster             Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
                                            •“Cache-Conscious Radix-Decluster Projections”,
                                            Manegold, Boncz, Nes, VLDB’04
   l     MonetDB/XQuery:
         l      structural joins exploiting                        “MonetDB/XQuery: a fast XQuery processor powered
                positional column access                           by a relational engine” Boncz, Grust, vanKeulen,
                                                                   Rittinger, Teubner, SIGMOD’06
   l     Cracking:                                                    “Database Cracking”, CIDR’07
         l      on-the-fly automatic indexing                         “Updating a cracked database “, SIGMOD’07
                without workload knowledge                            “Self-organizing tuple reconstruction in column-
                                                                      stores“, SIGMOD’09 (all Idreos, Manegold, Kersten)
   l     Recycling:                              “An architecture for recycling intermediates in a
         l      using materialized intermediates column-store”, Ivanova, Kersten, Nes, Goncalves,
                                                 SIGMOD’09
   l     Run-time Query Optimization:                                      “ROX: run-time optimization of XQueries”,
         l      correlation-aware run-time                                 Abdelkader, Boncz, Manegold, vanKeulen,
                optimization without cost model                            SIGMOD’09
             VLDB 2009 Tutorial                                      Column-Oriented Database Systems                  115
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems

                                                       “MonetDB/X100”
                                            vectorized query processing
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




             MonetDB spin-off: MonetDB/X100

                        Materialization vs Pipelining




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       117
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


  “MonetDB/X100: Hyper-Pipelining Query
  Execution ” Boncz, Zukowski, Nes, CIDR’05




                                                                                     next()


                                                                                                                     PROJECT
                                                                                     next()


                                                                                                                      SELECT
                                                                                     next()
                                                                                                101    alice    22
                                                                                                102    ivan     37
                                                                                                104    peggy    45
                                                                                                105    victor   25     SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                     118
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05




                                                                                     next()


                                                                                                                             PROJECT
                                                                                                                          > 30 ?
                                                                                     next()      101        alice    22   FALSE
                                                                                                 102        ivan     37   TRUE
                                                                                                 104        peggy    45   TRUE
                                                                                                 105        victor   25   FALSESELECT
                                                                                     next()
                                                                                                101    alice         22
                                                                                                102    ivan          37
                                                                                                104    peggy         45
                                                                                                105    victor        25            SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                119
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05




     “Vectorized
    Observations:                          In Cache                                             102 ivan  350
                                                                                                104 peggy 750
       Processing”
    next() called much less often Ł                                                                                       - 30 * 50
    more time spent in primitives                                                    next()
      vector = array of ~100                                                                    102 ivan  37 7 350
    less in overhead                                                                            104 peggy 45 15 750
     processed in a tight loop
    primitive calls process an array of                                                                                        PROJECT
                                                                                                                           > 30 ?
     CPU in a loop:
    values cache Resident                                                            next()      101        alice    22    FALSE
                                                                                                 102        ivan     37    TRUE
                                                                                                 104        peggy    45    TRUE
                                                                                                 105        victor   25    FALSESELECT
                                                                                     next()
                                                                                                101    alice         22
                                                                                                102    ivan          37
                                                                                                104    peggy         45
                                                                                                105    victor        25               SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                   120
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05




                                                                                                102 ivan  350
    Observations:
                                                                                                104 peggy 750

    next() called much less often Ł                                                                                       - 30 * 50
    more time spent in primitives                                                    next()
                                                                                                102 ivan  37 7 350
    less in overhead                                                                            104 peggy 45 15 750

    primitive calls process an array of                                                                                        PROJECT
                                                                                                                           > 30 ?
    values in a loop:                                                                next()      101        alice    22    FALSE
      CPU Efficiency depends on “nice” code                                                      102        ivan     37    TRUE
      - out-of-order execution                                                                   104        peggy    45    TRUE
      - few dependencies (control,data)                                                          105        victor   25    FALSESELECT
      - compiler support
                                                                                     next()
                                                                                                101    alice         22
      Compilers like simple loops over arrays                                                   102    ivan          37
      - loop-pipelining                                                                         104    peggy         45
      - automatic SIMD                                                                          105    victor        25               SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                   121
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05




    Observations:                                                                 > 30 ?
                                                                                 FALSE           for(i=0; i<n; i++)
    next() called much less often Ł                                              TRUE                 res[i] = (col[i] > x)
                                                                                 TRUE                             * 50
    more time spent in primitives                                                FALSE                           350
    less in overhead                                                                                             750
                                                                                   - 30                          PROJECT
    primitive calls process an array of                                                         for(i=0; i<n; 30 ?
                                                                                                            > i++)
    values in a loop:                                                               7
                                                                                    15                        FALSE
                                                                                                    res[i] = (col[i] - x)
      CPU Efficiency depends on “nice” code                                                                 TRUE
      - out-of-order execution                                                                              TRUE
      - few dependencies (control,data)                                            * 50                     FALSESELECT
                                                                                                for(i=0; i<n; i++)
      - compiler support
                                                                                  350               res[i] = (col[i] * x)
                                                                                  750
      Compilers like simple loops over arrays
      - loop-pipelining
      - automatic SIMD                                                                                                 SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                    122
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05




    Tricks being played:

    - Late materialization

    - Materialization avoidance
    using selection vectors
                                                                                                                          PROJECT
                                                                                                                     > 30 ?
                                                                                     next()                           1
                                                                                                                      2

                                                                                                                           SELECT
                                                                                     next()
                                                                                                101    alice    22
                                                                                                102    ivan     37
                                                                                                104    peggy    45
                                                                                                105    victor   25            SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                           123
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05

     map_mul_flt_val_flt_col(
       float *res,
       int* sel,
    Tricks being played:
       float val,
       float *col, int n)
    - Late materialization                                                                                           - 30 * 50
                                                                                     next()
     {
    - Materialization avoidance                                                                                      7        350
       for(int i=0; i<n; i++)                                                                                        15       750
    using selection vectors
              res[i] = val * col[sel[i]];                                                                                     PROJECT
                                                                                                                      > 30 ?
     }
                                                                                     next()                               1
     selection vectors used to reduce                                                                                     2

     vector copying                                                                                                             SELECT

     contain selected positions                                                      next()
                                                                                                101    alice    22
                                                                                                102    ivan     37
                                                                                                104    peggy    45
                                                                                                105    victor   25                  SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 124
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)


“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05

     map_mul_flt_val_flt_col(
       float *res,
       int* sel,
    Tricks being played:
       float val,
       float *col, int n)
    - Late materialization                                                                                           - 30 * 50
                                                                                     next()
     {
    - Materialization avoidance                                                                                      7        350
       for(int i=0; i<n; i++)                                                                                        15       750
    using selection vectors
              res[i] = val * col[sel[i]];                                                                                     PROJECT
                                                                                                                      > 30 ?
     }
                                                                                     next()                               1
     selection vectors used to reduce                                                                                     2

     vector copying                                                                                                             SELECT

     contain selected positions                                                      next()
                                                                                                101    alice    22
                                                                                                102    ivan     37
                                                                                                104    peggy    45
                                                                                                105    victor   25                  SCAN


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                                 125
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                      “MonetDB/X100: Hyper-Pipelining Query
                                                      Execution ” Boncz, Zukowski, Nes, CIDR’05
   MonetDB/X100
   l     Both efficiency
           l    Vectorized primitives
   l     and scalability..
           l    Pipelined query evaluation



                               n    C program:                        0.2s
                               n    MonetDB/X100:                     0.6s
                               n    MonetDB:                          3.7s
                               n    MySQL:                            26.2s
                               n    DBMS “X”:                         28.1s


            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       126
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Memory Hierarchy
                        X100 query engine




       CPU
      cache




         RAM                     ColumnBM
                              (buffer manager)


                                             (raid)
                                            Disk(s)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       127
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Memory Hierarchy
                        X100 query engine


                                                                                      Vectors are only the in-cache
                                                                                      representation

                                                                                      RAM & disk representation might
                                                                                      actually be different
       CPU
      cache                                                                           (vectorwise uses both PAX & DSM)




         RAM                     ColumnBM
                              (buffer manager)


                                             (raid)
                                            Disk(s)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems             128
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Optimal Vector size?
                        X100 query engine



                                                                                       All vectors together should fit
                                                                                       the CPU cache

                                                                                       Optimizer should tune this,
                                                                                       given the query characteristics.
       CPU
      cache




         RAM                     ColumnBM
                              (buffer manager)


                                             (raid)
                                            Disk(s)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                129
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Varying the Vector size



                                                                         Less and less iterator.next()
                                                                                      and
                                                                            primitive function calls
                                                                         (“interpretation overhead”)




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       130
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Varying the Vector size

                                 Vectors start to exceed the
                                     CPU cache, causing
                                  additional memory traffic




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       131
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   Varying the Vector size




                            The benefit of selection
                                   vectors




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       132
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05
   MonetDB/MIL materializes columns




       CPU
      cache




         RAM                     ColumnBM
                              (buffer manager)


                                             (raid)
                                            Disk(s)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       133
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                               “MonetDB/X100: Hyper-Pipelining Query
                                                               Execution ” Boncz, Zukowski, Nes, CIDR’05

   Benefits of Vectorized Processing
                 l     100x less Function Calls                             “Buffering Database Operations for
                                                                            Enhanced Instruction Cache Performance”
                         l    iterator.next(), primitives                   Zhou, Ross, SIGMOD’04
                 l     No Instruction Cache Misses
                         l
                                                         “Block oriented processing of
                              High locality in the primitives
                                                         relational database operations in
                 l     Less Data Cache Misses            modern computer architectures”
                        l Cache-conscious data placement Padmanabhan, Malkemus, Agarwal,

                 l     No Tuple Navigation               ICDE’01
                         l    Primitives are record-oblivious, only see arrays
                 l     Vectorization allows algorithmic optimization
                         l    Move activities out of the loop (“strength reduction”)
                 l     Compiler-friendly function bodies
                         l    Loop-pipelining, automatic SIMD




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems         134
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                “Balancing Vectorized Query Execution with
                                                Bandwidth Optimized Storage ” Zukowski, CWI 2008

     Vectorizing Relational Operators
   l     Project
   l     Select
           l    Exploit selectivities, test buffer overflow
   l     Aggregation
           l    Ordered, Hashed
   l     Sort
           l    Radix-sort nicely vectorizes
   l     Join
           l    Merge-join + Hash-join




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       135
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
                      Efficient Column Store Compression
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Key Ingredients
   l     Compress relations on a per-column basis
           l    Columns compress well
   l     Decompress small vectors of tuples from a column into
         the CPU cache
           l    Minimize main-memory overhead
   l     Use light-weight, CPU-efficient algorithms
           l    Exploit processing power of modern CPUs




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       137
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Key Ingredients
   l     Compress relations on a per-column basis
           l    Columns compress well
   l     Decompress small vectors of tuples from a column into
         the CPU cache
           l    Minimize main-memory overhead




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       138
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Vectorized Decompression
                        X100 query engine

                                                                                   Idea:

                                                                                   decompress a vector only

                                                                                   compression:
                                                                                   -between CPU and RAM
       CPU                                                                         -Instead of disk and RAM (classic)
      cache




         RAM                     ColumnBM
                              (buffer manager)


                                             (raid)
                                            Disk(s)
            VLDB 2009 Tutorial                                       Column-Oriented Database Systems           139
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   RAM-Cache Decompression

                                                                                                   l    Decompress vectors
                                                                                                        on-demand into the
                                                                                                        cache
                                                                                                   l    RAM-Cache boundary
                                                                                                        only crossed once
                                                                                                   l    More (compressed)
                                                                                                        data cached in RAM
                                                                                                   l    Less bandwidth use




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems                 140
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




     Multi-Core Bandwidth & Compression
                                                       Performance Degradation with Concurrent Queries
                                        1.2
   avg query speed (clock normalized)




                                         1




                                        0.8




                                        0.6




                                        0.4



                                                         Intel® Xeon E5410 (Harpertown) - Q6b Normal
                                        0.2
                                                         Intel® Xeon E5410 (Harpertown) - Q6b Compressed
                                                         Intel® Xeon X5560 (Nehalem) - Q6b Normal
                                                         Intel® Xeon X5560 (Nehalem) - Q6b Compressed
                                         0
                                                   1               2         3            4             5           6               7   8
                                                                         number of concurrent queries


                                              VLDB 2009 Tutorial                                 Column-Oriented Database Systems           141
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   CPU Efficient Decompression
              l     Decoding loop over cache-resident vectors of code words
              l     Avoid control dependencies within decoding loop
                     l    no if-then-else constructs in loop body
              l      Avoid data dependencies between loop iterations




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       142
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Disk Block Layout
                                                                                           l     Forward growing
                                                                                                 section of arbitrary
                                                                                                 size code words
                                                                                                 (code word size fixed
                                                                                                 per block)




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems             143
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   Disk Block Layout
                                                                                              l     Forward growing
                                                                                                    section of arbitrary
                                                                                                    size code words
                                                                                                    (code word size
                                                                                                    fixed per block)
                                                                                              l     Backwards growing
                                                                                                    exception list




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems               144
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   Naïve Decompression Algorithm
              l     Use reserved value from code word domain (MAXCODE) to mark
                    exception positions




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       145
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Deterioriation With Exception%




                 l     1.2GB/s deteriorates to 0.4GB/s

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       146
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   Deterioriation With Exception%




                 l     Perf Counters: CPU mispredicts if-then-else

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       147
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   Patching
                                                                                  l     Maintain a patch-list
                                                                                        through code word
                                                                                        section that links
                                                                                        exception positions




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems           148
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06

   Patching
                                                                                  l     Maintain a patch-list
                                                                                        through code word
                                                                                        section that links
                                                                                        exception positions
                                                                                  l     After decoding, patch up
                                                                                        the exception positions
                                                                                        with the correct values




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems           149
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Patched Decompression




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       150
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Patched Decompression




            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       151
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

                                                          “Super-Scalar RAM-CPU Cache Compression”
                                                          Zukowski, Heman, Nes, Boncz, ICDE’06
   Decompression Bandwidth
             Patching can be applied to:
             • Frame of Reference (PFOR)
             • Delta Compression (PFOR-DELTA)
             • Dictionary Compression (PDICT)


             Makes these methods much more
             applicable to noisy data
             Ł without additional cost




                    l     Patching makes two passes, but is faster!

            VLDB 2009 Tutorial                                       Column-Oriented Database Systems       152
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




                                                                                                             VLDB
                                        Column-Oriented                                                      2009
                                                                                                            Tutorial
                                       Database Systems
                                                                                          Conclusion
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Summary (1/2)
   l     Columns and Row-Stores: different?
         l      No fundamental differences
         l      Can current row-stores simulate column-stores now?
                l     not efficiently: row-stores need change
         l      On disk layout vs execution layout
                l   actually independent issues, on-the-fly conversion pays off
                l   column favors sequential access, row random
         l      Mixed Layout schemes
                l   Fractured mirrors
                l   PAX, Clotho
                l   Data morphing



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       154
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Summary (2/2)
   l     Crucial Columnar Techniques
         l      Storage
                l   Lean headers, sparse indices, fast positional access
         l      Compression
                l   Operating on compressed data
                l   Lightweight, vectorized decompression
         l      Late vs Early materialization
                l   Non-join: LM always wins
                l   Naïve/Invisible/Jive/Flash/Radix Join (LM often wins)
         l      Execution
                l   Vectorized in-cache execution
                l   Exploiting SIMD

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       155
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Future Work
   l     looking at write/load tradeoffs in column-stores
         l      read-only vs batch loads vs trickle updates vs OLTP




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       156
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Updates (1/3)
   l     Column-stores are update-in-place averse
         l      In-place: I/O for each column
         l      + re-compression
         l      + multiple sorted replicas
         l      + sparse tree indices

   Update-in-place is infeasible!




             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       157
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Updates (2/3)
   l     Column-stores use differential mechanisms instead
         l      Differential lists/files or more advanced (e.g. PDTs)
         l      Updates buffered in RAM, merged on each query
         l      Checkpointing merges differences in bulk sequentially
                l   I/O trends favor this anyway
                    §    trade RAM for converting random into sequential I/O
                    §    this trade is also needed in Flash (do not write randomly!)
                l   How high loads can it sustain?
                    §    Depends on available RAM for buffering (how long until full)
                         § Checkpoint must be done within that time
                         § The longer it can run, the less it molests queries
                    §    Using Flash for buffering differences buys a lot of time
                         § Hundreds of GBs of differences per server



             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       158
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Updates (3/3)
   l     Differential transactions favored by hardware trends
   l     Snapshot semantics accepted by the user community
         l      can always convert to serialized
                 “Serializable Isolation For Snapshot Databases” Alomari,
                 Cahill, Fekete, Roehm, SIGMOD’08

   Ł     Row stores could also use differential transactions and be efficient!
         Ł      Implies a departure from ARIES
         Ł      Implies a full rewrite

         My conclusion:
         a system that combines row- and columns needs differentially
         implemented transactions.
         Starting from a pure column-store, this is a limited add-on.
         Starting from a pure row-store, this implies a full rewrite.

             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       159
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Future Work
   l     looking at write/load tradeoffs in column-stores
         l      read-only vs batch loads vs trickle updates vs OLTP
   l     database design for column-stores
   l     column-store specific optimizers
         l      compression/materialization/join tricks Ł cost models?
   l     hybrid column-row systems
         l       can row-stores learn new column tricks?
                l   Study of the minimal number changes one needs to make to a
                    row store to get the majority of the benefits of a column-store
                l   Alternative: add features to column-stores that make them more
                    like row stores


             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       160
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)




   Conclusion
   l     Columnar techniques provide clear benefits for:
         l      Data warehousing, BI
         l      Information retrieval, graphs, e-science
   l     A number of crucial techniques make them effective
         l      Without these, existing row systems do not benefit
   l     Row-Stores and column-stores could be combined
         l      Row-stores may adopt some column-store techniques
         l      Column-stores add row-store (or PAX) functionality


   l     Many open issues to do research on!


             VLDB 2009 Tutorial                                      Column-Oriented Database Systems       161

VLDB 2009 Tutorial on Column-Stores

  • 1.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems Part 1: Stavros Harizopoulos (HP Labs) Part 2: Daniel Abadi (Yale) Part 3: Peter Boncz (CWI) VLDB 2009 Tutorial 1 Column-Oriented Database Systems
  • 2.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What is a column-store? row-store column-store Date Store Product Customer Price Date Store Product Customer Price + easy to add/modify a record + only need to read in relevant data - might read in unnecessary data - tuple writes require multiple accesses => suitable for read-mostly, read-intensive, large data repositories VLDB 2009 Tutorial Column-Oriented Database Systems 2
  • 3.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Are these two fundamentally different? l The only fundamental difference is the storage layout l However: we need to look at the big picture different storage layouts proposed row-stores row-stores++ row-stores++ converge? ‘70s ‘80s ‘90s ‘00s today column-stores new applications new bottleneck in hardware l How did we get here, and where we are heading Part 1 l What are the column-specific optimizations? Part 2 l How do we improve CPU efficiency when operating on Cs Part 3 VLDB 2009 Tutorial Column-Oriented Database Systems 3
  • 4.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 4
  • 5.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco Data Warehousing example l Typical DW installation dimension tables account fact table or RAM l Real-world example usage source “One Size Fits All? - Part 2: Benchmarking toll Results” Stonebraker et al. CIDR 2007 star schema QUERY 2 SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) Column-store Row-store FROM usage, toll, source, account WHERE usage.toll_id = toll.toll_id Query 1 2.06 300 AND usage.source_id = source.source_id Query 2 2.20 300 AND usage.account_id = account.account_id AND toll.type_ind in (‘AE’. ‘AA’) Query 3 0.09 300 AND usage.toll_price > 0 Query 4 5.24 300 AND source.type != ‘CIBER’ AND toll.rating_method = ‘IS’ Query 5 2.88 300 AND usage.invoice_date = 20051013 GROUP BY account.account_number Why? Three main factors (next slides) VLDB 2009 Tutorial Column-Oriented Database Systems 5
  • 6.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (1/3): read efficiency row store column store read pages containing entire rows read only columns needed one row = 212 columns! in this example: 7 columns is this typical? (it depends) caveats: • “select * ” not any faster • clever disk prefetching What about vertical partitioning? • clever tuple reconstruction (it does not work with ad-hoc queries) VLDB 2009 Tutorial Column-Oriented Database Systems 6
  • 7.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (2/3): compression efficiency l Columns compress better than rows l Typical row-store compression ratio 1 : 3 l Column-store 1 : 10 l Why? l Rows contain values from different domains => more entropy, difficult to dense-pack l Columns exhibit significantly less entropy l Examples: Male, Female, Female, Female, Male 1998, 1998, 1999, 1999, 1999, 2000 l Caveat: CPU cost (use lightweight compression) VLDB 2009 Tutorial Column-Oriented Database Systems 7
  • 8.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (3/3): sorting & indexing efficiency l Compression and dense-packing free up space l Use multiple overlapping column collections l Sorted columns compress better l Range queries are faster l Use sparse clustered indexes What about heavily-indexed row-stores? (works well for single column access, cross-column joins become increasingly expensive) VLDB 2009 Tutorial Column-Oriented Database Systems 8
  • 9.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Additional opportunities for column-stores l Block-tuple / vectorized processing l Easier to build block-tuple operators l Amortizes function-call cost, improves CPU cache performance l Easier to apply vectorized primitives l Software-based: bitwise operations l Hardware-based: SIMD Part 3 l Opportunities with compressed columns l Avoid decompression: operate directly on compressed l Delay decompression (and tuple reconstruction) more l Also known as: late materialization in Part 2 l Exploit columnar storage in other DBMS components l Physical design (both static and dynamic) See: Database Cracking, from CWI VLDB 2009 Tutorial Column-Oriented Database Systems 9
  • 10.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Column-Stores vs Row-Stores: How Different are They Effect on C-Store performance Really?” Abadi, Hachem, and Madden. SIGMOD 2008. Average for SSBM queries on C-store original Time (sec) C-store enable late column-oriented enable materialization join algorithm compression & operate on compressed VLDB 2009 Tutorial Column-Oriented Database Systems 10
  • 11.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary of column-store key features columnar storage Part 1 header/ID elimination l Storage layout compression Part 2 Part 3 multiple sort orders column operators Part 1 Part 2 avoid decompression l Execution engine Part 2 late materialization vectorized operations Part 3 l Design tools, optimizer VLDB 2009 Tutorial Column-Oriented Database Systems 11
  • 12.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 12
  • 13.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) From DSM to Column-stores TOD: Time Oriented Database – Wiederhold et al. 70s -1985: "A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975 More 1970s: Transposed files, Lorie, Batory, Svensson. “An overview of cantor: a new system for data analysis” Karasalo, Svensson, SSDBM 1983 1985: DSM paper “A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985. 1990s: Commercialization through SybaseIQ Late 90s – 2000s: Focus on main-memory performance l DSM “on steroids” [1997 – now] CWI: MonetDB l Hybrid DSM/NSM [2001 – 2004] Wisconsin: PAX, Fractured Mirrors Michigan: Data Morphing CMU: Clotho 2005 – : Re-birth of read-optimized DSM as “column-store” MIT: C-Store CWI: MonetDB/X100 10+ startups VLDB 2009 Tutorial Column-Oriented Database Systems 13
  • 14.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “A decomposition storage The original DSM paper model” Copeland and Khoshafian. SIGMOD 1985. l Proposed as an alternative to NSM l 2 indexes: clustered on ID, non-clustered on value l Speeds up queries projecting few columns l Requires more storage value ID 0100 0962 1000 .. 1 2 3 4 .. VLDB 2009 Tutorial Column-Oriented Database Systems 14
  • 15.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Memory wall and PAX l 90s: Cache-conscious research “Cache Conscious Algorithms for from: Relational Query Processing.” Shatdal, Kant, Naughton. VLDB 1994. “DBMSs on a modern processor: “Database Architecture Optimized for and: Where does time go?” Ailamaki, to: the New Bottleneck: Memory Access.” DeWitt, Hill, Wood. VLDB 1999. Boncz, Manegold, Kersten. VLDB 1999. l PAX: Partition Attributes Across l Retains NSM I/O pattern l Optimizes cache-to-RAM communication “Weaving Relations for Cache Performance.” Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001. VLDB 2009 Tutorial Column-Oriented Database Systems 15
  • 16.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) More hybrid NSM/DSM schemes l Dynamic PAX: Data Morphing “Data morphing: an adaptive, cache-conscious storage technique.” Hankins, Patel, VLDB 2003. l Clotho: custom layout using scatter-gather I/O “Clotho: Decoupling Memory Page Layout from Storage Organization.” Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004. l Fractured mirrors l Smart mirroring with both NSM/DSM copies “A Case For Fractured Mirrors.” Ramamurthy, DeWitt, Su, VLDB 2002. VLDB 2009 Tutorial Column-Oriented Database Systems 16
  • 17.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) MonetDB (more in Part 3) l Late 1990s, CWI: Boncz, Manegold, and Kersten l Motivation: l Main-memory l Improve computational efficiency by avoiding expression interpreter l DSM with virtual IDs natural choice l Developed new query execution algebra l Initial contributions: l Pointed out memory-wall in DBMSs l Cache-conscious projections and joins l … VLDB 2009 Tutorial Column-Oriented Database Systems 17
  • 18.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) 2005: the (re)birth of column-stores l New hardware and application realities l Faster CPUs, larger memories, disk bandwidth limit l Multi-terabyte Data Warehouses l New approach: combine several techniques l Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression l C-store paper: l First comprehensive design description of a column-store l MonetDB/X100 l “proper” disk-based column store l Explosion of new products VLDB 2009 Tutorial Column-Oriented Database Systems 18
  • 19.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Performance tradeoffs: columns vs. rows DSM traditionally was not favored by technology trends How has this changed? l Optimized DSM in “Fractured Mirrors,” 2002 l “Apples-to-apples” comparison “Performance Tradeoffs in Read- Optimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06 l Follow-up study “Read-Optimized Databases, In- Depth” Holloway, DeWitt, VLDB’08 l Main-memory DSM vs. NSM “DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08 l Flash-disks: a come-back for PAX? “Query Processing Techniques “Fast Scans and Joins Using Flash for Solid State Drives” Drives” Shah, Harizopoulos, Tsirogiannis, Harizopoulos, Wiener, Graefe. DaMoN’08 VLDB 2009 Tutorial Shah, Wiener, Graefe, Column-Oriented Database Systems 19 SIGMOD’09
  • 20.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Fractured mirrors: a closer look l Store DSM relations inside a B-tree “A Case For Fractured Mirrors” Ramamurthy, l Leaf nodes contain values DeWitt, Su, VLDB 2002. l Eliminate IDs, amortize header overhead l Custom implementation on Shore sparse Tuple TID Column Header Data B-tree on ID 1 a1 3 2 a2 3 a3 1 a1 2 a2 3 a3 4 a4 5 a5 4 a4 a1 a2 a3 a4 a5 1 4 5 a5 Similar: storage density “Efficient columnar comparable storage in B-trees” Graefe. to column stores Sigmod Record 03/2007. VLDB 2009 Tutorial Column-Oriented Database Systems 20
  • 21.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Fractured mirrors: performance From PAX paper: column? time row regular DSM column? columns projected: 1 2 3 4 5 optimized l Chunk-based tuple merging DSM l Read in segments of M pages l Merge segments in memory l Becomes CPU-bound after 5 pages VLDB 2009 Tutorial Column-Oriented Database Systems 21
  • 22.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Performance Tradeoffs in Read- Column-scanner Optimized Databases” implementation Harizopoulos, Liang, Abadi, Madden, VLDB’06 row scanner column scanner Joe 45 … … Joe 45 … … SELECT name, age WHERE age > 40 apply S predicate(s) S #POS 45 Joe #POS … Direct I/O Sue … prefetch ~100ms apply 1 Joe 45 worth of data predicate #1 S 2 Sue 37 …… … 45 37 … VLDB 2009 Tutorial Column-Oriented Database Systems 22
  • 23.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Scan performance l Large prefetch hides disk seeks in columns l Column-CPU efficiency with lower selectivity l Row-CPU suffers from memory stalls not shown, l Memory stalls disappear in narrow tuples details in the paper l Compression: similar to narrow VLDB 2009 Tutorial Column-Oriented Database Systems 23
  • 24.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Read-Optimized Databases, In- Even more results Depth” Holloway, DeWitt, VLDB’08 35 • Same engine as before narrow & compressed tuple: • Additional findings 30 CPU-bound! C-25% 25 C-10% R-50% 20 Time (s) 15 10 5 wide attributes: same as before 0 1 2 3 4 5 6 7 8 9 10 Columns Returned Non-selective queries, narrow tuples, favor well-compressed rows Materialized views are a win Column-joins are Scan times determine early materialized joins covered in part 2! VLDB 2009 Tutorial Column-Oriented Database Systems 24
  • 25.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Speedup of columns over rows “Performance Tradeoffs in Read- Optimized Databases” Harizopoulos, Liang, Abadi, cycles per disk byte 144 Madden, VLDB’06 72 (cpdb) 36 +++ 18 _ = + ++ 9 8 12 16 20 24 28 32 36 tuple width l Rows favored by narrow tuples and low cpdb l Disk-bound workloads have higher cpdb VLDB 2009 Tutorial Column-Oriented Database Systems 25
  • 26.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying prefetch size no competing disk traffic 40 Column 2 time (sec) 30 Column 8 20 Column 16 Column 48 (x 128KB) 10 Row (any prefetch size) 0 4 8 12 16 20 24 28 32 selected bytes per tuple l No prefetching hurts columns in single scans VLDB 2009 Tutorial Column-Oriented Database Systems 26
  • 27.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying prefetch size with competing disk traffic 40 Column, 48 40 time (sec) 30 Row, 48 30 20 20 10 10 Column, 8 Row, 8 0 0 4 12 20 28 4 12 20 28 selected bytes per tuple l No prefetching hurts columns in single scans l Under competing traffic, columns outperform rows for any prefetch size VLDB 2009 Tutorial Column-Oriented Database Systems 27
  • 28.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “DSM vs. NSM: CPU performance trade CPU Performance offs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08 l Benefit in on-the-fly conversion between NSM and DSM l DSM: sequential access (block fits in L2), random in L1 l NSM: random access, SIMD for grouped Aggregation VLDB 2009 Tutorial Column-Oriented Database Systems 28
  • 29.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) New storage technology: Flash SSDs l Performance characteristics l very fast random reads, slow random writes l fast sequential reads and writes l Price per bit (capacity follows) l cheaper than RAM, order of magnitude more expensive than Disk l Flash Translation Layer introduces unpredictability l avoid random writes! l Form factors not ideal yet l SSD (Ł small reads still suffer from SATA overhead/OS limitations) l PCI card (Ł high price, limited expandability) l Boost Sequential I/O in a simple package l Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box) l Column Store Updates l useful for delta structures and logs l Random I/O on flash fixes unclustered index access l still suboptimal if I/O block size > record size l therefore column stores profit mush less than horizontal stores l Random I/O useful to exploit secondary, tertiary table orderings l the larger the data, the deeper clustering one can exploit VLDB 2009 Tutorial Column-Oriented Database Systems 29
  • 30.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Even faster column scans on flash SSDs 30K Read IOps, 3K Write Iops l New-generation SSDs 250MB/s Read BW, 200MB/s Write l Very fast random reads, slower random writes l Fast sequential RW, comparable to HDD arrays l No expensive seeks across columns l FlashScan and Flashjoin: PAX on SSDs, inside Postgres “Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09 mini-pages with no qualified attributes are not accessed VLDB 2009 Tutorial Column-Oriented Database Systems 30
  • 31.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-scan performance over time regular DSM (2001) column-store (2006) ..to 1.2x slower from 7x slower ..to 2x slower ..to same and 3x faster! optimized DSM (2002) SSD Postgres/PAX (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 31
  • 32.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 32
  • 33.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Architecture of a column-store storage layout l read-optimized: dense-packed, compressed l organize in extends, batch updates l multiple sort orders l sparse indexes engine l block-tuple operators l new access methods system-level l optimized relational operators l system-wide column support l loading / updates l scaling through multiple nodes l transactions / redundancy VLDB 2009 Tutorial Column-Oriented Database Systems 33
  • 34.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “C-Store: A Column-Oriented DBMS.” Stonebraker et al. C-Store VLDB 2005. l Compress columns l No alignment l Big disk blocks l Only materialized views (perhaps many) l Focus on Sorting not indexing l Data ordered on anything, not just time l Automatic physical DBMS design l Optimize for grid computing l Innovative redundancy l Xacts – but no need for Mohan l Column optimizer and executor VLDB 2009 Tutorial Column-Oriented Database Systems 34
  • 35.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) C-Store: only materialized views (MVs) l Projection (MV) is some number of columns from a fact table l Plus columns in a dimension table – with a 1-n join between Fact and Dimension table l Stored in order of a storage key(s) l Several may be stored! l With a permutation, if necessary, to map between them l Table (as the user specified it and sees it) is not stored! l No secondary indexes (they are a one column sorted MV plus a permutation, if you really want one) User view: Possible set of MVs: EMP (name, age, salary, dept) MV-1 (name, dept, floor) in floor order Dept (dname, floor) MV-2 (salary, age) in age order MV-3 (dname, salary, name) in salary order VLDB 2009 Tutorial Column-Oriented Database Systems 35
  • 36.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Continuous Load and Query (Vertica) Hybrid Storage Architecture > Write Optimized > Read Optimized Store (WOS) Store (ROS) Trickle • On disk Load • Sorted / Compressed TUPLE MOVER Asynchronous • Segmented A B C Data Transfer • Large data loaded direct §Memory based A B C §Unsorted / Uncompressed §Segmented §Low latency / Small quick (A B C | A) inserts VLDB 2009 Tutorial Column-Oriented Database Systems 36
  • 37.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Loading Data (Vertica) > INSERT, UPDATE, DELETE Write-Optimized Store (WOS) > Bulk and Trickle Loads In-memory §COPY Automatic §COPY DIRECT Tuple Mover > User loads data into logical Tables > Vertica loads atomically into storage Read-Optimized Store (ROS) On-disk VLDB 2009 Tutorial Column-Oriented Database Systems 37
  • 38.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Applications for column-stores l Data Warehousing l High end (clustering) l Mid end/Mass Market l Personal Analytics l Data Mining l E.g. Proximity l Google BigTable l RDF l Semantic web data management l Information retrieval l Terabyte TREC l Scientific datasets l SciDB initiative l SLOAN Digital Sky Survey on MonetDB VLDB 2009 Tutorial Column-Oriented Database Systems 38
  • 39.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) List of column-store systems l Cantor (history) l Sybase IQ l SenSage (former Addamark Technologies) l Kdb l 1010data l MonetDB l C-Store/Vertica l X100/VectorWise l KickFire l SAP Business Accelerator l Infobright l ParAccel l Exasol VLDB 2009 Tutorial Column-Oriented Database Systems 39
  • 40.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 40
  • 41.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Simulate a Column-Store inside a Row-Store Date Store Product Customer Price 01/01 BOS Table Mesa $20 01/01 NYC Chair Lutz $13 Option B: Index Every Column 01/01 BOS Bed Mudd $79 Option A: Date Index Vertical Partitioning Date Store Product Customer Price TID Value TID Value TID Value TID Value TID Value 1 01/01 1 BOS 1 Table 1 Mesa 1 $20 Store Index 2 01/01 2 NYC 2 Chair 2 Lutz 2 $13 3 01/01 3 BOS 3 Bed 3 Mudd 3 $79 … VLDB 2009 Tutorial Column-Oriented Database Systems 41
  • 42.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Simulate a Column-Store inside a Row-Store Date Store Product Customer Price 01/01 BOS Table Mesa $20 01/01 NYC Chair Lutz $13 Option B: Index Every Column 01/01 BOS Bed Mudd $79 Option A: Date Index Vertical Partitioning Date Store Product Customer Price Value StartPos Length TID Value TID Value TID Value TID Value 01/01 1 3 1 BOS 1 Table 1 Mesa 1 $20 Store Index 2 NYC 2 Chair 2 Lutz 2 $13 Can explicitly run- 3 BOS 3 Bed 3 Mudd 3 $79 length encode date “Teaching an Old Elephant New Tricks.” Bruno, CIDR 2009. … VLDB 2009 Tutorial Column-Oriented Database Systems 42
  • 43.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Experiments l Star Schema Benchmark (SSBM) Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance”. O’Neil et. al. ICDE 2008. l Implemented by professional DBA l Original row-store plus 2 column-store simulations on same row-store product 250.0 “Column-Stores vs Row-Stores: 200.0 How Different are They Really?” Abadi, Hachem, and Madden. Time (seconds) 150.0 SIGMOD 2008. 100.0 50.0 0.0 Vertically Partitioned Row-Store With All Normal Row-Store Row-Store Indexes Average 25.7 79.9 221.2 VLDB 2009 Tutorial Column-Oriented Database Systems 43
  • 44.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What’s Going On? Vertical Partitions l Vertical partitions in row-stores: l Work well when workload is known l ..and queries access disjoint sets of columns l See automated physical design Tuple TID Column Header Data 1 l Do not work well as full-columns 2 l TupleID overhead significant 3 l Excessive joins Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns “Column-Stores vs. Row-Stores: Complete fact table takes up ~4 GB How Different Are They Really?” (compressed) Abadi, Madden, and Hachem. Vertically partitioned tables take up 0.7-1.1 SIGMOD 2008. GB (compressed) VLDB 2009 Tutorial Column-Oriented Database Systems 44
  • 45.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What’s Going On? All Indexes Case l Tuple construction l Common type of query: SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name l Result of lower part of query plan is a set of TIDs that passed all predicates l Need to extract SELECT attributes at these TIDs l BUT: index maps value to TID l You really want to map TID to value (i.e., a vertical partition) Tuple construction is SLOW VLDB 2009 Tutorial Column-Oriented Database Systems 45
  • 46.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) So…. l All indexes approach is a poor way to simulate a column-store l Problems with vertical partitioning are NOT fundamental l Store tuple header in a separate partition l Allow virtual TIDs l Combine clustered indexes, vertical partitioning l So can row-stores simulate column-stores? l Might be possible, BUT: l Need better support for vertical partitioning at the storage layer l Need support for column-specific optimizations at the executer level l Full integration: buffer pool, transaction manager, .. l When will this happen? See Part 2, Part 3 l Most promising features = soon for most promising features l ..unless new technology / new objectives change the game (SSDs, Massively Parallel Platforms, Energy-efficiency) VLDB 2009 Tutorial Column-Oriented Database Systems 46
  • 47.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) End of Part 1 l Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 47
  • 48.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Part 2 Outline l Compression l Tuple Materialization l Joins VLDB 2009 Tutorial Column-Oriented Database Systems 48
  • 49.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems Compression “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi, Madden, and Ferreira, SIGMOD ’06 •Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD’01
  • 50.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Compression l Trades I/O for CPU l Increased column-store opportunities: l Higher data value locality in column stores l Techniques such as run length encoding far more useful l Can use extra space to store multiple copies of data in different sort orders VLDB 2009 Tutorial Column-Oriented Database Systems 50
  • 51.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Run-length Encoding Quarter Product ID Price Quarter Product ID Price (value, start_pos, run_length) (value, start_pos, run_length) Q1 1 5 (Q1, 1, 300) (1, 1, 5) 5 Q1 1 7 (2, 6, 2) 7 Q1 1 2 (Q2, 301, 350) 2 Q1 1 9 … (Q3, 651, 500) 9 Q1 1 6 (1, 301, 3) 6 Q1 2 8 (Q4, 1151, 600) (2, 304, 1) 8 Q1 2 5 … … … … 5 … Q2 1 3 3 Q2 1 8 8 Q2 1 1 1 Q2 2 4 … … … 4 … VLDB 2009 Tutorial Column-Oriented Database Systems 51
  • 52.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Bit-vector Encoding l For each unique Product ID ID: 1 ID: 2 ID: 3 … value, v, in column c, create bit-vector 1 1 0 0 0 b 1 1 0 0 0 l b[i] = 1 if c[i] = v 1 1 0 0 0 1 1 0 0 0 l Good for columns 1 1 0 0 0 with few unique 2 0 1 0 0 values 2 0 1 0 0 l Each bit-vector … … … … … can be further 1 1 0 0 0 compressed if 1 1 0 0 0 sparse 2 0 1 0 0 3 0 0 1 0 … … … … … VLDB 2009 Tutorial Column-Oriented Database Systems 52
  • 53.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Dictionary Encoding Quarter l For each Quarter unique value 0 Quarter 1 create Q1 3 dictionary entry 0 24 Q2 2 l Dictionary can 0 128 Q4 0 be per-block or 0 122 Q1 1 per-column 3 l Column-stores Q3 Q1 2 2 OR + have the Q1 + Dictionary advantage that Dictionary dictionary Q1 24: Q1, Q2, Q4, Q1 entries may Q2 0: Q1 … encode multiple Q4 1: Q2 122: Q2, Q4, Q3, Q3 values at once Q3 2: Q3 … Q3 3: Q4 128: Q3, Q1, Q1, Q1 … VLDB 2009 Tutorial Column-Oriented Database Systems 53
  • 54.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Frame Of Reference Encoding Price Price l Encodes values as b bit Frame: 50 offset from chosen frame 45 -5 54 of reference 4 48 l Special escape code (e.g. -2 55 4 bits per all bits set to 1) indicates 5 value 51 1 a difference larger than 53 3 can be stored in b bits 40 ∞ l After escape code, 50 40 original (uncompressed) 49 Exceptions (see 0 part 3 for a better value is written 62 way to deal with -1 exceptions) 52 ∞ “Compressing Relations and Indexes 50 … 62 ” Goldstein, Ramakrishnan, Shaft, 2 ICDE’98 0 … VLDB 2009 Tutorial Column-Oriented Database Systems 54
  • 55.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Differential Encoding Time Time l Encodes values as b bit offset from previous value l Special escape code (just like 5:00 5:00 frame of reference encoding) 5:02 2 indicates a difference larger than can be stored in b bits 5:03 1 2 bits per l After escape code, original 5:03 0 value (uncompressed) value is written 5:04 1 l Performs well on columns containing increasing/decreasing 5:06 2 sequences 5:07 1 l inverted lists 5:08 1 l timestamps l object IDs 5:10 2 Exception (see l sorted / clustered columns 5:15 ∞ part 3 for a better way to deal with 5:16 5:15 exceptions) “Improved Word-Aligned Binary 5:16 1 Compression for Text Indexing” … 0 Ahn, Moffat, TKDE’06 VLDB 2009 Tutorial Column-Oriented Database Systems 55
  • 56.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What Compression Scheme To Use? Does column appear in the sort key? yes no Is the average Are number of unique run-length > 2 values < ~50000 yes no yes Differential RLE Encoding no Does this column appear frequently in selection predicates? yes no Is the data numerical and exhibit good locality? yes no Bit-vector Dictionary Compression Compression Frame of Reference Leave Data Encoding Uncompressed OR Heavyweight Compression VLDB 2009 Tutorial Column-Oriented Database Systems 56
  • 57.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Heavy-Weight Compression Schemes l Modern disk arrays can achieve > 1GB/s l 1/3 CPU for decompression Ł 3GB/s needed Ł Lightweight compression schemes are better Ł Even better: operate directly on compressed data VLDB 2009 Tutorial Column-Oriented Database Systems 57
  • 58.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Operating Directly on Compressed Data l I/O - CPU tradeoff is no longer a tradeoff l Reduces memory–CPU bandwidth requirements l Opens up possibility of operating on multiple records at once VLDB 2009 Tutorial Column-Oriented Database Systems 58
  • 59.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Operating Directly on Compressed Data Quarter Product ID 1 2 3 … … … ProductID, COUNT(*)) (Q1, 1, 300) 1 0 0 301-306 0 0 1 (1, 3) (Q2, 301, 6) 0 1 0 (2, 1) (Q3, 307, 500) 1 0 0 1 0 0 (3, 2) (Q4, 807, 600) 0 0 1 0 1 0 SELECT ProductID, Count(*) 1 0 0 FROM table 0 0 1 WHERE (Quarter = Q2) 0 1 0 GROUP BY ProductID Index Lookup + Offset jump 0 0 1 … … … VLDB 2009 Tutorial Column-Oriented Database Systems 59
  • 60.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Operating Directly on Compressed Data SELECT ProductID, Count(*) Block API FROM table WHERE (Quarter = Q2) Data GROUP BY ProductID Aggregation isOneValue() Operator isValueSorted() isPosContiguous() isSparse() Selection getNext() Operator decompressIntoArray() (Q1, 1, 300) getValueAtPosition(pos) (Q2, 301, 6) getMin() Compression- getMax() (Q3, 307, 500) Aware Scan getSize() (Q4, 807, 600) Operator VLDB 2009 Tutorial Column-Oriented Database Systems 60
  • 61.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems Tuple Materialization and Column-Oriented Join Algorithms “Materialization Strategies in a Column- “Query Processing Techniques for Oriented DBMS” Abadi, Myers, DeWitt, Solid State Drives” Tsirogiannis, and Madden. ICDE 2007. Harizopoulos Shah, Wiener, and Graefe. SIGMOD 2009. “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, “Cache-Conscious Radix-Decluster Kersten, SIGMOD’09 Projections”, Manegold, Boncz, Nes, VLDB’04 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
  • 62.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) When should columns be projected? l Where should column projection operators be placed in a query plan? l Row-store: l Column projection involves removing unneeded columns from tuples l Generally done as early as possible l Column-store: l Operation is almost completely opposite from a row-store l Column projection involves reading needed columns from storage and extracting values for a listed set of tuples § This process is called “materialization” l Early materialization: project columns at beginning of query plan § Straightforward since there is a one-to-one mapping across columns l Late materialization: wait as long as possible for projecting columns § More complicated since selection and join operators on one column obfuscates mapping to other columns from same table l Most column-stores construct tuples and column projection time § Many database interfaces expect output in regular tuples (rows) § Rest of discussion will focus on this case VLDB 2009 Tutorial Column-Oriented Database Systems 62
  • 63.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) When should tuples be constructed? Select + Aggregate QUERY: 4 2 2 7 SELECT custID,SUM(price) FROM table 4 1 3 13 WHERE (prodID = 4) AND 4 3 3 42 (storeID = 1) AND GROUP BY custID 4 1 3 80 Construct l Solution 1: Create rows first (EM). But: l Need to construct ALL tuples (4,1,4) 2 2 7 1 3 13 l Need to decompress data 3 3 42 l Poor memory bandwidth 1 3 80 utilization prodID storeID custID price VLDB 2009 Tutorial Column-Oriented Database Systems 63
  • 64.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns QUERY: 1 0 SELECT custID,SUM(price) 1 1 AGG FROM table 1 0 WHERE (prodID = 4) AND 1 1 (storeID = 1) AND Data Data GROUP BY custID Source Source custID price Data Data Source Source AND 4 2 2 7 4 2 4 1 3 13 Data Data 4 1 4 3 3 42 Source Source 4 3 4 1 3 80 prodID storeID 4 1 prodID storeID custID price prodID storeID VLDB 2009 Tutorial Column-Oriented Database Systems 64
  • 65.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns QUERY: 0 SELECT custID,SUM(price) AGG FROM table 1 WHERE (prodID = 4) AND 0 (storeID = 1) AND Data Data 1 GROUP BY custID Source Source custID price AND AND 4 2 2 7 1 0 4 1 3 13 Data Data 1 1 4 3 3 42 Source Source 1 0 4 1 3 80 prodID storeID 1 1 prodID storeID custID price VLDB 2009 Tutorial Column-Oriented Database Systems 65
  • 66.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns QUERY: 3 13 1 SELECT custID,SUM(price) AGG 3 80 1 FROM table WHERE (prodID = 4) AND (storeID = 1) AND 2 0 7 0 GROUP BY custID Data Data Source Source 3 1 Data Data 13 1 custID price Source Source 3 0 42 0 3 1 80 1 AND custID price 4 2 2 7 0 4 1 3 13 Data Data 1 4 3 3 42 Source Source 0 4 1 3 80 prodID storeID 1 prodID storeID custID price VLDB 2009 Tutorial Column-Oriented Database Systems 66
  • 67.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns QUERY: SELECT custID,SUM(price) AGG FROM table WHERE (prodID = 4) AND 3 1 1 93 (storeID = 1) AND Data Data GROUP BY custID Source Source custID price AGG AND 4 2 2 7 4 1 3 13 Data Data 3 1 13 1 4 3 3 42 Source Source 3 1 80 1 4 1 3 80 prodID storeID prodID storeID custID price VLDB 2009 Tutorial Column-Oriented Database Systems 67
  • 68.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. For plans without joins, late materialization is a win 10 QUERY: 9 8 SELECT C1, SUM(C2) Time (seconds) 7 FROM table 6 WHERE (C1 < CONST) AND 5 4 (C2 < CONST) 3 GROUP BY C1 2 1 l Ran on 2 compressed 0 Low selectivity Medium High selectivity columns from TPC-H selectivity scale 10 data Early Materialization Late Materialization VLDB 2009 Tutorial Column-Oriented Database Systems 68
  • 69.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. Even on uncompressed data, late materialization is still a win 14 QUERY: 12 SELECT C1, SUM(C2) Time (seconds) 10 FROM table 8 WHERE (C1 < CONST) AND 6 (C2 < CONST) 4 GROUP BY C1 2 0 l Materializing late still Low selectivity Medium High selectivity selectivity works best Early Materialization Late Materialization VLDB 2009 Tutorial Column-Oriented Database Systems 69
  • 70.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What about for plans with joins? Select R1.B, R1.C, R2.E, R2.H, R3.F B, C, E, H, F From R1, R2, R3 Project Where R1.A = R2.D AND R2.G = R3.K B, C, E, H, F Join G=K F B, C Join 2 G=K G K B, C, E, G, H Scan F, K Project Join 1 R3 (F, K, L) A=D Scan Join A=D R3 (F, K, L) G E, H A, B, C D, E, G, H A D Scan Scan Scan Scan R1 (A, B, C) R2 (D, E, G, H) R1 (A, B, C) R2 (D, E, G, H) VLDB 2009 Tutorial Column-Oriented Database Systems 70 70
  • 71.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What about for plans with joins? Select R1.B, R1.C, R2.E, R2.H, R3.F B, C, E, H, F From R1, R2, R3 Project Where R1.A = R2.D AND R2.G = R3.K B, C, E, H, F Join G=K F B, C Join 2 G=K G K B, C, E, G, H Early Late Scan F, K Project materialization Join 1 materialization (F, K, L) R3 A=D Scan Join A=D R3 (F, K, L) G E, H A, B, C D, E, G, H A D Scan Scan Scan Scan R1 (A, B, C) R2 (D, E, G, H) R1 (A, B, C) R2 (D, E, G, H) VLDB 2009 Tutorial Column-Oriented Database Systems 71 71
  • 72.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Early Materialization Example 2 7 QUERY: 3 13 SELECT C.lastName,SUM(F.price) 1 Green FROM facts AS F, customers AS C 3 42 2 White WHERE F.custID = C.custID 3 80 GROUP BY C.lastName 3 Brown Construct Construct (4,1,4) 12 6 2 7 1 1 3 13 1 Green 11 2 3 42 2 White 1 1 3 80 3 Brown prodID storeID quantity custID price custID lastName Facts Customers VLDB 2009 Tutorial Column-Oriented Database Systems 72
  • 73.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Early Materialization Example QUERY: 7 White SELECT C.lastName,SUM(F.price) 13 Brown FROM facts AS F, customers AS C 42 Brown WHERE F.custID = C.custID GROUP BY C.lastName 80 Brown Join 2 7 3 13 1 Green 3 42 2 White 3 80 3 Brown VLDB 2009 Tutorial Column-Oriented Database Systems 73
  • 74.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Late Materialization Example QUERY: 1 2 2 3 SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C 3 1 WHERE F.custID = C.custID 4 3 GROUP BY C.lastName Join Late materialized join causes out of order probing of projected (4,1,4) 12 6 2 7 columns from the 1 1 3 13 1 Green inner relation 11 2 1 42 2 White 1 1 3 80 3 Brown prodID storeID quantity custID price custID lastName Facts Customers VLDB 2009 Tutorial Column-Oriented Database Systems 74
  • 75.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Late Materialized Join Performance l Naïve LM join about 2X slower than EM join on typical queries (due to random I/O) l This number is very dependent on l Amount of memory available l Number of projected attributes l Join cardinality l But we can do better l Invisible Join l Jive/Flash Join l Radix cluster/decluster join VLDB 2009 Tutorial Column-Oriented Database Systems 75
  • 76.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Column-Stores vs Row-Stores: How Invisible Join Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. l Designed for typical joins when data is modeled using a star schema l One (“fact”) table is joined with multiple dimension tables l Typical query: select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA‘ and s_region = 'ASIA‘ and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc; VLDB 2009 Tutorial Column-Oriented Database Systems 76
  • 77.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Invisible Join Madden, and Hachem. SIGMOD 2008. Apply “region = ‘Asia’” On Customer Table custkey region nation … 1 ASIA CHINA … Hash Table (or bit-map) 2 ASIA INDIA … Containing Keys 1, 2 and 3 3 ASIA INDIA … 4 EUROPE FRANCE … Apply “region = ‘Asia’” On Supplier Table suppkey region nation … 1 ASIA RUSSIA … Hash Table (or bit-map) 2 EUROPE SPAIN … Containing Keys 1, 3 3 ASIA JAPAN … Apply “year in [1992,1997]” On Date Table dateid year … 01011997 1997 … Hash Table Containing 01021997 1997 … Keys 01011997, 01021997, 01031997 1997 … and 01031997 VLDB 2009 Tutorial Column-Oriented Database Systems 77
  • 78.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Original Fact Table “Column-Stores vs Row-Stores: orderkey custkey suppkey orderdate revenue How Different are They Really?” 1 3 1 01011997 43256 Abadi et. al. SIGMOD 2008. 2 3 2 01011997 33333 1 1 1 1 3 4 3 01021997 12121 1 0 1 0 0 1 1 0 4 5 6 7 1 4 1 3 1 2 2 2 01021997 23233 01021997 45456 01031997 43251 01031997 34235 1 0 1 & & = 1 0 0 1 1 1 1 0 0 1 0 1 0 Hash Table Hash Table Hash Table Containing Containing Containing Keys 01011997, Keys 1, 2 and 3 Keys 1 and 3 01021997, and 01031997 + custkey + suppkey + orderdate 3 1 1 1 01011997 1 3 1 2 0 01011997 1 4 0 3 1 01021997 1 1 1 1 1 01021997 1 4 0 2 0 01021997 1 1 1 2 0 01031997 1 3 1 2 0 01031997 1 VLDB 2009 Tutorial Column-Oriented Database Systems 78
  • 79.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey 3 1 3 0 4 1 4 + 0 1 0 3 1 + CHINA INDIA INDIA FRANCE = INDIA CHINA 1 0 3 0 suppkey 1 1 2 0 3 1 2 + 0 1 0 0 1 1 + RUSSIA SPAIN JAPAN = RUSSIA RUSSIA 2 2 0 orderdate 01011997 1 01011997 0 01011997 1997 01021997 01021997 01021997 + 0 1 0 01011997 01021997 JOIN 01021997 01031997 1997 1997 = 1997 1997 01031997 0 01031997 0 VLDB 2009 Tutorial Column-Oriented Database Systems 79
  • 80.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. Invisible Join Apply “region = ‘Asia’” On Customer Table custkey region nation … 1 ASIA CHINA … Hash Table (or bit-map) 2 ASIA INDIA … Containing Keys 1, 2 and 3 3 ASIA INDIA … 4 EUROPE FRANCE … Range [1-3] (between-predicate rewriting) Apply “region = ‘Asia’” On Supplier Table suppkey region nation … 1 ASIA RUSSIA … Hash Table (or bit-map) 2 EUROPE SPAIN … Containing Keys 1, 3 3 ASIA JAPAN … Apply “year in [1992,1997]” On Date Table dateid year … 01011997 1997 … Hash Table Containing 01021997 1997 … Keys 01011997, 01021997, 01031997 1997 … and 01031997 VLDB 2009 Tutorial Column-Oriented Database Systems 80
  • 81.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Invisible Join l Bottom Line l Many data warehouses model data using star/snowflake schemes l Joins of one (fact) table with many dimension tables is common l Invisible join takes advantage of this by making sure that the table that can be accessed in position order is the fact table for each join l Position lists from the fact table are then intersected (in position order) l This reduces the amount of data that must be accessed out of order from the dimension tables l “Between-predicate rewriting” trick not relevant for this discussion VLDB 2009 Tutorial Column-Oriented Database Systems 81
  • 82.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey 3 1 3 0 4 1 4 + 0 1 0 3 1 + CHINA INDIA INDIA FRANCE = INDIA CHINA 1 0 3 0 suppkey Still accessing table out of order 1 1 2 0 3 1 2 + 0 1 0 0 1 1 + RUSSIA SPAIN JAPAN = RUSSIA RUSSIA 2 2 0 orderdate 01011997 1 01011997 0 01011997 1997 01021997 01021997 01021997 + 0 1 0 01011997 01021997 JOIN 01021997 01031997 1997 1997 = 1997 1997 01031997 0 01031997 0 VLDB 2009 Tutorial Column-Oriented Database Systems 82
  • 83.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join 3 1 + CHINA INDIA INDIA FRANCE = INDIA CHINA “Fast Joins using Join Indices”. Li and Ross, VLDBJ 8:1-24, 1999. Still accessing table out of order “Query Processing Techniques for Solid State Drives”. Tsirogiannis, Harizopoulos et. al. SIGMOD 2009. VLDB 2009 Tutorial Column-Oriented Database Systems 83
  • 84.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join 1. Add column with dense ascending integers CHINA 2. from 1 Sort new position list 1 2 3 1 + INDIA INDIA FRANCE = INDIA CHINA by second column 3. Probe projected column in order using new sorted position list, keeping first column from position 2 1 1 3 + CHINA INDIA INDIA FRANCE = 2 1 CHINA INDIA list around 4. Sort new result by first 1 INDIA column 2 CHINA VLDB 2009 Tutorial Column-Oriented Database Systems 84
  • 85.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join l Bottom Line l Instead of probing projected columns from inner table out of order: l Sort join index l Probe projected columns in order l Sort result using an added column l LM vs EM tradeoffs: l LM has the extra sorts (EM accesses all columns in order) l LM only has to fit join columns into memory (EM needs join columns and all projected columns) § Results in big memory and CPU savings (see part 3 for why there is CPU savings) l LM only has to materialize relevant columns l In many cases LM advantages outweigh disadvantages l LM would be a clear winner if not for those pesky sorts … can we do better? VLDB 2009 Tutorial Column-Oriented Database Systems 85
  • 86.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Radix Cluster/Decluster l The full sort from the Jive join is actually overkill l We just want to access the storage blocks in order (we don’t mind random access within a block) l So do a radix sort and stop early l By stopping early, data within each block is accessed out of order, but in the order specified in the original join index l Use this pseudo-order to accelerate the post-probe sort as well •“Database Architecture Optimized for the New Bottleneck: Memory Access” “Cache-Conscious Radix-Decluster VLDB’99 Projections”, Manegold, Boncz, Nes, •“Generic Database Cost Models for VLDB’04 Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten) VLDB 2009 Tutorial Column-Oriented Database Systems 86
  • 87.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Radix Cluster/Decluster l Bottom line l Both sorts from the Jive join can be significantly reduced in overhead l Only been tested when there is sufficient memory for the entire join index to be stored three times l Technique is likely applicable to larger join indexes, but utility will go down a little l Only works if random access within a storage block l Don’t want to use radix cluster/decluster if you have variable- width column values or compression schemes that can only be decompressed starting from the beginning of the block VLDB 2009 Tutorial Column-Oriented Database Systems 87
  • 88.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) LM vs EM joins l Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins l Research papers show that LM joins become 2X faster than EM joins (instead of 2X slower) for a wide array of query types VLDB 2009 Tutorial Column-Oriented Database Systems 88
  • 89.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Tuple Construction Heuristics l For queries with selective predicates, aggregations, or compressed data, use late materialization l For joins: l Research papers: l Always use late materialization l Commercial systems: l Inner table to a join often materialized before join (reduces system complexity): l Some systems will use LM only if columns from inner table can fit entirely in memory VLDB 2009 Tutorial Column-Oriented Database Systems 89
  • 90.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Computational Efficiency of DB on modern hardware l how column-stores can help here l Keynote revisited: MonetDB & VectorWise in more depth l CPU efficient column compression l vectorized decompression l Conclusions l future work VLDB 2009 Tutorial Column-Oriented Database Systems 90
  • 91.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems 40 years of hardware evolution vs. DBMS computational efficiency
  • 92.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Architecture Elements: l Storage l CPU caches L1/L2/L3 l Registers l Execution Unit(s) l Pipelined l SIMD VLDB 2009 Tutorial Column-Oriented Database Systems 92
  • 93.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Metrics VLDB 2009 Tutorial Column-Oriented Database Systems 93
  • 94.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Metrics VLDB 2009 Tutorial Column-Oriented Database Systems 94
  • 95.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DRAM Metrics VLDB 2009 Tutorial Column-Oriented Database Systems 95
  • 96.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Super-Scalar Execution (pipelining) VLDB 2009 Tutorial Column-Oriented Database Systems 96
  • 97.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Hazards l Data hazards l Control Hazards l Dependencies between instructions l Branch mispredictions l L1 data cache misses l Computed branches (late binding) l L1 instruction cache misses Result: bubbles in the pipeline Out-of-order execution addresses data hazards l control hazards typically more expensive VLDB 2009 Tutorial Column-Oriented Database Systems 97
  • 98.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SIMD l Single Instruction Multiple Data l Same operation applied on a vector of values l MMX: 64 bits, SSE: 128bits, AVX: 256bits l SSE, e.g. multiply 8 short integers VLDB 2009 Tutorial Column-Oriented Database Systems 98
  • 99.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) A Look at the Query Pipeline 102 ivan 350 next() 37 *–50 7 30 SELECT id, name (age-30)*50 AS bonus 102 ivan 37 7 350 FROM employee WHERE age > 30 PROJECT next() 37 22 > 30 ? 101 102 alice ivan 22 TRUE 37 FALSE SELECT next() 102 101 ivan alice 37 22 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 99
  • 100.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) A Look at the Query Pipeline 102 ivan 350 next() 37 *–50 7 30 Operators PROJECT Iterator interface next() 37 22 > 30 ? -open() -next(): tuple SELECT -close() next() SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 100
  • 101.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) A Look at the Query Pipeline 102 ivan 350 next() 37 *–50 7 30 102 ivan 37 7 350 Primitives PROJECT Provide computational next() 37 22 > 30 ? functionality 101 102 alice ivan 22 TRUE 37 FALSE SELECT All arithmetic allowed in expressions, next() e.g. Multiplication 102 101 ivan alice 37 22 SCAN 7 * 50 mult(int,int) Ł int VLDB 2009 Tutorial Column-Oriented Database Systems 101
  • 102.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Database Architecture causes Hazards l Data hazards l Control Hazards l Dependencies between instructions l Branch mispredictions l L1 data cache misses l Computed branches (late binding) SIMD Out-of-order Execution l L1 instruction cache misses work on one tuple at a time Large Tree/Hash Structures PROJECT Operators Code footprint of all operators in query plan exceeds L1 cache SELECT Iterator interface -open() Data-dependent conditions -next(): tuple Next() late binding SCAN -close() method calls “DBMSs On A Modern Processor: Where Does Time Go? ” Tree, List, Hash traversal Complex NSM record navigation Ailamaki, DeWitt, Hill, Wood, VLDB’99 VLDB 2009 Tutorial Column-Oriented Database Systems 102
  • 103.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l C program: ? l MySQL: 26.2s l DBMS “X”: 28.1s “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 VLDB 2009 Tutorial Column-Oriented Database Systems 103
  • 104.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l C program: 0.2s l MySQL: 26.2s l DBMS “X”: 28.1s “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 VLDB 2009 Tutorial Column-Oriented Database Systems 104
  • 105.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems MonetDB
  • 106.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) a column-store l “save disk I/O when scan-intensive queries need a few columns” l “avoid an expression interpreter to improve computational efficiency” VLDB 2009 Tutorial Column-Oriented Database Systems 106
  • 107.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) RISC Database Algebra SELECT id, name, (age-30)*50 as bonus FROM people WHERE age > 30 VLDB 2009 Tutorial Column-Oriented Database Systems 107
  • 108.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) RISC Database Algebra CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column Simple, hard- - no per-tuple interpretation batcalc_minus_int(int* res, coded semantics int* col, - arrays: no record navigation int val, in operators - better instruction cache locality int n) { for(i=0; i<n; i++) as bonus SELECT id, name, (age-30)*50 FROM res[i] = col[i] - val; people } WHERE age > 30 VLDB 2009 Tutorial Column-Oriented Database Systems 108
  • 109.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) RISC Database Algebra CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column - no per-tuple interpretation batcalc_minus_int(int* res, int* col, - arrays: no record navigation int val, - better instruction cache locality int n) { MATERIALIZED for(i=0; i<n; i++) as bonus SELECT id, name, (age-30)*50 intermediate FROM res[i] = col[i] - val; people } WHERE age > 30 results VLDB 2009 Tutorial Column-Oriented Database Systems 109
  • 110.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) a column-store l “save disk I/O when scan-intensive queries need a few columns” l “avoid an expression interpreter to improve computational efficiency” How? l RISC query algebra: hard-coded semantics l Decompose complex expressions in multiple operations l Operators only handle simple arrays l No code that handles slotted buffered record layout l Relational algebra becomes array manipulation language l Often SIMD for free l Plus: use of cache-conscious algorithms for Sort/Aggr/Join VLDB 2009 Tutorial Column-Oriented Database Systems 110
  • 111.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) a Faustian pact l You want efficiency l Simple hard-coded operators l I take scalability l Result materialization n C program: 0.2s n MonetDB: 3.7s n MySQL: 26.2s n DBMS “X”: 28.1s VLDB 2009 Tutorial Column-Oriented Database Systems 111
  • 112.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems as a research platform
  • 113.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SIGMOD 1985 B MonetD bra lge BAT A ts B suppor MonetD G, M L, ODM SQL, X ..RDF •“MIL Primitives for Querying a Fragmented World”, Boncz, Kersten, VLDBJ’98 •“Flattening an Object Algebra to Provide Performance” Boncz, Wilschut, Kersten, ICDE’98 •“MonetDB/XQuery: a fast XQuery processor powered C- by a relational engine” Boncz, Grust, vanKeulen, pport on e RDF su SW-Stor Rittinger, Teubner, SIGMOD’06 / STORE •“SW-Store: a vertically partitioned DBMS for Semantic Web data management“ Abadi, Marcus, Madden, Hollenbach, VLDBJ’09 VLDB 2009 Tutorial Column-Oriented Database Systems 113
  • 114.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) The Software Stack Extensible query lang frontend frameworkXQuery SQL 03 RDF Arrays Extensible Dynamic .. SGL? Runtime QOPT Optimizers Framework! SOAP MonetDB 4 MonetDB 5 OGIS Extensible Architecture-Conscious MonetDB kernel compile Execution platform VLDB 2009 Tutorial Column-Oriented Database Systems 114
  • 115.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) optimizer frontend backend as a research platform •“Database Architecture Optimized for the New l Cache-Conscious Joins Bottleneck: Memory Access” VLDB’99 l Cost Models, Radix-cluster, •“Generic Database Cost Models for Hierarchical Memory Radix-decluster Systems”, VLDB’02 (all Manegold, Boncz, Kersten) •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04 l MonetDB/XQuery: l structural joins exploiting “MonetDB/XQuery: a fast XQuery processor powered positional column access by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06 l Cracking: “Database Cracking”, CIDR’07 l on-the-fly automatic indexing “Updating a cracked database “, SIGMOD’07 without workload knowledge “Self-organizing tuple reconstruction in column- stores“, SIGMOD’09 (all Idreos, Manegold, Kersten) l Recycling: “An architecture for recycling intermediates in a l using materialized intermediates column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09 l Run-time Query Optimization: “ROX: run-time optimization of XQueries”, l correlation-aware run-time Abdelkader, Boncz, Manegold, vanKeulen, optimization without cost model SIGMOD’09 VLDB 2009 Tutorial Column-Oriented Database Systems 115
  • 116.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems “MonetDB/X100” vectorized query processing
  • 117.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) MonetDB spin-off: MonetDB/X100 Materialization vs Pipelining VLDB 2009 Tutorial Column-Oriented Database Systems 117
  • 118.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 next() PROJECT next() SELECT next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 118
  • 119.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 next() PROJECT > 30 ? next() 101 alice 22 FALSE 102 ivan 37 TRUE 104 peggy 45 TRUE 105 victor 25 FALSESELECT next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 119
  • 120.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 “Vectorized Observations: In Cache 102 ivan 350 104 peggy 750 Processing” next() called much less often Ł - 30 * 50 more time spent in primitives next() vector = array of ~100 102 ivan 37 7 350 less in overhead 104 peggy 45 15 750 processed in a tight loop primitive calls process an array of PROJECT > 30 ? CPU in a loop: values cache Resident next() 101 alice 22 FALSE 102 ivan 37 TRUE 104 peggy 45 TRUE 105 victor 25 FALSESELECT next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 120
  • 121.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 102 ivan 350 Observations: 104 peggy 750 next() called much less often Ł - 30 * 50 more time spent in primitives next() 102 ivan 37 7 350 less in overhead 104 peggy 45 15 750 primitive calls process an array of PROJECT > 30 ? values in a loop: next() 101 alice 22 FALSE CPU Efficiency depends on “nice” code 102 ivan 37 TRUE - out-of-order execution 104 peggy 45 TRUE - few dependencies (control,data) 105 victor 25 FALSESELECT - compiler support next() 101 alice 22 Compilers like simple loops over arrays 102 ivan 37 - loop-pipelining 104 peggy 45 - automatic SIMD 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 121
  • 122.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Observations: > 30 ? FALSE for(i=0; i<n; i++) next() called much less often Ł TRUE res[i] = (col[i] > x) TRUE * 50 more time spent in primitives FALSE 350 less in overhead 750 - 30 PROJECT primitive calls process an array of for(i=0; i<n; 30 ? > i++) values in a loop: 7 15 FALSE res[i] = (col[i] - x) CPU Efficiency depends on “nice” code TRUE - out-of-order execution TRUE - few dependencies (control,data) * 50 FALSESELECT for(i=0; i<n; i++) - compiler support 350 res[i] = (col[i] * x) 750 Compilers like simple loops over arrays - loop-pipelining - automatic SIMD SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 122
  • 123.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Tricks being played: - Late materialization - Materialization avoidance using selection vectors PROJECT > 30 ? next() 1 2 SELECT next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 123
  • 124.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 map_mul_flt_val_flt_col( float *res, int* sel, Tricks being played: float val, float *col, int n) - Late materialization - 30 * 50 next() { - Materialization avoidance 7 350 for(int i=0; i<n; i++) 15 750 using selection vectors res[i] = val * col[sel[i]]; PROJECT > 30 ? } next() 1 selection vectors used to reduce 2 vector copying SELECT contain selected positions next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 124
  • 125.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 map_mul_flt_val_flt_col( float *res, int* sel, Tricks being played: float val, float *col, int n) - Late materialization - 30 * 50 next() { - Materialization avoidance 7 350 for(int i=0; i<n; i++) 15 750 using selection vectors res[i] = val * col[sel[i]]; PROJECT > 30 ? } next() 1 selection vectors used to reduce 2 vector copying SELECT contain selected positions next() 101 alice 22 102 ivan 37 104 peggy 45 105 victor 25 SCAN VLDB 2009 Tutorial Column-Oriented Database Systems 125
  • 126.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 MonetDB/X100 l Both efficiency l Vectorized primitives l and scalability.. l Pipelined query evaluation n C program: 0.2s n MonetDB/X100: 0.6s n MonetDB: 3.7s n MySQL: 26.2s n DBMS “X”: 28.1s VLDB 2009 Tutorial Column-Oriented Database Systems 126
  • 127.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Memory Hierarchy X100 query engine CPU cache RAM ColumnBM (buffer manager) (raid) Disk(s) VLDB 2009 Tutorial Column-Oriented Database Systems 127
  • 128.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Memory Hierarchy X100 query engine Vectors are only the in-cache representation RAM & disk representation might actually be different CPU cache (vectorwise uses both PAX & DSM) RAM ColumnBM (buffer manager) (raid) Disk(s) VLDB 2009 Tutorial Column-Oriented Database Systems 128
  • 129.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Optimal Vector size? X100 query engine All vectors together should fit the CPU cache Optimizer should tune this, given the query characteristics. CPU cache RAM ColumnBM (buffer manager) (raid) Disk(s) VLDB 2009 Tutorial Column-Oriented Database Systems 129
  • 130.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Varying the Vector size Less and less iterator.next() and primitive function calls (“interpretation overhead”) VLDB 2009 Tutorial Column-Oriented Database Systems 130
  • 131.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Varying the Vector size Vectors start to exceed the CPU cache, causing additional memory traffic VLDB 2009 Tutorial Column-Oriented Database Systems 131
  • 132.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Varying the Vector size The benefit of selection vectors VLDB 2009 Tutorial Column-Oriented Database Systems 132
  • 133.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 MonetDB/MIL materializes columns CPU cache RAM ColumnBM (buffer manager) (raid) Disk(s) VLDB 2009 Tutorial Column-Oriented Database Systems 133
  • 134.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 Benefits of Vectorized Processing l 100x less Function Calls “Buffering Database Operations for Enhanced Instruction Cache Performance” l iterator.next(), primitives Zhou, Ross, SIGMOD’04 l No Instruction Cache Misses l “Block oriented processing of High locality in the primitives relational database operations in l Less Data Cache Misses modern computer architectures” l Cache-conscious data placement Padmanabhan, Malkemus, Agarwal, l No Tuple Navigation ICDE’01 l Primitives are record-oblivious, only see arrays l Vectorization allows algorithmic optimization l Move activities out of the loop (“strength reduction”) l Compiler-friendly function bodies l Loop-pipelining, automatic SIMD VLDB 2009 Tutorial Column-Oriented Database Systems 134
  • 135.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Balancing Vectorized Query Execution with Bandwidth Optimized Storage ” Zukowski, CWI 2008 Vectorizing Relational Operators l Project l Select l Exploit selectivities, test buffer overflow l Aggregation l Ordered, Hashed l Sort l Radix-sort nicely vectorizes l Join l Merge-join + Hash-join VLDB 2009 Tutorial Column-Oriented Database Systems 135
  • 136.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems Efficient Column Store Compression
  • 137.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Key Ingredients l Compress relations on a per-column basis l Columns compress well l Decompress small vectors of tuples from a column into the CPU cache l Minimize main-memory overhead l Use light-weight, CPU-efficient algorithms l Exploit processing power of modern CPUs VLDB 2009 Tutorial Column-Oriented Database Systems 137
  • 138.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Key Ingredients l Compress relations on a per-column basis l Columns compress well l Decompress small vectors of tuples from a column into the CPU cache l Minimize main-memory overhead VLDB 2009 Tutorial Column-Oriented Database Systems 138
  • 139.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Vectorized Decompression X100 query engine Idea: decompress a vector only compression: -between CPU and RAM CPU -Instead of disk and RAM (classic) cache RAM ColumnBM (buffer manager) (raid) Disk(s) VLDB 2009 Tutorial Column-Oriented Database Systems 139
  • 140.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 RAM-Cache Decompression l Decompress vectors on-demand into the cache l RAM-Cache boundary only crossed once l More (compressed) data cached in RAM l Less bandwidth use VLDB 2009 Tutorial Column-Oriented Database Systems 140
  • 141.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Multi-Core Bandwidth & Compression Performance Degradation with Concurrent Queries 1.2 avg query speed (clock normalized) 1 0.8 0.6 0.4 Intel® Xeon E5410 (Harpertown) - Q6b Normal 0.2 Intel® Xeon E5410 (Harpertown) - Q6b Compressed Intel® Xeon X5560 (Nehalem) - Q6b Normal Intel® Xeon X5560 (Nehalem) - Q6b Compressed 0 1 2 3 4 5 6 7 8 number of concurrent queries VLDB 2009 Tutorial Column-Oriented Database Systems 141
  • 142.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 CPU Efficient Decompression l Decoding loop over cache-resident vectors of code words l Avoid control dependencies within decoding loop l no if-then-else constructs in loop body l Avoid data dependencies between loop iterations VLDB 2009 Tutorial Column-Oriented Database Systems 142
  • 143.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Disk Block Layout l Forward growing section of arbitrary size code words (code word size fixed per block) VLDB 2009 Tutorial Column-Oriented Database Systems 143
  • 144.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Disk Block Layout l Forward growing section of arbitrary size code words (code word size fixed per block) l Backwards growing exception list VLDB 2009 Tutorial Column-Oriented Database Systems 144
  • 145.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Naïve Decompression Algorithm l Use reserved value from code word domain (MAXCODE) to mark exception positions VLDB 2009 Tutorial Column-Oriented Database Systems 145
  • 146.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Deterioriation With Exception% l 1.2GB/s deteriorates to 0.4GB/s VLDB 2009 Tutorial Column-Oriented Database Systems 146
  • 147.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Deterioriation With Exception% l Perf Counters: CPU mispredicts if-then-else VLDB 2009 Tutorial Column-Oriented Database Systems 147
  • 148.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Patching l Maintain a patch-list through code word section that links exception positions VLDB 2009 Tutorial Column-Oriented Database Systems 148
  • 149.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Patching l Maintain a patch-list through code word section that links exception positions l After decoding, patch up the exception positions with the correct values VLDB 2009 Tutorial Column-Oriented Database Systems 149
  • 150.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Patched Decompression VLDB 2009 Tutorial Column-Oriented Database Systems 150
  • 151.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Patched Decompression VLDB 2009 Tutorial Column-Oriented Database Systems 151
  • 152.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 Decompression Bandwidth Patching can be applied to: • Frame of Reference (PFOR) • Delta Compression (PFOR-DELTA) • Dictionary Compression (PDICT) Makes these methods much more applicable to noisy data Ł without additional cost l Patching makes two passes, but is faster! VLDB 2009 Tutorial Column-Oriented Database Systems 152
  • 153.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB Column-Oriented 2009 Tutorial Database Systems Conclusion
  • 154.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary (1/2) l Columns and Row-Stores: different? l No fundamental differences l Can current row-stores simulate column-stores now? l not efficiently: row-stores need change l On disk layout vs execution layout l actually independent issues, on-the-fly conversion pays off l column favors sequential access, row random l Mixed Layout schemes l Fractured mirrors l PAX, Clotho l Data morphing VLDB 2009 Tutorial Column-Oriented Database Systems 154
  • 155.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary (2/2) l Crucial Columnar Techniques l Storage l Lean headers, sparse indices, fast positional access l Compression l Operating on compressed data l Lightweight, vectorized decompression l Late vs Early materialization l Non-join: LM always wins l Naïve/Invisible/Jive/Flash/Radix Join (LM often wins) l Execution l Vectorized in-cache execution l Exploiting SIMD VLDB 2009 Tutorial Column-Oriented Database Systems 155
  • 156.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Future Work l looking at write/load tradeoffs in column-stores l read-only vs batch loads vs trickle updates vs OLTP VLDB 2009 Tutorial Column-Oriented Database Systems 156
  • 157.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (1/3) l Column-stores are update-in-place averse l In-place: I/O for each column l + re-compression l + multiple sorted replicas l + sparse tree indices Update-in-place is infeasible! VLDB 2009 Tutorial Column-Oriented Database Systems 157
  • 158.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (2/3) l Column-stores use differential mechanisms instead l Differential lists/files or more advanced (e.g. PDTs) l Updates buffered in RAM, merged on each query l Checkpointing merges differences in bulk sequentially l I/O trends favor this anyway § trade RAM for converting random into sequential I/O § this trade is also needed in Flash (do not write randomly!) l How high loads can it sustain? § Depends on available RAM for buffering (how long until full) § Checkpoint must be done within that time § The longer it can run, the less it molests queries § Using Flash for buffering differences buys a lot of time § Hundreds of GBs of differences per server VLDB 2009 Tutorial Column-Oriented Database Systems 158
  • 159.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (3/3) l Differential transactions favored by hardware trends l Snapshot semantics accepted by the user community l can always convert to serialized “Serializable Isolation For Snapshot Databases” Alomari, Cahill, Fekete, Roehm, SIGMOD’08 Ł Row stores could also use differential transactions and be efficient! Ł Implies a departure from ARIES Ł Implies a full rewrite My conclusion: a system that combines row- and columns needs differentially implemented transactions. Starting from a pure column-store, this is a limited add-on. Starting from a pure row-store, this implies a full rewrite. VLDB 2009 Tutorial Column-Oriented Database Systems 159
  • 160.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Future Work l looking at write/load tradeoffs in column-stores l read-only vs batch loads vs trickle updates vs OLTP l database design for column-stores l column-store specific optimizers l compression/materialization/join tricks Ł cost models? l hybrid column-row systems l can row-stores learn new column tricks? l Study of the minimal number changes one needs to make to a row store to get the majority of the benefits of a column-store l Alternative: add features to column-stores that make them more like row stores VLDB 2009 Tutorial Column-Oriented Database Systems 160
  • 161.
    Re-use permitted whenacknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Conclusion l Columnar techniques provide clear benefits for: l Data warehousing, BI l Information retrieval, graphs, e-science l A number of crucial techniques make them effective l Without these, existing row systems do not benefit l Row-Stores and column-stores could be combined l Row-stores may adopt some column-store techniques l Column-stores add row-store (or PAX) functionality l Many open issues to do research on! VLDB 2009 Tutorial Column-Oriented Database Systems 161