SlideShare a Scribd company logo
VLDB 2009 Tutorial
Column-Oriented Database Systems
1
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
Part 1: Stavros Harizopoulos (HP Labs)
Part 2: Daniel Abadi (Yale)
Part 3: Peter Boncz (CWI)
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 2
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What is a column-store?
VLDB 2009 Tutorial Column-Oriented Database Systems 2
row-store column-store
Date CustomerProductStore
+ easy to add/modify a record
- might read in unnecessary data
+ only need to read in relevant data
- tuple writes require multiple accesses
=> suitable for read-mostly, read-intensive, large data repositories
Date Store Product Customer Price
Price
VLDB 2009 Tutorial Column-Oriented Database Systems 3
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Are these two fundamentally different?
l The only fundamental difference is the storage layout
l However: we need to look at the big picture
VLDB 2009 Tutorial Column-Oriented Database Systems 3
‘70s ‘80s ‘90s ‘00s today
row-stores row-stores++ row-stores++
different storage layouts proposed
new applications
new bottleneck in hardware
column-stores
converge?
l How did we get here, and where we are heading
l What are the column-specific optimizations?
l How do we improve CPU efficiency when operating on Cs
Part 2
Part 1
Part 3
VLDB 2009 Tutorial Column-Oriented Database Systems 4
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavros
l Introduction to key features
l From DSM to column-stores and performance tradeoffs
l Column-store architecture overview
l Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 4
VLDB 2009 Tutorial Column-Oriented Database Systems 5
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco Data Warehousing example
l Typical DW installation
l Real-world example
VLDB 2009 Tutorial Column-Oriented Database Systems 5
usage source
toll
account
star schema
fact table
dimension tables
or RAM
QUERY 2
SELECT account.account_number,
sum (usage.toll_airtime),
sum (usage.toll_price)
FROM usage, toll, source, account
WHERE usage.toll_id = toll.toll_id
AND usage.source_id = source.source_id
AND usage.account_id = account.account_id
AND toll.type_ind in (‘AE’. ‘AA’)
AND usage.toll_price > 0
AND source.type != ‘CIBER’
AND toll.rating_method = ‘IS’
AND usage.invoice_date = 20051013
GROUP BY account.account_number
Column-store Row-store
Query 1 2.06 300
Query 2 2.20 300
Query 3 0.09 300
Query 4 5.24 300
Query 5 2.88 300
Why? Three main factors (next slides)
“One Size Fits All? - Part 2: Benchmarking
Results” Stonebraker et al. CIDR 2007
VLDB 2009 Tutorial Column-Oriented Database Systems 6
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (1/3):
read efficiency
read pages containing entire rows
one row = 212 columns!
is this typical? (it depends)
VLDB 2009 Tutorial Column-Oriented Database Systems 6
row store column store
read only columns needed
in this example: 7 columns
caveats:
• “select * ” not any faster
• clever disk prefetching
• clever tuple reconstruction
What about vertical partitioning?
(it does not work with ad-hoc
queries)
VLDB 2009 Tutorial Column-Oriented Database Systems 7
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (2/3):
compression efficiency
l Columns compress better than rows
l Typical row-store compression ratio 1 : 3
l Column-store 1 : 10
l Why?
l Rows contain values from different domains
=> more entropy, difficult to dense-pack
l Columns exhibit significantly less entropy
l Examples:
l Caveat: CPU cost (use lightweight compression)
VLDB 2009 Tutorial Column-Oriented Database Systems 7
Male, Female, Female, Female, Male
1998, 1998, 1999, 1999, 1999, 2000
VLDB 2009 Tutorial Column-Oriented Database Systems 8
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (3/3):
sorting & indexing efficiency
l Compression and dense-packing free up space
l Use multiple overlapping column collections
l Sorted columns compress better
l Range queries are faster
l Use sparse clustered indexes
VLDB 2009 Tutorial Column-Oriented Database Systems 8
What about heavily-indexed row-stores?
(works well for single column access,
cross-column joins become increasingly expensive)
VLDB 2009 Tutorial Column-Oriented Database Systems 9
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Additional opportunities for column-stores
l Block-tuple / vectorized processing
l Easier to build block-tuple operators
l Amortizes function-call cost, improves CPU cache performance
l Easier to apply vectorized primitives
l Software-based: bitwise operations
l Hardware-based: SIMD
l Opportunities with compressed columns
l Avoid decompression: operate directly on compressed
l Delay decompression (and tuple reconstruction)
l Also known as: late materialization
l Exploit columnar storage in other DBMS components
l Physical design (both static and dynamic)
VLDB 2009 Tutorial Column-Oriented Database Systems 9
Part 3
more
in Part 2
See: Database
Cracking, from CWI
VLDB 2009 Tutorial Column-Oriented Database Systems 10
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Effect on C-Store performance
VLDB 2009 Tutorial Column-Oriented Database Systems 10
“Column-Stores vs Row-Stores:
How Different are They
Really?” Abadi, Hachem, and
Madden. SIGMOD 2008.
Time(sec) Average for SSBM queries on C-store
enable
late
materializationenable
compression &
operate on compressed
original
C-store
column-oriented
join algorithm
VLDB 2009 Tutorial Column-Oriented Database Systems 11
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary of column-store key features
l Storage layout
l Execution engine
l Design tools, optimizer
VLDB 2009 Tutorial Column-Oriented Database Systems 11
columnar storage
header/ID elimination
compression
multiple sort orders
column operators
avoid decompression
late materialization
vectorized operations Part 3
Part 2
Part 2
Part 1
Part 1
Part 2
Part 3
VLDB 2009 Tutorial Column-Oriented Database Systems 12
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavros
l Introduction to key features
l From DSM to column-stores and performance tradeoffs
l Column-store architecture overview
l Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 12
VLDB 2009 Tutorial Column-Oriented Database Systems 13
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
From DSM to Column-stores
70s -1985:
1985: DSM paper
1990s: Commercialization through SybaseIQ
Late 90s – 2000s: Focus on main-memory performance
l DSM “on steroids” [1997 – now]
l Hybrid DSM/NSM [2001 – 2004]
2005 – : Re-birth of read-optimized DSM as “column-store”
VLDB 2009 Tutorial Column-Oriented Database Systems 13
“A decomposition storage model”
Copeland and Khoshafian. SIGMOD 1985.
CWI: MonetDB
Wisconsin: PAX, Fractured Mirrors
Michigan: Data Morphing CMU: Clotho
MIT: C-Store CWI: MonetDB/X100 10+ startups
TOD: Time Oriented Database – Wiederhold et al.
"A Modular, Self-Describing Clinical Databank
System," Computers and Biomedical Research, 1975
More 1970s: Transposed files, Lorie, Batory,
Svensson.
“An overview of cantor: a new system for data analysis”
Karasalo, Svensson, SSDBM 1983
VLDB 2009 Tutorial Column-Oriented Database Systems 14
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
The original DSM paper
l Proposed as an alternative to NSM
l 2 indexes: clustered on ID, non-clustered on value
l Speeds up queries projecting few columns
l Requires more storage
VLDB 2009 Tutorial Column-Oriented Database Systems 14
“A decomposition storage
model” Copeland and
Khoshafian. SIGMOD 1985.
1 2 3 4 ..
ID
value
0100 0962 1000 ..
VLDB 2009 Tutorial Column-Oriented Database Systems 15
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory wall and PAX
l 90s: Cache-conscious research
l PAX: Partition Attributes Across
l Retains NSM I/O pattern
l Optimizes cache-to-RAM communication
VLDB 2009 Tutorial Column-Oriented Database Systems 15
“DBMSs on a modern processor:
Where does time go?” Ailamaki,
DeWitt, Hill, Wood. VLDB 1999.
“Weaving Relations for Cache Performance.”
Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.
“Cache Conscious Algorithms for
Relational Query Processing.”
Shatdal, Kant, Naughton. VLDB 1994.
from:
“Database Architecture Optimized for
the New Bottleneck: Memory Access.”
Boncz, Manegold, Kersten. VLDB 1999.
to:
and:
VLDB 2009 Tutorial Column-Oriented Database Systems 16
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
More hybrid NSM/DSM schemes
l Dynamic PAX: Data Morphing
l Clotho: custom layout using scatter-gather I/O
l Fractured mirrors
l Smart mirroring with both NSM/DSM copies
VLDB 2009 Tutorial Column-Oriented Database Systems 16
“Data morphing: an adaptive, cache-conscious
storage technique.” Hankins, Patel, VLDB 2003.
“Clotho: Decoupling Memory Page Layout from Storage Organization.”
Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.
“A Case For Fractured
Mirrors.” Ramamurthy,
DeWitt, Su, VLDB 2002.
VLDB 2009 Tutorial Column-Oriented Database Systems 17
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB (more in Part 3)
l Late 1990s, CWI: Boncz, Manegold, and Kersten
l Motivation:
l Main-memory
l Improve computational efficiency by avoiding expression
interpreter
l DSM with virtual IDs natural choice
l Developed new query execution algebra
l Initial contributions:
l Pointed out memory-wall in DBMSs
l Cache-conscious projections and joins
l …
VLDB 2009 Tutorial Column-Oriented Database Systems 17
VLDB 2009 Tutorial Column-Oriented Database Systems 18
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
2005: the (re)birth of column-stores
l New hardware and application realities
l Faster CPUs, larger memories, disk bandwidth limit
l Multi-terabyte Data Warehouses
l New approach: combine several techniques
l Read-optimized, fast multi-column access,
disk/CPU efficiency, light-weight compression
l C-store paper:
l First comprehensive design description of a column-store
l MonetDB/X100
l “proper” disk-based column store
l Explosion of new products
VLDB 2009 Tutorial Column-Oriented Database Systems 18
VLDB 2009 Tutorial Column-Oriented Database Systems 19
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Performance tradeoffs: columns vs. rows
DSM traditionally was not favored by technology trends
How has this changed?
l Optimized DSM in “Fractured Mirrors,” 2002
l “Apples-to-apples” comparison
l Follow-up study
l Main-memory DSM vs. NSM
l Flash-disks: a come-back for PAX?
VLDB 2009 Tutorial Column-Oriented Database Systems 19
“Performance Tradeoffs in Read-
Optimized Databases”
Harizopoulos, Liang, Abadi,
Madden, VLDB’06
“Read-Optimized Databases, In-
Depth” Holloway, DeWitt, VLDB’08
“Query Processing Techniques
for Solid State Drives”
Tsirogiannis, Harizopoulos,
Shah, Wiener, Graefe,
SIGMOD’09
“Fast Scans and Joins Using Flash
Drives” Shah, Harizopoulos,
Wiener, Graefe. DaMoN’08
“DSM vs. NSM: CPU performance tradeoffs in block-oriented
query processing” Boncz, Zukowski, Nes, DaMoN’08
VLDB 2009 Tutorial Column-Oriented Database Systems 20
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Fractured mirrors: a closer look
l Store DSM relations inside a B-tree
l Leaf nodes contain values
l Eliminate IDs, amortize header overhead
l Custom implementation on Shore
VLDB 2009 Tutorial Column-Oriented Database Systems 20
3
sparse
B-tree on ID
1 a1
“Efficient columnar
storage in B-trees” Graefe.
Sigmod Record 03/2007.
Similar: storage density
comparable
to column stores
“A Case For Fractured
Mirrors” Ramamurthy,
DeWitt, Su, VLDB 2002.
11
22
33
TIDTID
a1a1
a2a2
a3a3
Column
Data
Column
Data
Tuple
Header
Tuple
Header
44 a4a4
55 a5a5
2 a2 3 a3 4 a4 5 a5
1 a1 a3a2 4 a4 a5
VLDB 2009 Tutorial Column-Oriented Database Systems 21
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Fractured mirrors: performance
l Chunk-based tuple merging
l Read in segments of M pages
l Merge segments in memory
l Becomes CPU-bound after 5 pages
VLDB 2009 Tutorial Column-Oriented Database Systems 21
From PAX paper:
optimized
DSM
columns projected:
1 2 3 4 5
time
row
column?
column?
regular DSM
VLDB 2009 Tutorial Column-Oriented Database Systems 22
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-scanner
implementation
VLDB 2009 Tutorial Column-Oriented Database Systems 22
1 Joe 45
2 Sue 37
… … …
Joe
Sue
45
37
…
…
prefetch ~100ms
worth of data
S
apply
predicate(s)
Joe 45
… …
S
#POS 45
#POS …
S
Joe 45
… …
apply
predicate #1
SELECT name, age
WHERE age > 40
Direct I/O
row scanner column scanner
“Performance Tradeoffs in Read-
Optimized Databases”
Harizopoulos, Liang, Abadi,
Madden, VLDB’06
VLDB 2009 Tutorial Column-Oriented Database Systems 23
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Scan performance
l Large prefetch hides disk seeks in columns
l Column-CPU efficiency with lower selectivity
l Row-CPU suffers from memory stalls
l Memory stalls disappear in narrow tuples
l Compression: similar to narrow
VLDB 2009 Tutorial Column-Oriented Database Systems 23
not shown,
details in the paper
VLDB 2009 Tutorial Column-Oriented Database Systems 24
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Even more results
Non-selective queries, narrow tuples, favor well-compressed rows
Materialized views are a win
Scan times determine early materialized joins
VLDB 2009 Tutorial Column-Oriented Database Systems 24
“Read-Optimized Databases, In-
Depth” Holloway, DeWitt, VLDB’08
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10
Time(s)
Columns Returned
C-25%
C-10%
R-50%
Column-joins are
covered in part 2!
• Same engine as before
• Additional findings
wide attributes:
same as before
narrow & compressed tuple:
CPU-bound!
VLDB 2009 Tutorial Column-Oriented Database Systems 25
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Speedup of columns over rows
l Rows favored by narrow tuples and low cpdb
l Disk-bound workloads have higher cpdb
VLDB 2009 Tutorial Column-Oriented Database Systems 25
tuple width
cyclesperdiskbyte
(cpdb)
8 12 16 20 24 28 32 36
9
18
36
72
144
_ + ++=
+++
“Performance Tradeoffs in Read-
Optimized Databases”
Harizopoulos, Liang, Abadi,
Madden, VLDB’06
VLDB 2009 Tutorial Column-Oriented Database Systems 26
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying prefetch size
l No prefetching hurts columns in single scans
VLDB 2009 Tutorial Column-Oriented Database Systems 26
0
10
20
30
40
4 8 12 16 20 24 28 32
time(sec)
selected bytes per tuple
Row (any prefetch size)
Column 48 (x 128KB)
Column 16
Column 8
Column 2
no competing
disk traffic
VLDB 2009 Tutorial Column-Oriented Database Systems 27
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying prefetch size
l No prefetching hurts columns in single scans
l Under competing traffic, columns outperform rows for
any prefetch size
VLDB 2009 Tutorial Column-Oriented Database Systems 27
with competing disk traffic
0
10
20
30
40
4 12 20 28
Column, 48
Row, 48
0
10
20
30
40
4 12 20 28
Column, 8
Row, 8
selected bytes per tuple
time(sec)
VLDB 2009 Tutorial Column-Oriented Database Systems 28
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Performance
l Benefit in on-the-fly conversion between NSM and DSM
l DSM: sequential access (block fits in L2), random in L1
l NSM: random access, SIMD for grouped Aggregation
VLDB 2009 Tutorial Column-Oriented Database Systems 28
“DSM vs. NSM: CPU performance trade
offs in block-oriented query processing”
Boncz, Zukowski, Nes, DaMoN’08
VLDB 2009 Tutorial Column-Oriented Database Systems 29
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 29
New storage technology: Flash SSDs
l Performance characteristics
l very fast random reads, slow random writes
l fast sequential reads and writes
l Price per bit (capacity follows)
l cheaper than RAM, order of magnitude more expensive than Disk
l Flash Translation Layer introduces unpredictability
l avoid random writes!
l Form factors not ideal yet
l SSD (Ł small reads still suffer from SATA overhead/OS limitations)
l PCI card (Ł high price, limited expandability)
l Boost Sequential I/O in a simple package
l Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box)
l Column Store Updates
l useful for delta structures and logs
l Random I/O on flash fixes unclustered index access
l still suboptimal if I/O block size > record size
l therefore column stores profit mush less than horizontal stores
l Random I/O useful to exploit secondary, tertiary table orderings
l the larger the data, the deeper clustering one can exploit
VLDB 2009 Tutorial Column-Oriented Database Systems 30
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Even faster column scans on flash SSDs
l New-generation SSDs
l Very fast random reads, slower random writes
l Fast sequential RW, comparable to HDD arrays
l No expensive seeks across columns
l FlashScan and Flashjoin: PAX on SSDs, inside Postgres
VLDB 2009 Tutorial Column-Oriented Database Systems 30
“Query Processing Techniques for
Solid State Drives” Tsirogiannis,
Harizopoulos, Shah, Wiener, Graefe,
SIGMOD’09
mini-pages with no
qualified attributes are
not accessed
30K Read IOps, 3K Write Iops
250MB/s Read BW, 200MB/s Write
VLDB 2009 Tutorial Column-Oriented Database Systems 31
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-scan performance over time
VLDB 2009 Tutorial Column-Oriented Database Systems 31
from 7x slower
..to 1.2x slower
..to same
and 3x faster!
regular DSM (2001)
optimized DSM (2002)
column-store (2006)
SSD Postgres/PAX (2009)
..to 2x slower
VLDB 2009 Tutorial Column-Oriented Database Systems 32
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavros
l Introduction to key features
l From DSM to column-stores and performance tradeoffs
l Column-store architecture overview
l Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 32
VLDB 2009 Tutorial Column-Oriented Database Systems 33
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Architecture of a column-store
storage layout
l read-optimized: dense-packed, compressed
l organize in extends, batch updates
l multiple sort orders
l sparse indexes
VLDB 2009 Tutorial Column-Oriented Database Systems 33
engine
l block-tuple operators
l new access methods
l optimized relational operatorssystem-level
l system-wide column support
l loading / updates
l scaling through multiple nodes
l transactions / redundancy
VLDB 2009 Tutorial Column-Oriented Database Systems 34
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
C-Store
l Compress columns
l No alignment
l Big disk blocks
l Only materialized views (perhaps many)
l Focus on Sorting not indexing
l Data ordered on anything, not just time
l Automatic physical DBMS design
l Optimize for grid computing
l Innovative redundancy
l Xacts – but no need for Mohan
l Column optimizer and executor
VLDB 2009 Tutorial Column-Oriented Database Systems 34
“C-Store: A Column-Oriented
DBMS.” Stonebraker et al.
VLDB 2005.
VLDB 2009 Tutorial Column-Oriented Database Systems 35
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
C-Store: only materialized views (MVs)
l Projection (MV) is some number of columns from a fact table
l Plus columns in a dimension table – with a 1-n join between
Fact and Dimension table
l Stored in order of a storage key(s)
l Several may be stored!
l With a permutation, if necessary, to map between them
l Table (as the user specified it and sees it) is not stored!
l No secondary indexes (they are a one column sorted MV plus
a permutation, if you really want one)
VLDB 2009 Tutorial Column-Oriented Database Systems 35
User view:
EMP (name, age, salary, dept)
Dept (dname, floor)
Possible set of MVs:
MV-1 (name, dept, floor) in floor order
MV-2 (salary, age) in age order
MV-3 (dname, salary, name) in salary order
VLDB 2009 Tutorial Column-Oriented Database Systems 36
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Asynchronous
Data Transfer
TUPLE MOVER
> Read Optimized
Store (ROS)
• On disk
• Sorted / Compressed
• Segmented
• Large data loaded direct
Continuous Load and Query (Vertica)
Hybrid Storage Architecture
(A B C | A)
A B C
Trickle
Load
> Write Optimized
Store (WOS)
§Memory based
§Unsorted / Uncompressed
§Segmented
§Low latency / Small quick
inserts
A B C
VLDB 2009 Tutorial Column-Oriented Database Systems 36
VLDB 2009 Tutorial Column-Oriented Database Systems 37
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Loading Data (Vertica)
Write-Optimized
Store (WOS)
In-memory
Read-Optimized
Store (ROS)
On-disk
Automatic
Tuple Mover
> INSERT, UPDATE, DELETE
> Bulk and Trickle Loads
§COPY
§COPY DIRECT
> User loads data into logical Tables
> Vertica loads atomically into
storage
VLDB 2009 Tutorial Column-Oriented Database Systems 38
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Applications for column-stores
l Data Warehousing
l High end (clustering)
l Mid end/Mass Market
l Personal Analytics
l Data Mining
l E.g. Proximity
l Google BigTable
l RDF
l Semantic web data management
l Information retrieval
l Terabyte TREC
l Scientific datasets
l SciDB initiative
l SLOAN Digital Sky Survey on MonetDB
VLDB 2009 Tutorial Column-Oriented Database Systems 38
VLDB 2009 Tutorial Column-Oriented Database Systems 39
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
List of column-store systems
l Cantor (history)
l Sybase IQ
l SenSage (former Addamark Technologies)
l Kdb
l 1010data
l MonetDB
l C-Store/Vertica
l X100/VectorWise
l KickFire
l SAP Business Accelerator
l Infobright
l ParAccel
l Exasol
VLDB 2009 Tutorial Column-Oriented Database Systems 39
VLDB 2009 Tutorial Column-Oriented Database Systems 40
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavros
l Introduction to key features
l From DSM to column-stores and performance tradeoffs
l Column-store architecture overview
l Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 40
VLDB 2009 Tutorial Column-Oriented Database Systems 41
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 41
Simulate a Column-Store inside a Row-Store
Date Store Product Customer Price
01/01
01/01
01/01
1
2
3
1
2
3
1
2
3
Option A:
Vertical Partitioning
…
Option B:
Index Every Column
Date Index
Store Index
1
2
3
1
2
3
01/01
01/01
01/01
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
TID Value
Store
TID Value
Product
TID Value
Customer
TID Value
Price
TID Value
Date
VLDB 2009 Tutorial Column-Oriented Database Systems 42
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 42
Simulate a Column-Store inside a Row-Store
Date Store Product Customer Price
1
2
3
1
2
3
Option A:
Vertical Partitioning
…
Option B:
Index Every Column
Date Index
Store Index
1
2
3
1
2
3
01/01
01/01
01/01
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
TID Value
Store
TID Value
Product
TID Value
Customer
TID Value
Price
01/01 1 3
StartPosValue
Date
Length
Can explicitly run-
length encode date
“Teaching an Old Elephant New Tricks.”
Bruno, CIDR 2009.
VLDB 2009 Tutorial Column-Oriented Database Systems 43
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 43
Experiments
0.0
50.0
100.0
150.0
200.0
250.0
Time(seconds)
Average 25.7 79.9 221.2
Normal Row-Store
Vertically Partitioned
Row-Store
Row-Store With All
Indexes
“Column-Stores vs Row-Stores:
How Different are They Really?”
Abadi, Hachem, and Madden.
SIGMOD 2008.
l Star Schema Benchmark (SSBM)
l Implemented by professional DBA
l Original row-store plus 2 column-store
simulations on same row-store product
Adjoined Dimension Column Index (ADC Index)
to Improve Star Schema Query Performance”.
O’Neil et. al. ICDE 2008.
VLDB 2009 Tutorial Column-Oriented Database Systems 44
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What’s Going On? Vertical Partitions
l Vertical partitions in row-stores:
l Work well when workload is known
l ..and queries access disjoint sets of columns
l See automated physical design
l Do not work well as full-columns
l TupleID overhead significant
l Excessive joins
VLDB 2009 Tutorial Column-Oriented Database Systems 44
“Column-Stores vs. Row-Stores:
How Different Are They Really?”
Abadi, Madden, and Hachem.
SIGMOD 2008.
Queries touch 3-4 foreign keys in fact table,
1-2 numeric columns
Complete fact table takes up ~4 GB
(compressed)
Vertically partitioned tables take up 0.7-1.1
GB (compressed)
11
22
33
TID Column
Data
Tuple
Header
VLDB 2009 Tutorial Column-Oriented Database Systems 45
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 45
What’s Going On? All Indexes Case
l Tuple construction
l Common type of query:
l Result of lower part of query plan is a set of TIDs that passed
all predicates
l Need to extract SELECT attributes at these TIDs
l BUT: index maps value to TID
l You really want to map TID to value (i.e., a vertical partition)
Tuple construction is SLOW
SELECT store_name, SUM(revenue)
FROM Facts, Stores
WHERE fact.store_id = stores.store_id
AND stores.country = “Canada”
GROUP BY store_name
VLDB 2009 Tutorial Column-Oriented Database Systems 46
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 46
So….
l All indexes approach is a poor way to simulate a column-store
l Problems with vertical partitioning are NOT fundamental
l Store tuple header in a separate partition
l Allow virtual TIDs
l Combine clustered indexes, vertical partitioning
l So can row-stores simulate column-stores?
l Might be possible, BUT:
l Need better support for vertical partitioning at the storage layer
l Need support for column-specific optimizations at the executer level
l Full integration: buffer pool, transaction manager, ..
l When will this happen?
l Most promising features = soon
l ..unless new technology / new objectives change the game
(SSDs, Massively Parallel Platforms, Energy-efficiency)
See Part 2, Part 3
for most promising
features
VLDB 2009 Tutorial Column-Oriented Database Systems 47
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
End of Part 1
l Basic concepts — Stavros
l Introduction to key features
l From DSM to column-stores and performance tradeoffs
l Column-store architecture overview
l Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 47
VLDB 2009 Tutorial Column-Oriented Database Systems 48
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Part 2 Outline
l Compression
l Tuple Materialization
l Joins
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
Compression
VLDB
2009
Tutorial
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi, Madden, and
Ferreira, SIGMOD ’06
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
•Query optimization in compressed database
systems” Chen, Gehrke, Korn, SIGMOD’01
VLDB 2009 Tutorial Column-Oriented Database Systems 50
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Compression
l Trades I/O for CPU
l Increased column-store opportunities:
l Higher data value locality in column stores
l Techniques such as run length encoding far more
useful
l Can use extra space to store multiple copies of
data in different sort orders
VLDB 2009 Tutorial Column-Oriented Database Systems 51
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Run-length Encoding
Q1
Q1
Q1
Q1
Q1
Q1
Q1
Q2
Q2
Q2
Q2
…
…
1
1
1
1
1
2
2
1
1
1
2
…
…
Product IDQuarter
(value, start_pos, run_length)
(1, 1, 5)
…
…
Product IDQuarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
(2, 6, 2)
(1, 301, 3)
(2, 304, 1)
5
7
2
9
6
8
5
3
8
1
4
…
…
Price
5
7
2
9
6
8
5
3
8
1
4
…
…
Price
(Q1, 1, 300)
(value, start_pos, run_length)
VLDB 2009 Tutorial Column-Oriented Database Systems 52
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Bit-vector Encoding
1
1
1
1
1
2
2
1
1
2
3
…
…
Product ID
1
1
1
1
1
0
0
1
1
0
0
…
…
ID: 1 ID: 2 ID: 3
0
0
0
0
0
0
0
0
0
0
0
…
…
…
0
0
0
0
0
0
0
0
0
0
1
…
…
0
0
0
0
0
1
1
0
0
1
0
…
…
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi et. al, SIGMOD ’06
l For each unique
value, v, in column
c, create bit-vector
b
l b[i] = 1 if c[i] = v
l Good for columns
with few unique
values
l Each bit-vector
can be further
compressed if
sparse
VLDB 2009 Tutorial Column-Oriented Database Systems 53
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Q1
Q2
Q4
Q1
Q3
Q1
Q1
Q2
Q4
Q3
Q3
…
Quarter
Q1
0
1
3
0
2
0
0
1
3
2
2
…
Quarter
0
0: Q1
1: Q2
2: Q3
3: Q4
Dictionary Encoding
Dictionary
+
OR
24
128
122
Quarter
24: Q1, Q2, Q4, Q1
128: Q3, Q1, Q1, Q1
122: Q2, Q4, Q3, Q3
Dictionary
+
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi et. al, SIGMOD ’06
…
l For each
unique value
create
dictionary entry
l Dictionary can
be per-block or
per-column
l Column-stores
have the
advantage that
dictionary
entries may
encode multiple
values at once
VLDB 2009 Tutorial Column-Oriented Database Systems 54
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
45
54
48
55
51
53
40
49
62
52
50
…
Price
50
Frame: 50
4
-2
5
1
3
2
0
0
…
Price
-1
Frame Of Reference Encoding
-5
∞
40
∞
62
4 bits per
value
Exceptions (see
part 3 for a better
way to deal with
exceptions)
l Encodes values as b bit
offset from chosen frame
of reference
l Special escape code (e.g.
all bits set to 1) indicates
a difference larger than
can be stored in b bits
l After escape code,
original (uncompressed)
value is written
“Compressing Relations and Indexes
” Goldstein, Ramakrishnan, Shaft,
ICDE’98
VLDB 2009 Tutorial Column-Oriented Database Systems 55
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Differential Encoding
5:00
5:02
5:03
5:03
5:04
5:06
5:07
5:10
5:15
5:16
5:16
…
Time
5:08
2
1
0
1
2
1
1
0
Time
2
5:00
1
∞
5:15
2 bits per
value
Exception (see
part 3 for a better
way to deal with
exceptions)
l Encodes values as b bit offset
from previous value
l Special escape code (just like
frame of reference encoding)
indicates a difference larger than
can be stored in b bits
l After escape code, original
(uncompressed) value is written
l Performs well on columns
containing increasing/decreasing
sequences
l inverted lists
l timestamps
l object IDs
l sorted / clustered columns
“Improved Word-Aligned Binary
Compression for Text Indexing”
Ahn, Moffat, TKDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 56
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What Compression Scheme To Use?
Does column appear in the sort key?
Is the average
run-length > 2
Are number of unique
values < ~50000
Differential
Encoding
RLE
Is the data numerical
and exhibit good locality?
Frame of Reference
Encoding
Heavyweight
Compression
Leave Data
Uncompressed
OR
Does this column appear frequently
in selection predicates?
Bit-vector
Compression
Dictionary
Compression
yes
yes
yes
yes
yes
no
no
no
no
no
VLDB 2009 Tutorial Column-Oriented Database Systems 57
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Heavy-Weight Compression Schemes
l Modern disk arrays can achieve > 1GB/s
l 1/3 CPU for decompression Ł 3GB/s needed
Ł Lightweight compression schemes are better
Ł Even better: operate directly on compressed data
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 58
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l I/O - CPU tradeoff is no longer a tradeoff
l Reduces memory–CPU bandwidth requirements
l Opens up possibility of operating on multiple
records at once
Operating Directly on Compressed Data
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi et. al, SIGMOD ’06
VLDB 2009 Tutorial Column-Oriented Database Systems 59
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
1
0
0
1
1
0
0
1
0
0
0
…
…
Product IDQuarter
(1, 3)
(2, 1)
(3, 2)
1
0
0
1
0
0
0
1
0
0
1
0
…
…
2
0
1
0
0
0
1
0
0
1
0
1
…
…
3
ProductID, COUNT(*))
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Index Lookup + Offset jump
SELECT ProductID, Count(*)
FROM table
WHERE (Quarter = Q2)
GROUP BY ProductID
301-306
Operating Directly on Compressed Data
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi et. al, SIGMOD ’06
VLDB 2009 Tutorial Column-Oriented Database Systems 60
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Data
isOneValue()
isValueSorted()
isPosContiguous()
isSparse()
getNext()
decompressIntoArray()
getValueAtPosition(pos)
getMin()
getMax()
getSize()
Block API
Compression-
Aware Scan
Operator
Selection
Operator
Aggregation
Operator
SELECT ProductID, Count(*)
FROM table
WHERE (Quarter = Q2)
GROUP BY ProductID
“Integrating Compression and Execution in Column-
Oriented Database Systems” Abadi et. al, SIGMOD ’06
Operating Directly on Compressed Data
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
Tuple Materialization and
Column-Oriented Join Algorithms
VLDB
2009
Tutorial
“Materialization Strategies in a Column-
Oriented DBMS” Abadi, Myers, DeWitt,
and Madden. ICDE 2007.
“Query Processing Techniques for
Solid State Drives” Tsirogiannis,
Harizopoulos Shah, Wiener, and
Graefe. SIGMOD 2009.
“Column-Stores vs Row-Stores: How
Different are They Really?” Abadi,
Madden, and Hachem. SIGMOD 2008.
“Cache-Conscious Radix-Decluster
Projections”, Manegold, Boncz, Nes,
VLDB’04
“Self-organizing tuple reconstruction in
column-stores“, Idreos, Manegold,
Kersten, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 62
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
When should columns be projected?
l Where should column projection operators be placed in a query
plan?
l Row-store:
l Column projection involves removing unneeded columns from
tuples
l Generally done as early as possible
l Column-store:
l Operation is almost completely opposite from a row-store
l Column projection involves reading needed columns from storage
and extracting values for a listed set of tuples
§ This process is called “materialization”
l Early materialization: project columns at beginning of query plan
§ Straightforward since there is a one-to-one mapping across columns
l Late materialization: wait as long as possible for projecting columns
§ More complicated since selection and join operators on one column obfuscates
mapping to other columns from same table
l Most column-stores construct tuples and column projection time
§ Many database interfaces expect output in regular tuples (rows)
§ Rest of discussion will focus on this case
VLDB 2009 Tutorial Column-Oriented Database Systems 63
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
When should tuples be constructed?
l Solution 1: Create rows first (EM).
But:
l Need to construct ALL tuples
l Need to decompress data
l Poor memory bandwidth
utilization
2
1
3
1
2
3
3
3
7
13
42
80
(4,1,4)
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeID custID price
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
VLDB 2009 Tutorial Column-Oriented Database Systems 64
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Data
Source
prodID
Data
Source
storeID
AND
AGG
Data
Source
custID
Data
Source
price
1
1
1
1
4
4
4
4
0
1
0
1
2
1
3
1
prodID storeID
Data
Source
Data
Source
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
2
1
3
1
2
3
3
3
7
13
42
80
4
4
4
4
prodID storeID custID price
Solution 2: Operate on columns
VLDB 2009 Tutorial Column-Oriented Database Systems 65
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Solution 2: Operate on columns
1
1
1
1
0
1
0
1
AND
0
1
0
1
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
AND
AGG
Data
Source
custID
Data
Source
price
2
1
3
1
2
3
3
3
7
13
42
80
4
4
4
4
prodID storeID custID price
Data
Source
prodID
Data
Source
storeID
VLDB 2009 Tutorial Column-Oriented Database Systems 66
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
3
3
Solution 2: Operate on columns
0
1
0
1
0
1
0
1
Data
Source
Data
Source
0
1
0
1
7
13
42
80
0
1
0
1
2
3
3
3
1
1
13
80
custID price
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
AND
AGG
Data
Source
custID
Data
Source
price
2
1
3
1
2
3
3
3
7
13
42
80
4
4
4
4
prodID storeID custID price
Data
Source
prodID
Data
Source
storeID
VLDB 2009 Tutorial Column-Oriented Database Systems 67
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Solution 2: Operate on columns
1
1
3
3
1
1
13
80
AGG
13 193
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
AND
AGG
Data
Source
custID
Data
Source
price
2
1
3
1
2
3
3
3
7
13
42
80
4
4
4
4
prodID storeID custID price
Data
Source
prodID
Data
Source
storeID
VLDB 2009 Tutorial Column-Oriented Database Systems 68
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
For plans without joins, late materialization
is a win
l Ran on 2 compressed
columns from TPC-H
scale 10 data
QUERY:
SELECT C1, SUM(C2)
FROM table
WHERE (C1 < CONST) AND
(C2 < CONST)
GROUP BY C1
0
1
2
3
4
5
6
7
8
9
10
Low selectivity Medium
selectivity
High selectivity
Time(seconds)
Early Materialization Late Materialization
“Materialization Strategies in a Column-Oriented DBMS”
Abadi, Myers, DeWitt, and Madden. ICDE 2007.
VLDB 2009 Tutorial Column-Oriented Database Systems 69
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Materializing late still
works best
QUERY:
SELECT C1, SUM(C2)
FROM table
WHERE (C1 < CONST) AND
(C2 < CONST)
GROUP BY C1
0
2
4
6
8
10
12
14
Low selectivity Medium
selectivity
High selectivity
Time(seconds)
Early Materialization Late Materialization
Even on uncompressed data, late
materialization is still a win
“Materialization Strategies in a Column-Oriented DBMS”
Abadi, Myers, DeWitt, and Madden. ICDE 2007.
VLDB 2009 Tutorial Column-Oriented Database Systems 70
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What about for plans with joins?
70
Select R1.B, R1.C, R2.E, R2.H, R3.F
From R1, R2, R3
Where R1.A = R2.D AND R2.G = R3.K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join 1
A = D
Join 2
G = K
Scan
R3 (F, K, L)
B, C, E, H, F
A, B, C
B, C, E, G, H
D, E, G, H
F, K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join
A = D
A D
Scan
R3 (F, K, L)
Project
G
Join
G = K
Project
B, C, E, H, F
KG
E, H
F
B, C
VLDB 2009 Tutorial Column-Oriented Database Systems 71
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What about for plans with joins?
71
Select R1.B, R1.C, R2.E, R2.H, R3.F
From R1, R2, R3
Where R1.A = R2.D AND R2.G = R3.K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join 1
A = D
Join 2
G = K
Scan
R3 (F, K, L)
B, C, E, H, F
A, B, C
B, C, E, G, H
D, E, G, H
F, K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join
A = D
A D
Scan
R3 (F, K, L)
Project
G
Join
G = K
Project
B, C, E, H, F
KG
E, H
F
B, C
Early
materialization
Late
materialization
VLDB 2009 Tutorial Column-Oriented Database Systems 72
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Early Materialization Example
12
1
11
1
6
1
2
1
2
3
3
3
7
13
42
80
Construct
2
3
3
3
7
13
42
80
prodID storeID quantity custID price
1
2
3
Construct
1
2
3
custID lastName
Green
White
Brown
Green
White
Brown
QUERY:
SELECT C.lastName,SUM(F.price)
FROM facts AS F, customers AS C
WHERE F.custID = C.custID
GROUP BY C.lastName
Facts Customers
(4,1,4)
VLDB 2009 Tutorial Column-Oriented Database Systems 73
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Early Materialization Example
2
3
3
3
7
13
42
80
1
2
3
Green
White
Brown
QUERY:
SELECT C.lastName,SUM(F.price)
FROM facts AS F, customers AS C
WHERE F.custID = C.custID
GROUP BY C.lastName
Join
7
13
42
80
White
Brown
Brown
Brown
VLDB 2009 Tutorial Column-Oriented Database Systems 74
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Late Materialization Example
12
1
11
1
6
1
2
1
2
3
1
3
7
13
42
80
prodID storeID quantity custID price
1
2
3
custID lastName
Green
White
Brown
QUERY:
SELECT C.lastName,SUM(F.price)
FROM facts AS F, customers AS C
WHERE F.custID = C.custID
GROUP BY C.lastName
Facts Customers
Join
2
3
1
3
1
2
3
4
(4,1,4)
Late materialized join
causes out of order
probing of projected
columns from the
inner relation
VLDB 2009 Tutorial Column-Oriented Database Systems 75
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Late Materialized Join Performance
l Naïve LM join about 2X slower than EM join on typical
queries (due to random I/O)
l This number is very dependent on
l Amount of memory available
l Number of projected attributes
l Join cardinality
l But we can do better
l Invisible Join
l Jive/Flash Join
l Radix cluster/decluster join
VLDB 2009 Tutorial Column-Oriented Database Systems 76
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
l Designed for typical joins when data is modeled using a star
schema
l One (“fact”) table is joined with multiple dimension tables
l Typical query:
select c_nation, s_nation, d_year,
sum(lo_revenue) as revenue
from customer, lineorder, supplier, date
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_region = 'ASIA‘
and s_region = 'ASIA‘
and d_year >= 1992 and d_year <= 1997
group by c_nation, s_nation, d_year
order by d_year asc, revenue desc;
“Column-Stores vs Row-Stores: How
Different are They Really?” Abadi,
Madden, and Hachem. SIGMOD 2008.
VLDB 2009 Tutorial Column-Oriented Database Systems 77
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey region nation …
1 ASIA CHINA …
2 ASIA INDIA …
3 ASIA INDIA …
Apply “region = ‘Asia’” On Customer Table
Hash Table (or bit-map)
Containing Keys 1, 2 and 3
suppkey region nation …
1 ASIA RUSSIA …
2 EUROPE SPAIN …
Apply “region = ‘Asia’” On Supplier Table
Hash Table (or bit-map)
Containing Keys 1, 3
dateid year …
01011997 1997 …
01021997 1997 …
01031997 1997 …
Apply “year in [1992,1997]” On Date Table
Hash Table Containing
Keys 01011997, 01021997,
and 01031997
Invisible Join
“Column-Stores vs Row-Stores: How
Different are They Really?” Abadi,
Madden, and Hachem. SIGMOD 2008.
4 EUROPE FRANCE …
3 ASIA JAPAN …
VLDB 2009 Tutorial Column-Oriented Database Systems 78
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Column-Stores vs Row-Stores:
How Different are They Really?”
Abadi et. al. SIGMOD 2008.
custkey suppkey orderdate revenueorderkey
3 1 01011997 432561
3 2 01011997 333332
4 3 01021997 121213
1 1 01021997 232334
4 2 01021997 454565
1 2 01031997 432516
3 2 01031997 342357
Original Fact Table
custkey
3
3
4
1
4
1
3
Hash Table
Containing
Keys 1, 2 and 3
+
1
1
0
1
0
1
1
Hash Table
Containing
Keys 1 and 3
1
0
1
1
0
0
0
+suppkey
1
2
3
1
2
2
2
Hash Table Containing
Keys 01011997,
01021997, and 01031997
1
1
1
1
1
1
1
+orderdate
01011997
01011997
01021997
01021997
01021997
01031997
01031997
1
1
0
1
0
1
1
1
0
1
1
0
0
0
1
1
1
1
1
1
1
& & =
1
0
0
1
0
0
0
VLDB 2009 Tutorial Column-Oriented Database Systems 79
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey
3
3
4
1
4
1
3
+
suppkey
1
2
3
1
2
2
2
orderdate
01011997
01011997
01021997
01021997
01021997
01031997
01031997
1
0
0
1
0
0
0
3
1
1
0
0
1
0
0
0
1
1
1
0
0
1
0
0
0
01011997
01021997
+
+
+
+
JOIN
CHINA
INDIA
INDIA = CHINA
INDIA
RUSSIA
SPAIN = RUSSIA
RUSSIA
1997
1997
1997 =
01011997
01021997
01031997
1997
1997
FRANCE
JAPAN
VLDB 2009 Tutorial Column-Oriented Database Systems 80
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey region nation …
1 ASIA CHINA …
2 ASIA INDIA …
3 ASIA INDIA …
Apply “region = ‘Asia’” On Customer Table
Hash Table (or bit-map)
Containing Keys 1, 2 and 3
suppkey region nation …
1 ASIA RUSSIA …
2 EUROPE SPAIN …
Apply “region = ‘Asia’” On Supplier Table
Hash Table (or bit-map)
Containing Keys 1, 3
dateid year …
01011997 1997 …
01021997 1997 …
01031997 1997 …
Apply “year in [1992,1997]” On Date Table
Hash Table Containing
Keys 01011997, 01021997,
and 01031997
Invisible Join
“Column-Stores vs Row-Stores: How
Different are They Really?” Abadi,
Madden, and Hachem. SIGMOD 2008.
4 EUROPE FRANCE …
3 ASIA JAPAN …
Range [1-3]
(between-predicate rewriting)
VLDB 2009 Tutorial Column-Oriented Database Systems 81
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
l Bottom Line
l Many data warehouses model data using star/snowflake
schemes
l Joins of one (fact) table with many dimension tables is
common
l Invisible join takes advantage of this by making sure
that the table that can be accessed in position order is
the fact table for each join
l Position lists from the fact table are then intersected (in
position order)
l This reduces the amount of data that must be accessed
out of order from the dimension tables
l “Between-predicate rewriting” trick not relevant for this
discussion
VLDB 2009 Tutorial Column-Oriented Database Systems 82
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey
3
3
4
1
4
1
3
+
suppkey
1
2
3
1
2
2
2
orderdate
01011997
01011997
01021997
01021997
01021997
01031997
01031997
1
0
0
1
0
0
0
3
1
1
0
0
1
0
0
0
1
1
1
0
0
1
0
0
0
01011997
01021997
+
+
+
+
JOIN
CHINA
INDIA
INDIA = CHINA
INDIA
RUSSIA
SPAIN = RUSSIA
RUSSIA
1997
1997
1997 =
01011997
01021997
01031997
1997
1997
FRANCE
JAPAN
Still accessing table out of order
VLDB 2009 Tutorial Column-Oriented Database Systems 83
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
3
1 + CHINA
INDIA
INDIA = CHINA
INDIA
FRANCE
Still accessing table out of order
“Fast Joins using Join
Indices”. Li and Ross,
VLDBJ 8:1-24, 1999.
“Query Processing
Techniques for Solid State
Drives”. Tsirogiannis,
Harizopoulos et. al.
SIGMOD 2009.
VLDB 2009 Tutorial Column-Oriented Database Systems 84
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
1. Add column with dense
ascending integers
from 1
2. Sort new position list
by second column
3. Probe projected
column in order using
new sorted position
list, keeping first
column from position
list around
4. Sort new result by first
column
3
1
+
CHINA
INDIA
INDIA = CHINA
INDIA
FRANCE
1
2
1
3
2
1 + CHINA
INDIA
INDIA = INDIA
CHINA
FRANCE
2
1
CHINA
INDIA1
2
VLDB 2009 Tutorial Column-Oriented Database Systems 85
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
l Bottom Line
l Instead of probing projected columns from inner table out of
order:
l Sort join index
l Probe projected columns in order
l Sort result using an added column
l LM vs EM tradeoffs:
l LM has the extra sorts (EM accesses all columns in order)
l LM only has to fit join columns into memory (EM needs join
columns and all projected columns)
§ Results in big memory and CPU savings (see part 3 for why there is
CPU savings)
l LM only has to materialize relevant columns
l In many cases LM advantages outweigh disadvantages
l LM would be a clear winner if not for those pesky sorts … can
we do better?
VLDB 2009 Tutorial Column-Oriented Database Systems 86
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l The full sort from the Jive join is actually overkill
l We just want to access the storage blocks in order (we
don’t mind random access within a block)
l So do a radix sort and stop early
l By stopping early, data within each block is accessed out of
order, but in the order specified in the original join index
l Use this pseudo-order to accelerate the post-probe sort as well
“Cache-Conscious Radix-Decluster
Projections”, Manegold, Boncz, Nes,
VLDB’04
•“Database Architecture Optimized for the
New Bottleneck: Memory Access”
VLDB’99
•“Generic Database Cost Models for
Hierarchical Memory Systems”, VLDB’02
(all Manegold, Boncz, Kersten)
VLDB 2009 Tutorial Column-Oriented Database Systems 87
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l Bottom line
l Both sorts from the Jive join can be significantly reduced in
overhead
l Only been tested when there is sufficient memory for the
entire join index to be stored three times
l Technique is likely applicable to larger join indexes, but utility will
go down a little
l Only works if random access within a storage block
l Don’t want to use radix cluster/decluster if you have variable-
width column values or compression schemes that can only be
decompressed starting from the beginning of the block
VLDB 2009 Tutorial Column-Oriented Database Systems 88
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
LM vs EM joins
l Invisible, Jive, Flash, Cluster, Decluster techniques
contain a bag of tricks to improve LM joins
l Research papers show that LM joins become 2X faster
than EM joins (instead of 2X slower) for a wide array of
query types
VLDB 2009 Tutorial Column-Oriented Database Systems 89
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Tuple Construction Heuristics
l For queries with selective predicates,
aggregations, or compressed data, use late
materialization
l For joins:
l Research papers:
l Always use late materialization
l Commercial systems:
l Inner table to a join often materialized before join
(reduces system complexity):
l Some systems will use LM only if columns from
inner table can fit entirely in memory
VLDB 2009 Tutorial Column-Oriented Database Systems 90
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Computational Efficiency of DB on modern hardware
l how column-stores can help here
l Keynote revisited: MonetDB & VectorWise in more depth
l CPU efficient column compression
l vectorized decompression
l Conclusions
l future work
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
40 years of hardware evolution
vs.
DBMS computational efficiency
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 92
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Architecture
Elements:
l Storage
l CPU caches L1/L2/L3
l Registers
l Execution Unit(s)
l Pipelined
l SIMD
VLDB 2009 Tutorial Column-Oriented Database Systems 93
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 94
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 95
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DRAM Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 96
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Super-Scalar Execution (pipelining)
VLDB 2009 Tutorial Column-Oriented Database Systems 97
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Hazards
l Data hazards
l Dependencies between instructions
l L1 data cache misses
Result: bubbles in the pipeline
l Control Hazards
l Branch mispredictions
l Computed branches (late binding)
l L1 instruction cache misses
Out-of-order execution addresses data hazards
l control hazards typically more expensive
VLDB 2009 Tutorial Column-Oriented Database Systems 98
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIMD
l Single Instruction Multiple Data
l Same operation applied on a vector of values
l MMX: 64 bits, SSE: 128bits, AVX: 256bits
l SSE, e.g. multiply 8 short integers
VLDB 2009 Tutorial Column-Oriented Database Systems 99
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
alice 22101
next()
next()
next()
ivan 37102
ivan 37102
ivan 37102
ivan 350102
alice 22101
SELECT id, name
(age-30)*50 AS bonus
FROM employee
WHERE age > 30
350
FALSETRUE
22 > 30 ?37 > 30 ?
37 – 307 * 50
7
A Look at the Query Pipeline
VLDB 2009 Tutorial Column-Oriented Database Systems 100
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
next()
ivan 350102
22 > 30 ?37 > 30 ?
37 – 307 * 50
A Look at the Query Pipeline
Operators
Iterator interface
-open()
-next(): tuple
-close()
VLDB 2009 Tutorial Column-Oriented Database Systems 101
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
alice 22101
next()
next()
next()
ivan 37102
ivan 37102
ivan 37102
ivan 350102
alice 22101
350
FALSETRUE
22 > 30 ?37 > 30 ?
37 – 307 * 50
7
A Look at the Query Pipeline
Primitives
Provide computational
functionality
All arithmetic allowed in
expressions,
e.g. Multiplication
mult(int,int) Ł int
7 * 50
VLDB 2009 Tutorial Column-Oriented Database Systems 102
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Operators
Iterator interface
-open()
-next(): tuple
-close()SCAN
SELECT
PROJECT
Database Architecture causes Hazards
l Data hazards
l Dependencies between instructions
l L1 data cache misses
l Control Hazards
l Branch mispredictions
l Computed branches (late binding)
l L1 instruction cache misses
Complex NSM record navigation
Large Tree/Hash Structures
Code footprint of all
operators in query plan
exceeds L1 cache
Data-dependent conditions
Next() late binding
method calls
Tree, List, Hash traversal
SIMD Out-of-order Execution
work on one tuple at a time
“DBMSs On A Modern Processor: Where Does Time Go? ”
Ailamaki, DeWitt, Hill, Wood, VLDB’99
VLDB 2009 Tutorial Column-Oriented Database Systems 103
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DBMS Computational Efficiency
TPC-H 1GB, query 1
l selects 98% of fact table, computes net prices and
aggregates all
l Results:
l C program: ?
l MySQL: 26.2s
l DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 104
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DBMS Computational Efficiency
TPC-H 1GB, query 1
l selects 98% of fact table, computes net prices and
aggregates all
l Results:
l C program: 0.2s
l MySQL: 26.2s
l DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
MonetDB
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 106
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l “save disk I/O when scan-intensive queries need a few
columns”
l “avoid an expression interpreter to improve
computational efficiency”
a column-store
VLDB 2009 Tutorial Column-Oriented Database Systems 107
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonus
FROM people
WHERE age > 30
RISC Database Algebra
VLDB 2009 Tutorial Column-Oriented Database Systems 108
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonus
FROM people
WHERE age > 30
RISC Database Algebra
batcalc_minus_int(int* res,
int* col,
int val,
int n)
{
for(i=0; i<n; i++)
res[i] = col[i] - val;
}
CPU happy? Give it “nice” code !
- few dependencies (control,data)
- CPU gets out-of-order execution
- compiler can e.g. generate SIMD
One loop for an entire column
- no per-tuple interpretation
- arrays: no record navigation
- better instruction cache locality
Simple, hard-
coded semantics
in operators
VLDB 2009 Tutorial Column-Oriented Database Systems 109
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonus
FROM people
WHERE age > 30
RISC Database Algebra
batcalc_minus_int(int* res,
int* col,
int val,
int n)
{
for(i=0; i<n; i++)
res[i] = col[i] - val;
}
CPU happy? Give it “nice” code !
- few dependencies (control,data)
- CPU gets out-of-order execution
- compiler can e.g. generate SIMD
One loop for an entire column
- no per-tuple interpretation
- arrays: no record navigation
- better instruction cache locality
MATERIALIZED
intermediate
results
VLDB 2009 Tutorial Column-Oriented Database Systems 110
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l “save disk I/O when scan-intensive queries need a few columns”
l “avoid an expression interpreter to improve computational
efficiency”
How?
l RISC query algebra: hard-coded semantics
l Decompose complex expressions in multiple operations
l Operators only handle simple arrays
l No code that handles slotted buffered record layout
l Relational algebra becomes array manipulation language
l Often SIMD for free
l Plus: use of cache-conscious algorithms for Sort/Aggr/Join
a column-store
VLDB 2009 Tutorial Column-Oriented Database Systems 111
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l You want efficiency
l Simple hard-coded operators
l I take scalability
l Result materialization
n C program: 0.2s
n MonetDB: 3.7s
n MySQL: 26.2s
n DBMS “X”: 28.1s
a Faustian pact
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
as a research platform
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 113
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIGMOD 1985
MonetDB supports
SQL, XML, ODMG,
..RDF
RDF support on C-
STORE / SW-Store
MonetDB
BAT Algebra
•“MIL Primitives for Querying a Fragmented World”,
Boncz, Kersten, VLDBJ’98
•“Flattening an Object Algebra to Provide Performance”
Boncz, Wilschut, Kersten, ICDE’98
•“MonetDB/XQuery: a fast XQuery processor powered
by a relational engine” Boncz, Grust, vanKeulen,
Rittinger, Teubner, SIGMOD’06
•“SW-Store: a vertically partitioned DBMS for Semantic
Web data management“ Abadi, Marcus, Madden,
Hollenbach, VLDBJ’09
VLDB 2009 Tutorial Column-Oriented Database Systems 114
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
XQuery
MonetDB 4 MonetDB 5
MonetDB kernel
SQL 03
Optimizers
RDF
SOAP
Arrays
The Software Stack
Extensible query lang
frontend framework
.. SGL?Extensible Dynamic
Runtime QOPT
Framework!
OGIS
compile
Extensible
Architecture-Conscious
Execution platform
VLDB 2009 Tutorial Column-Oriented Database Systems 115
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Cache-Conscious Joins
l Cost Models, Radix-cluster,
Radix-decluster
l MonetDB/XQuery:
l structural joins exploiting
positional column access
l Cracking:
l on-the-fly automatic indexing
without workload knowledge
l Recycling:
l using materialized intermediates
l Run-time Query Optimization:
l correlation-aware run-time
optimization without cost model
as a research platform
“MonetDB/XQuery: a fast XQuery processor powered
by a relational engine” Boncz, Grust, vanKeulen,
Rittinger, Teubner, SIGMOD’06
•“Database Architecture Optimized for the New
Bottleneck: Memory Access” VLDB’99
•“Generic Database Cost Models for Hierarchical Memory
Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
•“Cache-Conscious Radix-Decluster Projections”,
Manegold, Boncz, Nes, VLDB’04
“Database Cracking”, CIDR’07
“Updating a cracked database “, SIGMOD’07
“Self-organizing tuple reconstruction in column-
stores“, SIGMOD’09 (all Idreos, Manegold, Kersten)
“An architecture for recycling intermediates in a
column-store”, Ivanova, Kersten, Nes, Goncalves,
SIGMOD’09
“ROX: run-time optimization of XQueries”,
Abdelkader, Boncz, Manegold, vanKeulen,
SIGMOD’09
optimizer frontend backend
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
“MonetDB/X100”
vectorized query processing
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 117
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Materialization vs Pipelining
MonetDB spin-off: MonetDB/X100
VLDB 2009 Tutorial Column-Oriented Database Systems 118
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
next()
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 119
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
FALSE
TRUE
TRUE
FALSE
> 30 ?
22
37
45
25
alice
ivan
peggy
victor
101
102
104
105
next()
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 120
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
7
15
FALSE
TRUE
TRUE
FALSE
37
45
ivan
peggy
102
104
350
750
ivan
peggy
102
104
350
750
Observations:
next() called much less often Ł
more time spent in primitives
less in overhead
primitive calls process an array of
values in a loop:
> 30 ?
- 30 * 50
22
37
45
25
alice
ivan
peggy
victor
101
102
104
105
“Vectorized In Cache
Processing”
vector = array of ~100
processed in a tight loop
CPU cache Resident next()
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 121
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
7
15
FALSE
TRUE
TRUE
FALSE
37
45
ivan
peggy
102
104
350
750
ivan
peggy
102
104
350
750
Observations:
next() called much less often Ł
more time spent in primitives
less in overhead
primitive calls process an array of
values in a loop:
> 30 ?
- 30 * 50
CPU Efficiency depends on “nice” code
- out-of-order execution
- few dependencies (control,data)
- compiler support
Compilers like simple loops over arrays
- loop-pipelining
- automatic SIMD
22
37
45
25
alice
ivan
peggy
victor
101
102
104
105
next()
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 122
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
FALSE
TRUE
TRUE
FALSE
350
750
Observations:
next() called much less often Ł
more time spent in primitives
less in overhead
primitive calls process an array of
values in a loop:
> 30 ?
* 50
CPU Efficiency depends on “nice” code
- out-of-order execution
- few dependencies (control,data)
- compiler support
Compilers like simple loops over arrays
- loop-pipelining
- automatic SIMD
FALSE
TRUE
TRUE
FALSE
> 30 ?
7
15
- 30
350
750
* 50
for(i=0; i<n; i++)
res[i] = (col[i] > x)
for(i=0; i<n; i++)
res[i] = (col[i] - x)
for(i=0; i<n; i++)
res[i] = (col[i] * x)
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 123
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
1
2
Tricks being played:
- Late materialization
- Materialization avoidance
using selection vectors
> 30 ?
next()
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 124
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
7
15
1
2
350
750
Tricks being played:
- Late materialization
- Materialization avoidance
using selection vectors
> 30 ?
- 30 * 50
next()
map_mul_flt_val_flt_col(
float *res,
int* sel,
float val,
float *col, int n)
{
for(int i=0; i<n; i++)
res[i] = val * col[sel[i]];
}
selection vectors used to reduce
vector copying
contain selected positions
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 125
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101
102
104
105
alice
ivan
peggy
victor
22
37
45
25
7
15
1
2
350
750
Tricks being played:
- Late materialization
- Materialization avoidance
using selection vectors
> 30 ?
- 30 * 50
next()
map_mul_flt_val_flt_col(
float *res,
int* sel,
float val,
float *col, int n)
{
for(int i=0; i<n; i++)
res[i] = val * col[sel[i]];
}
selection vectors used to reduce
vector copying
contain selected positions
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 126
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB/X100
l Both efficiency
l Vectorized primitives
l and scalability..
l Pipelined query evaluation
n C program: 0.2s
n MonetDB/X100: 0.6s
n MonetDB: 3.7s
n MySQL: 26.2s
n DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 127
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 128
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
Vectors are only the in-cache
representation
RAM & disk representation might
actually be different
(vectorwise uses both PAX & DSM)
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 129
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Optimal Vector size?
All vectors together should fit
the CPU cache
Optimizer should tune this,
given the query characteristics.
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 130
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
Less and less iterator.next()
and
primitive function calls
(“interpretation overhead”)
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 131
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
Vectors start to exceed the
CPU cache, causing
additional memory traffic
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 132
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
The benefit of selection
vectors
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 133
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB/MIL materializes columns
ColumnBM
(buffer manager)
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 134
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Benefits of Vectorized Processing
l 100x less Function Calls
l iterator.next(), primitives
l No Instruction Cache Misses
l High locality in the primitives
l Less Data Cache Misses
l Cache-conscious data placement
l No Tuple Navigation
l Primitives are record-oblivious, only see arrays
l Vectorization allows algorithmic optimization
l Move activities out of the loop (“strength reduction”)
l Compiler-friendly function bodies
l Loop-pipelining, automatic SIMD
“Buffering Database Operations for
Enhanced Instruction Cache Performance”
Zhou, Ross, SIGMOD’04
“Block oriented processing of
relational database operations in
modern computer architectures”
Padmanabhan, Malkemus, Agarwal,
ICDE’01
“MonetDB/X100: Hyper-Pipelining Query
Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 135
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Vectorizing Relational Operators
l Project
l Select
l Exploit selectivities, test buffer overflow
l Aggregation
l Ordered, Hashed
l Sort
l Radix-sort nicely vectorizes
l Join
l Merge-join + Hash-join
“Balancing Vectorized Query Execution with
Bandwidth Optimized Storage ” Zukowski, CWI 2008
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
Efficient Column Store Compression
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 137
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basis
l Columns compress well
l Decompress small vectors of tuples from a column into
the CPU cache
l Minimize main-memory overhead
l Use light-weight, CPU-efficient algorithms
l Exploit processing power of modern CPUs
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 138
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basis
l Columns compress well
l Decompress small vectors of tuples from a column into
the CPU cache
l Minimize main-memory overhead
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 139
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Vectorized Decompression
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
Idea:
decompress a vector only
compression:
-between CPU and RAM
-Instead of disk and RAM (classic)
VLDB 2009 Tutorial Column-Oriented Database Systems 140
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
RAM-Cache Decompression
l Decompress vectors
on-demand into the
cache
l RAM-Cache boundary
only crossed once
l More (compressed)
data cached in RAM
l Less bandwidth use
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 141
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Multi-Core Bandwidth & Compression
Performance Degradation with Concurrent Queries
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8
number of concurrent queries
avgqueryspeed(clocknormalized)
Intel® Xeon E5410 (Harpertown) - Q6b Normal
Intel® Xeon E5410 (Harpertown) - Q6b Compressed
Intel® Xeon X5560 (Nehalem) - Q6b Normal
Intel® Xeon X5560 (Nehalem) - Q6b Compressed
VLDB 2009 Tutorial Column-Oriented Database Systems 142
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Efficient Decompression
l Decoding loop over cache-resident vectors of code words
l Avoid control dependencies within decoding loop
l no if-then-else constructs in loop body
l Avoid data dependencies between loop iterations
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 143
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Disk Block Layout
l Forward growing
section of arbitrary
size code words
(code word size fixed
per block)
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 144
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Disk Block Layout
l Forward growing
section of arbitrary
size code words
(code word size
fixed per block)
l Backwards growing
exception list
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 145
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Naïve Decompression Algorithm
l Use reserved value from code word domain (MAXCODE) to mark
exception positions
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 146
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Deterioriation With Exception%
l 1.2GB/s deteriorates to 0.4GB/s
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 147
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Deterioriation With Exception%
l Perf Counters: CPU mispredicts if-then-else
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 148
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patching
l Maintain a patch-list
through code word
section that links
exception positions
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 149
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patching
l Maintain a patch-list
through code word
section that links
exception positions
l After decoding, patch up
the exception positions
with the correct values
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 150
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 151
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 152
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Decompression Bandwidth
l Patching makes two passes, but is faster!
Patching can be applied to:
• Frame of Reference (PFOR)
• Delta Compression (PFOR-DELTA)
• Dictionary Compression (PDICT)
Makes these methods much more
applicable to noisy data
Ł without additional cost
“Super-Scalar RAM-CPU Cache Compression”
Zukowski, Heman, Nes, Boncz, ICDE’06
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented
Database Systems
Conclusion
VLDB
2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 154
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (1/2)
l Columns and Row-Stores: different?
l No fundamental differences
l Can current row-stores simulate column-stores now?
l not efficiently: row-stores need change
l On disk layout vs execution layout
l actually independent issues, on-the-fly conversion pays off
l column favors sequential access, row random
l Mixed Layout schemes
l Fractured mirrors
l PAX, Clotho
l Data morphing
VLDB 2009 Tutorial Column-Oriented Database Systems 155
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (2/2)
l Crucial Columnar Techniques
l Storage
l Lean headers, sparse indices, fast positional access
l Compression
l Operating on compressed data
l Lightweight, vectorized decompression
l Late vs Early materialization
l Non-join: LM always wins
l Naïve/Invisible/Jive/Flash/Radix Join (LM often wins)
l Execution
l Vectorized in-cache execution
l Exploiting SIMD
VLDB 2009 Tutorial Column-Oriented Database Systems 156
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l looking at write/load tradeoffs in column-stores
l read-only vs batch loads vs trickle updates vs OLTP
VLDB 2009 Tutorial Column-Oriented Database Systems 157
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (1/3)
l Column-stores are update-in-place averse
l In-place: I/O for each column
l + re-compression
l + multiple sorted replicas
l + sparse tree indices
Update-in-place is infeasible!
VLDB 2009 Tutorial Column-Oriented Database Systems 158
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (2/3)
l Column-stores use differential mechanisms instead
l Differential lists/files or more advanced (e.g. PDTs)
l Updates buffered in RAM, merged on each query
l Checkpointing merges differences in bulk sequentially
l I/O trends favor this anyway
§ trade RAM for converting random into sequential I/O
§ this trade is also needed in Flash (do not write randomly!)
l How high loads can it sustain?
§ Depends on available RAM for buffering (how long until full)
§ Checkpoint must be done within that time
§ The longer it can run, the less it molests queries
§ Using Flash for buffering differences buys a lot of time
§ Hundreds of GBs of differences per server
VLDB 2009 Tutorial Column-Oriented Database Systems 159
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (3/3)
l Differential transactions favored by hardware trends
l Snapshot semantics accepted by the user community
l can always convert to serialized
Ł Row stores could also use differential transactions and be efficient!
Ł Implies a departure from ARIES
Ł Implies a full rewrite
My conclusion:
a system that combines row- and columns needs differentially
implemented transactions.
Starting from a pure column-store, this is a limited add-on.
Starting from a pure row-store, this implies a full rewrite.
“Serializable Isolation For Snapshot Databases” Alomari,
Cahill, Fekete, Roehm, SIGMOD’08
VLDB 2009 Tutorial Column-Oriented Database Systems 160
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l looking at write/load tradeoffs in column-stores
l read-only vs batch loads vs trickle updates vs OLTP
l database design for column-stores
l column-store specific optimizers
l compression/materialization/join tricks Ł cost models?
l hybrid column-row systems
l can row-stores learn new column tricks?
l Study of the minimal number changes one needs to make to a
row store to get the majority of the benefits of a column-store
l Alternative: add features to column-stores that make them more
like row stores
VLDB 2009 Tutorial Column-Oriented Database Systems 161
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Conclusion
l Columnar techniques provide clear benefits for:
l Data warehousing, BI
l Information retrieval, graphs, e-science
l A number of crucial techniques make them effective
l Without these, existing row systems do not benefit
l Row-Stores and column-stores could be combined
l Row-stores may adopt some column-store techniques
l Column-stores add row-store (or PAX) functionality
l Many open issues to do research on!

More Related Content

What's hot

Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuningFederico Campoli
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Web Services
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introductionleanderlee2
 
Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)Federico Razzoli
 
Dd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinDd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinStåle Deraas
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopCitiusTech
 

What's hot (20)

Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
OLTP vs OLAP
OLTP vs OLAPOLTP vs OLAP
OLTP vs OLAP
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)
 
Agreggates i
Agreggates iAgreggates i
Agreggates i
 
Dd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinDd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublin
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on Hadoop
 

Similar to Database system

Database system
Database systemDatabase system
Database systemNYversity
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesDATAVERSITY
 
SQL, Oracle, Joins
SQL, Oracle, JoinsSQL, Oracle, Joins
SQL, Oracle, JoinsGaurish Goel
 
Database High Availability Using SHADOW Systems
Database High Availability Using SHADOW SystemsDatabase High Availability Using SHADOW Systems
Database High Availability Using SHADOW SystemsJaemyung Kim
 
Presentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSPresentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSabdurrobsoyon
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBjhugg
 
Building services using windows azure
Building services using windows azureBuilding services using windows azure
Building services using windows azureSuliman AlBattat
 
Data Bases - Introduction to data science
Data Bases - Introduction to data scienceData Bases - Introduction to data science
Data Bases - Introduction to data scienceFrank Kienle
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introductionHasan Kata
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introductionsanjaychauhan689
 
DDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfDDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfEleonora Ciceri
 
Using database object relational storage
Using database object relational storageUsing database object relational storage
Using database object relational storageDalibor Blazevic
 
C-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.pptC-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.pptJinwenZhong1
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftAmazon Web Services LATAM
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for PerformanceAmazon Web Services
 

Similar to Database system (20)

Database system
Database systemDatabase system
Database system
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar Databases
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
 
SQL, Oracle, Joins
SQL, Oracle, JoinsSQL, Oracle, Joins
SQL, Oracle, Joins
 
Database High Availability Using SHADOW Systems
Database High Availability Using SHADOW SystemsDatabase High Availability Using SHADOW Systems
Database High Availability Using SHADOW Systems
 
Presentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSPresentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMS
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Building services using windows azure
Building services using windows azureBuilding services using windows azure
Building services using windows azure
 
Data Bases - Introduction to data science
Data Bases - Introduction to data scienceData Bases - Introduction to data science
Data Bases - Introduction to data science
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introduction
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introduction
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
NoSql databases
NoSql databasesNoSql databases
NoSql databases
 
DDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfDDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdf
 
Using database object relational storage
Using database object relational storageUsing database object relational storage
Using database object relational storage
 
C-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.pptC-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.ppt
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance
 

Recently uploaded

The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfSiskaFitrianingrum
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxabhinandnam9997
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxGal Baras
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shoplaozhuseo02
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptxJungkooksNonexistent
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxlaozhuseo02
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxnatyesu
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxTristanJasperRamos
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理aagad
 

Recently uploaded (12)

The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 

Database system

  • 1. VLDB 2009 Tutorial Column-Oriented Database Systems 1 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems Part 1: Stavros Harizopoulos (HP Labs) Part 2: Daniel Abadi (Yale) Part 3: Peter Boncz (CWI) VLDB 2009 Tutorial
  • 2. VLDB 2009 Tutorial Column-Oriented Database Systems 2 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What is a column-store? VLDB 2009 Tutorial Column-Oriented Database Systems 2 row-store column-store Date CustomerProductStore + easy to add/modify a record - might read in unnecessary data + only need to read in relevant data - tuple writes require multiple accesses => suitable for read-mostly, read-intensive, large data repositories Date Store Product Customer Price Price
  • 3. VLDB 2009 Tutorial Column-Oriented Database Systems 3 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Are these two fundamentally different? l The only fundamental difference is the storage layout l However: we need to look at the big picture VLDB 2009 Tutorial Column-Oriented Database Systems 3 ‘70s ‘80s ‘90s ‘00s today row-stores row-stores++ row-stores++ different storage layouts proposed new applications new bottleneck in hardware column-stores converge? l How did we get here, and where we are heading l What are the column-specific optimizations? l How do we improve CPU efficiency when operating on Cs Part 2 Part 1 Part 3
  • 4. VLDB 2009 Tutorial Column-Oriented Database Systems 4 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 4
  • 5. VLDB 2009 Tutorial Column-Oriented Database Systems 5 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco Data Warehousing example l Typical DW installation l Real-world example VLDB 2009 Tutorial Column-Oriented Database Systems 5 usage source toll account star schema fact table dimension tables or RAM QUERY 2 SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) FROM usage, toll, source, account WHERE usage.toll_id = toll.toll_id AND usage.source_id = source.source_id AND usage.account_id = account.account_id AND toll.type_ind in (‘AE’. ‘AA’) AND usage.toll_price > 0 AND source.type != ‘CIBER’ AND toll.rating_method = ‘IS’ AND usage.invoice_date = 20051013 GROUP BY account.account_number Column-store Row-store Query 1 2.06 300 Query 2 2.20 300 Query 3 0.09 300 Query 4 5.24 300 Query 5 2.88 300 Why? Three main factors (next slides) “One Size Fits All? - Part 2: Benchmarking Results” Stonebraker et al. CIDR 2007
  • 6. VLDB 2009 Tutorial Column-Oriented Database Systems 6 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (1/3): read efficiency read pages containing entire rows one row = 212 columns! is this typical? (it depends) VLDB 2009 Tutorial Column-Oriented Database Systems 6 row store column store read only columns needed in this example: 7 columns caveats: • “select * ” not any faster • clever disk prefetching • clever tuple reconstruction What about vertical partitioning? (it does not work with ad-hoc queries)
  • 7. VLDB 2009 Tutorial Column-Oriented Database Systems 7 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (2/3): compression efficiency l Columns compress better than rows l Typical row-store compression ratio 1 : 3 l Column-store 1 : 10 l Why? l Rows contain values from different domains => more entropy, difficult to dense-pack l Columns exhibit significantly less entropy l Examples: l Caveat: CPU cost (use lightweight compression) VLDB 2009 Tutorial Column-Oriented Database Systems 7 Male, Female, Female, Female, Male 1998, 1998, 1999, 1999, 1999, 2000
  • 8. VLDB 2009 Tutorial Column-Oriented Database Systems 8 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Telco example explained (3/3): sorting & indexing efficiency l Compression and dense-packing free up space l Use multiple overlapping column collections l Sorted columns compress better l Range queries are faster l Use sparse clustered indexes VLDB 2009 Tutorial Column-Oriented Database Systems 8 What about heavily-indexed row-stores? (works well for single column access, cross-column joins become increasingly expensive)
  • 9. VLDB 2009 Tutorial Column-Oriented Database Systems 9 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Additional opportunities for column-stores l Block-tuple / vectorized processing l Easier to build block-tuple operators l Amortizes function-call cost, improves CPU cache performance l Easier to apply vectorized primitives l Software-based: bitwise operations l Hardware-based: SIMD l Opportunities with compressed columns l Avoid decompression: operate directly on compressed l Delay decompression (and tuple reconstruction) l Also known as: late materialization l Exploit columnar storage in other DBMS components l Physical design (both static and dynamic) VLDB 2009 Tutorial Column-Oriented Database Systems 9 Part 3 more in Part 2 See: Database Cracking, from CWI
  • 10. VLDB 2009 Tutorial Column-Oriented Database Systems 10 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Effect on C-Store performance VLDB 2009 Tutorial Column-Oriented Database Systems 10 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Hachem, and Madden. SIGMOD 2008. Time(sec) Average for SSBM queries on C-store enable late materializationenable compression & operate on compressed original C-store column-oriented join algorithm
  • 11. VLDB 2009 Tutorial Column-Oriented Database Systems 11 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary of column-store key features l Storage layout l Execution engine l Design tools, optimizer VLDB 2009 Tutorial Column-Oriented Database Systems 11 columnar storage header/ID elimination compression multiple sort orders column operators avoid decompression late materialization vectorized operations Part 3 Part 2 Part 2 Part 1 Part 1 Part 2 Part 3
  • 12. VLDB 2009 Tutorial Column-Oriented Database Systems 12 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 12
  • 13. VLDB 2009 Tutorial Column-Oriented Database Systems 13 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) From DSM to Column-stores 70s -1985: 1985: DSM paper 1990s: Commercialization through SybaseIQ Late 90s – 2000s: Focus on main-memory performance l DSM “on steroids” [1997 – now] l Hybrid DSM/NSM [2001 – 2004] 2005 – : Re-birth of read-optimized DSM as “column-store” VLDB 2009 Tutorial Column-Oriented Database Systems 13 “A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985. CWI: MonetDB Wisconsin: PAX, Fractured Mirrors Michigan: Data Morphing CMU: Clotho MIT: C-Store CWI: MonetDB/X100 10+ startups TOD: Time Oriented Database – Wiederhold et al. "A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975 More 1970s: Transposed files, Lorie, Batory, Svensson. “An overview of cantor: a new system for data analysis” Karasalo, Svensson, SSDBM 1983
  • 14. VLDB 2009 Tutorial Column-Oriented Database Systems 14 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) The original DSM paper l Proposed as an alternative to NSM l 2 indexes: clustered on ID, non-clustered on value l Speeds up queries projecting few columns l Requires more storage VLDB 2009 Tutorial Column-Oriented Database Systems 14 “A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985. 1 2 3 4 .. ID value 0100 0962 1000 ..
  • 15. VLDB 2009 Tutorial Column-Oriented Database Systems 15 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Memory wall and PAX l 90s: Cache-conscious research l PAX: Partition Attributes Across l Retains NSM I/O pattern l Optimizes cache-to-RAM communication VLDB 2009 Tutorial Column-Oriented Database Systems 15 “DBMSs on a modern processor: Where does time go?” Ailamaki, DeWitt, Hill, Wood. VLDB 1999. “Weaving Relations for Cache Performance.” Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001. “Cache Conscious Algorithms for Relational Query Processing.” Shatdal, Kant, Naughton. VLDB 1994. from: “Database Architecture Optimized for the New Bottleneck: Memory Access.” Boncz, Manegold, Kersten. VLDB 1999. to: and:
  • 16. VLDB 2009 Tutorial Column-Oriented Database Systems 16 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) More hybrid NSM/DSM schemes l Dynamic PAX: Data Morphing l Clotho: custom layout using scatter-gather I/O l Fractured mirrors l Smart mirroring with both NSM/DSM copies VLDB 2009 Tutorial Column-Oriented Database Systems 16 “Data morphing: an adaptive, cache-conscious storage technique.” Hankins, Patel, VLDB 2003. “Clotho: Decoupling Memory Page Layout from Storage Organization.” Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004. “A Case For Fractured Mirrors.” Ramamurthy, DeWitt, Su, VLDB 2002.
  • 17. VLDB 2009 Tutorial Column-Oriented Database Systems 17 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) MonetDB (more in Part 3) l Late 1990s, CWI: Boncz, Manegold, and Kersten l Motivation: l Main-memory l Improve computational efficiency by avoiding expression interpreter l DSM with virtual IDs natural choice l Developed new query execution algebra l Initial contributions: l Pointed out memory-wall in DBMSs l Cache-conscious projections and joins l … VLDB 2009 Tutorial Column-Oriented Database Systems 17
  • 18. VLDB 2009 Tutorial Column-Oriented Database Systems 18 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) 2005: the (re)birth of column-stores l New hardware and application realities l Faster CPUs, larger memories, disk bandwidth limit l Multi-terabyte Data Warehouses l New approach: combine several techniques l Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression l C-store paper: l First comprehensive design description of a column-store l MonetDB/X100 l “proper” disk-based column store l Explosion of new products VLDB 2009 Tutorial Column-Oriented Database Systems 18
  • 19. VLDB 2009 Tutorial Column-Oriented Database Systems 19 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Performance tradeoffs: columns vs. rows DSM traditionally was not favored by technology trends How has this changed? l Optimized DSM in “Fractured Mirrors,” 2002 l “Apples-to-apples” comparison l Follow-up study l Main-memory DSM vs. NSM l Flash-disks: a come-back for PAX? VLDB 2009 Tutorial Column-Oriented Database Systems 19 “Performance Tradeoffs in Read- Optimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06 “Read-Optimized Databases, In- Depth” Holloway, DeWitt, VLDB’08 “Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09 “Fast Scans and Joins Using Flash Drives” Shah, Harizopoulos, Wiener, Graefe. DaMoN’08 “DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08
  • 20. VLDB 2009 Tutorial Column-Oriented Database Systems 20 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Fractured mirrors: a closer look l Store DSM relations inside a B-tree l Leaf nodes contain values l Eliminate IDs, amortize header overhead l Custom implementation on Shore VLDB 2009 Tutorial Column-Oriented Database Systems 20 3 sparse B-tree on ID 1 a1 “Efficient columnar storage in B-trees” Graefe. Sigmod Record 03/2007. Similar: storage density comparable to column stores “A Case For Fractured Mirrors” Ramamurthy, DeWitt, Su, VLDB 2002. 11 22 33 TIDTID a1a1 a2a2 a3a3 Column Data Column Data Tuple Header Tuple Header 44 a4a4 55 a5a5 2 a2 3 a3 4 a4 5 a5 1 a1 a3a2 4 a4 a5
  • 21. VLDB 2009 Tutorial Column-Oriented Database Systems 21 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Fractured mirrors: performance l Chunk-based tuple merging l Read in segments of M pages l Merge segments in memory l Becomes CPU-bound after 5 pages VLDB 2009 Tutorial Column-Oriented Database Systems 21 From PAX paper: optimized DSM columns projected: 1 2 3 4 5 time row column? column? regular DSM
  • 22. VLDB 2009 Tutorial Column-Oriented Database Systems 22 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-scanner implementation VLDB 2009 Tutorial Column-Oriented Database Systems 22 1 Joe 45 2 Sue 37 … … … Joe Sue 45 37 … … prefetch ~100ms worth of data S apply predicate(s) Joe 45 … … S #POS 45 #POS … S Joe 45 … … apply predicate #1 SELECT name, age WHERE age > 40 Direct I/O row scanner column scanner “Performance Tradeoffs in Read- Optimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06
  • 23. VLDB 2009 Tutorial Column-Oriented Database Systems 23 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Scan performance l Large prefetch hides disk seeks in columns l Column-CPU efficiency with lower selectivity l Row-CPU suffers from memory stalls l Memory stalls disappear in narrow tuples l Compression: similar to narrow VLDB 2009 Tutorial Column-Oriented Database Systems 23 not shown, details in the paper
  • 24. VLDB 2009 Tutorial Column-Oriented Database Systems 24 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Even more results Non-selective queries, narrow tuples, favor well-compressed rows Materialized views are a win Scan times determine early materialized joins VLDB 2009 Tutorial Column-Oriented Database Systems 24 “Read-Optimized Databases, In- Depth” Holloway, DeWitt, VLDB’08 0 5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 9 10 Time(s) Columns Returned C-25% C-10% R-50% Column-joins are covered in part 2! • Same engine as before • Additional findings wide attributes: same as before narrow & compressed tuple: CPU-bound!
  • 25. VLDB 2009 Tutorial Column-Oriented Database Systems 25 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Speedup of columns over rows l Rows favored by narrow tuples and low cpdb l Disk-bound workloads have higher cpdb VLDB 2009 Tutorial Column-Oriented Database Systems 25 tuple width cyclesperdiskbyte (cpdb) 8 12 16 20 24 28 32 36 9 18 36 72 144 _ + ++= +++ “Performance Tradeoffs in Read- Optimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06
  • 26. VLDB 2009 Tutorial Column-Oriented Database Systems 26 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying prefetch size l No prefetching hurts columns in single scans VLDB 2009 Tutorial Column-Oriented Database Systems 26 0 10 20 30 40 4 8 12 16 20 24 28 32 time(sec) selected bytes per tuple Row (any prefetch size) Column 48 (x 128KB) Column 16 Column 8 Column 2 no competing disk traffic
  • 27. VLDB 2009 Tutorial Column-Oriented Database Systems 27 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying prefetch size l No prefetching hurts columns in single scans l Under competing traffic, columns outperform rows for any prefetch size VLDB 2009 Tutorial Column-Oriented Database Systems 27 with competing disk traffic 0 10 20 30 40 4 12 20 28 Column, 48 Row, 48 0 10 20 30 40 4 12 20 28 Column, 8 Row, 8 selected bytes per tuple time(sec)
  • 28. VLDB 2009 Tutorial Column-Oriented Database Systems 28 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Performance l Benefit in on-the-fly conversion between NSM and DSM l DSM: sequential access (block fits in L2), random in L1 l NSM: random access, SIMD for grouped Aggregation VLDB 2009 Tutorial Column-Oriented Database Systems 28 “DSM vs. NSM: CPU performance trade offs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08
  • 29. VLDB 2009 Tutorial Column-Oriented Database Systems 29 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 29 New storage technology: Flash SSDs l Performance characteristics l very fast random reads, slow random writes l fast sequential reads and writes l Price per bit (capacity follows) l cheaper than RAM, order of magnitude more expensive than Disk l Flash Translation Layer introduces unpredictability l avoid random writes! l Form factors not ideal yet l SSD (Ł small reads still suffer from SATA overhead/OS limitations) l PCI card (Ł high price, limited expandability) l Boost Sequential I/O in a simple package l Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box) l Column Store Updates l useful for delta structures and logs l Random I/O on flash fixes unclustered index access l still suboptimal if I/O block size > record size l therefore column stores profit mush less than horizontal stores l Random I/O useful to exploit secondary, tertiary table orderings l the larger the data, the deeper clustering one can exploit
  • 30. VLDB 2009 Tutorial Column-Oriented Database Systems 30 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Even faster column scans on flash SSDs l New-generation SSDs l Very fast random reads, slower random writes l Fast sequential RW, comparable to HDD arrays l No expensive seeks across columns l FlashScan and Flashjoin: PAX on SSDs, inside Postgres VLDB 2009 Tutorial Column-Oriented Database Systems 30 “Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09 mini-pages with no qualified attributes are not accessed 30K Read IOps, 3K Write Iops 250MB/s Read BW, 200MB/s Write
  • 31. VLDB 2009 Tutorial Column-Oriented Database Systems 31 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-scan performance over time VLDB 2009 Tutorial Column-Oriented Database Systems 31 from 7x slower ..to 1.2x slower ..to same and 3x faster! regular DSM (2001) optimized DSM (2002) column-store (2006) SSD Postgres/PAX (2009) ..to 2x slower
  • 32. VLDB 2009 Tutorial Column-Oriented Database Systems 32 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 32
  • 33. VLDB 2009 Tutorial Column-Oriented Database Systems 33 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Architecture of a column-store storage layout l read-optimized: dense-packed, compressed l organize in extends, batch updates l multiple sort orders l sparse indexes VLDB 2009 Tutorial Column-Oriented Database Systems 33 engine l block-tuple operators l new access methods l optimized relational operatorssystem-level l system-wide column support l loading / updates l scaling through multiple nodes l transactions / redundancy
  • 34. VLDB 2009 Tutorial Column-Oriented Database Systems 34 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) C-Store l Compress columns l No alignment l Big disk blocks l Only materialized views (perhaps many) l Focus on Sorting not indexing l Data ordered on anything, not just time l Automatic physical DBMS design l Optimize for grid computing l Innovative redundancy l Xacts – but no need for Mohan l Column optimizer and executor VLDB 2009 Tutorial Column-Oriented Database Systems 34 “C-Store: A Column-Oriented DBMS.” Stonebraker et al. VLDB 2005.
  • 35. VLDB 2009 Tutorial Column-Oriented Database Systems 35 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) C-Store: only materialized views (MVs) l Projection (MV) is some number of columns from a fact table l Plus columns in a dimension table – with a 1-n join between Fact and Dimension table l Stored in order of a storage key(s) l Several may be stored! l With a permutation, if necessary, to map between them l Table (as the user specified it and sees it) is not stored! l No secondary indexes (they are a one column sorted MV plus a permutation, if you really want one) VLDB 2009 Tutorial Column-Oriented Database Systems 35 User view: EMP (name, age, salary, dept) Dept (dname, floor) Possible set of MVs: MV-1 (name, dept, floor) in floor order MV-2 (salary, age) in age order MV-3 (dname, salary, name) in salary order
  • 36. VLDB 2009 Tutorial Column-Oriented Database Systems 36 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Asynchronous Data Transfer TUPLE MOVER > Read Optimized Store (ROS) • On disk • Sorted / Compressed • Segmented • Large data loaded direct Continuous Load and Query (Vertica) Hybrid Storage Architecture (A B C | A) A B C Trickle Load > Write Optimized Store (WOS) §Memory based §Unsorted / Uncompressed §Segmented §Low latency / Small quick inserts A B C VLDB 2009 Tutorial Column-Oriented Database Systems 36
  • 37. VLDB 2009 Tutorial Column-Oriented Database Systems 37 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Loading Data (Vertica) Write-Optimized Store (WOS) In-memory Read-Optimized Store (ROS) On-disk Automatic Tuple Mover > INSERT, UPDATE, DELETE > Bulk and Trickle Loads §COPY §COPY DIRECT > User loads data into logical Tables > Vertica loads atomically into storage
  • 38. VLDB 2009 Tutorial Column-Oriented Database Systems 38 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Applications for column-stores l Data Warehousing l High end (clustering) l Mid end/Mass Market l Personal Analytics l Data Mining l E.g. Proximity l Google BigTable l RDF l Semantic web data management l Information retrieval l Terabyte TREC l Scientific datasets l SciDB initiative l SLOAN Digital Sky Survey on MonetDB VLDB 2009 Tutorial Column-Oriented Database Systems 38
  • 39. VLDB 2009 Tutorial Column-Oriented Database Systems 39 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) List of column-store systems l Cantor (history) l Sybase IQ l SenSage (former Addamark Technologies) l Kdb l 1010data l MonetDB l C-Store/Vertica l X100/VectorWise l KickFire l SAP Business Accelerator l Infobright l ParAccel l Exasol VLDB 2009 Tutorial Column-Oriented Database Systems 39
  • 40. VLDB 2009 Tutorial Column-Oriented Database Systems 40 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Part 1: Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 40
  • 41. VLDB 2009 Tutorial Column-Oriented Database Systems 41 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 41 Simulate a Column-Store inside a Row-Store Date Store Product Customer Price 01/01 01/01 01/01 1 2 3 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Date Index Store Index 1 2 3 1 2 3 01/01 01/01 01/01 BOS NYC BOS Table Chair Bed Mesa Lutz Mudd $20 $13 $79 BOS NYC BOS Table Chair Bed Mesa Lutz Mudd $20 $13 $79 TID Value Store TID Value Product TID Value Customer TID Value Price TID Value Date
  • 42. VLDB 2009 Tutorial Column-Oriented Database Systems 42 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 42 Simulate a Column-Store inside a Row-Store Date Store Product Customer Price 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Date Index Store Index 1 2 3 1 2 3 01/01 01/01 01/01 BOS NYC BOS Table Chair Bed Mesa Lutz Mudd $20 $13 $79 BOS NYC BOS Table Chair Bed Mesa Lutz Mudd $20 $13 $79 TID Value Store TID Value Product TID Value Customer TID Value Price 01/01 1 3 StartPosValue Date Length Can explicitly run- length encode date “Teaching an Old Elephant New Tricks.” Bruno, CIDR 2009.
  • 43. VLDB 2009 Tutorial Column-Oriented Database Systems 43 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 43 Experiments 0.0 50.0 100.0 150.0 200.0 250.0 Time(seconds) Average 25.7 79.9 221.2 Normal Row-Store Vertically Partitioned Row-Store Row-Store With All Indexes “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Hachem, and Madden. SIGMOD 2008. l Star Schema Benchmark (SSBM) l Implemented by professional DBA l Original row-store plus 2 column-store simulations on same row-store product Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance”. O’Neil et. al. ICDE 2008.
  • 44. VLDB 2009 Tutorial Column-Oriented Database Systems 44 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What’s Going On? Vertical Partitions l Vertical partitions in row-stores: l Work well when workload is known l ..and queries access disjoint sets of columns l See automated physical design l Do not work well as full-columns l TupleID overhead significant l Excessive joins VLDB 2009 Tutorial Column-Oriented Database Systems 44 “Column-Stores vs. Row-Stores: How Different Are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns Complete fact table takes up ~4 GB (compressed) Vertically partitioned tables take up 0.7-1.1 GB (compressed) 11 22 33 TID Column Data Tuple Header
  • 45. VLDB 2009 Tutorial Column-Oriented Database Systems 45 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 45 What’s Going On? All Indexes Case l Tuple construction l Common type of query: l Result of lower part of query plan is a set of TIDs that passed all predicates l Need to extract SELECT attributes at these TIDs l BUT: index maps value to TID l You really want to map TID to value (i.e., a vertical partition) Tuple construction is SLOW SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name
  • 46. VLDB 2009 Tutorial Column-Oriented Database Systems 46 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) VLDB 2009 Tutorial Column-Oriented Database Systems 46 So…. l All indexes approach is a poor way to simulate a column-store l Problems with vertical partitioning are NOT fundamental l Store tuple header in a separate partition l Allow virtual TIDs l Combine clustered indexes, vertical partitioning l So can row-stores simulate column-stores? l Might be possible, BUT: l Need better support for vertical partitioning at the storage layer l Need support for column-specific optimizations at the executer level l Full integration: buffer pool, transaction manager, .. l When will this happen? l Most promising features = soon l ..unless new technology / new objectives change the game (SSDs, Massively Parallel Platforms, Energy-efficiency) See Part 2, Part 3 for most promising features
  • 47. VLDB 2009 Tutorial Column-Oriented Database Systems 47 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) End of Part 1 l Basic concepts — Stavros l Introduction to key features l From DSM to column-stores and performance tradeoffs l Column-store architecture overview l Will rows and columns ever converge? l Part 2: Column-oriented execution — Daniel l Part 3: MonetDB/X100 and CPU efficiency — Peter VLDB 2009 Tutorial Column-Oriented Database Systems 47
  • 48. VLDB 2009 Tutorial Column-Oriented Database Systems 48 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Part 2 Outline l Compression l Tuple Materialization l Joins
  • 49. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems Compression VLDB 2009 Tutorial “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi, Madden, and Ferreira, SIGMOD ’06 “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 •Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD’01
  • 50. VLDB 2009 Tutorial Column-Oriented Database Systems 50 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Compression l Trades I/O for CPU l Increased column-store opportunities: l Higher data value locality in column stores l Techniques such as run length encoding far more useful l Can use extra space to store multiple copies of data in different sort orders
  • 51. VLDB 2009 Tutorial Column-Oriented Database Systems 51 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Run-length Encoding Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … 1 1 1 1 1 2 2 1 1 1 2 … … Product IDQuarter (value, start_pos, run_length) (1, 1, 5) … … Product IDQuarter (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600) (2, 6, 2) (1, 301, 3) (2, 304, 1) 5 7 2 9 6 8 5 3 8 1 4 … … Price 5 7 2 9 6 8 5 3 8 1 4 … … Price (Q1, 1, 300) (value, start_pos, run_length)
  • 52. VLDB 2009 Tutorial Column-Oriented Database Systems 52 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Bit-vector Encoding 1 1 1 1 1 2 2 1 1 2 3 … … Product ID 1 1 1 1 1 0 0 1 1 0 0 … … ID: 1 ID: 2 ID: 3 0 0 0 0 0 0 0 0 0 0 0 … … … 0 0 0 0 0 0 0 0 0 0 1 … … 0 0 0 0 0 1 1 0 0 1 0 … … “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 l For each unique value, v, in column c, create bit-vector b l b[i] = 1 if c[i] = v l Good for columns with few unique values l Each bit-vector can be further compressed if sparse
  • 53. VLDB 2009 Tutorial Column-Oriented Database Systems 53 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Q1 Q2 Q4 Q1 Q3 Q1 Q1 Q2 Q4 Q3 Q3 … Quarter Q1 0 1 3 0 2 0 0 1 3 2 2 … Quarter 0 0: Q1 1: Q2 2: Q3 3: Q4 Dictionary Encoding Dictionary + OR 24 128 122 Quarter 24: Q1, Q2, Q4, Q1 128: Q3, Q1, Q1, Q1 122: Q2, Q4, Q3, Q3 Dictionary + “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 … l For each unique value create dictionary entry l Dictionary can be per-block or per-column l Column-stores have the advantage that dictionary entries may encode multiple values at once
  • 54. VLDB 2009 Tutorial Column-Oriented Database Systems 54 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) 45 54 48 55 51 53 40 49 62 52 50 … Price 50 Frame: 50 4 -2 5 1 3 2 0 0 … Price -1 Frame Of Reference Encoding -5 ∞ 40 ∞ 62 4 bits per value Exceptions (see part 3 for a better way to deal with exceptions) l Encodes values as b bit offset from chosen frame of reference l Special escape code (e.g. all bits set to 1) indicates a difference larger than can be stored in b bits l After escape code, original (uncompressed) value is written “Compressing Relations and Indexes ” Goldstein, Ramakrishnan, Shaft, ICDE’98
  • 55. VLDB 2009 Tutorial Column-Oriented Database Systems 55 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Differential Encoding 5:00 5:02 5:03 5:03 5:04 5:06 5:07 5:10 5:15 5:16 5:16 … Time 5:08 2 1 0 1 2 1 1 0 Time 2 5:00 1 ∞ 5:15 2 bits per value Exception (see part 3 for a better way to deal with exceptions) l Encodes values as b bit offset from previous value l Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b bits l After escape code, original (uncompressed) value is written l Performs well on columns containing increasing/decreasing sequences l inverted lists l timestamps l object IDs l sorted / clustered columns “Improved Word-Aligned Binary Compression for Text Indexing” Ahn, Moffat, TKDE’06
  • 56. VLDB 2009 Tutorial Column-Oriented Database Systems 56 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What Compression Scheme To Use? Does column appear in the sort key? Is the average run-length > 2 Are number of unique values < ~50000 Differential Encoding RLE Is the data numerical and exhibit good locality? Frame of Reference Encoding Heavyweight Compression Leave Data Uncompressed OR Does this column appear frequently in selection predicates? Bit-vector Compression Dictionary Compression yes yes yes yes yes no no no no no
  • 57. VLDB 2009 Tutorial Column-Oriented Database Systems 57 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Heavy-Weight Compression Schemes l Modern disk arrays can achieve > 1GB/s l 1/3 CPU for decompression Ł 3GB/s needed Ł Lightweight compression schemes are better Ł Even better: operate directly on compressed data “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 58. VLDB 2009 Tutorial Column-Oriented Database Systems 58 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l I/O - CPU tradeoff is no longer a tradeoff l Reduces memory–CPU bandwidth requirements l Opens up possibility of operating on multiple records at once Operating Directly on Compressed Data “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06
  • 59. VLDB 2009 Tutorial Column-Oriented Database Systems 59 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) 1 0 0 1 1 0 0 1 0 0 0 … … Product IDQuarter (1, 3) (2, 1) (3, 2) 1 0 0 1 0 0 0 1 0 0 1 0 … … 2 0 1 0 0 0 1 0 0 1 0 1 … … 3 ProductID, COUNT(*)) (Q1, 1, 300) (Q2, 301, 6) (Q3, 307, 500) (Q4, 807, 600) Index Lookup + Offset jump SELECT ProductID, Count(*) FROM table WHERE (Quarter = Q2) GROUP BY ProductID 301-306 Operating Directly on Compressed Data “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06
  • 60. VLDB 2009 Tutorial Column-Oriented Database Systems 60 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) (Q1, 1, 300) (Q2, 301, 6) (Q3, 307, 500) (Q4, 807, 600) Data isOneValue() isValueSorted() isPosContiguous() isSparse() getNext() decompressIntoArray() getValueAtPosition(pos) getMin() getMax() getSize() Block API Compression- Aware Scan Operator Selection Operator Aggregation Operator SELECT ProductID, Count(*) FROM table WHERE (Quarter = Q2) GROUP BY ProductID “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ’06 Operating Directly on Compressed Data
  • 61. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems Tuple Materialization and Column-Oriented Join Algorithms VLDB 2009 Tutorial “Materialization Strategies in a Column- Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. “Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos Shah, Wiener, and Graefe. SIGMOD 2009. “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. “Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04 “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09
  • 62. VLDB 2009 Tutorial Column-Oriented Database Systems 62 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) When should columns be projected? l Where should column projection operators be placed in a query plan? l Row-store: l Column projection involves removing unneeded columns from tuples l Generally done as early as possible l Column-store: l Operation is almost completely opposite from a row-store l Column projection involves reading needed columns from storage and extracting values for a listed set of tuples § This process is called “materialization” l Early materialization: project columns at beginning of query plan § Straightforward since there is a one-to-one mapping across columns l Late materialization: wait as long as possible for projecting columns § More complicated since selection and join operators on one column obfuscates mapping to other columns from same table l Most column-stores construct tuples and column projection time § Many database interfaces expect output in regular tuples (rows) § Rest of discussion will focus on this case
  • 63. VLDB 2009 Tutorial Column-Oriented Database Systems 63 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) When should tuples be constructed? l Solution 1: Create rows first (EM). But: l Need to construct ALL tuples l Need to decompress data l Poor memory bandwidth utilization 2 1 3 1 2 3 3 3 7 13 42 80 (4,1,4) Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4 prodID storeID custID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
  • 64. VLDB 2009 Tutorial Column-Oriented Database Systems 64 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Data Source prodID Data Source storeID AND AGG Data Source custID Data Source price 1 1 1 1 4 4 4 4 0 1 0 1 2 1 3 1 prodID storeID Data Source Data Source QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID 2 1 3 1 2 3 3 3 7 13 42 80 4 4 4 4 prodID storeID custID price Solution 2: Operate on columns
  • 65. VLDB 2009 Tutorial Column-Oriented Database Systems 65 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns 1 1 1 1 0 1 0 1 AND 0 1 0 1 QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID AND AGG Data Source custID Data Source price 2 1 3 1 2 3 3 3 7 13 42 80 4 4 4 4 prodID storeID custID price Data Source prodID Data Source storeID
  • 66. VLDB 2009 Tutorial Column-Oriented Database Systems 66 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) 3 3 Solution 2: Operate on columns 0 1 0 1 0 1 0 1 Data Source Data Source 0 1 0 1 7 13 42 80 0 1 0 1 2 3 3 3 1 1 13 80 custID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID AND AGG Data Source custID Data Source price 2 1 3 1 2 3 3 3 7 13 42 80 4 4 4 4 prodID storeID custID price Data Source prodID Data Source storeID
  • 67. VLDB 2009 Tutorial Column-Oriented Database Systems 67 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Solution 2: Operate on columns 1 1 3 3 1 1 13 80 AGG 13 193 QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID AND AGG Data Source custID Data Source price 2 1 3 1 2 3 3 3 7 13 42 80 4 4 4 4 prodID storeID custID price Data Source prodID Data Source storeID
  • 68. VLDB 2009 Tutorial Column-Oriented Database Systems 68 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) For plans without joins, late materialization is a win l Ran on 2 compressed columns from TPC-H scale 10 data QUERY: SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1 0 1 2 3 4 5 6 7 8 9 10 Low selectivity Medium selectivity High selectivity Time(seconds) Early Materialization Late Materialization “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007.
  • 69. VLDB 2009 Tutorial Column-Oriented Database Systems 69 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l Materializing late still works best QUERY: SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1 0 2 4 6 8 10 12 14 Low selectivity Medium selectivity High selectivity Time(seconds) Early Materialization Late Materialization Even on uncompressed data, late materialization is still a win “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007.
  • 70. VLDB 2009 Tutorial Column-Oriented Database Systems 70 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What about for plans with joins? 70 Select R1.B, R1.C, R2.E, R2.H, R3.F From R1, R2, R3 Where R1.A = R2.D AND R2.G = R3.K R1 (A, B, C) R2 (D, E, G, H) Scan Scan Join 1 A = D Join 2 G = K Scan R3 (F, K, L) B, C, E, H, F A, B, C B, C, E, G, H D, E, G, H F, K R1 (A, B, C) R2 (D, E, G, H) Scan Scan Join A = D A D Scan R3 (F, K, L) Project G Join G = K Project B, C, E, H, F KG E, H F B, C
  • 71. VLDB 2009 Tutorial Column-Oriented Database Systems 71 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) What about for plans with joins? 71 Select R1.B, R1.C, R2.E, R2.H, R3.F From R1, R2, R3 Where R1.A = R2.D AND R2.G = R3.K R1 (A, B, C) R2 (D, E, G, H) Scan Scan Join 1 A = D Join 2 G = K Scan R3 (F, K, L) B, C, E, H, F A, B, C B, C, E, G, H D, E, G, H F, K R1 (A, B, C) R2 (D, E, G, H) Scan Scan Join A = D A D Scan R3 (F, K, L) Project G Join G = K Project B, C, E, H, F KG E, H F B, C Early materialization Late materialization
  • 72. VLDB 2009 Tutorial Column-Oriented Database Systems 72 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Early Materialization Example 12 1 11 1 6 1 2 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 prodID storeID quantity custID price 1 2 3 Construct 1 2 3 custID lastName Green White Brown Green White Brown QUERY: SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName Facts Customers (4,1,4)
  • 73. VLDB 2009 Tutorial Column-Oriented Database Systems 73 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Early Materialization Example 2 3 3 3 7 13 42 80 1 2 3 Green White Brown QUERY: SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName Join 7 13 42 80 White Brown Brown Brown
  • 74. VLDB 2009 Tutorial Column-Oriented Database Systems 74 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Late Materialization Example 12 1 11 1 6 1 2 1 2 3 1 3 7 13 42 80 prodID storeID quantity custID price 1 2 3 custID lastName Green White Brown QUERY: SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName Facts Customers Join 2 3 1 3 1 2 3 4 (4,1,4) Late materialized join causes out of order probing of projected columns from the inner relation
  • 75. VLDB 2009 Tutorial Column-Oriented Database Systems 75 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Late Materialized Join Performance l Naïve LM join about 2X slower than EM join on typical queries (due to random I/O) l This number is very dependent on l Amount of memory available l Number of projected attributes l Join cardinality l But we can do better l Invisible Join l Jive/Flash Join l Radix cluster/decluster join
  • 76. VLDB 2009 Tutorial Column-Oriented Database Systems 76 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Invisible Join l Designed for typical joins when data is modeled using a star schema l One (“fact”) table is joined with multiple dimension tables l Typical query: select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA‘ and s_region = 'ASIA‘ and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc; “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
  • 77. VLDB 2009 Tutorial Column-Oriented Database Systems 77 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey region nation … 1 ASIA CHINA … 2 ASIA INDIA … 3 ASIA INDIA … Apply “region = ‘Asia’” On Customer Table Hash Table (or bit-map) Containing Keys 1, 2 and 3 suppkey region nation … 1 ASIA RUSSIA … 2 EUROPE SPAIN … Apply “region = ‘Asia’” On Supplier Table Hash Table (or bit-map) Containing Keys 1, 3 dateid year … 01011997 1997 … 01021997 1997 … 01031997 1997 … Apply “year in [1992,1997]” On Date Table Hash Table Containing Keys 01011997, 01021997, and 01031997 Invisible Join “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. 4 EUROPE FRANCE … 3 ASIA JAPAN …
  • 78. VLDB 2009 Tutorial Column-Oriented Database Systems 78 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Column-Stores vs Row-Stores: How Different are They Really?” Abadi et. al. SIGMOD 2008. custkey suppkey orderdate revenueorderkey 3 1 01011997 432561 3 2 01011997 333332 4 3 01021997 121213 1 1 01021997 232334 4 2 01021997 454565 1 2 01031997 432516 3 2 01031997 342357 Original Fact Table custkey 3 3 4 1 4 1 3 Hash Table Containing Keys 1, 2 and 3 + 1 1 0 1 0 1 1 Hash Table Containing Keys 1 and 3 1 0 1 1 0 0 0 +suppkey 1 2 3 1 2 2 2 Hash Table Containing Keys 01011997, 01021997, and 01031997 1 1 1 1 1 1 1 +orderdate 01011997 01011997 01021997 01021997 01021997 01031997 01031997 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 1 1 & & = 1 0 0 1 0 0 0
  • 79. VLDB 2009 Tutorial Column-Oriented Database Systems 79 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey 3 3 4 1 4 1 3 + suppkey 1 2 3 1 2 2 2 orderdate 01011997 01011997 01021997 01021997 01021997 01031997 01031997 1 0 0 1 0 0 0 3 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0 0 01011997 01021997 + + + + JOIN CHINA INDIA INDIA = CHINA INDIA RUSSIA SPAIN = RUSSIA RUSSIA 1997 1997 1997 = 01011997 01021997 01031997 1997 1997 FRANCE JAPAN
  • 80. VLDB 2009 Tutorial Column-Oriented Database Systems 80 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey region nation … 1 ASIA CHINA … 2 ASIA INDIA … 3 ASIA INDIA … Apply “region = ‘Asia’” On Customer Table Hash Table (or bit-map) Containing Keys 1, 2 and 3 suppkey region nation … 1 ASIA RUSSIA … 2 EUROPE SPAIN … Apply “region = ‘Asia’” On Supplier Table Hash Table (or bit-map) Containing Keys 1, 3 dateid year … 01011997 1997 … 01021997 1997 … 01031997 1997 … Apply “year in [1992,1997]” On Date Table Hash Table Containing Keys 01011997, 01021997, and 01031997 Invisible Join “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. 4 EUROPE FRANCE … 3 ASIA JAPAN … Range [1-3] (between-predicate rewriting)
  • 81. VLDB 2009 Tutorial Column-Oriented Database Systems 81 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Invisible Join l Bottom Line l Many data warehouses model data using star/snowflake schemes l Joins of one (fact) table with many dimension tables is common l Invisible join takes advantage of this by making sure that the table that can be accessed in position order is the fact table for each join l Position lists from the fact table are then intersected (in position order) l This reduces the amount of data that must be accessed out of order from the dimension tables l “Between-predicate rewriting” trick not relevant for this discussion
  • 82. VLDB 2009 Tutorial Column-Oriented Database Systems 82 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) custkey 3 3 4 1 4 1 3 + suppkey 1 2 3 1 2 2 2 orderdate 01011997 01011997 01021997 01021997 01021997 01031997 01031997 1 0 0 1 0 0 0 3 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0 0 01011997 01021997 + + + + JOIN CHINA INDIA INDIA = CHINA INDIA RUSSIA SPAIN = RUSSIA RUSSIA 1997 1997 1997 = 01011997 01021997 01031997 1997 1997 FRANCE JAPAN Still accessing table out of order
  • 83. VLDB 2009 Tutorial Column-Oriented Database Systems 83 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join 3 1 + CHINA INDIA INDIA = CHINA INDIA FRANCE Still accessing table out of order “Fast Joins using Join Indices”. Li and Ross, VLDBJ 8:1-24, 1999. “Query Processing Techniques for Solid State Drives”. Tsirogiannis, Harizopoulos et. al. SIGMOD 2009.
  • 84. VLDB 2009 Tutorial Column-Oriented Database Systems 84 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join 1. Add column with dense ascending integers from 1 2. Sort new position list by second column 3. Probe projected column in order using new sorted position list, keeping first column from position list around 4. Sort new result by first column 3 1 + CHINA INDIA INDIA = CHINA INDIA FRANCE 1 2 1 3 2 1 + CHINA INDIA INDIA = INDIA CHINA FRANCE 2 1 CHINA INDIA1 2
  • 85. VLDB 2009 Tutorial Column-Oriented Database Systems 85 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Jive/Flash Join l Bottom Line l Instead of probing projected columns from inner table out of order: l Sort join index l Probe projected columns in order l Sort result using an added column l LM vs EM tradeoffs: l LM has the extra sorts (EM accesses all columns in order) l LM only has to fit join columns into memory (EM needs join columns and all projected columns) § Results in big memory and CPU savings (see part 3 for why there is CPU savings) l LM only has to materialize relevant columns l In many cases LM advantages outweigh disadvantages l LM would be a clear winner if not for those pesky sorts … can we do better?
  • 86. VLDB 2009 Tutorial Column-Oriented Database Systems 86 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Radix Cluster/Decluster l The full sort from the Jive join is actually overkill l We just want to access the storage blocks in order (we don’t mind random access within a block) l So do a radix sort and stop early l By stopping early, data within each block is accessed out of order, but in the order specified in the original join index l Use this pseudo-order to accelerate the post-probe sort as well “Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04 •“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB’99 •“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
  • 87. VLDB 2009 Tutorial Column-Oriented Database Systems 87 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Radix Cluster/Decluster l Bottom line l Both sorts from the Jive join can be significantly reduced in overhead l Only been tested when there is sufficient memory for the entire join index to be stored three times l Technique is likely applicable to larger join indexes, but utility will go down a little l Only works if random access within a storage block l Don’t want to use radix cluster/decluster if you have variable- width column values or compression schemes that can only be decompressed starting from the beginning of the block
  • 88. VLDB 2009 Tutorial Column-Oriented Database Systems 88 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) LM vs EM joins l Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins l Research papers show that LM joins become 2X faster than EM joins (instead of 2X slower) for a wide array of query types
  • 89. VLDB 2009 Tutorial Column-Oriented Database Systems 89 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Tuple Construction Heuristics l For queries with selective predicates, aggregations, or compressed data, use late materialization l For joins: l Research papers: l Always use late materialization l Commercial systems: l Inner table to a join often materialized before join (reduces system complexity): l Some systems will use LM only if columns from inner table can fit entirely in memory
  • 90. VLDB 2009 Tutorial Column-Oriented Database Systems 90 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Outline l Computational Efficiency of DB on modern hardware l how column-stores can help here l Keynote revisited: MonetDB & VectorWise in more depth l CPU efficient column compression l vectorized decompression l Conclusions l future work
  • 91. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems 40 years of hardware evolution vs. DBMS computational efficiency VLDB 2009 Tutorial
  • 92. VLDB 2009 Tutorial Column-Oriented Database Systems 92 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Architecture Elements: l Storage l CPU caches L1/L2/L3 l Registers l Execution Unit(s) l Pipelined l SIMD
  • 93. VLDB 2009 Tutorial Column-Oriented Database Systems 93 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Metrics
  • 94. VLDB 2009 Tutorial Column-Oriented Database Systems 94 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Metrics
  • 95. VLDB 2009 Tutorial Column-Oriented Database Systems 95 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DRAM Metrics
  • 96. VLDB 2009 Tutorial Column-Oriented Database Systems 96 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Super-Scalar Execution (pipelining)
  • 97. VLDB 2009 Tutorial Column-Oriented Database Systems 97 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Hazards l Data hazards l Dependencies between instructions l L1 data cache misses Result: bubbles in the pipeline l Control Hazards l Branch mispredictions l Computed branches (late binding) l L1 instruction cache misses Out-of-order execution addresses data hazards l control hazards typically more expensive
  • 98. VLDB 2009 Tutorial Column-Oriented Database Systems 98 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SIMD l Single Instruction Multiple Data l Same operation applied on a vector of values l MMX: 64 bits, SSE: 128bits, AVX: 256bits l SSE, e.g. multiply 8 short integers
  • 99. VLDB 2009 Tutorial Column-Oriented Database Systems 99 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT alice 22101 next() next() next() ivan 37102 ivan 37102 ivan 37102 ivan 350102 alice 22101 SELECT id, name (age-30)*50 AS bonus FROM employee WHERE age > 30 350 FALSETRUE 22 > 30 ?37 > 30 ? 37 – 307 * 50 7 A Look at the Query Pipeline
  • 100. VLDB 2009 Tutorial Column-Oriented Database Systems 100 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() next() ivan 350102 22 > 30 ?37 > 30 ? 37 – 307 * 50 A Look at the Query Pipeline Operators Iterator interface -open() -next(): tuple -close()
  • 101. VLDB 2009 Tutorial Column-Oriented Database Systems 101 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT alice 22101 next() next() next() ivan 37102 ivan 37102 ivan 37102 ivan 350102 alice 22101 350 FALSETRUE 22 > 30 ?37 > 30 ? 37 – 307 * 50 7 A Look at the Query Pipeline Primitives Provide computational functionality All arithmetic allowed in expressions, e.g. Multiplication mult(int,int) Ł int 7 * 50
  • 102. VLDB 2009 Tutorial Column-Oriented Database Systems 102 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Operators Iterator interface -open() -next(): tuple -close()SCAN SELECT PROJECT Database Architecture causes Hazards l Data hazards l Dependencies between instructions l L1 data cache misses l Control Hazards l Branch mispredictions l Computed branches (late binding) l L1 instruction cache misses Complex NSM record navigation Large Tree/Hash Structures Code footprint of all operators in query plan exceeds L1 cache Data-dependent conditions Next() late binding method calls Tree, List, Hash traversal SIMD Out-of-order Execution work on one tuple at a time “DBMSs On A Modern Processor: Where Does Time Go? ” Ailamaki, DeWitt, Hill, Wood, VLDB’99
  • 103. VLDB 2009 Tutorial Column-Oriented Database Systems 103 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l C program: ? l MySQL: 26.2s l DBMS “X”: 28.1s “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 104. VLDB 2009 Tutorial Column-Oriented Database Systems 104 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l C program: 0.2s l MySQL: 26.2s l DBMS “X”: 28.1s “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 105. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems MonetDB VLDB 2009 Tutorial
  • 106. VLDB 2009 Tutorial Column-Oriented Database Systems 106 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l “save disk I/O when scan-intensive queries need a few columns” l “avoid an expression interpreter to improve computational efficiency” a column-store
  • 107. VLDB 2009 Tutorial Column-Oriented Database Systems 107 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SELECT id, name, (age-30)*50 as bonus FROM people WHERE age > 30 RISC Database Algebra
  • 108. VLDB 2009 Tutorial Column-Oriented Database Systems 108 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SELECT id, name, (age-30)*50 as bonus FROM people WHERE age > 30 RISC Database Algebra batcalc_minus_int(int* res, int* col, int val, int n) { for(i=0; i<n; i++) res[i] = col[i] - val; } CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column - no per-tuple interpretation - arrays: no record navigation - better instruction cache locality Simple, hard- coded semantics in operators
  • 109. VLDB 2009 Tutorial Column-Oriented Database Systems 109 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SELECT id, name, (age-30)*50 as bonus FROM people WHERE age > 30 RISC Database Algebra batcalc_minus_int(int* res, int* col, int val, int n) { for(i=0; i<n; i++) res[i] = col[i] - val; } CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column - no per-tuple interpretation - arrays: no record navigation - better instruction cache locality MATERIALIZED intermediate results
  • 110. VLDB 2009 Tutorial Column-Oriented Database Systems 110 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l “save disk I/O when scan-intensive queries need a few columns” l “avoid an expression interpreter to improve computational efficiency” How? l RISC query algebra: hard-coded semantics l Decompose complex expressions in multiple operations l Operators only handle simple arrays l No code that handles slotted buffered record layout l Relational algebra becomes array manipulation language l Often SIMD for free l Plus: use of cache-conscious algorithms for Sort/Aggr/Join a column-store
  • 111. VLDB 2009 Tutorial Column-Oriented Database Systems 111 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l You want efficiency l Simple hard-coded operators l I take scalability l Result materialization n C program: 0.2s n MonetDB: 3.7s n MySQL: 26.2s n DBMS “X”: 28.1s a Faustian pact
  • 112. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems as a research platform VLDB 2009 Tutorial
  • 113. VLDB 2009 Tutorial Column-Oriented Database Systems 113 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SIGMOD 1985 MonetDB supports SQL, XML, ODMG, ..RDF RDF support on C- STORE / SW-Store MonetDB BAT Algebra •“MIL Primitives for Querying a Fragmented World”, Boncz, Kersten, VLDBJ’98 •“Flattening an Object Algebra to Provide Performance” Boncz, Wilschut, Kersten, ICDE’98 •“MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06 •“SW-Store: a vertically partitioned DBMS for Semantic Web data management“ Abadi, Marcus, Madden, Hollenbach, VLDBJ’09
  • 114. VLDB 2009 Tutorial Column-Oriented Database Systems 114 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) XQuery MonetDB 4 MonetDB 5 MonetDB kernel SQL 03 Optimizers RDF SOAP Arrays The Software Stack Extensible query lang frontend framework .. SGL?Extensible Dynamic Runtime QOPT Framework! OGIS compile Extensible Architecture-Conscious Execution platform
  • 115. VLDB 2009 Tutorial Column-Oriented Database Systems 115 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) l Cache-Conscious Joins l Cost Models, Radix-cluster, Radix-decluster l MonetDB/XQuery: l structural joins exploiting positional column access l Cracking: l on-the-fly automatic indexing without workload knowledge l Recycling: l using materialized intermediates l Run-time Query Optimization: l correlation-aware run-time optimization without cost model as a research platform “MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06 •“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB’99 •“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten) •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04 “Database Cracking”, CIDR’07 “Updating a cracked database “, SIGMOD’07 “Self-organizing tuple reconstruction in column- stores“, SIGMOD’09 (all Idreos, Manegold, Kersten) “An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09 “ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09 optimizer frontend backend
  • 116. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems “MonetDB/X100” vectorized query processing VLDB 2009 Tutorial
  • 117. VLDB 2009 Tutorial Column-Oriented Database Systems 117 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Materialization vs Pipelining MonetDB spin-off: MonetDB/X100
  • 118. VLDB 2009 Tutorial Column-Oriented Database Systems 118 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 next() “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 119. VLDB 2009 Tutorial Column-Oriented Database Systems 119 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 FALSE TRUE TRUE FALSE > 30 ? 22 37 45 25 alice ivan peggy victor 101 102 104 105 next() “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 120. VLDB 2009 Tutorial Column-Oriented Database Systems 120 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 7 15 FALSE TRUE TRUE FALSE 37 45 ivan peggy 102 104 350 750 ivan peggy 102 104 350 750 Observations: next() called much less often Ł more time spent in primitives less in overhead primitive calls process an array of values in a loop: > 30 ? - 30 * 50 22 37 45 25 alice ivan peggy victor 101 102 104 105 “Vectorized In Cache Processing” vector = array of ~100 processed in a tight loop CPU cache Resident next() “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 121. VLDB 2009 Tutorial Column-Oriented Database Systems 121 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 7 15 FALSE TRUE TRUE FALSE 37 45 ivan peggy 102 104 350 750 ivan peggy 102 104 350 750 Observations: next() called much less often Ł more time spent in primitives less in overhead primitive calls process an array of values in a loop: > 30 ? - 30 * 50 CPU Efficiency depends on “nice” code - out-of-order execution - few dependencies (control,data) - compiler support Compilers like simple loops over arrays - loop-pipelining - automatic SIMD 22 37 45 25 alice ivan peggy victor 101 102 104 105 next() “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 122. VLDB 2009 Tutorial Column-Oriented Database Systems 122 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT FALSE TRUE TRUE FALSE 350 750 Observations: next() called much less often Ł more time spent in primitives less in overhead primitive calls process an array of values in a loop: > 30 ? * 50 CPU Efficiency depends on “nice” code - out-of-order execution - few dependencies (control,data) - compiler support Compilers like simple loops over arrays - loop-pipelining - automatic SIMD FALSE TRUE TRUE FALSE > 30 ? 7 15 - 30 350 750 * 50 for(i=0; i<n; i++) res[i] = (col[i] > x) for(i=0; i<n; i++) res[i] = (col[i] - x) for(i=0; i<n; i++) res[i] = (col[i] * x) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 123. VLDB 2009 Tutorial Column-Oriented Database Systems 123 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 1 2 Tricks being played: - Late materialization - Materialization avoidance using selection vectors > 30 ? next() “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 124. VLDB 2009 Tutorial Column-Oriented Database Systems 124 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 7 15 1 2 350 750 Tricks being played: - Late materialization - Materialization avoidance using selection vectors > 30 ? - 30 * 50 next() map_mul_flt_val_flt_col( float *res, int* sel, float val, float *col, int n) { for(int i=0; i<n; i++) res[i] = val * col[sel[i]]; } selection vectors used to reduce vector copying contain selected positions “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 125. VLDB 2009 Tutorial Column-Oriented Database Systems 125 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) SCAN SELECT PROJECT next() next() 101 102 104 105 alice ivan peggy victor 22 37 45 25 7 15 1 2 350 750 Tricks being played: - Late materialization - Materialization avoidance using selection vectors > 30 ? - 30 * 50 next() map_mul_flt_val_flt_col( float *res, int* sel, float val, float *col, int n) { for(int i=0; i<n; i++) res[i] = val * col[sel[i]]; } selection vectors used to reduce vector copying contain selected positions “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 126. VLDB 2009 Tutorial Column-Oriented Database Systems 126 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) MonetDB/X100 l Both efficiency l Vectorized primitives l and scalability.. l Pipelined query evaluation n C program: 0.2s n MonetDB/X100: 0.6s n MonetDB: 3.7s n MySQL: 26.2s n DBMS “X”: 28.1s “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 127. VLDB 2009 Tutorial Column-Oriented Database Systems 127 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Memory Hierarchy ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) RAM “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 128. VLDB 2009 Tutorial Column-Oriented Database Systems 128 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Memory Hierarchy Vectors are only the in-cache representation RAM & disk representation might actually be different (vectorwise uses both PAX & DSM) ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) RAM “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 129. VLDB 2009 Tutorial Column-Oriented Database Systems 129 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Optimal Vector size? All vectors together should fit the CPU cache Optimizer should tune this, given the query characteristics. ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) RAM “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 130. VLDB 2009 Tutorial Column-Oriented Database Systems 130 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying the Vector size Less and less iterator.next() and primitive function calls (“interpretation overhead”) “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 131. VLDB 2009 Tutorial Column-Oriented Database Systems 131 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying the Vector size Vectors start to exceed the CPU cache, causing additional memory traffic “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 132. VLDB 2009 Tutorial Column-Oriented Database Systems 132 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Varying the Vector size The benefit of selection vectors “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 133. VLDB 2009 Tutorial Column-Oriented Database Systems 133 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) MonetDB/MIL materializes columns ColumnBM (buffer manager) CPU cache (raid) Disk(s) RAM “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 134. VLDB 2009 Tutorial Column-Oriented Database Systems 134 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Benefits of Vectorized Processing l 100x less Function Calls l iterator.next(), primitives l No Instruction Cache Misses l High locality in the primitives l Less Data Cache Misses l Cache-conscious data placement l No Tuple Navigation l Primitives are record-oblivious, only see arrays l Vectorization allows algorithmic optimization l Move activities out of the loop (“strength reduction”) l Compiler-friendly function bodies l Loop-pipelining, automatic SIMD “Buffering Database Operations for Enhanced Instruction Cache Performance” Zhou, Ross, SIGMOD’04 “Block oriented processing of relational database operations in modern computer architectures” Padmanabhan, Malkemus, Agarwal, ICDE’01 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
  • 135. VLDB 2009 Tutorial Column-Oriented Database Systems 135 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Vectorizing Relational Operators l Project l Select l Exploit selectivities, test buffer overflow l Aggregation l Ordered, Hashed l Sort l Radix-sort nicely vectorizes l Join l Merge-join + Hash-join “Balancing Vectorized Query Execution with Bandwidth Optimized Storage ” Zukowski, CWI 2008
  • 136. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems Efficient Column Store Compression VLDB 2009 Tutorial
  • 137. VLDB 2009 Tutorial Column-Oriented Database Systems 137 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Key Ingredients l Compress relations on a per-column basis l Columns compress well l Decompress small vectors of tuples from a column into the CPU cache l Minimize main-memory overhead l Use light-weight, CPU-efficient algorithms l Exploit processing power of modern CPUs “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 138. VLDB 2009 Tutorial Column-Oriented Database Systems 138 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Key Ingredients l Compress relations on a per-column basis l Columns compress well l Decompress small vectors of tuples from a column into the CPU cache l Minimize main-memory overhead “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 139. VLDB 2009 Tutorial Column-Oriented Database Systems 139 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Vectorized Decompression ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) RAM Idea: decompress a vector only compression: -between CPU and RAM -Instead of disk and RAM (classic)
  • 140. VLDB 2009 Tutorial Column-Oriented Database Systems 140 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) RAM-Cache Decompression l Decompress vectors on-demand into the cache l RAM-Cache boundary only crossed once l More (compressed) data cached in RAM l Less bandwidth use “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 141. VLDB 2009 Tutorial Column-Oriented Database Systems 141 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Multi-Core Bandwidth & Compression Performance Degradation with Concurrent Queries 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 number of concurrent queries avgqueryspeed(clocknormalized) Intel® Xeon E5410 (Harpertown) - Q6b Normal Intel® Xeon E5410 (Harpertown) - Q6b Compressed Intel® Xeon X5560 (Nehalem) - Q6b Normal Intel® Xeon X5560 (Nehalem) - Q6b Compressed
  • 142. VLDB 2009 Tutorial Column-Oriented Database Systems 142 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) CPU Efficient Decompression l Decoding loop over cache-resident vectors of code words l Avoid control dependencies within decoding loop l no if-then-else constructs in loop body l Avoid data dependencies between loop iterations “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 143. VLDB 2009 Tutorial Column-Oriented Database Systems 143 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Disk Block Layout l Forward growing section of arbitrary size code words (code word size fixed per block) “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 144. VLDB 2009 Tutorial Column-Oriented Database Systems 144 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Disk Block Layout l Forward growing section of arbitrary size code words (code word size fixed per block) l Backwards growing exception list “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 145. VLDB 2009 Tutorial Column-Oriented Database Systems 145 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Naïve Decompression Algorithm l Use reserved value from code word domain (MAXCODE) to mark exception positions “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 146. VLDB 2009 Tutorial Column-Oriented Database Systems 146 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Deterioriation With Exception% l 1.2GB/s deteriorates to 0.4GB/s “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 147. VLDB 2009 Tutorial Column-Oriented Database Systems 147 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Deterioriation With Exception% l Perf Counters: CPU mispredicts if-then-else “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 148. VLDB 2009 Tutorial Column-Oriented Database Systems 148 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Patching l Maintain a patch-list through code word section that links exception positions “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 149. VLDB 2009 Tutorial Column-Oriented Database Systems 149 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Patching l Maintain a patch-list through code word section that links exception positions l After decoding, patch up the exception positions with the correct values “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 150. VLDB 2009 Tutorial Column-Oriented Database Systems 150 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Patched Decompression “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 151. VLDB 2009 Tutorial Column-Oriented Database Systems 151 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Patched Decompression “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 152. VLDB 2009 Tutorial Column-Oriented Database Systems 152 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Decompression Bandwidth l Patching makes two passes, but is faster! Patching can be applied to: • Frame of Reference (PFOR) • Delta Compression (PFOR-DELTA) • Dictionary Compression (PDICT) Makes these methods much more applicable to noisy data Ł without additional cost “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06
  • 153. Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Column-Oriented Database Systems Conclusion VLDB 2009 Tutorial
  • 154. VLDB 2009 Tutorial Column-Oriented Database Systems 154 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary (1/2) l Columns and Row-Stores: different? l No fundamental differences l Can current row-stores simulate column-stores now? l not efficiently: row-stores need change l On disk layout vs execution layout l actually independent issues, on-the-fly conversion pays off l column favors sequential access, row random l Mixed Layout schemes l Fractured mirrors l PAX, Clotho l Data morphing
  • 155. VLDB 2009 Tutorial Column-Oriented Database Systems 155 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Summary (2/2) l Crucial Columnar Techniques l Storage l Lean headers, sparse indices, fast positional access l Compression l Operating on compressed data l Lightweight, vectorized decompression l Late vs Early materialization l Non-join: LM always wins l Naïve/Invisible/Jive/Flash/Radix Join (LM often wins) l Execution l Vectorized in-cache execution l Exploiting SIMD
  • 156. VLDB 2009 Tutorial Column-Oriented Database Systems 156 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Future Work l looking at write/load tradeoffs in column-stores l read-only vs batch loads vs trickle updates vs OLTP
  • 157. VLDB 2009 Tutorial Column-Oriented Database Systems 157 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (1/3) l Column-stores are update-in-place averse l In-place: I/O for each column l + re-compression l + multiple sorted replicas l + sparse tree indices Update-in-place is infeasible!
  • 158. VLDB 2009 Tutorial Column-Oriented Database Systems 158 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (2/3) l Column-stores use differential mechanisms instead l Differential lists/files or more advanced (e.g. PDTs) l Updates buffered in RAM, merged on each query l Checkpointing merges differences in bulk sequentially l I/O trends favor this anyway § trade RAM for converting random into sequential I/O § this trade is also needed in Flash (do not write randomly!) l How high loads can it sustain? § Depends on available RAM for buffering (how long until full) § Checkpoint must be done within that time § The longer it can run, the less it molests queries § Using Flash for buffering differences buys a lot of time § Hundreds of GBs of differences per server
  • 159. VLDB 2009 Tutorial Column-Oriented Database Systems 159 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Updates (3/3) l Differential transactions favored by hardware trends l Snapshot semantics accepted by the user community l can always convert to serialized Ł Row stores could also use differential transactions and be efficient! Ł Implies a departure from ARIES Ł Implies a full rewrite My conclusion: a system that combines row- and columns needs differentially implemented transactions. Starting from a pure column-store, this is a limited add-on. Starting from a pure row-store, this implies a full rewrite. “Serializable Isolation For Snapshot Databases” Alomari, Cahill, Fekete, Roehm, SIGMOD’08
  • 160. VLDB 2009 Tutorial Column-Oriented Database Systems 160 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Future Work l looking at write/load tradeoffs in column-stores l read-only vs batch loads vs trickle updates vs OLTP l database design for column-stores l column-store specific optimizers l compression/materialization/join tricks Ł cost models? l hybrid column-row systems l can row-stores learn new column tricks? l Study of the minimal number changes one needs to make to a row store to get the majority of the benefits of a column-store l Alternative: add features to column-stores that make them more like row stores
  • 161. VLDB 2009 Tutorial Column-Oriented Database Systems 161 Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) Conclusion l Columnar techniques provide clear benefits for: l Data warehousing, BI l Information retrieval, graphs, e-science l A number of crucial techniques make them effective l Without these, existing row systems do not benefit l Row-Stores and column-stores could be combined l Row-stores may adopt some column-store techniques l Column-stores add row-store (or PAX) functionality l Many open issues to do research on!