Daniel Abadi -- Yale University
Column-Stores vs. Row-Stores:
How Different Are They
Really?
Daniel Abadi (Yale),
Samuel M...
Row vs. Column-Stores
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail Phone #
Street
Add...
Column-Stores
• Really good for read-mostly data
warehouses
 Lot’s of column scans and aggregations
 Writes tend to be i...
Data Warehouse DBMS
Software
• $4.5 billion industry (out of total $16 billion
DBMS software industry)
• Growing 10% annua...
Momentum
• Right solution for growing market  $$$$
• Vertica, ParAccel, Kickfire, Calpont,
Infobright, and Exasol new ent...
Paper Looks At Key Question
• How much of the buzz around column-
stores just marketing hype?
 Do you really need to buy ...
Paper Methodology
• Comparing row-store vs. column-store is
dangerous/borderline meaningless
• Instead, compare row-store ...
Simulate Column-Store
Inside Row-Store
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail
1...
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 ro...
SSBM Averages
0.0
50.0
100.0
150.0
200.0
250.0
Time(seconds)
Average 25.7 79.9 221.2
Normal Row-Store
Vertically Partition...
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Reconstr...
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 ro...
Tuple Size
1
2
3
Column
Data
TID
1
2
3
TID Column
Data
1
2
3
TID Column
Data
Tuple
Header
•Queries touch 3-4 foreign keys ...
Horizontal Partitioning
• Fact table horizontally partitioned on year
 Year is an element of the ‘Date’ dimension
table
...
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Construc...
Tuple Construction
• Pretty much all queries require a column
to be extracted (in the SELECT clause)
that has not yet been...
Tuple Construction
• Result of lower part of query plan is a set
of TIDs that passed all predicates
• Need to extract SELE...
So….
• All indexes approach is pretty obviously a
poor way to simulate a column-store
• Problems with vertical partitionin...
Row-Store vs. Column-Store
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Time(seconds)
Average 25.7 11.7 4.4
Row-Store Row-Store (M V) ...
Column-Store Experiments
• Start with column-store (C-Store)
• Remove column-store-specific
performance optimizations
• En...
Compression
• Higher data value locality
in column-stores
 Better ratio  reduced I/O
• Can use schemes like
run-length e...
• Early Materialization:
create rows first. But:
 Poor memory bandwidth
utilization
 Lose opportunity for
vectorized ope...
Other Column-Store
Optimizations
• Invisible join
 Column-store specific join
 Optimizations for star schemas
 Similar ...
Simplified Version of Results
0.0
10.0
20.0
30.0
40.0
50.0
Time(seconds)
Average 4.4 14.9 40.7
Original C-Store
C-Store,No...
Conclusion
• Might be possible to simulate a row-store
in a column-store, BUT:
 Need better support for vertical partitio...
Come Join the Yale DB
Group!
Upcoming SlideShare
Loading in...5
×

Column-Stores vs. Row-Stores: How Different are they Really?

6,028

Published on

SIGMOD 2008 presentation

Published in: Technology, Business
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,028
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
154
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Column-Stores vs. Row-Stores: How Different are they Really?

  1. 1. Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th , 2008
  2. 2. Row vs. Column-Stores Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail Phone # Street Address Row-Store Column-Store − Might read in unnecessary data + Only need to read in relevant data + Easy to add a new record − Tuple writes might require multiple seeks
  3. 3. Column-Stores • Really good for read-mostly data warehouses  Lot’s of column scans and aggregations  Writes tend to be in batch  [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this  Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores  Factor of 5 faster on performance  Factor of 2 superior on price/performance
  4. 4. Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually
  5. 5. Momentum • Right solution for growing market  $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition)
  6. 6. Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype?  Do you really need to buy Sybase IQ/Vertica/ParAccel?  How far will your current row-store take you?  Can you get column-store performance from a row- store?  Can you simulate a column-store in a row-store
  7. 7. Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store  Simulate a column-store inside of a row-store  Remove column-oriented features from column-store until it behaves like a row-store
  8. 8. Simulate Column-Store Inside Row-Store Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail 1 2 3 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Last Name Index First Name Index
  9. 9. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  10. 10. SSBM Averages 0.0 50.0 100.0 150.0 200.0 250.0 Time(seconds) Average 25.7 79.9 221.2 Normal Row-Store Vertically Partitioned Row-Store Row-Store With All Indexes
  11. 11. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Reconstruction
  12. 12. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  13. 13. Tuple Size 1 2 3 Column Data TID 1 2 3 TID Column Data 1 2 3 TID Column Data Tuple Header •Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed)
  14. 14. Horizontal Partitioning • Fact table horizontally partitioned on year  Year is an element of the ‘Date’ dimension table  Most queries in SSBM have a predicate on year  Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them
  15. 15. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Construction
  16. 16. Tuple Construction • Pretty much all queries require a column to be extracted (in the SELECT clause) that has not yet been accessed, e.g.:  SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.area = “NEW ENGLAND” GROUP BY store_name
  17. 17. Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs  BUT: index maps value to TID  You really want to map TID to value (i.e., a vertical partition)   Tuple construction is SLOW
  18. 18. So…. • All indexes approach is pretty obviously a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental  Store tuple header in a separate partition  Allow virtual TIDs  Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores?
  19. 19. Row-Store vs. Column-Store 0.0 5.0 10.0 15.0 20.0 25.0 30.0 Time(seconds) Average 25.7 11.7 4.4 Row-Store Row-Store (M V) C-Store
  20. 20. Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store that behaves like a row-store
  21. 21. Compression • Higher data value locality in column-stores  Better ratio  reduced I/O • Can use schemes like run-length encoding  Easy to operate on directly for improved performance ([AMF06]) Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter (Q1, 1, 300) Quarter (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)
  22. 22. • Early Materialization: create rows first. But:  Poor memory bandwidth utilization  Lose opportunity for vectorized operation 2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4 prodID storeIDcustID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID Early vs. Late Materialization 4 4 4 4
  23. 23. Other Column-Store Optimizations • Invisible join  Column-store specific join  Optimizations for star schemas  Similar to a semi-join • Block Processing
  24. 24. Simplified Version of Results 0.0 10.0 20.0 30.0 40.0 50.0 Time(seconds) Average 4.4 14.9 40.7 Original C-Store C-Store,No Compression C-Store,Early Materialization
  25. 25. Conclusion • Might be possible to simulate a row-store in a column-store, BUT:  Need better support for vertical partitioning at the storage layer  Need support for column-specific optimizations at the executer level • Working with HP-Labs to find out
  26. 26. Come Join the Yale DB Group!
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×