0
Variations of the Star
Schema Benchmark to Test
the Effects of Data Skew on
Query Performance
T IL M AN N R ABL , M EIKEL ...
Real Life Data is
Distributed Uniformly…
Well, Not Really
◦ Customers zip codes typically clustered around metropolitan ar...
Student Seminar Signup
Distribution

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

3
How Can Skew Effect
Database Systems?
Data placement
◦ Partitioning
◦ Indexing

Data structures
◦ Tree balance
◦ Bucket fi...
Agenda
Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)
Parallel Data Generation Framework (PDGF)

Introducing...
Data Skew in Benchmarks
TPC-D (1994-1999): only uniform data
◦ SIGMOD 1997 - “Successor of TPC-D
should include data skew”...
Star Schema Benchmark I

Star schema version of TPC-H
◦
◦
◦
◦

Merged Order and Lineitem
Date dimension
Dropped Partsupp
S...
Star Schema Benchmark II
Completely new set of queries
4 flights of 3-4 queries
◦ Designed for functional coverage and sel...
Parallel Data Generation
Framework
Generic data generation framework
Relational model
◦ Schema specified in configuration ...
XML

PDGF

DB

Configuring PDGF
Schema configuration
Relational model
◦ Tables, fields

Properties
◦ Table size, character...
Opportunities to Inject Data
Skew in
Foreign key relations
◦ E.g., L_PARTKEY

One fact table measures
◦ E.g., L_Quantity

...
Skew in Foreign Key
Relations
Very realistic
Easy to implement in PDGF

◦ Just add a distribution to the reference

<distr...
Skew in Fact Table Measure
– Lo_Quantity
Lo_Quantity distribution
◦ Values range between 0 and 50
◦ Originally uniform dis...
Skew in Single Dimension
Hierarchy - Part
P_Category distribution
◦ Uniform P(X=x)=0.04
◦ Skewed P(X=x)= 0.01 - 48.36
◦ Pr...
Skew in Multiple Dimension
Hierarchies – S_City &
C_City
Skewed S_City & C_City
◦ Probabilites exponentially
distributed

...
Conclusion & Future Work
PDGF implementation of SSB
Introduction of skew in SSB
Extensive performance analysis
◦ Several i...
Thanks
Questions?
Download and try PDGF:
http://www.paralleldatageneration.org
(scripts used in the study available on web...
Back-up Slides
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

18
Configuring PDGF
Generation
Generation configuration
Defines the output
◦
◦
◦
◦

Scheduling
Data format
Sorting
File name ...
Upcoming SlideShare
Loading in...5
×

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

577

Published on

This is a presentation that was held at ICPE 2013, Prague, 24/04/2013

Full paper and additional information available at:
http://msrg.org/papers/ICPE2013Rabl

Abstract:
The Star Schema Benchmark (SSB), has been widely used to evaluate the performance of database management systems when executing star schema queries. SSB, based on the well known industry standard benchmark TPC-H, shares some of its drawbacks, most notably, its uniform data distributions. Today’s systems rely heavily on sophisticated cost-based query optimizers to generate the most efficient query execution plans. A benchmark that evaluates optimizer’s capability to generate optimal execution plans under all circumstances must provide the rich data set details on which optimizers rely (uniform and non-uniform distributions, data sparsity, etc.). This is also true for other database system parts, such as indices and operators, and ultimately holds for an end-to-end benchmark as well. SSB’s data generator, based on TPC-H’s dbgen, is not easy to adapt to different data distributions as its meta data and actual data generation implementations are not separated. In this paper, we motivate the need for a new revision of SSB that includes non-uniform data distributions. We list what specific modifications are required to SSB to implement non-uniform data sets and we demonstrate how to implement these modifications in the Parallel Data Generator Framework to generate both the data and query sets.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
577
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Data skew is naturally occuring
  • Transcript of "Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance"

    1. 1. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance T IL M AN N R ABL , M EIKEL PO ESS, H AN S - AR N O J AC O BSEN , PAT R IC K AN D EL IZABETH O’N EIL ICPE 2013, PRAGUE, 24/04/2013 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG
    2. 2. Real Life Data is Distributed Uniformly… Well, Not Really ◦ Customers zip codes typically clustered around metropolitan areas ◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods ◦ US retail sales: ◦ peak during Holiday Season ◦ December sales are 2x of January sales Source: US Census Data RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2
    3. 3. Student Seminar Signup Distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3
    4. 4. How Can Skew Effect Database Systems? Data placement ◦ Partitioning ◦ Indexing Data structures ◦ Tree balance ◦ Bucket fill ratio ◦ Histograms Optimizer  finding the optimal query plan ◦ Index vs. non-index driven plans ◦ Hash join vs. merge join ◦ Hash group by vs. sort group by RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 4
    5. 5. Agenda Data Skew in Current Benchmarks Star Schema Benchmark (SSB) Parallel Data Generation Framework (PDGF) Introducing Skew in SSB RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 5
    6. 6. Data Skew in Benchmarks TPC-D (1994-1999): only uniform data ◦ SIGMOD 1997 - “Successor of TPC-D should include data skew” ◦ No effect until … TPC-DS (released 2012) ◦ Contains comparability zones ◦ Not fully utilized TPC-D/H variations ◦ Chaudhuri and Narayasa: Zipfian distribution on all columns ◦ Crolotte and Ghazal: comparability zones Still lots of open potential RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 6
    7. 7. Star Schema Benchmark I Star schema version of TPC-H ◦ ◦ ◦ ◦ Merged Order and Lineitem Date dimension Dropped Partsupp Selectivity hierarchies ◦ C_City  C_Nation  C_Region ◦ … RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7
    8. 8. Star Schema Benchmark II Completely new set of queries 4 flights of 3-4 queries ◦ Designed for functional coverage and selectivity coverage ◦ Drill down in dimension hierarchies ◦ Predefined selectivity Q1.1 select from where and and and sum(lo_extendedprice*lo_discount) as revenue lineorder, date lo_orderdate = d_datekey d_year = 1993 lo_discount between 1 and 3 lo_quantity < 25; Q1.2 select from where and and and sum(lo_extendedprice*lo_discount) as revenue lineorder, date lo_orderdate = d_datekey d_yearmonthnum = 199301 lo_discount between 1 and 3 lo_quantity between 26 and 35; RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8
    9. 9. Parallel Data Generation Framework Generic data generation framework Relational model ◦ Schema specified in configuration file ◦ Post-processing stage for alternative representations Repeatable computation ◦ Based on XORSHIFT random number generators ◦ Hierarchical seeding strategy Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10. RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9
    10. 10. XML PDGF DB Configuring PDGF Schema configuration Relational model ◦ Tables, fields Properties ◦ Table size, characters, … Generators ◦ Simple generators ◦ Metagenerators Update definition ◦ Insert, update, delete ◦ Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10
    11. 11. Opportunities to Inject Data Skew in Foreign key relations ◦ E.g., L_PARTKEY One fact table measures ◦ E.g., L_Quantity Single dimension hierarchy ◦ E.g., P_Brand → P_Category → P_Mfgr Multiple dimension hierarchies ◦ E.g., City → Nation in Supplier and Customer Experimental methodology ◦ ◦ ◦ ◦ One experiment series for each of the above Comparison to original SSB Comparison of index-forced, non-index, and automatic optimizer mode SSB scale factor 100 (100 GB), x86 server RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 11
    12. 12. Skew in Foreign Key Relations Very realistic Easy to implement in PDGF ◦ Just add a distribution to the reference <distribution name="Exponential“ lambda="0.26235" /> But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes  Very limited effect on selectivity Focus on attributes in selectivity predicates RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 12
    13. 13. Skew in Fact Table Measure – Lo_Quantity Lo_Quantity distribution ◦ Values range between 0 and 50 ◦ Originally uniform distribution with: ◦ P(X=x)=0.02 ◦ Coefficient of variation of 0.00000557 ◦ Proposed skewed distribution with: ◦ P(X x) 0 .3 1 .3 x Query 1.1 ◦ lo_quantity < x, x ∈ [2, 51] Results ◦ Switches too early to non-index plan ◦ Switches too late to non-index plan ◦ Optimizer agnostic to distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 13
    14. 14. Skew in Single Dimension Hierarchy - Part P_Category distribution ◦ Uniform P(X=x)=0.04 ◦ Skewed P(X=x)= 0.01 - 48.36 ◦ Probabilities explicitly defined Query 2.1 ◦ Restrictions on two dimensions Results uniform case ◦ Index driven superior ◦ Optimizer chooses non-index driven Results skewed case ◦ Switches too early to non-index plan RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14
    15. 15. Skew in Multiple Dimension Hierarchies – S_City & C_City Skewed S_City & C_City ◦ Probabilites exponentially distributed Query 3.3 Join Cardinality Elapsed Time ◦ Restrictions on 3 dimensions ◦ Variation on Supplier and Customer city Results uniform and skewed cases ◦ Automatic plan performs best ◦ Cross over between automatic uniform and skewed too late RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15
    16. 16. Conclusion & Future Work PDGF implementation of SSB Introduction of skew in SSB Extensive performance analysis ◦ Several interesting optimizer effects ◦ Performance impact of skew Future Work Further analysis on impact of skew Skew in query generation Complete suite for testing skew effects RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 16
    17. 17. Thanks Questions? Download and try PDGF: http://www.paralleldatageneration.org (scripts used in the study available on website above) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17
    18. 18. Back-up Slides RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18
    19. 19. Configuring PDGF Generation Generation configuration Defines the output ◦ ◦ ◦ ◦ Scheduling Data format Sorting File name and location Post processing ◦ ◦ ◦ ◦ Filtering of values Merging of tables Splitting of tables Templates (e.g. XML / queries) <table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output > </table > RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×