Tractor Pulling on
  Datawarehouses

Martin Kersten, Volker Markl
Meikel Poess, Kai-Uwe Settler
  Alfons Kemper, Ani Nica,



       DBTest 2011
The good old days
• The early eighties when
  – Oracle appeared on the scene
  – Ingres was a respected innovator on
    RDBMS
  – System R fought the Codasyl battle
  – IMS was still dominating the market


• There was a need for a metric to
  evaluate the solutions
The good old days
• Turned into an organised battle
  – TPC-C, TPC-H, TPC-D, TPC-W…
  – hundreds of benchmarks to proof one’s
    muscles
• We need tools to assess a solution
  space



• We don’t need weapons to win a
  ‘war’
Dagstuhl 2010 Robust Query
        Processing
• With each step in the pull the tension
  of the Tractor increases
  (exponentially)

• The Tractor driver is throttling and
  changing gears to keep it going
Ingredients of the DBMS
          Tractor Pull
• A tractor pull is a series of workload
  steps for which we measure the
  performance
• Each step is defined by
  – Catalog changes
  – Database load, delete+load+create
    index
  – Query processing, BI grouped statistics
  – Concurrency
  – Act of God operations
A database soil



Generate a small database < RAM
Use a single data type
A database soil




COPY the smaller relation into the larger one




                                 Cop
A database soil
Query template
SELECT R0.B0, ...,Ri.Bi, count(*), avg(R0.B0),
avg(R1.B0), avg(R1.B1),. . ., avg(Ri.B0), . . .
FROM R0, . . . , Ri
WHERE selectpattern(R0, . . . , Ri) AND
joinpattern(R0, . . . , Ri)
GROUP BY R0.B0, . . . , Ri.Bi
ORDER BY R0.B0, . . . , Ri.Bi

Linear, Cyclic, Star-based, Clique query patterns

The n-th query load includes the n-1 th query load
Scenarios
• Tractor pull workload

• W(N) = < S, L, Pre, Qry, Post, qry,
  db>
  – Schema adjustments
  – Loading the database
  – Pre-optimization
  – Query execution
  – Post optimization
  – query characteristics
  – db growth function
Hill scenario
• The Hills scenario models a data
  warehouse that grows with a modest
  growth rate of g ∈ (0, 1) (e.g., g =
  0.2).

• It starts out from a main-memory
  focus until it overflows into a few
  disks.

• It will highlight a system’s robustness
  to deal with the memory-disk
Hill scenario
A modest growing warehouse with a
 single user.
The database fits in memory and spills
 over to disk



D ∈ (0%, 100%), G∈ (0, 1)
Number of connections at track I : 1
db(0) = (D x RAM) x ( 1 / (2 x dom) )
db(i) = g x i x db(0)
qry(0) = 1, qry(i) = 4
|qry(i)| = 1 + 4 x i
Meadow scenario
A stable warehouse with a multiple users.
Query templates stress complexity

d∈(0%,100%), g=0, C>1
Number of connections at track i : C
db(0) = (d × RAM) × (1) 2×dom
db(i) = 0 (no growth)
qry(0) = 0, qry(i) = C
|Q(i)| = 1 + C × i
Rockies scenario
A growing warehouse with a multiple
 users.
Query templates stress complexity

d∈(0%,100%), g∈ (0,10)
Number of connections at track i : i
db(0) = (d × RAM) × (1) 2×dom
db(i) = g × i × db(0)
qry(0) = 0, qry(i) = i × 4
|Q(i)| = 1 + 4 × i (i+1)/2
Robustness metrics
• It is a multi-dimensional metric
  aimed at measuring the deviation
  from the expected norm

• Robust(N)=<L, S, QO, QOk, QE, QEk,
  H>
  – Standard deviation of the loading time L
  –    ,, Storage requirements
  –    ,, Query optimization (per track
  –   ,, Query execution (per track)
  –   ,, Holistic
A hill scenario
A meadow Scenario
A Rockies scenario
Take aways
• Robustness is all about comparisons.
  We need methods to quickly
  determine difference in behavior.

• If the system reaches the end of the
  field we are happy. If it blows up or if
  the queries are behaving worse
  along the way it is not robust.
Conclusions
• Tractorpulling is an effective new
  toolkit for robustness testing a DBMS
  in various dimensions

• Refinements for ease of analysis is
  needed (GUIs)

• http://sourceforge.net/projects/tracto
  rpulling
Tractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse

Tractor Pulling on Data Warehouse

  • 1.
    Tractor Pulling on Datawarehouses Martin Kersten, Volker Markl Meikel Poess, Kai-Uwe Settler Alfons Kemper, Ani Nica, DBTest 2011
  • 2.
    The good olddays • The early eighties when – Oracle appeared on the scene – Ingres was a respected innovator on RDBMS – System R fought the Codasyl battle – IMS was still dominating the market • There was a need for a metric to evaluate the solutions
  • 4.
    The good olddays • Turned into an organised battle – TPC-C, TPC-H, TPC-D, TPC-W… – hundreds of benchmarks to proof one’s muscles
  • 5.
    • We needtools to assess a solution space • We don’t need weapons to win a ‘war’
  • 6.
    Dagstuhl 2010 RobustQuery Processing
  • 8.
    • With eachstep in the pull the tension of the Tractor increases (exponentially) • The Tractor driver is throttling and changing gears to keep it going
  • 9.
    Ingredients of theDBMS Tractor Pull • A tractor pull is a series of workload steps for which we measure the performance • Each step is defined by – Catalog changes – Database load, delete+load+create index – Query processing, BI grouped statistics – Concurrency – Act of God operations
  • 10.
    A database soil Generatea small database < RAM Use a single data type
  • 11.
    A database soil COPYthe smaller relation into the larger one Cop
  • 12.
  • 13.
    Query template SELECT R0.B0,...,Ri.Bi, count(*), avg(R0.B0), avg(R1.B0), avg(R1.B1),. . ., avg(Ri.B0), . . . FROM R0, . . . , Ri WHERE selectpattern(R0, . . . , Ri) AND joinpattern(R0, . . . , Ri) GROUP BY R0.B0, . . . , Ri.Bi ORDER BY R0.B0, . . . , Ri.Bi Linear, Cyclic, Star-based, Clique query patterns The n-th query load includes the n-1 th query load
  • 14.
    Scenarios • Tractor pullworkload • W(N) = < S, L, Pre, Qry, Post, qry, db> – Schema adjustments – Loading the database – Pre-optimization – Query execution – Post optimization – query characteristics – db growth function
  • 15.
    Hill scenario • TheHills scenario models a data warehouse that grows with a modest growth rate of g ∈ (0, 1) (e.g., g = 0.2). • It starts out from a main-memory focus until it overflows into a few disks. • It will highlight a system’s robustness to deal with the memory-disk
  • 16.
    Hill scenario A modestgrowing warehouse with a single user. The database fits in memory and spills over to disk D ∈ (0%, 100%), G∈ (0, 1) Number of connections at track I : 1 db(0) = (D x RAM) x ( 1 / (2 x dom) ) db(i) = g x i x db(0) qry(0) = 1, qry(i) = 4 |qry(i)| = 1 + 4 x i
  • 17.
    Meadow scenario A stablewarehouse with a multiple users. Query templates stress complexity d∈(0%,100%), g=0, C>1 Number of connections at track i : C db(0) = (d × RAM) × (1) 2×dom db(i) = 0 (no growth) qry(0) = 0, qry(i) = C |Q(i)| = 1 + C × i
  • 18.
    Rockies scenario A growingwarehouse with a multiple users. Query templates stress complexity d∈(0%,100%), g∈ (0,10) Number of connections at track i : i db(0) = (d × RAM) × (1) 2×dom db(i) = g × i × db(0) qry(0) = 0, qry(i) = i × 4 |Q(i)| = 1 + 4 × i (i+1)/2
  • 19.
    Robustness metrics • Itis a multi-dimensional metric aimed at measuring the deviation from the expected norm • Robust(N)=<L, S, QO, QOk, QE, QEk, H> – Standard deviation of the loading time L – ,, Storage requirements – ,, Query optimization (per track – ,, Query execution (per track) – ,, Holistic
  • 20.
  • 21.
  • 22.
  • 23.
    Take aways • Robustnessis all about comparisons. We need methods to quickly determine difference in behavior. • If the system reaches the end of the field we are happy. If it blows up or if the queries are behaving worse along the way it is not robust.
  • 24.
    Conclusions • Tractorpulling isan effective new toolkit for robustness testing a DBMS in various dimensions • Refinements for ease of analysis is needed (GUIs) • http://sourceforge.net/projects/tracto rpulling