2. What is TPCH
The TPC Benchmark™H (TPC-H) is a decision support benchmark.
It consists of a suite of business oriented ad-hoc queries and
concurrent data modifications. The queries and the data populating
the database have been chosen to have broad industry-wide
relevance while maintaining a sufficient degree of ease of
implementation. This benchmark illustrates decision support
systems that
Examine large volumes of data;
Execute queries with a high degree of complexity;
Give answers to critical business questions.
The performance metric reported by TPC-H is called the TPC-H
Composite Query-per-Hour Performance Metric (QphH@Size), and
reflects multiple aspects of the capability of the system to process
queries. These aspects include the selected database size against
which the queries are executed, the query processing power when
queries are submitted by a single stream and the query throughput
when queries are submitted by multiple concurrent users.
5. TPC-H Schema overview : MPP data distribution
Table Column Node 1 Node 2 Node 3
LINEITEM
ORDERKEY 1 2 3
PARTKEY 6 4 8
SUPPKEY 3 18 5
ORDERS
ORDERKEY 1 2 3
CUSTKEY 4 2 9
PARTSUPP
PARTKEY 1 2 3
SUPPKEY 4 5 6
PART PARTKEY 1 2 3
CUSTOMER CUSTKEY 1 2 3
SUPPLIER SUPPKEY 1..N 1..N 1..N
NATION NATIONKEY 1..N 1..N 1..N
REGION REGIONKEY 1..N 1..N 1..N
Collocated
Over network data movement
Collocated Over network data movement
Table Distribution column
LINEITEM L_ORDERKEY
ORDERS O_ORDERKEY
PARTSUPP PS_PARTKEY
PART P_PARTKEY
CUSTOMER C_CUSTKEY
SUPPLIER REPLICATED
NATION REPLICATED
REGION REPLICATED
6. TPC-H Schema : Metrics
Power:
Run order
RF1 (Inserts into LINEITEM and ORDERS)
22 read only queries
RF2 (Deletes from LINEITEM & ORDERS)
Metric :
Query per hour rate
TPC-H Power@Size = 3600 * SF / Geomean(22
queries , RF1, RF2)
Geometric mean of all queries results in a run
Performance improvements to any query equally
improves the metric
Throughput:
Run orders
N concurrent Power query streams with different
parameters
N RF1 & RF2 streams, this can be run in parallel
with the concurrent streams above or after
Metric :
Ratio of the total number of queries executed over
the length of the measurement interval
TPC-H Throughput@Size = (S*22*3600)/Ts *SF
Absolute runtime matters, optimizing for the longest
running query helps
Throughput
Power
Run in
Parallel
Query Stream 01
Refresh function 1
Inserts into
LINEITEM & ORDERS
Query Stream 02
Query stream 00
14,2,9,20,6…5,7,12
…
Query Stream N
Refresh function 2
Deletes from
LINEITEM & ORDERS
Refresh streams
with N pairs of
RF1 & 2
Scale Factor Number of streams
100 5
300 6
1000 7
3000 8
10000 9
30000 10
100000 11
8. TPC-H Performance measurements
Invest in tools to analyze plans,
some consider plan analysis an art,
breaking down the plan to key metrics
helps a lot
Capture enough information in the
execution plan to unveil performance
issues:
Estimate Vs. Actual number of rows
etc..
Amount of data spilled per disk
Rows touched Vs. rows qualified
during scan
Logical Vs. Physical reads
CPU & Memory consumed per plan
operator
Skew in number of rows processed
per thread per operator
Instrument the code to provide
cycles per row for key scenarios:
Scan
Aggregate
Join
Set
performance
goals
Measure
Performance
Start looking
at SMP &
MPP plans
Check CPU
& IO
utilization
Fix
performance
issues
Repeat
9. TPC-H Performance measurements
Scalability within a single server
Vary the number of processors
Vary scale factor : 100G, 300G
Identify queries that don‟t have linear scaling
Capture:
CPU & IO utilization per query with at least 1
second sampling rate
Capture hot functions and waits if any
Capture CPI ideally per function
Capture execution plans
Get busy crunching the data
Scalability across multiple servers
Vary the number of servers in the systems
Vary amount of data per server
Capture:
CPU , disk & network IO
Distributed plans
Look for queries that have excessive cross node
traffic
Identify suboptimal plans where
predicates/aggregates are not pushed down
More focused
performance effort
MPP
scaling
Data
scaling
SMP
scaling
11. Partner engagements
Can be considered as one of the secret sauces for highly performing
software
Partners (HW/Infrastructure) tend to have vested interest in
showcasing Performance and Scalability of their products.
Allows software companies to leverage HW expertise and provide
access to low level tools that are not publically available (Through
NDA).
Partners occasionally provide HW for Performance benchmarks,
prototype evaluation, release publications
Partners can be a great assist for :
Providing low level analysis
Collaborate in publications, benchmarks, proof of concepts etc..
Provide HW for Performance testing, evaluation, improvement (large
scale experiments are expensive)
12. Partner engagements
NVRAM: Random-access memory that retains its information when
power is turned off (non-volatile). This is in contrast to dynamic random-
access memory (DRAM)
“Promises”:
Latency within the same order of magnitude of DRAM
Cheaper than SSDs
+10TB of NVRAM in a 2-socket system within the next 4 years
Still in prototype phase
Could eliminates need for spinning disks or SSDs altogether
In-memory database are likely to be early adopters of such
technology
Good reading:
http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf
http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf
14. Partner engagements
Diablo technologies SSD in DRAM slot
DIMM capacity of 200GB & 400GB, technology is rebranded by IBM and VmWare Ready
http://www.diablo-technologies.com/
16. TPC-H where is it today Why do benchmarks?
Stimulate technological advancements
Why TPCH?
Introduce a set of technological challenges whose resolution will significantly improve the performance of the product
As benchmark is it relevant to current DW applications ?
Gartner Magic quadrant references:
“Vectorwise delivered leading 1TB non-clustered TPC Benchmark H (TPC-H) results in 2012”
Big players are Oracle, Vectorwise, Microsoft, Exasol and Paraccel
Most significant innovation came from:
Kickfire acquired by Teradata, FPGA-based "Query Processor Module” with an instruction set tuned for database operations
ParAccel acquired by Actian, shared-nothing architecture with a columnar orientation, adaptive compression, memory-centric design
Exasol .. column-oriented way and proprietary InMemory compression methods are used, database also has automatic self optimization
(create indexes, stats , distribute tables etc.. )
So where does it come in handy?
Identify system bottlenecks
Push performance focused features into the product
TPC-H schema is heavily used for ETL and virtualization benchmarks
Introduces lots of interesting challenges to the DMBS
What about TPC-DS, it has a more realistic ETL process , snow flake schema, but no one has published a TPC-DS benchmark yet
17. TPC-H where is it today
Number of publications is on the decline
99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Number of publications 9 1 5 12 31 15 42 31 20 13 15 10 20 5 6
0
5
10
15
20
25
30
35
40
45
Numberofpublications
Number of TPCH publications per year
• First cloud based benchmark? When will we see
this?
19. TPC-H challenges : Aggregation
Almost all TPCH queries do aggregation
Unless there is a sorted index (B-tree) on group by column aggregating in Hash table makes most sense
opposed to ordered aggregation
Correctly sizing the hash table dictates performance
If cardinality under estimates number of distinct values lots of chaining occurs and HT can eventually spill to
disk.
If CE overestimates resources are not used optimally
For low distinct count doing hash table per thread (local) then doing a global aggregation improves
performance
For small group by on strings, present group by expressions as integers (index in array) opposed to
using a hash table (Reduce cache footprint)
For group by on Primary key (C_CUSTKEY) no need to include other columns from CUSTOMER in the
Hash table
Main benefits from PK/FK is aggregate optimizations
Queries sensitive to aggregation performance:
1, 3, 4, 10, 13, 18, 20, 21
20. TPC-H challenges : Aggregation
Q1
Reduces 6 billion rows
to 4
Sensitive to string
matching
Benefits from doing
local aggregation
Q10
Group by on most
Customer columns
If PK on C_CUSTKEY
exists could use
C_CUSTKEY for
aggregation
Further optimization
push down of aggregate
on O_CUSTKEY and
TOP
18
Group by on
L_ORDERKEY results
in 1.5 billion rows (4x
reduction)
Local aggregation
usually hurts
performance
Hash table for
aggregation alone can
take 25GB of RAM
21. TPC-H challenges : Joins
Select a schema which leverages locality
Examples : ORDERS x LINEITEM on
L_ORDERKEY=O_ORDERKEY by hash partitioning on
ORDERKEY
Q5,Q9,Q18 can spill and have bad performance if the correct
plan is not picked
Q9 will cause over the network communication for MPP
systems, unless PARTSUPP, PART and SUPPLIER are
replicated which is not feasible for large scale factors
TPCH joins are highly selective, hence efficient bloom filters
are necessary
Simplistic guide : Find the most selective filter/aggregation
and this is where you start
22. TPC-H challenges : Expression evaluation
Arithmetic
operation
performance
Store decimals as
integers and save
some bits
19123 Vs. 191.23
Rebase of some of
the columns to use
less bits
Keep data in the
most compact form
to best exploit
SIMD instructions
Detecting
common sub
expressions
sum(l_extendedprice
) as
sum_base_price,
sum(l_extendedprice
*(1-l_discount)) as
sum_disc_price,
sum(l_extendedprice
*(1-
l_discount)*(1+l_tax)
) as sum_charge,
Expression
filter push
down
(Q7, Q19)
Q7 Take the
superset or
UNION of filters
and push down
to the scan
Q19 Take the
union of
individual
predicates
Column
projection vs
expression
evaluation
Cardinality
estimates
should help
decide to
Project
columns A & B
or or (A * (1 -
B) ) before a
filter on C
23. TPC-H challenges : Correlated subqueries
Push down of predicates into subquery when applicable
When sub queries are flattened batch processing
outperforms row by row
Buffer overlapped intermediate results
Partial query reuse
Challenging for MPP systems (don‟t redistribute or
shuffle the same data twice)
24. TPC-H challenges : Parallelism and concurrency
Current 2P servers have +48 cores, +½ TB of RAM &
+10GB/sec of disk IO BW, this means that within a single
box the engine needs to provide meaningful scaling
Further sub-partitioning data on a single server alleviates
single server scaling problems
TPC-H queries tend to use lots of workspace memory for
Joins and aggregations.
Precise and dynamic memory allocation keeps queries
from spilling to under high concurrency
25. TPC-H challenges : Scan performance
Disk read performance is crucial, should validate that
when system is not CPU bound IO subsystem is
efficiently used.
Ability to filter out pages or segments from the scan is
crucial
In memory scan performance can be increased if we
decrease the search scope and thereby the amount of
data that needs to be streamed from main memory to the
CPU
26. TPC-H challenges : Scan performance
Store dictionaries in sorted order or in a BST to make
• Compress the filter or predicate to do numeric comparison
opposed to decompress and match on strings
• Quickly validates if the value exists in the segment
27. TPC-H challenges : scan performance
What do we do for highly selective filters?
Implement paged indexes for columns of interest
Partition a column into pages, store bitmap indices for each compressed value, bits reflect
which rows have the respective value, instead of scanning the entire segment for the
matching row , we only read the block which has the matching values aka bits set.
http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf
28. In MPP a single SQL statement results in multiple SQL
statements that get executed locally on each node
Some TPCDS queries can result in +20 SQL statements
that need be executed on each leaf node locally
Steaming of data should result in better performance but
there are cases when this strategy fails.
Placing data on disk after each steps allows the Query
optimizer to reevaluate the plan
TPC-H challenges : Intermediate steps in MPP
29. Query :
Select count(*) from PART, PARTSUPP , LINEITEM where P_BRAND=“NIKE”
and PS_COMMENT like “%bla%” and P_PARTKEY=PS_PARTKEY and
L_PARTKEY = PS_PARTKEY group by P_BRAND
Schema :
PART distributed on P_PARTKEY
PARTSUPP distributed on PS_PARTKEY
LINEITEM distributed on L_ORDERKEY
Create bloom filters BF1 on PART, push filter on PARTSUPP and
create BF2 , replicate bloom filter on all leaf nodes apply filter on
LINEITEM and only shuffle qualifying rows on
Optimizer should chose between semi join reduction and
replicating PART x PARTSUPP
Multiple copies of a set of columns distributed differently can
improve performance of such issue but at high cost.
TPC-H challenges : Improving join performance for incompatible joins
31. SQL to map reduce jobs? Crunching data in relational
database is always faster than HADOOP, bring data
from HADOOP into columnar format , perform analytics
with efficient generated code
Full integration with analytics tools as SAS , R
, Tableau , Excel etc…
Support PL/SQL syntax (Oracle Compete)
Eliminate the aggregating node to reduce system cost
for a small number of nodes, Exasol does it.
Looking ahead
34. TPC-H column store
Avoid virtual function calls, branching use templates
Scan usually dominates CPU profile
Vector/Batch processing is a must
If done correctly code is very sensitive to branching, data dependency, exploit
instruction parallelism when possible
Use SIMD instructions , leverage already existing libraries to encapsulate SSE
instructions complexity
// define and initialize integer vectors a and b
Vec4i a(10,11,12,13);
Vec4i b(20,21,22,23);
// add the two vectors
Vec4i c = a + b;
http://www.agner.org/optimize/vectorclass.pdf
35. TPC-H Plans
Behold the power of the optimizer
If plan is wrong you are doomed…
Very good read for TPCH Q8
http://www.slideshare.net/GraySystemsLab/pass-summit-
2010-keynote-david-dewitt
36. JSON documents
Most efficient way to store Json documents
Great compression and quick retrieval, ask me how to
….
37. Q1
Used as benchmark for computational power
Arithmetic operation performance
Aggregating to same hash buckets
Common sub expressions pattern matching
Scan performance sensitive
String matching for aggregation (Could do matching on compressed format)
select l_returnflag, l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3)
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
Challenges
38. Q2
Correlated sub query
Push down of predicates to the correlated subquery
Highly selective (Segment size plays a big role)
Tricky to generate optimal plan
Depending on which tables are partitioned and which are replicated, plan
performance varies a lot.
select
s_acctbal,s_name, n_name, p_partkey,
p_mfgr, s_address, s_phone, s_comment
from part, supplier,
partsupp, nation, region
where p_partkey = ps_partkey
and s_suppkey = ps_suppkey and p_size = [SIZE]
and p_type like '%[TYPE]' and s_nationkey = n_nationkey
and n_regionkey = r_regionkey and r_name = '[REGION]'
and ps_supplycost = ( select from
partsupp, supplier, nation, region
where p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = '[REGION]'
) order by
s_acctbal desc, n_name, s_name, p_partkey;
Challenges
39. Q3
Collocated join between orders & lineitem
Detect correlation between shipdate, orderdat
Bitmap filters on lineitem are necessary
Replicating (select c_custkey from customers where
c_mktsegment = „[SEGMENt]‟)
select TOP 10 l_orderkey, sum(l_extendedprice*(1-
l_discount)) as revenue,
o_orderdate, o_shippriority
from customer, orders, lineitem
where c_mktsegment = '[SEGMENT]' and c_custkey =
o_custkey
and l_orderkey = o_orderkey and o_orderdate < date
'[DATE]'
and l_shipdate > date '[DATE]'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate;
Challenges
Editor's Notes
The subcommittee included representatives from Compaq, Data General, Dell, EMC, HP, IBM, Informix, Microsoft, NCR, Oracle, Sequent, SGI, Sun, Sybase, and Unisys
Define main tables, scale factor. Benefits of collocated joinsHow tables can be partitionedMetric Geometric mean to avoid optimizing individual queries All tables grow linearly with scale factor except NATION and REGION Unless you have Btree indexes there shouldn’t be any loop joins ??
If all we have is distributed and replicated tables
Doesn’t matter how fast the physical operators are if the generated plan is wrong.Plan changes can make or break performanceIf CPU utilization is 100% and performance is still not acceptable look into CPI Cycles/instruction,
Invest in buildingtools that do post processing on plans to identify inefficient plans:Avg number of rows per operatorJoins that don’t reduce the number of rowsAggregates that don’t reduce the number of rowsOver or underestimating SpillingIt pays off to build profiling into the code to get Cycles per row for Scans, Aggregates, filtering etc….
http://www.oracle.com/us/corporate/features/database-in-memory-option/index.html 3:30Response http://www.youtube.com/watch?v=48_oSIkEJlo#t=77Poking Oraclehttp://www.youtube.com/watch?v=48_oSIkEJlo#t=279NV Ram? Is that on the horizon?Company X names this a tectonic change and will be commodity HW by 2016, with capacity up to 10TB per 2-socket serverNVRAM in a box exposed as a SAN equivelanthttp://www.diablo-technologies.com/
Select sum(l_extendedprice * (1 - l_discount) ) as revenue From lineitem, part Where ( p_partkey = l_partkeyand p_brand = ‘[BRAND1]’ and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10 and p_size between 1 and 5 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkeyand p_brand = ‘[BRAND2]’ and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10 and p_size between 1 and 10 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkeyand p_brand = ‘[BRAND3]’ and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10 and p_size between 1 and 15 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ );Conjunctions and Disjunctions
Select sum(l_extendedprice * (1 - l_discount) ) as revenue From lineitem, part Where ( p_partkey = l_partkeyand p_brand = ‘[BRAND1]’ and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10 and p_size between 1 and 5 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkeyand p_brand = ‘[BRAND2]’ and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10 and p_size between 1 and 10 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkeyand p_brand = ‘[BRAND3]’ and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10 and p_size between 1 and 15 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ );Conjunctions and Disjunctions
Q2,11,15,17 and Q20Select sum(l_extendedprice) / 7.0 as avg_yearlyFrom lineitem, part Where p_partkey = l_partkeyand p_brand = '[BRAND]' and p_container = '[CONTAINER]' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitemwhere l_partkey = p_partkey);