TPC-H Column Store and MPP systems

TPC-H Performance
MPP & Column Store

What is TPCH
 The TPC Benchmark™H (TPC-H) is a decision support benchmark.
It consists of a suite of business oriented ad-hoc queries and
concurrent data modifications. The queries and the data populating
the database have been chosen to have broad industry-wide
relevance while maintaining a sufficient degree of ease of
implementation. This benchmark illustrates decision support
systems that
 Examine large volumes of data;
 Execute queries with a high degree of complexity;
 Give answers to critical business questions.
 The performance metric reported by TPC-H is called the TPC-H
Composite Query-per-Hour Performance Metric (QphH@Size), and
reflects multiple aspects of the capability of the system to process
queries. These aspects include the selected database size against
which the queries are executed, the query processing power when
queries are submitted by a single stream and the query throughput
when queries are submitted by multiple concurrent users.

Overview
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A

TPC-H Schema overview: Relationships between columns

TPC-H Schema overview : MPP data distribution
Table Column Node 1 Node 2 Node 3
LINEITEM
ORDERKEY 1 2 3
PARTKEY 6 4 8
SUPPKEY 3 18 5
ORDERS
ORDERKEY 1 2 3
CUSTKEY 4 2 9
PARTSUPP
PARTKEY 1 2 3
SUPPKEY 4 5 6
PART PARTKEY 1 2 3
CUSTOMER CUSTKEY 1 2 3
SUPPLIER SUPPKEY 1..N 1..N 1..N
NATION NATIONKEY 1..N 1..N 1..N
REGION REGIONKEY 1..N 1..N 1..N
Collocated
Over network data movement
Collocated Over network data movement
Table Distribution column
LINEITEM L_ORDERKEY
ORDERS O_ORDERKEY
PARTSUPP PS_PARTKEY
PART P_PARTKEY
CUSTOMER C_CUSTKEY
SUPPLIER REPLICATED
NATION REPLICATED
REGION REPLICATED

TPC-H Schema : Metrics
 Power:
 Run order
 RF1 (Inserts into LINEITEM and ORDERS)
 22 read only queries
 RF2 (Deletes from LINEITEM & ORDERS)
 Metric :
 Query per hour rate
 TPC-H Power@Size = 3600 * SF / Geomean(22
queries , RF1, RF2)
 Geometric mean of all queries results in a run
 Performance improvements to any query equally
improves the metric
 Throughput:
 Run orders
 N concurrent Power query streams with different
parameters
 N RF1 & RF2 streams, this can be run in parallel
with the concurrent streams above or after
 Metric :
 Ratio of the total number of queries executed over
the length of the measurement interval
 TPC-H Throughput@Size = (S*22*3600)/Ts *SF
 Absolute runtime matters, optimizing for the longest
running query helps
Throughput
Power
Run in
Parallel
Query Stream 01
Refresh function 1
Inserts into
LINEITEM & ORDERS
Query Stream 02
Query stream 00
14,2,9,20,6…5,7,12
…
Query Stream N
Refresh function 2
Deletes from
LINEITEM & ORDERS
Refresh streams
with N pairs of
RF1 & 2
Scale Factor Number of streams
100 5
300 6
1000 7
3000 8
10000 9
30000 10
100000 11

Outline
TPC-H Schema overview
Partner engagement
TPC-H challenges
Looking ahead
Q&A

 Invest in tools to analyze plans,
some consider plan analysis an art,
breaking down the plan to key metrics
helps a lot
 Capture enough information in the
execution plan to unveil performance
issues:
 Estimate Vs. Actual number of rows
etc..
 Amount of data spilled per disk
 Rows touched Vs. rows qualified
during scan
 Logical Vs. Physical reads
 CPU & Memory consumed per plan
operator
 Skew in number of rows processed
per thread per operator
 Instrument the code to provide
cycles per row for key scenarios:
 Scan
 Aggregate
 Join
Set
performance
goals
Measure
Performance
Start looking
at SMP &
MPP plans
Check CPU
& IO
utilization
Fix
performance
issues
Repeat

 Scalability within a single server
 Vary the number of processors
 Vary scale factor : 100G, 300G
 Identify queries that don‟t have linear scaling
 Capture:
 CPU & IO utilization per query with at least 1
second sampling rate
 Capture hot functions and waits if any
 Capture CPI ideally per function
 Capture execution plans
 Get busy crunching the data
 Scalability across multiple servers
 Vary the number of servers in the systems
 Vary amount of data per server
 Capture:
 CPU , disk & network IO
 Distributed plans
 Look for queries that have excessive cross node
traffic
 Identify suboptimal plans where
predicates/aggregates are not pushed down
More focused
performance effort
MPP
scaling
Data
scaling
SMP
scaling

Partner engagements
 Can be considered as one of the secret sauces for highly performing
software
 Partners (HW/Infrastructure) tend to have vested interest in
showcasing Performance and Scalability of their products.
 Allows software companies to leverage HW expertise and provide
access to low level tools that are not publically available (Through
NDA).
 Partners occasionally provide HW for Performance benchmarks,
prototype evaluation, release publications
 Partners can be a great assist for :
 Providing low level analysis
 Collaborate in publications, benchmarks, proof of concepts etc..
 Provide HW for Performance testing, evaluation, improvement (large
scale experiments are expensive)

Partner engagements
 NVRAM: Random-access memory that retains its information when
power is turned off (non-volatile). This is in contrast to dynamic random-
access memory (DRAM)
 “Promises”:
 Latency within the same order of magnitude of DRAM
 Cheaper than SSDs
 +10TB of NVRAM in a 2-socket system within the next 4 years
 Still in prototype phase
 Could eliminates need for spinning disks or SSDs altogether
 In-memory database are likely to be early adopters of such
technology
 Good reading:
 http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf
 http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf

Partner engagements
Diablo technologies SSD in DRAM slot
http://www.diablo-technologies.com/

Partner engagements
Diablo technologies SSD in DRAM slot
DIMM capacity of 200GB & 400GB, technology is rebranded by IBM and VmWare Ready
http://www.diablo-technologies.com/

TPC-H where is it today Why do benchmarks?
 Stimulate technological advancements
 Why TPCH?
 Introduce a set of technological challenges whose resolution will significantly improve the performance of the product
 As benchmark is it relevant to current DW applications ?
 Gartner Magic quadrant references:
“Vectorwise delivered leading 1TB non-clustered TPC Benchmark H (TPC-H) results in 2012”
 Big players are Oracle, Vectorwise, Microsoft, Exasol and Paraccel
 Most significant innovation came from:
 Kickfire acquired by Teradata, FPGA-based "Query Processor Module” with an instruction set tuned for database operations
 ParAccel acquired by Actian, shared-nothing architecture with a columnar orientation, adaptive compression, memory-centric design
 Exasol .. column-oriented way and proprietary InMemory compression methods are used, database also has automatic self optimization
(create indexes, stats , distribute tables etc.. )
 So where does it come in handy?
 Identify system bottlenecks
 Push performance focused features into the product
 TPC-H schema is heavily used for ETL and virtualization benchmarks
 Introduces lots of interesting challenges to the DMBS
 What about TPC-DS, it has a more realistic ETL process , snow flake schema, but no one has published a TPC-DS benchmark yet

 Number of publications is on the decline
99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Number of publications 9 1 5 12 31 15 42 31 20 13 15 10 20 5 6
0
5
10
15
20
25
30
35
40
45
Numberofpublications
Number of TPCH publications per year
• First cloud based benchmark? When will we see
this?

TPC-H challenges : Aggregation
 Almost all TPCH queries do aggregation
 Unless there is a sorted index (B-tree) on group by column aggregating in Hash table makes most sense
opposed to ordered aggregation
 Correctly sizing the hash table dictates performance
 If cardinality under estimates number of distinct values lots of chaining occurs and HT can eventually spill to
disk.
 If CE overestimates resources are not used optimally
 For low distinct count doing hash table per thread (local) then doing a global aggregation improves
performance
 For small group by on strings, present group by expressions as integers (index in array) opposed to
using a hash table (Reduce cache footprint)
 For group by on Primary key (C_CUSTKEY) no need to include other columns from CUSTOMER in the
Hash table
 Main benefits from PK/FK is aggregate optimizations
 Queries sensitive to aggregation performance:
 1, 3, 4, 10, 13, 18, 20, 21

TPC-H challenges : Aggregation
Q1
Reduces 6 billion rows
to 4
Sensitive to string
matching
Benefits from doing
local aggregation
Q10
Group by on most
Customer columns
If PK on C_CUSTKEY
exists could use
C_CUSTKEY for
aggregation
Further optimization
push down of aggregate
on O_CUSTKEY and
TOP
18
Group by on
L_ORDERKEY results
in 1.5 billion rows (4x
reduction)
Local aggregation
usually hurts
performance
Hash table for
aggregation alone can
take 25GB of RAM

TPC-H challenges : Joins
 Select a schema which leverages locality
Examples : ORDERS x LINEITEM on
L_ORDERKEY=O_ORDERKEY by hash partitioning on
ORDERKEY
 Q5,Q9,Q18 can spill and have bad performance if the correct
plan is not picked
 Q9 will cause over the network communication for MPP
systems, unless PARTSUPP, PART and SUPPLIER are
replicated which is not feasible for large scale factors
 TPCH joins are highly selective, hence efficient bloom filters
are necessary
 Simplistic guide : Find the most selective filter/aggregation
and this is where you start

TPC-H challenges : Expression evaluation
Arithmetic
operation
performance
Store decimals as
integers and save
some bits
19123 Vs. 191.23
Rebase of some of
the columns to use
less bits
Keep data in the
most compact form
to best exploit
SIMD instructions
Detecting
common sub
expressions
sum(l_extendedprice
) as
sum_base_price,
sum(l_extendedprice
*(1-l_discount)) as
sum_disc_price,
sum(l_extendedprice
*(1-
l_discount)*(1+l_tax)
) as sum_charge,
Expression
filter push
down
(Q7, Q19)
Q7 Take the
superset or
UNION of filters
and push down
to the scan
Q19 Take the
union of
individual
predicates
Column
projection vs
expression
evaluation
Cardinality
estimates
should help
decide to
Project
columns A & B
or or (A * (1 -
B) ) before a
filter on C

TPC-H challenges : Correlated subqueries
 Push down of predicates into subquery when applicable
 When sub queries are flattened batch processing
outperforms row by row
 Buffer overlapped intermediate results
 Partial query reuse
 Challenging for MPP systems (don‟t redistribute or
shuffle the same data twice)

TPC-H challenges : Parallelism and concurrency
 Current 2P servers have +48 cores, +½ TB of RAM &
+10GB/sec of disk IO BW, this means that within a single
box the engine needs to provide meaningful scaling
 Further sub-partitioning data on a single server alleviates
single server scaling problems
 TPC-H queries tend to use lots of workspace memory for
Joins and aggregations.
 Precise and dynamic memory allocation keeps queries
from spilling to under high concurrency

TPC-H challenges : Scan performance
 Disk read performance is crucial, should validate that
when system is not CPU bound IO subsystem is
efficiently used.
 Ability to filter out pages or segments from the scan is
crucial
 In memory scan performance can be increased if we
decrease the search scope and thereby the amount of
data that needs to be streamed from main memory to the
CPU

TPC-H challenges : Scan performance
Store dictionaries in sorted order or in a BST to make
• Compress the filter or predicate to do numeric comparison
opposed to decompress and match on strings
• Quickly validates if the value exists in the segment

TPC-H challenges : scan performance
 What do we do for highly selective filters?
 Implement paged indexes for columns of interest
 Partition a column into pages, store bitmap indices for each compressed value, bits reflect
which rows have the respective value, instead of scanning the entire segment for the
matching row , we only read the block which has the matching values aka bits set.
http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf

 In MPP a single SQL statement results in multiple SQL
statements that get executed locally on each node
 Some TPCDS queries can result in +20 SQL statements
that need be executed on each leaf node locally
 Steaming of data should result in better performance but
there are cases when this strategy fails.
 Placing data on disk after each steps allows the Query
optimizer to reevaluate the plan
TPC-H challenges : Intermediate steps in MPP

 Query :
Select count(*) from PART, PARTSUPP , LINEITEM where P_BRAND=“NIKE”
and PS_COMMENT like “%bla%” and P_PARTKEY=PS_PARTKEY and
L_PARTKEY = PS_PARTKEY group by P_BRAND
 Schema :
 PART distributed on P_PARTKEY
 PARTSUPP distributed on PS_PARTKEY
 LINEITEM distributed on L_ORDERKEY
 Create bloom filters BF1 on PART, push filter on PARTSUPP and
create BF2 , replicate bloom filter on all leaf nodes apply filter on
LINEITEM and only shuffle qualifying rows on
 Optimizer should chose between semi join reduction and
replicating PART x PARTSUPP
 Multiple copies of a set of columns distributed differently can
improve performance of such issue but at high cost.
TPC-H challenges : Improving join performance for incompatible joins

 SQL to map reduce jobs? Crunching data in relational
database is always faster than HADOOP, bring data
from HADOOP into columnar format , perform analytics
with efficient generated code
 Full integration with analytics tools as SAS , R
, Tableau , Excel etc…
 Support PL/SQL syntax (Oracle Compete)
 Eliminate the aggregating node to reduce system cost
for a small number of nodes, Exasol does it.
Looking ahead

Competitive analysis
Exasol 1TB 240
threads, 20
processors
Exasol 1TB 768
threads, 64
processors
Exasol 3TB 960
threads, 80
processors
MemSql 83GB
480 threads, 40
sockets
Ms SqlServer
10TB, 160
threads, 8
processors
Oracle 11c
10TB, 512
threads, 4
processors
Sec/GB/Thread 1.4 1.5 1.5 46.7 8.1 40.7
-
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
Sec/GB/Thread
TPCH Q1 analysis Sec/GB/Thread (Lower is better)
Assuming all processors have the same speed!!!!
Referances:
• http://www.tpc.org/tpch/results/tpch_perf_results.asp
• http://www.esg-global.com/lab-reports/memsqle28099s-distributed-in-
memory-database/

Appendinx
 GMQ 2013
http://www.gartner.com/technology/reprints.do?id=1-
1DU2VD4&ct=130131&st=sb
GMQ 2014
http://www.gartner.com/technology/reprints.do?id=1-
1M9YEHW&ct=131028&st=sb

TPC-H column store
 Avoid virtual function calls, branching use templates
 Scan usually dominates CPU profile
 Vector/Batch processing is a must
 If done correctly code is very sensitive to branching, data dependency, exploit
instruction parallelism when possible
 Use SIMD instructions , leverage already existing libraries to encapsulate SSE
instructions complexity
 // define and initialize integer vectors a and b
 Vec4i a(10,11,12,13);
 Vec4i b(20,21,22,23);
 // add the two vectors
 Vec4i c = a + b;
 http://www.agner.org/optimize/vectorclass.pdf

TPC-H Plans
 Behold the power of the optimizer
 If plan is wrong you are doomed…
 Very good read for TPCH Q8
http://www.slideshare.net/GraySystemsLab/pass-summit-
2010-keynote-david-dewitt

JSON documents
 Most efficient way to store Json documents
 Great compression and quick retrieval, ask me how to
….

Q1
 Used as benchmark for computational power
 Arithmetic operation performance
 Aggregating to same hash buckets
 Common sub expressions pattern matching
 Scan performance sensitive
 String matching for aggregation (Could do matching on compressed format)
select l_returnflag, l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3)
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
Challenges

Q2
 Correlated sub query
 Push down of predicates to the correlated subquery
 Highly selective (Segment size plays a big role)
 Tricky to generate optimal plan
 Depending on which tables are partitioned and which are replicated, plan
performance varies a lot.
select
s_acctbal,s_name, n_name, p_partkey,
p_mfgr, s_address, s_phone, s_comment
from part, supplier,
partsupp, nation, region
where p_partkey = ps_partkey
and s_suppkey = ps_suppkey and p_size = [SIZE]
and p_type like '%[TYPE]' and s_nationkey = n_nationkey
and n_regionkey = r_regionkey and r_name = '[REGION]'
and ps_supplycost = ( select from
partsupp, supplier, nation, region
where p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = '[REGION]'
) order by
s_acctbal desc, n_name, s_name, p_partkey;
Challenges

Q3
 Collocated join between orders & lineitem
 Detect correlation between shipdate, orderdat
 Bitmap filters on lineitem are necessary
 Replicating (select c_custkey from customers where
c_mktsegment = „[SEGMENt]‟)
select TOP 10 l_orderkey, sum(l_extendedprice*(1-
l_discount)) as revenue,
o_orderdate, o_shippriority
from customer, orders, lineitem
where c_mktsegment = '[SEGMENT]' and c_custkey =
o_custkey
and l_orderkey = o_orderkey and o_orderdate < date
'[DATE]'
and l_shipdate > date '[DATE]'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate;
Challenges

TPC-H Column Store and MPP systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to TPC-H Column Store and MPP systems

Similar to TPC-H Column Store and MPP systems (20)

Recently uploaded

Recently uploaded (20)

TPC-H Column Store and MPP systems

Editor's Notes