Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TPC-H Column Store and MPP systems


Published on

TPCH focused presentation on challenges introduced by the benhcmark, market trend , new technologies etc...

Published in: Education, Technology
  • Be the first to comment

TPC-H Column Store and MPP systems

  1. 1. TPC-H Performance MPP & Column Store
  2. 2. What is TPCH  The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that  Examine large volumes of data;  Execute queries with a high degree of complexity;  Give answers to critical business questions.  The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream and the query throughput when queries are submitted by multiple concurrent users.
  3. 3. Overview TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  4. 4. TPC-H Schema overview: Relationships between columns
  5. 5. TPC-H Schema overview : MPP data distribution Table Column Node 1 Node 2 Node 3 LINEITEM ORDERKEY 1 2 3 PARTKEY 6 4 8 SUPPKEY 3 18 5 ORDERS ORDERKEY 1 2 3 CUSTKEY 4 2 9 PARTSUPP PARTKEY 1 2 3 SUPPKEY 4 5 6 PART PARTKEY 1 2 3 CUSTOMER CUSTKEY 1 2 3 SUPPLIER SUPPKEY 1..N 1..N 1..N NATION NATIONKEY 1..N 1..N 1..N REGION REGIONKEY 1..N 1..N 1..N Collocated Over network data movement Collocated Over network data movement Table Distribution column LINEITEM L_ORDERKEY ORDERS O_ORDERKEY PARTSUPP PS_PARTKEY PART P_PARTKEY CUSTOMER C_CUSTKEY SUPPLIER REPLICATED NATION REPLICATED REGION REPLICATED
  6. 6. TPC-H Schema : Metrics  Power:  Run order  RF1 (Inserts into LINEITEM and ORDERS)  22 read only queries  RF2 (Deletes from LINEITEM & ORDERS)  Metric :  Query per hour rate  TPC-H Power@Size = 3600 * SF / Geomean(22 queries , RF1, RF2)  Geometric mean of all queries results in a run  Performance improvements to any query equally improves the metric  Throughput:  Run orders  N concurrent Power query streams with different parameters  N RF1 & RF2 streams, this can be run in parallel with the concurrent streams above or after  Metric :  Ratio of the total number of queries executed over the length of the measurement interval  TPC-H Throughput@Size = (S*22*3600)/Ts *SF  Absolute runtime matters, optimizing for the longest running query helps Throughput Power Run in Parallel Query Stream 01 Refresh function 1 Inserts into LINEITEM & ORDERS Query Stream 02 Query stream 00 14,2,9,20,6…5,7,12 … Query Stream N Refresh function 2 Deletes from LINEITEM & ORDERS Refresh streams with N pairs of RF1 & 2 Scale Factor Number of streams 100 5 300 6 1000 7 3000 8 10000 9 30000 10 100000 11
  7. 7. Outline TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  8. 8. TPC-H Performance measurements  Invest in tools to analyze plans, some consider plan analysis an art, breaking down the plan to key metrics helps a lot  Capture enough information in the execution plan to unveil performance issues:  Estimate Vs. Actual number of rows etc..  Amount of data spilled per disk  Rows touched Vs. rows qualified during scan  Logical Vs. Physical reads  CPU & Memory consumed per plan operator  Skew in number of rows processed per thread per operator  Instrument the code to provide cycles per row for key scenarios:  Scan  Aggregate  Join Set performance goals Measure Performance Start looking at SMP & MPP plans Check CPU & IO utilization Fix performance issues Repeat
  9. 9. TPC-H Performance measurements  Scalability within a single server  Vary the number of processors  Vary scale factor : 100G, 300G  Identify queries that don‟t have linear scaling  Capture:  CPU & IO utilization per query with at least 1 second sampling rate  Capture hot functions and waits if any  Capture CPI ideally per function  Capture execution plans  Get busy crunching the data  Scalability across multiple servers  Vary the number of servers in the systems  Vary amount of data per server  Capture:  CPU , disk & network IO  Distributed plans  Look for queries that have excessive cross node traffic  Identify suboptimal plans where predicates/aggregates are not pushed down More focused performance effort MPP scaling Data scaling SMP scaling
  10. 10. Outline TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  11. 11. Partner engagements  Can be considered as one of the secret sauces for highly performing software  Partners (HW/Infrastructure) tend to have vested interest in showcasing Performance and Scalability of their products.  Allows software companies to leverage HW expertise and provide access to low level tools that are not publically available (Through NDA).  Partners occasionally provide HW for Performance benchmarks, prototype evaluation, release publications  Partners can be a great assist for :  Providing low level analysis  Collaborate in publications, benchmarks, proof of concepts etc..  Provide HW for Performance testing, evaluation, improvement (large scale experiments are expensive)
  12. 12. Partner engagements  NVRAM: Random-access memory that retains its information when power is turned off (non-volatile). This is in contrast to dynamic random- access memory (DRAM)  “Promises”:  Latency within the same order of magnitude of DRAM  Cheaper than SSDs  +10TB of NVRAM in a 2-socket system within the next 4 years  Still in prototype phase  Could eliminates need for spinning disks or SSDs altogether  In-memory database are likely to be early adopters of such technology  Good reading:  
  13. 13. Partner engagements Diablo technologies SSD in DRAM slot
  14. 14. Partner engagements Diablo technologies SSD in DRAM slot DIMM capacity of 200GB & 400GB, technology is rebranded by IBM and VmWare Ready
  15. 15. Outline TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  16. 16. TPC-H where is it today Why do benchmarks?  Stimulate technological advancements  Why TPCH?  Introduce a set of technological challenges whose resolution will significantly improve the performance of the product  As benchmark is it relevant to current DW applications ?  Gartner Magic quadrant references: “Vectorwise delivered leading 1TB non-clustered TPC Benchmark H (TPC-H) results in 2012”  Big players are Oracle, Vectorwise, Microsoft, Exasol and Paraccel  Most significant innovation came from:  Kickfire acquired by Teradata, FPGA-based "Query Processor Module” with an instruction set tuned for database operations  ParAccel acquired by Actian, shared-nothing architecture with a columnar orientation, adaptive compression, memory-centric design  Exasol .. column-oriented way and proprietary InMemory compression methods are used, database also has automatic self optimization (create indexes, stats , distribute tables etc.. )  So where does it come in handy?  Identify system bottlenecks  Push performance focused features into the product  TPC-H schema is heavily used for ETL and virtualization benchmarks  Introduces lots of interesting challenges to the DMBS  What about TPC-DS, it has a more realistic ETL process , snow flake schema, but no one has published a TPC-DS benchmark yet
  17. 17. TPC-H where is it today  Number of publications is on the decline 99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Number of publications 9 1 5 12 31 15 42 31 20 13 15 10 20 5 6 0 5 10 15 20 25 30 35 40 45 Numberofpublications Number of TPCH publications per year • First cloud based benchmark? When will we see this?
  18. 18. Outline TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  19. 19. TPC-H challenges : Aggregation  Almost all TPCH queries do aggregation  Unless there is a sorted index (B-tree) on group by column aggregating in Hash table makes most sense opposed to ordered aggregation  Correctly sizing the hash table dictates performance  If cardinality under estimates number of distinct values lots of chaining occurs and HT can eventually spill to disk.  If CE overestimates resources are not used optimally  For low distinct count doing hash table per thread (local) then doing a global aggregation improves performance  For small group by on strings, present group by expressions as integers (index in array) opposed to using a hash table (Reduce cache footprint)  For group by on Primary key (C_CUSTKEY) no need to include other columns from CUSTOMER in the Hash table  Main benefits from PK/FK is aggregate optimizations  Queries sensitive to aggregation performance:  1, 3, 4, 10, 13, 18, 20, 21
  20. 20. TPC-H challenges : Aggregation Q1 Reduces 6 billion rows to 4 Sensitive to string matching Benefits from doing local aggregation Q10 Group by on most Customer columns If PK on C_CUSTKEY exists could use C_CUSTKEY for aggregation Further optimization push down of aggregate on O_CUSTKEY and TOP 18 Group by on L_ORDERKEY results in 1.5 billion rows (4x reduction) Local aggregation usually hurts performance Hash table for aggregation alone can take 25GB of RAM
  21. 21. TPC-H challenges : Joins  Select a schema which leverages locality Examples : ORDERS x LINEITEM on L_ORDERKEY=O_ORDERKEY by hash partitioning on ORDERKEY  Q5,Q9,Q18 can spill and have bad performance if the correct plan is not picked  Q9 will cause over the network communication for MPP systems, unless PARTSUPP, PART and SUPPLIER are replicated which is not feasible for large scale factors  TPCH joins are highly selective, hence efficient bloom filters are necessary  Simplistic guide : Find the most selective filter/aggregation and this is where you start
  22. 22. TPC-H challenges : Expression evaluation Arithmetic operation performance Store decimals as integers and save some bits 19123 Vs. 191.23 Rebase of some of the columns to use less bits Keep data in the most compact form to best exploit SIMD instructions Detecting common sub expressions sum(l_extendedprice ) as sum_base_price, sum(l_extendedprice *(1-l_discount)) as sum_disc_price, sum(l_extendedprice *(1- l_discount)*(1+l_tax) ) as sum_charge, Expression filter push down (Q7, Q19) Q7 Take the superset or UNION of filters and push down to the scan Q19 Take the union of individual predicates Column projection vs expression evaluation Cardinality estimates should help decide to Project columns A & B or or (A * (1 - B) ) before a filter on C
  23. 23. TPC-H challenges : Correlated subqueries  Push down of predicates into subquery when applicable  When sub queries are flattened batch processing outperforms row by row  Buffer overlapped intermediate results  Partial query reuse  Challenging for MPP systems (don‟t redistribute or shuffle the same data twice)
  24. 24. TPC-H challenges : Parallelism and concurrency  Current 2P servers have +48 cores, +½ TB of RAM & +10GB/sec of disk IO BW, this means that within a single box the engine needs to provide meaningful scaling  Further sub-partitioning data on a single server alleviates single server scaling problems  TPC-H queries tend to use lots of workspace memory for Joins and aggregations.  Precise and dynamic memory allocation keeps queries from spilling to under high concurrency
  25. 25. TPC-H challenges : Scan performance  Disk read performance is crucial, should validate that when system is not CPU bound IO subsystem is efficiently used.  Ability to filter out pages or segments from the scan is crucial  In memory scan performance can be increased if we decrease the search scope and thereby the amount of data that needs to be streamed from main memory to the CPU
  26. 26. TPC-H challenges : Scan performance Store dictionaries in sorted order or in a BST to make • Compress the filter or predicate to do numeric comparison opposed to decompress and match on strings • Quickly validates if the value exists in the segment
  27. 27. TPC-H challenges : scan performance  What do we do for highly selective filters?  Implement paged indexes for columns of interest  Partition a column into pages, store bitmap indices for each compressed value, bits reflect which rows have the respective value, instead of scanning the entire segment for the matching row , we only read the block which has the matching values aka bits set.
  28. 28.  In MPP a single SQL statement results in multiple SQL statements that get executed locally on each node  Some TPCDS queries can result in +20 SQL statements that need be executed on each leaf node locally  Steaming of data should result in better performance but there are cases when this strategy fails.  Placing data on disk after each steps allows the Query optimizer to reevaluate the plan TPC-H challenges : Intermediate steps in MPP
  29. 29.  Query : Select count(*) from PART, PARTSUPP , LINEITEM where P_BRAND=“NIKE” and PS_COMMENT like “%bla%” and P_PARTKEY=PS_PARTKEY and L_PARTKEY = PS_PARTKEY group by P_BRAND  Schema :  PART distributed on P_PARTKEY  PARTSUPP distributed on PS_PARTKEY  LINEITEM distributed on L_ORDERKEY  Create bloom filters BF1 on PART, push filter on PARTSUPP and create BF2 , replicate bloom filter on all leaf nodes apply filter on LINEITEM and only shuffle qualifying rows on  Optimizer should chose between semi join reduction and replicating PART x PARTSUPP  Multiple copies of a set of columns distributed differently can improve performance of such issue but at high cost. TPC-H challenges : Improving join performance for incompatible joins
  30. 30. Outline TPC-H Schema overview TPC-H Performance measurements Partner engagement TPC-H where is it today TPC-H challenges Looking ahead Q&A
  31. 31.  SQL to map reduce jobs? Crunching data in relational database is always faster than HADOOP, bring data from HADOOP into columnar format , perform analytics with efficient generated code  Full integration with analytics tools as SAS , R , Tableau , Excel etc…  Support PL/SQL syntax (Oracle Compete)  Eliminate the aggregating node to reduce system cost for a small number of nodes, Exasol does it. Looking ahead
  32. 32. Competitive analysis Exasol 1TB 240 threads, 20 processors Exasol 1TB 768 threads, 64 processors Exasol 3TB 960 threads, 80 processors MemSql 83GB 480 threads, 40 sockets Ms SqlServer 10TB, 160 threads, 8 processors Oracle 11c 10TB, 512 threads, 4 processors Sec/GB/Thread 1.4 1.5 1.5 46.7 8.1 40.7 - 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 Sec/GB/Thread TPCH Q1 analysis Sec/GB/Thread (Lower is better) Assuming all processors have the same speed!!!! Referances: • • memory-database/
  33. 33. Appendinx  GMQ 2013 1DU2VD4&ct=130131&st=sb GMQ 2014 1M9YEHW&ct=131028&st=sb
  34. 34. TPC-H column store  Avoid virtual function calls, branching use templates  Scan usually dominates CPU profile  Vector/Batch processing is a must  If done correctly code is very sensitive to branching, data dependency, exploit instruction parallelism when possible  Use SIMD instructions , leverage already existing libraries to encapsulate SSE instructions complexity  // define and initialize integer vectors a and b  Vec4i a(10,11,12,13);  Vec4i b(20,21,22,23);  // add the two vectors  Vec4i c = a + b; 
  35. 35. TPC-H Plans  Behold the power of the optimizer  If plan is wrong you are doomed…  Very good read for TPCH Q8 2010-keynote-david-dewitt
  36. 36. JSON documents  Most efficient way to store Json documents  Great compression and quick retrieval, ask me how to ….
  37. 37. Q1  Used as benchmark for computational power  Arithmetic operation performance  Aggregating to same hash buckets  Common sub expressions pattern matching  Scan performance sensitive  String matching for aggregation (Could do matching on compressed format) select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; Challenges
  38. 38. Q2  Correlated sub query  Push down of predicates to the correlated subquery  Highly selective (Segment size plays a big role)  Tricky to generate optimal plan  Depending on which tables are partitioned and which are replicated, plan performance varies a lot. select s_acctbal,s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = [SIZE] and p_type like '%[TYPE]' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' and ps_supplycost = ( select from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' ) order by s_acctbal desc, n_name, s_name, p_partkey; Challenges
  39. 39. Q3  Collocated join between orders & lineitem  Detect correlation between shipdate, orderdat  Bitmap filters on lineitem are necessary  Replicating (select c_custkey from customers where c_mktsegment = „[SEGMENt]‟) select TOP 10 l_orderkey, sum(l_extendedprice*(1- l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = '[SEGMENT]' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '[DATE]' and l_shipdate > date '[DATE]' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate; Challenges