Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

1,908 views

Published on

General introduction of PG-Strom at Oct-2015

Published in: Software

GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

  1. 1. 1 GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~ PGconf.EU 2015 Wien NEC Business Creation Division The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
  2. 2. 2 Today’s topic PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Very powerful computing capability Very functional & well-used database PG-Strom: Glue of the both promising technologies GPGPU
  3. 3. 3 Self Introduction ▌Name: KaiGai Kohei ▌Works for:NEC ▌Role:  Lead of the PG-Strom project  Contribution to PostgreSQL and other OSS projects  Making business opportunity using this technology ▌PG-Strom Project  Mission: To deliver the power of semiconductor evolution for every people who want  Launched at Jan-2012, as a personal development project  Project is now funded by NEC  Fully open source project PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  4. 4. 4 Agenda 1. GPU characteristics and why? 2. What intermediates PostgreSQL and GPU 3. How GPU processes SQL workloads 4. Performance results and analysis 5. Further development
  5. 5. 5 Features of GPU (Graphic Processor Unit) ▌Characteristics  Massive parallel cores  Much higher DRAM bandwidth  Better price / performance ratio  advantages for massive but simple arithmetic operations  Different manner to write software algorithm PGconf.EU 2015 - GPGPU Accelerates PostgreSQL SOURCE: CUDA C Programming Guide GPU CPU Model Nvidia GTX TITAN X Intel Xeon E5-2690 v3 Architecture Maxwell Haswell Launch Mar-2015 Sep-2014 # of transistors 8.0billion 3.84billion # of cores 3072 (simple) 12 (functional) Core clock 1.0GHz 2.6GHz, up to 3.5GHz Peak Flops (single precision) 6.6TFLOPS 998.4GFLOPS (with AVX2) DRAM size 12GB, GDDR5 (384bits bus) 768GB/socket, DDR4 Memory band 336.5GB/s 68GB/s Power consumption 250W 135W Price $999 $2,094
  6. 6. 6 How GPU cores works PGconf.EU 2015 - GPGPU Accelerates PostgreSQL ●item[0] step.1 step.2 step.4step.3 Calculation of 𝑖𝑡𝑒𝑚[𝑖] 𝑖=0…𝑁−1 with GPU cores ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Sum of items[] by log2N steps Inter-core synchronization by HW functionality
  7. 7. 7 Dark Silicon and Semiconductor Trend ▌Processor shrink also increases power leakage. ▌Dark Silicon: area cannot power-on due to thermal problem. ▌People thought: How to utilize these extra transistors? Domain specific processors, like GPU PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Core Core Core Core Core Core Core Core Dark SiliconCore Core Core Core Dark Silicon Core Core Core Core Core Core Next GenNext Gen Intel Haswell on-chip layout (4core system)
  8. 8. 8 Towards heterogeneous era... Software must be designed to intermediate CPU and GPU PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  9. 9. 9 Agenda 1. GPU characteristics and why? 2. What intermediates PostgreSQL and GPU 3. How GPU processes SQL workloads 4. Performance results and analysis 5. Further development
  10. 10. 10 What is PG-Strom ▌Core ideas ① GPU native code generation on the fly ② Asynchronous massive parallel execution ▌Advantages  Transparent acceleration with 100% query compatibility  Commodity H/W and less system integration cost PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Parser Planner Executor Custom- Scan/Join Interface Query: SELECT * FROM l_tbl JOIN r_tbl on l_tbl.lid = r_tbl.rid; PG-Strom CUDA driver nvrtc DMA Data Transfer CUDA Source code Massive Parallel Execution
  11. 11. 11 How query is processed ▌Custom-Scan/Join interface (v9.5 new!) ▌PG-Strom provides own scan/join logics using GPUs ▌It generates same results, but different way PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Query Parser Query Optimizer Query Executor SQL (text) Parse Tree Query Plan Custom Scan/Join Interface PG-Strom Interactions between PostgreSQL and Extensions CUDA code Data Chunk NVIDIA Run-Time Compiler
  12. 12. 12 definition at planning time (OT) Dive into Custom-Join PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Built-in JOIN Custom JOIN CustomScan (scanrelid==0) Logic implemented by extension Inner Scan Outer Scan table-X table-Y Join • NestedLoop • MergeJoin • HashJoin TupleDesc of table-X TupleDesc of table-Y X-tuple Y-tuple joined-tuple TupleDesc of Join X-Y table-Y Something other data source table-X Data Load with extension’s comfortable way joined-tuple compatible tuple format
  13. 13. 13 SQL-to-GPU native code generation (1/3) postgres=# EXPLAIN VERBOSE SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; QUERY PLAN ------------------------------------------------------------------------------------- HashAggregate (cost=167646.33..167646.36 rows=3 width=12) Output: cat, pgstrom.count((pgstrom.nrows())), pgstrom.avg((pgstrom.nrows((x IS NOT NULL))), (pgstrom.psum(x))) Group Key: t0.cat -> Custom Scan (GpuPreAgg) (cost=24150.22..163315.53 rows=258 width=48) Output: cat, pgstrom.nrows(), pgstrom.nrows((x IS NOT NULL)), pgstrom.psum(x) Bulkload: On (density: 100.00%) Reduction: Local + Global Device Filter: ((t0.x >= t0.y) AND (t0.x <= (t0.y + '20'::double precision))) Features: format: tuple-slot, bulkload: unsupported Kernel Source: /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu -> Custom Scan (BulkScan) on public.t0 (cost=21150.22..159312.94 rows=10000060 width=12) Output: id, cat, aid, bid, cid, did, eid, x, y, z Features: format: heap-tuple, bulkload: supported (13 rows) PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  14. 14. 14 SQL-to-GPU native code generation (2/3) $ less /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, kern_data_store *ktoast, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); /* constant 20.0 */ pg_float8_t KVAR_8 = pg_float8_vref(kds,kcxt,7,kds_index); /* var of X */ pg_float8_t KVAR_9 = pg_float8_vref(kds,kcxt,8,kds_index); /* var of Y */ return EVAL(pgfn_boolop_and_2(kcxt, pgfn_float8ge(kcxt, KVAR_8, KVAR_9), pgfn_float8le(kcxt, KVAR_8, pgfn_float8pl(kcxt, KVAR_9, KPARAM_1)))); } : PGconf.EU 2015 - GPGPU Accelerates PostgreSQL  formula expression based on WHERE clause
  15. 15. 15 SQL-to-GPU native code generation (3/3) PGconf.EU 2015 - GPGPU Accelerates PostgreSQL SQL Query Dynamically constructed GPU code ------ Per Query Variant Statically constructed GPU code ------ Algorithm Core NVIDIA Run-Time Compiler (nvrtc) Cached GPU Binary + PG-Strom GPU code generator For Execution For Next Time GpuScan, GpuHashJoin, GpuPreAgg, GpuSort, ...
  16. 16. 16 PG-Strom functionalities at Sep-2015 ▌Logics  GpuScan ... Simple loop extraction by GPU multithread  GpuHashJoin ... GPU multithread based N-way hash-join  GpuNestLoop ... GPU multithread based N-way nested-loop  GpuPreAgg ... Row reduction prior to CPU aggregation  GpuSort ... GPU bitonic + CPU merge, hybrid sorting ▌Data Types  Numeric ... int2/4/8, float4/8, numeric  Date and Time ... date, time(tz), timestamp(tz), interval  Text ... simple matching only, at this moment  others ... currency ▌Functions  Comparison operator ... <, <=, !=, =, >=, >  Arithmetic operators ... +, -, *, /, %, ...  Mathematical functions ... sqrt, log, exp, ...  Aggregate functions ... min, max, sum, avg, stddev, ... PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  17. 17. 17 Agenda 1. GPU characteristics and why? 2. What intermediates PostgreSQL and GPU 3. How GPU processes SQL workloads 4. Performance results and analysis 5. Further development
  18. 18. 18 Asynchronous & Pipelining Execution (1/3) ▌Karma of discrete GPU device  We have to move the data to be processed.  Thus, people say data intensive tasks are not good for GPU.  Correct, but we can reduce the impact.  Our solution: Pipelining PGconf.EU 2015 - GPGPU Accelerates PostgreSQL GPU CPU GPU kernel execution time DMA Send DMA Recv
  19. 19. 19 Asynchronous & Pipelining Execution (2/3) PGconf.EU 2015 - GPGPU Accelerates PostgreSQL chunk-(i+1) chunk-(i+2) chunk-i chunk-(i+3) tablescan Move to Next DMA Recv GPU Kernel Exec DMA Send Buffer Read DMA Recv GPU Kernel Exec DMA Send Buffer Read GPU Kernel Exec DMA Send Buffer Read DMA Send Buffer Read Step-(i+4) Step-(i+3) Step-(i+2) Step-(i+1) Step-i
  20. 20. 20 Asynchronous & Pipelining Execution (3/3) ▌Karma of discrete GPU device  We have to move the data to be processed.  Thus, people say data intensive tasks are not good for GPU.  Correct, but we can reduce the impact  Our solution: Pipelining ▌Downside of the pipelining  Too small workloads  Pipeline hazard PGconf.EU 2015 - GPGPU Accelerates PostgreSQL GPU CPU GPU kernel execution time DMA Send DMA Recv
  21. 21. 21 SQL Logic on GPU (1/3) – GpuHashJoin PGconf.EU 2015 - GPGPU Accelerates PostgreSQL Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next step Next step All CPU does is just references the result of relations join Sequential Hash table search by CPU Sequential Projection by CPU Parallel Projection by GPU Parallel Hash- table search by GPU Built-in Hash-Join GpuHashJoin of PG-Strom
  22. 22. 22 Inner Hash table OT: Ununiformly Distributed Hash-Table PGconf.EU 2015 - GPGPU Accelerates PostgreSQL GPU Core-0 GPU Core-1 GPU Core-2 GPU Core-3 GPU Core-4 GPU Core-5 GPU Core-M Outer relation Item[0] Item[1] Item[2] Item[3] Item[4] Item[5] Item[N] Item[M] Item Item Item Item Item Item Item Item Item Item Inner relation Hash Value Hash Function Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Item[3] Item Overflow of Result Buffer
  23. 23. 23 Inner Hash table OT: Ununiformly Distributed Hash-Table PGconf.EU 2015 - GPGPU Accelerates PostgreSQL GPU Core-0 GPU Core-1 GPU Core-2 GPU Core-3 GPU Core-4 GPU Core-5 GPU Core-M Outer relation Item[0] Item[1] Item[2] Item[3] Item[4] Item[5] Item[N] Item[M] Item Item Item Item Item Item Item Item Item Item Inner relation Hash Value Hash Function Item[3] Item Item[3] Item Virtual Partition with Extra Randomness Just fit for Result Buffer x Retry for each virtual partition
  24. 24. 24 SQL Logic on GPU (2/3) – GpuNestLoop (unparametalized) Outer-Relation (Nx:usuallylarger) ※splittochunk-by-chunkon demand ● ● ● ● ● ● ● ● ● ● ●●●●●●●Two dimensional GPU kernel launch blockDim.x blockDim.y Ny Nx Thread (X=2, Y=3) Inner-Relation (Ny: relatively small) Only edge thread references DRAM to fetch values. Nx:32 x Ny:32 = 1024 A matrix can be evaluated with only 64 times DRAM accesses PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  25. 25. 25 SQL Logic on GPU (3/3) – GpuPreAgg ▌GpuPreAgg, prior to Aggregate, reduces number of rows to be processed. ▌Query rewrite to make equivalent results from intermediate aggregation. PGconf.EU 2015 - GPGPU Accelerates PostgreSQL GroupAggregate Sort SeqScan Tbl_1 Result Set several rows several millions rows several millions rows Y, count(*), avg(X) X, Y X, Y X, Y, Z (table definition) GroupAggregate Sort GpuScan Tbl_1 Result Set several rows several hundreds rows several millions rows Y, sum(nrows), avg_ex(nrows, psum) Y, nrows, psum_X X, Y X, Y, Z (table definition) GpuPreAgg Y, nrows, psum_X several hundreds rows SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y; reduction of rows to be processed
  26. 26. 26 Agenda 1. GPU characteristics and why? 2. What intermediates PostgreSQL and GPU 3. How GPU processes SQL workloads 4. Performance results and analysis 5. Further development
  27. 27. 27 Microbenchmark (1/2) – total response time ▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat; measurement of query response time with increasing of inner relations  t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded. ▌Environment:  PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)  CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores) PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 58.89 82.51 102.15 125.46 159.83 194.50 241.88 12.44 17.88 17.56 21.98 33.33 38.14 50.29 0 50 100 150 200 250 300 2 3 4 5 6 7 8 QueryResponseTime[sec] # of tables involved Query Response Time towards Numer of Tables Joined PostgreSQL 9.5β PG-Strom
  28. 28. 28 Microbenchmark (2/2) – Time consumption per component ▌Good acceleration results towards JOIN and AGGREGATION ▌Less acceleration ratio when N is larger than 6 By inaccurate problem size estimation, why? PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 0 50 100 150 200 250 300 PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom 2 3 4 5 6 7 8 QueryResponseTime[sec] # of tables involved Time consumption per component (PostgreSQL v9.5β vs PG-Strom) Scan Join Aggregate Others
  29. 29. 29 OT: Why accurate problem size estimation is important ▌GpuJoin results must be smaller than GPU RAM size. ▌Virtual inner-partition saves scale of the Join results. Better size estimation reduces number of retries. We can determine the problem size using dynamic parallelism downside: Supported by only Kepler or later generation. PGconf.EU 2015 - GPGPU Accelerates PostgreSQL innerRel (depth-1) innerRel (depth-2) innerRel (depth-3) innerRel (depth-4) outerRel Results × × × × = × × × × = GpuJoin with virtual inner-partition and iteration Larger than GPU RAM Within GPU RAM Size
  30. 30. 30 TPC-DS Benchmark (1/2) – Total Results ▌What is TPC-DS?  Successor of TPC-H, defines 103 (99+4) kind of queries  Simulation of decision support queries in retail industry ▌About this benchmark  Run the series of TPC-DS queries sequentially.  We could collect 96 results of 103.  Environment: • PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64) • CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores) PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 0 500 1000 1500 2000 QueryResponseTime[sec] PostgreSQL v9.5β PG-Strom anomaly?
  31. 31. 31 TPC-DS Benchmark (2/2) – Total Result ▌Summary  PG-Strom wins to PostgreSQL: 76 of 96  PostgreSQL wins to PG-Strom: 20 of 96  shows advantages on most of simple queries ▌Challenges  Wise plan construction, to avoid GpuXXX node not to be here  Functionality improvement for Aggregation and Scan PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 0 100 200 300 400 500 QueryResponseTime[sec] PostgreSQL v9.5β PG-Strom
  32. 32. 32 Time consumption per workloads (1/3) – PostgreSQL v9.5 ▌Scan, Join and Aggregation provides a clear explanation for the time consumption factor in TPC-DS workloads PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Time consumption per workloads (PostgreSQL v9.5beta) Scan Join Aggregate Others
  33. 33. 33 Time consumption per workloads (2/3) – PG-Strom ▌PG-Strom dramatically reduced time for Joins. ▌Time for Joins are: 37%  17% PGconf.EU 2015 - GPGPU Accelerates PostgreSQL 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Time consumption per workloads (PostgreSQL v9.5beta) Scan Join Aggregate Others
  34. 34. 34 Time consumption per workloads (3/3) – Wow! GpuJoin GpuJoin can explain 84% of performance improvement PGconf.EU 2015 - GPGPU Accelerates PostgreSQL y = 0.962x - 7.289 R² = 0.8442 -50.00 0.00 50.00 100.00 150.00 200.00 250.00 -50.00 0.00 50.00 100.00 150.00 200.00 250.00 ImprovementofTotalTime[sec] (TotaltimeofPgSQL)–(TotaltimeofPG-Strom) Improvement of Join Time [sec] (Join time of PgSQL) – (Join time of PG-Strom)
  35. 35. 35 Agenda 1. GPU characteristics and why? 2. What intermediates PostgreSQL and GPU 3. How GPU processes SQL workloads 4. Performance results and analysis 5. Further development
  36. 36. 36 What we learn from TPC-DS ▌GpuJoin may be a killer feature of PG-Strom  ....unless problem size estimation is accurate.  Planner needs to be wiser, not to inject GpuJoin towards the situation not suitable. ▌For Aggregation  GpuPreAgg didn’t appear as we expected  Anything GPU can help CPU aggregation? ▌For Scan  Scan performance == data stream bandwidth of GPU input  Single thread is a restriction for bandwidth of data supply PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  37. 37. 37 Features enhancement / improvement ▌Target-list calculation on GPU  Pre-calculation of hash-value for HashAggregate  Projection off-load if complicated mathematical expression ▌Custom-Scan under Gather node  Allows multiple workers to provide data stream to GPUs ▌Partial Aggregation before Join  Wider utilization of GpuPreAgg to reduce number of rows to be joined by CPU ▌Utilization of dynamic parallelism  Wise approach towards result buffer overflow problem in Join ▌... and more... PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  38. 38. 38 Short term development 1. Software quality improvement 2. Corner case handling 3. Documentation 4. Target-List calculation in GPU 5. Software quality improvement PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  39. 39. 39 WIP Project – In-place Computing Why export entire data to run mathematical algorithm? ▌Runs complicated mathematical expression on GPUs, then CPU just references the result  Algorithms in SQL (even partially) automatically moves to GPUs  WIP: Data-Clustering algorithms, like Min-Max, K-means, ... PGconf.EU 2015 - GPGPU Accelerates PostgreSQL mathematical algorithm Things wants to know Things wants to know result result mathematical algorithm Principle of Near-Data Computing Limited computing capability Data Export
  40. 40. 40 Future Direction of PG-Strom Real-life workloads are the only compass for development PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  41. 41. 41 WE WANT YOU GET INVOLVED ▌PostgreSQL Users  who concern about query execution performance on your workloads ▌Application Developers  who are interested in trying yet another database acceleration methodology ▌PostgreSQL Developers  who can bet the future of heterogeneous era PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  42. 42. 42 Resources ▌Github  https://github.com/pg-strom/devel  https://github.com/pg-strom/devel/issues ▌Documentation  Under construction.... sorry ▌Today’s slides  Uploaded soon, check #pgconfeu on Tw ▌Author’s contact  e-mail: kaigai@ak.jp.nec.com  Tw: @kkaigai PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
  43. 43. 43

×