KaiGai Kohei <kaigai@kaigai.gr.jp> 
(Tw: @kkaigai)
Utilization of GPU computing capability is the key of performance. 
Data shall be deployed on semiconductor memory, rather than magnetic disk. 
GPU is well validated long-standing technology, ready for enterprise usage. 
Processor 
Memory 
All major vendor drives to heterogeneous architecture because of power consumption and thermal problem. 
Stacked DRAM allows to keep the trend of memory capacity expansion on the next several years. 
How Software will evolve on the next generation hardware? 
GPU 
Bioinformatics 
Simulation 
GIS 
Nvidia Corporation 
GPU, originated from graphics accelerator, expands supporting workloads. Nowadays, it becomes leading player in HPS region. 
PG-Strom intends to be a pioneer to introduce the semiconductor’s new trend into the world of RDBMS
CPU 
vanilla PostgreSQL 
PostgreSQL + PG-Strom 
Iteration of row-fetch and evaluation of WHERE clause 
Fetch multiple 
rows at once 
Async DMA send & Async Kernel exec 
OpenCL runtime compiler 
: Fetch a row from shared buffer 
: evaluation of WHERE clause 
Due to parallel execution, time to evaluate gets shorten 
CPU 
GPU 
Synchronization 
SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10; 
code for GPU/MIC 
Auto generation of GPU-code from user given SQL statement, then just-in-time compile 
Shorter response time than CPU execution 
GPU/MIC device
shared memory segment 
heap buffer 
columnar cache 
message queue 
heap 
storage 
custom nodes 
•GpuScan 
•GpuHashJoin 
•GpuPreAgg 
GPU code generator 
OpenCL server 
device manager 
shared memory allocator 
Query 
Executor 
Query 
Planner 
custom-plan 
interface 
Query Parser 
background worker
CPU 
GPU 
model 
Xeon E5-2690v2 
Nvidia Tesla K20 
generation 
Q3-2013 
Q4-2014 
# of cores 
10 
2476 
core clocks 
3.0GHz 
706MHz 
code set 
x86_64 
(functional) 
nv-kepler 
(simple) 
peak flops 
(single/double) 
480GFlops / 240GFlops 
3.52TFlops / 1.17TFlops 
memory size 
up to 768GB 
5GB 
memory band 
59.7GB/s 
208GB/s 
TDP 
130W 
225W 
price 
$2100 
$2700 
Floating-Point Operations per Second for the CPU and GPU 
The GPU Devotes More Transistors to Data Processing 
CPU allocates more transistors for rich- functional ALUs and cache, GPU allocates more transistors for simple ALUs.
● 
item[0] 
step.1 
step.2 
step.4 
step.3 
Calculation of 
Σi=0...N-1item[i] 
with N of GPU cores 
◆ 
● 
▲ 
■ 
★ 
● 
◆ 
● 
● 
◆ 
▲ 
● 
● 
◆ 
● 
● 
◆ 
▲ 
■ 
● 
● 
◆ 
● 
● 
◆ 
▲ 
● 
● 
◆ 
● 
item[1] 
item[2] 
item[3] 
item[4] 
item[5] 
item[6] 
item[7] 
item[8] 
item[9] 
item[10] 
item[11] 
item[12] 
item[13] 
item[14] 
item[15] 
Sum of items[] 
with 
log2N steps 
Reduction algorithm well fits aggregate functionality of RDBMS
Working example – It demonstrates GPU acceleration works in RDBMS has a significant performance advantages. 
Full-scan, Hash-joining, Sorting were implemented 
x3~x23 times faster query response time than vanilla PostgreSQL 
Things we learned – reduction of rows to be processed by CPU is the key of higher performance 
2014CY2Q 
2014CY4Q | 
2014CY3Q 
Beta development – It develops minimum valuable product for evaluation purpose. 
Full-scan, Hash-join, Aggregation will be supported 
Things we are learning – materialization is still expensive, columnar-cache leads over-consumption of shared memory 
Public beta – It tries to gather potential use cases that can utilize PG-Strom to tackle people’s problem. 
More functionalities may be implemented depending on the feedbacks. 
Things we want to learn – what kind of use-case may be expected, who can be prospective customers.
Tuple materialization was a key factor of performance bottle-neck 
minimizing number of tuple materialization makes sense 
GpuHashJoin loads multiple tables onto a hash-table then joins at once. 
materialization will happen only once, even more than 3 tables are joined 
HashJoin 
Hash 
SeqScan 
HashJoin 
Hash 
SeqScan 
SeqScan 
Tbl_1 
Tbl_2 
Tbl_3 
MultiHash 
SeqScan 
SeqScan 
Tbl_1 
Tbl_2 
GpuScan 
Tbl_3 
GpuHashJoin 
result set 
result set 
Data exchange with keeping column-format 
SELECT * FROM Tbl_1, Tbl_2, Tbl_3 WHERE .....; 
Data format transformation only once
GpuPreAgg intends to reduce number of rows to be processed by CPU 
GPU’s DRAM has less capable than CPU’s one, so not a good idea to run global sort, but local grouping and reduction is it’s best fit. 
Due to soring algorithm, time to handle 1/N data size less than 1/N. 
GroupAggregate 
Sort 
SeqScan 
Tbl_1 
result set 
some rows 
some million rows 
some million rows 
count(*), avg(X), Y 
X, Y 
X, Y 
X, Y, Z (table definition) 
GroupAggregate 
Sort 
GpuScan 
Tbl_1 
result set 
some rows 
some hundreds rows 
some million rows 
sum(nrows), avg_ex(nrows, psum), Y 
Y, nrows, psum_X 
X, Y 
X, Y, Z (table definition) 
GpuPreAgg 
Y, nrows, psum_X 
some hundreds rows 
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
Jul-2014 – A test implementation 
◦Technical validation of GPU acceleration on RDBMS 
◦Full-scan, Hash-Join, Sorting 
x3~x23 times faster than vanilla PostgreSQL 
Nov-2014 – A working beta 
◦Target of evaluation on a series of PoC efforts 
◦Full-scan, Hash-Join, Aggregate 
Full-Scan 
Hash-Join 
Aggregate 
PostgreSQL 
8237.694 
36206.565 
7954.771 
PG-Strom 
545.516 
8092.828 
2071.513 
0 
5000 
10000 
15000 
20000 
25000 
30000 
35000 
40000 
Query response time [ms] 
test result in the working beta (at Sep-20114) 
Test Environment 
Server: NEC Express5800 HR120b-1 
CPU: Xeon(R) CPU E5-2640 x2 
RAM: 256GB 
GPU: NVidia GTX750 Ti (640 cuda cores) 
Each query run on a table with 20M records. Hash-join workload joins 3 other tables with 40K rows. 
Query response time comes from “execution time” in EXPLAIN ANALYZE
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)

  • 1.
  • 2.
    Utilization of GPUcomputing capability is the key of performance. Data shall be deployed on semiconductor memory, rather than magnetic disk. GPU is well validated long-standing technology, ready for enterprise usage. Processor Memory All major vendor drives to heterogeneous architecture because of power consumption and thermal problem. Stacked DRAM allows to keep the trend of memory capacity expansion on the next several years. How Software will evolve on the next generation hardware? GPU Bioinformatics Simulation GIS Nvidia Corporation GPU, originated from graphics accelerator, expands supporting workloads. Nowadays, it becomes leading player in HPS region. PG-Strom intends to be a pioneer to introduce the semiconductor’s new trend into the world of RDBMS
  • 3.
    CPU vanilla PostgreSQL PostgreSQL + PG-Strom Iteration of row-fetch and evaluation of WHERE clause Fetch multiple rows at once Async DMA send & Async Kernel exec OpenCL runtime compiler : Fetch a row from shared buffer : evaluation of WHERE clause Due to parallel execution, time to evaluate gets shorten CPU GPU Synchronization SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10; code for GPU/MIC Auto generation of GPU-code from user given SQL statement, then just-in-time compile Shorter response time than CPU execution GPU/MIC device
  • 4.
    shared memory segment heap buffer columnar cache message queue heap storage custom nodes •GpuScan •GpuHashJoin •GpuPreAgg GPU code generator OpenCL server device manager shared memory allocator Query Executor Query Planner custom-plan interface Query Parser background worker
  • 5.
    CPU GPU model Xeon E5-2690v2 Nvidia Tesla K20 generation Q3-2013 Q4-2014 # of cores 10 2476 core clocks 3.0GHz 706MHz code set x86_64 (functional) nv-kepler (simple) peak flops (single/double) 480GFlops / 240GFlops 3.52TFlops / 1.17TFlops memory size up to 768GB 5GB memory band 59.7GB/s 208GB/s TDP 130W 225W price $2100 $2700 Floating-Point Operations per Second for the CPU and GPU The GPU Devotes More Transistors to Data Processing CPU allocates more transistors for rich- functional ALUs and cache, GPU allocates more transistors for simple ALUs.
  • 6.
    ● item[0] step.1 step.2 step.4 step.3 Calculation of Σi=0...N-1item[i] with N of GPU cores ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Sum of items[] with log2N steps Reduction algorithm well fits aggregate functionality of RDBMS
  • 7.
    Working example –It demonstrates GPU acceleration works in RDBMS has a significant performance advantages. Full-scan, Hash-joining, Sorting were implemented x3~x23 times faster query response time than vanilla PostgreSQL Things we learned – reduction of rows to be processed by CPU is the key of higher performance 2014CY2Q 2014CY4Q | 2014CY3Q Beta development – It develops minimum valuable product for evaluation purpose. Full-scan, Hash-join, Aggregation will be supported Things we are learning – materialization is still expensive, columnar-cache leads over-consumption of shared memory Public beta – It tries to gather potential use cases that can utilize PG-Strom to tackle people’s problem. More functionalities may be implemented depending on the feedbacks. Things we want to learn – what kind of use-case may be expected, who can be prospective customers.
  • 8.
    Tuple materialization wasa key factor of performance bottle-neck minimizing number of tuple materialization makes sense GpuHashJoin loads multiple tables onto a hash-table then joins at once. materialization will happen only once, even more than 3 tables are joined HashJoin Hash SeqScan HashJoin Hash SeqScan SeqScan Tbl_1 Tbl_2 Tbl_3 MultiHash SeqScan SeqScan Tbl_1 Tbl_2 GpuScan Tbl_3 GpuHashJoin result set result set Data exchange with keeping column-format SELECT * FROM Tbl_1, Tbl_2, Tbl_3 WHERE .....; Data format transformation only once
  • 9.
    GpuPreAgg intends toreduce number of rows to be processed by CPU GPU’s DRAM has less capable than CPU’s one, so not a good idea to run global sort, but local grouping and reduction is it’s best fit. Due to soring algorithm, time to handle 1/N data size less than 1/N. GroupAggregate Sort SeqScan Tbl_1 result set some rows some million rows some million rows count(*), avg(X), Y X, Y X, Y X, Y, Z (table definition) GroupAggregate Sort GpuScan Tbl_1 result set some rows some hundreds rows some million rows sum(nrows), avg_ex(nrows, psum), Y Y, nrows, psum_X X, Y X, Y, Z (table definition) GpuPreAgg Y, nrows, psum_X some hundreds rows SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
  • 10.
    Jul-2014 – Atest implementation ◦Technical validation of GPU acceleration on RDBMS ◦Full-scan, Hash-Join, Sorting x3~x23 times faster than vanilla PostgreSQL Nov-2014 – A working beta ◦Target of evaluation on a series of PoC efforts ◦Full-scan, Hash-Join, Aggregate Full-Scan Hash-Join Aggregate PostgreSQL 8237.694 36206.565 7954.771 PG-Strom 545.516 8092.828 2071.513 0 5000 10000 15000 20000 25000 30000 35000 40000 Query response time [ms] test result in the working beta (at Sep-20114) Test Environment Server: NEC Express5800 HR120b-1 CPU: Xeon(R) CPU E5-2640 x2 RAM: 256GB GPU: NVidia GTX750 Ti (640 cuda cores) Each query run on a table with 20M records. Hash-join workload joins 3 other tables with 40K rows. Query response time comes from “execution time” in EXPLAIN ANALYZE