NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution

NVMEandGPUaccelerates PostgreSQL beyondthelimitation
〜Our challenge to the 10GB/s for query execution performance〜
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>

Here are mysterious benchmark results
NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20182
Benchmark conditions:
 By the PostgreSQL v11beta3 + PG-Strom v2.1devel on a single-node server system
 13 queries of Star-schema benchmark onto the 1055GB data set
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryExecutionThroughput[MB/s]
Star Schema Benchmark for PostgreSQL 11beta3 + PG-Strom v2.1devel
PG-Strom v2.1devel
max 13.5GB/s in query execution throughput on single-node PostgreSQL

about HeteroDB,Inc
Corporate overview
 Name HeteroDB,Inc
 Established 4th-Jul-2017
 Headcount 2 (KaiGai and Kashiwagi)
 Location Shinagawa, Tokyo, Japan
 Businesses Sales of accelerated database product
Technical consulting on GPU&DB region
By the heterogeneous-computing technology on the database area,
we provides a useful, fast and cost-effective data analytics platform
for all the people who need the power of analytics.
CEO Profile
 KaiGai Kohei – He has contributed for PostgreSQL and Linux kernel
development in the OSS community more than ten years, especially,
for security and database federation features of PostgreSQL.
 Award of “Genius Programmer” by IPA MITOH program (2007)
 The top-5 posters finalist at GPU Technology Conference 2017.

Features of RDBMS
 High-availability / Clustering
 DB administration and backup
 Transaction control
 BI and visualization
 We can use the products that
support PostgreSQL as-is.
Core technology – PG-Strom
PG-Strom: An extension module for PostgreSQL, to accelerate SQL
workloads by the thousands cores and wide-band memory of GPU.
GPU
Big-data Analytics
PG-Strom
Mass data loading from
the storage device rapidly
Machine-learning & Statistics

Characteristics of GPU (1/3)
Highly parallel computing processor with thousands cores and
hundreds GB/s memory band on a single chip
CPU
Like a passenger vehicle; well utilizable
but less transportation capacity.
GPU
Like a high-speed railway; a little bit troublesome to
get in or out, but capable for mass-transportation.
Model Intel Xeon Platinum 8180M NVIDIA Tesla V100
Architecture Skylake-SP Volta
# of cores 28 (functional) 5120 (simple)
Performance (FP32) 2.24 TFLOPS (with AVX2) 15.0TFLOPS
Memory capacity max 1.5TB (DDR4) 16GB (HBM2)
Memory band 127.8GB/s 900GB/s
TDP 205W 300W

Characteristics of GPU (2/3) – Reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Calculation of the total
sum of an array by GPU
Σi=0...N-1item[i]
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter-cores synchronization with hardware support
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
Same logic is internally used to
implement aggregate function.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -6

Characteristics of GPU (3/3)
Over 10years history in HPC, then massive popularization in Machine-Learning
NVIDIA Tesla V100
Super Computer
(TITEC; TSUBAME3.0) Computer Graphics Machine-Learning
Today’s Topic
How I/O workloads are accelerated by GPU that is a computing accelerator?
Simulation

How PostgreSQL utilizes GPU?
〜Architecture of PG-Strom〜

Construction of query execution plan in PostgreSQL (1/2)
Scan
t0 Scan
t1
Scan
t2
Join
t0,t1
Join
(t0,t1),t2
GROUP
BY cat
ORDER
BY score
LIMIT
100

Construction of query execution plan in PostgreSQL (2/2)
Scan
t0 Scan
t1
Join
t0,t1
Statistics)
nrows: 1.2M
width: 80
Index: none
candidate
HashJoin
cost=4000
candidate
MergeJoin
cost=12000
candidate
NestLoop
cost=99999
candidate
Parallel
Hash Join
cost=3000
candidate
GpuJoin
cost=2500
WINNER!
Built-in execution path of PostgreSQLProposition by extensions
（since PostgreSQL v9.5）
（since PostgreSQL v9.6）
GpuJoin
t0,t1
Statistics)
nrows: 4000
width: 120
Index: t1.id
Competition of multiple algorithms, then chosen by the “cost”.

Interactions between PostgreSQL and PG-Strom with CustomScan
As long as consistent results are made, implementation is flexible.
CustomScan (GpuJoin)
(*BeginCustomScan)(...)
(*ExecCustomScan)(...)
(*EndCustomScan)(...)
:
SeqScan
on t0
SeqScan
on t1
GroupAgg
key: cat
ExecInitGpuJoin(...)
 Initialize GPU context
 Kick asynchronous JIT compilation of
the GPU program auto-generated
ExecGpuJoin(...)
 Read records from the t0 and t1, and
copy to the DMA buffer
 Kick asynchronous GPU tasks
 Fetch results from the completed GPU
tasks, then pass them to the next step
(GroupAgg)
ExecEndGpuJoin(...)
 Wait for completion of the
asynchronous tasks (if any)
 Release of GPU resource

Auto generation of GPU code from SQL - Example of WHERE-clause
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transformation of the numeric-formula
in WHERE-clause to CUDA C code on demand
Reference to input data
SQL expression in CUDA source code
Run-time
compiler
Parallel
Execution

EXPLAIN shows query execution plan
postgres=# EXPLAIN ANALYZE
SELECT cat,count(*),sum(ax) FROM tbl NATURAL JOIN t1 WHERE cid % 100 < 50 GROUP BY cat;
QUERY PLAN
---------------------------------------------------------------------------------------------------
GroupAggregate (cost=203498.81..203501.80 rows=26 width=20)
(actual time=1511.622..1511.632 rows=26 loops=1)
Group Key: tbl.cat
-> Sort (cost=203498.81..203499.26 rows=182 width=20)
Sort Key: tbl.cat
Sort Method: quicksort Memory: 27kB
-> Custom Scan (GpuPreAgg) (cost=203489.25..203491.98 rows=182 width=20)
Reduction: Local
Combined GpuJoin: enabled
-> Custom Scan (GpuJoin) on tbl (cost=13455.86..220069.26 rows=1797115 width=12)
(never executed)
Outer Scan: tbl (cost=12729.55..264113.41 rows=6665208 width=8)
Outer Scan Filter: ((cid % 100) < 50)
Rows Removed by Outer Scan Filter: 10047462
Depth 1: GpuHashJoin (plan nrows: 6665208...1797115,
actual nrows: 9948078...2473997)
HashKeys: tbl.aid
JoinQuals: (tbl.aid = t1.aid)
KDS-Hash (size plan: 11.54MB, exec: 7125.12KB)
-> Seq Scan on t1 (cost=0.00..2031.00 rows=100000 width=12)
Planning Time: 0.721 ms
Execution Time: 1595.815 ms
(19 rows)
What’s happen?

GpuScan + GpuJoin + GpuPreAgg Combined Kernel (1/3)
Aggregation
GROUP BY
JOIN
SCAN
SELECT cat, count(*), avg(x)
FROM t0 JOIN t1 ON t0.id = t1.id
WHERE y like ‘%abc%’
GROUP BY cat;
count(*), avg(x)
GROUP BY cat
t0 JOIN t1
ON t0.id = t1.id
WHERE y like ‘%abc%’
results
GpuScan
GpuJoin
Agg
+
GpuPreAgg
SeqScan
HashJoin
Agg

GpuScan
kernel
GpuJoin
kernel
GpuPreAgg
kernel
DMA
Buffer
GPU
CPU
Storage
Simple replacement of the logics makes ping-pong of
data-transfer between CPU and GPU
DMA
Buffer
DMA
Buffer
DMA
Buffer
Agg
(PostgreSQL)
results

GpuScan
kernel
GpuJoin
kernel
GpuPreAgg
kernel
DMA
Buffer
GPU
CPU
Storage
Save the data-transfer by data exchange on the GPU device memory
GPU
Buffer
GPU
Buffer results
DMA
Buffer
Agg
(PostgreSQL)
A combined GPU kernel for SCAN + JOIN + GROUP BY
data size
= Large
data size
= Small
Usually, amount of data size to be written back from GPU
is much smaller than the data size sent to GPU

Re-definition of the GPU’s role
〜How GPU accelerates I/O workloads〜

A usual composition of x86_64 server
GPUSSD
CPU
RAM
HDD
N/W

Data flow to process a massive amount of data
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
Normal Data Flow
All the records, including junks, must be loaded
onto RAM once, because software cannot check
necessity of the rows prior to the data loading.
So, amount of the I/O traffic over PCIe bus tends
to be large.
Unless records are not loaded to CPU/RAM once, over the PCIe bus,
software cannot check its necessity even if it is “junk”.

Core Feature: SSD-to-GPU Direct SQL
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
NVIDIA GPUDirect RDMA
It allows to load the data blocks on NVME-SSD
to GPU using peer-to-peer DMA over PCIe-bus;
bypassing CPU/RAM. WHERE-clause
JOIN
GROUP BY
Run SQL by GPU
to reduce the data size
Data Size: Small
v2.0

Benchmark Results – single-node version
2172.3 2159.6 2158.9 2086.0 2127.2 2104.3
1920.3
2023.4 2101.1 2126.9
1900.0 1960.3
2072.1
6149.4 6279.3 6282.5
5985.6 6055.3 6152.5
5479.3
6051.2 6061.5 6074.2
5813.7 5871.8 5800.1
0
1000
2000
3000
4000
5000
6000
7000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryProcessingThroughput[MB/sec]
Star Schema Benchmark on NVMe-SSD + md-raid0
PgSQL9.6(SSDx3) PGStrom2.0(SSDx3) H/W Spec (3xSSD)
SSD-to-GPU Direct SQL pulls out an awesome performance close to the H/W spec
 Measurement by the Star Schema Benchmark; which is a set of typical batch / reporting workloads.
 CPU: Intel Xeon E5-2650v4, RAM: 128GB, GPU: NVIDIA Tesla P40, SSD: Intel 750 (400GB; SeqRead 2.2GB/s)x3
 Size of dataset is 353GB (sf: 401), to ensure I/O bounds workload

Element technology - GPUDirect RDMA (1/2)
▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU
 Originally designed for multi-nodes MPI over Infiniband
 Infrastructure of Linux kernel driver for other PCIe devices, including NVME-SSDs.
Copyright (c) NVIDIA corporation, 2015

Element technology - GPUDirect RDMA (2/2)
Physical
address space
PCIe BAR1 Area
GPU
device
memory
RAM
NVMe-SSD Infiniband
HBA
PCIe device
GPUDirect RDMA
It enables to map GPU device
memory on physical address
space of the host system
Once “physical address of GPU device memory”
appears, we can use is as source or destination
address of DMA with PCIe devices.
0xf0000000
0xe0000000
DMA Request
SRC: 1200th sector
LEN: 40 sectors
DST: 0xe0200000

SSD-to-GPU Direct SQL - Software Stack
Tesla GPU
NVIDIA
CUDA
Toolkit
Filesystem
(ext4, xfs)
nvme driver (inbox)
nvme_strom
kernel
module
NVMe SSD drives
commodity x86_64 hardware
NVIDIA GPUDirect RDMA
NVIDIA
kernel
driver
PostgreSQL
pg_strom
extension
read(2) ioctl(2)
Hardware
Layer
Operating
System
Software
Layer
Database
Software
Layer
Application Software
SQL Interface
I/O path based on
normal filesystem
I/O path based on
SSD-to-GPU Direct SQL Execution
■ User’s Applications
■ Software developed by others
■ Software developed by HDB
■ Hardware
v2.0

Run faster, beyond the limitation

Approach① – Faster NVME-SSD (1/2)
Intel DC P4600 (2.0TB, HHHL)
SeqRead: 3200MB/s, SeqWrite: 1575MB/s
RandRead: 610k IOPS, RandWrite: 196k IOPS
Interface: PCIe 3.0 (x4)
May I pull out the maximum performance of
them?

Approach① – Faster NVME-SSD (1/2)
Broadwell-EP is capable up to 7.1GB/s for P2P DMA routing performance.

Approach② – The latest CPU generation
Supermicro 1019GP-TT
CPU: Xeon Gold 6126T (2.6GHz, 12C)
RAM: 192GB (32GB DDR4-2666 x6)
GPU: NVIDIA Tesla P40 (3840C, 24GB) x1
SSD: Intel SSD DC P4600 (2.0TB, HHHL) x3
HDD: 2.0TB (SATA, 72krpm) x6
N/W: 10Gb ethernet x2ports

Approach② – The latest CPU generation
Skylake-SP improved the P2P DMA routing performance to 8.5GB/s.
GPU
SSD-1
SSD-2
SSD-3
md-raid0
Xeon
Gold
6126T
routing
by CPU

Consideration for the hardware configuration
① SSD and GPU are connected
to the same PCIe-switch
OK
② CPU controls PCIe-bus, and
SSD and GPU are directly
connected to the same CPU
Workable
③ SSD and GPU are connected
to the different CPUs
Not Supported
CPU CPU
PLX
SSD GPU
PCIe-switch
CPU CPU
SSD GPU
CPU CPU
SSD GPU
QPI
A pair of SSD and GPU must be under a particular CPU or PLX(PCIe-switch).
PLX is more preferable than CPU.
Which kind of the hardware can provide
optimal data path with PCIe-switch?

Simple solution) HPC servers optimized for RDMA
NVIDIA GPUDirect RDMA is originally designed for multi-node MPI.
Some HPC servers are optimized to P2P DMA between GPU and Infiniband HBA
Supermicro SYS-4029GP-TRT2

Practical solution) Utilization of I/O Expansion Box
Reduction of the traffic on PCIe-bus by detachment of data flow from SSD to GPU
NEC ExpEther 40G (4slot)
slot-0
slot-1
slot-2
slot-3
PCIe
switch
slot-0
slot-1
slot-2
slot-3
PCIe
switch
GPU-0
GPU-1
SSD-0
SSD-1
SSD-2
SSD-3
Host
RAM
HBA0
HBA1
CPU
Small data (= less traffic)

Optimization of the Storage Path
with I/O Expansion Box

System configuration with I/O expansion boxes
PCIe I/O Expansion Box
Host System
(x86_64 server)
NVMe SSD
PostgreSQL Tables
PostgreSQL
Data Blocks
Internal
PCIe Switch
SSD-to-GPU P2P DMA
(Large data size)
GPU
WHERE-clause
JOIN
GROUP BY
PCIe over
Ethernet
Pre-processed
small data
A few GB/s
SQL execution
performance
per box
A few GB/s
SQL execution
performance
per box
A few GB/s
SQL execution
performance
per box
NIC / HBA
Simplified DB operations and APP development
by the simple single-node PostgreSQL configuration
Enhancement of capacity & performance
Visible as leafs of partitioned child-tables on PostgreSQL
v2.1

Table Partitioning considering the hardware (1/2)
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
Associate partition-leafs with tablespaces and I/O expansion boxes
v2.1

Table Partitioning considering the hardware (2/2)
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
New in PostgreSQL v11: Data distribution by Hash Partitioning
key
INSERT Hashed key
hash = f(key)
hash % 3 = 2
hash % 3 = 0
Raw data
1053GB
Partial data
351GB
Partial data
351GB
Partial data
351GB
v2.1

Partition-wise GpuJoin/GpuPreAgg（1/3）
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
New in PostgreSQL v11: Parallel scan of the partition leafs
Scan
Scan
Scan
Gather
Join
Agg
Query
Results
Scan
Massive records
makes hard to gather
v2.1

lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
Preferable: Gathering the partition-leafs next to JOIN / GROUP BY
Join
Gather
Agg
Query
Results
Scan
Scan
PreAgg
Join
Scan
PreAgg
Join
Scan
PreAgg
v2.1

ssbm =# EXPLAIN SELECT sum(lo_extendedprice*lo_discount) as revenue
FROM lineorder,date1
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount between 1 and 3
AND lo_quantity < 25;
QUERY PLAN
------------------------------------------------------------------------------
Aggregate
-> Gather
Workers Planned: 9
-> Parallel Append
-> Parallel Custom Scan (GpuPreAgg)
Reduction: NoGroup
GPU Preference: GPU2 (Tesla P40)
-> Parallel Custom Scan (GpuJoin) on lineorder_p2
Outer Scan: lineorder_p2
Outer Scan Filter: ((lo_discount >= '1'::numeric) AND (lo_discount <= '3'::numeric)
AND (lo_quantity < '25'::numeric))
Depth 1: GpuHashJoin (nrows 102760469...45490403)
HashKeys: lineorder_p2.lo_orderdate
JoinQuals: (lineorder_p2.lo_orderdate = date1.d_datekey)
KDS-Hash (size: 66.03KB)
NVMe-Strom: enabled
-> Seq Scan on date1
Filter: (d_year = 1993)
-> Parallel Custom Scan (GpuPreAgg)
Reduction: NoGroup
:
Portion to be executed
on the 3rd I/O expansion box.
v2.1

Distance between SSD and GPU (1/2)
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
GPU selection based on the distance from SSD where PG-Strom tries to scan
Good
Not Good
v2.1

Distance between SSD and GPU (2/2)
$ pg_ctl restart
:
LOG: - PCIe[0000:80]
LOG: - PCIe(0000:80:02.0)
LOG: - PCIe(0000:83:00.0)
LOG: - PCIe(0000:84:00.0)
LOG: - PCIe(0000:85:00.0) nvme0 (INTEL SSDPEDKE020T7)
LOG: - PCIe(0000:84:01.0)
LOG: - PCIe(0000:86:00.0) GPU0 (Tesla P40)
LOG: - PCIe(0000:84:02.0)
LOG: - PCIe(0000:87:00.0) nvme1 (INTEL SSDPEDKE020T7)
LOG: - PCIe(0000:80:03.0)
LOG: - PCIe(0000:c0:00.0)
LOG: - PCIe(0000:c1:00.0)
LOG: - PCIe(0000:c2:00.0) nvme2 (INTEL SSDPEDKE020T7)
LOG: - PCIe(0000:c1:01.0)
LOG: - PCIe(0000:c3:00.0) GPU1 (Tesla P40)
LOG: - PCIe(0000:c1:02.0)
LOG: - PCIe(0000:c4:00.0) nvme3 (INTEL SSDPEDKE020T7)
LOG: - PCIe(0000:80:03.2)
LOG: - PCIe(0000:e0:00.0)
LOG: - PCIe(0000:e1:00.0)
LOG: - PCIe(0000:e2:00.0) nvme4 (INTEL SSDPEDKE020T7)
LOG: - PCIe(0000:e1:01.0)
LOG: - PCIe(0000:e3:00.0) GPU2 (Tesla P40)
LOG: - PCIe(0000:e1:02.0)
LOG: - PCIe(0000:e4:00.0) nvme5 (INTEL SSDPEDKE020T7)
LOG: GPU<->SSD Distance Matrix
LOG: GPU0 GPU1 GPU2
LOG: nvme0 ( 3) 7 7
LOG: nvme5 7 7 ( 3)
LOG: nvme4 7 7 ( 3)
LOG: nvme2 7 ( 3) 7
LOG: nvme1 ( 3) 7 7
LOG: nvme3 7 ( 3) 7
Auto selection of the optimal GPU according to
the distance between PCIe devices
v2.1

Benchmark (1/3) - System configuration
x 1
x 3
x 6
x 3
NEC Express5800/R120h-2m
CPU: Intel Xeon E5-2603 v4 (6C, 1.7GHz)
RAM: 64GB
OS: Red Hat Enterprise Linux 7
(kernel: 3.10.0-862.9.1.el7.x86_64)
CUDA-9.2.148 + driver 396.44
DB: PostgreSQL 11beta3 + PG-Strom v2.1devel
NEC ExpEther 40G (4slots)
I/F: PCIe 3.0 x8 (x16 physical) ... 4slots
with internal PCIe switch
N/W: 40Gb-ethernet
Intel DC P4600 (2.0TB; HHHL)
SeqRead: 3200MB/s, SeqWrite: 1575MB/s
RandRead: 610k IOPS, RandWrite: 196k IOPS
I/F: PCIe 3.0 x4
NVIDIA Tesla P40
# of cores: 3840 (1.3GHz)
Device RAM: 24GB (347GB/s, GDDR5)
CC: 6.1 (Pascal, GP104)
I/F: PCIe 3.0 x16
SPECIAL THANKS FOR
v2.1

Benchmark (2/3) - Result of query execution performance
 13 SSBM queries to 1055GB database in total (a.k.a 351GB per I/O expansion box)
 Raw I/O data transfer without SQL execution was up to 9GB/s.
In other words, SQL execution was faster than simple storage read with raw-I/O.
13,401 13,534 13,536 13,330
12,696
12,965
12,533
11,498
12,312 12,419 12,414 12,622 12,594
2,388 2,477 2,493 2,502 2,739 2,831
1,865
2,268 2,442 2,418
1,789 1,848
2,202
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryExecutionThroughput[MB/s]
Star Schema Benchmark for PgSQL v11beta3 / PG-Strom v2.1devel on NEC ExpEther x3
PostgreSQL v11beta3 PG-Strom v2.1devel Raw I/O Limitation
max 13.5GB/s for query execution performance with 3x I/O expansion boxes!!
v2.1

Benchmark (3/3) - Density of I/O per expansion box
0
2000
4000
6000
8000
10000
12000
14000
16000
nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1
I/O workload balances over the I/O expansion boxes, more scaling are expected
 On SQL execution, raw-I/O performance was 5000〜5100MB/s per expansion box, and 2600MB/s per NVME-SSD.
 Overall performance was balanced, so we can expect performance scaling if more expansion boxes.
 4.5GB/s x8 = 36GB/s is expected if 8 expansion box configuration; close to the commercial DWH solutions.
v2.1

Conclusion

Expected usage – Log data processing and analysis
As a data management, analytics and machine-learning platform for log data daily growing up
Manufacturing Logistics Mobile Home electronics
GPU + NVME-SSD
Why PG-Strom?
 It supports nearly 100TB with single node by addition of I/O expansion box.
 It allows to summarize the raw log data as is, more than max performance of H/W.
 Users can continue to use the familiar SQL statement and applications.

Conclusion
 PG-Strom
An extension module for PostgreSQL, to accelerate SQL execution by GPU. It pulls out maximum
potential of hardware to summarize and analyze large data more than terabytes class.
 Core feature: SSD-to-GPU Direct SQL
It directly transfers the data blocks on NVME-SSD to GPU by P2P DMA, and runs SQL workloads on
GPU prior to data loading onto the host system. By reduction of the data to be processed, it
improves the performance of I/O bound jobs.
 Multiple GPU/SSD configuration with I/O expansion box
To avoid saturation of CPU which performs PCIe root complex, it exchanges P2P DMA packets close
to the storage device by I/O expansion box that mounts PCIe switch.
Not only mitigation of CPU loads, but also allows enhancement of database capacity and
performance on demand. We measured 13.5GB/s in SQL execution by 3x expansion box. More
performance is expected according to the investment of hardware.
 Expected use scenario
Database system which stores massive logs, including M2M.
Simpleness of operations by single-node PostgreSQL, and continuity of the skill-set by the familiar
SQL statement and applications.
Adoption targets: small〜middle Hadoop clusters, or entry-class DWH solutions.

NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution

NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution

Similar to NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution (20)

More from Kohei KaiGai

More from Kohei KaiGai (20)

Recently uploaded

Recently uploaded (20)

NVME and GPU accelerates PostgreSQL beyond 10GB/s query execution