20171206 PGconf.ASIA LT gstore_fdw

gstore_fdw - On GPU data exchange frame
for analytics & machine-learning
HeteroDB,Inc
Chief Architect & co-founder
KaiGai Kohei <kaigai@heterodb.com>

PGconf.ASIA 2016
PL/CUDA for Drug Discovery on PostgreSQL

PL/CUDA – A limit breaker of in-database analytics
Introduction of HeteroDB Products (2018-1Q）3
Advanced Analytics in-database with User-defined SQL functions
What is PL/CUDA
 Allows CUDA C code block in user-defined
SQL functions, and run on GPU device.
Scan
Pre-
Process
Analytics
Post-
Process
CREATE FUNCTION my_logic(matrix)
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
CUDA C code block
Execution of advanced analytics in database; which allows
pre-/post-process using flexible description by SQL
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 0000000000010000001000000000000001000000...
2 CHEMBL405398 0000000000000001001000000000000000100000...
3 CHEMBL503634 0000010000000000000000001000000000000000...
: : :
Database Compounds
(~10M items)
Query Compounds
(~1,000 items)
Use case for similarity search on drug discovery
Runs calculations of distance between chemical compounds
towards 10billion combinations

Exactly, it’s pretty fast.
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
(*Lowerisbetter)
Number of Query Compounds [Q]
Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!

But, I’m still concern about...
30.25
12.97 13.46 13.90
18.16
24.65
13.00 13.23 13.59
16.01
19.13
0
5
10
15
20
25
30
35
40
45
50
10 50 100 500 1000
QueryResponseTime[sec]
(*Lowerisbetter)
Number of Query Compounds [Q]
Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
11-12sec consumption
regardless of the problem size

Invocation of PL/CUDA functions
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Time to setup arguments of
PL/CUDA functions

Problem
① 1GB Limitation of PostgreSQL variable length values
② Time to setup arguments of PL/CUDA funciton

Storage
Host-side
Shared Buffer
Solution – gstore_fdw
▌gstore_fdw
 A foreign-data-wrapper for GPU memory region.
 Allows to read/write GPU memory using SELECT/INSERT.
 Almost “zero-cost” to setup PL/CUDA function arguments.
Foreign Table
(gstore_fdw)
GPU device memoryINSERTSELECT
IPC Handle
v2.0

Storage
Host-side
Shared Buffer
Solution – gstore_fdw
▌gstore_fdw
 A foreign-data-wrapper for GPU memory region.
 Allows to read/write GPU memory using SELECT/INSERT
 Almost “zero-cost” to setup PL/CUDA function arguments.
 Enables to shared the GPU memory region using CUDA APIs
 Suitable for machine-learning by co-operation with Python/R scripts
Foreign Table
(gstore_fdw)
GPU device memoryINSERTSELECT
IPC Handle
IPC Handle
Device Memory Exporting:
Like mmap(2) on host side, GPU
memory can be shared with other
processes using IPC handle.
User written
Scripts
v2.0

Example
postgres=# CREATE FOREIGN TABLE ft (
id int,
x0 real,
x1 real,
x2 real,
x3 real,
x4 real,
x5 real,
x6 real,
x7 real,
x8 real,
x9 real
) SERVER gstore_fdw OPTIONS (pinning '0',
format 'pgstrom');
postgres=# INSERT INTO ft (SELECT x, 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(),
100*random() FROM
generate_series(1,10000000) x);
LOG: alloc: preserved memory 440000320 bytes
INSERT 0 10000000

Before INSERT
$ nvidia-smi
Sun Nov 12 00:03:30 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 36C P0 52W / 250W | 171MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12438 C ...bgworker: PG-Strom GPU memory keeper 161MiB |
+-----------------------------------------------------------------------------+

After INSERT
$ nvidia-smi
Sun Nov 12 00:06:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 36C P0 51W / 250W | 591MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12438 C ...bgworker: PG-Strom GPU memory keeper 581MiB |
+-----------------------------------------------------------------------------+

A simple test with PL/CUDA function
postgres=# select gstore_test('ft');
gstore_test
------------------
5000140834.18597
(1 row)
Time: 548.382 ms

Integration of data-management and machine-learning
gstore_fdw
Summary
Pre-/Post-Process
Device Memory
Exporting
v2.0
Data Centralization and fast try&error cycle of machine-learning
Data management
by PostgreSQL,
Machine-learning
by Python/R
postgres_fdw
Data Scientist
Data Lake/
Data Ware House

Future works
▌Support of cuPy / cuBLAS internal format
Expectation: integration with deep learning framework
▌Support of Incremental INSERT
▌Support of UPDATE commands
▌Support of streaming database mode
It performs like a ring buffer; only latest N-items are valid

20171206 PGconf.ASIA LT gstore_fdw

20171206 PGconf.ASIA LT gstore_fdw

More Related Content

What's hot

Similar to 20171206 PGconf.ASIA LT gstore_fdw

More from Kohei KaiGai

Recently uploaded

20171206 PGconf.ASIA LT gstore_fdw