gstore_fdw - On GPU data exchange frame
for analytics & machine-learning
HeteroDB,Inc
Chief Architect & co-founder
KaiGai Kohei <kaigai@heterodb.com>
PGconf.ASIA 2016
PL/CUDA for Drug Discovery on PostgreSQL
PL/CUDA – A limit breaker of in-database analytics
Introduction of HeteroDB Products (2018-1Q)3
Advanced Analytics in-database with User-defined SQL functions
What is PL/CUDA
 Allows CUDA C code block in user-defined
SQL functions, and run on GPU device.
Scan
Pre-
Process
Analytics
Post-
Process
CREATE FUNCTION my_logic(matrix)
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
CUDA C code block
Execution of advanced analytics in database; which allows
pre-/post-process using flexible description by SQL
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 0000000000010000001000000000000001000000...
2 CHEMBL405398 0000000000000001001000000000000000100000...
3 CHEMBL503634 0000010000000000000000001000000000000000...
: : :
Database Compounds
(~10M items)
Query Compounds
(~1,000 items)
Use case for similarity search on drug discovery
Runs calculations of distance between chemical compounds
towards 10billion combinations
Exactly, it’s pretty fast.
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
(*Lowerisbetter)
Number of Query Compounds [Q]
Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!
But, I’m still concern about...
30.25
12.97 13.46 13.90
18.16
24.65
13.00 13.23 13.59
16.01
19.13
0
5
10
15
20
25
30
35
40
45
50
10 50 100 500 1000
QueryResponseTime[sec]
(*Lowerisbetter)
Number of Query Compounds [Q]
Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
11-12sec consumption
regardless of the problem size
Invocation of PL/CUDA functions
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Time to setup arguments of
PL/CUDA functions
Problem
① 1GB Limitation of PostgreSQL variable length values
② Time to setup arguments of PL/CUDA funciton
Storage
Host-side
Shared Buffer
Solution – gstore_fdw
Introduction of HeteroDB Products (2018-1Q)8
▌gstore_fdw
 A foreign-data-wrapper for GPU memory region.
 Allows to read/write GPU memory using SELECT/INSERT.
 Almost “zero-cost” to setup PL/CUDA function arguments.
Foreign Table
(gstore_fdw)
GPU device memoryINSERTSELECT
IPC Handle
v2.0
Storage
Host-side
Shared Buffer
Solution – gstore_fdw
Introduction of HeteroDB Products (2018-1Q)9
▌gstore_fdw
 A foreign-data-wrapper for GPU memory region.
 Allows to read/write GPU memory using SELECT/INSERT
 Almost “zero-cost” to setup PL/CUDA function arguments.
 Enables to shared the GPU memory region using CUDA APIs
 Suitable for machine-learning by co-operation with Python/R scripts
Foreign Table
(gstore_fdw)
GPU device memoryINSERTSELECT
IPC Handle
IPC Handle
Device Memory Exporting:
Like mmap(2) on host side, GPU
memory can be shared with other
processes using IPC handle.
User written
Scripts
v2.0
Example
postgres=# CREATE FOREIGN TABLE ft (
id int,
x0 real,
x1 real,
x2 real,
x3 real,
x4 real,
x5 real,
x6 real,
x7 real,
x8 real,
x9 real
) SERVER gstore_fdw OPTIONS (pinning '0',
format 'pgstrom');
postgres=# INSERT INTO ft (SELECT x, 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(),
100*random() FROM
generate_series(1,10000000) x);
LOG: alloc: preserved memory 440000320 bytes
INSERT 0 10000000
Before INSERT
$ nvidia-smi
Sun Nov 12 00:03:30 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 36C P0 52W / 250W | 171MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12438 C ...bgworker: PG-Strom GPU memory keeper 161MiB |
+-----------------------------------------------------------------------------+
After INSERT
$ nvidia-smi
Sun Nov 12 00:06:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 36C P0 51W / 250W | 591MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12438 C ...bgworker: PG-Strom GPU memory keeper 581MiB |
+-----------------------------------------------------------------------------+
A simple test with PL/CUDA function
postgres=# select gstore_test('ft');
gstore_test
------------------
5000140834.18597
(1 row)
Time: 548.382 ms
Integration of data-management and machine-learning
Introduction of HeteroDB Products (2018-1Q)14
gstore_fdw
Summary
Pre-/Post-Process
Device Memory
Exporting
v2.0
Data Centralization and fast try&error cycle of machine-learning
Data management
by PostgreSQL,
Machine-learning
by Python/R
postgres_fdw
Data Scientist
Data Lake/
Data Ware House
Future works
▌Support of cuPy / cuBLAS internal format
Expectation: integration with deep learning framework
▌Support of Incremental INSERT
▌Support of UPDATE commands
▌Support of streaming database mode
It performs like a ring buffer; only latest N-items are valid
20171206 PGconf.ASIA LT gstore_fdw

20171206 PGconf.ASIA LT gstore_fdw

  • 1.
    gstore_fdw - OnGPU data exchange frame for analytics & machine-learning HeteroDB,Inc Chief Architect & co-founder KaiGai Kohei <kaigai@heterodb.com>
  • 2.
    PGconf.ASIA 2016 PL/CUDA forDrug Discovery on PostgreSQL
  • 3.
    PL/CUDA – Alimit breaker of in-database analytics Introduction of HeteroDB Products (2018-1Q)3 Advanced Analytics in-database with User-defined SQL functions What is PL/CUDA  Allows CUDA C code block in user-defined SQL functions, and run on GPU device. Scan Pre- Process Analytics Post- Process CREATE FUNCTION my_logic(matrix) RETURNS matrix AS $$ $$ LANGUAGE ‘plcuda’; CUDA C code block Execution of advanced analytics in database; which allows pre-/post-process using flexible description by SQL ID NAME Fingerprint (1024bit) 1 CHEMBL153534 0000000000010000001000000000000001000000... 2 CHEMBL405398 0000000000000001001000000000000000100000... 3 CHEMBL503634 0000010000000000000000001000000000000000... : : : Database Compounds (~10M items) Query Compounds (~1,000 items) Use case for similarity search on drug discovery Runs calculations of distance between chemical compounds towards 10billion combinations
  • 4.
    Exactly, it’s prettyfast. PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4 30.25 145.29 295.95 1503.31 3034.94 12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13 0 500 1000 1500 2000 2500 3000 3500 10 50 100 500 1000 QueryResponseTime[sec] (*Lowerisbetter) Number of Query Compounds [Q] Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 x150 times faster!
  • 5.
    But, I’m stillconcern about... 30.25 12.97 13.46 13.90 18.16 24.65 13.00 13.23 13.59 16.01 19.13 0 5 10 15 20 25 30 35 40 45 50 10 50 100 500 1000 QueryResponseTime[sec] (*Lowerisbetter) Number of Query Compounds [Q] Similarity Search of Chemical Compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 11-12sec consumption regardless of the problem size
  • 6.
    Invocation of PL/CUDAfunctions PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6 PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS SELECT row_number() OVER (), fp.name, similarity FROM (SELECT float4_as_int4(key_id) key_id, similarity FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix)) FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q, (SELECT matrix FROM finger_print_10m_matrix) D ) ) AS sim(key_id real, similarity real) ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000; Time to setup arguments of PL/CUDA functions
  • 7.
    Problem ① 1GB Limitationof PostgreSQL variable length values ② Time to setup arguments of PL/CUDA funciton
  • 8.
    Storage Host-side Shared Buffer Solution –gstore_fdw Introduction of HeteroDB Products (2018-1Q)8 ▌gstore_fdw  A foreign-data-wrapper for GPU memory region.  Allows to read/write GPU memory using SELECT/INSERT.  Almost “zero-cost” to setup PL/CUDA function arguments. Foreign Table (gstore_fdw) GPU device memoryINSERTSELECT IPC Handle v2.0
  • 9.
    Storage Host-side Shared Buffer Solution –gstore_fdw Introduction of HeteroDB Products (2018-1Q)9 ▌gstore_fdw  A foreign-data-wrapper for GPU memory region.  Allows to read/write GPU memory using SELECT/INSERT  Almost “zero-cost” to setup PL/CUDA function arguments.  Enables to shared the GPU memory region using CUDA APIs  Suitable for machine-learning by co-operation with Python/R scripts Foreign Table (gstore_fdw) GPU device memoryINSERTSELECT IPC Handle IPC Handle Device Memory Exporting: Like mmap(2) on host side, GPU memory can be shared with other processes using IPC handle. User written Scripts v2.0
  • 10.
    Example postgres=# CREATE FOREIGNTABLE ft ( id int, x0 real, x1 real, x2 real, x3 real, x4 real, x5 real, x6 real, x7 real, x8 real, x9 real ) SERVER gstore_fdw OPTIONS (pinning '0', format 'pgstrom'); postgres=# INSERT INTO ft (SELECT x, 100*random(), 100*random(), 100*random(), 100*random(), 100*random(), 100*random(), 100*random(), 100*random(), 100*random(), 100*random() FROM generate_series(1,10000000) x); LOG: alloc: preserved memory 440000320 bytes INSERT 0 10000000
  • 11.
    Before INSERT $ nvidia-smi SunNov 12 00:03:30 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.81 Driver Version: 384.81 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 | | N/A 36C P0 52W / 250W | 171MiB / 22912MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 12438 C ...bgworker: PG-Strom GPU memory keeper 161MiB | +-----------------------------------------------------------------------------+
  • 12.
    After INSERT $ nvidia-smi SunNov 12 00:06:01 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.81 Driver Version: 384.81 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 | | N/A 36C P0 51W / 250W | 591MiB / 22912MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 12438 C ...bgworker: PG-Strom GPU memory keeper 581MiB | +-----------------------------------------------------------------------------+
  • 13.
    A simple testwith PL/CUDA function postgres=# select gstore_test('ft'); gstore_test ------------------ 5000140834.18597 (1 row) Time: 548.382 ms
  • 14.
    Integration of data-managementand machine-learning Introduction of HeteroDB Products (2018-1Q)14 gstore_fdw Summary Pre-/Post-Process Device Memory Exporting v2.0 Data Centralization and fast try&error cycle of machine-learning Data management by PostgreSQL, Machine-learning by Python/R postgres_fdw Data Scientist Data Lake/ Data Ware House
  • 15.
    Future works ▌Support ofcuPy / cuBLAS internal format Expectation: integration with deep learning framework ▌Support of Incremental INSERT ▌Support of UPDATE commands ▌Support of streaming database mode It performs like a ring buffer; only latest N-items are valid