20181212 - PGconfASIA - LT - English

In-database Analytics using GPU
～ Tried to implement Logistic Regression Analytics～
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>

Hello guys,
Are you using PL/CUDA?
Hello guys. Are you using PL/CUDA?
This caption is not automatic by machine-learning. I preliminary write up by manual.
PGconf.ASIA 2018 LT - In-database Analytics using GPU2

Result
PL/CUDA User Defined Function
▌What is PL/CUDA?
 PL/CUDA allows UDF written in CUDA C which is executable on GPU.
▌Characteristics
 Extreme optimization of GPU code by manual; not auto-generated.
 Fully integration of SQL for pre-/post-processes; with flexible operations
All In-database Analytics
Scan
Pre-Process
Analytics
Post-ProcessCREATE FUNCTION
my_logic( reggstore, text )
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
Custom CUDA C code block
(runs on GPU device)  Manual optimization for statistics
and machine-learning
 Utilization of thousands cores and
wide-band device memory.
ready
PL/CUDA allows UDF written in CUDA C program that is executable on GPU. Valuable due to integration of
manual (extreme) optimization for GPU and flexible data operation by SQL.

PL/CUDA Use Case – Similarity Search on Drug-Discovery
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 00000000000100000010000000000010001000000...
2 CHEMBL405398 00000000000000010010000000000000000100000...
3 CHEMBL503634 00000100000000000000010000000000000000000...
: : :
Data structure of chemical compounds
Database compounds
(10M items)
Query compounds
(~1,000 items)
To be checked = 10billion combinations
DB Server
Similarity
Search Logic
Query
List of similar
chemical
compounds
For similarity search on drug-discovery, GPU calculated 10billion of distance between chemical compounds
x150 times faster than C-binary on CPU. It is very computing intensive workloads.
x150 times
faster!!
response time of the similarity search by k-NN method (k=3, D=10M)
number of query compounds [Q]

Is there any sample program?
Oh.... this case was proprietary algorithm. Now we have no sample code in public.
Is there any sample programs?

I tried to make it.
Theme: Logistic Regression Analytics
I tried to make it.
Theme: Logistic Regression Analytics

What is Logistics Regression Analytics (1/2)
A method for binary classification
Logistic Regression Analytics is a machine-learning method for binary classification.
True
False

What is Logistics Regression Analytics (2/2)
Probability of right classification follows the logistic function.
Probability of “right” classification follows the logistic function
𝜎 𝛼 =
1
1 − 𝑒−𝛼

Estimation of the parameters (1/3)
In general ....
Parameter: 𝑤 = 𝑤0, 𝑤1, ⋯ , 𝑤 𝑚
Explanatory variables: 𝜑𝑖 = 1, 𝑥1, ⋯ , 𝑥 𝑚 𝑖
Teacher data: 𝑡𝑖 = 0 𝑜𝑟 1
Determination of the division
surface is equivalent to seek
the weight of explanatory
variables and intercept.
0 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑦
Determination of division surface is equivalent to seek the weight of the explanatory variables and
intercept. But teacher data tell us boolean state for the combination of explanatory variables.

Target: Maximize the probability of the training set.
When 𝑧𝑖 = 𝜎 𝑊 𝑇
𝜑𝑖 , 𝑷 = 𝑃𝑖 = 𝑍𝑖
𝑡 𝑖
1 − 𝑍𝑖
1−𝑡 𝑖𝑁
𝑖=1
𝑁
𝑖=1
Distance from the division surface
introduces certainness of the
classification.
We assume the training set is a result
by the feasible probability.
Explanatory variables far from the division surface has higher probability of true/false. We assume the
training-set is result of the highest likelihood, maximized by the W parameter.

Parameter estimation by iteration of:
𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − Φ 𝑇
𝑅Φ −1
Φ 𝑇
𝑧 − 𝑡
i.e,
Φ =
1 𝑥11 ⋯ 𝑥1𝑚
⋮ ⋱ ⋮
1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑚
𝑡 = 𝑡1, … , 𝑡 𝑛
𝑧 = 𝑧1, … , 𝑧 𝑛
𝑅 = 𝑑𝑖𝑎𝑔 𝑧1 1 − 𝑧1 , … , 𝑧 𝑛 1 − 𝑧 𝑛
For more details, check out the book. Anyway, W is updated for each iteration, then Wnew shall seek to the
reasonable parameter then Wold. Eventually, difference of Wnew and Wold becomes very small.
For more details, check the book
“The first step of machine-learning theory”

Amount of the calculation
▌# of explanatory variables (small): several to several hundreds ... m items
▌# of training data (large): several hundreds to several millions ... n items
𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝑤Δ = 𝑤 𝑜𝑙𝑑 − Φ 𝑇
𝑅Φ −1
Φ 𝑇
𝑧 − 𝑡
Estimation for amount of the calculation. # of explanatory variables are to up hundreds, but # of training
data set is more than million items. It is suitable for parallel calculation by GPU.
ΦR
n
-1
Φ 𝑇
𝑧 − 𝑡
Φ 𝑇
n
m
n
1
-1
Φ 𝑇
𝑅Φ −1
Φ 𝑇
𝑧 − 𝑡
𝑤Δ
𝑚 × 𝑚 𝑚 × 1
𝑚 × 1

Example of GPU code for matrix-products Φ 𝑇 𝑅Φ
KERNEL_FUNCTION_MAXTHREADS(void) logregr_update_P(cl_double **Preg, /* out */
cl_float **Xp,
cl_int width,
VectorTypeFloat *Z) {
cl_double *P = Preg[0];
__shared__ cl_float v[MAXTHREADS_PER_BLOCK]; // shared variables
nitems_bs = TYPEALIGN(get_local_size(), nitems);
nloops = width * width * nitems_bs;
for (loop = get_global_id(); // unique identifier of GPU threads
loop < nloops;
loop += get_global_size()) { // add total number of GPU threads
k = loop % nitems_bs; // index of 𝑅 column/row
i = (loop / nitems_bs) % width; // index of Φ 𝑇 column
j = loop / (nitems_bs * width); // index of Φ column
if (k < nitems) {
cl_float z = Z->values[k];
cl_float x1 = (i == 0 ? 1.0 : Xp[i-1][k]);
cl_float x2 = (j == 0 ? 1.0 : Xp[j-1][k]);
v[get_local_id()] = x1 * z * (1.0 - z) * x2;
}
else
v[get_local_id()] = 0.0;
sum = pgstromTotalSum(v,MAXTHREADS_PER_BLOCK); // total sum of the element
if (get_local_id() == 0) // calculated by the sibling threads
atomicAdd(&P[i + j * width], sum);
__syncthreads();
}
}

Calculation by GPU – A case for reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Sum count by GPU
Σi=0...N-1item[i]
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Sum of the items[]
in log2N steps
Inter-core synchronization by HW support
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
Also used by aggregation
Values on shared memory can be accessed by multiple CPU cores simultaneously. Hardware supports inter-
cores synchronization, and it enables to calculate total sum with log2N steps.

Sample program of the Logistic Regression Analytics
$ git clone https://github.com/heterodb/toybox.git
$ cd toybox/logistic_regression/
$ make && make install
$ psql postgres
postgres=# create extension logregr;
CREATE EXTENSION
To get the sample code, open “heterodb/toybox” on GitHub, then move to “logistic_regression”.
You can install it using CREATE EXTENSION, if PG-Strom is correctly setup.
https://github.com/heterodb/toybox/  logistic_regression

Let’s play (1/4) - Creation of artificial test data
postgres=# CREATE TABLE logreg (
t bool,
x1 float,
x2 float,
x3 float,
x4 float );
CREATE TABLE
-- The training data classified all the 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 as true; 40M rows
postgres=# INSERT INTO logreg
(SELECT (1.0+2.0*x1-3.0*x2+x3+0.5*x4) > 0 t, x1, x2, x3, x4
FROM (SELECT random() x1,
random() x2,
random() x3,
random() x4
FROM generate_series(1,40000000)) x);
INSERT 0 40000000
OK, let’s work the PL/CUDA function. First of all, make a normal table with 40M rows of random data.
All the rows that satisfy 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 are marked as ‘true’.

Let’s play (2/4) - Data loading to GPU device memory (part-1)
postgres=# CREATE FOREIGN TABLE ft (
t bool,
x1 real,
x2 real,
x3 real,
x4 real
) SERVER gstore_fdw
OPTIONS (pinning '0');
CREATE FOREIGN TABLE
postgres=# INSERT INTO ft
(SELECT * FROM logreg);
INSERT 0 40000000
Gstore_Fdw is a FDW extension on behalf of the GPU device memory, specified by the ‘pinning’ option.
INSERT INTO the Gstore_Fdw table loads 40M rows in the ‘logreg’ table.
GPU device memory
Foreign Table
(gstore_fdw)
 Data format conversion
 Data compression (if any)
 Transaction control

Let’s play (3/4) - Data loading to GPU device memory (part-2)
[kaigai@saba src]$ nvidia-smi
Thu Dec 6 12:10:56 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | N/A |
| N/A 42C P0 52W / 250W | 817MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 27650 C ...bgworker: PG-Strom GPU memory keeper 807MiB |
+-----------------------------------------------------------------------------+
807MB of GPU device memory is preserved. The dataset consumes 680MB, in addition to the 120MB
for device management.
For device management: about 120MB +
(sizeof(bool) + 4*sizeof(float)) * 40M = 680MB

Let’s play (4/4)
postgres=# SELECT logregr_train('ft',
attnum_of('ft','t'),
attnums_of('ft','{x1,x2,x3,x4}'));
logregr_train
------------------------------------------
{3376.4,6752.71,-10129.1,3376.3,1688.27}
(1 row)
Time: 3647.059 ms (00:03.647)
Weight of the explanatory variables are estimated. 5 elements are returned because here is four
explanatory variables and intercept. It takes 3.6sec.

Comparison to CPU implementation (1/3)
logregr_train() function at MADLib
postgres=# SELECT madlib.logregr_train(‘logreg’, ‘hoge’,
‘t’,’ARRAY[1,x1,x2,x3,x4]’,
NULL, 20);
logregr_train
---------------
(1 row)
Time: 1301307.361 ms (21:41.307)
postgres=# SELECT coef FROM hoge;
coef
------------------------------------------------------
{3041.82722783601,6083.57794939209,-9125.44857123801,3041.73992459095,1520.98287953044}
(1 row)
For the same jobs, MADLib’s logregr_train() tooks 21min41sec. PL/CUDA implementation was 356 times
faster than the CPU-based implementation.
1301307.36 / 3647.06
= x356.8 times faster

Comparison to CPU implementation (2/3) - recalculation
It is weight of
the explanatory variables.
The parameter estimated by
logregr_train() is weight of
the division surface.
w0 w1 w2 w3 w4
PL/CUDA 3376.4 6752.71 -10129.1 3376.3 1688.27
MADLib 3041.83 6083.58 -9125.45 3041.74 1520.98
The result of logregr_train() is different from the weight when we made the dataset artificially, because it
returns the gradient and intercept of the normal vector towards the division surface.

Comparison to CPU implementation (3/3) - recalculation
Notice: !!we usually should not apply estimated parameter on the training set!!
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(ARRAY[ 3376.4, 6752.71,
-10129.1, 3376.3,
1688.27]::float[],
ARRAY[x1,x2,x3,x4]) p
FROM logreg) data
WHERE t != p;
count
-------
90
(1 row)
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(hoge.coef,
ARRAY[x1,x2,x3,x4]) p
FROM logreg, hoge) data
WHERE t != p;
count
-------
70
(1 row)
Prediction by our PL/CUDA function told 90 of 40M rows wrongly, and MADLib also told 70 of 40M.
Note that we usually don’t apply prediction on the training set when we have “actual” data analytics.
count number of the incorrect estimations

Conclusion
▌PL/CUDA sample programs
https://github.com/heterodb/toybox
▌PL/CUDA is fun(ction).
▌Suitable workloads for PL/CUDA
 Machine-Learning
 Similarity-Search
 Anomaly Detection
 Image Generation
 .... and others
Conclusion: We could make a sample program of PL/CUDA, and be published. PL/CUDA is fun.
PL/CUDA will be valuable for machine-learning, similarity-search, anomaly-detection, image generation, ...

20181212 - PGconfASIA - LT - English

20181212 - PGconfASIA - LT - English

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20181212 - PGconfASIA - LT - English

Similar to 20181212 - PGconfASIA - LT - English (20)

More from Kohei KaiGai

More from Kohei KaiGai (20)

Recently uploaded

Recently uploaded (20)

20181212 - PGconfASIA - LT - English