PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

PL/CUDA
~Fusion of HPC Grade Power with In-Database Analytics~
The PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kaigai@ak.jp.nec.com>

The PG-Strom Project
about myself...
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2
▌KaiGai Kohei
 tw: @kkaigai
 https://github.com/kaigai
▌PostgreSQL
 SELinux, FDW, CustomScan, ...
▌PG-Strom
 GPU acceleration for PostgreSQL
▌Works
 NEC OSS Promotion Center
 Development of the software and
its business opportunity

PG-Strom Overview (1/2) – Architecture
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
No Storage
Changes
No Query
Changes
 Features
• automatic GPU code generation
from the supplied SQL
• asynchronous & massive
parallel execution on GPUs
• WHERE-clause, JOIN, GROUP
BY, and projection are
supported
 Advantages
• transparent acceleration by the
power of thousands cores
• Low cost solution for analytic
processing

Characteristics of GPU (Graphic Processor Unit)
GPU CPU
Model
NVIDIA
Tesla P100
Intel Xeon
E5-2699v4
Architecture Pascal Broadwell
Launch Q2-2016 Q1-2016
# of transistors 15billion 7.2billion
# of cores
3584
(simple)
22
(functional)
core clock
1.126GHz
~1.303GHz
2.20GHz
~3.60GHz
Perk FFLOPS
(FP32)
9.3 TFLOPS
1.2 TFLOPS
(with AVX2)
DRAM Size 16GB (HBM2) max 1.5TB (DDR4)
Memory Band 732GB/s 76.8GB/s
Power
Consumption
250W 145W

PG-Strom Overview (2/2) – GPU binary generation on the fly
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transform of arithmetic operations
in the WHERE-clause to CUDA programs
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5

GPU accelerates SQL performance
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
40.44
62.79
79.82
100.50
119.51
144.55
201.12
248.95
9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00
0
50
100
150
200
250
300
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of joined tables
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.5 PG-Strom v1.0
CPU: Xeon E5-2670v3
GPU: GTX1080
RAM: 384GB
OS: CentOS 7.2
DB: PostgreSQL 9.5 +
PG-Strom v1.0

Feedbacks from users during v1.0 development
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
Heavy computing
intensive workloads
• In-database Analytics
• Scientific R&D, Marketing, ...
 by PL/CUDA + Matrix-Array
Heavy I/O
intensive workloads
• Generic large OLAP
• ETL, Reporting, ...
 by SSD-to-GPU P2P DMA

The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8
Introduction of PL/CUDA

Our Failure (1/3) – Write an algorithm logic in SQL
Apr-2016

Our Failure (2/3) – Performance benefit (?)
Apr-2016

Our Failure (3/3) – Problems
▌Problem.1 – Who writes algorithms in SQL?
 Majority of mathematical algorithms are developed based on
the manner of procedural programming language.
 Although users don’t need to write algorithm logics in CUDA,
they also have to write up the algorithm logic using SQL puzzle.
▌Problem.2 – Performance Benefit
 Yeah, PG-Strom is much faster than PostgreSQL to execute the core
logic of min-max method. It is an excellent result but nobody knows
how many people run the algorithm on PostgreSQL.
 Performance of GpuProjection is almost equivalent to the CPU version
of implementation which is designed to a particular problem. Why?
 Inefficient code due to SQL compatibility
 Inefficient data format due to PostgreSQL’s row-format

Our Answer – PL/CUDA + Matrix-like Array
CREATE FUNCTION my_logic(matrix, matrix)
RETURNS vector
AS $$
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
Storage
GPU Kernel
User defined
CUDA code block
Post SQL Process
 Tables JOIN
 Window
Function
 ORDER BY
 GROUP BY
 etc....
Load function’s
arguments
Write-back
result set
PL/CUDA
method for manual optimization
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix

Example of PL/CUDA function definition
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, int[], int[])
RETURNS float4[]
AS $$
#plcuda_begin
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k);
for (loop=0; loop < nloops; loop++) {
/* 1. calculation of the similarity */
for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) {
j = i % part_sz; /* index within partition */
/* index of database matrix (D) */
dindex = part_nums * get_global_index() + (i / part_sz);
/* index of query matrix (Q) */
qindex = loop * (part_sz - k) + (j - k);
values[i] = knn_similarity_compute(D, dindex, Q, qindex);
}
}
:
#plcuda_end
$$ LANGUAGE 'plcuda';
CUDA code block

How GPU Kernels are built
my_cuda_func(float[])
RETURNS int[]
$$
#plcuda_sanity_check func_sc
#plcuda_begin
#plcuda_end
#plcuda_working_bufsz func_wb
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
bool func_sc(float[])
helper function for sanity check
bigint func_wb(float[])
helper function for buffer size estimation
GPU
Binary
GPU
Kernel
Working
Buffer
Input
arguments
Run-time
compiler
input/output
of SQL data
User define
CUDA code block
Common Library
Routines
source program
of GPU code
Load the
arguments

Why automatic generated code was not best for performance
 NULL checks for each variable references
 Overflow checks for each primitive operators
 Function call instead of primitive operators
STATIC_FUNCTION(pg_float4_t)
pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2)
{
pg_float4_t result;
result.isnull = arg1.isnull | arg2.isnull;
if (!result.isnull)
{
result.value = arg1.value * arg2.value;
CHECKFLOATVAL(&kcxt->e, result,
isinf(arg1.value) || isinf(arg2.value),
arg1.value == 0.0 || arg2.value == 0.0);
}
return result;
}
select x*y from c_test;

Disadvantage because of data format (1/2)
▌Row-oriented data
× includes unreferenced values
× many steps for data references
〇 common data format with PostgreSQL
▌Column-oriented data
〇 load only referenced variables
〇 data reference by 1 step
× needs data format exchange
GPU
core
GPU
core
GPU
core
a c d feb
a c d feb
a d feb
a c d feb
GPU
core
eb
GPU
core
GPU
core
GPU
core
GPU
core
b
b
b
b
b
GPU
core
GPU
core
e
e
e
e
e
Usual SQL load cannot justify the cost for data transformation,
Advanced algorithm processing is improved by columnar format.

Disadvantage because of data format (2/2)
▌Case of random memory access
 Increase of memory transaction, but less usage rate of the data-bus
▌Case of coalesced memory access
 Least number of memory transaction, and maximum usage rate of the data-bus
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x 8 = 256bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 100.0%)
Only 32bit x 1 = 32bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 12.5%)
GPU cores
GPU cores

2D-Array as Matrix
▌datatype[] array_matrix(variadic datatype[])
 An aggregate function to construct a 2D-array from the input stream.
 datatype is either of int2, int4, int8, float4 or float8.
 The 2D-array does not contain NULL.
▌SETOF record matrix_unnest(datatype[])
 Deform a 2D-array into stream of multiple records
▌Downside
 Unable to handle variable-length data type
 1GB limit of varlena data type in PostgreSQL
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix

Example to call PL/CUDA function
SELECT row_number() OVER (),
float4_as_int4(R.key_id) key_id,
R.score
FROM matrix_unnest(
(SELECT my_plcuda_function(A.matrix,
B.matrix)
FROM (SELECT cbind(array_matrix(id),
array_matrix(x, y, z)) matrix
FROM normal_table
WHERE tag LIKE ‘%abc%’) A,
(SELECT matrix
FROM matrix_table) B
)
) AS R(key_id real, score real)
ORDER BY score DESC
LIMIT 1000;
Invocation of PL/CUDA
function with two
Matrix arguments
Construct, or load
pre-built array-matrix
Deform array-matrix
to generic records
Post-process by SQL
(JOIN, window-function)

Case Study
similarity search on drug discovery

Background – relationship of disease and chemical compounds
target disease
relevant protein
chemical compounds
(= candidate of drugs)
Discovery of chemical compounds which are “active” to the target protein
inactive active
active
but toxicity
academic papers

k-NN Similarity Search on Chemical Compounds (1/2)
Database chemical compounds set
(D; 10M records scale)
Query chemical
compounds set
(Q; ~1000 records scale)
Search by
Similarity
Target Protein
“similar compounds” will
have higher probability of active
Picks up active
chemical compounds
to the target protein
from academic papers

k-NN Similarity Search on Chemical Compounds (2/2)
Similarity
is
definition of distance
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 000000000001000000100000000000000100000000000001000000...
2 CHEMBL405398 000000000000000100100000000000000000000000000000100000...
3 CHEMBL503634 000001000000000000000000001000000100000000000000000000...
: : :
Data structure of the chemical compounds
Similarity by Jaccard index:
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃

Scale of the computing
Database chemical compounds set
(D; 10M records scale)
Q: Query chemical
compounds set
average of
the top-3
𝑑𝑖 of D-compounds
Distance of Q-set
and 𝑑𝑖 compounds
𝑑𝑗 of D-compounds
average of
the top-3
Distance of Q-set
and 𝑑𝑗 compounds
Order of calculation:
𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄
(make distance) (sorting+average)

Implementation of PL/CUDA function (1/3)
Step-1
Split all the logical combination of Q and D
into multiple partitions, then assign them
on SMM; execution unit of GPU.
Step-2
Each GPU core calculates a similarity score
between a Q-compound and a D-compound,
then store the score on “shared memory”;
which is fast and close to SMM.

Step-3
Bitonic-sorting by the similarity score, then
reorder the Q-compounds Step-5
Make an average by the
top-k values, then store it
on the result buffer.
Step-4
Repeat from Step-2, if # of Q-compounds is larger
than shared memory size. Top-k items are kept.

knn_gpu_similarity(int, -- k-value
int[], -- ID+bitmap of Q
int[]) -- ID+bitmap of D
RETURNS float4[] -- result: ID+similarity
AS $$
#plcuda_decl
:
#plcuda_begin
#plcuda_kernel_blocksz ¥
knn_gpu_similarity_main_block_size
#plcuda_num_threads ¥
knn_gpu_similarity_main_num_threads
#plcuda_shmem_blocksz 8192
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
for (loop=0; loop < nloops; loop++)
{
/* 1. calculation of the similarity */
for (i = get_local_id();
i < part_sz * part_nums;
i += get_local_size()) {
j = i % part_sz; /* index within partition */
dindex = part_nums * get_global_index()
+ (i / part_sz);
qindex = loop * (part_sz - k) + (j - k);
if (dindex < ARRAY_MATRIX_HEIGHT(D) &&
qindex < ARRAY_MATRIX_HEIGHT(Q)) {
values[i] = knn_similarity_compute(D, dindex,
Q, qindex);
}
}
__syncthreads();
/* 2. sorting by the similarity for each partition */
knn_similarity_sorting(values, part_sz, part_nums);
__syncthreads();
:
}
#plcuda_end
#plcuda_sanity_check knn_gpu_similarity_sanity_check
#plcuda_working_bufsz 0
#plcuda_results_bufsz knn_gpu_similarity_results_bufsz
real[] -- ID+Similarity of D-compounds (2xN)
knn_gpu_similarity(int k, -- k-value
int[] Q, -- ID+Fingerprint of Q-compounds (33xM)
int[] D); -- ID+Fingerprint of D-compounds (33xN)

Invocation of PL/CUDA function
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Post-process by SQL; like lookup of compounds
name by compounds-id (tables JOIN), making
rank of similarity by window function.
Execution of PL/CUDA function
with Q-/D-matrix as argument
Transform the records read from tables
to Array-Matrix type.
(Pre-build is also possible)
Transform the Array-Matrix (3xN),
return value of PL/CUDA function,
into usual record data x Nrows.

Performance results
 For comparison to CPU cases, we implemented an equivalent SQL function by C.
 # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.
 Up to 10B combination search; almost equivalent size for real drug discovery research.
 HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB
 SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
Number of Query Compounds [Q]
Similarity search of chemical compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!

Another Usage
k-means clustering in database system

Clustering Analysis

k-means clustering algorithm
1. Assign cluster randomly 2. Make centroid for each
cluster
3. Chose the nearest
cluster from the centroid

k-means clustering algorithm
1. Assign cluster randomly 5. Make centroid for each
cluster again
6. All the cluster get fully
converged
4. Repeat until convergence

k-means clustering on PL/CUDA
gpu_kmeans(real[], -- ID + Data Matrix
int, -- k-value (number of clusters)
int = 10, -- max number of iteration
int = 1) -- seed of initial randomness
RETURNS int[]
AS $$
#plcuda_decl
:
KERNEL_FUNCTION_MAXTHREADS(void)
update_centroid(MatrixType *D,
MatrixType *R,
MatrixType *C)
{
:
/* accumulate the local centroid */
for (did = get_global_id();
did < nitems;
did += get_global_size())
{
/* pick up the target cluster */
cid = r_values[nitems + did];
atomicAdd(&l_cent[cid], 1.0);
for (index=1; index < width; index++)
atomicAdd(&l_cent[index * k_value + cid],
d_values[index * nitems + did]);
}
__syncthreads();
/* write back to the global C-matrix */
for (index = get_local_id();
index < width * k_value;
index += get_local_size())
atomicAdd(&c_values[index], l_cent[index]);
}
:
#plcuda_begin
:
status = pgstromLaunchDynamicKernel4((void *)
setup_initial_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)(r_seed),
nitems, 0, 0);
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
for (loop=0; loop < nloops; loop++)
{
:
status = pgstromLaunchDynamicKernelMaxThreads3(
(void *)kmeans_update_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)k_value,
nitems, 0,
sizeof(cl_int) + sizeof(cl_float));
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
:
}
#plcuda_sanity_check gpu_kmeans_sanity_check
#plcuda_working_bufsz gpu_kmeans_working_bufsz
#plcuda_results_bufsz gpu_kmeans_results_bufsz
#plcuda_end

Test data for k-means clustering
▌Dataset overview
 A collection of datasets of vehicle traffic,
observed between two points for a set duration
of time over a period of 6 months
 449 observation points, at Aarhus, Denmark.
 13.5M records from Feb to June of 2014
▌Data contains
 average speed
 average measured time
 number of vehicles
 latitude/longitude of the observation points
 etc...
▌What we did
 categorize the zone of road into 5-classes
according to the characteristics of vehicle’s
running style.

Invocation of GPU k-means function
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Make a matrix from the raw-data
Run k-means clustering logic
Pick-up most frequent cluster

GPU k-means (1/3) – clustering by all the data
$ wget -O map.png "`psql traffic -At -f ~/traffic.sql`"
bypass highway?
Road towards
downtown?
Beltway?

GPU k-means (2/3) – Daytime and Nighttime
daytime (8-17) nighttime (18-7)

GPU k-means (3/3) – Weekdays and Weekend
weekdays weekend

Invocation of GPU k-means function
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata
WHERE extract('hour' from timestamp)
between 7 and 17
)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Just add a line to select
different input data set.
Flexibility of SQL!

Summary

Summary
▌What is PL/CUDA
 Original concept of PG-Strom is automatic optimization.
 PL/CUDA pulls out full capability of GPU instead of manual optimization.
Likely, nobody has written advanced algorithm in SQL :-)
▌Advantages
 TFLOPS grade computing engine for analytics in-database
 No need to export entire dataset for analytics by external applications
 Allows to utilize SQL flexibility for pre-/post-processing of the core
analytics algorithms
▌Future Challenges
 Data size larger than 1GB, because of varlena restriction in PostgreSQL
 Asynchronous execution. CPU parallel
 Time to construct matrix. If likely static, we can construct preliminary.

Resources
▌Repository
https://github.com/pg-strom/devel
▌Today’s Slides
http://www.slideshare.net/kaigai/pgconfsv2016-plcuda
▌Contact
 kaigai@ak.jp.nec.com
 Tw: @kkaigai

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

More Related Content

What's hot

Viewers also liked

Similar to PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

More from Kohei KaiGai

Recently uploaded

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics