PL/CUDA
~Fusion of HPC Grade Power with In-Database Analytics~
The PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kaigai@ak.jp.nec.com>
The PG-Strom Project
about myself...
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2
▌KaiGai Kohei
 tw: @kkaigai
 https://github.com/kaigai
▌PostgreSQL
 SELinux, FDW, CustomScan, ...
▌PG-Strom
 GPU acceleration for PostgreSQL
▌Works
 NEC OSS Promotion Center
 Development of the software and
its business opportunity
The PG-Strom Project
PG-Strom Overview (1/2) – Architecture
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
No Storage
Changes
No Query
Changes
 Features
• automatic GPU code generation
from the supplied SQL
• asynchronous & massive
parallel execution on GPUs
• WHERE-clause, JOIN, GROUP
BY, and projection are
supported
 Advantages
• transparent acceleration by the
power of thousands cores
• Low cost solution for analytic
processing
The PG-Strom Project
Characteristics of GPU (Graphic Processor Unit)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4
GPU CPU
Model
NVIDIA
Tesla P100
Intel Xeon
E5-2699v4
Architecture Pascal Broadwell
Launch Q2-2016 Q1-2016
# of transistors 15billion 7.2billion
# of cores
3584
(simple)
22
(functional)
core clock
1.126GHz
~1.303GHz
2.20GHz
~3.60GHz
Perk FFLOPS
(FP32)
9.3 TFLOPS
1.2 TFLOPS
(with AVX2)
DRAM Size 16GB (HBM2) max 1.5TB (DDR4)
Memory Band 732GB/s 76.8GB/s
Power
Consumption
250W 145W
The PG-Strom Project
PG-Strom Overview (2/2) – GPU binary generation on the fly
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transform of arithmetic operations
in the WHERE-clause to CUDA programs
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5
The PG-Strom Project
GPU accelerates SQL performance
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
40.44
62.79
79.82
100.50
119.51
144.55
201.12
248.95
9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00
0
50
100
150
200
250
300
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of joined tables
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.5 PG-Strom v1.0
CPU: Xeon E5-2670v3
GPU: GTX1080
RAM: 384GB
OS: CentOS 7.2
DB: PostgreSQL 9.5 +
PG-Strom v1.0
The PG-Strom Project
Feedbacks from users during v1.0 development
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
Heavy computing
intensive workloads
• In-database Analytics
• Scientific R&D, Marketing, ...
 by PL/CUDA + Matrix-Array
Heavy I/O
intensive workloads
• Generic large OLAP
• ETL, Reporting, ...
 by SSD-to-GPU P2P DMA
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8
Introduction of PL/CUDA
The PG-Strom Project
Our Failure (1/3) – Write an algorithm logic in SQL
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9
Apr-2016
The PG-Strom Project
Our Failure (2/3) – Performance benefit (?)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10
Apr-2016
The PG-Strom Project
Our Failure (3/3) – Problems
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11
▌Problem.1 – Who writes algorithms in SQL?
 Majority of mathematical algorithms are developed based on
the manner of procedural programming language.
 Although users don’t need to write algorithm logics in CUDA,
they also have to write up the algorithm logic using SQL puzzle.
▌Problem.2 – Performance Benefit
 Yeah, PG-Strom is much faster than PostgreSQL to execute the core
logic of min-max method. It is an excellent result but nobody knows
how many people run the algorithm on PostgreSQL.
 Performance of GpuProjection is almost equivalent to the CPU version
of implementation which is designed to a particular problem. Why?
 Inefficient code due to SQL compatibility
 Inefficient data format due to PostgreSQL’s row-format
The PG-Strom Project
Our Answer – PL/CUDA + Matrix-like Array
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12
CREATE FUNCTION my_logic(matrix, matrix)
RETURNS vector
AS $$
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
Storage
GPU Kernel
User defined
CUDA code block
Post SQL Process
 Tables JOIN
 Window
Function
 ORDER BY
 GROUP BY
 etc....
Load function’s
arguments
Write-back
result set
PL/CUDA
method for manual optimization
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example of PL/CUDA function definition
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, int[], int[])
RETURNS float4[]
AS $$
#plcuda_begin
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k);
for (loop=0; loop < nloops; loop++) {
/* 1. calculation of the similarity */
for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) {
j = i % part_sz; /* index within partition */
/* index of database matrix (D) */
dindex = part_nums * get_global_index() + (i / part_sz);
/* index of query matrix (Q) */
qindex = loop * (part_sz - k) + (j - k);
values[i] = knn_similarity_compute(D, dindex, Q, qindex);
}
}
:
#plcuda_end
$$ LANGUAGE 'plcuda';
CUDA code block
The PG-Strom Project
How GPU Kernels are built
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14
CREATE OR REPLACE FUNCTION
my_cuda_func(float[])
RETURNS int[]
$$
#plcuda_sanity_check func_sc
#plcuda_begin
#plcuda_end
#plcuda_working_bufsz func_wb
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
bool func_sc(float[])
helper function for sanity check
bigint func_wb(float[])
helper function for buffer size estimation
GPU
Binary
GPU
Kernel
Working
Buffer
Input
arguments
Run-time
compiler
input/output
of SQL data
User define
CUDA code block
Common Library
Routines
source program
of GPU code
Load the
arguments
The PG-Strom Project
Why automatic generated code was not best for performance
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15
 NULL checks for each variable references
 Overflow checks for each primitive operators
 Function call instead of primitive operators
STATIC_FUNCTION(pg_float4_t)
pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2)
{
pg_float4_t result;
result.isnull = arg1.isnull | arg2.isnull;
if (!result.isnull)
{
result.value = arg1.value * arg2.value;
CHECKFLOATVAL(&kcxt->e, result,
isinf(arg1.value) || isinf(arg2.value),
arg1.value == 0.0 || arg2.value == 0.0);
}
return result;
}
select x*y from c_test;
The PG-Strom Project
Disadvantage because of data format (1/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16
▌Row-oriented data
× includes unreferenced values
× many steps for data references
〇 common data format with PostgreSQL
▌Column-oriented data
〇 load only referenced variables
〇 data reference by 1 step
× needs data format exchange
GPU
core
GPU
core
GPU
core
a c d feb
a c d feb
a d feb
a c d feb
GPU
core
eb
GPU
core
GPU
core
GPU
core
GPU
core
b
b
b
b
b
GPU
core
GPU
core
e
e
e
e
e
Usual SQL load cannot justify the cost for data transformation,
Advanced algorithm processing is improved by columnar format.
The PG-Strom Project
Disadvantage because of data format (2/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17
▌Case of random memory access
 Increase of memory transaction, but less usage rate of the data-bus
▌Case of coalesced memory access
 Least number of memory transaction, and maximum usage rate of the data-bus
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x 8 = 256bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 100.0%)
Only 32bit x 1 = 32bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 12.5%)
GPU cores
GPU cores
The PG-Strom Project
2D-Array as Matrix
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18
▌datatype[] array_matrix(variadic datatype[])
 An aggregate function to construct a 2D-array from the input stream.
 datatype is either of int2, int4, int8, float4 or float8.
 The 2D-array does not contain NULL.
▌SETOF record matrix_unnest(datatype[])
 Deform a 2D-array into stream of multiple records
▌Downside
 Unable to handle variable-length data type
 1GB limit of varlena data type in PostgreSQL
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example to call PL/CUDA function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19
SELECT row_number() OVER (),
float4_as_int4(R.key_id) key_id,
R.score
FROM matrix_unnest(
(SELECT my_plcuda_function(A.matrix,
B.matrix)
FROM (SELECT cbind(array_matrix(id),
array_matrix(x, y, z)) matrix
FROM normal_table
WHERE tag LIKE ‘%abc%’) A,
(SELECT matrix
FROM matrix_table) B
)
) AS R(key_id real, score real)
ORDER BY score DESC
LIMIT 1000;
Invocation of PL/CUDA
function with two
Matrix arguments
Construct, or load
pre-built array-matrix
Deform array-matrix
to generic records
Post-process by SQL
(JOIN, window-function)
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20
Case Study
similarity search on drug discovery
The PG-Strom Project
Background – relationship of disease and chemical compounds
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21
target disease
relevant protein
chemical compounds
(= candidate of drugs)
Discovery of chemical compounds which are “active” to the target protein
inactive active
active
but toxicity
academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (1/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22
Database chemical compounds set
(D; 10M records scale)
Query chemical
compounds set
(Q; ~1000 records scale)
Search by
Similarity
Target Protein
“similar compounds” will
have higher probability of active
Picks up active
chemical compounds
to the target protein
from academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (2/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23
Similarity
is
definition of distance
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 000000000001000000100000000000000100000000000001000000...
2 CHEMBL405398 000000000000000100100000000000000000000000000000100000...
3 CHEMBL503634 000001000000000000000000001000000100000000000000000000...
: : :
Data structure of the chemical compounds
Similarity by Jaccard index:
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
The PG-Strom Project
Scale of the computing
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24
Database chemical compounds set
(D; 10M records scale)
Q: Query chemical
compounds set
average of
the top-3
𝑑𝑖 of D-compounds
Distance of Q-set
and 𝑑𝑖 compounds
𝑑𝑗 of D-compounds
average of
the top-3
Distance of Q-set
and 𝑑𝑗 compounds
Order of calculation:
𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄
(make distance) (sorting+average)
The PG-Strom Project
Implementation of PL/CUDA function (1/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25
Step-1
Split all the logical combination of Q and D
into multiple partitions, then assign them
on SMM; execution unit of GPU.
Step-2
Each GPU core calculates a similarity score
between a Q-compound and a D-compound,
then store the score on “shared memory”;
which is fast and close to SMM.
The PG-Strom Project
Implementation of PL/CUDA function (2/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26
Step-3
Bitonic-sorting by the similarity score, then
reorder the Q-compounds Step-5
Make an average by the
top-k values, then store it
on the result buffer.
Step-4
Repeat from Step-2, if # of Q-compounds is larger
than shared memory size. Top-k items are kept.
The PG-Strom Project
Implementation of PL/CUDA function (3/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, -- k-value
int[], -- ID+bitmap of Q
int[]) -- ID+bitmap of D
RETURNS float4[] -- result: ID+similarity
AS $$
#plcuda_decl
:
#plcuda_begin
#plcuda_kernel_blocksz ¥
knn_gpu_similarity_main_block_size
#plcuda_num_threads ¥
knn_gpu_similarity_main_num_threads
#plcuda_shmem_blocksz 8192
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
for (loop=0; loop < nloops; loop++)
{
/* 1. calculation of the similarity */
for (i = get_local_id();
i < part_sz * part_nums;
i += get_local_size()) {
j = i % part_sz; /* index within partition */
dindex = part_nums * get_global_index()
+ (i / part_sz);
qindex = loop * (part_sz - k) + (j - k);
if (dindex < ARRAY_MATRIX_HEIGHT(D) &&
qindex < ARRAY_MATRIX_HEIGHT(Q)) {
values[i] = knn_similarity_compute(D, dindex,
Q, qindex);
}
}
__syncthreads();
/* 2. sorting by the similarity for each partition */
knn_similarity_sorting(values, part_sz, part_nums);
__syncthreads();
:
}
#plcuda_end
#plcuda_sanity_check knn_gpu_similarity_sanity_check
#plcuda_working_bufsz 0
#plcuda_results_bufsz knn_gpu_similarity_results_bufsz
$$ LANGUAGE 'plcuda';
real[] -- ID+Similarity of D-compounds (2xN)
knn_gpu_similarity(int k, -- k-value
int[] Q, -- ID+Fingerprint of Q-compounds (33xM)
int[] D); -- ID+Fingerprint of D-compounds (33xN)
The PG-Strom Project
Invocation of PL/CUDA function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Post-process by SQL; like lookup of compounds
name by compounds-id (tables JOIN), making
rank of similarity by window function.
Execution of PL/CUDA function
with Q-/D-matrix as argument
Transform the records read from tables
to Array-Matrix type.
(Pre-build is also possible)
Transform the Array-Matrix (3xN),
return value of PL/CUDA function,
into usual record data x Nrows.
The PG-Strom Project
Performance results
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29
 For comparison to CPU cases, we implemented an equivalent SQL function by C.
 # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.
 Up to 10B combination search; almost equivalent size for real drug discovery research.
 HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB
 SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
Number of Query Compounds [Q]
Similarity search of chemical compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30
Another Usage
k-means clustering in database system
The PG-Strom Project
Clustering Analysis
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
The PG-Strom Project
k-means clustering algorithm
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32
1. Assign cluster randomly 2. Make centroid for each
cluster
3. Chose the nearest
cluster from the centroid
The PG-Strom Project
k-means clustering algorithm
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33
1. Assign cluster randomly 5. Make centroid for each
cluster again
6. All the cluster get fully
converged
4. Repeat until convergence
The PG-Strom Project
k-means clustering on PL/CUDA
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34
CREATE OR REPLACE FUNCTION
gpu_kmeans(real[], -- ID + Data Matrix
int, -- k-value (number of clusters)
int = 10, -- max number of iteration
int = 1) -- seed of initial randomness
RETURNS int[]
AS $$
#plcuda_decl
:
KERNEL_FUNCTION_MAXTHREADS(void)
update_centroid(MatrixType *D,
MatrixType *R,
MatrixType *C)
{
:
/* accumulate the local centroid */
for (did = get_global_id();
did < nitems;
did += get_global_size())
{
/* pick up the target cluster */
cid = r_values[nitems + did];
atomicAdd(&l_cent[cid], 1.0);
for (index=1; index < width; index++)
atomicAdd(&l_cent[index * k_value + cid],
d_values[index * nitems + did]);
}
__syncthreads();
/* write back to the global C-matrix */
for (index = get_local_id();
index < width * k_value;
index += get_local_size())
atomicAdd(&c_values[index], l_cent[index]);
}
:
#plcuda_begin
:
status = pgstromLaunchDynamicKernel4((void *)
setup_initial_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)(r_seed),
nitems, 0, 0);
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
for (loop=0; loop < nloops; loop++)
{
:
status = pgstromLaunchDynamicKernelMaxThreads3(
(void *)kmeans_update_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)k_value,
nitems, 0,
sizeof(cl_int) + sizeof(cl_float));
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
:
}
#plcuda_sanity_check gpu_kmeans_sanity_check
#plcuda_working_bufsz gpu_kmeans_working_bufsz
#plcuda_results_bufsz gpu_kmeans_results_bufsz
#plcuda_end
$$ LANGUAGE 'plcuda';
The PG-Strom Project
Test data for k-means clustering
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35
▌Dataset overview
 A collection of datasets of vehicle traffic,
observed between two points for a set duration
of time over a period of 6 months
 449 observation points, at Aarhus, Denmark.
 13.5M records from Feb to June of 2014
▌Data contains
 average speed
 average measured time
 number of vehicles
 latitude/longitude of the observation points
 etc...
▌What we did
 categorize the zone of road into 5-classes
according to the characteristics of vehicle’s
running style.
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Make a matrix from the raw-data
Run k-means clustering logic
Pick-up most frequent cluster
The PG-Strom Project
GPU k-means (1/3) – clustering by all the data
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37
$ wget -O map.png "`psql traffic -At -f ~/traffic.sql`"
bypass highway?
Road towards
downtown?
Beltway?
The PG-Strom Project
GPU k-means (2/3) – Daytime and Nighttime
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38
daytime (8-17) nighttime (18-7)
The PG-Strom Project
GPU k-means (3/3) – Weekdays and Weekend
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39
weekdays weekend
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata
WHERE extract('hour' from timestamp)
between 7 and 17
)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Just add a line to select
different input data set.
Flexibility of SQL!
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41
Summary
The PG-Strom Project
Summary
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42
▌What is PL/CUDA
 Original concept of PG-Strom is automatic optimization.
 PL/CUDA pulls out full capability of GPU instead of manual optimization.
Likely, nobody has written advanced algorithm in SQL :-)
▌Advantages
 TFLOPS grade computing engine for analytics in-database
 No need to export entire dataset for analytics by external applications
 Allows to utilize SQL flexibility for pre-/post-processing of the core
analytics algorithms
▌Future Challenges
 Data size larger than 1GB, because of varlena restriction in PostgreSQL
 Asynchronous execution. CPU parallel
 Time to construct matrix. If likely static, we can construct preliminary.
The PG-Strom Project
Resources
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43
▌Repository
https://github.com/pg-strom/devel
▌Today’s Slides
http://www.slideshare.net/kaigai/pgconfsv2016-plcuda
▌Contact
 kaigai@ak.jp.nec.com
 Tw: @kkaigai
Question?

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

  • 1.
    PL/CUDA ~Fusion of HPCGrade Power with In-Database Analytics~ The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei <kaigai@ak.jp.nec.com>
  • 2.
    The PG-Strom Project aboutmyself... PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2 ▌KaiGai Kohei  tw: @kkaigai  https://github.com/kaigai ▌PostgreSQL  SELinux, FDW, CustomScan, ... ▌PG-Strom  GPU acceleration for PostgreSQL ▌Works  NEC OSS Promotion Center  Development of the software and its business opportunity
  • 3.
    The PG-Strom Project PG-StromOverview (1/2) – Architecture PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU No Storage Changes No Query Changes  Features • automatic GPU code generation from the supplied SQL • asynchronous & massive parallel execution on GPUs • WHERE-clause, JOIN, GROUP BY, and projection are supported  Advantages • transparent acceleration by the power of thousands cores • Low cost solution for analytic processing
  • 4.
    The PG-Strom Project Characteristicsof GPU (Graphic Processor Unit) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4 GPU CPU Model NVIDIA Tesla P100 Intel Xeon E5-2699v4 Architecture Pascal Broadwell Launch Q2-2016 Q1-2016 # of transistors 15billion 7.2billion # of cores 3584 (simple) 22 (functional) core clock 1.126GHz ~1.303GHz 2.20GHz ~3.60GHz Perk FFLOPS (FP32) 9.3 TFLOPS 1.2 TFLOPS (with AVX2) DRAM Size 16GB (HBM2) max 1.5TB (DDR4) Memory Band 732GB/s 76.8GB/s Power Consumption 250W 145W
  • 5.
    The PG-Strom Project PG-StromOverview (2/2) – GPU binary generation on the fly PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transform of arithmetic operations in the WHERE-clause to CUDA programs Reference to input data SQL expression in CUDA source code Run-time Compiler (nvrtc) Just-in-time Compile Parallel Execution 5
  • 6.
    The PG-Strom Project GPUaccelerates SQL performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat;  t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema) 40.44 62.79 79.82 100.50 119.51 144.55 201.12 248.95 9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00 0 50 100 150 200 250 300 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of joined tables PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.5 PG-Strom v1.0 CPU: Xeon E5-2670v3 GPU: GTX1080 RAM: 384GB OS: CentOS 7.2 DB: PostgreSQL 9.5 + PG-Strom v1.0
  • 7.
    The PG-Strom Project Feedbacksfrom users during v1.0 development PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU Heavy computing intensive workloads • In-database Analytics • Scientific R&D, Marketing, ...  by PL/CUDA + Matrix-Array Heavy I/O intensive workloads • Generic large OLAP • ETL, Reporting, ...  by SSD-to-GPU P2P DMA
  • 8.
    The PG-Strom ProjectPGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8 Introduction of PL/CUDA
  • 9.
    The PG-Strom Project OurFailure (1/3) – Write an algorithm logic in SQL PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9 Apr-2016
  • 10.
    The PG-Strom Project OurFailure (2/3) – Performance benefit (?) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10 Apr-2016
  • 11.
    The PG-Strom Project OurFailure (3/3) – Problems PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11 ▌Problem.1 – Who writes algorithms in SQL?  Majority of mathematical algorithms are developed based on the manner of procedural programming language.  Although users don’t need to write algorithm logics in CUDA, they also have to write up the algorithm logic using SQL puzzle. ▌Problem.2 – Performance Benefit  Yeah, PG-Strom is much faster than PostgreSQL to execute the core logic of min-max method. It is an excellent result but nobody knows how many people run the algorithm on PostgreSQL.  Performance of GpuProjection is almost equivalent to the CPU version of implementation which is designed to a particular problem. Why?  Inefficient code due to SQL compatibility  Inefficient data format due to PostgreSQL’s row-format
  • 12.
    The PG-Strom Project OurAnswer – PL/CUDA + Matrix-like Array PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12 CREATE FUNCTION my_logic(matrix, matrix) RETURNS vector AS $$ $$ LANGUAGE ‘plcuda’; User define CUDA code block Storage GPU Kernel User defined CUDA code block Post SQL Process  Tables JOIN  Window Function  ORDER BY  GROUP BY  etc.... Load function’s arguments Write-back result set PL/CUDA method for manual optimization ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 13.
    The PG-Strom Project Exampleof PL/CUDA function definition PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, int[], int[]) RETURNS float4[] AS $$ #plcuda_begin cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k); for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ /* index of database matrix (D) */ dindex = part_nums * get_global_index() + (i / part_sz); /* index of query matrix (Q) */ qindex = loop * (part_sz - k) + (j - k); values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } : #plcuda_end $$ LANGUAGE 'plcuda'; CUDA code block
  • 14.
    The PG-Strom Project HowGPU Kernels are built PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14 CREATE OR REPLACE FUNCTION my_cuda_func(float[]) RETURNS int[] $$ #plcuda_sanity_check func_sc #plcuda_begin #plcuda_end #plcuda_working_bufsz func_wb $$ LANGUAGE ‘plcuda’; User define CUDA code block bool func_sc(float[]) helper function for sanity check bigint func_wb(float[]) helper function for buffer size estimation GPU Binary GPU Kernel Working Buffer Input arguments Run-time compiler input/output of SQL data User define CUDA code block Common Library Routines source program of GPU code Load the arguments
  • 15.
    The PG-Strom Project Whyautomatic generated code was not best for performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15  NULL checks for each variable references  Overflow checks for each primitive operators  Function call instead of primitive operators STATIC_FUNCTION(pg_float4_t) pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2) { pg_float4_t result; result.isnull = arg1.isnull | arg2.isnull; if (!result.isnull) { result.value = arg1.value * arg2.value; CHECKFLOATVAL(&kcxt->e, result, isinf(arg1.value) || isinf(arg2.value), arg1.value == 0.0 || arg2.value == 0.0); } return result; } select x*y from c_test;
  • 16.
    The PG-Strom Project Disadvantagebecause of data format (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16 ▌Row-oriented data × includes unreferenced values × many steps for data references 〇 common data format with PostgreSQL ▌Column-oriented data 〇 load only referenced variables 〇 data reference by 1 step × needs data format exchange GPU core GPU core GPU core a c d feb a c d feb a d feb a c d feb GPU core eb GPU core GPU core GPU core GPU core b b b b b GPU core GPU core e e e e e Usual SQL load cannot justify the cost for data transformation, Advanced algorithm processing is improved by columnar format.
  • 17.
    The PG-Strom Project Disadvantagebecause of data format (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17 ▌Case of random memory access  Increase of memory transaction, but less usage rate of the data-bus ▌Case of coalesced memory access  Least number of memory transaction, and maximum usage rate of the data-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit is valid data in 256bit width memory transaction (Bus usage ratio: 100.0%) Only 32bit x 1 = 32bit is valid data in 256bit width memory transaction (Bus usage ratio: 12.5%) GPU cores GPU cores
  • 18.
    The PG-Strom Project 2D-Arrayas Matrix PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18 ▌datatype[] array_matrix(variadic datatype[])  An aggregate function to construct a 2D-array from the input stream.  datatype is either of int2, int4, int8, float4 or float8.  The 2D-array does not contain NULL. ▌SETOF record matrix_unnest(datatype[])  Deform a 2D-array into stream of multiple records ▌Downside  Unable to handle variable-length data type  1GB limit of varlena data type in PostgreSQL ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 19.
    The PG-Strom Project Exampleto call PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19 SELECT row_number() OVER (), float4_as_int4(R.key_id) key_id, R.score FROM matrix_unnest( (SELECT my_plcuda_function(A.matrix, B.matrix) FROM (SELECT cbind(array_matrix(id), array_matrix(x, y, z)) matrix FROM normal_table WHERE tag LIKE ‘%abc%’) A, (SELECT matrix FROM matrix_table) B ) ) AS R(key_id real, score real) ORDER BY score DESC LIMIT 1000; Invocation of PL/CUDA function with two Matrix arguments Construct, or load pre-built array-matrix Deform array-matrix to generic records Post-process by SQL (JOIN, window-function)
  • 20.
    The PG-Strom ProjectPGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20 Case Study similarity search on drug discovery
  • 21.
    The PG-Strom Project Background– relationship of disease and chemical compounds PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21 target disease relevant protein chemical compounds (= candidate of drugs) Discovery of chemical compounds which are “active” to the target protein inactive active active but toxicity academic papers
  • 22.
    The PG-Strom Project k-NNSimilarity Search on Chemical Compounds (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22 Database chemical compounds set (D; 10M records scale) Query chemical compounds set (Q; ~1000 records scale) Search by Similarity Target Protein “similar compounds” will have higher probability of active Picks up active chemical compounds to the target protein from academic papers
  • 23.
    The PG-Strom Project k-NNSimilarity Search on Chemical Compounds (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23 Similarity is definition of distance ID NAME Fingerprint (1024bit) 1 CHEMBL153534 000000000001000000100000000000000100000000000001000000... 2 CHEMBL405398 000000000000000100100000000000000000000000000000100000... 3 CHEMBL503634 000001000000000000000000001000000100000000000000000000... : : : Data structure of the chemical compounds Similarity by Jaccard index: 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
  • 24.
    The PG-Strom Project Scaleof the computing PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24 Database chemical compounds set (D; 10M records scale) Q: Query chemical compounds set average of the top-3 𝑑𝑖 of D-compounds Distance of Q-set and 𝑑𝑖 compounds 𝑑𝑗 of D-compounds average of the top-3 Distance of Q-set and 𝑑𝑗 compounds Order of calculation: 𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄 (make distance) (sorting+average)
  • 25.
    The PG-Strom Project Implementationof PL/CUDA function (1/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25 Step-1 Split all the logical combination of Q and D into multiple partitions, then assign them on SMM; execution unit of GPU. Step-2 Each GPU core calculates a similarity score between a Q-compound and a D-compound, then store the score on “shared memory”; which is fast and close to SMM.
  • 26.
    The PG-Strom Project Implementationof PL/CUDA function (2/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26 Step-3 Bitonic-sorting by the similarity score, then reorder the Q-compounds Step-5 Make an average by the top-k values, then store it on the result buffer. Step-4 Repeat from Step-2, if # of Q-compounds is larger than shared memory size. Top-k items are kept.
  • 27.
    The PG-Strom Project Implementationof PL/CUDA function (3/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, -- k-value int[], -- ID+bitmap of Q int[]) -- ID+bitmap of D RETURNS float4[] -- result: ID+similarity AS $$ #plcuda_decl : #plcuda_begin #plcuda_kernel_blocksz ¥ knn_gpu_similarity_main_block_size #plcuda_num_threads ¥ knn_gpu_similarity_main_num_threads #plcuda_shmem_blocksz 8192 cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ dindex = part_nums * get_global_index() + (i / part_sz); qindex = loop * (part_sz - k) + (j - k); if (dindex < ARRAY_MATRIX_HEIGHT(D) && qindex < ARRAY_MATRIX_HEIGHT(Q)) { values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } __syncthreads(); /* 2. sorting by the similarity for each partition */ knn_similarity_sorting(values, part_sz, part_nums); __syncthreads(); : } #plcuda_end #plcuda_sanity_check knn_gpu_similarity_sanity_check #plcuda_working_bufsz 0 #plcuda_results_bufsz knn_gpu_similarity_results_bufsz $$ LANGUAGE 'plcuda'; real[] -- ID+Similarity of D-compounds (2xN) knn_gpu_similarity(int k, -- k-value int[] Q, -- ID+Fingerprint of Q-compounds (33xM) int[] D); -- ID+Fingerprint of D-compounds (33xN)
  • 28.
    The PG-Strom Project Invocationof PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28 PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS SELECT row_number() OVER (), fp.name, similarity FROM (SELECT float4_as_int4(key_id) key_id, similarity FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix)) FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q, (SELECT matrix FROM finger_print_10m_matrix) D ) ) AS sim(key_id real, similarity real) ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000; Post-process by SQL; like lookup of compounds name by compounds-id (tables JOIN), making rank of similarity by window function. Execution of PL/CUDA function with Q-/D-matrix as argument Transform the records read from tables to Array-Matrix type. (Pre-build is also possible) Transform the Array-Matrix (3xN), return value of PL/CUDA function, into usual record data x Nrows.
  • 29.
    The PG-Strom Project Performanceresults PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29  For comparison to CPU cases, we implemented an equivalent SQL function by C.  # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.  Up to 10B combination search; almost equivalent size for real drug discovery research.  HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB  SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0 30.25 145.29 295.95 1503.31 3034.94 12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13 0 500 1000 1500 2000 2500 3000 3500 10 50 100 500 1000 QueryResponseTime[sec] Number of Query Compounds [Q] Similarity search of chemical compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 x150 times faster!
  • 30.
    The PG-Strom ProjectPGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30 Another Usage k-means clustering in database system
  • 31.
    The PG-Strom Project ClusteringAnalysis PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
  • 32.
    The PG-Strom Project k-meansclustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32 1. Assign cluster randomly 2. Make centroid for each cluster 3. Chose the nearest cluster from the centroid
  • 33.
    The PG-Strom Project k-meansclustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33 1. Assign cluster randomly 5. Make centroid for each cluster again 6. All the cluster get fully converged 4. Repeat until convergence
  • 34.
    The PG-Strom Project k-meansclustering on PL/CUDA PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34 CREATE OR REPLACE FUNCTION gpu_kmeans(real[], -- ID + Data Matrix int, -- k-value (number of clusters) int = 10, -- max number of iteration int = 1) -- seed of initial randomness RETURNS int[] AS $$ #plcuda_decl : KERNEL_FUNCTION_MAXTHREADS(void) update_centroid(MatrixType *D, MatrixType *R, MatrixType *C) { : /* accumulate the local centroid */ for (did = get_global_id(); did < nitems; did += get_global_size()) { /* pick up the target cluster */ cid = r_values[nitems + did]; atomicAdd(&l_cent[cid], 1.0); for (index=1; index < width; index++) atomicAdd(&l_cent[index * k_value + cid], d_values[index * nitems + did]); } __syncthreads(); /* write back to the global C-matrix */ for (index = get_local_id(); index < width * k_value; index += get_local_size()) atomicAdd(&c_values[index], l_cent[index]); } : #plcuda_begin : status = pgstromLaunchDynamicKernel4((void *) setup_initial_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)(r_seed), nitems, 0, 0); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); for (loop=0; loop < nloops; loop++) { : status = pgstromLaunchDynamicKernelMaxThreads3( (void *)kmeans_update_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)k_value, nitems, 0, sizeof(cl_int) + sizeof(cl_float)); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); : } #plcuda_sanity_check gpu_kmeans_sanity_check #plcuda_working_bufsz gpu_kmeans_working_bufsz #plcuda_results_bufsz gpu_kmeans_results_bufsz #plcuda_end $$ LANGUAGE 'plcuda';
  • 35.
    The PG-Strom Project Testdata for k-means clustering PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35 ▌Dataset overview  A collection of datasets of vehicle traffic, observed between two points for a set duration of time over a period of 6 months  449 observation points, at Aarhus, Denmark.  13.5M records from Feb to June of 2014 ▌Data contains  average speed  average measured time  number of vehicles  latitude/longitude of the observation points  etc... ▌What we did  categorize the zone of road into 5-classes according to the characteristics of vehicle’s running style.
  • 36.
    The PG-Strom Project Invocationof GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Make a matrix from the raw-data Run k-means clustering logic Pick-up most frequent cluster
  • 37.
    The PG-Strom Project GPUk-means (1/3) – clustering by all the data PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37 $ wget -O map.png "`psql traffic -At -f ~/traffic.sql`" bypass highway? Road towards downtown? Beltway?
  • 38.
    The PG-Strom Project GPUk-means (2/3) – Daytime and Nighttime PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38 daytime (8-17) nighttime (18-7)
  • 39.
    The PG-Strom Project GPUk-means (3/3) – Weekdays and Weekend PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39 weekdays weekend
  • 40.
    The PG-Strom Project Invocationof GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata WHERE extract('hour' from timestamp) between 7 and 17 ) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Just add a line to select different input data set. Flexibility of SQL!
  • 41.
    The PG-Strom ProjectPGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41 Summary
  • 42.
    The PG-Strom Project Summary PGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42 ▌What is PL/CUDA  Original concept of PG-Strom is automatic optimization.  PL/CUDA pulls out full capability of GPU instead of manual optimization. Likely, nobody has written advanced algorithm in SQL :-) ▌Advantages  TFLOPS grade computing engine for analytics in-database  No need to export entire dataset for analytics by external applications  Allows to utilize SQL flexibility for pre-/post-processing of the core analytics algorithms ▌Future Challenges  Data size larger than 1GB, because of varlena restriction in PostgreSQL  Asynchronous execution. CPU parallel  Time to construct matrix. If likely static, we can construct preliminary.
  • 43.
    The PG-Strom Project Resources PGconf.SV2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43 ▌Repository https://github.com/pg-strom/devel ▌Today’s Slides http://www.slideshare.net/kaigai/pgconfsv2016-plcuda ▌Contact  kaigai@ak.jp.nec.com  Tw: @kkaigai
  • 44.