Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

1,243 views

Published on

My presentation slides at PGconf.SV 2016
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
http://www.pgconfsv.com/

Published in: Technology
  • How is it possible? 80% accuracy? Sports picks directly from the insiders? ♣♣♣ https://tinyurl.com/yxcmgjf5
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

  1. 1. PL/CUDA ~Fusion of HPC Grade Power with In-Database Analytics~ The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei <kaigai@ak.jp.nec.com>
  2. 2. The PG-Strom Project about myself... PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2 ▌KaiGai Kohei  tw: @kkaigai  https://github.com/kaigai ▌PostgreSQL  SELinux, FDW, CustomScan, ... ▌PG-Strom  GPU acceleration for PostgreSQL ▌Works  NEC OSS Promotion Center  Development of the software and its business opportunity
  3. 3. The PG-Strom Project PG-Strom Overview (1/2) – Architecture PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU No Storage Changes No Query Changes  Features • automatic GPU code generation from the supplied SQL • asynchronous & massive parallel execution on GPUs • WHERE-clause, JOIN, GROUP BY, and projection are supported  Advantages • transparent acceleration by the power of thousands cores • Low cost solution for analytic processing
  4. 4. The PG-Strom Project Characteristics of GPU (Graphic Processor Unit) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4 GPU CPU Model NVIDIA Tesla P100 Intel Xeon E5-2699v4 Architecture Pascal Broadwell Launch Q2-2016 Q1-2016 # of transistors 15billion 7.2billion # of cores 3584 (simple) 22 (functional) core clock 1.126GHz ~1.303GHz 2.20GHz ~3.60GHz Perk FFLOPS (FP32) 9.3 TFLOPS 1.2 TFLOPS (with AVX2) DRAM Size 16GB (HBM2) max 1.5TB (DDR4) Memory Band 732GB/s 76.8GB/s Power Consumption 250W 145W
  5. 5. The PG-Strom Project PG-Strom Overview (2/2) – GPU binary generation on the fly PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transform of arithmetic operations in the WHERE-clause to CUDA programs Reference to input data SQL expression in CUDA source code Run-time Compiler (nvrtc) Just-in-time Compile Parallel Execution 5
  6. 6. The PG-Strom Project GPU accelerates SQL performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat;  t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema) 40.44 62.79 79.82 100.50 119.51 144.55 201.12 248.95 9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00 0 50 100 150 200 250 300 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of joined tables PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.5 PG-Strom v1.0 CPU: Xeon E5-2670v3 GPU: GTX1080 RAM: 384GB OS: CentOS 7.2 DB: PostgreSQL 9.5 + PG-Strom v1.0
  7. 7. The PG-Strom Project Feedbacks from users during v1.0 development PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU Heavy computing intensive workloads • In-database Analytics • Scientific R&D, Marketing, ...  by PL/CUDA + Matrix-Array Heavy I/O intensive workloads • Generic large OLAP • ETL, Reporting, ...  by SSD-to-GPU P2P DMA
  8. 8. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8 Introduction of PL/CUDA
  9. 9. The PG-Strom Project Our Failure (1/3) – Write an algorithm logic in SQL PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9 Apr-2016
  10. 10. The PG-Strom Project Our Failure (2/3) – Performance benefit (?) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10 Apr-2016
  11. 11. The PG-Strom Project Our Failure (3/3) – Problems PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11 ▌Problem.1 – Who writes algorithms in SQL?  Majority of mathematical algorithms are developed based on the manner of procedural programming language.  Although users don’t need to write algorithm logics in CUDA, they also have to write up the algorithm logic using SQL puzzle. ▌Problem.2 – Performance Benefit  Yeah, PG-Strom is much faster than PostgreSQL to execute the core logic of min-max method. It is an excellent result but nobody knows how many people run the algorithm on PostgreSQL.  Performance of GpuProjection is almost equivalent to the CPU version of implementation which is designed to a particular problem. Why?  Inefficient code due to SQL compatibility  Inefficient data format due to PostgreSQL’s row-format
  12. 12. The PG-Strom Project Our Answer – PL/CUDA + Matrix-like Array PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12 CREATE FUNCTION my_logic(matrix, matrix) RETURNS vector AS $$ $$ LANGUAGE ‘plcuda’; User define CUDA code block Storage GPU Kernel User defined CUDA code block Post SQL Process  Tables JOIN  Window Function  ORDER BY  GROUP BY  etc.... Load function’s arguments Write-back result set PL/CUDA method for manual optimization ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  13. 13. The PG-Strom Project Example of PL/CUDA function definition PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, int[], int[]) RETURNS float4[] AS $$ #plcuda_begin cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k); for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ /* index of database matrix (D) */ dindex = part_nums * get_global_index() + (i / part_sz); /* index of query matrix (Q) */ qindex = loop * (part_sz - k) + (j - k); values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } : #plcuda_end $$ LANGUAGE 'plcuda'; CUDA code block
  14. 14. The PG-Strom Project How GPU Kernels are built PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14 CREATE OR REPLACE FUNCTION my_cuda_func(float[]) RETURNS int[] $$ #plcuda_sanity_check func_sc #plcuda_begin #plcuda_end #plcuda_working_bufsz func_wb $$ LANGUAGE ‘plcuda’; User define CUDA code block bool func_sc(float[]) helper function for sanity check bigint func_wb(float[]) helper function for buffer size estimation GPU Binary GPU Kernel Working Buffer Input arguments Run-time compiler input/output of SQL data User define CUDA code block Common Library Routines source program of GPU code Load the arguments
  15. 15. The PG-Strom Project Why automatic generated code was not best for performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15  NULL checks for each variable references  Overflow checks for each primitive operators  Function call instead of primitive operators STATIC_FUNCTION(pg_float4_t) pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2) { pg_float4_t result; result.isnull = arg1.isnull | arg2.isnull; if (!result.isnull) { result.value = arg1.value * arg2.value; CHECKFLOATVAL(&kcxt->e, result, isinf(arg1.value) || isinf(arg2.value), arg1.value == 0.0 || arg2.value == 0.0); } return result; } select x*y from c_test;
  16. 16. The PG-Strom Project Disadvantage because of data format (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16 ▌Row-oriented data × includes unreferenced values × many steps for data references 〇 common data format with PostgreSQL ▌Column-oriented data 〇 load only referenced variables 〇 data reference by 1 step × needs data format exchange GPU core GPU core GPU core a c d feb a c d feb a d feb a c d feb GPU core eb GPU core GPU core GPU core GPU core b b b b b GPU core GPU core e e e e e Usual SQL load cannot justify the cost for data transformation, Advanced algorithm processing is improved by columnar format.
  17. 17. The PG-Strom Project Disadvantage because of data format (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17 ▌Case of random memory access  Increase of memory transaction, but less usage rate of the data-bus ▌Case of coalesced memory access  Least number of memory transaction, and maximum usage rate of the data-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit is valid data in 256bit width memory transaction (Bus usage ratio: 100.0%) Only 32bit x 1 = 32bit is valid data in 256bit width memory transaction (Bus usage ratio: 12.5%) GPU cores GPU cores
  18. 18. The PG-Strom Project 2D-Array as Matrix PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18 ▌datatype[] array_matrix(variadic datatype[])  An aggregate function to construct a 2D-array from the input stream.  datatype is either of int2, int4, int8, float4 or float8.  The 2D-array does not contain NULL. ▌SETOF record matrix_unnest(datatype[])  Deform a 2D-array into stream of multiple records ▌Downside  Unable to handle variable-length data type  1GB limit of varlena data type in PostgreSQL ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  19. 19. The PG-Strom Project Example to call PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19 SELECT row_number() OVER (), float4_as_int4(R.key_id) key_id, R.score FROM matrix_unnest( (SELECT my_plcuda_function(A.matrix, B.matrix) FROM (SELECT cbind(array_matrix(id), array_matrix(x, y, z)) matrix FROM normal_table WHERE tag LIKE ‘%abc%’) A, (SELECT matrix FROM matrix_table) B ) ) AS R(key_id real, score real) ORDER BY score DESC LIMIT 1000; Invocation of PL/CUDA function with two Matrix arguments Construct, or load pre-built array-matrix Deform array-matrix to generic records Post-process by SQL (JOIN, window-function)
  20. 20. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20 Case Study similarity search on drug discovery
  21. 21. The PG-Strom Project Background – relationship of disease and chemical compounds PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21 target disease relevant protein chemical compounds (= candidate of drugs) Discovery of chemical compounds which are “active” to the target protein inactive active active but toxicity academic papers
  22. 22. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22 Database chemical compounds set (D; 10M records scale) Query chemical compounds set (Q; ~1000 records scale) Search by Similarity Target Protein “similar compounds” will have higher probability of active Picks up active chemical compounds to the target protein from academic papers
  23. 23. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23 Similarity is definition of distance ID NAME Fingerprint (1024bit) 1 CHEMBL153534 000000000001000000100000000000000100000000000001000000... 2 CHEMBL405398 000000000000000100100000000000000000000000000000100000... 3 CHEMBL503634 000001000000000000000000001000000100000000000000000000... : : : Data structure of the chemical compounds Similarity by Jaccard index: 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
  24. 24. The PG-Strom Project Scale of the computing PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24 Database chemical compounds set (D; 10M records scale) Q: Query chemical compounds set average of the top-3 𝑑𝑖 of D-compounds Distance of Q-set and 𝑑𝑖 compounds 𝑑𝑗 of D-compounds average of the top-3 Distance of Q-set and 𝑑𝑗 compounds Order of calculation: 𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄 (make distance) (sorting+average)
  25. 25. The PG-Strom Project Implementation of PL/CUDA function (1/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25 Step-1 Split all the logical combination of Q and D into multiple partitions, then assign them on SMM; execution unit of GPU. Step-2 Each GPU core calculates a similarity score between a Q-compound and a D-compound, then store the score on “shared memory”; which is fast and close to SMM.
  26. 26. The PG-Strom Project Implementation of PL/CUDA function (2/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26 Step-3 Bitonic-sorting by the similarity score, then reorder the Q-compounds Step-5 Make an average by the top-k values, then store it on the result buffer. Step-4 Repeat from Step-2, if # of Q-compounds is larger than shared memory size. Top-k items are kept.
  27. 27. The PG-Strom Project Implementation of PL/CUDA function (3/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, -- k-value int[], -- ID+bitmap of Q int[]) -- ID+bitmap of D RETURNS float4[] -- result: ID+similarity AS $$ #plcuda_decl : #plcuda_begin #plcuda_kernel_blocksz ¥ knn_gpu_similarity_main_block_size #plcuda_num_threads ¥ knn_gpu_similarity_main_num_threads #plcuda_shmem_blocksz 8192 cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ dindex = part_nums * get_global_index() + (i / part_sz); qindex = loop * (part_sz - k) + (j - k); if (dindex < ARRAY_MATRIX_HEIGHT(D) && qindex < ARRAY_MATRIX_HEIGHT(Q)) { values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } __syncthreads(); /* 2. sorting by the similarity for each partition */ knn_similarity_sorting(values, part_sz, part_nums); __syncthreads(); : } #plcuda_end #plcuda_sanity_check knn_gpu_similarity_sanity_check #plcuda_working_bufsz 0 #plcuda_results_bufsz knn_gpu_similarity_results_bufsz $$ LANGUAGE 'plcuda'; real[] -- ID+Similarity of D-compounds (2xN) knn_gpu_similarity(int k, -- k-value int[] Q, -- ID+Fingerprint of Q-compounds (33xM) int[] D); -- ID+Fingerprint of D-compounds (33xN)
  28. 28. The PG-Strom Project Invocation of PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28 PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS SELECT row_number() OVER (), fp.name, similarity FROM (SELECT float4_as_int4(key_id) key_id, similarity FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix)) FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q, (SELECT matrix FROM finger_print_10m_matrix) D ) ) AS sim(key_id real, similarity real) ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000; Post-process by SQL; like lookup of compounds name by compounds-id (tables JOIN), making rank of similarity by window function. Execution of PL/CUDA function with Q-/D-matrix as argument Transform the records read from tables to Array-Matrix type. (Pre-build is also possible) Transform the Array-Matrix (3xN), return value of PL/CUDA function, into usual record data x Nrows.
  29. 29. The PG-Strom Project Performance results PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29  For comparison to CPU cases, we implemented an equivalent SQL function by C.  # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.  Up to 10B combination search; almost equivalent size for real drug discovery research.  HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB  SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0 30.25 145.29 295.95 1503.31 3034.94 12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13 0 500 1000 1500 2000 2500 3000 3500 10 50 100 500 1000 QueryResponseTime[sec] Number of Query Compounds [Q] Similarity search of chemical compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 x150 times faster!
  30. 30. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30 Another Usage k-means clustering in database system
  31. 31. The PG-Strom Project Clustering Analysis PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
  32. 32. The PG-Strom Project k-means clustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32 1. Assign cluster randomly 2. Make centroid for each cluster 3. Chose the nearest cluster from the centroid
  33. 33. The PG-Strom Project k-means clustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33 1. Assign cluster randomly 5. Make centroid for each cluster again 6. All the cluster get fully converged 4. Repeat until convergence
  34. 34. The PG-Strom Project k-means clustering on PL/CUDA PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34 CREATE OR REPLACE FUNCTION gpu_kmeans(real[], -- ID + Data Matrix int, -- k-value (number of clusters) int = 10, -- max number of iteration int = 1) -- seed of initial randomness RETURNS int[] AS $$ #plcuda_decl : KERNEL_FUNCTION_MAXTHREADS(void) update_centroid(MatrixType *D, MatrixType *R, MatrixType *C) { : /* accumulate the local centroid */ for (did = get_global_id(); did < nitems; did += get_global_size()) { /* pick up the target cluster */ cid = r_values[nitems + did]; atomicAdd(&l_cent[cid], 1.0); for (index=1; index < width; index++) atomicAdd(&l_cent[index * k_value + cid], d_values[index * nitems + did]); } __syncthreads(); /* write back to the global C-matrix */ for (index = get_local_id(); index < width * k_value; index += get_local_size()) atomicAdd(&c_values[index], l_cent[index]); } : #plcuda_begin : status = pgstromLaunchDynamicKernel4((void *) setup_initial_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)(r_seed), nitems, 0, 0); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); for (loop=0; loop < nloops; loop++) { : status = pgstromLaunchDynamicKernelMaxThreads3( (void *)kmeans_update_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)k_value, nitems, 0, sizeof(cl_int) + sizeof(cl_float)); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); : } #plcuda_sanity_check gpu_kmeans_sanity_check #plcuda_working_bufsz gpu_kmeans_working_bufsz #plcuda_results_bufsz gpu_kmeans_results_bufsz #plcuda_end $$ LANGUAGE 'plcuda';
  35. 35. The PG-Strom Project Test data for k-means clustering PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35 ▌Dataset overview  A collection of datasets of vehicle traffic, observed between two points for a set duration of time over a period of 6 months  449 observation points, at Aarhus, Denmark.  13.5M records from Feb to June of 2014 ▌Data contains  average speed  average measured time  number of vehicles  latitude/longitude of the observation points  etc... ▌What we did  categorize the zone of road into 5-classes according to the characteristics of vehicle’s running style.
  36. 36. The PG-Strom Project Invocation of GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Make a matrix from the raw-data Run k-means clustering logic Pick-up most frequent cluster
  37. 37. The PG-Strom Project GPU k-means (1/3) – clustering by all the data PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37 $ wget -O map.png "`psql traffic -At -f ~/traffic.sql`" bypass highway? Road towards downtown? Beltway?
  38. 38. The PG-Strom Project GPU k-means (2/3) – Daytime and Nighttime PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38 daytime (8-17) nighttime (18-7)
  39. 39. The PG-Strom Project GPU k-means (3/3) – Weekdays and Weekend PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39 weekdays weekend
  40. 40. The PG-Strom Project Invocation of GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata WHERE extract('hour' from timestamp) between 7 and 17 ) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Just add a line to select different input data set. Flexibility of SQL!
  41. 41. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41 Summary
  42. 42. The PG-Strom Project Summary PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42 ▌What is PL/CUDA  Original concept of PG-Strom is automatic optimization.  PL/CUDA pulls out full capability of GPU instead of manual optimization. Likely, nobody has written advanced algorithm in SQL :-) ▌Advantages  TFLOPS grade computing engine for analytics in-database  No need to export entire dataset for analytics by external applications  Allows to utilize SQL flexibility for pre-/post-processing of the core analytics algorithms ▌Future Challenges  Data size larger than 1GB, because of varlena restriction in PostgreSQL  Asynchronous execution. CPU parallel  Time to construct matrix. If likely static, we can construct preliminary.
  43. 43. The PG-Strom Project Resources PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43 ▌Repository https://github.com/pg-strom/devel ▌Today’s Slides http://www.slideshare.net/kaigai/pgconfsv2016-plcuda ▌Contact  kaigai@ak.jp.nec.com  Tw: @kkaigai
  44. 44. Question?

×