Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PG-StromGPU Accelerated AsynchronousSuper-Parallel Query ExecutionKaiGai Kohei <kaigai@kaigai.gr.jp>(Tw: @kkaigai)
Self Introduction• 名前                     海外 浩平 (KaiGai Kohei)• 所属                     NEC Europe - SAP Global Competence ...
All everyone talks about BIG-DATA       猫                                    杓子      熱い視線                                 ...
Big-Data Database?            ¥¥¥¥¥¥            $$$$$$            €€€€€€20130218 - PG-Strom Workshop, Tokyo   4
Homogeneous / Heterogeneous computing                                                                        KPIs         ...
Design concept of PG-Strom              The world cheapest            The most Cost-Effective              Big-Data Databa...
Characteristics of GPU (1/2)                                                    Nvidia       AMD             Intel        ...
Characteristics of GPU (2/2)  Example)  Zi = Xi + Yi             (0 <= i <= n)   X0       X1        X2              Xn   +...
Play with GPU (1/3)         Memory                                CPU                             GPU                     ...
Play with GPU (2/3)   Host code examplevoid sqrt_float4(int n, float v[]){  /* Acquire device memory and data transfer (ho...
Play with GPU (3/3)    Device code example__kernel void dev_sqrt_float(int length, float x[]){  int i = get_global_id(0); ...
Comparison - CPU vs CPU + GPU     CPU                   CPU + GPU                    • Advantage Storage                 ...
Architecture                     of                 PG-Strom20130218 - PG-Strom Workshop, Tokyo   13
PG-Strom’s Asynchronous Execution model    vanilla PostgreSQL                                                          Pos...
Re-definition of SQL/MED (1/2)• SQL/MED (Management of External Data)   • External data source performing as if regular ta...
Re-definition of SQL/MED (2/2)                     Query                     Parser                                      S...
PG-Strom as SQL/MED driver                     Query                     Parser                           ① 条件句から、GPU用    ...
Overall architecture                                                          World of CPU            regular             ...
So what, How fast is it?  postgres=# SELECT COUNT(*) FROM rtbl             WHERE sqrt((x-256)^2 + (y-128)^2) < 40;   count...
Key Technologies• Automatic GPU code generation & JIT compile• Column-oriented data structure• Asynchronous Execution20130...
Automatic “pseudo” code generation          SELECT * FROM ftbl WHERE          c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2)...
Automatic native code generation - WIP          SELECT * FROM ftbl WHERE          c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100...
Save bandwidth & shared-buffer usage   E.g) SELECT name, tel, email, address FROM address_book          WHERE sqrt((pos_x ...
Column-oriented data structure (1/3)                                                     (shadow) TABLE “public.ft.rowid” ...
Column-oriented data structure (2/3) postgres=# CREATE FOREIGN TABLE example              (a int, b text) SERVER pg_strom;...
Column-oriented data structure (3/3)② Calculation                                                         opcode          ...
Asynchronous Execution (1/2)           IterateForeignScan                                                        Yes      ...
Asynchronous Execution (2/2) - in the future           IterateForeignScan                                 Yes             ...
Eco-System                   in               PostgreSQL              Development20130218 - PG-Strom Workshop, Tokyo   29
PostgreSQL developer’s community                                    PostgreSQL                                    develope...
PostgreSQL development cycle                   2011                            2012                         2013v9.2cycle ...
PostgreSQL CommitFest20130218 - PG-Strom Workshop, Tokyo   32
Key features towards upcoming v9.3 (1/3)• Background Worker   • It enables extensions to manage own background worker proc...
Key features towards upcoming v9.3 (2/3)• Writable-FDW   •   It allows FDW-drivers to modify external data source via fore...
Key features towards upcoming v9.3 (3/3)• Writable-FDW (Pseudo-column support)   • It required to identify a particular re...
Direction                    of               The Future              Development20130218 - PG-Strom Workshop, Tokyo   36
Move to OpenCL - WIP• OpenCL support, instead of CUDA   • multiplatform support   • built-in JIT compiler                 ...
Variable Length Data Support - WIP• Data layout on chunk-buffer is revising, to accept variable-length data.• Older format...
Procedural Language Support• This idea allows users to describe complicated logic   as procedural language to be executed ...
You getting involvedI’d like to know ...• How PG-Strom run on     real-world dataset and workload• How PG-Strom should get...
Summary• Characteristics of GPU & OpenCL  • Inflexible instructions but much higher parallelism  • Cheap and small power c...
Source• Source code   • https://github.com/kaigai/pg_strom• Wikipage   • http://wiki.postgresql.org/wiki/PGStrom      (nee...
Any Questions?20130218 - PG-Strom Workshop, Tokyo   43
Upcoming SlideShare
Loading in …5
×

PG-Strom

5,684 views

Published on

18th-Feb-2013に東京・恵比寿のRedHat社セミナールームで開催した勉強会のスライドです。

Published in: Technology
  • Be the first to comment

PG-Strom

  1. 1. PG-StromGPU Accelerated AsynchronousSuper-Parallel Query ExecutionKaiGai Kohei <kaigai@kaigai.gr.jp>(Tw: @kkaigai)
  2. 2. Self Introduction• 名前 海外 浩平 (KaiGai Kohei)• 所属 NEC Europe - SAP Global Competence Center• 仕事 OSS活用によるイノベーション創出 (20-30%) SAPとのアライアンス、PF製品の拡販 (70-80%) • SAPのIn-memory DB “SAP HANA” の認証作業とか • CLUSTERPROのSAP認定取得、拡販とか 特にコレの 割合がデカい20130218 - PG-Strom Workshop, Tokyo 2
  3. 3. All everyone talks about BIG-DATA 猫 杓子 熱い視線 BIG DATA20130218 - PG-Strom Workshop, Tokyo 3
  4. 4. Big-Data Database? ¥¥¥¥¥¥ $$$$$$ €€€€€€20130218 - PG-Strom Workshop, Tokyo 4
  5. 5. Homogeneous / Heterogeneous computing KPIs Homogeneous Scale-Up • Computing Performance • Power Consumption • System Cost (HW/SW) Heterogeneous • Variety of Applications Scale-Up • Vendor Support • Software Development : + Scale-out (not a topic of today’s talk)20130218 - PG-Strom Workshop, Tokyo 5
  6. 6. Design concept of PG-Strom The world cheapest The most Cost-Effective Big-Data Database• Utilization of open source technology• Utilization of commodity hardware • up-to ?? CPUs まだ、この辺をとやかく • up-to ??? GB RAM 言える段階ではない • up-to ??? Data size• Utilization of heterogeneous computing with GPU20130218 - PG-Strom Workshop, Tokyo 6
  7. 7. Characteristics of GPU (1/2) Nvidia AMD Intel Kepler GCN SandyBridge Model Tesla K20X FirePro S9000 Xeon E5-2690 (Q4/2012) (Q3/2012) (Q1/2012) Number of 7.1billion 4.3billion 2.26billion Transistors Number of 2688 1792 16 Cores Simple Simple Functional Core clock 732MHz 925MHz 2.9GHz Peak FLOPS 3.95Tflops 3.23TFlops 185.6GFlops Memory 6GB, GDDR5 6GB, GDDR5 384GB/socket, Size / TYPE DDR3 Memory ~192GB/s ~264GB/s ~51.2GB/s Bandwidth Power ~235W ~225W ~135W Consumption Price $3199? $2499? $206120130218 - PG-Strom Workshop, Tokyo 7
  8. 8. Characteristics of GPU (2/2) Example) Zi = Xi + Yi (0 <= i <= n) X0 X1 X2 Xn + + + + Y0 Y1 Y2 Yn     Z0 Z1 Z2 Zn Assign a particular “core” Nvidia’s GeForce GTX 680 Block Diagram20130218 - PG-Strom Workshop, Tokyo 8
  9. 9. Play with GPU (1/3) Memory CPU GPU 計算負荷 GPUの仕事 GPGPU on-host DDR3-1600 (non-integrated) buffer (~51.2GB/s) DDR5 通常のI/O負荷 192.2GB/s IO HUB  CPUの仕事 GPU on-device code HDD buffer HBA device DRAM SAS 2.0 (600MB/s) PCI-E 3.0 x16 (~32GB/s) Asynchronous Execution of CPU, GPU and PCI-E Minimization of data transfer between host and device20130218 - PG-Strom Workshop, Tokyo 9
  10. 10. Play with GPU (2/3) Host code examplevoid sqrt_float4(int n, float v[]){ /* Acquire device memory and data transfer (host -> device) */ dev_v = clCreateBuffer(cxt, CL_MEM_READ_WRITE, sizeof(float) * n, NULL, &rv); /* Enqueue data transfer host to device */ clEnqueueWriteBuffer(cmdq, dev_x, CL_TRUE, 0, NULL, v, 0, NULL, NULL); /* Set arguments of kernel code */ clSetKernelArg(kernel, 0, sizeof(int), (void *)&n); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dev_v); /* Enqueue invocation of device kernel */ clEnqueueNDRangeKernel(cmdq, kernel, 1, NULL, &g_itemsz, &l_itemsz, 0, NULL, NULL); /* Enqueue data transfer device to host */ clEnqueueReadBuffer(cmdq, dev_x, CL_TRUE, 0, NULL, v, 0, NULL, NULL); /* Release device memory */ clReleaseMemObject(dev_x)}20130218 - PG-Strom Workshop, Tokyo 10
  11. 11. Play with GPU (3/3) Device code example__kernel void dev_sqrt_float(int length, float x[]){ int i = get_global_id(0); if (i < length) x[i] = sqrt(x[i]);} Host code to load kernel/* Load source code of the program */program = clCreateProgramWithSource(cxt, 1, (const char *)&kernel_source, (const size_t *)&kernel_source_len, &rv);/* Run-time build of the program */rv = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);/* Create a device kernel object */kernel = clCreateKernel(program, “dev_sqrt_float”, &rv);20130218 - PG-Strom Workshop, Tokyo 11
  12. 12. Comparison - CPU vs CPU + GPU CPU CPU + GPU • Advantage Storage  Storage  • シンプルな演算の Host buffer Host buffer 超並列処理 • 非同期実行& DMA: host  device パイプライン処理 Storage  Parallel • Disadvantage Host buffer Calculation • ホストデバイス間の Loop & DMA転送コスト DMA: host  device Calculation Host buffer • コードの複雑化  output depends on workloadHost buffer output20130218 - PG-Strom Workshop, Tokyo 12
  13. 13. Architecture of PG-Strom20130218 - PG-Strom Workshop, Tokyo 13
  14. 14. PG-Strom’s Asynchronous Execution model vanilla PostgreSQL PostgreSQL with PG-Strom CPU CPU GPU Asynchronous memory transfer and execution Iteration of scan tuples and evaluation of qualifiers Synchronization Larger “chunk” to scan Earlier than the database at once “Only CPU” scan : Scan tuples on shared-buffers : Execution of the qualifiers Page20130218 - PG-Strom Workshop, Tokyo 14
  15. 15. Re-definition of SQL/MED (1/2)• SQL/MED (Management of External Data) • External data source performing as if regular tables • Not only “management”, but external computing resources also Exec Regular Exec storage Query Executor Table Query Planner Query Parser Foreign MySQL Table FDW SQL Query Foreign Oracle FDW Exec Table Foreign PG-Strom Table FDW Exec Regular storage Table20130218 - PG-Strom Workshop, Tokyo 15
  16. 16. Re-definition of SQL/MED (2/2) Query Parser SQL/MED API construction of Query remote SQL Tree remote SQL Query FDW Planner Planner Remote pgsql connection open remote query Query FDW Executor Executor result set Result connection close Set20130218 - PG-Strom Workshop, Tokyo Extension module 16
  17. 17. PG-Strom as SQL/MED driver Query Parser ① 条件句から、GPU用 Kernel Codeを自動生成 Query Tree WHERE log(x) < 10 ..... Query FDW .... Planner Planner ..... ② Kernel codeの ④ GPU Kernelの JIT-compile 非同期実行 chunk Query FDW buffer Executor Executor ③ Shadow Table Result からのロード Set20130218 - PG-Strom Workshop, Tokyo PG-Strom module shadow tables 17
  18. 18. Overall architecture World of CPU regular shadow tables tables shared_buffer chunk_buffer World of GPU chunk GPU data code GPU Device chunk Memory SeqScan, PG-Strom request data etc... handler JIT Super compile Parallel ForeignScan Event ExecutionResult monitor GPU Kernel Query Executor GPU Function PG-Strom kernel PostgreSQL Backend GPU Server Postmaster background worker 18 20130218 - PG-Strom Workshop, Tokyo
  19. 19. So what, How fast is it? postgres=# SELECT COUNT(*) FROM rtbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Time: 7668.684 ms postgres=# SELECT COUNT(*) FROM ftbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Accelerated! Time: 857.298 ms CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value20130218 - PG-Strom Workshop, Tokyo 19
  20. 20. Key Technologies• Automatic GPU code generation & JIT compile• Column-oriented data structure• Asynchronous Execution20130218 - PG-Strom Workshop, Tokyo 20
  21. 21. Automatic “pseudo” code generation SELECT * FROM ftbl WHERE c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10; contains unsupported operators / functions Translation to pseudo code xreg10 = $(ftbl.x) xreg12 = 256.000000::doublePseudo-code based implementationwill be replaced by native code and xreg8 = (xreg10 - xreg12)JIT-compile approach soon. xreg10 = 2.000000::double xreg6 = pow(xreg8, xreg10) xreg12 = $(ftbl.y) xreg14 = 128.000000::double :20130218 - PG-Strom Workshop, Tokyo 21
  22. 22. Automatic native code generation - WIP SELECT * FROM ftbl WHERE c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10;OpenCL run-time buildsnative GPU binary __kernel void pgstrom_qual(int nitems, bool result[], float x[], float y[]) { int index = get_global_id(0); if (sqrt(pow(x[i] - 256, 2) + pow(y[i] - 100, 2)) < 10) result[i] = true; else result[i] = false; }20130218 - PG-Strom Workshop, Tokyo 22
  23. 23. Save bandwidth & shared-buffer usage E.g) SELECT name, tel, email, address FROM address_book WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;  No sense to fetch columns being not in use CPU GPU CPU GPU Synchronization Synchronization : Scan tuples on the shared-buffers  Save the bandwidth of PCI-E bus : Execution of the qualifiers : Columns being not used the qualifiers  Save the shared-buffer usage20130218 - PG-Strom Workshop, Tokyo 23
  24. 24. Column-oriented data structure (1/3) (shadow) TABLE “public.ft.rowid” rowid nitems isnull FOREIGN TABLE ft 4000 2000 {0,0,0,1,0,0,…} int float text X Y Z 6000 2000 {0,0,0,0,0,0,…} : : : (shadow) TABLE “public.ft.z.cs” 14000 400 {0,0,1,0,0,0,…} rowid nitems isnull values 4000 15 {0,0,…} { ‘hello’, ‘world’, … } 4015 20 {0,0,…} { ‘aaa’, ‘bbb’, ‘ccc’, … } (shadow) TABLE “public.ft.y.cs” rowid nitems : isnull : : values : 4000 25014275 {0,0,…} { 1.38, 6.45, 2.15, …‘yyy’, ‘zzz’, …} 25 {0,0,…} {‘xxx’, } 4250 250 {0,1,…} { 4.32, 5.46, 3.14, … } (shadow): TABLE “public.ft.a.cs” : : : rowid nitems 14200 isnull 100 values {0,0,…} {19, 29, 39, 49, 59, …} 4000 500 {0,0,…} {10, 20, 30, 40, 50, …} 4500 500 {0,1,…} {11, 0, 31, 41, 51, …} : : : : 14200 200 {0,0,…} {19, 29, 39, 49, 59, …}20130218 - PG-Strom Workshop, Tokyo 24
  25. 25. Column-oriented data structure (2/3) postgres=# CREATE FOREIGN TABLE example (a int, b text) SERVER pg_strom; CREATE FOREIGN TABLE postgres=# SELECT * FROM pgstrom_shadow_relations; oid | relname | relkind | relsize -------+----------------------+---------+----------- 16446 | public.example.rowid | r | 0 16449 | public.example.idx | i | 8192 16450 | public.example.a.cs | r | 0 16453 | public.example.a.idx | i | 8192 16454 | public.example.b.cs | r | 0 16457 | public.example.b.idx | i | 8192 16462 | public.example.seq | S | 8192 (9 rows) postgres=# SELECT * FROM pg_strom."public.example.a.cs" ; rowid | nitems | isnull | values -------+--------+--------+-------- (0 rows)20130218 - PG-Strom Workshop, Tokyo 25
  26. 26. Column-oriented data structure (3/3)② Calculation opcode Pseudo Code PgStromChunkBuffer ① Transfer rowmap value a[] <not used> value b[] ③ Write-Back value c[] value d[] <not used>• Less bandwidth consumption Table: my_schema.ft1.b.cs of PCI-Express bus 10100 {2.4, 5.6, 4.95, … } 10300 {10.23, 7.54, 5.43, … }• Less usage of buffer-cache Table: my_schema.ft1.c.cs• Suitable for data 10100 {‘2010-10-21’, …} compression 10200 {‘2011-01-23’, …} 10300 {‘2011-08-17’, …}PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module 26
  27. 27. Asynchronous Execution (1/2) IterateForeignScan Yes free_chunk_list No more rows on current chunk? current chunk Job Queue No next If no chunks are ready yet Load chunk from chunk shadow tables GPU code shadow tables GPU Management current Server Return next chunk TupleTableSlot ready_chunk_list20130218 - PG-Strom Workshop, Tokyo 27
  28. 28. Asynchronous Execution (2/2) - in the future IterateForeignScan Yes free_chunk_list Asynchronous Asynchronous No more rows Data Load Calculation on current chunk? current chunk No next Job Queue Job Queue chunk next chunk GPU shadow tables code Parallel I/O Server Parallel I/O Server Parallel I/O Server GPU Management current Server Return next chunk TupleTableSlot ready_chunk_list20130218 - PG-Strom Workshop, Tokyo 28
  29. 29. Eco-System in PostgreSQL Development20130218 - PG-Strom Workshop, Tokyo 29
  30. 30. PostgreSQL developer’s community PostgreSQL developer’s community contribution, software, feedback, documentation, donation, ... knowledge, ... Service infrastructure, support,20130218 - PG-Strom Workshop, Tokyo consulting, ... 30
  31. 31. PostgreSQL development cycle 2011 2012 2013v9.2cycle v9.2 Release CommitFest 1st~4thv9.3 PGconf2011 &cycle developer meeting CommitFest 1st~4th PGconf2012 & developer meeting v9.3 development schedule • 17th-May developer meeting • 15th-Jun CommitFest:1st • 15th-Sep CommitFest:2nd • 15th-Nov CommitFest:3rd PostgreSQL developer meeting (17th-May-2012) • 15th-Jan CommitFest:4th20130218 - PG-Strom Workshop, Tokyo 31
  32. 32. PostgreSQL CommitFest20130218 - PG-Strom Workshop, Tokyo 32
  33. 33. Key features towards upcoming v9.3 (1/3)• Background Worker • It enables extensions to manage own background worker process • Pre-requisite of PG-Strom’s GPU control server • KaiGai implemented 1st version, then Alvaro revised and committed Shared Resources (DB cluster, shared mem, IPC, ...) Built-in Extra background daemon PostgreSQL PostgreSQL PostgreSQL PostgreSQL Backend Backend Backend (autovacuum, Own main() Backend bgwriter...) manage Extension postmaster20130218 - PG-Strom Workshop, Tokyo 33
  34. 34. Key features towards upcoming v9.3 (2/3)• Writable-FDW • It allows FDW-drivers to modify external data source via foreign-table. • Several new APIs shall be added • Helpful for PG-Strom to modify shadow-tables using standard DML • KaiGai submitted patch, then it is “ready-for-committer” status now SELECT SQL Executor SQL Planner SQL Parser INSERT FDW driver UPDATE DELETE New API External Data Source20130218 - PG-Strom Workshop, Tokyo 34
  35. 35. Key features towards upcoming v9.3 (3/3)• Writable-FDW (Pseudo-column support) • It required to identify a particular remote-row to be written. • “rowid” shall be carried from scan-stage to modify-stage as a value of pseudo-column. • Pseudo-column concept is also available to push-down complex calculation into external computing resource. SELECT X, Y, (X-Y)^2 from ftable;  SELECT X, Y, Pcol_1 from ftable; Just reference to (SELECT X, Y, (X-Y)^2 AS Pcol_1 the calculated result from remote_data_source) in the remote side Remote Query20130218 - PG-Strom Workshop, Tokyo 35
  36. 36. Direction of The Future Development20130218 - PG-Strom Workshop, Tokyo 36
  37. 37. Move to OpenCL - WIP• OpenCL support, instead of CUDA • multiplatform support • built-in JIT compiler OpenCL Source (CString) clCreateProgramWithSource() ○ ○ cl_program clCreateKernel() ○ × cl_kernel ○ × clEnqueueNDRangeKernel20130218 - PG-Strom Workshop, Tokyo 37
  38. 38. Variable Length Data Support - WIP• Data layout on chunk-buffer is revising, to accept variable-length data.• Older format assumed fixed-number of items per chunk.• Newer format assumes fixed-size chunk; consumed from head/tail to consume Direction for fixed-length values X for fixed-length variable A for index of variable-length value B offset: 123 for fixed-length values Y to consume Direction text: ‘hello world’ for contents of variable-lengths for fixed-length values Z Older chunk-buffer layout Newer chunk-buffer layout20130218 - PG-Strom Workshop, Tokyo 38
  39. 39. Procedural Language Support• This idea allows users to describe complicated logic as procedural language to be executed on GPU.• Expected usage: image processing, genome matching, ... CREATE FUNCTION genome_similarity(text,text) RETURNS float AS $$ varlena *genome1 = ARG1; varlena *genome2 = ARG2; : <something complicated logic> : return similarity; $$ LANGUAGE pg_strom; SELECT id, label FROM genome_db WHERE genome_similarity(data, ‘ATGCAGGT....’) > 0.9;20130218 - PG-Strom Workshop, Tokyo 39
  40. 40. You getting involvedI’d like to know ...• How PG-Strom run on real-world dataset and workload• How PG-Strom should get evolved• Which region and problem will fit All the co-development / co-evaluationprojects are always welcome!20130218 - PG-Strom Workshop, Tokyo 40
  41. 41. Summary• Characteristics of GPU & OpenCL • Inflexible instructions but much higher parallelism • Cheap and small power consumption per computing capability• PG-Strom - towards most cost-effective database • Utilization of GPU to off-load CPU jobs • Automatic code generation and JIT compile • Asynchronous execution • Column-oriented data structure• Upcoming development • Move to OpenCL rather then CUDA • Support for variable length values • Support for procedural language• Your involvement will lead future evolution!20130218 - PG-Strom Workshop, Tokyo 41
  42. 42. Source• Source code • https://github.com/kaigai/pg_strom• Wikipage • http://wiki.postgresql.org/wiki/PGStrom (need maintenance...)20130218 - PG-Strom Workshop, Tokyo 42
  43. 43. Any Questions?20130218 - PG-Strom Workshop, Tokyo 43

×