20101030 opencl intro

2,451 views

Published on

Published in: Technology, Art & Photos
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,451
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
50
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20101030 opencl intro

  1. 1. Brief Introduction to OpenCL Hu Zi Ming hzmangel@gmail.com 2010-10-30 1 / 24 Brief Introduction to OpenCL
  2. 2. Outline 1 Some Background about OpenCL CPU vs. GPU What is OpenCL Advantages & Disadvantages 2 Programming with OpenCL 3 Demo about OpenCL 2 / 24 Brief Introduction to OpenCL
  3. 3. CPU vs. GPU CPU: Make single thread fast Hide latency though large cache 3 / 24 Brief Introduction to OpenCL
  4. 4. CPU vs. GPU CPU: Make single thread fast Hide latency though large cache GPU: Improvement thoughput Hide latency though prarllelism 3 / 24 Brief Introduction to OpenCL
  5. 5. Before OpenCL. . . Nvidia CUDA ATI stream Microsoft DirectComputer . . . . . . 4 / 24 Brief Introduction to OpenCL
  6. 6. Before OpenCL. . . Nvidia CUDA ATI stream Microsoft DirectComputer . . . . . . Apple said, Let there be standard 4 / 24 Brief Introduction to OpenCL
  7. 7. Before OpenCL. . . Nvidia CUDA ATI stream Microsoft DirectComputer . . . . . . Apple said, Let there be standard And there was OpenCL 4 / 24 Brief Introduction to OpenCL
  8. 8. What is OpenCL Open Computing Language Based on C for CUDA but slightly lower Originally developed by Apple Handed over to the Khronos Group now Can be used in parallel computing 5 / 24 Brief Introduction to OpenCL
  9. 9. Advantages Support heterogeneous platforms Task-based(CPU) and data-based(GPU) parallelism for parallel computing Improve memory bandwidth and compute bandwidth greatly Extends the GPU power w/o been locked in one manufacturer Support extensions like OpenGL Support ES mode for mobile devices 6 / 24 Brief Introduction to OpenCL
  10. 10. Disadvantages Tunning is hardware-specific Algorithm is binded with data shape Recursion is not available now Function pointer is not supported now 7 / 24 Brief Introduction to OpenCL
  11. 11. Outline 1 Some Background about OpenCL 2 Programming with OpenCL Prerequisite Main Flow of Host Code Four Models 3 Demo about OpenCL 8 / 24 Brief Introduction to OpenCL
  12. 12. Prerequisite Driver support OpenCL ATI Stream SDK/NVIDIA CUDA Toolkit/. . . Host code: control kernel code OpenCL kernel code: written in OpenCL and run on devices 9 / 24 Brief Introduction to OpenCL
  13. 13. Main Flow of Host Code Get information about the platform and devices Select devices to be used in execution Create an OpenCL context Create a command queue Create memory buffer objects Create program object Load the kernel source code and compile it Create kernel object Set kernel arguments Execute the kernel Copy memory from GPU to CPU 10 / 24 Brief Introduction to OpenCL
  14. 14. OpenCL Summary 11 / 24 Brief Introduction to OpenCL
  15. 15. Four Models Platform model Execution model Memory model Programming model 12 / 24 Brief Introduction to OpenCL
  16. 16. Platform Model A host connected to one or more OpenCL devices Device can be divided into one or more compute units (CUs) Compute unit can be further divided into one or more processing elements (PEs) Application send commands from host to PE PE within CU execute instructions as SIMD/SPMD units 13 / 24 Brief Introduction to OpenCL
  17. 17. Platform Model (Cont.) 14 / 24 Brief Introduction to OpenCL
  18. 18. Execution Model Work item is the basic unit of work 15 / 24 Brief Introduction to OpenCL
  19. 19. Execution Model Work item is the basic unit of work Kernel is code for work item Executed on OpenCL devices, basically a C function 15 / 24 Brief Introduction to OpenCL
  20. 20. Execution Model Work item is the basic unit of work Kernel is code for work item Executed on OpenCL devices, basically a C function Host program executed on host 15 / 24 Brief Introduction to OpenCL
  21. 21. Execution Model Work item is the basic unit of work Kernel is code for work item Executed on OpenCL devices, basically a C function Host program executed on host Create index space based on NDRange Organize work-item as work-group 15 / 24 Brief Introduction to OpenCL
  22. 22. Execution Model (Cont.) 16 / 24 Brief Introduction to OpenCL
  23. 23. Memory Model Global mem: r/w to all work-item in all work-groups Constant mem: global mem and remain constant during execution Local mem: local to a work-group Private mem: private to work-item 17 / 24 Brief Introduction to OpenCL
  24. 24. Memory Model Global mem: r/w to all work-item in all work-groups Constant mem: global mem and remain constant during execution Local mem: local to a work-group Private mem: private to work-item Data move path: host -¿ global -¿ local and back 17 / 24 Brief Introduction to OpenCL
  25. 25. Memory Model 18 / 24 Brief Introduction to OpenCL
  26. 26. Programming Model Data parallel programming model Task parallel programming model Synchronization 19 / 24 Brief Introduction to OpenCL
  27. 27. Outline 1 Some Background about OpenCL 2 Programming with OpenCL 3 Demo about OpenCL Matrix Add Matrix Multiply 20 / 24 Brief Introduction to OpenCL
  28. 28. Kernel Code normal add __kernel void add(__global int *a, __global int *b, __global int *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } 21 / 24 Brief Introduction to OpenCL
  29. 29. Normal Kernel Code normal multiply __kernel void mul(__global int *a, __global int *b, __global int *c) { int x = get_global_id(1); int y = get_global_id(0); int i = 0; c[y * WC + x] = 0; for (; i < W; i++) { c[y * WC + x] += a[y * WA + i] * b[i * WB + x]; } } 22 / 24 Brief Introduction to OpenCL
  30. 30. Kernel Code with Block Support multiply with block support __kernel void mul(__global float *a, __global float *b, __global float *c, __local float *as, __local float *bs) { int x = get_global_id(1); int y = get_global_id(0); int bx = get_group_id(1); int by = get_group_id(0); int tx = get_local_id(1); int ty = get_local_id(0); int tmp_val = 0; c[x * WC + y] = 0; for (int i = 0; i < WA / BLOCK_SIZE; i++) { as[ty * BLOCK_SIZE + tx] = a[y * WA + x]; bs[ty * BLOCK_SIZE + tx] = b[y * WA + x]; barrier(CLK_LOCAL_MEM_FENCE); for (int j = 0; j < BLOCK_SIZE; j++) { tmp_val += a[y * WA + i] * b[i * WB + x]; barrier(CLK_LOCAL_MEM_FENCE); } c[y * WB + x] = tmp_val; } } 23 / 24 Brief Introduction to OpenCL
  31. 31. Q AND A 24 / 24 Brief Introduction to OpenCL

×