Introduction toParallel Programming WithCUDA & OpenCLCUDA & OpenCLMoayad H. AlmohaishiGraduate student, Computer ScienceLo...
Outlines  •    Introduction  •    Introduction to CUDA        – Hello World           – Addition application           – A...
Introduction• Why GPU  – Available in almost all new desktops and laptops  – many-core     • 512 cores on GTX580  – high f...
Introduction to CUDA• CUDA Architecture  – The physical technology on the GPU• CUDA C  – The programming language to harve...
What you need to know?Today :•You will need some knowledge about C•Yow don’t need to know about parallelprogramming•You do...
Terminology• Host  – The CPU and its dedicated system memory (RAM).• Device  – The GPU and its on-board memory            ...
C Hello Worldint main( void ) {    printf(“Hello World ! n”);    return 0;}    This Hello world C Code if compiled with Nv...
CUDA Kernel__global__ void kernel( Void ){}int main( void ) {kernel<<<1,1>>>();printf(“Hello World ! n”);return 0;}       ...
Kernel<<1,1>>(); is the command__global__ is a key word to defineto call the CUDA kernel kernelthe function as a CUDA from...
Single Addition on the CPUfloat add( float *a, float *b ){    return a+b;}void main( void ) {    float *a, *b, *c;    ... ...
Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){    c= a+b;}void main( void ) {    float *a,...
Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){    c= a+b;}                                ...
Original C memory commands:malloc(), free(), memcpy() CUDA Global Memory • To be able to use the GPU memory you will need:...
The Kernel will is correct and willstay the same  Single Addition on the GPU  __global__ void add( float *a, float *b, flo...
Allocating the the differentwe need to define device memoryvariables for host and device  Single Addition on the GPUmemori...
copyFree the device memory from         memory to andthe device  Single Addition on the GPU      cudaMemcpy(d_a, &h_a, siz...
Is that right to do?• GPU is about massive parallelism, so running this  program on the GPU is inefficient and will run  s...
The add function will stay the sameArray Addition on the CPUvoid main( void ) {int n = 512; // 2^9float *a[n], *b[n], *c[n...
we have to modify the size  Array Addition on the GPU void main( void ) {      int n = 512;      float *h_a[n], *h_b[n], *...
Array Addition on the GPU    cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);    cudaMemcpy(d_b, &h_b, size, cudaMemcp...
Blocks• CUDA Run the Kernel as a block on a grid  containing n number of blocks.• The maximum value of n can defer from de...
n number of blocks will berunning on the kernel  Array Addition on the GPU1      cudaMemcpy(d_a, &h_a, size, cudaMemcpyHos...
Array Addition Kernel1__global__ void add( float *a, float *b, float *c ){    int idx = blockIdx.x ;    c[idx]= a[idx]+b[i...
Threads• Each block can contain up to 512 parallel threads  in the first and second CUDA architecture• In fermi architectu...
n number of threads on singleblock will be running on the  Array Addition on the GPUkernel      cudaMemcpy(d_a, &h_a, size...
CUDA Run the threads as half warps. soit is more efficient to have at least 16  Array Addition Kernelthreads per block  __...
MORE• is it still massive parallelism ?• what about more than 512 elements ?                                        27
Terminology• 1D grid     blockIdx.x= 0   blockIdx.x= 1   blockIdx.x= 2    0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6       ...
How to point each thread to the rightglobal memory address ?  global memory access          blockIdx.x= 0     blockIdx.x= ...
global memory access• 1D grid     blockIdx.x= 0   blockIdx.x= 1   blockIdx.x= 2    0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5...
Array Addition on the GPU   cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);   cudaMemcpy(d_b, &h_b, size, cudaMemcpyH...
Array Addition Kernel__global__ void add( float *a, float *b, float *c ){    int idx = threadIdx.x + blockIdx.x * blockDim...
Exercises• What is the maximum number of threads that can  be run on a grid ?• How we can go over that limit ?            ...
How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory ...
How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory ...
What you learned•   Creating CUDA Kernel•   Calling the Kernel from the host•   Allocating CUDA memory•   Copy to/from the...
Dot Product         A              ×         B                  +                  C                      37
• if each thread do one multiplication. which thread  will make the addition ?                                            ...
Shared Memory• The Shared Memory is very fast memory on the  GPU chip itself.• each block has its own shared memory space....
Dot Product Kernel__global__ void dotP(int *a, int *b, int *c){   __shared__ temp[N];   temp[threadIdx.x] = a[threadIdx.x]...
exercise• in this application the addition will run on thread 0  only. is that efficient ?• how to make it better ?       ...
Matrix multiplication                        B         A              C                            42
MatrixMul on the GPUvoid main( void ) {   int n = 16;   float *h_a[n][n], *h_b[n][n], *h_c[n][n];   float *d_a[n][n], *d_b...
Array Addition on the GPU    cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);    cudaMemcpy(d_b, &h_b, size, cudaMemcp...
Simple Matrix Multiplication Kernel __global__ void matrixMul( float *a, float *b, float*c ){     int x = threadIdx.x ; //...
Exercise• Use the shared memory to optimize the matrix  algorithm (hint: look at the code on the SDK)                     ...
What you learned• Using the shared memory to share the date  among the threads in a block• Synchronizing the threads• sett...
Performance Considerations• for maximum performance:  – Reduce the global memory access.  – maximize the occupancy (allow ...
Introduction to OpenCL• OpenCL is open standard• Cross platform; can run on:  –   Multi-core CPU  –   GPU (NVIDIA,ATI)  – ...
How the program work        Host                                               Device (GPU)                        •Alloca...
Basic OpenCL program Structure• OpenCL Kernel• Host program containing:   – a. Devices Context.   – b. Command Queue   – c...
Creating the Kernel#include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*OpenCLSource[ ] = { “__kernel void V...
Notice that all the kernel here stored aschar variable  Creating the Kernel  #include <studio.h>#include <stdlib.h>#includ...
The __kernel key word in functionget_global_id() is a builtis equivalent to__global__ in CUDAinstead of calculating the gl...
Initializing dataint InitialData1[12] = {62, 48, 20, -53, 39, 83, 19, 47, 13, 88, 38, -92};intInitialData2[12] = {-49, 29,...
Creating the main functionint main (int argc, char **argv){ int HostVector1[SIZE]; int HostVector2[SIZE]        for (int c...
cl_context clCreateContextFromType( cl_context_properties*properties,          cl_device_type device_type, void   Creating...
cl_context clCreateContextFromType( cl_context_propertiesYou can also use*properties,CL_DEVICE_TYPE_CPU device_type, void ...
cl_int clGetContextInfo( cl_context context,cl_platform_info param_name,             size_t  Query compute devicesparam_va...
cl_int clGetContextInfo( cl_context context,cl_platform_info param_name,             size_t  Query compute devicesparam_va...
cl_command_queue clCreateCommandQueue(cl_context context, cl_device_id device,Command queuecl_command_queue_properties pro...
cl_mem clCreateBuffer(cl_context context,       cl_mem_flagsflags, size_t size, void *host_ptr, cl_int *errcode_ret)flags:...
Allocating the Memorycl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLYCL_MEM_COPY_HOST_PTR, sizeof (int) * S...
cl_program clCreateProgramWithSource(            cl_contextcontext, cl_unit count,    const char **strings, const size_tCr...
cl_int clBuildProgram(cl_program program,   cl_unitnum_devices, const cl_device_id *device_list,Creating the programconst ...
cl_kernel clCreateKernel(cl_program program, const char*kernel_name, cl_int *errcode_ret)Creating the programcl_kernel Ope...
cl_int clSetKernelArg(cl_kernel kernel,        cl_unitmatching the GPU memory with thearg_index, size_t arg_size,         ...
matching the GPU memory with theKernelclSetKernelArg(OpenCLVectorAdd, 1, sizeof (cl_mem), (void*)&GPUVector1);clSetKernelA...
cl_int clEnqueueNDRangeKernel( cl_command_queuecommand_queue,              cl_kernel kernel, cl_unit work_dim,Lunching the...
cl_int clEnqueueReadBuffer(           cl_command_queuecommand_queue,             cl_mem buffer,     cl_boolCopying the out...
Cleaning the GPU deviceclReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleseMemObject(GPUOutputVector);fr...
What you learned• Writing OpenCL Kernel• Writing OpenCL Application  –   Setting the context  –   preparing the command qu...
Sources and additional resources• Jason sander, “Introduction to CUDA” -book and  GTC presentation.• OpenCL specification ...
Upcoming SlideShare
Loading in...5
×

Intro2 Cuda Moayad

1,327

Published on

Lecture for High Performance Computing and High Availability On CUDA and OpenCL

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,327
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Intro2 Cuda Moayad

  1. 1. Introduction toParallel Programming WithCUDA & OpenCLCUDA & OpenCLMoayad H. AlmohaishiGraduate student, Computer ScienceLouisiana Tech Universitymha023@latech.edu 1
  2. 2. Outlines • Introduction • Introduction to CUDA – Hello World – Addition application – Array Addition – CUDA Memories – Matrix Multiplication – Performance considerations • Introduction to OpenCL – Addition Kernel – differences from CUDA kernel – setting the OpenCL host code • Sources and additional Resources01/23/11 2
  3. 3. Introduction• Why GPU – Available in almost all new desktops and laptops – many-core • 512 cores on GTX580 – high floating point operations • GTX580 offer peak performance ≈1.5 TFLOPS (Single Precision) – high memory bandwidth • GTX580 offer 192.4 GB/sec 3
  4. 4. Introduction to CUDA• CUDA Architecture – The physical technology on the GPU• CUDA C – The programming language to harvest the power of CUDA architecture – based on standard C 4
  5. 5. What you need to know?Today :•You will need some knowledge about C•Yow don’t need to know about parallelprogramming•You don’t need to know about CUDA architecture 5
  6. 6. Terminology• Host – The CPU and its dedicated system memory (RAM).• Device – The GPU and its on-board memory 6
  7. 7. C Hello Worldint main( void ) { printf(“Hello World ! n”); return 0;} This Hello world C Code if compiled with Nvidia CUDA compiler will compile without problem. 7
  8. 8. CUDA Kernel__global__ void kernel( Void ){}int main( void ) {kernel<<<1,1>>>();printf(“Hello World ! n”);return 0;} 8
  9. 9. Kernel<<1,1>>(); is the command__global__ is a key word to defineto call the CUDA kernel kernelthe function as a CUDA from the CUDA Kernelhost code __global__ void kernel( Void ){ __global__ void kernel( Void }} int main( void ) { kernel<<<1,1>>>(); kernel<<<1,1>>>(); printf(“Hello World ! n”); return 0; } 9
  10. 10. Single Addition on the CPUfloat add( float *a, float *b ){ return a+b;}void main( void ) { float *a, *b, *c; ... // setting a and b values c = add(a,b); printf(“%f + %f = %f n”, a,b,c); return 0;} 10
  11. 11. Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){ c= a+b;}void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); printf(“%f + %f = %f n”, a,b,c); return 0;} 11
  12. 12. Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){ c= a+b;} ?!void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); // c will need to be copied to host printf(“%f + %f = %f n”, a,b,c); return 0;} 12
  13. 13. Original C memory commands:malloc(), free(), memcpy() CUDA Global Memory • To be able to use the GPU memory you will need: – Allocate memory on the GPU using the command • cudaMalloc() – Copy the host memory to the device memory using • cudaMemcpy() • To free the memory • cudaFree() 13
  14. 14. The Kernel will is correct and willstay the same Single Addition on the GPU __global__ void add( float *a, float *b, float *c ){ c= a+b; } 14
  15. 15. Allocating the the differentwe need to define device memoryvariables for host and device Single Addition on the GPUmemories. void main( void ) { float *h_a, *h_b, *h_c; float *d_a, *d_b, *d_c; int size = sizeof(float); cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); h_a = 150; h_b = 89; 15
  16. 16. copyFree the device memory from memory to andthe device Single Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 16
  17. 17. Is that right to do?• GPU is about massive parallelism, so running this program on the GPU is inefficient and will run slower than the CPU version• You need large data 17
  18. 18. The add function will stay the sameArray Addition on the CPUvoid main( void ) {int n = 512; // 2^9float *a[n], *b[n], *c[n];... // setting a and b valuesfor (int i=0 i<=n, i++){ c[i] = add(a[i],b[i]); printf(“%f + %f = %f n”, a,b,c); }return 0;} 18
  19. 19. we have to modify the size Array Addition on the GPU void main( void ) { int n = 512; float *h_a[n], *h_b[n], *h_c[n]; float *d_a[n], *d_b[n], *d_c[n]; int size = sizeof(float) * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 19
  20. 20. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); ?! printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;} 20
  21. 21. Blocks• CUDA Run the Kernel as a block on a grid containing n number of blocks.• The maximum value of n can defer from device to device. current devices limit is 65535 blocks per grid• we will use blockIdx.x to access the block ID from the kernel 21
  22. 22. n number of blocks will berunning on the kernel Array Addition on the GPU1 cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<n,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 22
  23. 23. Array Addition Kernel1__global__ void add( float *a, float *b, float *c ){ int idx = blockIdx.x ; c[idx]= a[idx]+b[idx];} 23
  24. 24. Threads• Each block can contain up to 512 parallel threads in the first and second CUDA architecture• In fermi architecture each block can contain up to 1024 parallel threads.• we will use threadIdx.x to access the thread ID from the kernel 24
  25. 25. n number of threads on singleblock will be running on the Array Addition on the GPUkernel cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,n>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 25
  26. 26. CUDA Run the threads as half warps. soit is more efficient to have at least 16 Array Addition Kernelthreads per block __global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x ; c[idx]= a[idx]+b[idx]; } 26
  27. 27. MORE• is it still massive parallelism ?• what about more than 512 elements ? 27
  28. 28. Terminology• 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 threadIdx.x 28
  29. 29. How to point each thread to the rightglobal memory address ? global memory access blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 1 1 1 1 1 1 1 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 Global Memory 29
  30. 30. global memory access• 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 idx = threadIdx.x + blockIdx.x * blockDim.x 30
  31. 31. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); int blockSize = 256; int blockSize = 256; int blocks = n/blockSize; int blocks = n/blockSize; add<<<blocks,blockSize>>>(d_a,d_b,d_c); add<<<blocks,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); //printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 31
  32. 32. Array Addition Kernel__global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x + blockIdx.x * blockDim.x; c[idx]= a[idx]+b[idx];} 32
  33. 33. Exercises• What is the maximum number of threads that can be run on a grid ?• How we can go over that limit ? 33
  34. 34. How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory index and global memory accessjump the second one.You will access the second index throw idx + 1 • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 34
  35. 35. How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory index and global memory accessjump the next blockSize .You will access the second index throw idx + blockDim.x • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 35
  36. 36. What you learned• Creating CUDA Kernel• Calling the Kernel from the host• Allocating CUDA memory• Copy to/from the device memory• freeing the device memory• controlling the number of threads throw the block size and number of blocks per grid. 36
  37. 37. Dot Product A × B + C 37
  38. 38. • if each thread do one multiplication. which thread will make the addition ? 38
  39. 39. Shared Memory• The Shared Memory is very fast memory on the GPU chip itself.• each block has its own shared memory space.• can be declared using __shared__ CUDA keyword• to make sure all the thread finished computing use the CUDA keyword __syncthreads() 39
  40. 40. Dot Product Kernel__global__ void dotP(int *a, int *b, int *c){ __shared__ temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if (threadIdx.x == 0) { int sum = 0; for (int i = 0;i<N;i++) sum += temp[i]; c = sum; }} 40
  41. 41. exercise• in this application the addition will run on thread 0 only. is that efficient ?• how to make it better ? 41
  42. 42. Matrix multiplication B A C 42
  43. 43. MatrixMul on the GPUvoid main( void ) { int n = 16; float *h_a[n][n], *h_b[n][n], *h_c[n][n]; float *d_a[n][n], *d_b[n][n], *d_c[n][n]; int size = sizeof(float) * n * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 43
  44. 44. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); dim3 blockSize = (n,n,1); add<<<1,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;} 44
  45. 45. Simple Matrix Multiplication Kernel __global__ void matrixMul( float *a, float *b, float*c ){ int x = threadIdx.x ; //row int y = threadIdx.y; //column float temp = 0; for (int i=0; i<= blockDim.x; i++){ temp += a[i][y] * b[x][i]; } c[x][y] = temp;} 45
  46. 46. Exercise• Use the shared memory to optimize the matrix algorithm (hint: look at the code on the SDK) 46
  47. 47. What you learned• Using the shared memory to share the date among the threads in a block• Synchronizing the threads• setting blockSize of more than one dimension using dim3 47
  48. 48. Performance Considerations• for maximum performance: – Reduce the global memory access. – maximize the occupancy (allow scheduling of 1024 threads per stream multi processor) • use the right blockSize • use the right number of registers • use the right size of the shared memory – Increasing the independent instructions – coalescing the memory access – Using right instruction:byte ratio 48
  49. 49. Introduction to OpenCL• OpenCL is open standard• Cross platform; can run on: – Multi-core CPU – GPU (NVIDIA,ATI) – Cell B/E – others• close to CUDA 49
  50. 50. How the program work Host Device (GPU) •Allocating the memory in the Host Stream Processors GPU •initializing data in the memory Kernel objects. Code •Allocating the memory in the Device (GPU) Memory •Copy the Data from Host to Memory Device A[] B[] C[] A[] B[] C[] •Running the Kernel •Copy the results to the Host memory •Clear the Memory and Free the resources 50
  51. 51. Basic OpenCL program Structure• OpenCL Kernel• Host program containing: – a. Devices Context. – b. Command Queue – c. Memory Objects – d. OpenCL Program. – e. Kernel Memory Arguments. 51
  52. 52. Creating the Kernel#include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int*a, n”, “ __global int* b) n”, “{ n”, “unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”}};};}; 52
  53. 53. Notice that all the kernel here stored aschar variable Creating the Kernel #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”, “{ n”, “ * OpenCLSource[ ] = { unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”}; }; }; }; 53
  54. 54. The __kernel key word in functionget_global_id() is a builtis equivalent to__global__ in CUDAinstead of calculating the global ID in Creating the KernelThe function parameters need to beCUDAdefine as __global while you don’t needthat in CUDA #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”,VectorAdd(__global int* “__kernel void “{ n”, “ c, __global int* a, n”, “ unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, int* b) n __global “} n”}; }; “ unsigned int n = get_global_id(0); n”, , }; , }; 54
  55. 55. Initializing dataint InitialData1[12] = {62, 48, 20, -53, 39, 83, 19, 47, 13, 88, 38, -92};intInitialData2[12] = {-49, 29, 38, 10, 37, 46, -12, 86, 17, 83, -22, 94};#define SIZE2048 55
  56. 56. Creating the main functionint main (int argc, char **argv){ int HostVector1[SIZE]; int HostVector2[SIZE] for (int c= 0; c<SIZE; c++) { HostVector[c] = InitialData1[c%12]; HostVector[c] = initialData2[c%12];}];}];}];}];} 56
  57. 57. cl_context clCreateContextFromType( cl_context_properties*properties, cl_device_type device_type, void Creating the context(*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data,cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 57
  58. 58. cl_context clCreateContextFromType( cl_context_propertiesYou can also use*properties,CL_DEVICE_TYPE_CPU device_type, void cl_device_type Creating the context(*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data,cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 58
  59. 59. cl_int clGetContextInfo( cl_context context,cl_platform_info param_name, size_t Query compute devicesparam_value_size, void *param_value, size_t*param_value_size_ret)Param_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIES size_t ParamDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes); 59
  60. 60. cl_int clGetContextInfo( cl_context context,cl_platform_info param_name, size_t Query compute devicesparam_value_size, void *param_value, size_t*param_value_size_ret)Param_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIES cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL); 60
  61. 61. cl_command_queue clCreateCommandQueue(cl_context context, cl_device_id device,Command queuecl_command_queue_properties properties, cl_int*errcode_ret)Properties:CL_QUEUE_PROFILING_ENABLE,CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLEProperties:CL_QUEUE_PROFILING_ENABLE,CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLEcl_command_queue GPUCommandQueue = clCreatCommandQueue(GPUContext, GPUDevices[0], 0, NULL); 61
  62. 62. cl_mem clCreateBuffer(cl_context context, cl_mem_flagsflags, size_t size, void *host_ptr, cl_int *errcode_ret)flags:Allocating the MemoryCL_MEM_READ_WRITE, CL_MEM_READ_ONLY,CL_MEM_WRITE_ONLY, CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR,CL_MEM_COPY_HOST_PTRflags: CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY,CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR,CL_MEM_COPY_HOST_PTRcl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLYCL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector1, NULL); 62
  63. 63. Allocating the Memorycl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLYCL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector2, NULL);cl_memGPUOutputVector; GPUOutputVector = clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof (int) * SIZE, NULL, NULL);;; 63
  64. 64. cl_program clCreateProgramWithSource( cl_contextcontext, cl_unit count, const char **strings, const size_tCreating the program*lengths, cl_int *errcode_ret)cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 8,OpenCLSource, NULL, NULL); 64
  65. 65. cl_int clBuildProgram(cl_program program, cl_unitnum_devices, const cl_device_id *device_list,Creating the programconst char *options,void (*pfn_notify)void *user_data), void *user_data) (cl_program,clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL); 65
  66. 66. cl_kernel clCreateKernel(cl_program program, const char*kernel_name, cl_int *errcode_ret)Creating the programcl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, “VectorAdd”,NULL); 66
  67. 67. cl_int clSetKernelArg(cl_kernel kernel, cl_unitmatching the GPU memory with thearg_index, size_t arg_size, const void *arg_value)KernelclSetKernelArg(OpenCLVectorAdd, 0, sizeof (cl_mem), (void*)&GPUOutputVector); 67
  68. 68. matching the GPU memory with theKernelclSetKernelArg(OpenCLVectorAdd, 1, sizeof (cl_mem), (void*)&GPUVector1);clSetKernelArg(OpenCLVectorAdd, 2, sizeof (cl_mem), (void*)&GPUVector2);;; 68
  69. 69. cl_int clEnqueueNDRangeKernel( cl_command_queuecommand_queue, cl_kernel kernel, cl_unit work_dim,Lunching the Kernel const size_t *global_work_offset,*global_work_size, const size_t const size_t *local_work_size,cl_unit num_events_in_wait_list, const cl_event*event_wait_list, cl_event *event)size_t WorkSize [1] = {SIZE};clEnqueueNDRangeKernel(GPUCommandQueueOpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL); 69
  70. 70. cl_int clEnqueueReadBuffer( cl_command_queuecommand_queue, cl_mem buffer, cl_boolCopying the output to the host memoryblocking_read, size_t offset,size_t cb, void *ptr,cl_unit num_evernts_in_wait_list, const cl_event*event_wait_list, cl_event *event)int HostOutputVector [SIZE];clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector, CL_TRUE, 0, SIZE* sizeof(int), HostOutputVector, 0, NULLNULL); 70
  71. 71. Cleaning the GPU deviceclReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleseMemObject(GPUOutputVector);free (GPUDevices);for(int c= 0; c < 305; c++)printf (“%c”, (char)HostOutputVector[c]);return 0;};} 71
  72. 72. What you learned• Writing OpenCL Kernel• Writing OpenCL Application – Setting the context – preparing the command queue – setting the memory objects – setting the program – setting the kernel and the arguments 72
  73. 73. Sources and additional resources• Jason sander, “Introduction to CUDA” -book and GTC presentation.• OpenCL specification document• NVIDIA CUDA programming guide• NVIDIA OpenCL getting started guide• Videos from GTC’10 in the link :• http://www.nvidia.com/object/gtc2010-presentation-a 73
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×