Intro2 Cuda Moayad

  • 835 views
Uploaded on

Lecture for High Performance Computing and High Availability On CUDA and OpenCL

Lecture for High Performance Computing and High Availability On CUDA and OpenCL

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
835
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction toParallel Programming WithCUDA & OpenCLCUDA & OpenCLMoayad H. AlmohaishiGraduate student, Computer ScienceLouisiana Tech Universitymha023@latech.edu 1
  • 2. Outlines • Introduction • Introduction to CUDA – Hello World – Addition application – Array Addition – CUDA Memories – Matrix Multiplication – Performance considerations • Introduction to OpenCL – Addition Kernel – differences from CUDA kernel – setting the OpenCL host code • Sources and additional Resources01/23/11 2
  • 3. Introduction• Why GPU – Available in almost all new desktops and laptops – many-core • 512 cores on GTX580 – high floating point operations • GTX580 offer peak performance ≈1.5 TFLOPS (Single Precision) – high memory bandwidth • GTX580 offer 192.4 GB/sec 3
  • 4. Introduction to CUDA• CUDA Architecture – The physical technology on the GPU• CUDA C – The programming language to harvest the power of CUDA architecture – based on standard C 4
  • 5. What you need to know?Today :•You will need some knowledge about C•Yow don’t need to know about parallelprogramming•You don’t need to know about CUDA architecture 5
  • 6. Terminology• Host – The CPU and its dedicated system memory (RAM).• Device – The GPU and its on-board memory 6
  • 7. C Hello Worldint main( void ) { printf(“Hello World ! n”); return 0;} This Hello world C Code if compiled with Nvidia CUDA compiler will compile without problem. 7
  • 8. CUDA Kernel__global__ void kernel( Void ){}int main( void ) {kernel<<<1,1>>>();printf(“Hello World ! n”);return 0;} 8
  • 9. Kernel<<1,1>>(); is the command__global__ is a key word to defineto call the CUDA kernel kernelthe function as a CUDA from the CUDA Kernelhost code __global__ void kernel( Void ){ __global__ void kernel( Void }} int main( void ) { kernel<<<1,1>>>(); kernel<<<1,1>>>(); printf(“Hello World ! n”); return 0; } 9
  • 10. Single Addition on the CPUfloat add( float *a, float *b ){ return a+b;}void main( void ) { float *a, *b, *c; ... // setting a and b values c = add(a,b); printf(“%f + %f = %f n”, a,b,c); return 0;} 10
  • 11. Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){ c= a+b;}void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); printf(“%f + %f = %f n”, a,b,c); return 0;} 11
  • 12. Single Addition on the GPU__global__ void add( float *a, float *b, float *c ){ c= a+b;} ?!void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); // c will need to be copied to host printf(“%f + %f = %f n”, a,b,c); return 0;} 12
  • 13. Original C memory commands:malloc(), free(), memcpy() CUDA Global Memory • To be able to use the GPU memory you will need: – Allocate memory on the GPU using the command • cudaMalloc() – Copy the host memory to the device memory using • cudaMemcpy() • To free the memory • cudaFree() 13
  • 14. The Kernel will is correct and willstay the same Single Addition on the GPU __global__ void add( float *a, float *b, float *c ){ c= a+b; } 14
  • 15. Allocating the the differentwe need to define device memoryvariables for host and device Single Addition on the GPUmemories. void main( void ) { float *h_a, *h_b, *h_c; float *d_a, *d_b, *d_c; int size = sizeof(float); cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); h_a = 150; h_b = 89; 15
  • 16. copyFree the device memory from memory to andthe device Single Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 16
  • 17. Is that right to do?• GPU is about massive parallelism, so running this program on the GPU is inefficient and will run slower than the CPU version• You need large data 17
  • 18. The add function will stay the sameArray Addition on the CPUvoid main( void ) {int n = 512; // 2^9float *a[n], *b[n], *c[n];... // setting a and b valuesfor (int i=0 i<=n, i++){ c[i] = add(a[i],b[i]); printf(“%f + %f = %f n”, a,b,c); }return 0;} 18
  • 19. we have to modify the size Array Addition on the GPU void main( void ) { int n = 512; float *h_a[n], *h_b[n], *h_c[n]; float *d_a[n], *d_b[n], *d_c[n]; int size = sizeof(float) * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 19
  • 20. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); ?! printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;} 20
  • 21. Blocks• CUDA Run the Kernel as a block on a grid containing n number of blocks.• The maximum value of n can defer from device to device. current devices limit is 65535 blocks per grid• we will use blockIdx.x to access the block ID from the kernel 21
  • 22. n number of blocks will berunning on the kernel Array Addition on the GPU1 cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<n,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 22
  • 23. Array Addition Kernel1__global__ void add( float *a, float *b, float *c ){ int idx = blockIdx.x ; c[idx]= a[idx]+b[idx];} 23
  • 24. Threads• Each block can contain up to 512 parallel threads in the first and second CUDA architecture• In fermi architecture each block can contain up to 1024 parallel threads.• we will use threadIdx.x to access the thread ID from the kernel 24
  • 25. n number of threads on singleblock will be running on the Array Addition on the GPUkernel cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,n>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 25
  • 26. CUDA Run the threads as half warps. soit is more efficient to have at least 16 Array Addition Kernelthreads per block __global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x ; c[idx]= a[idx]+b[idx]; } 26
  • 27. MORE• is it still massive parallelism ?• what about more than 512 elements ? 27
  • 28. Terminology• 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 threadIdx.x 28
  • 29. How to point each thread to the rightglobal memory address ? global memory access blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 1 1 1 1 1 1 1 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 Global Memory 29
  • 30. global memory access• 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 idx = threadIdx.x + blockIdx.x * blockDim.x 30
  • 31. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); int blockSize = 256; int blockSize = 256; int blocks = n/blockSize; int blocks = n/blockSize; add<<<blocks,blockSize>>>(d_a,d_b,d_c); add<<<blocks,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); //printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 31
  • 32. Array Addition Kernel__global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x + blockIdx.x * blockDim.x; c[idx]= a[idx]+b[idx];} 32
  • 33. Exercises• What is the maximum number of threads that can be run on a grid ?• How we can go over that limit ? 33
  • 34. How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory index and global memory accessjump the second one.You will access the second index throw idx + 1 • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 34
  • 35. How to point each thread to the right global memory address ?Hint: you need to find the idx formula that count one memory index and global memory accessjump the next blockSize .You will access the second index throw idx + blockDim.x • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 35
  • 36. What you learned• Creating CUDA Kernel• Calling the Kernel from the host• Allocating CUDA memory• Copy to/from the device memory• freeing the device memory• controlling the number of threads throw the block size and number of blocks per grid. 36
  • 37. Dot Product A × B + C 37
  • 38. • if each thread do one multiplication. which thread will make the addition ? 38
  • 39. Shared Memory• The Shared Memory is very fast memory on the GPU chip itself.• each block has its own shared memory space.• can be declared using __shared__ CUDA keyword• to make sure all the thread finished computing use the CUDA keyword __syncthreads() 39
  • 40. Dot Product Kernel__global__ void dotP(int *a, int *b, int *c){ __shared__ temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if (threadIdx.x == 0) { int sum = 0; for (int i = 0;i<N;i++) sum += temp[i]; c = sum; }} 40
  • 41. exercise• in this application the addition will run on thread 0 only. is that efficient ?• how to make it better ? 41
  • 42. Matrix multiplication B A C 42
  • 43. MatrixMul on the GPUvoid main( void ) { int n = 16; float *h_a[n][n], *h_b[n][n], *h_c[n][n]; float *d_a[n][n], *d_b[n][n], *d_c[n][n]; int size = sizeof(float) * n * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 43
  • 44. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); dim3 blockSize = (n,n,1); add<<<1,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;} 44
  • 45. Simple Matrix Multiplication Kernel __global__ void matrixMul( float *a, float *b, float*c ){ int x = threadIdx.x ; //row int y = threadIdx.y; //column float temp = 0; for (int i=0; i<= blockDim.x; i++){ temp += a[i][y] * b[x][i]; } c[x][y] = temp;} 45
  • 46. Exercise• Use the shared memory to optimize the matrix algorithm (hint: look at the code on the SDK) 46
  • 47. What you learned• Using the shared memory to share the date among the threads in a block• Synchronizing the threads• setting blockSize of more than one dimension using dim3 47
  • 48. Performance Considerations• for maximum performance: – Reduce the global memory access. – maximize the occupancy (allow scheduling of 1024 threads per stream multi processor) • use the right blockSize • use the right number of registers • use the right size of the shared memory – Increasing the independent instructions – coalescing the memory access – Using right instruction:byte ratio 48
  • 49. Introduction to OpenCL• OpenCL is open standard• Cross platform; can run on: – Multi-core CPU – GPU (NVIDIA,ATI) – Cell B/E – others• close to CUDA 49
  • 50. How the program work Host Device (GPU) •Allocating the memory in the Host Stream Processors GPU •initializing data in the memory Kernel objects. Code •Allocating the memory in the Device (GPU) Memory •Copy the Data from Host to Memory Device A[] B[] C[] A[] B[] C[] •Running the Kernel •Copy the results to the Host memory •Clear the Memory and Free the resources 50
  • 51. Basic OpenCL program Structure• OpenCL Kernel• Host program containing: – a. Devices Context. – b. Command Queue – c. Memory Objects – d. OpenCL Program. – e. Kernel Memory Arguments. 51
  • 52. Creating the Kernel#include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int*a, n”, “ __global int* b) n”, “{ n”, “unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”}};};}; 52
  • 53. Notice that all the kernel here stored aschar variable Creating the Kernel #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”, “{ n”, “ * OpenCLSource[ ] = { unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”}; }; }; }; 53
  • 54. The __kernel key word in functionget_global_id() is a builtis equivalent to__global__ in CUDAinstead of calculating the global ID in Creating the KernelThe function parameters need to beCUDAdefine as __global while you don’t needthat in CUDA #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”,VectorAdd(__global int* “__kernel void “{ n”, “ c, __global int* a, n”, “ unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, int* b) n __global “} n”}; }; “ unsigned int n = get_global_id(0); n”, , }; , }; 54
  • 55. Initializing dataint InitialData1[12] = {62, 48, 20, -53, 39, 83, 19, 47, 13, 88, 38, -92};intInitialData2[12] = {-49, 29, 38, 10, 37, 46, -12, 86, 17, 83, -22, 94};#define SIZE2048 55
  • 56. Creating the main functionint main (int argc, char **argv){ int HostVector1[SIZE]; int HostVector2[SIZE] for (int c= 0; c<SIZE; c++) { HostVector[c] = InitialData1[c%12]; HostVector[c] = initialData2[c%12];}];}];}];}];} 56
  • 57. cl_context clCreateContextFromType( cl_context_properties*properties, cl_device_type device_type, void Creating the context(*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data,cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 57
  • 58. cl_context clCreateContextFromType( cl_context_propertiesYou can also use*properties,CL_DEVICE_TYPE_CPU device_type, void cl_device_type Creating the context(*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data,cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 58
  • 59. cl_int clGetContextInfo( cl_context context,cl_platform_info param_name, size_t Query compute devicesparam_value_size, void *param_value, size_t*param_value_size_ret)Param_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIES size_t ParamDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes); 59
  • 60. cl_int clGetContextInfo( cl_context context,cl_platform_info param_name, size_t Query compute devicesparam_value_size, void *param_value, size_t*param_value_size_ret)Param_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIESParam_name:CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICES,CL_CONTEXT_PROPERTIES cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL); 60
  • 61. cl_command_queue clCreateCommandQueue(cl_context context, cl_device_id device,Command queuecl_command_queue_properties properties, cl_int*errcode_ret)Properties:CL_QUEUE_PROFILING_ENABLE,CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLEProperties:CL_QUEUE_PROFILING_ENABLE,CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLEcl_command_queue GPUCommandQueue = clCreatCommandQueue(GPUContext, GPUDevices[0], 0, NULL); 61
  • 62. cl_mem clCreateBuffer(cl_context context, cl_mem_flagsflags, size_t size, void *host_ptr, cl_int *errcode_ret)flags:Allocating the MemoryCL_MEM_READ_WRITE, CL_MEM_READ_ONLY,CL_MEM_WRITE_ONLY, CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR,CL_MEM_COPY_HOST_PTRflags: CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY,CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR,CL_MEM_COPY_HOST_PTRcl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLYCL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector1, NULL); 62
  • 63. Allocating the Memorycl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLYCL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector2, NULL);cl_memGPUOutputVector; GPUOutputVector = clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof (int) * SIZE, NULL, NULL);;; 63
  • 64. cl_program clCreateProgramWithSource( cl_contextcontext, cl_unit count, const char **strings, const size_tCreating the program*lengths, cl_int *errcode_ret)cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 8,OpenCLSource, NULL, NULL); 64
  • 65. cl_int clBuildProgram(cl_program program, cl_unitnum_devices, const cl_device_id *device_list,Creating the programconst char *options,void (*pfn_notify)void *user_data), void *user_data) (cl_program,clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL); 65
  • 66. cl_kernel clCreateKernel(cl_program program, const char*kernel_name, cl_int *errcode_ret)Creating the programcl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, “VectorAdd”,NULL); 66
  • 67. cl_int clSetKernelArg(cl_kernel kernel, cl_unitmatching the GPU memory with thearg_index, size_t arg_size, const void *arg_value)KernelclSetKernelArg(OpenCLVectorAdd, 0, sizeof (cl_mem), (void*)&GPUOutputVector); 67
  • 68. matching the GPU memory with theKernelclSetKernelArg(OpenCLVectorAdd, 1, sizeof (cl_mem), (void*)&GPUVector1);clSetKernelArg(OpenCLVectorAdd, 2, sizeof (cl_mem), (void*)&GPUVector2);;; 68
  • 69. cl_int clEnqueueNDRangeKernel( cl_command_queuecommand_queue, cl_kernel kernel, cl_unit work_dim,Lunching the Kernel const size_t *global_work_offset,*global_work_size, const size_t const size_t *local_work_size,cl_unit num_events_in_wait_list, const cl_event*event_wait_list, cl_event *event)size_t WorkSize [1] = {SIZE};clEnqueueNDRangeKernel(GPUCommandQueueOpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL); 69
  • 70. cl_int clEnqueueReadBuffer( cl_command_queuecommand_queue, cl_mem buffer, cl_boolCopying the output to the host memoryblocking_read, size_t offset,size_t cb, void *ptr,cl_unit num_evernts_in_wait_list, const cl_event*event_wait_list, cl_event *event)int HostOutputVector [SIZE];clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector, CL_TRUE, 0, SIZE* sizeof(int), HostOutputVector, 0, NULLNULL); 70
  • 71. Cleaning the GPU deviceclReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleseMemObject(GPUOutputVector);free (GPUDevices);for(int c= 0; c < 305; c++)printf (“%c”, (char)HostOutputVector[c]);return 0;};} 71
  • 72. What you learned• Writing OpenCL Kernel• Writing OpenCL Application – Setting the context – preparing the command queue – setting the memory objects – setting the program – setting the kernel and the arguments 72
  • 73. Sources and additional resources• Jason sander, “Introduction to CUDA” -book and GTC presentation.• OpenCL specification document• NVIDIA CUDA programming guide• NVIDIA OpenCL getting started guide• Videos from GTC’10 in the link :• http://www.nvidia.com/object/gtc2010-presentation-a 73