Griffon Topic2 Presentation (Tia)

1,029 views
951 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,029
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • cuda
  • S2s
  • ค่าที่ได้แกว่งมาก
  • S2s ที่เป็น opensourceก็หายากส่วนใหญ่นิยมทำเป็น compiler ที่มีประสิทธิภาพมากกว่าเลยMaintain ได้สะดวกกว่า ~
  • Griffon Topic2 Presentation (Tia)

    1. 1. GriffonGPU Programming API for Scientific and General Purpose<br />PisitMakpaisit 4909611727<br />Supervisor : Dr. Worawan Diaz Carballo<br />Department of Computer Science, Faculty of Science and Technology, Thammasat University<br />
    2. 2. <ul><li> GPU-CPU performance gap
    3. 3. GPGPU
    4. 4. GPU programming model complexity</li></ul>Motivation<br />3/13/2010<br />2<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    5. 5. GPU-CPU performance gap<br />All we have graphic card in PC<br />Processor unit in graphic card called “GPU” <br />Therefore every PC have GPU<br />Now GPU performance is pulling away from traditional processors<br />http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf<br />3/13/2010<br />3<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    6. 6. GPGPU<br />General-Purpose computation on Graphics Processing Units<br />Very high computation and data throughput<br />Scalability<br />3/13/2010<br />4<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    7. 7. GPGPU Applications<br />Simulation<br />Finance<br />Fluid Dynamics<br />Medical Imaging<br />Visualization<br />Signal Processing<br />Image Processing<br />Optical Flow<br />Differential Equation<br />Linear Algebra<br />Finite Element<br />Fast Fourier Transform<br />etc.<br />3/13/2010<br />5<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    8. 8. Vector Addition<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />6<br />Vector A<br />+<br />Vector B<br />=<br />Vector C<br />
    9. 9. Vector Addition (Sequential Code)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />7<br />#include <stdio.h><br />#define SIZE 500<br />voidVecAdd(float *A, float *B, float *C){<br />inti;<br />for(i=0;i<SIZE;i++)<br /> C[i] = A[i] + B[i]<br />}<br />Declare Function<br />void main(){<br />inti, size = SIZE * sizeof(float);<br />float *A, *B, *C;<br />Declare Variables<br /> A = (float*)malloc(size);<br /> B = (float*)malloc(size);<br /> C = (float*)malloc(size);<br />Memory Allocate<br />VecAdd(A,B,C);<br />Function Call<br /> free(A);<br /> free(B);<br /> free(C);<br />}<br />Memory De-Allocate<br />
    10. 10. Vector Addition (Sequential Code)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />8<br />Vector A<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />Vector B<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />6<br />9<br />9<br />14<br />7<br />7<br />11<br />Vector C<br />7<br />15<br />7<br />
    11. 11. Improve Performance<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />9<br />We can improve vector with parallel computing<br />Data Parallelism – simultaneously add each elements<br />1st choice<br /><ul><li>Multicore on CPU
    12. 12. OpenMP</li></ul>2nd choice<br /><ul><li>Multicore on GPU
    13. 13. CUDA</li></li></ul><li>Vector Addition (OpenMP)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />10<br />#include <stdio.h><br />#define SIZE 500<br />voidVecAdd(float *A, float *B, float *C){<br />inti;<br />for(i=0;i<SIZE;i++)<br /> C[i] = A[i] + B[i]<br />}<br />void main(){<br />inti, size = SIZE * sizeof(float);<br />float *A, *B, *C;<br /> A = (float*)malloc(size);<br /> B = (float*)malloc(size);<br /> C = (float*)malloc(size);<br />VecAdd(A,B,C);<br /> free(A);<br /> free(B);<br /> free(C);<br />}<br />1. Sequential Code<br />#pragmaomp parallel for<br />2. Add Compiler Directive<br />3. Finish<br />
    14. 14. Vector Addition (OpenMP)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />11<br />Vector A<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />Vector B<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />6<br />9<br />9<br />14<br />7<br />7<br />11<br />Vector C<br />7<br />15<br />7<br />
    15. 15. Speed Up (Amdahl’s Law) <br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />12<br />Execution time (Sequential)<br />Vector Addition ~ 80%<br />Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2 <br />Execution time (Parallel on CPU)<br />Vector Addition <br />
    16. 16. OpenMP<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />13<br />Easy and automatic threads management<br />Few threads on CPU<br />
    17. 17. Vector Addition (GPU - CUDA)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />14<br />Vector A on CPU<br />+<br />+<br />+<br />+<br />+<br />+<br />+<br />Copy<br />+<br />+<br />+<br />Vector B on CPU<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />=<br />Vector C on CPU<br />Copy<br />6<br />9<br />9<br />14<br />7<br />7<br />11<br />7<br />15<br />7<br />CPU Memory<br />GPU Memory<br />
    18. 18. Parallel Vector Addition on GPU (CUDA)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />15<br />#include <stdio.h><br />#define SIZE 500<br />__global__ void VecAdd(float* A, float* B, float* C){<br />intidx = threadIdx.x;<br />if(idx < SIZE)<br /> C[idx] = A[idx] + B[idx];<br />}<br />Declare Kernel Function<br />void main(){<br />inti, size = SIZE * sizeof(float);<br />float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;<br />Declare Variables<br />h_A = (float*)malloc(size);<br />h_B = (float*)malloc(size);<br />h_C = (float*)malloc(size);<br />CPU Memory Allocate<br />cudaMalloc((void**)&d_A, size);<br />cudaMalloc((void**)&d_B, size);<br />cudaMalloc((void**)&d_C, size);<br />GPU Memory Allocate<br />
    19. 19. Parallel Vector Addition on GPU (CUDA)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />16<br />cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);<br />cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);<br /> Data Transferfrom CPU to GPU<br />addVec<<<1, SIZE>>>(d_A, d_B, d_C);<br />Kernel Call<br /> Data Transferfrom GPU to CPU<br />cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);<br /> free(h_A);<br /> free(h_B);<br /> free(h_C);<br />CPU Memory De-Allocate<br />cudaFree(d_A);<br />cudaFree(d_B);<br />cudaFree(d_C);<br />}<br />GPU Memory De-Allocate<br />
    20. 20. Speed Up (Amdahl’s Law) <br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />17<br />Execution time (Sequential)<br />Vector Addition ~ 80%<br />Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16 <br />Execution time (Parallel on GPU)<br />
    21. 21. CUDA<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />18<br />Speed up but spend more effort and time<br />Many threads on GPU<br />
    22. 22. CUDA Memory Model<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />19<br /><ul><li>Global Memory – Off-chip, large, shared by all threads, slow, host can read and write
    23. 23. Local Memory – per one thread , faster than Global Memory
    24. 24. Shared Memory – shared by all threads in block, faster than Global Memory </li></li></ul><li>Griffon<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />20<br />Simple programming model (OpenMP)<br />+<br />Computing Performance (GPU - CUDA)<br />=<br />Easy and Efficient (Griffon)<br />
    25. 25. Parallel Vector Addition on GPU (Griffon)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />21<br />#include <stdio.h><br />#define SIZE 500<br />voidVecAdd(float *A, float *B, float *C){<br />inti;<br />for(i=0;i<SIZE;i++)<br /> C[i] = A[i] + B[i]<br />}<br />void main(){<br />inti, size = SIZE * sizeof(float);<br />float *A, *B, *C;<br /> A = (float*)malloc(size);<br /> B = (float*)malloc(size);<br /> C = (float*)malloc(size);<br />VecAdd(A,B,C);<br /> free(A);<br /> free(B);<br /> free(C);<br />}<br />1. Sequential Code<br />#pragmagfn parallel for<br />2. Add Compiler Directive<br />3. Finish<br />So Easy !!<br />
    26. 26. Griffon<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />22<br />Compiler directive for C-Language<br />Source-to-source compiler<br />Automatic data management<br />Optimization<br />
    27. 27. Objectives<br />3/13/2010<br />23<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    28. 28. Objectives (1/2)<br />To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler<br />Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API.<br />Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes.<br />3/13/2010<br />24<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    29. 29. Objectives (2/2)<br />To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: <br />Automatic – The GPU memory management of generated codes is done automatically by Griffon.<br />Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%.<br />3/13/2010<br />25<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    30. 30. Project Constraint<br />3/13/2010<br />26<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    31. 31. Project Constraint<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />27<br />Griffon is a C-language API that supports both Windows and Linux environments<br />The generated executable program can only run on the NVIDIA graphic card.<br />Uses can use Griffon in cooperated with OpenMP.<br />
    32. 32. Related Works<br />3/13/2010<br />28<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    33. 33. Brook+ & CUDA<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />29<br />General propose computation on GPU<br />Manual kernel and data transfer on various GPU memory management<br />Vendor dependent<br />
    34. 34. OpenCL (Open Computing Language)<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />30<br />Cross-platform and Vendor neutral<br />Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor)<br />Data and Task Parallelism<br />
    35. 35. OpenMP to GPGPU<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />31<br />OpenMP applications into CUDA-based GPGPU applications<br />GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality <br />
    36. 36. hiCUDA<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />32<br />Directive-based GPU Programming Language<br />Computation Model for identify code region that executed on GPU<br />Data Model for allocate and de-allocate memory on GPU and data transfer<br />
    37. 37. <ul><li> Software Architecture
    38. 38. Directives
    39. 39. Griffon Compilation Process
    40. 40. Optimization Techniques</li></ul>Methodology<br />3/13/2010<br />33<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    41. 41. Software Architecture<br />NVCC is one of the Griffon toolchain.<br />Griffon source-to-source compiler comprises oMemory Allocator and Optimizer<br />Griffon C Application<br />Griffon Compiler<br />Compile-time Memory Allocator<br />Optimizer<br />CUDA C Application<br />NVCC (NVIDIA CUDA Compiler)<br />PTX code<br />C code<br />PTX compiler<br />GCC (Linux),CL (MS Windows)<br />CPU object code<br />GPU object code<br />Executable<br />3/13/2010<br />34<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    42. 42. Directives<br />3/13/2010<br />35<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    43. 43. Griffon Directives<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />36<br />Specify kernel work flow<br />Define parallel region<br />Parallel Region<br />Control Flow<br />GPU/CPU Overlap Compute<br />Synchronous<br />Define synchronous point<br />Define region that CPU overlap compute with GPU <br />
    44. 44. Directives<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />37<br />General Form<br />#pragmagfn directive-name [clause[ [,] clause]...] new-line<br />Parallel Region<br />#pragmagfn parallel for [clause[ [,] clause]...] new-line<br /> for-loops<br /> <br />Clause : kernelname(name)<br />waitfor(kernelname-list)<br /> private(var-list)<br />accurate([low,high])<br /> reduction(operator:var-list)<br />
    45. 45. Parallel Region<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />38<br />for(i=0;i<N;i++){<br /> C[i] = A[i] + B[i];<br />}<br />#pragmagfn parallel for<br />for(i=0;i<N;i++){<br /> C[i] = A[i] + B[i];<br />}<br />
    46. 46. Kernel Flow Control<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />39<br /><ul><li>#pragmagfn parallel for kernelname( A )
    47. 47. #pragmagfn parallel for kernelname( B ) waitfor( A )
    48. 48. #pragmagfn parallel for kernelname( C ) waitfor( A )
    49. 49. #pragmagfn parallel for kernelname( D ) waitfor( B,C )</li></ul>A<br />C<br />B<br />Kernel B and C can compute in parallel<br />D<br />
    50. 50. Synchronization<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />40<br />P0<br />P0<br />P1<br />P1<br />P2<br />P2<br />P3<br />P3<br />Synchronous Point<br />Barrier<br />#pragmagfn barrier new-line<br />Atomic<br />#pragmagfn atomic newline<br /> assignment-statement<br />Parallel Reduction<br />#pragmagfn parallel for reduction(operation,var-list)<br />
    51. 51. Synchronization<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />41<br />#pragmagfn parallel for<br />for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1;<br />}<br />#pragmagfn parallel for<br />for(i=1;i<N-1;i++){ <br /> A[i] = B[i];<br /> if(A[i] > 7){<br /> #pragmagfn atomic C[i] += x / 5;<br /> }<br />}<br />Option 1<br />for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1;<br />}<br />for(i=1;i<N-1;i++){ <br /> A[i] = B[i];<br /> if(A[i] > 7){ C[i] += x / 5;<br /> }<br />}<br />#pragmagfn parallel for<br />for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier<br /> A[i] = B[i];<br /> if(A[i] > 7){<br /> #pragmagfn atomic C[i] += x / 5;<br /> }<br />}<br />Option 2<br />
    52. 52. Synchronization<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />42<br />for (i = 1; i <= n-1; i++) {<br /> x = a + (i * h);<br /> integral = integral + f(x);<br />}<br />#pragmagfn parallel for <br />private(x) reduction(+:integral)<br />for (i = 1; i <= n-1; i++) {<br /> x = a + (i * h);<br /> integral = integral + f(x);<br />}<br />
    53. 53. GPU/CPU Overlap compute<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />43<br />#pragmagfnoverlapcompute(kernelname)newline structure-block<br />Many threads on GPU<br />CPU function <br /> Parallel  <br />GPU/CPU Synchronize<br />
    54. 54. GPU/CPU Overlap compute<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />44<br />#pragmagfn parallel for kernelname( calA )<br />for(i=0;i<N;i++){<br /> …<br />}<br />#pragmagfnoverlapcompute( calA )<br />independenceCpuFunction();<br />for(i=0;i<N;i++){<br /> …<br />}<br />independenceCpuFunction();<br />
    55. 55. Accurate Level<br />#pragmagfn parallel for accurate( [low, high] )<br />Use low when speed is important<br />Use high when precision is important<br />Default is high<br />3/13/2010<br />45<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    56. 56. Griffon Compilation Process<br />3/13/2010<br />46<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    57. 57. Create Kernel<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />47<br />__global__ void __kernel_0(…, int __N){<br />int __tid = blockIdx.x * blockDix.x + threadIdx.x;<br />inti = __tid[* 1 + 0] ;<br /> if(__tid<N){<br />x = sin(A[i]);<br /> y = cos(B[i]); <br /> C[i] = x + y; <br /> }<br />}<br />int main(){<br />int sum = 0;<br />int x, y;<br /> __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1); <br />// Insert kernel call<br /> return 0;<br />}<br />int main(){<br />int sum = 0;<br />int x, y;<br /> #pragmagfn parallel for private(x, y) reduction(+:sum)<br /> for(i=0;i<N;i++){<br />x = sin(A[i]);<br /> y = cos(B[i]); <br /> C[i] = x + y; <br /> }<br /> return 0;<br />}<br />
    58. 58. For-Loop Format and Thread Mapping <br />For-loop must be in format<br />for( index = min ; index <= max ; index += increment ){…}<br />for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case<br />Number of Thread can calculate by formula<br />Iterative Index and Thread Mapping<br /> __tid = blockIdx.x * blockDix.x + threadIdx.x;<br /> index = __tid * increment + min;<br />3/13/2010<br />48<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    59. 59. Private and shared variable management<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />49<br />Shared variables much be pass to kernel function<br />Private variables mush be declare in kernel fucntion<br />Declare GPU device variables for shared variable<br />Size for allocate<br />Static : size when declare. Ex int A[500];<br />Dynamic : allocate function – malloc, calloc, realloc<br />
    60. 60. Private and shared variable management<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />50<br />__global__ void __kernel_0(int * A, int * B, int * C, int __N){<br />int __tid = blockIdx.x * blockDix.x + threadIdx.x;<br />inti = __tid [* 1 + 0] ;<br />int x, y;<br /> if(__tid<N){<br /> x = sin(A[i]);<br /> y = cos(B[i]); <br /> C[i] = x + y; <br /> }<br />}<br />int main(){<br />int sum = 0;<br />int x, y;<br />int A[N], B[N], C[N] ;<br />int * __d_A ,* __d_B ,* __d_C ; cudaMalloc((void**)&__d_C,sizeof(int) * N);<br /> cudaMalloc((void**)&__d_B,sizeof(int) * N);<br /> cudaMalloc((void**)&__d_A,sizeof(int) * N);<br /> __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1); <br />cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A);<br /> return 0;<br />}<br />int main(){<br />int sum = 0;<br />int x, y; int A[N], B[N], C[N] ; <br /> #pragmagfn parallel for private(x, y) reduction(+:sum)<br /> for(i=0;i<N;i++){<br />x = sin(A[i]);<br />y = cos(B[i]); <br />C[i] = x + y; <br /> }<br /> return 0;<br />}<br />
    61. 61. Reduction variable management<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />51<br />__global__ void __kernel_0(float *A, float * global___sum_add){<br />int __tid = blockIdx.x * blockDim.x + threadIdx.x ;<br />inti = __tid ;<br />int __rtid = threadIdx.x ;<br /> __shared__ int __sum_add[512] ;<br />int sum = 0 ;<br /> <br /> __sum_add[__rtid] = 0;<br /> if( __tid < __N ){<br /> …<br /> sum += c[i];<br /> __sum_add[__rtid] = sum;<br /> __syncthreads();<br /> if(__rtid < 256) __sum_add[__rtid] += __sum_add[__rtid + 256];<br /> __syncthreads();<br /> if(__rtid < 128) __sum_add[__rtid] += __sum_add[__rtid + 128];<br /> __syncthreads();<br /> if(__rtid < 64) __sum_add[__rtid] += __sum_add[__rtid + 64];<br /> __syncthreads();<br /> if(__rtid < 32) __sum_add[__rtid] += __sum_add[__rtid + 32];<br /> __syncthreads();<br /> if(__rtid < 16) __sum_add[__rtid] += __sum_add[__rtid + 16];<br /> if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid + 8];<br /> if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid + 4];<br /> if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid + 2];<br /> if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid + 1];<br /> }<br /> if(__rtid == 0)<br />atomicAdd(global___sum_add, __sum_add[0]);<br />}<br />int main(){<br /> …<br /> #pragmagfn parallel for <br /> reduction(+:sum)<br /> for(i=0;i<MAX;i++){<br /> ...<br /> sum += A[i];<br /> ...<br /> }<br /> ...<br />}<br />Very complex because optimize parallel reduction implementation <br />
    62. 62. Replace math functions & GPU functions<br />int f1(int a){<br /> return ++a;<br />}<br />int f0(int a){<br /> return f1(a) + 5;<br />}<br />#pragmagfn parallel for<br />for(i=0;i<N;i++){<br /> A[i] = f0(A[i]) + sin(B[i]);<br />}<br />__device__ int __device_f1(int a){<br /> return ++a;<br />}<br />__device__ int __device_f0(int a){<br />return __device_f1(a) + 5;<br />}<br />__global__ void __kernel_1(int *A, int *B, int N){<br /> …<br />A[i] = __device_f0(A[i]) + __sinf(B[i]);<br />}<br />3/13/2010<br />52<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    63. 63. Barrier and Atomic<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />53<br />__global__ void __kernel_A(…){<br /> if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; __threadfence();<br /> A[i] = B[i];<br />atomicAdd(&C[i], x / 5);<br /> }<br />}<br />__global__ void __kernel_A(…){<br /> if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier<br /> A[i] = B[i];<br />#pragmagfn atomic C[i] += x / 5;<br /> }<br />}<br />
    64. 64. Kernel call and data transfer sort<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />54<br />__kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add);cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost );<br />cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost );<br />cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost );<br />Detail in optimization section<br />
    65. 65. Automatic cache with shared memory<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />55<br />#pragmagfn parallel for<br />for(i=1;i<(MAX-1);i++){<br /> B[i] = A[i-1] + A[i] + A[i+1];<br />}<br />Detail in optimization section<br />__global__ void __kernel_0 (int * B, int * A, int __N)<br />{<br />int __tid = blockIdx.x * blockDim.x + threadIdx.x ;<br />inti = __tid * 1 + 1 ;<br /> __shared__ intsa[514] ;<br /> if(__tid < __N)<br /> {<br />sa[threadIdx.x + 0] = A[i + 0 - 1];<br /> if(threadIdx.x + 512 < 514)<br />sa[threadIdx.x + 512] = A[i + 512 - 1];<br /> __syncthreads();<br /> B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1];<br /> }<br />}<br />
    66. 66. <ul><li> Maximum thread on GPU
    67. 67. Reduce data transfer with analysis control flow
    68. 68. Reduce data transfer with kernel control flow
    69. 69. Overlapping kernel and data transferand asynchronous data transfer
    70. 70. Automatic cache with shared memory</li></ul>Optimization Techniques<br />3/13/2010<br />56<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    71. 71. Reduce data transfer with analysis control flow<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />57<br /><ul><li>Used variable
    72. 72. Defined variable</li></ul>A, B transfer from CPU to GPU<br />C transfers from GPU to CPU<br />D is both<br />#pragmagfn parallel for<br />for(i=0;i<N;i++){<br />C[i] = A[i] + B[i] + D[i];<br />D[i] = C[i] * 0.5;<br />}<br />
    73. 73. Reduce data transfer with kernel control flow<br />MemcpyHost to Device for Variable that is defined in kernel<br />MemcpyDevice to Host for Variable that is used in kernel<br />#pragmagfn parallel for<br />for(i=0;i<N;i++){<br />C[i] = A[i] + B[i];<br />}<br />cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice );<br />cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice );<br />Kernel <<< … , … >>> ( … )<br />cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);<br />A<br />B<br />K1<br />C<br />3/13/2010<br />58<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    74. 74. Reduce data transfer with kernel control flow<br />Use graph defined by kernelname and waitfor construct<br />#pragmagfn parallel for <br />kernelname(k1)<br />for(i=0;i<N;i++){<br /> C[i] = A[i] + B[i];<br />}<br />#pragmagfn parallel for <br />kernelname(k2) waitfor(k1) <br />for(i=0;i<N;i++){<br /> E[i] = A[i] * C[i] – D[i];<br /> C[i] = E[i] / 3.0;<br />}<br />A<br />B<br />K1<br />D<br />C<br />A<br />C<br />K2<br />E<br />C<br />3/13/2010<br />59<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    75. 75. Reduce data transfer with kernel control flow<br />If there is a path from k1 to k2<br />If invar of k1 is same as invar of k2 delete invar of k2<br />If outvar of k1 is same as outvar of k2 delete outvar of k1<br />if outvar of k1 is same as invar of k2 delete invar of k2<br />A<br />B<br />K1<br />D<br />C<br />A<br />C<br />K2<br />E<br />C<br />3/13/2010<br />60<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    76. 76. Schedule Kernel and Memcpy for Maximum overlap<br />K1<br />Already reduce transfer nodes graph <br />A<br />B<br />K2<br />How to schedule?<br />C<br />D<br />K3<br />E<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    77. 77. Schedule for synchronous function<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />62<br />62<br />K1<br />K2<br />A<br />B<br />D<br />K3<br />C<br />E<br />Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE)<br />New version of CUDA API has asynchronous data transfer function <br />
    78. 78. Schedule Kernel and Memcpy for Maximum overlap<br />Memcpy and Kernel can be overlaped<br />Maximum is 3-ways overlap<br />MemcpyHostToDevice<br />Kernel<br />MemcpyDeviceToHost<br />4-ways overlap If include CPU compute by overlapcompute directive<br />3/13/2010<br />63<br />Griffon - GPU Programming API for Scientific and General Purpose<br />Level 1<br />K1<br />A<br />1<br />2<br />Level 2<br />K2<br />B<br />C<br />1<br />2<br />3<br />Level 3<br />K3<br />D<br />1<br />2<br />Level 4<br />E<br />1<br />
    79. 79. 3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />64<br />Set queue to empty<br />Until all node is deleted<br />1.1. Set level =1 and stream_num = 1;<br />1.2. Find 0 incoming degree kernel node, delete node and link, create transfer command with stream_num<br /> 1.2.1. if found in 1.2 stream_num += 1<br />1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num<br /> 1.3.1 if found in 1.3 stream_num += 1<br />1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num<br /> 1.4.1 if found in 1.4 stream_num += 1<br />1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node<br />1.6. Insert synchronous function<br />1.7. Collect max stream_num<br />1.8. level += 1;<br />Level 1<br />K1<br />A<br />1<br />2<br />Level 2<br />K2<br />B<br />C<br />1<br />2<br />3<br />Level 3<br />K3<br />D<br />1<br />2<br />Level 4<br />E<br />1<br />
    80. 80. Automatic cache with shared memory<br />When detect “linear access” pattern in kernel automatic cache will work<br /> Global Memory<br />…<br />Shared<br />Shared<br />Shared<br />Thread block1<br />Thread block 3<br />Thread block n<br />Shared<br />Thread block2<br />#pragmagfn parallel for<br />for(i=1;i<(MAX-1);i++){<br /> B[i] = A[i-1] + A[i] + A[i+1];<br />}<br />3/13/2010<br />65<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    81. 81. Automatic cache with shared memory<br />#pragmagfn parallel for<br />for(i=1;i<(MAX-1);i++){<br /> B[i] = A[i-1] + A[i] + A[i+1];<br />}<br />__global__ void __kernel_0 (int * B, int * A, int __N)<br />{<br />int __tid = blockIdx.x * blockDim.x + threadIdx.x ;<br />inti = __tid * 1 + 1 ;<br /> __shared__ intsa[514] ;<br /> if(__tid < __N)<br /> {<br />sa[threadIdx.x + 0] = A[i + 0 - 1];<br /> if(threadIdx.x + 512 < 514)<br />sa[threadIdx.x + 512] = A[i + 512 - 1];<br /> __syncthreads();<br /> B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1];<br /> }<br />}<br />3/13/2010<br />66<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    82. 82. DEMO<br />3/13/2010<br />67<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    83. 83. <ul><li> Compiler Directives
    84. 84. Compiler Performance</li></ul>Evaluation<br />3/13/2010<br />68<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    85. 85. Compiler Directives<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />69<br />5 undergraduate students who have studied the concepts of CUDA <br />only 1.5 hour of demonstration<br />
    86. 86. Compiler Directives<br />3/13/2010<br />Griffon - GPU Programming API for Scientific and General Purpose<br />70<br />Calculation of Pi Using Numerical Integration<br />Calculation of Pi Using the Monte Carlo Method<br />Trapezoidal Rule<br />VectorNormalization<br />Calculate Sine of Vector’s Element<br />
    87. 87. Compiler Performance<br />3/13/2010<br />71<br />Griffon - GPU Programming API for Scientific and General Purpose<br />Calculation of Pi Using Numerical Integration<br />Calculation of Pi Using the Monte Carlo Method<br />Trapezoidal Rule<br />VectorNormalization<br />Calculate Sine of Vector’s Element<br />
    88. 88. Conclusion<br />3/13/2010<br />72<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    89. 89. Griffon Instruction<br />Total numbers of instructions (Directive + Clause): 9<br />Problem is performance of high communication degree parallel program<br />Improve directive for describe algorithm in program (Divide and conquer, Partial summation, etc.)<br />New optimization technique such as cache with shared memory, appropriate thread number<br />3/13/2010<br />73<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    90. 90. Performance factor and speed up<br />Computation density is most effect on Performance<br />3/13/2010<br />74<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    91. 91. Building S2S Compiler <br />Source to source compilers aren’t popular<br />Compiler that transform Griffon code to GPU object code (PTX)<br />Although the programs generated by a PTX compiler could be very efficient, they cannot gain any benefits from manual optimization.<br />3/13/2010<br />75<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    92. 92. Future Work<br />Optimization Techniques<br />Data Structure<br />Loop transformation<br />Directives<br />More support OpenMP<br />CPU/GPU Parallel region<br />Support OpenCL<br />Compiler<br />Support C++, other language<br />Support popular IDE<br />3/13/2010<br />76<br />Griffon - GPU Programming API for Scientific and General Purpose<br />
    93. 93. Reference<br />Brook, http://graphics.stanford.edu/projects/brookgpu<br />Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley Publishing<br />CUDA Zone, http://www.nvidia.com/object/cuda_home.html<br />Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern Compiler Design, John Wiley & Sons Ltd<br />General-Purpose Computation on Graphic Hardware, http://gpgpu.org<br />IliasLeontiadis, George Tzoumas, OpenMP C Parser<br />Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU Technology Conference<br />Mark Harris, Optimizing Parallel Reduction in CUDA<br />OpenCL, http://www.khronos.org/opencl<br />Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic. PPoPP ’09<br />The OpenMP API specification for parallel programming, http://openmp.org/wp<br />Thomas Niemann, A Guide to Lex & Yacc<br />Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based Language for GPU Programming. GPGPU '09<br />Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley<br />3/13/2010<br />77<br />Griffon - GPU Programming API for Scientific and General Purpose<br />

    ×