Directive-based approach to heterogeneouscomputing  Ruyman Reyes Castro  High Performance Computing Group  University of L...
TOP500 Performance Development List                                      2 / 83
Applications Used in HPC CentersUsage of HECToR by Area of Expertise                                        3 / 83
Real HPC UsersMost Used Applications in HECToR        Application     % of total jobs   Language   Prog. Model          VA...
Knowledge of ProgrammingSurvey conducted in the Swiss National Supercomputing Centre(2011)                                ...
Are application developers using       the proper tools?                                   6 / 83
Complexity Arises (I)                        7 / 83
Directives: Enhancing Legacy Code (I)     OpenMP Example 1    ... 2   #pragma omp parallel for default(shared) private(i, ...
Complexity Arises (II)                         9 / 83
Re-compiling the code is no longer enough to    continue improving the performance                                        ...
Porting Applications To New Architectures     Programming CUDA (Host Code) 1   float a_host[n], b_host[n]; 2   // Allocate...
Porting Applications To New Architectures     Programming CUDA (Kernel Source) 1   // Kernel code 2   __global__ void kern...
Programmers need faster ways to migrate existing code                                                        13 / 83
Why not use directive-based approaches for these new heterogeneous architectures?                                         ...
Overview of Our Work    We can’t solve problems by using the same kind of    thinking we used when we created them.       ...
OutlineHybrid MPI+OpenMP   llc and llCoMP   Hybrid llCoMP   Computational Results   Technical DrawbacksOpenMP-to-GPUDirect...
La Laguna C: llcWhat is    Directive-based approach to distributed memory environments    OpenMP compatible    Additional ...
Chronological Perspective (Late 2008)Cores per Socket - System Share   Accelerator - System Share                         ...
A Hybrid OpenMP+MPI ImplementationSame llc code, extended llCoMP implementation    Directives are replaced by a set of par...
llc Code Example     llc Implementation of the Mandelbrot Set Computation 1    ... 2   #pragma omp parallel for default(sh...
Hybrid MPI+OpenMP performance                                21 / 83
Technical DrawbacksllCoMP   The original design of llCoMP StS was not flexible enough   Traditional two-pass compiler   Exc...
Back to the Drawing Board                            23 / 83
OutlineHybrid MPI+OpenMPOpenMP-to-GPU  Related Work  Yet Another Compiler Framework (YaCF)  Computational Results  Technic...
Chronological Perspective (Late 2009)Cores per Socket - System Share   Accelerator - System Share                         ...
Related WorkOther OpenMP-to-GPU translators: OpenMPC[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programmingan...
YaCF: Yet Another Compiler FrameworkApplication programmer writes llc code    Focus on data and algorithm    Architecture ...
YaCF Software Architecture                             28 / 83
Main Software Design PatternsImplementing search and replacement in the IR    Filter: Looks for an specific pattern on the ...
Dynamic Language and ToolsKey Idea: Features Should Require Only a Few Lines of Code                                      ...
Template Patterns    Ease back-end implementation1   <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">2   %...
CUDA Back-end    Generates a CUDA kernel and memory transfers from the            information obtained during the analysis...
Example     Update Loop from the Molecular Dynamics Code 1    ... 2   #pragma omp target device(cuda) copy(pos, vel, f) co...
Translation process                      34 / 83
The Jacobi Iterative Method 1   error = 0.0; 2 3 4   { 5 6       for (i = 0; i < m; i++) 7         for (j = 0; j < n; j++)...
Jacobi OpenMP Source 1   error = 0.0; 2 3   #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4   { 5     #pr...
Jacobi llCoMP v1 1   error = 0.0; 2   #pragma omp target device(cuda) 3   #pragma omp parallel shared(uold, u, ...) privat...
Jacobi llCoMP v2 1   error = 0.0; 2   #pragma omp target device(cuda) copy_in(u, f) copy_out(f) 3   #pragma omp parallel s...
Jacobi Iterative Method                          39 / 83
Technical DrawbacksLimited to Compile-time Optimizations    Some features require runtime information    → Kernel grid con...
Back to the Drawing Board                            41 / 83
OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for Accelerators   Related Work   OpenACC   Accelerator ULL (accULL)   Res...
Chronological Perspective (2011)Cores per Socket - System Share   Accelerator - System Share                              ...
Related Work (I)     hiCUDA         Translates each directive into a CUDA call         It is able to use the GPU Shared Me...
Related Work (II)     PGI Accelerator Model         Higher level (directive-based) approach         Fortran and C are supp...
Our Ongoing Work at that Time: llclExtending llc with support for heterogeneous platformsCompiler + Runtime implementation...
llcl: Directives 1   double *a, *b, *c; 2   ... 3   #pragma llc context name("mxm") copy_in(a[n * l], b[l * m],  4        ...
llcl: XML Platform Description File 1   <xml> 2   <platform name="default"> 3    <region name="compute"> 4      <element n...
OpenACC Announcement                       49 / 83
OpenACC Announcement                       50 / 83
OpenACC: Directives 1   double *a, *b, *c; 2   ... 3   #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)      ...
Related WorkOpenACC Implementations (After Announcement)    PGI - Released on February 2012    CAPS - Released on March 20...
accULL: Our OpenACC ImplementationaccULL = YaCF + FrangolloIt is a two-layer based implementation:                     Com...
Frangollo: the RuntimeImplementation    Lightweight    Standard C++ and STL code    CUDA component written using the CUDA ...
Frangollo Layered Structure                              55 / 83
Memory Management 1   // Creates a context to handle memory coherence 2   ctxt_id = FRG__createContext("name", ...) 3   .....
Kernel ExecutionLoading the kernel    Context may have from zero to N named kernels associated    Runtime loads different v...
Implementing OpenACCPutting all together 1. The compiler driver generates Frangollo interface calls from    OpenACC direct...
Building an OpenACC Code with accULL                                       59 / 83
Compilance with OpenACC StandardTable: Compliance with the OpenACC 1.0 standard (directives)                 Construct    ...
Experimental PlatformsGaroe: A Desktop computer    Intel Core i7 930 processor (2.80 GHz), 4Gb RAM    2 GPU devices attach...
SoftwareCompiler versions (Pre-OpenACC)    PGI Compiler Toolkit 12.2 with the PGI Accelerator    Programming Model 1.3    ...
Matrix Multiplication (M × M) (I) 1   #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N]) 2   { 3   #pragma a...
Floating Point Performance for M×M in Peco                                             64 / 83
M×M (II) 1   #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N]) 2   { 3   #pragma acc kernels loop private(i) 4   for (i...
M×M (III) 1   #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...) 2   { 3   #pragma acc kernels loop private(i) gang(...
About Grid Shape and Loop Scheduling ClausesOptimal gang/worker (i.e, grid shape) values vary    Among OpenACC implementat...
Effect of Varying Gang/Worker                               68 / 83
OpenMP vs Frangollo+OpenCL in Drago                                      69 / 83
Needleman-Wunsch (NW)NW is a nonlinear global optimization method for DNAsequence alignmentsThe potential pairs of sequenc...
Performance Comparison of NW in Garoe                                        71 / 83
Overall Comparison                     72 / 83
OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
Directive-based ProgrammingSupport for accelerators in the OpenMP standard may beadded in the future→ In the meantime, Ope...
OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
Back to the Drawing Board?                             76 / 83
accULL Still Has Some OpportunitiesStudy support for multiple devices (either transparently or inOpenACC)Design an MPI com...
Re-use our Know-howIntegrate OpenACC and OMPSs?   Current OMPSs implementation does not automatically   generate kernel co...
ContributionsReyes, R. and de Sande, F. Automatic code generation for GPUs inllc. The Journal of Supercomputing 58, 3 (Mar...
Other contributions    accULL has been released as an Open Source Project    → http://cap.pcg.ull.es/accull    accULL is c...
Acknowledgements   Spanish MEC   Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03   and TIN2011-24598   Canary Islan...
Thank you for your attention!                                82 / 83
Directive-based approach to heterogeneouscomputing  Ruyman Reyes Castro  High Performance Computing Group  University of L...
Upcoming SlideShare
Loading in …5
×

Directive-based approach to Heterogeneous Computing

1,260 views
1,167 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,260
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Directive-based approach to Heterogeneous Computing

  1. 1. Directive-based approach to heterogeneouscomputing Ruyman Reyes Castro High Performance Computing Group University of La Laguna December 19, 2012
  2. 2. TOP500 Performance Development List 2 / 83
  3. 3. Applications Used in HPC CentersUsage of HECToR by Area of Expertise 3 / 83
  4. 4. Real HPC UsersMost Used Applications in HECToR Application % of total jobs Language Prog. Model VASP 17% Fortran MPI+OpenMP CP2K 7% Fortran MPI+OpenMP Unified Model (UM) 7% Fortran MPI GROMACS 4% C++ MPI+OpenMP Large code-bases Complex algorithms implemented Mixture of different Fortran flavours 4 / 83
  5. 5. Knowledge of ProgrammingSurvey conducted in the Swiss National Supercomputing Centre(2011) 5 / 83
  6. 6. Are application developers using the proper tools? 6 / 83
  7. 7. Complexity Arises (I) 7 / 83
  8. 8. Directives: Enhancing Legacy Code (I) OpenMP Example 1 ... 2 #pragma omp parallel for default(shared) private(i, j) firstprivate(rmass, dt) 3 for (i = 0; i < np; i++) { 4 for (j = 0; j < nd; j++) { 5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j]; 6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]); 7 a[i][j] = f[i][j] * rmass; 8 } 9 }10 ... 8 / 83
  9. 9. Complexity Arises (II) 9 / 83
  10. 10. Re-compiling the code is no longer enough to continue improving the performance 10 / 83
  11. 11. Porting Applications To New Architectures Programming CUDA (Host Code) 1 float a_host[n], b_host[n]; 2 // Allocate 3 cudaMalloc((void*)&a, n * sizeof(float)); 4 cudaMalloc((void*)&b, n * sizeof(float)); 5 // Transfer 6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice); 7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice); 8 // Define grid shape 9 blocks = 10010 threads = 12811 // Execute12 kernel<<<blocks,threads>>>(a, b, c);13 // Copy-back14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);15 // Clean16 cudaFree(a);17 cudaFree(b); 11 / 83
  12. 12. Porting Applications To New Architectures Programming CUDA (Kernel Source) 1 // Kernel code 2 __global__ void kernel(float *a, float *b, float c) 3 { 4 // Get the index of this thread 5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x; 6 // Do the computation 7 b[index] = a[index] * c; 8 // Wait for all threads in the block to finish 9 __syncthreads();10 } 12 / 83
  13. 13. Programmers need faster ways to migrate existing code 13 / 83
  14. 14. Why not use directive-based approaches for these new heterogeneous architectures? 14 / 83
  15. 15. Overview of Our Work We can’t solve problems by using the same kind of thinking we used when we created them. Albert EinsteinThe field is undergoing rapid changes: we have to adapt tothem 1. Hybrid MPI+OpenMP (2008) → Usage of directives in cluster environments 2. OpenMP extensions (2009) → Extensions of OpenMP/La Laguna C (llc) for heterogeneous architectures 3. Directives for accelerators (2011) → Specific accelerator-oriented directives → OpenACC (December 2011) 15 / 83
  16. 16. OutlineHybrid MPI+OpenMP llc and llCoMP Hybrid llCoMP Computational Results Technical DrawbacksOpenMP-to-GPUDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
  17. 17. La Laguna C: llcWhat is Directive-based approach to distributed memory environments OpenMP compatible Additional set of extensions to address particular features Implemented FORALL loops, Pipelines, Farms . . .Reference[48] Dorta, A. J. Extensi´n del modelo de OpenMP a memoria odistribuida. PhD Thesis, Universidad de La Laguna, December 2008. 17 / 83
  18. 18. Chronological Perspective (Late 2008)Cores per Socket - System Share Accelerator - System Share 18 / 83
  19. 19. A Hybrid OpenMP+MPI ImplementationSame llc code, extended llCoMP implementation Directives are replaced by a set of parallel patterns Improved performance on multicore systems → Better usage of inter-core memories (i.e cache) → Lower memory requirements when using replicated memory on MPITranslation 19 / 83
  20. 20. llc Code Example llc Implementation of the Mandelbrot Set Computation 1 ... 2 #pragma omp parallel for default(shared) reduction(+:numoutside) private(i, j, ztemp, z) shared(nt, c) 3 #pragma llc reduction_type (int) 4 for(i = 0; i < npoints; i++) { 5 z.creal = c[i].creal; z.cimag = c[i].cimag; 6 for (j = 0; j < MAXITER; j++) { 7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal; 8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag; 9 z.creal = ztemp;10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {11 numoutside++;12 break;13 }14 }15 ... 20 / 83
  21. 21. Hybrid MPI+OpenMP performance 21 / 83
  22. 22. Technical DrawbacksllCoMP The original design of llCoMP StS was not flexible enough Traditional two-pass compiler Excessive effort to implement new features Need more advanced features to implement GPU code generation 22 / 83
  23. 23. Back to the Drawing Board 23 / 83
  24. 24. OutlineHybrid MPI+OpenMPOpenMP-to-GPU Related Work Yet Another Compiler Framework (YaCF) Computational Results Technical DrawbacksDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
  25. 25. Chronological Perspective (Late 2009)Cores per Socket - System Share Accelerator - System Share 25 / 83
  26. 26. Related WorkOther OpenMP-to-GPU translators: OpenMPC[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programmingand tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conferenceon Supercomputing. IEEE Computer Society, pp. 1–11.Other Compiler Frameworks: Cetus, LLVM[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compilerinfrastructure for source-to-source transformation. In Languages and Compilersfor Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume2958 of LNCS(2003), pp. 539-553.[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelongprogram analysis & transformation. In Proceedings of the internationalsymposium on Code generation and optimization: feedback-directed and runtimeoptimization, CGO’04. IEEE Computer Society, pp. 75–47. 26 / 83
  27. 27. YaCF: Yet Another Compiler FrameworkApplication programmer writes llc code Focus on data and algorithm Architecture independent Only needs to specify where the parallelism isSystem engineer writes template code Focus on non-functional code Can reuse code from different patterns (i.e inheritance) 27 / 83
  28. 28. YaCF Software Architecture 28 / 83
  29. 29. Main Software Design PatternsImplementing search and replacement in the IR Filter: Looks for an specific pattern on the IR → E.g Looks for a pragma omp parallel construct Mutator: Looks for a node and transforms the IR → E.g Applies loop transformations (nesting, flattening, . . . ) → E.g Replaces a pragma omp for by a CUDA kernel call Can be composed to solve more complex problems 29 / 83
  30. 30. Dynamic Language and ToolsKey Idea: Features Should Require Only a Few Lines of Code 30 / 83
  31. 31. Template Patterns Ease back-end implementation1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">2 %for var in var_list:3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}),4 ${var.numelems} * sizeof(${var.type}));5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},6 ${var.numelems} * sizeof(${var.type}),7 cudaMemcpyHostToDevice);8 %endfor9 </%def> 31 / 83
  32. 32. CUDA Back-end Generates a CUDA kernel and memory transfers from the information obtained during the analysisSupported syntax parallel, for and their condensed form implemented New directives to support manual optimizations (e.g interchange) Syntax taken from an OpenMP proposal by BSC, UJI and others (#pragma omp target) copy in, copy out enable users to provide memory transfer information Generated code is human-readable 32 / 83
  33. 33. Example Update Loop from the Molecular Dynamics Code 1 ... 2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a) 3 #pragma omp parallel for default(shared) private(i, j) firstprivate(rmass, dt) 4 for (i = 0; i < np; i++) { 5 for (j = 0; j < nd; j++) { 6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j]; 7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]); 8 a[i][j] = f[i][j] * rmass; 9 }10 } 33 / 83
  34. 34. Translation process 34 / 83
  35. 35. The Jacobi Iterative Method 1 error = 0.0; 2 3 4 { 5 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 910 for (i = 0; i < (m - 2); i++) {11 for (j = 0; j < (n - 2); j++) {12 resid = ...13 error += resid * resid;14 }15 }16 }17 k++;18 error = sqrt(error) / (double) (n * m); 35 / 83
  36. 36. Jacobi OpenMP Source 1 error = 0.0; 2 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error)10 for (i = 0; i < (m - 2); i++) {11 for (j = 0; j < (n - 2); j++) {12 resid = ...13 error += resid * resid;14 }15 }16 }17 k++;18 error = sqrt(error) / (double) (n * m); 36 / 83
  37. 37. Jacobi llCoMP v1 1 error = 0.0; 2 #pragma omp target device(cuda) 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error)10 for (i = 0; i < (m - 2); i++) {11 for (j = 0; j < (n - 2); j++) {12 resid = ...13 error += resid * resid;14 }15 }16 }17 k++;18 error = sqrt(error) / (double) (n * m); 37 / 83
  38. 38. Jacobi llCoMP v2 1 error = 0.0; 2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f) 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error)10 for (i = 0; i < (m - 2); i++) {11 for (j = 0; j < (n - 2); j++) {12 resid = ...13 error += resid * resid;14 }15 }16 }17 k++;18 error = sqrt(error) / (double) (n * m); 38 / 83
  39. 39. Jacobi Iterative Method 39 / 83
  40. 40. Technical DrawbacksLimited to Compile-time Optimizations Some features require runtime information → Kernel grid configuration Orphaned directives were not possible → Would require an inter-procedural analysis module Some templates were too complex → And would need to be replicated to support OpenCL 40 / 83
  41. 41. Back to the Drawing Board 41 / 83
  42. 42. OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for Accelerators Related Work OpenACC Accelerator ULL (accULL) ResultsConclusionsFuture Work and Final Remarks
  43. 43. Chronological Perspective (2011)Cores per Socket - System Share Accelerator - System Share 43 / 83
  44. 44. Related Work (I) hiCUDA Translates each directive into a CUDA call It is able to use the GPU Shared Memory Only works with NVIDIA devices The programmer still needs to know hardware details Code Example: 1 ... 2 #pragma hicuda global alloc c [*] [*] copyin 3 4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16) 5 #pragma hicuda loop_partition over_tblock over_thread 6 for (i = 0; i < N; i++) { 7 #pragma hicuda loop_partition over_tblock over_thread 8 for (j = 0; j < N; j++) { 9 double sum = 0.0;10 ... 44 / 83
  45. 45. Related Work (II) PGI Accelerator Model Higher level (directive-based) approach Fortran and C are supported Code Example: 1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m]) 2 { 3 #pragma acc region 4 for (j = 0; j < n; j++) 5 for (i = 0; i < l; i++) { 6 double sum = 0.0; 7 for (k = 0; k < m; k++) 8 sum += b[i + k * l] * c[k + j * m]; 9 a[i + j * l] = sum;10 }11 } 45 / 83
  46. 46. Our Ongoing Work at that Time: llclExtending llc with support for heterogeneous platformsCompiler + Runtime implementation→ The Compiler generates runtime code→ The Runtime handles memory coherence and drivesexecutionCompiler optimizations directed by an XML fileMore generic/higher level approach - not tied to GPUs 46 / 83
  47. 47. llcl: Directives 1 double *a, *b, *c; 2 ... 3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], 4 c[m * n], l, m, n) copy_out(a[n * l]) 5 { 6 int i, j, k; 7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k) 8 for (i = 0; i < l; i++) 9 for (j = 0; j < n; j++) {10 a[i + j * l] = 0.0;11 for (k = 0; k < m; k++)12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];13 }14 }15 ... 47 / 83
  48. 48. llcl: XML Platform Description File 1 <xml> 2 <platform name="default"> 3 <region name="compute"> 4 <element name="compute_1" class="loop"> 5 <mutator name="Loop.LoopInterchange"/> 6 <target device="cuda"/> 7 <target device="opencl"/> 8 </element> 9 </region>10 </platform>11 </xml> 48 / 83
  49. 49. OpenACC Announcement 49 / 83
  50. 50. OpenACC Announcement 50 / 83
  51. 51. OpenACC: Directives 1 double *a, *b, *c; 2 ... 3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n) copy_out(a[n * l]) 4 { 5 int i, j, k; 6 #pragma acc kernels loop private(i, j, k) 7 for (i = 0; i < l; i++) 8 for (j = 0; j < n; j++) { 9 a[i + j * l] = 0.0;10 for (k = 0; k < m; k++)11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];12 }13 }14 ... 51 / 83
  52. 52. Related WorkOpenACC Implementations (After Announcement) PGI - Released on February 2012 CAPS - Released on March 2012 Cray - To be released → Access to beta release availableWe had a first experimental implementation in January 2012 52 / 83
  53. 53. accULL: Our OpenACC ImplementationaccULL = YaCF + FrangolloIt is a two-layer based implementation: Compiler + Runtime Library 53 / 83
  54. 54. Frangollo: the RuntimeImplementation Lightweight Standard C++ and STL code CUDA component written using the CUDA Driver API OpenCL component written using the C OpenCL interface Experimental features can be enabled/disabled at compile timeHandles 1. Device discovery, initialization, . . . 2. Memory coherence (registered variables) 3. Manage kernel execution (including grid shape) 54 / 83
  55. 55. Frangollo Layered Structure 55 / 83
  56. 56. Memory Management 1 // Creates a context to handle memory coherence 2 ctxt_id = FRG__createContext("name", ...) 3 ... 4 // Register a variable within the context 5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...); 6 ... 7 // Execute the kernel 8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...) 9 ...10 // Finish the context and concyle variables11 FRG__destroyContext(ctxt_id); 56 / 83
  57. 57. Kernel ExecutionLoading the kernel Context may have from zero to N named kernels associated Runtime loads different versions of the kernel for each device Kernel is loaded depending on the platform where it is executedGrid shape Grid shape is estimated using compute intensity (CI): Nmem /(Cost × Nflops ) → E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth 144Gb/s, Cost 3.5 Low CI → favors memory accesses High CI → favors computation 57 / 83
  58. 58. Implementing OpenACCPutting all together 1. The compiler driver generates Frangollo interface calls from OpenACC directives → Converts data region directives into context creation → Generates Host and Device synchronization 2. Extracts the kernel code 3. Frangollo implements OpenACC API calls → acc init, acc malloc/acc free 4. Implements some optimizations → Compiler: loop invariant, skewing, strip-mining, interchange → Kernel extraction: divergence reduction, data-dependency analysis (basic) → Runtime: grid shape estimation, optimized reduction kernels 58 / 83
  59. 59. Building an OpenACC Code with accULL 59 / 83
  60. 60. Compilance with OpenACC StandardTable: Compliance with the OpenACC 1.0 standard (directives) Construct Supported by kernels PGI, HMPP, accULL loop PGI, HMPP, accULL kernels loop PGI, HMPP, accULL parallel PGI, HMPP update Implemented copy, copyin, copyout, . . . PGI, HMPP, accULL pcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL async PGI deviceptr clause PGI host accULL collapse accULL Table: Compliance with the OpenACC 1.0 standard (API) API Call Supported by acc init PGI, HMPP, accULL acc set device PGI, HMPP, accULL(no effect) acc get device PGI, HMPP, accULL 60 / 83
  61. 61. Experimental PlatformsGaroe: A Desktop computer Intel Core i7 930 processor (2.80 GHz), 4Gb RAM 2 GPU devices attached: Tesla C1060 Tesla C2050 (Fermi)Peco: A cluster node Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors, 24Gb RAM Attached a Tesla C2050 (Fermi)Drago: A shared memory system 4 Intel Xeon E7 4850 CPU, 6Gb RAM Accelerator platform: Intel OpenCL SDK 1.5, running on the CPU 61 / 83
  62. 62. SoftwareCompiler versions (Pre-OpenACC) PGI Compiler Toolkit 12.2 with the PGI Accelerator Programming Model 1.3 hiCUDA: 0.9Compiler versions (OpenACC) PGI Compiler Toolkit 12.6 CAPS HMPP: 3.2.3 62 / 83
  63. 63. Matrix Multiplication (M × M) (I) 1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N]) 2 { 3 #pragma acc kernels loop private(i, j) collapse(2) 4 for (i = 0; i < L; i++) 5 for (j = 0; j < N; j++) 6 a[i * L + j] = 0.0; 7 /* Iterates over blocks */ 8 for (ii = 0; ii < L; ii += tile_size) 9 for (jj = 0; jj < N; jj += tile_size)10 for (kk = 0; kk < M; kk += tile_size) {11 /* Iterates inside a block */12 #pragma acc kernels loop collapse(2) private(i,j,k)13 for (j = jj; j < min(N, jj+tile_size); j++)14 for (i = ii; i < min(L, ii+tile_size); i++)15 for (k = kk; k < min(M, kk+tile_size); k++)16 a[i*L+j] += (b[i*L+k] * c[k*M+j]);17 }18 } 63 / 83
  64. 64. Floating Point Performance for M×M in Peco 64 / 83
  65. 65. M×M (II) 1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N]) 2 { 3 #pragma acc kernels loop private(i) 4 for (i = 0; i < L; i++) 5 #pragma acc loop private(j) 6 for (j = 0; j < N; j++) 7 a[i * L + j] = 0.0; 8 /* Iterates over blocks */ 9 for (ii = 0; ii < L; ii += tile_size)10 for (jj = 0; jj < N; jj += tile_size)11 for (kk = 0; kk < M; kk += tile_size) {12 /* Iterates inside a block */13 #pragma acc kernels loop private(i)14 for (j = jj; j < min(N, jj+tile_size); j++)15 #pragma acc loop private(j)16 for (i = ii; i < min(L, ii+tile_size); i++)17 for (k = kk; k < min(M, kk+tile_size); k++)18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);19 }20 } 65 / 83
  66. 66. M×M (III) 1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...) 2 { 3 #pragma acc kernels loop private(i) gang(32) 4 for (i = 0; i < L; i++) 5 #pragma acc loop private(j) worker(32) 6 for (j = 0; j < N; j++) 7 a[i * L + j] = 0.0; 8 /* Iterates over blocks */ 9 for (ii = 0; ii < L; ii += tile_size)10 for (jj = 0; jj < N; jj += tile_size)11 for (kk = 0; kk < M; kk += tile_size) {12 /* Iterates inside a block */13 #pragma acc kernels loop private(i) gang(32)14 for (j = jj; j < min(N, jj+tile_size); j++)15 #pragma acc loop private(j) worker(32)16 for (i = ii; i < min(L, ii+tile_size); i++)17 for (k = kk; k < min(M, kk+tile_size); k++)18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);19 }20 } 66 / 83
  67. 67. About Grid Shape and Loop Scheduling ClausesOptimal gang/worker (i.e, grid shape) values vary Among OpenACC implementations Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?) What happens if we implement a non-GPU accelerator? Our implementation ignores gang/worker, leaves decision to runtime → User can influence the decision with an environment variable It is possible to enable the gang/worker clauses in our implementation → Gang/worker feeds a Strip-mining transformation forcing block/threads (WIP) 67 / 83
  68. 68. Effect of Varying Gang/Worker 68 / 83
  69. 69. OpenMP vs Frangollo+OpenCL in Drago 69 / 83
  70. 70. Needleman-Wunsch (NW)NW is a nonlinear global optimization method for DNAsequence alignmentsThe potential pairs of sequences are organized in a 2D matrixThe method uses Dynamic Programming to find the optimumalignment 70 / 83
  71. 71. Performance Comparison of NW in Garoe 71 / 83
  72. 72. Overall Comparison 72 / 83
  73. 73. OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
  74. 74. Directive-based ProgrammingSupport for accelerators in the OpenMP standard may beadded in the future→ In the meantime, OpenACC can be used to port codes toGPUs→ It is possible to combine OpenACC with OpenMPGenerated code does not always match native-codeperformance→ But leverages the development effort providing enoughperformanceaccULL is an interesting research-oriented implementation ofOpenACC→ First non-commercial OpenACC implementation→ It is a flexible framework to explore optimizations, newplatforms, . . . 74 / 83
  75. 75. OutlineHybrid MPI+OpenMPOpenMP-to-GPUDirectives for AcceleratorsConclusionsFuture Work and Final Remarks
  76. 76. Back to the Drawing Board? 76 / 83
  77. 77. accULL Still Has Some OpportunitiesStudy support for multiple devices (either transparently or inOpenACC)Design an MPI component for the runtimeIntegration with other projectsImprove the performance of the generated code (e.g usingPolyhedral models)Enhance the support for Extrae/Paraver (experimental tracingalready built-in) 77 / 83
  78. 78. Re-use our Know-howIntegrate OpenACC and OMPSs? Current OMPSs implementation does not automatically generate kernel code Integrating OpenACCsyntax within tasks would enable automatic code generation Improve portability in accelerator platforms Leverage development effort 78 / 83
  79. 79. ContributionsReyes, R. and de Sande, F. Automatic code generation for GPUs inllc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.349-356.Reyes, R. and de Sande, F. Optimization stategies in different CUDAarchitectures using llCoMP. Microprocessors and Microsystems -Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. accULL: an oOpenACC implementation with CUDA and OpenCL support. InEuro-Par 2012 Parallel Processing - 18th International Conference,vol. 7484 of LNCS, pp. 871-882.Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. A Preliminary oEvaluation of OpenACC Implementations. The Journal ofSupercomputing (In Press) 79 / 83
  80. 80. Other contributions accULL has been released as an Open Source Project → http://cap.pcg.ull.es/accull accULL is currently being evaluated by VectorFabrics Provided feedback to CAPS which seems to be used in their current version Contacted by members of the OpenACC committee Two HPC-Europa2 visits by our team master students 80 / 83
  81. 81. Acknowledgements Spanish MEC Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03 and TIN2011-24598 Canary Islands Government ACIISI Contract SolSubC200801000285 TEXT Project (FP7-261580) HPC-EUROPA2 (project number: 228398) Universitat Jaume I de Castell´n o Universidad de La Laguna All members of GCAP 81 / 83
  82. 82. Thank you for your attention! 82 / 83
  83. 83. Directive-based approach to heterogeneouscomputing Ruyman Reyes Castro High Performance Computing Group University of La Laguna December 19, 2012

×