accULL (HAC Leganés)

691 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
691
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

accULL (HAC Leganés)

  1. 1. HeterogeneousArchitectures accULL: An User-directed Approach toaccULL: An EarlyOpenACC Heterogeneous ProgrammingImplementationResultsConclusions andFuture Work Ruym´n Reyes a Iv´n L´pez-Rodr´ a o ıguez Juan J. Fumero Francisco de Sande 1 Dept. E.I.O. y Computaci´n, o Univ. de La Laguna, 38271–La Laguna, Spain International Workshop on Heterogeneous Architectures and Computing Legan´s, July 13 2012 e 1 / 66
  2. 2. OutlineHeterogeneousArchitecturesaccULL: An Early 1 Heterogeneous ArchitecturesOpenACCImplementationResultsConclusions and 2 accULL: An Early OpenACC ImplementationFuture Work 3 Results 4 Conclusions and Future Work 2 / 66
  3. 3. OutlineHeterogeneousArchitecturesaccULL: An Early 1 Heterogeneous ArchitecturesOpenACCImplementationResultsConclusions and 2 accULL: An Early OpenACC ImplementationFuture Work 3 Results 4 Conclusions and Future Work 3 / 66
  4. 4. Introduction The irruption of GPUs: Impressive ResultsHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 4 / 66
  5. 5. GPUs Successfully used for general purpose computing (GPGPU)HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 5 / 66
  6. 6. Heterogeneous ArchitecturesHeterogeneousArchitectures But ...accULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work It is not Easy! 6 / 66
  7. 7. Heterogeneous Architectures A GPU is not a CPUHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work GPUs are inherently SIMD processors CPUs and GPUs tackle the processing of tasks differently CPUs excel at serial processing GPUs are better at handling applications that require high floating point calculations and lower power consumption 7 / 66
  8. 8. Parallel Languages: MPI (DM) and OpenMP (SM)Heterogeneous They are not valid for programming GPUsArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work New programming models are required... 8 / 66
  9. 9. GPGPU ProgrammingHeterogeneous Nowadays Software Stack:ArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 9 / 66
  10. 10. CUDA from NVIDIAHeterogeneousArchitectures Pros: Performance, EasieraccULL: An EarlyOpenACC than OpenCLImplementationResults Con: Only for NVIDIAConclusions and hardwareFuture Work CUDA Code Example 1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n , 2 int m , int p) { 3 i n t i = blockIdx . x ∗32 + threadIdx . x ; 4 i n t j = blockIdx . y ; 5 f l o a t sum = 0 . 0 f ; 6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ; 7 a [ i+n∗j ] = sum ; 8 } 10 / 66
  11. 11. GPGPU Programming OpenCL: Open Computing LanguageHeterogeneousArchitectures A framework developed by the Khronos GroupaccULL: An Early A standardOpenACCImplementation OpenCL programs execute across heterogeneous platforms:Results CPUs + GPUs + other processorsConclusions andFuture Work Pros: can be used with any device, it is a standard Cons: more complex than CUDA, inmature 11 / 66
  12. 12. GPGPU Programming Common ProblemsHeterogeneous 1 The programmer needs to know low-level details of theArchitectures architectureaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 12 / 66
  13. 13. GPGPU ProgrammingHeterogeneousArchitecturesaccULL: An EarlyOpenACC Common ProblemsImplementation 1 The programmer needs to know low-level details of theResults architectureConclusions andFuture Work 2 Source codes need to be rewritten: One version for CPU A different version for GPU 3 Good performance requires a great effort in parameter tunning 4 CUDA and OpenCL are new and complex for non-experts 13 / 66
  14. 14. GPGPU ProgrammingHeterogeneousArchitecturesaccULL: An Early Our Claim: New models and tools are needed if we wantOpenACCImplementation to widespread the use of GPUs in HPCResultsConclusions andFuture Work Is there anything new in the horizon? hiCUDA PGI accelerator model CAPS HMPP OpenACC 14 / 66
  15. 15. GPGPU Programming hiCUDAHeterogeneous Translates each directive into a CUDA callArchitectures It is able to use the GPU Shared MemoryaccULL: An EarlyOpenACCImplementation Only works with NVIDIA devicesResults The programmer still needs to know hardware detailsConclusions andFuture Work hiCUDA Code Example: 1 ... 2 # pragma h i c u d a g l o b a l a l l o c c [ ∗ ] [ ∗ ] copyin 4 # pragma h i c u d a k e r n e l mxm t b l o c k (N/ 1 6 ,N/ 1 6 ) t h r e a d ( 1 6 , 1 6 ) 5 #pragma h i c u d a loop _partit ion over_tblock over_thread 6 f o r ( i = 0 ; i < N ; i++ ) { 7 #pragma h i c u d a loop _partit ion over_tblock over_thread 8 f o r ( j = 0 ; j < N ; j++) { 9 d o u b l e sum = 0 . 0 ; 10 ... 15 / 66
  16. 16. GPGPU Programming PGI accelerator model It is a higher level (directive-based) approachHeterogeneousArchitectures Fortran and C are supportedaccULL: An EarlyOpenACC Precursor to OpenACCImplementationResultsConclusions and PGI Accelerator Model Code Example:Future Work 1 # pragma a c c d a t a c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] ) 2 { 3 #pragma a c c r e g i o n 4 { 5 #pragma a c c l o o p independent 6 f o r ( j = 0 ; j < n ; j++) 7 { 8 #pragma a c c l o o p independent 9 f o r ( i = 0 ; i < l ; i++ ) { 10 d o u b l e sum = 0 . 0 ; 11 f o r ( k = 0 ; k < m ; k++ ) { 12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ; 13 } 14 a [ i+j∗l ] = sum ; 15 } 16 } 16 / 66
  17. 17. GPGPU ProgrammingHeterogeneousArchitecturesaccULL: An Early OpenACC: introduced last November inOpenACCImplementation SuperComputing’2011Results A directive based languageConclusions andFuture Work Aim to be standard Supported by: Cray, NVIDIA, PGI and CAPS A single source code for CPU/GPU Platform independent Easier for beginners 17 / 66
  18. 18. GPGPU Programming OpenACC Code Example:HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 18 / 66
  19. 19. OutlineHeterogeneousArchitecturesaccULL: An Early 1 Heterogeneous ArchitecturesOpenACCImplementationResultsConclusions and 2 accULL: An Early OpenACC ImplementationFuture Work 3 Results 4 Conclusions and Future Work 19 / 66
  20. 20. accULL: Our OpenACC implementationHeterogeneousArchitecturesaccULL: An Early accULL is a framework developed to support OpenACCOpenACCImplementation programsResultsConclusions andFuture Work 20 / 66
  21. 21. accULL: Our OpenACC implementationHeterogeneousArchitectures accULL = YaCF + FrangolloaccULL: An EarlyOpenACCImplementation It is a two-layer based implementation:Results Compiler + RunTime LibraryConclusions andFuture Work 21 / 66
  22. 22. YaCF: the compilerHeterogeneous YaCF (Yet Another Compiler Framework) is the compilerArchitecturesaccULL: An Early framework we have developedOpenACCImplementation Some features:Results It is a StS compilerConclusions andFuture Work Written in Python from scratch with an OO approach Receives C99 as input It is able to generate CUDA/OpenCL kernels from an annotated code A driver for compiling OpenACC directives has been added YaCF translates the directives into Frangollo calls A public-domain development 22 / 66
  23. 23. Frangollo: the RunTimeHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementation FrangolloResults It is a RunTime to support the execution over heterogeneousConclusions andFuture Work platforms 1 Encapsulates the hardware issues 2 Is able to run in NVIDIA devices using CUDA 3 Is able to manage a wider range of devices using OpenCL 23 / 66
  24. 24. Frangollo: the RunTime Compilation flowHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 24 / 66
  25. 25. Frangollo: the RunTimeHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementation Its ResponsibilitiesResultsConclusions and 1 Manages the memoryFuture Work 2 Initializes the devices 3 Launches the kernels 25 / 66
  26. 26. Frangollo: the RunTimeHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementation Its ResponsibilitiesResultsConclusions and 1 Manages the memoryFuture Work 2 Initializes the devices 3 Launches the kernels Makes programmers’ life easier! 26 / 66
  27. 27. Frangollo: Memory Management A program workflowHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 27 / 66
  28. 28. Frangollo: StructureHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work Interface layer: A door to Frangollo Some functions in the C interface: registerVar launchKernel getNumDevices 28 / 66
  29. 29. Frangollo: StructureHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work Abstract layer Frangollo uses a class-hierarchy All classes in this layer are abstracts 29 / 66
  30. 30. Frangollo: Structure Device layerHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementation Encapsulates all targetResults language related functionsConclusions andFuture Work New platforms could be added in the future 30 / 66
  31. 31. OutlineHeterogeneousArchitecturesaccULL: An Early 1 Heterogeneous ArchitecturesOpenACCImplementationResultsConclusions and 2 accULL: An Early OpenACC ImplementationFuture Work 3 Results 4 Conclusions and Future Work 31 / 66
  32. 32. Platforms M1: A Desktop computerHeterogeneousArchitectures Intel Core i7 930 processor (2.80 GHz)accULL: An EarlyOpenACC 1MB of L2 cache, 8MB of L3 cache, shared by the four coresImplementation 4 GB RAMResultsConclusions and 2 GPU devices attached:Future Work Tesla C1060 with 3Gb memory (M1a) Tesla C2050 (Fermi) with 4GB memory (M1b) Accelerator platform is CUDA 4.0 M1a/ M1b mimic the scenario of an OpenACC average developer She can purchase a GPU card and plug in it into her desktop computer It features a relatively cheap platform 32 / 66
  33. 33. Platforms M2: A cluster nodeHeterogeneousArchitectures M2: 2 quad core Intel Xeon E5410 (2.25GHz) processorsaccULL: An EarlyOpenACCImplementation 24 GB memoryResults Attached a Fermi C2050 card with 448 multiprocessors and 4Conclusions and GB memoryFuture Work Accelerator platform: CUDA 4.0 M2 is a node of a common multinode cluster Nowadays clusters combine multicore processors and GPU devices, so we can take advantage of OpenACC This kind of compute node has higher acquisition and maintenance costs than M1 33 / 66
  34. 34. Platforms M3: A second clusterHeterogeneousArchitectures M3 is a shared memory systemaccULL: An Early 4 Intel Xeon E7 4850 CPUOpenACCImplementation 2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)ResultsConclusions and 6GB of memory per coreFuture Work Accelerator platform: Intel OpenCL SDK 1.5, running on the CPU M3 showcases an alternative use of OpenCL There are implementations of OpenCL targeting shared memory systems Using CPU-targeted OpenCL platforms along with OpenACC represents an interesting alternative to OpenMP programming 34 / 66
  35. 35. Some of our ExperimentsHeterogeneous Blocked Matrix Multiplication (M×M)ArchitecturesaccULL: An EarlyOpenACCImplementation Rodinia BenchmarkResults The Rodinia Benchmark suite comprises compute-heavyConclusions andFuture Work applications It covers a wide range of applications OpenMP, CUDA and OpenCL versions are available for most of the codes in the suite From them, we have selected: Needleman-Wunsch (NW) HotSpot (HS) Speckle Reducing Anisotropic Diffusion (SRAD) 35 / 66
  36. 36. Matrix Multiplication Sketch of M×M in OpenACCHeterogeneousArchitectures 1 # pragma a c c k e r n e l s name ( " mxm " ) copy ( a [ L∗N ] )accULL: An Early 2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )OpenACC 3 {Implementation 4 # pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 )Results 5 f o r ( i = 0 ; i < L ; i++)Conclusions and 6 f o r ( j = 0 ; j < N ; j++)Future Work 7 a[i ∗ L + j] = 0.0; 8 /∗ I t e r a t e o v e r b l o c k s ∗/ 9 f o r ( ii = 0 ; ii < L ; ii += tile_size ) 10 f o r ( jj = 0 ; jj < N ; jj += tile_size ) 11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) { 12 /∗ I t e r a t e i n s i d e a b l o c k ∗/ 13 #pragma a c c l o o p c o l l a p s e ( 2 ) p r i v a t e ( i , j , k ) 14 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++) 15 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++) 16 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++) 17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ; 18 } 19 } 36 / 66
  37. 37. Matrix Multiplication Floating point performance for M×M in M2HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 37 / 66
  38. 38. Matrix Multiplication Floating point performance comparison between OpenMP, accULL, PGI and hiCUDA in M1HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 38 / 66
  39. 39. Matrix Multiplication Comparison between OpenMP-gcc implementation andHeterogeneous Frangollo+OpenCL in M3 (SM system 40 cores)ArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 39 / 66
  40. 40. Needleman-Wunsch Performance comparisons of NW in M1bHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work accULL performs worse than native versions 40 / 66
  41. 41. Needleman-Wunsch Performance comparisons of NW in M3 (SM, 40 cores)HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work The OpenMP versions outperform to the OpenCL counterparts 41 / 66
  42. 42. HotSpot Performance comparison of different implementations showing efficiency over native CUDA code in M1HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work In this case, accULL performs similarly to hiCUDA 42 / 66
  43. 43. HotSpot Speed-Up comparison with native CUDA code inHeterogeneous M1b (Fermi)ArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 43 / 66
  44. 44. HotSpot Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 44 / 66
  45. 45. SRAD Speedup over the OpenMP implementation in M1bHeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 45 / 66
  46. 46. SRAD Speedup over the OpenMP implementation in M3HeterogeneousArchitecturesaccULL: An EarlyOpenACCImplementationResultsConclusions andFuture Work 46 / 66
  47. 47. OutlineHeterogeneousArchitecturesaccULL: An Early 1 Heterogeneous ArchitecturesOpenACCImplementationResultsConclusions and 2 accULL: An Early OpenACC ImplementationFuture Work 3 Results 4 Conclusions and Future Work 47 / 66
  48. 48. Conclusions IHeterogeneousArchitecturesaccULL: An EarlyOpenACC accULLImplementationResults First OpenACC implementation with support for both CUDAConclusions and and OpenCLFuture Work 48 / 66
  49. 49. Conclusions IHeterogeneousArchitecturesaccULL: An EarlyOpenACC accULLImplementationResults First OpenACC implementation with support for both CUDAConclusions and and OpenCLFuture Work It supports most of the standard 49 / 66
  50. 50. Conclusions IHeterogeneousArchitecturesaccULL: An EarlyOpenACC accULLImplementationResults First OpenACC implementation with support for both CUDAConclusions and and OpenCLFuture Work It supports most of the standard We validate accULL using codes from widely available benchmarks using GPUs and CPUs 50 / 66
  51. 51. Conclusions IHeterogeneousArchitecturesaccULL: An EarlyOpenACC accULLImplementationResults First OpenACC implementation with support for both CUDAConclusions and and OpenCLFuture Work It supports most of the standard We validate accULL using codes from widely available benchmarks using GPUs and CPUs It meets the requirements of a non-expert developer 51 / 66
  52. 52. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResultsConclusions andFuture Work 52 / 66
  53. 53. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler 53 / 66
  54. 54. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler 54 / 66
  55. 55. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation 55 / 66
  56. 56. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling 56 / 66
  57. 57. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting 57 / 66
  58. 58. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting Overlapping of computation and communications 58 / 66
  59. 59. Conclusions IIHeterogeneousArchitectures accULLaccULL: An Early YaCF can be used as a fast-prototyping tool to exploreOpenACCImplementation optimizationsResults Frangollo can be detached from YaCF and combined with aConclusions andFuture Work production-ready compiler Some issues that can be tackled within Frangollo independently from the compiler Memory allocation Kernel scheduling Data splitting Overlapping of computation and communications Parallel reduction implementation 59 / 66
  60. 60. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work 60 / 66
  61. 61. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work To complete the implementation of the asynchronous calls for better performance 61 / 66
  62. 62. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support 62 / 66
  63. 63. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI 63 / 66
  64. 64. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI Integration of Frangollo with a production-ready compiler 64 / 66
  65. 65. Future workHeterogeneousArchitectures There are plenty of opportunities to improve performanceaccULL: An EarlyOpenACCImplementation To implement 2D arrays as cudaMatrix or OCLImages toResults improve non-contiguous memory accessConclusions andFuture Work To complete the implementation of the asynchronous calls for better performance Multi-GPU support To explore different possibilities of integration with MPI Integration of Frangollo with a production-ready compiler New backend for FPGAs 65 / 66
  66. 66. Thank you for your attention! accULL: An User-directed Approach toHeterogeneousArchitectures Heterogeneous ProgrammingaccULL: An EarlyOpenACCImplementationResults http://accull.wordpress.com/Conclusions andFuture Work This work has been partially supported by the EU (FEDER), the Spanish MEC (contracts TIN2008-06570-C04-03 and TIN2011-24598), HPC-EUROPA2 and the Canary Islands Government, ACIISI F. de Sande fsande@ull.es 66 / 66

×