Yacf

657 views

Published on

YaCF: Yet Another Compiler Framework. It is a source to source compiler that translates OpenACC code to Internal Representation (Frangollo).

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
657
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Yacf

  1. 1. YaCF: TheaccULL CompilerJuan J. FumeroIntroductionYaCFExperimentsConclusionsFuture Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 1 / 85
  2. 2. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 2 / 85
  3. 3. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 3 / 85
  4. 4. YaCF: TheaccULL CompilerJuan J. Fumero Moore’s LawIntroductionYaCFExperimentsConclusionsFuture Work Every 18 months the number of transistors could be doubled. 4 / 85
  5. 5. YaCF: TheaccULL CompilerJuan J. Fumero Nowadays Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work 5 / 85
  6. 6. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work The solution • More processors • More cores per processor 6 / 85
  7. 7. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work The systems are hybrid using all options. 7 / 85
  8. 8. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work 8 / 85
  9. 9. YaCF: TheaccULL CompilerJuan J. Fumero OpenMP: Shared MemoryIntroductionYaCF ProgrammingExperiments • API that support SMP programming.Conclusions • Multi-platform.Future Work • A directive-based approach. • A set of compiler directives, library routines and environment variables for parallel programming. OpenMP example 1 #pragma omp p a r a l l e l 2 { 3 #pragma omp master 4 { 5 nthreads = o m p _ g e t _ n u m _ t h r e a d s ( ) ; 6 } 7 #pragma omp f o r p r i v a t e ( x ) reduction (+: sum ) schedule ( runtime ) 8 f o r ( i =0; i < NUM_STEPS ; ++i ) { 9 x = ( i +0.5)∗step ; 10 sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ; 11 } 12 #pragma omp master 13 { 14 pi = step ∗ sum ; 15 } 16 } 9 / 85
  10. 10. YaCF: TheaccULL CompilerJuan J. Fumero MPI: Message Passing InterfaceIntroductionYaCFExperimentsConclusionsFuture Work • A language-independent communications protocol used to program parallel applications. • MPI’s goals are high performance, scalability and portability. MPI example 1 MPI_Comm_size ( MPI_COMM_WORLD , &M P I _ N U M P R O C E S S O R S ) ; 2 MPI_Comm_rank ( MPI_COMM_WORLD , &MPI_NAME ) ; 3 w = 1.0 / N ; 4 f o r ( i = MPI_NAME ; i < N ; i += M P I _ N U M P R O C E S S O R S ) { 5 local = ( i + 0 . 5 ) ∗ w ; 6 pi_mpi = pi_mpi + 4 . 0 / ( 1 . 0 + local ∗ local ) ; 7 } 8 MPI_Allreduce (&pi_mpi , &gpi_mpi , 1 , MPI_DOUBLE , MPI_SUM , MPI_C OMM_WOR LD ) ; 10 / 85
  11. 11. YaCF: TheaccULL CompilerJuan J. Fumero High Performance ComputingIntroductionYaCFExperiments • The most powerful computers at the moment.Conclusions • Systems with a massive number of processors.Future Work • High speed of calculation. • It contains thousands of processors and cores. • Systems very expensive and consuming a huge amount of energy. 11 / 85
  12. 12. YaCF: TheaccULL CompilerJuan J. Fumero TOP 500: High PerformanceIntroductionYaCF ComputingExperimentsConclusions • The TOP500 project ranks and details the 500 (non-distributed)Future Work most powerful known computer systems in the world. • The project publishes an updated list of the supercomputers twice a year. 12 / 85
  13. 13. YaCF: TheaccULL CompilerJuan J. Fumero Accelerators EraIntroductionYaCFExperimentsConclusionsFuture Work 13 / 85
  14. 14. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions CUDAFuture Work Developed by NVIDIA. • Pros: its performance, it is easier than OpenCL. • Con: only works with NVIDIA hardware. 14 / 85
  15. 15. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work CUDA 1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n , 2 int m , int p) 3 { 4 i n t i = blockIdx . x∗32 + threadIdx . x ; 5 i n t j = blockIdx . y ; 6 f l o a t sum = 0 . 0 f ; 7 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ; 8 a [ i+n∗j ] = sum ; 9 } 15 / 85
  16. 16. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work OpenCL A framework developed by the Khronos Group. • Pros: can be used with any device, it is a standard. • Cons: more complex than CUDA, immature. 16 / 85
  17. 17. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work OpenCL 1 __kernel v o i d matvecmul ( __global f l o a t ∗a , 2 c o n s t __global f l o a t ∗b , c o n s t __global f l o a t ∗c , 3 c o n s t uint N ) { 4 float R; 5 int k; 6 i n t xid = get_global_id ( 0 ) ; 7 i n t yid = get_global_id ( 1 ) ; 8 i f ( xid < N ) { 9 i f ( yid < N ) { 10 R = 0.0; 11 f o r ( k = 0 ; k < N ; k++) 12 R += b [ xid ∗ N + k ] ∗ c [ k∗N + yid ] ; 13 a [ xid∗N+yid ] = R ; 14 } 15 } 16 } 17 / 85
  18. 18. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions ProsFuture Work 1 The programmer can use all machine’s devices. 2 GPU and CPU could work in parallel. 18 / 85
  19. 19. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions ProblemsFuture Work 1 The programmer needs to know low-level details of the architecture. 19 / 85
  20. 20. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work Cons 1 The programmer needs to know low-level details of the architecture. 2 Source codes need to be rewritten: • One version for OpenMP/MPI. • A different version for GPU. 3 Good performance requires a great effort in parameter tuning. 4 These languages (CUDA/OpenCL) are complex and new for non-experts. 20 / 85
  21. 21. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU (General Purpose GPU)IntroductionYaCF ComputingExperimentsConclusionsFuture Work Can we use GPUs for parallel computing? Is this efficient? 21 / 85
  22. 22. YaCF: TheaccULL CompilerJuan J. Fumero The NBody ProblemIntroductionYaCFExperimentsConclusionsFuture Work • Simulation numerically approximates the evolution of a system of bodies. • Each body continuously interacts with other bodies. • Fluid flow simulations. 22 / 85
  23. 23. YaCF: TheaccULL CompilerJuan J. Fumero NBody descriptionIntroductionYaCFExperimentsConclusionsFuture Work Acceleration Fi ai = mi mj rij ai ≈ G · (||rij ||2 + 2 )3/2 1≤j≤N 23 / 85
  24. 24. YaCF: TheaccULL CompilerJuan J. Fumero CUDA implementationIntroductionYaCFExperimentsConclusionsFuture Work • The method is Particle to Particle. • Its computational complexity is O(n2 ) • Evaluate all pair-wise interactions. It is exact. 24 / 85
  25. 25. YaCF: TheaccULL CompilerJuan J. Fumero CUDA implementation: blocks andIntroductionYaCF gridsExperimentsConclusionsFuture Work 25 / 85
  26. 26. YaCF: TheaccULL CompilerJuan J. Fumero CUDA Kernel: Tile calculationIntroductionYaCFExperimentsConclusionsFuture Work 1 __device__ float3 gravitation ( float4 myPos , float3 accel ) { 2 e x t e r n __shared__ float4 sharedPos [ ] ; 3 unsigned long i = 0; 4 5 f o r ( u n s i g n e d i n t counter = 0 ; counter < blockDim . x ; counter++ ) 6 { 7 accel = b o d y B o d y I n t e r a c t i o n ( accel , SX ( i++) , myPos ) ; 8 } 9 r e t u r n accel ; 10 } 26 / 85
  27. 27. YaCF: TheaccULL CompilerJuan J. Fumero CUDA Kernel: calculate forcesIntroductionYaCFExperimentsConclusionsFuture Work 1 __global__ v o i d c al c u l a t e _ f o r c es ( float4∗ globalX , float4∗ globalA ) 2 { 3 // A s h a r e d memory b u f f e r t o s t o r e t h e body p o s i t i o n s . 4 e x t e r n __shared__ float4 [ ] shPosition ; 5 float4 myPosition ; 6 i n t i , tile ; 7 float3 a c c = {0.0 f , 0 . 0 f , 0 . 0 f }; 8 // G l o b a l t h r e a d ID ( r e p r e s e n t t h e u n i q u e body i n d e x i n t h e s i m u l a t i o n ) 9 i n t gtid = blockIdx . x ∗ blockDim . x + threadIdx . x ; 10 // T h i s i s t h e p o s i t i o n o f t h e body we a r e c o m p u t i n g t h e a c c e l e r a t i o n f o r . 11 float4 myPosition = globalX [ gtid ] ; 12 f o r ( i = 0 , tile = 0 ; i < N ; i += blockDim . x , tile++) 13 { 14 i n t idx = tile ∗ blockDim . x + threadIdx . x ; 15 shPosition [ threadIdx . x ] = globalX [ idx ] ; 16 __syncthreads ( ) ; 17 a c c = t il e_ ca lc u l a t i on ( myPosition , a c c ) ; 18 __syncthreads ( ) ; 19 } 20 // r e t u r n 21 } 27 / 85
  28. 28. YaCF: TheaccULL CompilerJuan J. Fumero ResultsIntroduction • Tesla C1060 (1.3).YaCF • Sequential source code: Intel Corei7 930.ExperimentsConclusions • NBody SDK.Future Work • Cuda Runtime /Cuda Driver: 4.0. • 400000 bodies • 200 interactions. Device Cores Memory Performance (GFLOPS) Tesla C1060 240 4GB 933 (Single), 78 (double) Intel Corei7 4 4GB 44.8 (11.2 per core) 28 / 85
  29. 29. YaCF: TheaccULL CompilerJuan J. Fumero ResultsIntroductionYaCFExperimentsConclusions • Sequential code: ≈ 147202512.40 ms ≈ 41 hours (40.89 hours)Future Work • Parallel CUDA code: 1392029.6 ms = (23.3 minutes) • The speedup is 105.7 (105×). 29 / 85
  30. 30. YaCF: TheaccULL CompilerJuan J. Fumero At the Present TimeIntroductionYaCFExperimentsConclusionsFuture Work • Some applications accelerate with GPUs. • The user need to learn new programming languages and tools. • The CUDA model and its architecture have to be understood. • Non-expert users have to write programs for a new model. 30 / 85
  31. 31. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU LanguagesIntroductionYaCFExperimentsConclusionsFuture Work OpenACC: introduced last November in SuperComputing’2011 A directive based language. • Aimed to be standard. • Supported by: Cray, NVIDIA, PGI and CAPS. • One simple source code for all versions. • Platform independent. • Easier for beginners. 31 / 85
  32. 32. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU LanguagesIntroductionYaCFExperiments OpenACCConclusions A directive based language.Future Work 32 / 85
  33. 33. YaCF: TheaccULL CompilerJuan J. Fumero A New Dimension for HPCIntroductionYaCFExperimentsConclusionsFuture Work 33 / 85
  34. 34. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF ImplementationExperimentsConclusionsFuture Work accULL = compiler + runtime library. 34 / 85
  35. 35. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF ImplementationExperimentsConclusionsFuture Work accULL = compiler + runtime library. accULL = YaCF + Frangollo. 34 / 85
  36. 36. YaCF: TheaccULL CompilerJuan J. Fumero Initial Objectives of this ProjectIntroductionYaCFExperimentsConclusionsFuture Work • To integrate C99 in the YaCF project. • To implement a new class hierarchy for new YaCF Frontends. • To implement an OpenACC Frontend. • To complete the OpenMP grammar with directives in OpenMP 3.0. • To test the new C99 interface. 35 / 85
  37. 37. YaCF: TheaccULL CompilerJuan J. Fumero Source-to-source CompilersIntroductionYaCFExperimentsConclusionsFuture Work • Rose Compiler Framework. • Cetus Compiler. • Mercurium. 36 / 85
  38. 38. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 37 / 85
  39. 39. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 38 / 85
  40. 40. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 39 / 85
  41. 41. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 40 / 85
  42. 42. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 41 / 85
  43. 43. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: Yet Another CompilerIntroductionYaCF FrameworkExperimentsConclusionsFuture Work 42 / 85
  44. 44. YaCF: TheaccULL CompilerJuan J. Fumero YaCFIntroductionYaCFExperimentsConclusionsFuture Work • A source-to-source compiler that translates C code with OpenMP, llc and OpenACC annotations into code with Frangollo calls. • Integrates code analysis tools. • Completely written in Python. • Based on widely known object oriented software patterns. • Based on the pycparser Python module. • Implementing code transformation is only a matter of writing a few lines of code. 43 / 85
  45. 45. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 44 / 85
  46. 46. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 45 / 85
  47. 47. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 46 / 85
  48. 48. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 47 / 85
  49. 49. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 48 / 85
  50. 50. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 49 / 85
  51. 51. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 50 / 85
  52. 52. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 51 / 85
  53. 53. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 52 / 85
  54. 54. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 53 / 85
  55. 55. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 54 / 85
  56. 56. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 55 / 85
  57. 57. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 56 / 85
  58. 58. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 57 / 85
  59. 59. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: StatisticsIntroductionYaCFExperimentsConclusionsFuture Work • 20683 lines of Python code. • 2158 functions and methods. • My contribution has been about 25 % of YaCF project. 58 / 85
  60. 60. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 59 / 85
  61. 61. YaCF: TheaccULL CompilerJuan J. Fumero ExperimentsIntroductionYaCFExperimentsConclusionsFuture Work • Benchmark Scalapack: testing C99. • Block Matrix Multiplication in accULL. • Three different problems from the Rodinia Benchmark: • HotSpot. • SRAD. • Needleman–Wunsch. 60 / 85
  62. 62. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACKIntroductionYaCFExperimentsConclusionsFuture Work • The ScaLAPACK (Scalable LAPACK) is a library that includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. • ScaLAPACK is designed for heterogeneous computing. • It is portable to any computer that support MPI. • Scalable depends on PBLAS operations. 61 / 85
  63. 63. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACK: results in YaCFIntroductionYaCFExperimentsConclusions Directory Total C files Success FailuresFuture Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 62 / 85
  64. 64. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACK: results in YaCFIntroductionYaCFExperimentsConclusions Directory Total C files Success FailuresFuture Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 95 % of the ScaLAPACK C files are correctly parsed in YaCF. 62 / 85
  65. 65. YaCF: TheaccULL CompilerJuan J. Fumero PlatformsIntroductionYaCFExperimentsConclusions • Garoe: A desktop computer with an Intel Core i7 930 processorFuture Work (2.80 GHz), with 1MB of L2 cache, 8MB of L3 cache, shared by the four cores. The system has 4 GB RAM and a Tesla C2050 with 4 GB of memory attached. 63 / 85
  66. 66. YaCF: TheaccULL CompilerJuan J. Fumero PlatformsIntroductionYaCFExperimentsConclusions • Drago: A second cluster node. It is a shared memory systemFuture Work with 4 Intel Xeon E7. Each processor has 10 cores. In this case, the accelerator platform is Intel OpenCL SDK 1.5 which runs on the CPU. 64 / 85
  67. 67. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULLIntroductionYaCFExperimentsConclusionsFuture Work • MxM is a basic kernel frequently used to showcase the peak performance of GPU computing. • We compare the performance of the accULL implementation with that of: • OpenMP. • CUDA. • OpenCL. 65 / 85
  68. 68. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULLIntroductionYaCFExperimentsConclusions MxM OpenACC codeFuture Work 1 #pragma a c c k e r n e l s name ( " mxm " ) c o p y ( a [ L∗N ] ) c o p y i n ( b [ L∗M] , c [M∗N ] ) 2 { 3 #pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 ) 4 f o r ( i = 0 ; i < L ; i++) 5 f o r ( j = 0 ; j < N ; j++) 6 a[i ∗ L + j] = 0.0; 7 /∗ I t e r a t e o v e r b l o c k s ∗/ 8 f o r ( ii = 0 ; ii < L ; ii += tile_size ) 9 f o r ( jj = 0 ; jj < N ; jj += tile_size ) 10 f o r ( kk = 0 ; kk < M ; kk += tile_size ) { 11 /∗ I t e r a t e i n s i d e a b l o c k ∗/ 12 #pragma a c c l o o p collapse ( 2 ) p r i v a t e ( i , j , k ) 13 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++) 14 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++) 15 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++) 16 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ; 17 } 18 } 66 / 85
  69. 69. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULL (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work 67 / 85
  70. 70. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULL (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 68 / 85
  71. 71. YaCF: TheaccULL CompilerJuan J. Fumero SRAD: an Image Filtering CodeIntroductionYaCFExperimentsConclusionsFuture Work 69 / 85
  72. 72. YaCF: TheaccULL CompilerJuan J. Fumero SRAD (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work CUDA in Frangollo performs better than CUDA native. 70 / 85
  73. 73. YaCF: TheaccULL CompilerJuan J. Fumero SRAD (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 71 / 85
  74. 74. YaCF: TheaccULL CompilerJuan J. Fumero NW: Needleman-Wunsch, aIntroductionYaCF Sequence Alignment CodeExperimentsConclusionsFuture Work 72 / 85
  75. 75. YaCF: TheaccULL CompilerJuan J. Fumero NW (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work Poor results (but better than OpenMP - 4 cores) 73 / 85
  76. 76. YaCF: TheaccULL CompilerJuan J. Fumero NW (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 74 / 85
  77. 77. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot: a Thermal SimulationIntroductionYaCF Tool for Estimating ProcessorExperiments TemperatureConclusionsFuture Work 75 / 85
  78. 78. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work As good as native versions. 76 / 85
  79. 79. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 77 / 85
  80. 80. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 78 / 85
  81. 81. YaCF: TheaccULL CompilerJuan J. Fumero Conclusions: CompilerIntroductionYaCF TechnologiesExperimentsConclusionsFuture Work • Compiler technologies tend to use and optimize source-to-source compilers to generate and transform source code. • It is easier to parallelize a source code with AST transformations. • AST transformations enable to programmers to easily generate code for any platform. 79 / 85
  82. 82. YaCF: TheaccULL CompilerJuan J. Fumero Conclusions: Programming ModelIntroductionYaCFExperimentsConclusionsFuture Work • The usage of directive-based programming languages allow non-expert programmers to abstract from architectural details and write programs easier. • The OpenACC standard is a start point to heterogeneous systems programming. • Future versions of the OpenMP standard will include support for accelerators. • The results we are obtaining with accULL our early OpenACC implementation are promising. 80 / 85
  83. 83. YaCF: TheaccULL CompilerJuan J. Fumero References IIntroductionYaCFExperiments Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a oConclusions accULL: An OpenACC implementation with CUDA and OpenCLFuture Work support International European Conference on Parallel and Distributed Computing 2012. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o Directive-based Programming for GPUs: A Comparative Study The 14th IEEE International Conference on High Performance Computing and Communications. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o accULL: an user-directed Approach to Heterogeneous Programming The 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 81 / 85
  84. 84. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 82 / 85
  85. 85. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. 83 / 85
  86. 86. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. 83 / 85
  87. 87. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. 83 / 85
  88. 88. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. 83 / 85
  89. 89. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  90. 90. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  91. 91. YaCF: TheaccULL CompilerJuan J. Fumero Thank you for your attentionIntroductionYaCFExperimentsConclusionsFuture Work Juan Jos´ Fumero Alfonso e jfumeroa@ull.edu.es 84 / 85
  92. 92. YaCF: TheaccULL CompilerJuan J. FumeroIntroductionYaCFExperimentsConclusionsFuture Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 85 / 85

×