Your SlideShare is downloading. ×
0
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Yacf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Yacf

386

Published on

YaCF: Yet Another Compiler Framework. It is a source to source compiler that translates OpenACC code to Internal Representation (Frangollo).

YaCF: Yet Another Compiler Framework. It is a source to source compiler that translates OpenACC code to Internal Representation (Frangollo).

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
386
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. YaCF: TheaccULL CompilerJuan J. FumeroIntroductionYaCFExperimentsConclusionsFuture Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 1 / 85
  • 2. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 2 / 85
  • 3. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 3 / 85
  • 4. YaCF: TheaccULL CompilerJuan J. Fumero Moore’s LawIntroductionYaCFExperimentsConclusionsFuture Work Every 18 months the number of transistors could be doubled. 4 / 85
  • 5. YaCF: TheaccULL CompilerJuan J. Fumero Nowadays Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work 5 / 85
  • 6. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work The solution • More processors • More cores per processor 6 / 85
  • 7. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work The systems are hybrid using all options. 7 / 85
  • 8. YaCF: TheaccULL CompilerJuan J. Fumero Parallel ArchitecturesIntroductionYaCFExperimentsConclusionsFuture Work 8 / 85
  • 9. YaCF: TheaccULL CompilerJuan J. Fumero OpenMP: Shared MemoryIntroductionYaCF ProgrammingExperiments • API that support SMP programming.Conclusions • Multi-platform.Future Work • A directive-based approach. • A set of compiler directives, library routines and environment variables for parallel programming. OpenMP example 1 #pragma omp p a r a l l e l 2 { 3 #pragma omp master 4 { 5 nthreads = o m p _ g e t _ n u m _ t h r e a d s ( ) ; 6 } 7 #pragma omp f o r p r i v a t e ( x ) reduction (+: sum ) schedule ( runtime ) 8 f o r ( i =0; i < NUM_STEPS ; ++i ) { 9 x = ( i +0.5)∗step ; 10 sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ; 11 } 12 #pragma omp master 13 { 14 pi = step ∗ sum ; 15 } 16 } 9 / 85
  • 10. YaCF: TheaccULL CompilerJuan J. Fumero MPI: Message Passing InterfaceIntroductionYaCFExperimentsConclusionsFuture Work • A language-independent communications protocol used to program parallel applications. • MPI’s goals are high performance, scalability and portability. MPI example 1 MPI_Comm_size ( MPI_COMM_WORLD , &M P I _ N U M P R O C E S S O R S ) ; 2 MPI_Comm_rank ( MPI_COMM_WORLD , &MPI_NAME ) ; 3 w = 1.0 / N ; 4 f o r ( i = MPI_NAME ; i < N ; i += M P I _ N U M P R O C E S S O R S ) { 5 local = ( i + 0 . 5 ) ∗ w ; 6 pi_mpi = pi_mpi + 4 . 0 / ( 1 . 0 + local ∗ local ) ; 7 } 8 MPI_Allreduce (&pi_mpi , &gpi_mpi , 1 , MPI_DOUBLE , MPI_SUM , MPI_C OMM_WOR LD ) ; 10 / 85
  • 11. YaCF: TheaccULL CompilerJuan J. Fumero High Performance ComputingIntroductionYaCFExperiments • The most powerful computers at the moment.Conclusions • Systems with a massive number of processors.Future Work • High speed of calculation. • It contains thousands of processors and cores. • Systems very expensive and consuming a huge amount of energy. 11 / 85
  • 12. YaCF: TheaccULL CompilerJuan J. Fumero TOP 500: High PerformanceIntroductionYaCF ComputingExperimentsConclusions • The TOP500 project ranks and details the 500 (non-distributed)Future Work most powerful known computer systems in the world. • The project publishes an updated list of the supercomputers twice a year. 12 / 85
  • 13. YaCF: TheaccULL CompilerJuan J. Fumero Accelerators EraIntroductionYaCFExperimentsConclusionsFuture Work 13 / 85
  • 14. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions CUDAFuture Work Developed by NVIDIA. • Pros: its performance, it is easier than OpenCL. • Con: only works with NVIDIA hardware. 14 / 85
  • 15. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work CUDA 1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n , 2 int m , int p) 3 { 4 i n t i = blockIdx . x∗32 + threadIdx . x ; 5 i n t j = blockIdx . y ; 6 f l o a t sum = 0 . 0 f ; 7 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ; 8 a [ i+n∗j ] = sum ; 9 } 15 / 85
  • 16. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work OpenCL A framework developed by the Khronos Group. • Pros: can be used with any device, it is a standard. • Cons: more complex than CUDA, immature. 16 / 85
  • 17. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work OpenCL 1 __kernel v o i d matvecmul ( __global f l o a t ∗a , 2 c o n s t __global f l o a t ∗b , c o n s t __global f l o a t ∗c , 3 c o n s t uint N ) { 4 float R; 5 int k; 6 i n t xid = get_global_id ( 0 ) ; 7 i n t yid = get_global_id ( 1 ) ; 8 i f ( xid < N ) { 9 i f ( yid < N ) { 10 R = 0.0; 11 f o r ( k = 0 ; k < N ; k++) 12 R += b [ xid ∗ N + k ] ∗ c [ k∗N + yid ] ; 13 a [ xid∗N+yid ] = R ; 14 } 15 } 16 } 17 / 85
  • 18. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions ProsFuture Work 1 The programmer can use all machine’s devices. 2 GPU and CPU could work in parallel. 18 / 85
  • 19. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusions ProblemsFuture Work 1 The programmer needs to know low-level details of the architecture. 19 / 85
  • 20. YaCF: TheaccULL CompilerJuan J. Fumero Languages for HeterogeneousIntroductionYaCF ProgrammingExperimentsConclusionsFuture Work Cons 1 The programmer needs to know low-level details of the architecture. 2 Source codes need to be rewritten: • One version for OpenMP/MPI. • A different version for GPU. 3 Good performance requires a great effort in parameter tuning. 4 These languages (CUDA/OpenCL) are complex and new for non-experts. 20 / 85
  • 21. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU (General Purpose GPU)IntroductionYaCF ComputingExperimentsConclusionsFuture Work Can we use GPUs for parallel computing? Is this efficient? 21 / 85
  • 22. YaCF: TheaccULL CompilerJuan J. Fumero The NBody ProblemIntroductionYaCFExperimentsConclusionsFuture Work • Simulation numerically approximates the evolution of a system of bodies. • Each body continuously interacts with other bodies. • Fluid flow simulations. 22 / 85
  • 23. YaCF: TheaccULL CompilerJuan J. Fumero NBody descriptionIntroductionYaCFExperimentsConclusionsFuture Work Acceleration Fi ai = mi mj rij ai ≈ G · (||rij ||2 + 2 )3/2 1≤j≤N 23 / 85
  • 24. YaCF: TheaccULL CompilerJuan J. Fumero CUDA implementationIntroductionYaCFExperimentsConclusionsFuture Work • The method is Particle to Particle. • Its computational complexity is O(n2 ) • Evaluate all pair-wise interactions. It is exact. 24 / 85
  • 25. YaCF: TheaccULL CompilerJuan J. Fumero CUDA implementation: blocks andIntroductionYaCF gridsExperimentsConclusionsFuture Work 25 / 85
  • 26. YaCF: TheaccULL CompilerJuan J. Fumero CUDA Kernel: Tile calculationIntroductionYaCFExperimentsConclusionsFuture Work 1 __device__ float3 gravitation ( float4 myPos , float3 accel ) { 2 e x t e r n __shared__ float4 sharedPos [ ] ; 3 unsigned long i = 0; 4 5 f o r ( u n s i g n e d i n t counter = 0 ; counter < blockDim . x ; counter++ ) 6 { 7 accel = b o d y B o d y I n t e r a c t i o n ( accel , SX ( i++) , myPos ) ; 8 } 9 r e t u r n accel ; 10 } 26 / 85
  • 27. YaCF: TheaccULL CompilerJuan J. Fumero CUDA Kernel: calculate forcesIntroductionYaCFExperimentsConclusionsFuture Work 1 __global__ v o i d c al c u l a t e _ f o r c es ( float4∗ globalX , float4∗ globalA ) 2 { 3 // A s h a r e d memory b u f f e r t o s t o r e t h e body p o s i t i o n s . 4 e x t e r n __shared__ float4 [ ] shPosition ; 5 float4 myPosition ; 6 i n t i , tile ; 7 float3 a c c = {0.0 f , 0 . 0 f , 0 . 0 f }; 8 // G l o b a l t h r e a d ID ( r e p r e s e n t t h e u n i q u e body i n d e x i n t h e s i m u l a t i o n ) 9 i n t gtid = blockIdx . x ∗ blockDim . x + threadIdx . x ; 10 // T h i s i s t h e p o s i t i o n o f t h e body we a r e c o m p u t i n g t h e a c c e l e r a t i o n f o r . 11 float4 myPosition = globalX [ gtid ] ; 12 f o r ( i = 0 , tile = 0 ; i < N ; i += blockDim . x , tile++) 13 { 14 i n t idx = tile ∗ blockDim . x + threadIdx . x ; 15 shPosition [ threadIdx . x ] = globalX [ idx ] ; 16 __syncthreads ( ) ; 17 a c c = t il e_ ca lc u l a t i on ( myPosition , a c c ) ; 18 __syncthreads ( ) ; 19 } 20 // r e t u r n 21 } 27 / 85
  • 28. YaCF: TheaccULL CompilerJuan J. Fumero ResultsIntroduction • Tesla C1060 (1.3).YaCF • Sequential source code: Intel Corei7 930.ExperimentsConclusions • NBody SDK.Future Work • Cuda Runtime /Cuda Driver: 4.0. • 400000 bodies • 200 interactions. Device Cores Memory Performance (GFLOPS) Tesla C1060 240 4GB 933 (Single), 78 (double) Intel Corei7 4 4GB 44.8 (11.2 per core) 28 / 85
  • 29. YaCF: TheaccULL CompilerJuan J. Fumero ResultsIntroductionYaCFExperimentsConclusions • Sequential code: ≈ 147202512.40 ms ≈ 41 hours (40.89 hours)Future Work • Parallel CUDA code: 1392029.6 ms = (23.3 minutes) • The speedup is 105.7 (105×). 29 / 85
  • 30. YaCF: TheaccULL CompilerJuan J. Fumero At the Present TimeIntroductionYaCFExperimentsConclusionsFuture Work • Some applications accelerate with GPUs. • The user need to learn new programming languages and tools. • The CUDA model and its architecture have to be understood. • Non-expert users have to write programs for a new model. 30 / 85
  • 31. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU LanguagesIntroductionYaCFExperimentsConclusionsFuture Work OpenACC: introduced last November in SuperComputing’2011 A directive based language. • Aimed to be standard. • Supported by: Cray, NVIDIA, PGI and CAPS. • One simple source code for all versions. • Platform independent. • Easier for beginners. 31 / 85
  • 32. YaCF: TheaccULL CompilerJuan J. Fumero GPGPU LanguagesIntroductionYaCFExperiments OpenACCConclusions A directive based language.Future Work 32 / 85
  • 33. YaCF: TheaccULL CompilerJuan J. Fumero A New Dimension for HPCIntroductionYaCFExperimentsConclusionsFuture Work 33 / 85
  • 34. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF ImplementationExperimentsConclusionsFuture Work accULL = compiler + runtime library. 34 / 85
  • 35. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF ImplementationExperimentsConclusionsFuture Work accULL = compiler + runtime library. accULL = YaCF + Frangollo. 34 / 85
  • 36. YaCF: TheaccULL CompilerJuan J. Fumero Initial Objectives of this ProjectIntroductionYaCFExperimentsConclusionsFuture Work • To integrate C99 in the YaCF project. • To implement a new class hierarchy for new YaCF Frontends. • To implement an OpenACC Frontend. • To complete the OpenMP grammar with directives in OpenMP 3.0. • To test the new C99 interface. 35 / 85
  • 37. YaCF: TheaccULL CompilerJuan J. Fumero Source-to-source CompilersIntroductionYaCFExperimentsConclusionsFuture Work • Rose Compiler Framework. • Cetus Compiler. • Mercurium. 36 / 85
  • 38. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 37 / 85
  • 39. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 38 / 85
  • 40. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 39 / 85
  • 41. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 40 / 85
  • 42. YaCF: TheaccULL CompilerJuan J. Fumero accULL: our OpenACCIntroductionYaCF implementationExperimentsConclusionsFuture Work 41 / 85
  • 43. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: Yet Another CompilerIntroductionYaCF FrameworkExperimentsConclusionsFuture Work 42 / 85
  • 44. YaCF: TheaccULL CompilerJuan J. Fumero YaCFIntroductionYaCFExperimentsConclusionsFuture Work • A source-to-source compiler that translates C code with OpenMP, llc and OpenACC annotations into code with Frangollo calls. • Integrates code analysis tools. • Completely written in Python. • Based on widely known object oriented software patterns. • Based on the pycparser Python module. • Implementing code transformation is only a matter of writing a few lines of code. 43 / 85
  • 45. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 44 / 85
  • 46. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 45 / 85
  • 47. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 46 / 85
  • 48. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 47 / 85
  • 49. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 48 / 85
  • 50. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 49 / 85
  • 51. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 50 / 85
  • 52. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 51 / 85
  • 53. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 52 / 85
  • 54. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 53 / 85
  • 55. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 54 / 85
  • 56. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: PreprocessorIntroductionYaCFExperimentsConclusionsFuture Work 55 / 85
  • 57. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 56 / 85
  • 58. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: ArchitectureIntroductionYaCFExperimentsConclusionsFuture Work 57 / 85
  • 59. YaCF: TheaccULL CompilerJuan J. Fumero YaCF: StatisticsIntroductionYaCFExperimentsConclusionsFuture Work • 20683 lines of Python code. • 2158 functions and methods. • My contribution has been about 25 % of YaCF project. 58 / 85
  • 60. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 59 / 85
  • 61. YaCF: TheaccULL CompilerJuan J. Fumero ExperimentsIntroductionYaCFExperimentsConclusionsFuture Work • Benchmark Scalapack: testing C99. • Block Matrix Multiplication in accULL. • Three different problems from the Rodinia Benchmark: • HotSpot. • SRAD. • Needleman–Wunsch. 60 / 85
  • 62. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACKIntroductionYaCFExperimentsConclusionsFuture Work • The ScaLAPACK (Scalable LAPACK) is a library that includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. • ScaLAPACK is designed for heterogeneous computing. • It is portable to any computer that support MPI. • Scalable depends on PBLAS operations. 61 / 85
  • 63. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACK: results in YaCFIntroductionYaCFExperimentsConclusions Directory Total C files Success FailuresFuture Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 62 / 85
  • 64. YaCF: TheaccULL CompilerJuan J. Fumero ScaLAPACK: results in YaCFIntroductionYaCFExperimentsConclusions Directory Total C files Success FailuresFuture Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 95 % of the ScaLAPACK C files are correctly parsed in YaCF. 62 / 85
  • 65. YaCF: TheaccULL CompilerJuan J. Fumero PlatformsIntroductionYaCFExperimentsConclusions • Garoe: A desktop computer with an Intel Core i7 930 processorFuture Work (2.80 GHz), with 1MB of L2 cache, 8MB of L3 cache, shared by the four cores. The system has 4 GB RAM and a Tesla C2050 with 4 GB of memory attached. 63 / 85
  • 66. YaCF: TheaccULL CompilerJuan J. Fumero PlatformsIntroductionYaCFExperimentsConclusions • Drago: A second cluster node. It is a shared memory systemFuture Work with 4 Intel Xeon E7. Each processor has 10 cores. In this case, the accelerator platform is Intel OpenCL SDK 1.5 which runs on the CPU. 64 / 85
  • 67. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULLIntroductionYaCFExperimentsConclusionsFuture Work • MxM is a basic kernel frequently used to showcase the peak performance of GPU computing. • We compare the performance of the accULL implementation with that of: • OpenMP. • CUDA. • OpenCL. 65 / 85
  • 68. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULLIntroductionYaCFExperimentsConclusions MxM OpenACC codeFuture Work 1 #pragma a c c k e r n e l s name ( " mxm " ) c o p y ( a [ L∗N ] ) c o p y i n ( b [ L∗M] , c [M∗N ] ) 2 { 3 #pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 ) 4 f o r ( i = 0 ; i < L ; i++) 5 f o r ( j = 0 ; j < N ; j++) 6 a[i ∗ L + j] = 0.0; 7 /∗ I t e r a t e o v e r b l o c k s ∗/ 8 f o r ( ii = 0 ; ii < L ; ii += tile_size ) 9 f o r ( jj = 0 ; jj < N ; jj += tile_size ) 10 f o r ( kk = 0 ; kk < M ; kk += tile_size ) { 11 /∗ I t e r a t e i n s i d e a b l o c k ∗/ 12 #pragma a c c l o o p collapse ( 2 ) p r i v a t e ( i , j , k ) 13 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++) 14 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++) 15 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++) 16 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ; 17 } 18 } 66 / 85
  • 69. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULL (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work 67 / 85
  • 70. YaCF: TheaccULL CompilerJuan J. Fumero MxM in accULL (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 68 / 85
  • 71. YaCF: TheaccULL CompilerJuan J. Fumero SRAD: an Image Filtering CodeIntroductionYaCFExperimentsConclusionsFuture Work 69 / 85
  • 72. YaCF: TheaccULL CompilerJuan J. Fumero SRAD (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work CUDA in Frangollo performs better than CUDA native. 70 / 85
  • 73. YaCF: TheaccULL CompilerJuan J. Fumero SRAD (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 71 / 85
  • 74. YaCF: TheaccULL CompilerJuan J. Fumero NW: Needleman-Wunsch, aIntroductionYaCF Sequence Alignment CodeExperimentsConclusionsFuture Work 72 / 85
  • 75. YaCF: TheaccULL CompilerJuan J. Fumero NW (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work Poor results (but better than OpenMP - 4 cores) 73 / 85
  • 76. YaCF: TheaccULL CompilerJuan J. Fumero NW (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 74 / 85
  • 77. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot: a Thermal SimulationIntroductionYaCF Tool for Estimating ProcessorExperiments TemperatureConclusionsFuture Work 75 / 85
  • 78. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot (Garoe)IntroductionYaCFExperimentsConclusionsFuture Work As good as native versions. 76 / 85
  • 79. YaCF: TheaccULL CompilerJuan J. Fumero HotSpot (Drago)IntroductionYaCFExperimentsConclusionsFuture Work 77 / 85
  • 80. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 78 / 85
  • 81. YaCF: TheaccULL CompilerJuan J. Fumero Conclusions: CompilerIntroductionYaCF TechnologiesExperimentsConclusionsFuture Work • Compiler technologies tend to use and optimize source-to-source compilers to generate and transform source code. • It is easier to parallelize a source code with AST transformations. • AST transformations enable to programmers to easily generate code for any platform. 79 / 85
  • 82. YaCF: TheaccULL CompilerJuan J. Fumero Conclusions: Programming ModelIntroductionYaCFExperimentsConclusionsFuture Work • The usage of directive-based programming languages allow non-expert programmers to abstract from architectural details and write programs easier. • The OpenACC standard is a start point to heterogeneous systems programming. • Future versions of the OpenMP standard will include support for accelerators. • The results we are obtaining with accULL our early OpenACC implementation are promising. 80 / 85
  • 83. YaCF: TheaccULL CompilerJuan J. Fumero References IIntroductionYaCFExperiments Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a oConclusions accULL: An OpenACC implementation with CUDA and OpenCLFuture Work support International European Conference on Parallel and Distributed Computing 2012. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o Directive-based Programming for GPUs: A Comparative Study The 14th IEEE International Conference on High Performance Computing and Communications. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o accULL: an user-directed Approach to Heterogeneous Programming The 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 81 / 85
  • 84. YaCF: TheaccULL CompilerJuan J. Fumero OutlineIntroductionYaCFExperimentsConclusions 1 IntroductionFuture Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 82 / 85
  • 85. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. 83 / 85
  • 86. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. 83 / 85
  • 87. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. 83 / 85
  • 88. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. 83 / 85
  • 89. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  • 90. YaCF: TheaccULL CompilerJuan J. Fumero Future WorkIntroductionYaCFExperimentsConclusionsFuture Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  • 91. YaCF: TheaccULL CompilerJuan J. Fumero Thank you for your attentionIntroductionYaCFExperimentsConclusionsFuture Work Juan Jos´ Fumero Alfonso e jfumeroa@ull.edu.es 84 / 85
  • 92. YaCF: TheaccULL CompilerJuan J. FumeroIntroductionYaCFExperimentsConclusionsFuture Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 85 / 85

×