YaCF: The
accULL Compiler

Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work
                  YaCF: The accULL Compiler
                     Undergraduate Thesis Project


                     Juan Jos´ Fumero Alfonso
                              e
                      Universidad de La Laguna



                         22 de junio de 2012




                                                    1 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             2 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             3 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                         Moore’s Law
Introduction

YaCF

Experiments

Conclusions

Future Work




                  Every 18 months the number of transistors could be doubled.



                                                                                4 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  Nowadays Parallel Architectures
Introduction

YaCF

Experiments

Conclusions

Future Work




                                                    5 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                             Parallel Architectures
Introduction

YaCF

Experiments

Conclusions

Future Work




                  The solution
                    • More processors
                    • More cores per processor




                                                                      6 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                               Parallel Architectures
Introduction

YaCF

Experiments

Conclusions

Future Work
                  The systems are hybrid using all options.




                                                                        7 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  Parallel Architectures
Introduction

YaCF

Experiments

Conclusions

Future Work




                                           8 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                       OpenMP: Shared Memory
Introduction

YaCF
                                                                 Programming
Experiments           • API that support SMP programming.
Conclusions
                      • Multi-platform.
Future Work
                      • A directive-based approach.
                      • A set of compiler directives, library routines and environment
                         variables for parallel programming.

                  OpenMP example
                   1 #pragma omp p a r a l l e l
                   2 {
                   3     #pragma omp master
                   4     {
                   5            nthreads = o m p _ g e t _ n u m _ t h r e a d s ( ) ;
                   6     }
                   7     #pragma omp f o r p r i v a t e ( x ) reduction (+: sum ) schedule ( runtime )
                   8      f o r ( i =0; i < NUM_STEPS ; ++i ) {
                   9            x = ( i +0.5)∗step ;
                  10            sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ;
                  11     }
                  12     #pragma omp master
                  13     {
                  14            pi = step ∗ sum ;
                  15     }
                  16 }



                                                                                                          9 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                      MPI: Message Passing Interface
Introduction

YaCF

Experiments

Conclusions

Future Work         • A language-independent communications protocol used to
                      program parallel applications.
                    • MPI’s goals are high performance, scalability and portability.

                  MPI example
                  1 MPI_Comm_size ( MPI_COMM_WORLD , &M P I _ N U M P R O C E S S O R S ) ;
                  2 MPI_Comm_rank ( MPI_COMM_WORLD , &MPI_NAME ) ;
                  3 w = 1.0 / N ;
                  4 f o r ( i = MPI_NAME ; i < N ; i += M P I _ N U M P R O C E S S O R S ) {
                  5       local = ( i + 0 . 5 ) ∗ w ;
                  6       pi_mpi = pi_mpi + 4 . 0 / ( 1 . 0 + local ∗ local ) ;
                  7 }
                  8 MPI_Allreduce (&pi_mpi , &gpi_mpi , 1 , MPI_DOUBLE , MPI_SUM , MPI_C OMM_WOR LD ) ;




                                                                                                          10 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                 High Performance Computing
Introduction

YaCF

Experiments       • The most powerful computers at the moment.
Conclusions
                  • Systems with a massive number of processors.
Future Work
                  • High speed of calculation.
                  • It contains thousands of processors and cores.
                  • Systems very expensive and consuming a huge amount of energy.




                                                                               11 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                    TOP 500: High Performance
Introduction

YaCF
                                                   Computing
Experiments

Conclusions
                  • The TOP500 project ranks and details the 500 (non-distributed)
Future Work
                    most powerful known computer systems in the world.
                  • The project publishes an updated list of the supercomputers
                    twice a year.




                                                                                  12 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  Accelerators Era
Introduction

YaCF

Experiments

Conclusions

Future Work




                                     13 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Languages for Heterogeneous
Introduction

YaCF
                                                  Programming
Experiments

Conclusions
                  CUDA
Future Work       Developed by NVIDIA.
                    • Pros: its performance, it is easier than OpenCL.
                    • Con: only works with NVIDIA hardware.




                                                                         14 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                 Languages for Heterogeneous
Introduction

YaCF
                                                                Programming
Experiments

Conclusions

Future Work



                  CUDA

                  1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,
                  2   int m , int p)
                  3 {
                  4     i n t i = blockIdx . x∗32 + threadIdx . x ;
                  5     i n t j = blockIdx . y ;
                  6     f l o a t sum = 0 . 0 f ;
                  7     f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;
                  8     a [ i+n∗j ] = sum ;
                  9 }




                                                                                                              15 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Languages for Heterogeneous
Introduction

YaCF
                                                  Programming
Experiments

Conclusions

Future Work
                  OpenCL
                  A framework developed by the Khronos Group.
                    • Pros: can be used with any device, it is a standard.
                    • Cons: more complex than CUDA, immature.




                                                                             16 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                Languages for Heterogeneous
Introduction

YaCF
                                                               Programming
Experiments

Conclusions

Future Work
                  OpenCL

                   1 __kernel v o i d matvecmul ( __global f l o a t ∗a ,
                   2       c o n s t __global f l o a t ∗b , c o n s t __global f l o a t ∗c ,
                   3       c o n s t uint N ) {
                   4           float R;
                   5           int k;
                   6           i n t xid = get_global_id ( 0 ) ;
                   7           i n t yid = get_global_id ( 1 ) ;
                   8           i f ( xid < N )        {
                   9                 i f ( yid < N ) {
                  10                       R = 0.0;
                  11                       f o r ( k = 0 ; k < N ; k++)
                  12                                    R += b [ xid ∗ N + k ] ∗ c [ k∗N + yid ] ;
                  13                       a [ xid∗N+yid ] = R ;
                  14                 }
                  15          }
                  16 }




                                                                                                     17 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Languages for Heterogeneous
Introduction

YaCF
                                                  Programming
Experiments

Conclusions       Pros
Future Work
                   1   The programmer can use all machine’s devices.
                   2   GPU and CPU could work in parallel.




                                                                       18 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Languages for Heterogeneous
Introduction

YaCF
                                                  Programming
Experiments

Conclusions       Problems
Future Work
                   1   The programmer needs to know low-level details of the
                       architecture.




                                                                               19 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Languages for Heterogeneous
Introduction

YaCF
                                                  Programming
Experiments

Conclusions

Future Work
                  Cons
                   1   The programmer needs to know low-level details of the
                       architecture.
                   2   Source codes need to be rewritten:
                         • One version for OpenMP/MPI.
                         • A different version for GPU.
                   3   Good performance requires a great effort in parameter tuning.
                   4   These languages (CUDA/OpenCL) are complex and new for
                       non-experts.




                                                                                      20 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                        GPGPU (General Purpose GPU)
Introduction

YaCF
                                          Computing
Experiments

Conclusions

Future Work




                  Can we use GPUs for parallel
                  computing? Is this efficient?




                                                      21 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  The NBody Problem
Introduction

YaCF

Experiments

Conclusions

Future Work

                       • Simulation numerically
                         approximates the
                         evolution of a system of
                         bodies.
                       • Each body continuously
                         interacts with other
                         bodies.
                       • Fluid flow simulations.




                                                    22 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                NBody description
Introduction

YaCF

Experiments

Conclusions

Future Work


                  Acceleration
                                                     Fi
                                              ai =
                                                     mi
                                                           mj rij
                                 ai ≈ G ·
                                                    (||rij ||2 +    2 )3/2
                                            1≤j≤N




                                                                             23 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                            CUDA implementation
Introduction

YaCF

Experiments

Conclusions

Future Work




                  • The method is Particle to Particle.
                  • Its computational complexity is O(n2 )
                  • Evaluate all pair-wise interactions. It is exact.




                                                                        24 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  CUDA implementation: blocks and
Introduction

YaCF
                                             grids
Experiments

Conclusions

Future Work




                                                     25 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                              CUDA Kernel: Tile calculation
Introduction

YaCF

Experiments

Conclusions

Future Work


                   1 __device__ float3 gravitation ( float4 myPos , float3 accel ) {
                   2     e x t e r n __shared__ float4 sharedPos [ ] ;
                   3     unsigned long i = 0;
                   4
                   5     f o r ( u n s i g n e d i n t counter = 0 ; counter < blockDim . x ; counter++ )
                   6     {
                   7             accel = b o d y B o d y I n t e r a c t i o n ( accel , SX ( i++) , myPos ) ;
                   8     }
                   9     r e t u r n accel ;
                  10 }




                                                                                                                 26 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                   CUDA Kernel: calculate forces
Introduction

YaCF

Experiments

Conclusions

Future Work
                   1 __global__ v o i d c al c u l a t e _ f o r c es ( float4∗ globalX , float4∗ globalA )
                   2 {
                   3   // A s h a r e d memory b u f f e r t o s t o r e t h e body p o s i t i o n s .
                   4   e x t e r n __shared__ float4 [ ] shPosition ;
                   5   float4 myPosition ;
                   6   i n t i , tile ;
                   7   float3 a c c = {0.0 f , 0 . 0 f , 0 . 0 f };
                   8   // G l o b a l t h r e a d ID ( r e p r e s e n t t h e u n i q u e body i n d e x i n t h e s i m u l a t i o n )
                   9   i n t gtid = blockIdx . x ∗ blockDim . x + threadIdx . x ;
                  10   // T h i s i s t h e p o s i t i o n o f t h e body we a r e c o m p u t i n g t h e a c c e l e r a t i o n f o r .
                  11   float4 myPosition = globalX [ gtid ] ;
                  12   f o r ( i = 0 , tile = 0 ; i < N ; i += blockDim . x , tile++)
                  13   {
                  14       i n t idx = tile ∗ blockDim . x + threadIdx . x ;
                  15       shPosition [ threadIdx . x ] = globalX [ idx ] ;
                  16       __syncthreads ( ) ;
                  17       a c c = t il e_ ca lc u l a t i on ( myPosition , a c c ) ;
                  18       __syncthreads ( ) ;
                  19   }
                  20   // r e t u r n
                  21 }




                                                                                                                                              27 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                                  Results
Introduction
                  •   Tesla C1060 (1.3).
YaCF
                  •   Sequential source code: Intel Corei7 930.
Experiments

Conclusions
                  •   NBody SDK.
Future Work       •   Cuda Runtime /Cuda Driver: 4.0.
                        • 400000 bodies
                        • 200 interactions.

                         Device      Cores    Memory     Performance (GFLOPS)
                      Tesla C1060     240      4GB      933 (Single), 78 (double)
                      Intel Corei7     4       4GB        44.8 (11.2 per core)




                                                                                    28 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                             Results
Introduction

YaCF

Experiments

Conclusions
                  • Sequential code: ≈ 147202512.40 ms ≈ 41 hours (40.89 hours)
Future Work
                  • Parallel CUDA code: 1392029.6 ms = (23.3 minutes)
                  • The speedup is 105.7 (105×).




                                                                              29 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                            At the Present Time
Introduction

YaCF

Experiments

Conclusions

Future Work




                  • Some applications accelerate with GPUs.
                  • The user need to learn new programming languages and tools.
                  • The CUDA model and its architecture have to be understood.
                  • Non-expert users have to write programs for a new model.




                                                                                  30 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                  GPGPU Languages
Introduction

YaCF

Experiments

Conclusions

Future Work       OpenACC: introduced last November in
                  SuperComputing’2011
                  A directive based language.
                    • Aimed to be standard.
                    • Supported by: Cray, NVIDIA, PGI and CAPS.
                    • One simple source code for all versions.
                    • Platform independent.
                    • Easier for beginners.




                                                                    31 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                GPGPU Languages
Introduction

YaCF

Experiments
                  OpenACC
Conclusions       A directive based language.
Future Work




                                                                  32 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  A New Dimension for HPC
Introduction

YaCF

Experiments

Conclusions

Future Work




                                            33 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                            accULL: our OpenACC
Introduction

YaCF
                                  Implementation
Experiments

Conclusions

Future Work
                  accULL = compiler + runtime library.




                                                         34 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                            accULL: our OpenACC
Introduction

YaCF
                                  Implementation
Experiments

Conclusions

Future Work
                  accULL = compiler + runtime library.
                     accULL = YaCF + Frangollo.




                                                         34 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                             Initial Objectives of this Project
Introduction

YaCF

Experiments

Conclusions

Future Work


                  • To integrate C99 in the YaCF project.
                  • To implement a new class hierarchy for new YaCF Frontends.
                  • To implement an OpenACC Frontend.
                  • To complete the OpenMP grammar with directives in OpenMP
                    3.0.
                  • To test the new C99 interface.




                                                                                 35 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                      Source-to-source Compilers
Introduction

YaCF

Experiments

Conclusions

Future Work




                  • Rose Compiler Framework.
                  • Cetus Compiler.
                  • Mercurium.




                                                                   36 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             37 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  accULL: our OpenACC
Introduction

YaCF
                        implementation
Experiments

Conclusions

Future Work




                                         38 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  accULL: our OpenACC
Introduction

YaCF
                        implementation
Experiments

Conclusions

Future Work




                                         39 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  accULL: our OpenACC
Introduction

YaCF
                        implementation
Experiments

Conclusions

Future Work




                                         40 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  accULL: our OpenACC
Introduction

YaCF
                        implementation
Experiments

Conclusions

Future Work




                                         41 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Yet Another Compiler
Introduction

YaCF
                                  Framework
Experiments

Conclusions

Future Work




                                               42 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                                  YaCF
Introduction

YaCF

Experiments

Conclusions

Future Work       • A source-to-source compiler that translates C code with
                    OpenMP, llc and OpenACC annotations into code with
                    Frangollo calls.
                  • Integrates code analysis tools.
                  • Completely written in Python.
                  • Based on widely known object oriented software patterns.
                  • Based on the pycparser Python module.
                  • Implementing code transformation is only a matter of writing a
                    few lines of code.




                                                                                     43 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       44 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       45 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       46 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       47 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       48 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       49 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       50 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       51 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Preprocessor
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       52 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Preprocessor
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       53 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Preprocessor
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       54 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Preprocessor
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       55 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       56 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  YaCF: Architecture
Introduction

YaCF

Experiments

Conclusions

Future Work




                                       57 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                   YaCF: Statistics
Introduction

YaCF

Experiments

Conclusions

Future Work




                  • 20683 lines of Python code.
                  • 2158 functions and methods.
                  • My contribution has been about 25 % of YaCF project.




                                                                           58 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             59 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                     Experiments
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Benchmark Scalapack: testing
                    C99.
                  • Block Matrix Multiplication in
                    accULL.
                  • Three different problems from
                    the Rodinia Benchmark:
                      • HotSpot.
                      • SRAD.
                      • Needleman–Wunsch.




                                                                   60 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                        ScaLAPACK
Introduction

YaCF

Experiments

Conclusions

Future Work


                  • The ScaLAPACK (Scalable LAPACK) is a library that includes
                    a subset of LAPACK routines redesigned for distributed memory
                    MIMD parallel computers.
                  • ScaLAPACK is designed for heterogeneous computing.
                  • It is portable to any computer that support MPI.
                  • Scalable depends on PBLAS operations.




                                                                                61 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                           ScaLAPACK: results in YaCF
Introduction

YaCF

Experiments

Conclusions
                  Directory          Total C files   Success   Failures
Future Work
                  PBLAS/SRC              123          123        0
                  REDIST/SRC              21          21         0
                  PBLAS/SRC/PTOOLS       102          101        1
                  PBLAS/TESTING           2            1         1
                  PBLAS/TIMING            2            1         1
                  REDIST/TESTING          10           0        10
                  SRC                     9            9         0
                  TOOLS                   2            2         0
                  Total                  271          258       13




                                                                         62 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                               ScaLAPACK: results in YaCF
Introduction

YaCF

Experiments

Conclusions
                   Directory             Total C files Success Failures
Future Work
                   PBLAS/SRC                  123          123          0
                   REDIST/SRC                  21           21          0
                   PBLAS/SRC/PTOOLS           102          101          1
                   PBLAS/TESTING               2             1          1
                   PBLAS/TIMING                2             1          1
                   REDIST/TESTING              10            0         10
                   SRC                         9             9          0
                   TOOLS                       2             2          0
                   Total                      271          258         13
                  95 % of the ScaLAPACK C files are correctly parsed in YaCF.




                                                                               62 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                           Platforms
Introduction

YaCF

Experiments

Conclusions       • Garoe: A desktop computer with an Intel Core i7 930 processor
Future Work         (2.80 GHz), with 1MB of L2 cache, 8MB of L3 cache, shared by
                    the four cores. The system has 4 GB RAM and a Tesla C2050
                    with 4 GB of memory attached.




                                                                                63 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                            Platforms
Introduction

YaCF

Experiments

Conclusions
                  • Drago: A second cluster node. It is a shared memory system
Future Work         with 4 Intel Xeon E7. Each processor has 10 cores. In this case,
                    the accelerator platform is Intel OpenCL SDK 1.5 which runs on
                    the CPU.




                                                                                  64 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                     MxM in accULL
Introduction

YaCF

Experiments

Conclusions

Future Work


                  • MxM is a basic kernel frequently used to showcase the peak
                    performance of GPU computing.
                  • We compare the performance of the accULL implementation
                    with that of:
                      • OpenMP.
                      • CUDA.
                      • OpenCL.




                                                                                 65 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                                                                MxM in accULL
Introduction

YaCF

Experiments

Conclusions
                  MxM OpenACC code
Future Work

                   1   #pragma a c c k e r n e l s name ( " mxm " ) c o p y ( a [ L∗N ] ) c o p y i n ( b [ L∗M] , c [M∗N ] )
                   2   {
                   3   #pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 )
                   4   f o r ( i = 0 ; i < L ; i++)
                   5       f o r ( j = 0 ; j < N ; j++)
                   6           a[i ∗ L + j] = 0.0;
                   7   /∗ I t e r a t e o v e r b l o c k s ∗/
                   8   f o r ( ii = 0 ; ii < L ; ii += tile_size )
                   9     f o r ( jj = 0 ; jj < N ; jj += tile_size )
                  10       f o r ( kk = 0 ; kk < M ; kk += tile_size ) {
                  11         /∗ I t e r a t e i n s i d e a b l o c k ∗/
                  12        #pragma a c c l o o p collapse ( 2 ) p r i v a t e ( i , j , k )
                  13         f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++)
                  14           f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++)
                  15             f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++)
                  16               a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;
                  17         }
                  18   }




                                                                                                                                66 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  MxM in accULL (Garoe)
Introduction

YaCF

Experiments

Conclusions

Future Work




                                          67 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  MxM in accULL (Drago)
Introduction

YaCF

Experiments

Conclusions

Future Work




                                          68 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  SRAD: an Image Filtering Code
Introduction

YaCF

Experiments

Conclusions

Future Work




                                                  69 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                     SRAD (Garoe)
Introduction

YaCF

Experiments

Conclusions

Future Work




                  CUDA in Frangollo performs better than CUDA native.

                                                                        70 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  SRAD (Drago)
Introduction

YaCF

Experiments

Conclusions

Future Work




                                 71 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  NW: Needleman-Wunsch, a
Introduction

YaCF
                   Sequence Alignment Code
Experiments

Conclusions

Future Work




                                             72 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                          NW (Garoe)
Introduction

YaCF

Experiments

Conclusions

Future Work




                  Poor results (but better than OpenMP - 4 cores)

                                                                       73 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  NW (Drago)
Introduction

YaCF

Experiments

Conclusions

Future Work




                               74 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  HotSpot: a Thermal Simulation
Introduction

YaCF
                   Tool for Estimating Processor
Experiments                         Temperature
Conclusions

Future Work




                                                   75 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                HotSpot (Garoe)
Introduction

YaCF

Experiments

Conclusions

Future Work




                  As good as native versions.

                                                                  76 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  HotSpot (Drago)
Introduction

YaCF

Experiments

Conclusions

Future Work




                                    77 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             78 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                             Conclusions: Compiler
Introduction

YaCF
                                                      Technologies
Experiments

Conclusions

Future Work




                  • Compiler technologies tend to use and optimize source-to-source
                    compilers to generate and transform source code.
                  • It is easier to parallelize a source code with AST transformations.
                  • AST transformations enable to programmers to easily generate
                    code for any platform.




                                                                                     79 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                           Conclusions: Programming Model
Introduction

YaCF

Experiments

Conclusions

Future Work       • The usage of directive-based programming languages allow
                    non-expert programmers to abstract from architectural details
                    and write programs easier.
                  • The OpenACC standard is a start point to heterogeneous
                    systems programming.
                  • Future versions of the OpenMP standard will include support for
                    accelerators.
                  • The results we are obtaining with accULL our early OpenACC
                    implementation are promising.




                                                                                    80 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                     References I
Introduction

YaCF

Experiments       Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande
                        a           a o
Conclusions       accULL: An OpenACC implementation with CUDA and OpenCL
Future Work
                  support
                  International European Conference on Parallel and Distributed
                  Computing 2012.
                  Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande
                        a          a o
                  Directive-based Programming for GPUs: A Comparative Study
                  The 14th IEEE International Conference on High Performance
                  Computing and Communications.
                  Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande
                        a          a o
                  accULL: an user-directed Approach to Heterogeneous
                  Programming
                  The 10th IEEE International Symposium on Parallel and
                  Distributed Processing with Applications.


                                                                               81 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                   Outline
Introduction

YaCF

Experiments

Conclusions
                  1 Introduction
Future Work




                  2 YaCF


                  3 Experiments


                  4 Conclusions


                  5 Future Work




                                             82 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                   Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.




                                                                 83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                    Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.
                  • Perform new experiments with OpenACC.




                                                                  83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                    Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.
                  • Perform new experiments with OpenACC.
                  • To compare our accULL approach with PGI-OpenACC and
                    CAPS-HMPP.




                                                                          83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                        Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.
                  • Perform new experiments with OpenACC.
                  • To compare our accULL approach with PGI-OpenACC and
                    CAPS-HMPP.
                  • Adding support for vectorization.




                                                                          83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                        Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.
                  • Perform new experiments with OpenACC.
                  • To compare our accULL approach with PGI-OpenACC and
                    CAPS-HMPP.
                  • Adding support for vectorization.
                  • Exploring FPGAs to combine with CUDA and OpenCL.
                  • To introduce LLVM Compiler Framework in the Frontend.




                                                                            83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                                                        Future Work
Introduction

YaCF

Experiments

Conclusions

Future Work
                  • Add support to MPI with CUDA and OpenCL.
                  • Perform new experiments with OpenACC.
                  • To compare our accULL approach with PGI-OpenACC and
                    CAPS-HMPP.
                  • Adding support for vectorization.
                  • Exploring FPGAs to combine with CUDA and OpenCL.
                  • To introduce LLVM Compiler Framework in the Frontend.




                                                                            83 / 85
YaCF: The
accULL Compiler

Juan J. Fumero
                  Thank you for your attention
Introduction

YaCF

Experiments

Conclusions

Future Work




                    Juan Jos´ Fumero Alfonso
                            e
                       jfumeroa@ull.edu.es




                                                 84 / 85
YaCF: The
accULL Compiler

Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work
                  YaCF: The accULL Compiler
                     Undergraduate Thesis Project


                     Juan Jos´ Fumero Alfonso
                              e
                      Universidad de La Laguna



                         22 de junio de 2012




                                                    85 / 85

Yacf

  • 1.
    YaCF: The accULL Compiler JuanJ. Fumero Introduction YaCF Experiments Conclusions Future Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 1 / 85
  • 2.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 2 / 85
  • 3.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 3 / 85
  • 4.
    YaCF: The accULL Compiler JuanJ. Fumero Moore’s Law Introduction YaCF Experiments Conclusions Future Work Every 18 months the number of transistors could be doubled. 4 / 85
  • 5.
    YaCF: The accULL Compiler JuanJ. Fumero Nowadays Parallel Architectures Introduction YaCF Experiments Conclusions Future Work 5 / 85
  • 6.
    YaCF: The accULL Compiler JuanJ. Fumero Parallel Architectures Introduction YaCF Experiments Conclusions Future Work The solution • More processors • More cores per processor 6 / 85
  • 7.
    YaCF: The accULL Compiler JuanJ. Fumero Parallel Architectures Introduction YaCF Experiments Conclusions Future Work The systems are hybrid using all options. 7 / 85
  • 8.
    YaCF: The accULL Compiler JuanJ. Fumero Parallel Architectures Introduction YaCF Experiments Conclusions Future Work 8 / 85
  • 9.
    YaCF: The accULL Compiler JuanJ. Fumero OpenMP: Shared Memory Introduction YaCF Programming Experiments • API that support SMP programming. Conclusions • Multi-platform. Future Work • A directive-based approach. • A set of compiler directives, library routines and environment variables for parallel programming. OpenMP example 1 #pragma omp p a r a l l e l 2 { 3 #pragma omp master 4 { 5 nthreads = o m p _ g e t _ n u m _ t h r e a d s ( ) ; 6 } 7 #pragma omp f o r p r i v a t e ( x ) reduction (+: sum ) schedule ( runtime ) 8 f o r ( i =0; i < NUM_STEPS ; ++i ) { 9 x = ( i +0.5)∗step ; 10 sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ; 11 } 12 #pragma omp master 13 { 14 pi = step ∗ sum ; 15 } 16 } 9 / 85
  • 10.
    YaCF: The accULL Compiler JuanJ. Fumero MPI: Message Passing Interface Introduction YaCF Experiments Conclusions Future Work • A language-independent communications protocol used to program parallel applications. • MPI’s goals are high performance, scalability and portability. MPI example 1 MPI_Comm_size ( MPI_COMM_WORLD , &M P I _ N U M P R O C E S S O R S ) ; 2 MPI_Comm_rank ( MPI_COMM_WORLD , &MPI_NAME ) ; 3 w = 1.0 / N ; 4 f o r ( i = MPI_NAME ; i < N ; i += M P I _ N U M P R O C E S S O R S ) { 5 local = ( i + 0 . 5 ) ∗ w ; 6 pi_mpi = pi_mpi + 4 . 0 / ( 1 . 0 + local ∗ local ) ; 7 } 8 MPI_Allreduce (&pi_mpi , &gpi_mpi , 1 , MPI_DOUBLE , MPI_SUM , MPI_C OMM_WOR LD ) ; 10 / 85
  • 11.
    YaCF: The accULL Compiler JuanJ. Fumero High Performance Computing Introduction YaCF Experiments • The most powerful computers at the moment. Conclusions • Systems with a massive number of processors. Future Work • High speed of calculation. • It contains thousands of processors and cores. • Systems very expensive and consuming a huge amount of energy. 11 / 85
  • 12.
    YaCF: The accULL Compiler JuanJ. Fumero TOP 500: High Performance Introduction YaCF Computing Experiments Conclusions • The TOP500 project ranks and details the 500 (non-distributed) Future Work most powerful known computer systems in the world. • The project publishes an updated list of the supercomputers twice a year. 12 / 85
  • 13.
    YaCF: The accULL Compiler JuanJ. Fumero Accelerators Era Introduction YaCF Experiments Conclusions Future Work 13 / 85
  • 14.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions CUDA Future Work Developed by NVIDIA. • Pros: its performance, it is easier than OpenCL. • Con: only works with NVIDIA hardware. 14 / 85
  • 15.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Future Work CUDA 1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n , 2 int m , int p) 3 { 4 i n t i = blockIdx . x∗32 + threadIdx . x ; 5 i n t j = blockIdx . y ; 6 f l o a t sum = 0 . 0 f ; 7 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ; 8 a [ i+n∗j ] = sum ; 9 } 15 / 85
  • 16.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Future Work OpenCL A framework developed by the Khronos Group. • Pros: can be used with any device, it is a standard. • Cons: more complex than CUDA, immature. 16 / 85
  • 17.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Future Work OpenCL 1 __kernel v o i d matvecmul ( __global f l o a t ∗a , 2 c o n s t __global f l o a t ∗b , c o n s t __global f l o a t ∗c , 3 c o n s t uint N ) { 4 float R; 5 int k; 6 i n t xid = get_global_id ( 0 ) ; 7 i n t yid = get_global_id ( 1 ) ; 8 i f ( xid < N ) { 9 i f ( yid < N ) { 10 R = 0.0; 11 f o r ( k = 0 ; k < N ; k++) 12 R += b [ xid ∗ N + k ] ∗ c [ k∗N + yid ] ; 13 a [ xid∗N+yid ] = R ; 14 } 15 } 16 } 17 / 85
  • 18.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Pros Future Work 1 The programmer can use all machine’s devices. 2 GPU and CPU could work in parallel. 18 / 85
  • 19.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Problems Future Work 1 The programmer needs to know low-level details of the architecture. 19 / 85
  • 20.
    YaCF: The accULL Compiler JuanJ. Fumero Languages for Heterogeneous Introduction YaCF Programming Experiments Conclusions Future Work Cons 1 The programmer needs to know low-level details of the architecture. 2 Source codes need to be rewritten: • One version for OpenMP/MPI. • A different version for GPU. 3 Good performance requires a great effort in parameter tuning. 4 These languages (CUDA/OpenCL) are complex and new for non-experts. 20 / 85
  • 21.
    YaCF: The accULL Compiler JuanJ. Fumero GPGPU (General Purpose GPU) Introduction YaCF Computing Experiments Conclusions Future Work Can we use GPUs for parallel computing? Is this efficient? 21 / 85
  • 22.
    YaCF: The accULL Compiler JuanJ. Fumero The NBody Problem Introduction YaCF Experiments Conclusions Future Work • Simulation numerically approximates the evolution of a system of bodies. • Each body continuously interacts with other bodies. • Fluid flow simulations. 22 / 85
  • 23.
    YaCF: The accULL Compiler JuanJ. Fumero NBody description Introduction YaCF Experiments Conclusions Future Work Acceleration Fi ai = mi mj rij ai ≈ G · (||rij ||2 + 2 )3/2 1≤j≤N 23 / 85
  • 24.
    YaCF: The accULL Compiler JuanJ. Fumero CUDA implementation Introduction YaCF Experiments Conclusions Future Work • The method is Particle to Particle. • Its computational complexity is O(n2 ) • Evaluate all pair-wise interactions. It is exact. 24 / 85
  • 25.
    YaCF: The accULL Compiler JuanJ. Fumero CUDA implementation: blocks and Introduction YaCF grids Experiments Conclusions Future Work 25 / 85
  • 26.
    YaCF: The accULL Compiler JuanJ. Fumero CUDA Kernel: Tile calculation Introduction YaCF Experiments Conclusions Future Work 1 __device__ float3 gravitation ( float4 myPos , float3 accel ) { 2 e x t e r n __shared__ float4 sharedPos [ ] ; 3 unsigned long i = 0; 4 5 f o r ( u n s i g n e d i n t counter = 0 ; counter < blockDim . x ; counter++ ) 6 { 7 accel = b o d y B o d y I n t e r a c t i o n ( accel , SX ( i++) , myPos ) ; 8 } 9 r e t u r n accel ; 10 } 26 / 85
  • 27.
    YaCF: The accULL Compiler JuanJ. Fumero CUDA Kernel: calculate forces Introduction YaCF Experiments Conclusions Future Work 1 __global__ v o i d c al c u l a t e _ f o r c es ( float4∗ globalX , float4∗ globalA ) 2 { 3 // A s h a r e d memory b u f f e r t o s t o r e t h e body p o s i t i o n s . 4 e x t e r n __shared__ float4 [ ] shPosition ; 5 float4 myPosition ; 6 i n t i , tile ; 7 float3 a c c = {0.0 f , 0 . 0 f , 0 . 0 f }; 8 // G l o b a l t h r e a d ID ( r e p r e s e n t t h e u n i q u e body i n d e x i n t h e s i m u l a t i o n ) 9 i n t gtid = blockIdx . x ∗ blockDim . x + threadIdx . x ; 10 // T h i s i s t h e p o s i t i o n o f t h e body we a r e c o m p u t i n g t h e a c c e l e r a t i o n f o r . 11 float4 myPosition = globalX [ gtid ] ; 12 f o r ( i = 0 , tile = 0 ; i < N ; i += blockDim . x , tile++) 13 { 14 i n t idx = tile ∗ blockDim . x + threadIdx . x ; 15 shPosition [ threadIdx . x ] = globalX [ idx ] ; 16 __syncthreads ( ) ; 17 a c c = t il e_ ca lc u l a t i on ( myPosition , a c c ) ; 18 __syncthreads ( ) ; 19 } 20 // r e t u r n 21 } 27 / 85
  • 28.
    YaCF: The accULL Compiler JuanJ. Fumero Results Introduction • Tesla C1060 (1.3). YaCF • Sequential source code: Intel Corei7 930. Experiments Conclusions • NBody SDK. Future Work • Cuda Runtime /Cuda Driver: 4.0. • 400000 bodies • 200 interactions. Device Cores Memory Performance (GFLOPS) Tesla C1060 240 4GB 933 (Single), 78 (double) Intel Corei7 4 4GB 44.8 (11.2 per core) 28 / 85
  • 29.
    YaCF: The accULL Compiler JuanJ. Fumero Results Introduction YaCF Experiments Conclusions • Sequential code: ≈ 147202512.40 ms ≈ 41 hours (40.89 hours) Future Work • Parallel CUDA code: 1392029.6 ms = (23.3 minutes) • The speedup is 105.7 (105×). 29 / 85
  • 30.
    YaCF: The accULL Compiler JuanJ. Fumero At the Present Time Introduction YaCF Experiments Conclusions Future Work • Some applications accelerate with GPUs. • The user need to learn new programming languages and tools. • The CUDA model and its architecture have to be understood. • Non-expert users have to write programs for a new model. 30 / 85
  • 31.
    YaCF: The accULL Compiler JuanJ. Fumero GPGPU Languages Introduction YaCF Experiments Conclusions Future Work OpenACC: introduced last November in SuperComputing’2011 A directive based language. • Aimed to be standard. • Supported by: Cray, NVIDIA, PGI and CAPS. • One simple source code for all versions. • Platform independent. • Easier for beginners. 31 / 85
  • 32.
    YaCF: The accULL Compiler JuanJ. Fumero GPGPU Languages Introduction YaCF Experiments OpenACC Conclusions A directive based language. Future Work 32 / 85
  • 33.
    YaCF: The accULL Compiler JuanJ. Fumero A New Dimension for HPC Introduction YaCF Experiments Conclusions Future Work 33 / 85
  • 34.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF Implementation Experiments Conclusions Future Work accULL = compiler + runtime library. 34 / 85
  • 35.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF Implementation Experiments Conclusions Future Work accULL = compiler + runtime library. accULL = YaCF + Frangollo. 34 / 85
  • 36.
    YaCF: The accULL Compiler JuanJ. Fumero Initial Objectives of this Project Introduction YaCF Experiments Conclusions Future Work • To integrate C99 in the YaCF project. • To implement a new class hierarchy for new YaCF Frontends. • To implement an OpenACC Frontend. • To complete the OpenMP grammar with directives in OpenMP 3.0. • To test the new C99 interface. 35 / 85
  • 37.
    YaCF: The accULL Compiler JuanJ. Fumero Source-to-source Compilers Introduction YaCF Experiments Conclusions Future Work • Rose Compiler Framework. • Cetus Compiler. • Mercurium. 36 / 85
  • 38.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 37 / 85
  • 39.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF implementation Experiments Conclusions Future Work 38 / 85
  • 40.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF implementation Experiments Conclusions Future Work 39 / 85
  • 41.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF implementation Experiments Conclusions Future Work 40 / 85
  • 42.
    YaCF: The accULL Compiler JuanJ. Fumero accULL: our OpenACC Introduction YaCF implementation Experiments Conclusions Future Work 41 / 85
  • 43.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Yet Another Compiler Introduction YaCF Framework Experiments Conclusions Future Work 42 / 85
  • 44.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF Introduction YaCF Experiments Conclusions Future Work • A source-to-source compiler that translates C code with OpenMP, llc and OpenACC annotations into code with Frangollo calls. • Integrates code analysis tools. • Completely written in Python. • Based on widely known object oriented software patterns. • Based on the pycparser Python module. • Implementing code transformation is only a matter of writing a few lines of code. 43 / 85
  • 45.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 44 / 85
  • 46.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 45 / 85
  • 47.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 46 / 85
  • 48.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 47 / 85
  • 49.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 48 / 85
  • 50.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 49 / 85
  • 51.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 50 / 85
  • 52.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 51 / 85
  • 53.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Preprocessor Introduction YaCF Experiments Conclusions Future Work 52 / 85
  • 54.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Preprocessor Introduction YaCF Experiments Conclusions Future Work 53 / 85
  • 55.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Preprocessor Introduction YaCF Experiments Conclusions Future Work 54 / 85
  • 56.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Preprocessor Introduction YaCF Experiments Conclusions Future Work 55 / 85
  • 57.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 56 / 85
  • 58.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Architecture Introduction YaCF Experiments Conclusions Future Work 57 / 85
  • 59.
    YaCF: The accULL Compiler JuanJ. Fumero YaCF: Statistics Introduction YaCF Experiments Conclusions Future Work • 20683 lines of Python code. • 2158 functions and methods. • My contribution has been about 25 % of YaCF project. 58 / 85
  • 60.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 59 / 85
  • 61.
    YaCF: The accULL Compiler JuanJ. Fumero Experiments Introduction YaCF Experiments Conclusions Future Work • Benchmark Scalapack: testing C99. • Block Matrix Multiplication in accULL. • Three different problems from the Rodinia Benchmark: • HotSpot. • SRAD. • Needleman–Wunsch. 60 / 85
  • 62.
    YaCF: The accULL Compiler JuanJ. Fumero ScaLAPACK Introduction YaCF Experiments Conclusions Future Work • The ScaLAPACK (Scalable LAPACK) is a library that includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. • ScaLAPACK is designed for heterogeneous computing. • It is portable to any computer that support MPI. • Scalable depends on PBLAS operations. 61 / 85
  • 63.
    YaCF: The accULL Compiler JuanJ. Fumero ScaLAPACK: results in YaCF Introduction YaCF Experiments Conclusions Directory Total C files Success Failures Future Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 62 / 85
  • 64.
    YaCF: The accULL Compiler JuanJ. Fumero ScaLAPACK: results in YaCF Introduction YaCF Experiments Conclusions Directory Total C files Success Failures Future Work PBLAS/SRC 123 123 0 REDIST/SRC 21 21 0 PBLAS/SRC/PTOOLS 102 101 1 PBLAS/TESTING 2 1 1 PBLAS/TIMING 2 1 1 REDIST/TESTING 10 0 10 SRC 9 9 0 TOOLS 2 2 0 Total 271 258 13 95 % of the ScaLAPACK C files are correctly parsed in YaCF. 62 / 85
  • 65.
    YaCF: The accULL Compiler JuanJ. Fumero Platforms Introduction YaCF Experiments Conclusions • Garoe: A desktop computer with an Intel Core i7 930 processor Future Work (2.80 GHz), with 1MB of L2 cache, 8MB of L3 cache, shared by the four cores. The system has 4 GB RAM and a Tesla C2050 with 4 GB of memory attached. 63 / 85
  • 66.
    YaCF: The accULL Compiler JuanJ. Fumero Platforms Introduction YaCF Experiments Conclusions • Drago: A second cluster node. It is a shared memory system Future Work with 4 Intel Xeon E7. Each processor has 10 cores. In this case, the accelerator platform is Intel OpenCL SDK 1.5 which runs on the CPU. 64 / 85
  • 67.
    YaCF: The accULL Compiler JuanJ. Fumero MxM in accULL Introduction YaCF Experiments Conclusions Future Work • MxM is a basic kernel frequently used to showcase the peak performance of GPU computing. • We compare the performance of the accULL implementation with that of: • OpenMP. • CUDA. • OpenCL. 65 / 85
  • 68.
    YaCF: The accULL Compiler JuanJ. Fumero MxM in accULL Introduction YaCF Experiments Conclusions MxM OpenACC code Future Work 1 #pragma a c c k e r n e l s name ( " mxm " ) c o p y ( a [ L∗N ] ) c o p y i n ( b [ L∗M] , c [M∗N ] ) 2 { 3 #pragma a c c l o o p p r i v a t e ( i , j ) c o l l a p s e ( 2 ) 4 f o r ( i = 0 ; i < L ; i++) 5 f o r ( j = 0 ; j < N ; j++) 6 a[i ∗ L + j] = 0.0; 7 /∗ I t e r a t e o v e r b l o c k s ∗/ 8 f o r ( ii = 0 ; ii < L ; ii += tile_size ) 9 f o r ( jj = 0 ; jj < N ; jj += tile_size ) 10 f o r ( kk = 0 ; kk < M ; kk += tile_size ) { 11 /∗ I t e r a t e i n s i d e a b l o c k ∗/ 12 #pragma a c c l o o p collapse ( 2 ) p r i v a t e ( i , j , k ) 13 f o r ( j=jj ; j < min ( N , jj+tile_size ) ; j++) 14 f o r ( i=ii ; i < min ( L , ii+tile_size ) ; i++) 15 f o r ( k=kk ; k < min ( M , kk+tile_size ) ; k++) 16 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ; 17 } 18 } 66 / 85
  • 69.
    YaCF: The accULL Compiler JuanJ. Fumero MxM in accULL (Garoe) Introduction YaCF Experiments Conclusions Future Work 67 / 85
  • 70.
    YaCF: The accULL Compiler JuanJ. Fumero MxM in accULL (Drago) Introduction YaCF Experiments Conclusions Future Work 68 / 85
  • 71.
    YaCF: The accULL Compiler JuanJ. Fumero SRAD: an Image Filtering Code Introduction YaCF Experiments Conclusions Future Work 69 / 85
  • 72.
    YaCF: The accULL Compiler JuanJ. Fumero SRAD (Garoe) Introduction YaCF Experiments Conclusions Future Work CUDA in Frangollo performs better than CUDA native. 70 / 85
  • 73.
    YaCF: The accULL Compiler JuanJ. Fumero SRAD (Drago) Introduction YaCF Experiments Conclusions Future Work 71 / 85
  • 74.
    YaCF: The accULL Compiler JuanJ. Fumero NW: Needleman-Wunsch, a Introduction YaCF Sequence Alignment Code Experiments Conclusions Future Work 72 / 85
  • 75.
    YaCF: The accULL Compiler JuanJ. Fumero NW (Garoe) Introduction YaCF Experiments Conclusions Future Work Poor results (but better than OpenMP - 4 cores) 73 / 85
  • 76.
    YaCF: The accULL Compiler JuanJ. Fumero NW (Drago) Introduction YaCF Experiments Conclusions Future Work 74 / 85
  • 77.
    YaCF: The accULL Compiler JuanJ. Fumero HotSpot: a Thermal Simulation Introduction YaCF Tool for Estimating Processor Experiments Temperature Conclusions Future Work 75 / 85
  • 78.
    YaCF: The accULL Compiler JuanJ. Fumero HotSpot (Garoe) Introduction YaCF Experiments Conclusions Future Work As good as native versions. 76 / 85
  • 79.
    YaCF: The accULL Compiler JuanJ. Fumero HotSpot (Drago) Introduction YaCF Experiments Conclusions Future Work 77 / 85
  • 80.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 78 / 85
  • 81.
    YaCF: The accULL Compiler JuanJ. Fumero Conclusions: Compiler Introduction YaCF Technologies Experiments Conclusions Future Work • Compiler technologies tend to use and optimize source-to-source compilers to generate and transform source code. • It is easier to parallelize a source code with AST transformations. • AST transformations enable to programmers to easily generate code for any platform. 79 / 85
  • 82.
    YaCF: The accULL Compiler JuanJ. Fumero Conclusions: Programming Model Introduction YaCF Experiments Conclusions Future Work • The usage of directive-based programming languages allow non-expert programmers to abstract from architectural details and write programs easier. • The OpenACC standard is a start point to heterogeneous systems programming. • Future versions of the OpenMP standard will include support for accelerators. • The results we are obtaining with accULL our early OpenACC implementation are promising. 80 / 85
  • 83.
    YaCF: The accULL Compiler JuanJ. Fumero References I Introduction YaCF Experiments Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o Conclusions accULL: An OpenACC implementation with CUDA and OpenCL Future Work support International European Conference on Parallel and Distributed Computing 2012. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o Directive-based Programming for GPUs: A Comparative Study The 14th IEEE International Conference on High Performance Computing and Communications. Ruym´n Reyes, Iv´n L´pez, Juan J. Fumero, F de Sande a a o accULL: an user-directed Approach to Heterogeneous Programming The 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 81 / 85
  • 84.
    YaCF: The accULL Compiler JuanJ. Fumero Outline Introduction YaCF Experiments Conclusions 1 Introduction Future Work 2 YaCF 3 Experiments 4 Conclusions 5 Future Work 82 / 85
  • 85.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. 83 / 85
  • 86.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. 83 / 85
  • 87.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. 83 / 85
  • 88.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. 83 / 85
  • 89.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  • 90.
    YaCF: The accULL Compiler JuanJ. Fumero Future Work Introduction YaCF Experiments Conclusions Future Work • Add support to MPI with CUDA and OpenCL. • Perform new experiments with OpenACC. • To compare our accULL approach with PGI-OpenACC and CAPS-HMPP. • Adding support for vectorization. • Exploring FPGAs to combine with CUDA and OpenCL. • To introduce LLVM Compiler Framework in the Frontend. 83 / 85
  • 91.
    YaCF: The accULL Compiler JuanJ. Fumero Thank you for your attention Introduction YaCF Experiments Conclusions Future Work Juan Jos´ Fumero Alfonso e jfumeroa@ull.edu.es 84 / 85
  • 92.
    YaCF: The accULL Compiler JuanJ. Fumero Introduction YaCF Experiments Conclusions Future Work YaCF: The accULL Compiler Undergraduate Thesis Project Juan Jos´ Fumero Alfonso e Universidad de La Laguna 22 de junio de 2012 85 / 85