1. RAMSES @CSCS
Kotsalos Christos & Claudio Gheller
Refactoring of the Ramses code and performance optimisation
on CPUs and GPUs
2. RAMSES: modular physics
AMR build
Domain
Decomposition -
Load balancing
Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
per time step⤾
3. Our goal
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU using OpenACC directives
❖Recognise the GPU - friendly parts :
❖ Computational intensity + data independency
❖Minimise the data transfer:
❖ GPU <—> CPU communication
❖GPU-to-GPU communication through GPUDirect & Communication generally
❖GPU porting and optimisation
}Infrastructure
4. Problems to overcome!
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
Dependencies between the modules
More complex amr_step than this simplified
representation
Recursive calls
Communicators (OpenACC & memory)
# grids depend on the level of refinement. GPU
porting issues. Difficulty to fit the loops in the
architecture!
5. 1st part of the Project
Redesign communication for GPUs & CPUs:
❖ Is the CPU communication optimal?
❖ Is the communication suitable for GPU programming?
6. Communication between the subdomains
• Point-to-Point Communication :: ISend <—> IRecv
• The subdomains communicate the solutions with their neighbours (physical)
• Everything through the communicators
MPI
Send Buffer Recv Buffer
emission communicator
(structure/ derived data types)
reception communicator
(structure/ derived data types)
Stored information:
‣ Number of cells
‣ cell index
‣ Double precision arrays
‣ Single precision arrays
}
Advantages:
Elegant Structure
Disadvantages:
Data locality issues
Not fully supported by OpenACC
7. Type of Communication: Point-to-Point or Collective ?
Point-to-point: Isend & Irecv
Original implementation in RAMSES
Collective: Alltoall(v)
The data to be communicated are gathered in arrays (one per PE)
These arrays (of intrinsic data types) are scattered from all PES to all PES
8. OpenACC & Allocatable Derived data types
• Not fully supported (cray only but still partially)
• No GPUDirect support (GPU-to-GPU communication)
Two solutions to overcome this problem
❖ Use collective communication (the buffers are of intrinsic data types)
❖ Replace locally or globally the communicators with regular arrays
Check performance
Everywhere in the code, not
an easy solution
9. GPUDirect
❖ GPU-to-GPU communication
❖ The same calls as the regular MPI but:
export MPICH_RDMA_ENABLED_CUDA=1
!$acc host_data use_device(send buffer)
call MPI_ISEND( normal arguments )
!$acc end host_data
!$acc host_data use_device(recv buffer)
call MPI_IRECV( normal arguments )
!$acc end host_data
Embed the
MPI call
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
10. AlltoAll
1 @ 2 @ … n @
Send Buffers Recv Buffers
1 # 2 # … n #
1 * 2 * … n *
…
1 @ 1 # … n *
2 @ 2 # … 2 *
n @ n # … n *
…
pe1
pe2
pen
(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)
Restriction: the sendcnt & recvcnt are fixed per pe
11. AlltoAllv
1 @ 2 @ … n @
Send Buffer
pei
(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
sdispls(1) = 0
sdispls(2) = sdispls(1)+sendcnts(1)
sdispls(n) = sdispls(n-1)+sendcnts(n-1)
the sendcnt & recvcnt are NOT fixed per pe
12. Experiment: Replace the point-to-point with the collective
communication everywhere (CPU version of RAMSES)
Alltoallv ISend/ IRecv
Latency Issue!
point-to-point ~ 3 times faster
13. Load balancing
The subdomains need to communicate mainly with their neighbours
The collective communication sends and recvs too many empty buffers
Latency Issue!
14. AlltoAllv_tuned
User
specified
Bandwidth Issue! Too much data to communicate!
1 @ 2 @ … n @
Send Buffer
pei
}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
If sendcnts(i) = 0
then sendcnts(i) = fill the buffer with zeros
Special case is the
AlltoAll
15. Final solution (communication part)
❖ Point-to-Point communication
❖ Replace locally the communicators with regular arrays
❖ GPU-to-GPU through GPUDirect flawlessly
Send Buffer Recv Buffer
emission array
(intrinsic data types)
reception array
(intrinsic data types)
emission communicator
(derived data types)
reception communicator
(derived data types)
↧ ↧
16. 2nd part of the Project
GPU porting of the Poisson Solver
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
17. GPU porting of poisson solver
Communicators caused issues because of data locality:
❖ Implicit synchronisation barriers
❖ Poor performance
Local communicators
Stored information:
‣ Number of grids
‣ Grid index
‣ Double precision arrays
‣ Single precision arrays
}
↧
Replaceable
1D arrays
(intrinsic data types)
level
components
cpus
cells
18. 1D arrays
(intrinsic data types)
level
components
cpus
cells
Data Locality
Increased Performance
2 to 3 times faster
In the beginning of
multigrid_fine
19. 3rd part of the Project
Infrastructure development
❖ Which parts on CPU, which on GPU?
❖ Minimise interaction between CPU-GPU
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
20. Infrastructure
Subroutines that update host/ device
(optimised data transfer)
#if defined(_OPENACC)
call update_globalvar_dp_to_host (var,level)
call update_globalvar_dp_to_device(var,level)
#endif
instead of
!$acc update device/host(var)
communicate only what it is needed
Use of the update
directive but in an
optimal way
21. Manual Profiler (MPROF)
Why?
❖ Bugs in CRAYPAT
❖ Strange SYNC barriers (implicit)
❖ Discrepancy of CRAYPAT and NVIDIA’s tools
It is enabled from the Makefile by the flag MPROF
It uses the MPI_WTIME subroutine
It is adapted for all the GPU-ported subroutines of RAMSES
22. Until this point
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU
Recognise the GPU - friendly parts
Construct an optimised infrastructure so as to minimise the data transfer
GPU-to-GPU communication through GPUDirect & Communication generally
GPU porting completed (~95%)
Optimisation (ongoing with daily encouraging results)
OpenMP porting of the non-GPU-ported parts
GPU
GPU
GPU
CPU CPU CPU
CPUCPU CPU
23. Working environment : Piz Daint
Results
Optimise so as:
1 GPU > 8 cores
Test128, time steps 100 to 110
24. 0"
50"
100"
150"
200"
250"
1" 2" 3" 4" 5" 6" 7" 8"
Time%(sec)%
Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%
Summary%of%the%current%situaDon%
new_ACCyes"Total"Time"
original_ACCno"Total"Time"
original RAMSES ~ 1.7 times faster
non-GPU parts must be
ported using OpenMP for full
comparability!