RAMSES @CSCS

RAMSES @CSCS
Kotsalos Christos & Claudio Gheller
Refactoring of the Ramses code and performance optimisation
on CPUs and GPUs

RAMSES: modular physics
AMR build
Domain
Decomposition -
Load balancing
Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
per time step⤾

Our goal
AMR build Load balance Gravity
Hydro
MHD
N-body
GPU using OpenACC directives
❖Recognise the GPU - friendly parts :
❖ Computational intensity + data independency
❖Minimise the data transfer:
❖ GPU <—> CPU communication
❖GPU-to-GPU communication through GPUDirect & Communication generally
❖GPU porting and optimisation
}Infrastructure

Problems to overcome!
Hydro
MHD
N-body
Dependencies between the modules
More complex amr_step than this simplified
representation
Recursive calls
Communicators (OpenACC & memory)
# grids depend on the level of refinement. GPU
porting issues. Difficulty to fit the loops in the
architecture!

1st part of the Project
Redesign communication for GPUs & CPUs:
❖ Is the CPU communication optimal?
❖ Is the communication suitable for GPU programming?

Communication between the subdomains
• Point-to-Point Communication :: ISend <—> IRecv
• The subdomains communicate the solutions with their neighbours (physical)
• Everything through the communicators
MPI
Send Buffer Recv Buffer
emission communicator
(structure/ derived data types)
reception communicator
(structure/ derived data types)
Stored information:
‣ Number of cells
‣ cell index
‣ Double precision arrays
‣ Single precision arrays
}
Advantages:
Elegant Structure
Disadvantages:
Data locality issues
Not fully supported by OpenACC

Type of Communication: Point-to-Point or Collective ?
Point-to-point: Isend & Irecv
Original implementation in RAMSES
Collective: Alltoall(v)
The data to be communicated are gathered in arrays (one per PE)
These arrays (of intrinsic data types) are scattered from all PES to all PES

OpenACC & Allocatable Derived data types
• Not fully supported (cray only but still partially)
• No GPUDirect support (GPU-to-GPU communication)
Two solutions to overcome this problem
❖ Use collective communication (the buffers are of intrinsic data types)
❖ Replace locally or globally the communicators with regular arrays
Check performance
Everywhere in the code, not
an easy solution

GPUDirect
❖ GPU-to-GPU communication
❖ The same calls as the regular MPI but:
export MPICH_RDMA_ENABLED_CUDA=1
!$acc host_data use_device(send buffer)
call MPI_ISEND( normal arguments )
!$acc end host_data
!$acc host_data use_device(recv buffer)
call MPI_IRECV( normal arguments )
!$acc end host_data
Embed the
MPI call
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

AlltoAll
1 @ 2 @ … n @
Send Buffers Recv Buffers
1 # 2 # … n #
1 * 2 * … n *
…
1 @ 1 # … n *
2 @ 2 # … 2 *
n @ n # … n *
…
pe1
pe2
pen
(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)
Restriction: the sendcnt & recvcnt are ﬁxed per pe

AlltoAllv
1 @ 2 @ … n @
Send Buffer
pei
(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
sdispls(1) = 0
sdispls(2) = sdispls(1)+sendcnts(1)
sdispls(n) = sdispls(n-1)+sendcnts(n-1)
the sendcnt & recvcnt are NOT ﬁxed per pe

Experiment: Replace the point-to-point with the collective
communication everywhere (CPU version of RAMSES)
Alltoallv ISend/ IRecv
Latency Issue!
point-to-point ~ 3 times faster

Load balancing
The subdomains need to communicate mainly with their neighbours
The collective communication sends and recvs too many empty buffers
Latency Issue!

AlltoAllv_tuned
User
speciﬁed
Bandwidth Issue! Too much data to communicate!
1 @ 2 @ … n @
Send Buffer
pei
}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
If sendcnts(i) = 0
then sendcnts(i) = ﬁll the buffer with zeros
Special case is the
AlltoAll

Final solution (communication part)
❖ Point-to-Point communication
❖ Replace locally the communicators with regular arrays
❖ GPU-to-GPU through GPUDirect ﬂawlessly
Send Buffer Recv Buffer
emission array
(intrinsic data types)
reception array
emission communicator
(derived data types)
reception communicator
(derived data types)
↧ ↧

2nd part of the Project
GPU porting of the Poisson Solver
Hydro
MHD
N-body

GPU porting of poisson solver
Communicators caused issues because of data locality:
❖ Implicit synchronisation barriers
❖ Poor performance
Local communicators
Stored information:
‣ Number of grids
‣ Grid index
‣ Double precision arrays
‣ Single precision arrays
}
↧
Replaceable
1D arrays
level
components
cpus
cells

1D arrays
level
components
cpus
cells
Data Locality
Increased Performance
2 to 3 times faster
In the beginning of
multigrid_ﬁne

3rd part of the Project
Infrastructure development
❖ Which parts on CPU, which on GPU?
❖ Minimise interaction between CPU-GPU
Hydro
MHD
N-body

Infrastructure
Subroutines that update host/ device
(optimised data transfer)
#if deﬁned(_OPENACC)
call update_globalvar_dp_to_host (var,level)
call update_globalvar_dp_to_device(var,level)
#endif
instead of
!$acc update device/host(var)
communicate only what it is needed
Use of the update
directive but in an
optimal way

Manual Profiler (MPROF)
Why?
❖ Bugs in CRAYPAT
❖ Strange SYNC barriers (implicit)
❖ Discrepancy of CRAYPAT and NVIDIA’s tools
It is enabled from the Makefile by the flag MPROF
It uses the MPI_WTIME subroutine
It is adapted for all the GPU-ported subroutines of RAMSES

Until this point
Hydro
MHD
N-body
GPU
Recognise the GPU - friendly parts
Construct an optimised infrastructure so as to minimise the data transfer
GPU-to-GPU communication through GPUDirect & Communication generally
GPU porting completed (~95%)
Optimisation (ongoing with daily encouraging results)
OpenMP porting of the non-GPU-ported parts
GPU
GPU
GPU
CPU CPU CPU
CPUCPU CPU

Working environment : Piz Daint
Results
Optimise so as:
1 GPU > 8 cores
Test128, time steps 100 to 110

0"
50"
100"
150"
200"
250"
1" 2" 3" 4" 5" 6" 7" 8"
Time%(sec)%
Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%
Summary%of%the%current%situaDon%
new_ACCyes"Total"Time"
original_ACCno"Total"Time"
original RAMSES ~ 1.7 times faster
non-GPU parts must be
ported using OpenMP for full
comparability!

CRAY-PAT report
Fair comparison:
216 - 80 ~ 140 sec
original RAMSES ~ 1.1 times faster

Manual Proﬁler (MPROF)1GPU 8Cores

Infrastrucure
GPUs
1 2 4 8
ACC_COPY (sec) 10.5 7.7 6.3 6.8
Total Time (sec) 216.8 195.2 129.7 111.4
(%) 4.8 3.9 4.9 6.1

0"
25"
50"
75"
1" 2" 3" 4" 5" 6" 7" 8"
MPI$Time$(sec)$
Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$
RAMSES$communicaGon$
new_ACCyes"MPI"Time"
original_ACCno"MPI"Time"
~2 less communication in the new approach

0"
25"
50"
75"
100"
1" 2" 3" 4" 5" 6" 7" 8"
Communica)on*/*Total*Time*(%)*
Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)*
******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)**
new_ACCyes"
original_ACCno"

GPUDirect ~ 1.55 regular MPI (CPU-to-CPU)
0"
5"
10"
15"
20"
0" 8" 16" 24" 32" 40" 48" 56" 64"
Isend+Irecv*+me*(sec)*
Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)*
GPUDirect*
original_GPUDirectNO"
new_GPUDirectYES"

RAMSES @CSCS
Thank you very much for your attention!

RAMSES @CSCS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to RAMSES @CSCS

Similar to RAMSES @CSCS (20)

RAMSES @CSCS