SlideShare a Scribd company logo
1 of 32
Download to read offline
RAMSES @CSCS
Kotsalos Christos & Claudio Gheller
Refactoring of the Ramses code and performance optimisation
on CPUs and GPUs
RAMSES: modular physics
AMR build
Domain
Decomposition -
Load balancing
Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
per time step⤾
Our goal
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU using OpenACC directives
❖Recognise the GPU - friendly parts :
❖ Computational intensity + data independency
❖Minimise the data transfer:
❖ GPU <—> CPU communication
❖GPU-to-GPU communication through GPUDirect & Communication generally
❖GPU porting and optimisation
}Infrastructure
Problems to overcome!
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
Dependencies between the modules
More complex amr_step than this simplified
representation
Recursive calls
Communicators (OpenACC & memory)
# grids depend on the level of refinement. GPU
porting issues. Difficulty to fit the loops in the
architecture!
1st part of the Project
Redesign communication for GPUs & CPUs:
❖ Is the CPU communication optimal?
❖ Is the communication suitable for GPU programming?
Communication between the subdomains
• Point-to-Point Communication :: ISend <—> IRecv
• The subdomains communicate the solutions with their neighbours (physical)
• Everything through the communicators
MPI
Send Buffer Recv Buffer
emission communicator
(structure/ derived data types)
reception communicator
(structure/ derived data types)
Stored information:
‣ Number of cells
‣ cell index
‣ Double precision arrays
‣ Single precision arrays
}
Advantages:
Elegant Structure
Disadvantages:
Data locality issues
Not fully supported by OpenACC
Type of Communication: Point-to-Point or Collective ?
Point-to-point: Isend & Irecv
Original implementation in RAMSES
Collective: Alltoall(v)
The data to be communicated are gathered in arrays (one per PE)
These arrays (of intrinsic data types) are scattered from all PES to all PES
OpenACC & Allocatable Derived data types
• Not fully supported (cray only but still partially)
• No GPUDirect support (GPU-to-GPU communication)
Two solutions to overcome this problem
❖ Use collective communication (the buffers are of intrinsic data types)
❖ Replace locally or globally the communicators with regular arrays
Check performance
Everywhere in the code, not
an easy solution
GPUDirect
❖ GPU-to-GPU communication
❖ The same calls as the regular MPI but:
export MPICH_RDMA_ENABLED_CUDA=1
!$acc host_data use_device(send buffer)
call MPI_ISEND( normal arguments )
!$acc end host_data
!$acc host_data use_device(recv buffer)
call MPI_IRECV( normal arguments )
!$acc end host_data
Embed the
MPI call
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
AlltoAll
1 @ 2 @ … n @
Send Buffers Recv Buffers
1 # 2 # … n #
1 * 2 * … n *
…
1 @ 1 # … n *
2 @ 2 # … 2 *
n @ n # … n *
…
pe1
pe2
pen
(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)
Restriction: the sendcnt & recvcnt are fixed per pe
AlltoAllv
1 @ 2 @ … n @
Send Buffer
pei
(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
sdispls(1) = 0
sdispls(2) = sdispls(1)+sendcnts(1)
sdispls(n) = sdispls(n-1)+sendcnts(n-1)
the sendcnt & recvcnt are NOT fixed per pe
Experiment: Replace the point-to-point with the collective
communication everywhere (CPU version of RAMSES)
Alltoallv ISend/ IRecv
Latency Issue!
point-to-point ~ 3 times faster
Load balancing
The subdomains need to communicate mainly with their neighbours
The collective communication sends and recvs too many empty buffers
Latency Issue!
AlltoAllv_tuned
User
specified
Bandwidth Issue! Too much data to communicate!
1 @ 2 @ … n @
Send Buffer
pei
}
}
}
sendcnts(1) sendcnts(2) sendcnts(n)
If sendcnts(i) = 0
then sendcnts(i) = fill the buffer with zeros
Special case is the
AlltoAll
Final solution (communication part)
❖ Point-to-Point communication
❖ Replace locally the communicators with regular arrays
❖ GPU-to-GPU through GPUDirect flawlessly
Send Buffer Recv Buffer
emission array
(intrinsic data types)
reception array
(intrinsic data types)
emission communicator
(derived data types)
reception communicator
(derived data types)
↧ ↧
2nd part of the Project
GPU porting of the Poisson Solver
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU porting of poisson solver
Communicators caused issues because of data locality:
❖ Implicit synchronisation barriers
❖ Poor performance
Local communicators
Stored information:
‣ Number of grids
‣ Grid index
‣ Double precision arrays
‣ Single precision arrays
}
↧
Replaceable
1D arrays
(intrinsic data types)
level
components
cpus
cells
1D arrays
(intrinsic data types)
level
components
cpus
cells
Data Locality
Increased Performance
2 to 3 times faster
In the beginning of
multigrid_fine
3rd part of the Project
Infrastructure development
❖ Which parts on CPU, which on GPU?
❖ Minimise interaction between CPU-GPU
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
Infrastructure
Subroutines that update host/ device
(optimised data transfer)
#if defined(_OPENACC)
call update_globalvar_dp_to_host (var,level)
call update_globalvar_dp_to_device(var,level)
#endif
instead of
!$acc update device/host(var)
communicate only what it is needed
Use of the update
directive but in an
optimal way
Manual Profiler (MPROF)
Why?
❖ Bugs in CRAYPAT
❖ Strange SYNC barriers (implicit)
❖ Discrepancy of CRAYPAT and NVIDIA’s tools
It is enabled from the Makefile by the flag MPROF
It uses the MPI_WTIME subroutine
It is adapted for all the GPU-ported subroutines of RAMSES
Until this point
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU
Recognise the GPU - friendly parts
Construct an optimised infrastructure so as to minimise the data transfer
GPU-to-GPU communication through GPUDirect & Communication generally
GPU porting completed (~95%)
Optimisation (ongoing with daily encouraging results)
OpenMP porting of the non-GPU-ported parts
GPU
GPU
GPU
CPU CPU CPU
CPUCPU CPU
Working environment : Piz Daint
Results
Optimise so as:
1 GPU > 8 cores
Test128, time steps 100 to 110
0"
50"
100"
150"
200"
250"
1" 2" 3" 4" 5" 6" 7" 8"
Time%(sec)%
Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%
Summary%of%the%current%situaDon%
new_ACCyes"Total"Time"
original_ACCno"Total"Time"
original RAMSES ~ 1.7 times faster
non-GPU parts must be
ported using OpenMP for full
comparability!
CRAY-PAT report
Fair comparison:
216 - 80 ~ 140 sec
original RAMSES ~ 1.1 times faster
Manual Profiler (MPROF)1GPU 8Cores
Infrastrucure
GPUs
1 2 4 8
ACC_COPY (sec) 10.5 7.7 6.3 6.8
Total Time (sec) 216.8 195.2 129.7 111.4
(%) 4.8 3.9 4.9 6.1
0"
25"
50"
75"
1" 2" 3" 4" 5" 6" 7" 8"
MPI$Time$(sec)$
Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$
RAMSES$communicaGon$
new_ACCyes"MPI"Time"
original_ACCno"MPI"Time"
~2 less communication in the new approach
0"
25"
50"
75"
100"
1" 2" 3" 4" 5" 6" 7" 8"
Communica)on*/*Total*Time*(%)*
Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)*
******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)**
new_ACCyes"
original_ACCno"
GPUDirect ~ 1.55 regular MPI (CPU-to-CPU)
0"
5"
10"
15"
20"
0" 8" 16" 24" 32" 40" 48" 56" 64"
Isend+Irecv*+me*(sec)*
Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)*
GPUDirect*
original_GPUDirectNO"
new_GPUDirectYES"
Optimisation 15-071GPU 8Cores
RAMSES @CSCS
Thank you very much for your attention!

More Related Content

What's hot

CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchChun Ming Ou
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesSlide_N
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me_xhr_
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPCGuo Jing
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCdterei
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesAnne Nicolas
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
 

What's hot (20)

CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable Switch
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
Make gRPC great again
Make gRPC great againMake gRPC great again
Make gRPC great again
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering Oopsies
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDP
 

Viewers also liked

Caza del tesoro_Selva amazónica
Caza del tesoro_Selva amazónicaCaza del tesoro_Selva amazónica
Caza del tesoro_Selva amazónicaMartín Ferreira
 
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awan
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awanKalam e shaikh ul Mukrram Ameer Mohammad Akram awan
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awanjavid iqbal sodagar
 
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"jecarrerag
 
MRU qüestions
MRU qüestionsMRU qüestions
MRU qüestionsaamarc
 
Zone of flow establishment in turbulent jets
Zone of flow establishment in turbulent jetsZone of flow establishment in turbulent jets
Zone of flow establishment in turbulent jetsChristos Kotsalos
 
www.CentroApoio.com - História - Crise do Capitalismo e Segunda Guerra Mundi...
www.CentroApoio.com -  História - Crise do Capitalismo e Segunda Guerra Mundi...www.CentroApoio.com -  História - Crise do Capitalismo e Segunda Guerra Mundi...
www.CentroApoio.com - História - Crise do Capitalismo e Segunda Guerra Mundi...Vídeo Aulas Apoio
 
Antibody therapy and engineering
Antibody therapy and engineeringAntibody therapy and engineering
Antibody therapy and engineeringGrace Felciya
 
Antibody engineering-Abbas Morovvati
Antibody engineering-Abbas MorovvatiAntibody engineering-Abbas Morovvati
Antibody engineering-Abbas Morovvatiabbasmorovvati
 
Levan microbial polysaccharide
Levan microbial polysaccharideLevan microbial polysaccharide
Levan microbial polysaccharideHarry Kwon
 

Viewers also liked (16)

Caza del tesoro_Selva amazónica
Caza del tesoro_Selva amazónicaCaza del tesoro_Selva amazónica
Caza del tesoro_Selva amazónica
 
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awan
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awanKalam e shaikh ul Mukrram Ameer Mohammad Akram awan
Kalam e shaikh ul Mukrram Ameer Mohammad Akram awan
 
my first SCI paper
my first SCI papermy first SCI paper
my first SCI paper
 
E commerce
E commerceE commerce
E commerce
 
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"
WORKSHOP "Flora Amazónica: Sistemática, Ecología y Aspectos Forestales"
 
Fredrick onyango cv
Fredrick onyango cvFredrick onyango cv
Fredrick onyango cv
 
MRU qüestions
MRU qüestionsMRU qüestions
MRU qüestions
 
Marey Sheikh KA TARIF
Marey Sheikh KA TARIFMarey Sheikh KA TARIF
Marey Sheikh KA TARIF
 
research presentation
research presentationresearch presentation
research presentation
 
Ahmed c v
Ahmed c vAhmed c v
Ahmed c v
 
Vikas2
Vikas2Vikas2
Vikas2
 
Zone of flow establishment in turbulent jets
Zone of flow establishment in turbulent jetsZone of flow establishment in turbulent jets
Zone of flow establishment in turbulent jets
 
www.CentroApoio.com - História - Crise do Capitalismo e Segunda Guerra Mundi...
www.CentroApoio.com -  História - Crise do Capitalismo e Segunda Guerra Mundi...www.CentroApoio.com -  História - Crise do Capitalismo e Segunda Guerra Mundi...
www.CentroApoio.com - História - Crise do Capitalismo e Segunda Guerra Mundi...
 
Antibody therapy and engineering
Antibody therapy and engineeringAntibody therapy and engineering
Antibody therapy and engineering
 
Antibody engineering-Abbas Morovvati
Antibody engineering-Abbas MorovvatiAntibody engineering-Abbas Morovvati
Antibody engineering-Abbas Morovvati
 
Levan microbial polysaccharide
Levan microbial polysaccharideLevan microbial polysaccharide
Levan microbial polysaccharide
 

Similar to RAMSES @CSCS

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodesnagarajan_ka
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.Slide_N
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisJinGui LI
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...NECST Lab @ Politecnico di Milano
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Sreerag parallel programming
Sreerag   parallel programmingSreerag   parallel programming
Sreerag parallel programmingSreerag Gopinath
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering SoftwareYung-Yu Chen
 

Similar to RAMSES @CSCS (20)

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
 
0507036
05070360507036
0507036
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System Analysis
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Assignment
AssignmentAssignment
Assignment
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Sreerag parallel programming
Sreerag   parallel programmingSreerag   parallel programming
Sreerag parallel programming
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Presentation1
Presentation1Presentation1
Presentation1
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 

RAMSES @CSCS

  • 1. RAMSES @CSCS Kotsalos Christos & Claudio Gheller Refactoring of the Ramses code and performance optimisation on CPUs and GPUs
  • 2. RAMSES: modular physics AMR build Domain Decomposition - Load balancing Gravity Hydro MHD N-body CoolingStar formationOther physics RT per time step⤾
  • 3. Our goal AMR build Load balance Gravity Hydro MHD N-body CoolingStar formationOther physics RT GPU using OpenACC directives ❖Recognise the GPU - friendly parts : ❖ Computational intensity + data independency ❖Minimise the data transfer: ❖ GPU <—> CPU communication ❖GPU-to-GPU communication through GPUDirect & Communication generally ❖GPU porting and optimisation }Infrastructure
  • 4. Problems to overcome! AMR build Load balance Gravity Hydro MHD N-body CoolingStar formationOther physics RT Dependencies between the modules More complex amr_step than this simplified representation Recursive calls Communicators (OpenACC & memory) # grids depend on the level of refinement. GPU porting issues. Difficulty to fit the loops in the architecture!
  • 5. 1st part of the Project Redesign communication for GPUs & CPUs: ❖ Is the CPU communication optimal? ❖ Is the communication suitable for GPU programming?
  • 6. Communication between the subdomains • Point-to-Point Communication :: ISend <—> IRecv • The subdomains communicate the solutions with their neighbours (physical) • Everything through the communicators MPI Send Buffer Recv Buffer emission communicator (structure/ derived data types) reception communicator (structure/ derived data types) Stored information: ‣ Number of cells ‣ cell index ‣ Double precision arrays ‣ Single precision arrays } Advantages: Elegant Structure Disadvantages: Data locality issues Not fully supported by OpenACC
  • 7. Type of Communication: Point-to-Point or Collective ? Point-to-point: Isend & Irecv Original implementation in RAMSES Collective: Alltoall(v) The data to be communicated are gathered in arrays (one per PE) These arrays (of intrinsic data types) are scattered from all PES to all PES
  • 8. OpenACC & Allocatable Derived data types • Not fully supported (cray only but still partially) • No GPUDirect support (GPU-to-GPU communication) Two solutions to overcome this problem ❖ Use collective communication (the buffers are of intrinsic data types) ❖ Replace locally or globally the communicators with regular arrays Check performance Everywhere in the code, not an easy solution
  • 9. GPUDirect ❖ GPU-to-GPU communication ❖ The same calls as the regular MPI but: export MPICH_RDMA_ENABLED_CUDA=1 !$acc host_data use_device(send buffer) call MPI_ISEND( normal arguments ) !$acc end host_data !$acc host_data use_device(recv buffer) call MPI_IRECV( normal arguments ) !$acc end host_data Embed the MPI call export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  • 10. AlltoAll 1 @ 2 @ … n @ Send Buffers Recv Buffers 1 # 2 # … n # 1 * 2 * … n * … 1 @ 1 # … n * 2 @ 2 # … 2 * n @ n # … n * … pe1 pe2 pen (sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr) Restriction: the sendcnt & recvcnt are fixed per pe
  • 11. AlltoAllv 1 @ 2 @ … n @ Send Buffer pei (sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)} } } sendcnts(1) sendcnts(2) sendcnts(n) sdispls(1) = 0 sdispls(2) = sdispls(1)+sendcnts(1) sdispls(n) = sdispls(n-1)+sendcnts(n-1) the sendcnt & recvcnt are NOT fixed per pe
  • 12. Experiment: Replace the point-to-point with the collective communication everywhere (CPU version of RAMSES) Alltoallv ISend/ IRecv Latency Issue! point-to-point ~ 3 times faster
  • 13. Load balancing The subdomains need to communicate mainly with their neighbours The collective communication sends and recvs too many empty buffers Latency Issue!
  • 14. AlltoAllv_tuned User specified Bandwidth Issue! Too much data to communicate! 1 @ 2 @ … n @ Send Buffer pei } } } sendcnts(1) sendcnts(2) sendcnts(n) If sendcnts(i) = 0 then sendcnts(i) = fill the buffer with zeros Special case is the AlltoAll
  • 15. Final solution (communication part) ❖ Point-to-Point communication ❖ Replace locally the communicators with regular arrays ❖ GPU-to-GPU through GPUDirect flawlessly Send Buffer Recv Buffer emission array (intrinsic data types) reception array (intrinsic data types) emission communicator (derived data types) reception communicator (derived data types) ↧ ↧
  • 16. 2nd part of the Project GPU porting of the Poisson Solver AMR build Load balance Gravity Hydro MHD N-body CoolingStar formationOther physics RT
  • 17. GPU porting of poisson solver Communicators caused issues because of data locality: ❖ Implicit synchronisation barriers ❖ Poor performance Local communicators Stored information: ‣ Number of grids ‣ Grid index ‣ Double precision arrays ‣ Single precision arrays } ↧ Replaceable 1D arrays (intrinsic data types) level components cpus cells
  • 18. 1D arrays (intrinsic data types) level components cpus cells Data Locality Increased Performance 2 to 3 times faster In the beginning of multigrid_fine
  • 19. 3rd part of the Project Infrastructure development ❖ Which parts on CPU, which on GPU? ❖ Minimise interaction between CPU-GPU AMR build Load balance Gravity Hydro MHD N-body CoolingStar formationOther physics RT
  • 20. Infrastructure Subroutines that update host/ device (optimised data transfer) #if defined(_OPENACC) call update_globalvar_dp_to_host (var,level) call update_globalvar_dp_to_device(var,level) #endif instead of !$acc update device/host(var) communicate only what it is needed Use of the update directive but in an optimal way
  • 21. Manual Profiler (MPROF) Why? ❖ Bugs in CRAYPAT ❖ Strange SYNC barriers (implicit) ❖ Discrepancy of CRAYPAT and NVIDIA’s tools It is enabled from the Makefile by the flag MPROF It uses the MPI_WTIME subroutine It is adapted for all the GPU-ported subroutines of RAMSES
  • 22. Until this point AMR build Load balance Gravity Hydro MHD N-body CoolingStar formationOther physics RT GPU Recognise the GPU - friendly parts Construct an optimised infrastructure so as to minimise the data transfer GPU-to-GPU communication through GPUDirect & Communication generally GPU porting completed (~95%) Optimisation (ongoing with daily encouraging results) OpenMP porting of the non-GPU-ported parts GPU GPU GPU CPU CPU CPU CPUCPU CPU
  • 23. Working environment : Piz Daint Results Optimise so as: 1 GPU > 8 cores Test128, time steps 100 to 110
  • 24. 0" 50" 100" 150" 200" 250" 1" 2" 3" 4" 5" 6" 7" 8" Time%(sec)% Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%% Summary%of%the%current%situaDon% new_ACCyes"Total"Time" original_ACCno"Total"Time" original RAMSES ~ 1.7 times faster non-GPU parts must be ported using OpenMP for full comparability!
  • 25. CRAY-PAT report Fair comparison: 216 - 80 ~ 140 sec original RAMSES ~ 1.1 times faster
  • 27. Infrastrucure GPUs 1 2 4 8 ACC_COPY (sec) 10.5 7.7 6.3 6.8 Total Time (sec) 216.8 195.2 129.7 111.4 (%) 4.8 3.9 4.9 6.1
  • 28. 0" 25" 50" 75" 1" 2" 3" 4" 5" 6" 7" 8" MPI$Time$(sec)$ Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$ RAMSES$communicaGon$ new_ACCyes"MPI"Time" original_ACCno"MPI"Time" ~2 less communication in the new approach
  • 29. 0" 25" 50" 75" 100" 1" 2" 3" 4" 5" 6" 7" 8" Communica)on*/*Total*Time*(%)* Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)* ******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)** new_ACCyes" original_ACCno"
  • 30. GPUDirect ~ 1.55 regular MPI (CPU-to-CPU) 0" 5" 10" 15" 20" 0" 8" 16" 24" 32" 40" 48" 56" 64" Isend+Irecv*+me*(sec)* Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)* GPUDirect* original_GPUDirectNO" new_GPUDirectYES"
  • 32. RAMSES @CSCS Thank you very much for your attention!