Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Performance Optimization of
CGYRO for Multiscale
Turbulence Simulations
Presented by Igor Sfiligoi, UCSD/SDSC
in collaboration with Emily Belli and Jeff Candy, GA
at the joint
US-Japan Workshop on Exascale Computing Collaboration and
6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program
Jan 18th 2022
1

What is CGYRO?
• An Eulerian GyroKinetic Fusion Plasma solver
• Uses a radially spectral formulation
• Implements the complete Sugama electromagnetic gyrokinetic theory
• Re-implemented from scratch with Multiscale Turbulence in mind
• Built on lessons learned from GYRO and NEO
• With heavy reliance on system-provided FFT libraries
• Uses MPI for multi-node/multi-device scaling
• OpenMP for multi-core support (can be combined with MPI)
• OpenACC/cuFFT for GPU support
https://gafusion.github.io/doc/cgyro.html
2
Jan 18, 2022

Suitable for most Fusion Plasma studies
• For describing the transport in the core of the Tokamak,
simulating ion-scale turbulence is sufficient
• Coarse grained mesh sufficient, making it very fast
• The pedestal region needs proper simulation of multi-scale turbulence,
i.e. accounting for the coupling of ion-scale and electron-scale turbulences
• Fine grained mesh needs, which is both slower and requires more memory
• And(of course) anything in between
• Cell size can be tuned to minimize simulation cost
while providing the desired accuracy
Jan 18, 2022 4

Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
Jan 18, 2022 5

• MPI + OpenMP
external library
• FFT transforms
Jan 18, 2022 6
Multiscale simulations
can effectively use
O(1k) nodes and
O(10k cores)

• MPI + OpenMP
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 7

• MPI + OpenMP
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 8
GPU Memory size was a problem on Titan,
only a fraction of code ran on GPUs there.
On modern systems (Summit, Perlmutter)
virtually no CPU-bound code anymore.

GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 9
Intermediate-scale benchmark test case nl03
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
Perlmutter results obtained
on Phase 1 setup.
576 V100 GPUs
10x faster
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
(128 x 1152 x 24 x 18 x 8) x 3 species

GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 10
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
10x faster
4 A100 GPUs
about as fast as
6 V100 GPUs
576 V100 GPUs
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species
(128 x 1152 x 24 x 18 x 8) x 3 species

GPUs are fast
• A100 GPU much faster
than KNL CPU
• But compute now less than
50% of total time!
Jan 18, 2022 11
on Phase 1 setup.
4x faster
3x faster
nl03 - Intermediate-scale benchmark test case
sh04 - Multiscale benchmark test case

CGYRO communication heavy
• CGYRO exchanges significant amount of data during compute
(numbers for each process, per timing period)
• nl03 - Using 16 Perlmutter nodes
• Data: ~ 28 GB
• Compute: ~ 9 s
• sh04 - Using 96 Perlmutter nodes
• Data: ~140 GB
• Compute: ~ 22 s
Jan 18, 2022 12
5x data volume
2.5x compute time
Most of the communication is AllToAll
Some AllReduce

Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 13
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
on Phase 1 setup.
On GCP, using OpenMPI v3
provided by NVIDIA SDK

Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 14
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
on Phase 1 setup.
GPU-to-GPU communication much faster than networking.
NVLINK provides 4.8 Tbps bandwidth.
Perlmutter has 200 Gbps Slinghot for 4 GPUs.
The use of GPU-aware MPI communication essential.
PCIe almost 10x slower than NVLINK.
On GCP, using OpenMPI v3
provided by NVIDIA SDK

Minimizing network traffic a must
• Multi-node deployments a requirement for multiscale simulations
• Memory constraints
• Time to solution
• But one can try to
reduce network traffic
by keeping most of the
data inside the node
Jan 18, 2022 15
on Phase 1 setup.
2x faster
1.4x faster

CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 16
-30%
-27%
-12% +10%

CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 17
-30%
-27%
-12% +10%
Not a solution for Multiscale simulations
MPI_ORDER=2 still beneficial

Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
Jan 18, 2022 18
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)

Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
• Using single precision for comm instead of double
• Existing option for a subset of Comm2 (upwind)
• Implications on simulation fidelity not fully understood yet
Jan 18, 2022 19
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)

16 64 256 1024
Number of Nodes
10
100
1000
Wallclock
time
(s)
(a) Skylake
Stampede2
Cori KNL
Piz Daint
Titan
Summit
Frontera
Perlmutter
CGYRO compute over the years
• The nl03 benchmark test has long been representative of
CGYRO simulations
• True multiscale was
considered a heroic run
• Here you can see how
runtimes shrunk
with the newer systems
Jan 18, 2022 20
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species

CGYRO Multiscale benchmarks
• We are starting to collect sh04 benchmark data
on modern systems
• Not many points
but scaling looks
good on Summit
and Perlmutter
Jan 18, 2022 21
on Phase 1 setup.
(128 x 1152 x 24 x 18 x 8) x 3 species

0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
22
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!

0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
23
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!
Simulating burning plasmas in
H-mode regimes will be very
complex, requiring high
(multiscale) resolution

The need for multiple species
• Most simulations so far used only 2 or 3 species, typically
• deuterium ions + electrons
• wall impurity (carbon)
• But burning plasmas are much more complex,
we will need many more species
• electrons
• 2 fuel ions (D and T)
• helium ash
• Low and high-Z wall impurities (tungsten and beryllium)
• That will drastically increase the compute cost
• But recent improvement should limit data size growth
Jan 18, 2022 24

Multi-species scaling results
• Benchmark results from Summit
• Strong scaling with 4 species
• Weak scaling going from 2 to 6 species
Jan 18, 2022 25
128 256 512 1024 2048
Number of Nodes
1
2
4
8
16
32
64
Wallclock
time
(s)
strong scaling
weak scaling
> 20% Summit
Multiscale CGYRO Simulation
2 3 4 6
These results were without recent
multi-species communication optimizations
1.4x faster
(192 x 2304 x 24 x 18 x 8) x 3 species
nl05 benchmark test

Summary and future work
• Simulating burning plasmas in H-mode regimes will be very complex
• Requiring multiscale resolution and multiple species
• CGYRO is well positioned to provide those insights
on current and upcoming HPC systems
• The porting to GPUs drastically reduced the observed compute time
• Algorithmic improvements have reduced communication cost
• Future work
• Ensure the code continues to work on new HW (e.g. AMD GPUs)
• Further algorithmic improvements
Jan 18, 2022 - Joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program 26

Acknowledgments
• Many people contributed to the development and optimization
of CGYRO, and I would like to explicitly acknowledge
J.Candy, E.Belli, K.Hallatschek, C.Holland, N.Howard and E.D’Azevedoe
• This work was supported by the U.S. Department of Energy under
awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation
Laboratory), and DE-SC0017992 (AToM SciDAC-4 project).
• Computing resources were provided by NERSC and by OLCF through
the INCITE and ALCC programs.
27
Jan 18, 2022

Further reading
• The most comprehensive reference:
• J. Candy et al., “Multiscale-optimized plasma turbulence simulation on
petascale architectures”, Computers & Fluids, vol. 188, 125 (2019)
https://doi.org/10.1016/j.compfluid.2019.04.016
• Additional material:
• I. Sfiligoi et al., “CGYRO Performance on Power9 CPUs and Volta GPUs”,
Lecture Notes in Computer Science, vol 11203 (2018).
https://doi.org/10.1007/978-3-030-02465-9_24
• I. Sfiligoi et al. “Fusion Research Using Azure A100 HPC instances”, Poster 149
at SC21 (2021) https://sc21.supercomputing.org/proceedings/tech_poster/
tech_poster_pages/rpost149.html
28
Jan 18, 2022

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Similar to Performance Optimization of CGYRO for Multiscale Turbulence Simulations (20)

More from Igor Sfiligoi

More from Igor Sfiligoi (20)

Recently uploaded

Recently uploaded (20)

Performance Optimization of CGYRO for Multiscale Turbulence Simulations