Performance Optimization of
CGYRO for Multiscale
Turbulence Simulations
Presented by Igor Sfiligoi, UCSD/SDSC
in collaboration with Emily Belli and Jeff Candy, GA
at the joint
US-Japan Workshop on Exascale Computing Collaboration and
6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program
Jan 18th 2022
1
What is CGYRO?
• An Eulerian GyroKinetic Fusion Plasma solver
• Uses a radially spectral formulation
• Implements the complete Sugama electromagnetic gyrokinetic theory
• Re-implemented from scratch with Multiscale Turbulence in mind
• Built on lessons learned from GYRO and NEO
• With heavy reliance on system-provided FFT libraries
• Uses MPI for multi-node/multi-device scaling
• OpenMP for multi-core support (can be combined with MPI)
• OpenACC/cuFFT for GPU support
https://gafusion.github.io/doc/cgyro.html
2
Jan 18, 2022
A 5D mesh
Jan 18, 2022 3
Suitable for most Fusion Plasma studies
• For describing the transport in the core of the Tokamak,
simulating ion-scale turbulence is sufficient
• Coarse grained mesh sufficient, making it very fast
• The pedestal region needs proper simulation of multi-scale turbulence,
i.e. accounting for the coupling of ion-scale and electron-scale turbulences
• Fine grained mesh needs, which is both slower and requires more memory
• And(of course) anything in between
• Cell size can be tuned to minimize simulation cost
while providing the desired accuracy
Jan 18, 2022 4
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
Jan 18, 2022 5
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
Jan 18, 2022 6
Multiscale simulations
can effectively use
O(1k) nodes and
O(10k cores)
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 7
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 8
GPU Memory size was a problem on Titan,
only a fraction of code ran on GPUs there.
On modern systems (Summit, Perlmutter)
virtually no CPU-bound code anymore.
GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 9
Intermediate-scale benchmark test case nl03
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
Perlmutter results obtained
on Phase 1 setup.
576 V100 GPUs
10x faster
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 10
Intermediate-scale benchmark test case nl03
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
10x faster
4 A100 GPUs
about as fast as
6 V100 GPUs
576 V100 GPUs
Perlmutter results obtained
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
GPUs are fast
• A100 GPU much faster
than KNL CPU
• But compute now less than
50% of total time!
Jan 18, 2022 11
Perlmutter results obtained
on Phase 1 setup.
4x faster
3x faster
nl03 - Intermediate-scale benchmark test case
sh04 - Multiscale benchmark test case
CGYRO communication heavy
• CGYRO exchanges significant amount of data during compute
(numbers for each process, per timing period)
• nl03 - Using 16 Perlmutter nodes
• Data: ~ 28 GB
• Compute: ~ 9 s
• sh04 - Using 96 Perlmutter nodes
• Data: ~140 GB
• Compute: ~ 22 s
Jan 18, 2022 12
5x data volume
2.5x compute time
Most of the communication is AllToAll
Some AllReduce
Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 13
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
Perlmutter results obtained
on Phase 1 setup.
On GCP, using OpenMPI v3
provided by NVIDIA SDK
Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 14
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
Perlmutter results obtained
on Phase 1 setup.
GPU-to-GPU communication much faster than networking.
NVLINK provides 4.8 Tbps bandwidth.
Perlmutter has 200 Gbps Slinghot for 4 GPUs.
The use of GPU-aware MPI communication essential.
PCIe almost 10x slower than NVLINK.
On GCP, using OpenMPI v3
provided by NVIDIA SDK
Minimizing network traffic a must
• Multi-node deployments a requirement for multiscale simulations
• Memory constraints
• Time to solution
• But one can try to
reduce network traffic
by keeping most of the
data inside the node
Jan 18, 2022 15
Intermediate-scale benchmark test case nl03
Perlmutter results obtained
on Phase 1 setup.
2x faster
1.4x faster
CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 16
-30%
-27%
-12% +10%
CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 17
-30%
-27%
-12% +10%
Not a solution for Multiscale simulations
MPI_ORDER=2 still beneficial
Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
Jan 18, 2022 18
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)
Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
• Using single precision for comm instead of double
• Existing option for a subset of Comm2 (upwind)
• Implications on simulation fidelity not fully understood yet
Jan 18, 2022 19
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)
16 64 256 1024
Number of Nodes
10
100
1000
Wallclock
time
(s)
(a) Skylake
Stampede2
Cori KNL
Piz Daint
Titan
Summit
Frontera
Perlmutter
CGYRO compute over the years
• The nl03 benchmark test has long been representative of
CGYRO simulations
• True multiscale was
considered a heroic run
• Here you can see how
runtimes shrunk
with the newer systems
Jan 18, 2022 20
Perlmutter results obtained
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
CGYRO Multiscale benchmarks
• We are starting to collect sh04 benchmark data
on modern systems
• Not many points
but scaling looks
good on Summit
and Perlmutter
Jan 18, 2022 21
Perlmutter results obtained
on Phase 1 setup.
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
22
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!
0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
23
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!
Simulating burning plasmas in
H-mode regimes will be very
complex, requiring high
(multiscale) resolution
The need for multiple species
• Most simulations so far used only 2 or 3 species, typically
• deuterium ions + electrons
• wall impurity (carbon)
• But burning plasmas are much more complex,
we will need many more species
• electrons
• 2 fuel ions (D and T)
• helium ash
• Low and high-Z wall impurities (tungsten and beryllium)
• That will drastically increase the compute cost
• But recent improvement should limit data size growth
Jan 18, 2022 24
Multi-species scaling results
• Benchmark results from Summit
• Strong scaling with 4 species
• Weak scaling going from 2 to 6 species
Jan 18, 2022 25
128 256 512 1024 2048
Number of Nodes
1
2
4
8
16
32
64
Wallclock
time
(s)
strong scaling
weak scaling
> 20% Summit
Multiscale CGYRO Simulation
2 3 4 6
These results were without recent
multi-species communication optimizations
1.4x faster
(192 x 2304 x 24 x 18 x 8) x 3 species
report time = 0.16 a/c_s
nl05 benchmark test
Summary and future work
• Simulating burning plasmas in H-mode regimes will be very complex
• Requiring multiscale resolution and multiple species
• CGYRO is well positioned to provide those insights
on current and upcoming HPC systems
• The porting to GPUs drastically reduced the observed compute time
• Algorithmic improvements have reduced communication cost
• Future work
• Ensure the code continues to work on new HW (e.g. AMD GPUs)
• Further algorithmic improvements
Jan 18, 2022 - Joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program 26
Acknowledgments
• Many people contributed to the development and optimization
of CGYRO, and I would like to explicitly acknowledge
J.Candy, E.Belli, K.Hallatschek, C.Holland, N.Howard and E.D’Azevedoe
• This work was supported by the U.S. Department of Energy under
awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation
Laboratory), and DE-SC0017992 (AToM SciDAC-4 project).
• Computing resources were provided by NERSC and by OLCF through
the INCITE and ALCC programs.
27
Jan 18, 2022
Further reading
• The most comprehensive reference:
• J. Candy et al., “Multiscale-optimized plasma turbulence simulation on
petascale architectures”, Computers & Fluids, vol. 188, 125 (2019)
https://doi.org/10.1016/j.compfluid.2019.04.016
• Additional material:
• I. Sfiligoi et al., “CGYRO Performance on Power9 CPUs and Volta GPUs”,
Lecture Notes in Computer Science, vol 11203 (2018).
https://doi.org/10.1007/978-3-030-02465-9_24
• I. Sfiligoi et al. “Fusion Research Using Azure A100 HPC instances”, Poster 149
at SC21 (2021) https://sc21.supercomputing.org/proceedings/tech_poster/
tech_poster_pages/rpost149.html
28
Jan 18, 2022

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

  • 1.
    Performance Optimization of CGYROfor Multiscale Turbulence Simulations Presented by Igor Sfiligoi, UCSD/SDSC in collaboration with Emily Belli and Jeff Candy, GA at the joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program Jan 18th 2022 1
  • 2.
    What is CGYRO? •An Eulerian GyroKinetic Fusion Plasma solver • Uses a radially spectral formulation • Implements the complete Sugama electromagnetic gyrokinetic theory • Re-implemented from scratch with Multiscale Turbulence in mind • Built on lessons learned from GYRO and NEO • With heavy reliance on system-provided FFT libraries • Uses MPI for multi-node/multi-device scaling • OpenMP for multi-core support (can be combined with MPI) • OpenACC/cuFFT for GPU support https://gafusion.github.io/doc/cgyro.html 2 Jan 18, 2022
  • 3.
    A 5D mesh Jan18, 2022 3
  • 4.
    Suitable for mostFusion Plasma studies • For describing the transport in the core of the Tokamak, simulating ion-scale turbulence is sufficient • Coarse grained mesh sufficient, making it very fast • The pedestal region needs proper simulation of multi-scale turbulence, i.e. accounting for the coupling of ion-scale and electron-scale turbulences • Fine grained mesh needs, which is both slower and requires more memory • And(of course) anything in between • Cell size can be tuned to minimize simulation cost while providing the desired accuracy Jan 18, 2022 4
  • 5.
    Porting CGYRO toGPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms Jan 18, 2022 5
  • 6.
    Porting CGYRO toGPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms Jan 18, 2022 6 Multiscale simulations can effectively use O(1k) nodes and O(10k cores)
  • 7.
    Porting CGYRO toGPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms • Using OpenMP made transition easy • We used OpenACC for GPU compute • Very few changes needed • Loop reordering for performance • Explicit movement of memory between system and GPU • NVIDIA provides GPU-optimized FFT • Needs batching, but else very similar to FFTW Jan 18, 2022 7
  • 8.
    Porting CGYRO toGPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms • Using OpenMP made transition easy • We used OpenACC for GPU compute • Very few changes needed • Loop reordering for performance • Explicit movement of memory between system and GPU • NVIDIA provides GPU-optimized FFT • Needs batching, but else very similar to FFTW Jan 18, 2022 8 GPU Memory size was a problem on Titan, only a fraction of code ran on GPUs there. On modern systems (Summit, Perlmutter) virtually no CPU-bound code anymore.
  • 9.
    GPUs are fast •A100 GPU order of magnitude faster than KNL CPU Jan 18, 2022 9 Intermediate-scale benchmark test case nl03 96 V100 GPUs Multiscale benchmark test case sh04 64 A100 GPUs 6x faster 384 A100 GPUs Perlmutter results obtained on Phase 1 setup. 576 V100 GPUs 10x faster (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 10.
    GPUs are fast •A100 GPU order of magnitude faster than KNL CPU Jan 18, 2022 10 Intermediate-scale benchmark test case nl03 96 V100 GPUs Multiscale benchmark test case sh04 64 A100 GPUs 6x faster 384 A100 GPUs 10x faster 4 A100 GPUs about as fast as 6 V100 GPUs 576 V100 GPUs Perlmutter results obtained on Phase 1 setup. (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 11.
    GPUs are fast •A100 GPU much faster than KNL CPU • But compute now less than 50% of total time! Jan 18, 2022 11 Perlmutter results obtained on Phase 1 setup. 4x faster 3x faster nl03 - Intermediate-scale benchmark test case sh04 - Multiscale benchmark test case
  • 12.
    CGYRO communication heavy •CGYRO exchanges significant amount of data during compute (numbers for each process, per timing period) • nl03 - Using 16 Perlmutter nodes • Data: ~ 28 GB • Compute: ~ 9 s • sh04 - Using 96 Perlmutter nodes • Data: ~140 GB • Compute: ~ 22 s Jan 18, 2022 12 5x data volume 2.5x compute time Most of the communication is AllToAll Some AllReduce
  • 13.
    Networks are thebottleneck • Single Google Cloud 16x A100 node as fast as 8 Perlmutter nodes (x4 A100) Jan 18, 2022 13 32 A100 GPUs on Perlmutter (Phase 1) 16 A100 GPUs on Google Cloud (GCP) For intermediate-scale benchmark test case nl03 (not enough GPU RAM for anything bigger) Perlmutter results obtained on Phase 1 setup. On GCP, using OpenMPI v3 provided by NVIDIA SDK
  • 14.
    Networks are thebottleneck • Single Google Cloud 16x A100 node as fast as 8 Perlmutter nodes (x4 A100) Jan 18, 2022 14 32 A100 GPUs on Perlmutter (Phase 1) 16 A100 GPUs on Google Cloud (GCP) For intermediate-scale benchmark test case nl03 (not enough GPU RAM for anything bigger) Perlmutter results obtained on Phase 1 setup. GPU-to-GPU communication much faster than networking. NVLINK provides 4.8 Tbps bandwidth. Perlmutter has 200 Gbps Slinghot for 4 GPUs. The use of GPU-aware MPI communication essential. PCIe almost 10x slower than NVLINK. On GCP, using OpenMPI v3 provided by NVIDIA SDK
  • 15.
    Minimizing network traffica must • Multi-node deployments a requirement for multiscale simulations • Memory constraints • Time to solution • But one can try to reduce network traffic by keeping most of the data inside the node Jan 18, 2022 15 Intermediate-scale benchmark test case nl03 Perlmutter results obtained on Phase 1 setup. 2x faster 1.4x faster
  • 16.
    CGYRO 2D communicationpattern • Communication happens on 2 orthogonal communicators • One fixed size, the other increases with #MPI • Different amounts of data on the two communicators • First communicator typically much more “chatty” • For small to medium simulations • Keeping most of it inside the node will reduce network traffic • But if we increase #MPI, the other communicator data will increase Jan 18, 2022 16 -30% -27% -12% +10%
  • 17.
    CGYRO 2D communicationpattern • Communication happens on 2 orthogonal communicators • One fixed size, the other increases with #MPI • Different amounts of data on the two communicators • First communicator typically much more “chatty” • For small to medium simulations • Keeping most of it inside the node will reduce network traffic • But if we increase #MPI, the other communicator data will increase Jan 18, 2022 17 -30% -27% -12% +10% Not a solution for Multiscale simulations MPI_ORDER=2 still beneficial
  • 18.
    Algorithmic data reductions •When #MPI is a multiple of #species • One can exchange only per-species data • Comm2 data volume cut by #species • Smarter, adaptive time advance • Reduces both compute time and data volume • Changes semantics, but good theoretical foundations Jan 18, 2022 18 Used in sh04 Not yet automatic, but should be (VELOCITY_ORDER=2)
  • 19.
    Algorithmic data reductions •When #MPI is a multiple of #species • One can exchange only per-species data • Comm2 data volume cut by #species • Smarter, adaptive time advance • Reduces both compute time and data volume • Changes semantics, but good theoretical foundations • Using single precision for comm instead of double • Existing option for a subset of Comm2 (upwind) • Implications on simulation fidelity not fully understood yet Jan 18, 2022 19 Used in sh04 Not yet automatic, but should be (VELOCITY_ORDER=2)
  • 20.
    16 64 2561024 Number of Nodes 10 100 1000 Wallclock time (s) (a) Skylake Stampede2 Cori KNL Piz Daint Titan Summit Frontera Perlmutter CGYRO compute over the years • The nl03 benchmark test has long been representative of CGYRO simulations • True multiscale was considered a heroic run • Here you can see how runtimes shrunk with the newer systems Jan 18, 2022 20 Perlmutter results obtained on Phase 1 setup. (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s
  • 21.
    CGYRO Multiscale benchmarks •We are starting to collect sh04 benchmark data on modern systems • Not many points but scaling looks good on Summit and Perlmutter Jan 18, 2022 21 Perlmutter results obtained on Phase 1 setup. (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 22.
    0 5 1015 20 25 kµΩs 0.00 0.02 0.04 0.06 0.08 0.10 Q (k µ ) e /Q GB 0 5 10 15 20 25 kµΩs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Q (k µ ) e /Q GB The importance on multiscale simulations • Some recent insights Jan 18, 2022 22 ”typical” ion-scale ITG turbulence High resolution multiscale ETG turbulence 3456 x 574 FFT Very different results!
  • 23.
    0 5 1015 20 25 kµΩs 0.00 0.02 0.04 0.06 0.08 0.10 Q (k µ ) e /Q GB 0 5 10 15 20 25 kµΩs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Q (k µ ) e /Q GB The importance on multiscale simulations • Some recent insights Jan 18, 2022 23 ”typical” ion-scale ITG turbulence High resolution multiscale ETG turbulence 3456 x 574 FFT Very different results! Simulating burning plasmas in H-mode regimes will be very complex, requiring high (multiscale) resolution
  • 24.
    The need formultiple species • Most simulations so far used only 2 or 3 species, typically • deuterium ions + electrons • wall impurity (carbon) • But burning plasmas are much more complex, we will need many more species • electrons • 2 fuel ions (D and T) • helium ash • Low and high-Z wall impurities (tungsten and beryllium) • That will drastically increase the compute cost • But recent improvement should limit data size growth Jan 18, 2022 24
  • 25.
    Multi-species scaling results •Benchmark results from Summit • Strong scaling with 4 species • Weak scaling going from 2 to 6 species Jan 18, 2022 25 128 256 512 1024 2048 Number of Nodes 1 2 4 8 16 32 64 Wallclock time (s) strong scaling weak scaling > 20% Summit Multiscale CGYRO Simulation 2 3 4 6 These results were without recent multi-species communication optimizations 1.4x faster (192 x 2304 x 24 x 18 x 8) x 3 species report time = 0.16 a/c_s nl05 benchmark test
  • 26.
    Summary and futurework • Simulating burning plasmas in H-mode regimes will be very complex • Requiring multiscale resolution and multiple species • CGYRO is well positioned to provide those insights on current and upcoming HPC systems • The porting to GPUs drastically reduced the observed compute time • Algorithmic improvements have reduced communication cost • Future work • Ensure the code continues to work on new HW (e.g. AMD GPUs) • Further algorithmic improvements Jan 18, 2022 - Joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program 26
  • 27.
    Acknowledgments • Many peoplecontributed to the development and optimization of CGYRO, and I would like to explicitly acknowledge J.Candy, E.Belli, K.Hallatschek, C.Holland, N.Howard and E.D’Azevedoe • This work was supported by the U.S. Department of Energy under awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation Laboratory), and DE-SC0017992 (AToM SciDAC-4 project). • Computing resources were provided by NERSC and by OLCF through the INCITE and ALCC programs. 27 Jan 18, 2022
  • 28.
    Further reading • Themost comprehensive reference: • J. Candy et al., “Multiscale-optimized plasma turbulence simulation on petascale architectures”, Computers & Fluids, vol. 188, 125 (2019) https://doi.org/10.1016/j.compfluid.2019.04.016 • Additional material: • I. Sfiligoi et al., “CGYRO Performance on Power9 CPUs and Volta GPUs”, Lecture Notes in Computer Science, vol 11203 (2018). https://doi.org/10.1007/978-3-030-02465-9_24 • I. Sfiligoi et al. “Fusion Research Using Azure A100 HPC instances”, Poster 149 at SC21 (2021) https://sc21.supercomputing.org/proceedings/tech_poster/ tech_poster_pages/rpost149.html 28 Jan 18, 2022