SlideShare a Scribd company logo
Performance Optimization of
CGYRO for Multiscale
Turbulence Simulations
Presented by Igor Sfiligoi, UCSD/SDSC
in collaboration with Emily Belli and Jeff Candy, GA
at the joint
US-Japan Workshop on Exascale Computing Collaboration and
6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program
Jan 18th 2022
1
What is CGYRO?
• An Eulerian GyroKinetic Fusion Plasma solver
• Uses a radially spectral formulation
• Implements the complete Sugama electromagnetic gyrokinetic theory
• Re-implemented from scratch with Multiscale Turbulence in mind
• Built on lessons learned from GYRO and NEO
• With heavy reliance on system-provided FFT libraries
• Uses MPI for multi-node/multi-device scaling
• OpenMP for multi-core support (can be combined with MPI)
• OpenACC/cuFFT for GPU support
https://gafusion.github.io/doc/cgyro.html
2
Jan 18, 2022
A 5D mesh
Jan 18, 2022 3
Suitable for most Fusion Plasma studies
• For describing the transport in the core of the Tokamak,
simulating ion-scale turbulence is sufficient
• Coarse grained mesh sufficient, making it very fast
• The pedestal region needs proper simulation of multi-scale turbulence,
i.e. accounting for the coupling of ion-scale and electron-scale turbulences
• Fine grained mesh needs, which is both slower and requires more memory
• And(of course) anything in between
• Cell size can be tuned to minimize simulation cost
while providing the desired accuracy
Jan 18, 2022 4
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
Jan 18, 2022 5
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
Jan 18, 2022 6
Multiscale simulations
can effectively use
O(1k) nodes and
O(10k cores)
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 7
Porting CGYRO to GPUs
• Like most SW, CGYRO initially
developed for CPU compute
• MPI + OpenMP
• Few, compute-intense kernels
• The most compute
intensive compute happens in
external library
• FFT transforms
• Using OpenMP made transition easy
• We used OpenACC for GPU compute
• Very few changes needed
• Loop reordering for performance
• Explicit movement of memory
between system and GPU
• NVIDIA provides GPU-optimized FFT
• Needs batching,
but else very similar to FFTW
Jan 18, 2022 8
GPU Memory size was a problem on Titan,
only a fraction of code ran on GPUs there.
On modern systems (Summit, Perlmutter)
virtually no CPU-bound code anymore.
GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 9
Intermediate-scale benchmark test case nl03
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
Perlmutter results obtained
on Phase 1 setup.
576 V100 GPUs
10x faster
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
GPUs are fast
• A100 GPU order of magnitude faster than KNL CPU
Jan 18, 2022 10
Intermediate-scale benchmark test case nl03
96 V100 GPUs
Multiscale benchmark test case sh04
64 A100 GPUs
6x faster
384 A100 GPUs
10x faster
4 A100 GPUs
about as fast as
6 V100 GPUs
576 V100 GPUs
Perlmutter results obtained
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
GPUs are fast
• A100 GPU much faster
than KNL CPU
• But compute now less than
50% of total time!
Jan 18, 2022 11
Perlmutter results obtained
on Phase 1 setup.
4x faster
3x faster
nl03 - Intermediate-scale benchmark test case
sh04 - Multiscale benchmark test case
CGYRO communication heavy
• CGYRO exchanges significant amount of data during compute
(numbers for each process, per timing period)
• nl03 - Using 16 Perlmutter nodes
• Data: ~ 28 GB
• Compute: ~ 9 s
• sh04 - Using 96 Perlmutter nodes
• Data: ~140 GB
• Compute: ~ 22 s
Jan 18, 2022 12
5x data volume
2.5x compute time
Most of the communication is AllToAll
Some AllReduce
Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 13
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
Perlmutter results obtained
on Phase 1 setup.
On GCP, using OpenMPI v3
provided by NVIDIA SDK
Networks are the bottleneck
• Single Google Cloud 16x A100 node
as fast as 8 Perlmutter nodes (x4 A100)
Jan 18, 2022 14
32 A100 GPUs on Perlmutter (Phase 1)
16 A100 GPUs on Google Cloud (GCP)
For intermediate-scale benchmark test case nl03
(not enough GPU RAM for anything bigger)
Perlmutter results obtained
on Phase 1 setup.
GPU-to-GPU communication much faster than networking.
NVLINK provides 4.8 Tbps bandwidth.
Perlmutter has 200 Gbps Slinghot for 4 GPUs.
The use of GPU-aware MPI communication essential.
PCIe almost 10x slower than NVLINK.
On GCP, using OpenMPI v3
provided by NVIDIA SDK
Minimizing network traffic a must
• Multi-node deployments a requirement for multiscale simulations
• Memory constraints
• Time to solution
• But one can try to
reduce network traffic
by keeping most of the
data inside the node
Jan 18, 2022 15
Intermediate-scale benchmark test case nl03
Perlmutter results obtained
on Phase 1 setup.
2x faster
1.4x faster
CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 16
-30%
-27%
-12% +10%
CGYRO 2D communication pattern
• Communication happens on 2 orthogonal communicators
• One fixed size, the other increases with #MPI
• Different amounts of data on the two communicators
• First communicator typically
much more “chatty”
• For small to medium simulations
• Keeping most of it inside the node
will reduce network traffic
• But if we increase #MPI, the
other communicator data will increase
Jan 18, 2022 17
-30%
-27%
-12% +10%
Not a solution for Multiscale simulations
MPI_ORDER=2 still beneficial
Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
Jan 18, 2022 18
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)
Algorithmic data reductions
• When #MPI is a multiple of #species
• One can exchange only per-species data
• Comm2 data volume cut by #species
• Smarter, adaptive time advance
• Reduces both compute time and data volume
• Changes semantics, but good theoretical foundations
• Using single precision for comm instead of double
• Existing option for a subset of Comm2 (upwind)
• Implications on simulation fidelity not fully understood yet
Jan 18, 2022 19
Used in sh04
Not yet automatic,
but should be
(VELOCITY_ORDER=2)
16 64 256 1024
Number of Nodes
10
100
1000
Wallclock
time
(s)
(a) Skylake
Stampede2
Cori KNL
Piz Daint
Titan
Summit
Frontera
Perlmutter
CGYRO compute over the years
• The nl03 benchmark test has long been representative of
CGYRO simulations
• True multiscale was
considered a heroic run
• Here you can see how
runtimes shrunk
with the newer systems
Jan 18, 2022 20
Perlmutter results obtained
on Phase 1 setup.
(64 x 512 x 32 x 24 x 8) x 3 species
report time = 0.1 a/c_s
CGYRO Multiscale benchmarks
• We are starting to collect sh04 benchmark data
on modern systems
• Not many points
but scaling looks
good on Summit
and Perlmutter
Jan 18, 2022 21
Perlmutter results obtained
on Phase 1 setup.
(128 x 1152 x 24 x 18 x 8) x 3 species
report time = 1.0 a/c_s
0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
22
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!
0 5 10 15 20 25
kµΩs
0.00
0.02
0.04
0.06
0.08
0.10
Q
(k
µ
)
e
/Q
GB
0 5 10 15 20 25
kµΩs
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Q
(k
µ
)
e
/Q
GB
The importance on multiscale simulations
• Some recent insights
Jan 18, 2022
23
”typical” ion-scale
ITG turbulence
High resolution
multiscale ETG
turbulence
3456 x 574 FFT
Very different
results!
Simulating burning plasmas in
H-mode regimes will be very
complex, requiring high
(multiscale) resolution
The need for multiple species
• Most simulations so far used only 2 or 3 species, typically
• deuterium ions + electrons
• wall impurity (carbon)
• But burning plasmas are much more complex,
we will need many more species
• electrons
• 2 fuel ions (D and T)
• helium ash
• Low and high-Z wall impurities (tungsten and beryllium)
• That will drastically increase the compute cost
• But recent improvement should limit data size growth
Jan 18, 2022 24
Multi-species scaling results
• Benchmark results from Summit
• Strong scaling with 4 species
• Weak scaling going from 2 to 6 species
Jan 18, 2022 25
128 256 512 1024 2048
Number of Nodes
1
2
4
8
16
32
64
Wallclock
time
(s)
strong scaling
weak scaling
> 20% Summit
Multiscale CGYRO Simulation
2 3 4 6
These results were without recent
multi-species communication optimizations
1.4x faster
(192 x 2304 x 24 x 18 x 8) x 3 species
report time = 0.16 a/c_s
nl05 benchmark test
Summary and future work
• Simulating burning plasmas in H-mode regimes will be very complex
• Requiring multiscale resolution and multiple species
• CGYRO is well positioned to provide those insights
on current and upcoming HPC systems
• The porting to GPUs drastically reduced the observed compute time
• Algorithmic improvements have reduced communication cost
• Future work
• Ensure the code continues to work on new HW (e.g. AMD GPUs)
• Further algorithmic improvements
Jan 18, 2022 - Joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program 26
Acknowledgments
• Many people contributed to the development and optimization
of CGYRO, and I would like to explicitly acknowledge
J.Candy, E.Belli, K.Hallatschek, C.Holland, N.Howard and E.D’Azevedoe
• This work was supported by the U.S. Department of Energy under
awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation
Laboratory), and DE-SC0017992 (AToM SciDAC-4 project).
• Computing resources were provided by NERSC and by OLCF through
the INCITE and ALCC programs.
27
Jan 18, 2022
Further reading
• The most comprehensive reference:
• J. Candy et al., “Multiscale-optimized plasma turbulence simulation on
petascale architectures”, Computers & Fluids, vol. 188, 125 (2019)
https://doi.org/10.1016/j.compfluid.2019.04.016
• Additional material:
• I. Sfiligoi et al., “CGYRO Performance on Power9 CPUs and Volta GPUs”,
Lecture Notes in Computer Science, vol 11203 (2018).
https://doi.org/10.1007/978-3-030-02465-9_24
• I. Sfiligoi et al. “Fusion Research Using Azure A100 HPC instances”, Poster 149
at SC21 (2021) https://sc21.supercomputing.org/proceedings/tech_poster/
tech_poster_pages/rpost149.html
28
Jan 18, 2022

More Related Content

What's hot

Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
Patrick McGarry
 
How Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph betterHow Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph better
TeK Charnsilp Chinprasert
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
Tim Bell
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
Tim Bell
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
Tim Bell
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
CESGA Centro de Supercomputación de Galicia
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
InfluxData
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
Nicola Ferraro
 
Euro ht condor_alahiff
Euro ht condor_alahiffEuro ht condor_alahiff
Euro ht condor_alahiff
vandersantiago
 
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
WSO2
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
Tim Bell
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
Belmiro Moreira
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
Stephen Gordon
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
Belmiro Moreira
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Belmiro Moreira
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
Oleg Chunikhin
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Belmiro Moreira
 
Federated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation TherapyFederated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation Therapy
CESGA Centro de Supercomputación de Galicia
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Belmiro Moreira
 

What's hot (20)

Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
 
How Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph betterHow Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph better
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Euro ht condor_alahiff
Euro ht condor_alahiffEuro ht condor_alahiff
Euro ht condor_alahiff
 
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Federated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation TherapyFederated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation Therapy
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 

Similar to Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
Igor Sfiligoi
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
inside-BigData.com
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
inside-BigData.com
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUS
Igor Sfiligoi
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
Unai Lopez-Novoa
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
inside-BigData.com
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
Domino Data Lab
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Koichi Shirahata
 
OBDPC 2022
OBDPC 2022OBDPC 2022
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
Yutaka Kawai
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
Dawei Mu
 

Similar to Performance Optimization of CGYRO for Multiscale Turbulence Simulations (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUS
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
 

More from Igor Sfiligoi

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
Igor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
Igor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
Igor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
Igor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Igor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
Igor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
Igor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
Igor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
Igor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
Igor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
Igor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
Igor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
Igor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
Igor Sfiligoi
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Igor Sfiligoi
 

More from Igor Sfiligoi (20)

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 

Recently uploaded

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

  • 1. Performance Optimization of CGYRO for Multiscale Turbulence Simulations Presented by Igor Sfiligoi, UCSD/SDSC in collaboration with Emily Belli and Jeff Candy, GA at the joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program Jan 18th 2022 1
  • 2. What is CGYRO? • An Eulerian GyroKinetic Fusion Plasma solver • Uses a radially spectral formulation • Implements the complete Sugama electromagnetic gyrokinetic theory • Re-implemented from scratch with Multiscale Turbulence in mind • Built on lessons learned from GYRO and NEO • With heavy reliance on system-provided FFT libraries • Uses MPI for multi-node/multi-device scaling • OpenMP for multi-core support (can be combined with MPI) • OpenACC/cuFFT for GPU support https://gafusion.github.io/doc/cgyro.html 2 Jan 18, 2022
  • 3. A 5D mesh Jan 18, 2022 3
  • 4. Suitable for most Fusion Plasma studies • For describing the transport in the core of the Tokamak, simulating ion-scale turbulence is sufficient • Coarse grained mesh sufficient, making it very fast • The pedestal region needs proper simulation of multi-scale turbulence, i.e. accounting for the coupling of ion-scale and electron-scale turbulences • Fine grained mesh needs, which is both slower and requires more memory • And(of course) anything in between • Cell size can be tuned to minimize simulation cost while providing the desired accuracy Jan 18, 2022 4
  • 5. Porting CGYRO to GPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms Jan 18, 2022 5
  • 6. Porting CGYRO to GPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms Jan 18, 2022 6 Multiscale simulations can effectively use O(1k) nodes and O(10k cores)
  • 7. Porting CGYRO to GPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms • Using OpenMP made transition easy • We used OpenACC for GPU compute • Very few changes needed • Loop reordering for performance • Explicit movement of memory between system and GPU • NVIDIA provides GPU-optimized FFT • Needs batching, but else very similar to FFTW Jan 18, 2022 7
  • 8. Porting CGYRO to GPUs • Like most SW, CGYRO initially developed for CPU compute • MPI + OpenMP • Few, compute-intense kernels • The most compute intensive compute happens in external library • FFT transforms • Using OpenMP made transition easy • We used OpenACC for GPU compute • Very few changes needed • Loop reordering for performance • Explicit movement of memory between system and GPU • NVIDIA provides GPU-optimized FFT • Needs batching, but else very similar to FFTW Jan 18, 2022 8 GPU Memory size was a problem on Titan, only a fraction of code ran on GPUs there. On modern systems (Summit, Perlmutter) virtually no CPU-bound code anymore.
  • 9. GPUs are fast • A100 GPU order of magnitude faster than KNL CPU Jan 18, 2022 9 Intermediate-scale benchmark test case nl03 96 V100 GPUs Multiscale benchmark test case sh04 64 A100 GPUs 6x faster 384 A100 GPUs Perlmutter results obtained on Phase 1 setup. 576 V100 GPUs 10x faster (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 10. GPUs are fast • A100 GPU order of magnitude faster than KNL CPU Jan 18, 2022 10 Intermediate-scale benchmark test case nl03 96 V100 GPUs Multiscale benchmark test case sh04 64 A100 GPUs 6x faster 384 A100 GPUs 10x faster 4 A100 GPUs about as fast as 6 V100 GPUs 576 V100 GPUs Perlmutter results obtained on Phase 1 setup. (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 11. GPUs are fast • A100 GPU much faster than KNL CPU • But compute now less than 50% of total time! Jan 18, 2022 11 Perlmutter results obtained on Phase 1 setup. 4x faster 3x faster nl03 - Intermediate-scale benchmark test case sh04 - Multiscale benchmark test case
  • 12. CGYRO communication heavy • CGYRO exchanges significant amount of data during compute (numbers for each process, per timing period) • nl03 - Using 16 Perlmutter nodes • Data: ~ 28 GB • Compute: ~ 9 s • sh04 - Using 96 Perlmutter nodes • Data: ~140 GB • Compute: ~ 22 s Jan 18, 2022 12 5x data volume 2.5x compute time Most of the communication is AllToAll Some AllReduce
  • 13. Networks are the bottleneck • Single Google Cloud 16x A100 node as fast as 8 Perlmutter nodes (x4 A100) Jan 18, 2022 13 32 A100 GPUs on Perlmutter (Phase 1) 16 A100 GPUs on Google Cloud (GCP) For intermediate-scale benchmark test case nl03 (not enough GPU RAM for anything bigger) Perlmutter results obtained on Phase 1 setup. On GCP, using OpenMPI v3 provided by NVIDIA SDK
  • 14. Networks are the bottleneck • Single Google Cloud 16x A100 node as fast as 8 Perlmutter nodes (x4 A100) Jan 18, 2022 14 32 A100 GPUs on Perlmutter (Phase 1) 16 A100 GPUs on Google Cloud (GCP) For intermediate-scale benchmark test case nl03 (not enough GPU RAM for anything bigger) Perlmutter results obtained on Phase 1 setup. GPU-to-GPU communication much faster than networking. NVLINK provides 4.8 Tbps bandwidth. Perlmutter has 200 Gbps Slinghot for 4 GPUs. The use of GPU-aware MPI communication essential. PCIe almost 10x slower than NVLINK. On GCP, using OpenMPI v3 provided by NVIDIA SDK
  • 15. Minimizing network traffic a must • Multi-node deployments a requirement for multiscale simulations • Memory constraints • Time to solution • But one can try to reduce network traffic by keeping most of the data inside the node Jan 18, 2022 15 Intermediate-scale benchmark test case nl03 Perlmutter results obtained on Phase 1 setup. 2x faster 1.4x faster
  • 16. CGYRO 2D communication pattern • Communication happens on 2 orthogonal communicators • One fixed size, the other increases with #MPI • Different amounts of data on the two communicators • First communicator typically much more “chatty” • For small to medium simulations • Keeping most of it inside the node will reduce network traffic • But if we increase #MPI, the other communicator data will increase Jan 18, 2022 16 -30% -27% -12% +10%
  • 17. CGYRO 2D communication pattern • Communication happens on 2 orthogonal communicators • One fixed size, the other increases with #MPI • Different amounts of data on the two communicators • First communicator typically much more “chatty” • For small to medium simulations • Keeping most of it inside the node will reduce network traffic • But if we increase #MPI, the other communicator data will increase Jan 18, 2022 17 -30% -27% -12% +10% Not a solution for Multiscale simulations MPI_ORDER=2 still beneficial
  • 18. Algorithmic data reductions • When #MPI is a multiple of #species • One can exchange only per-species data • Comm2 data volume cut by #species • Smarter, adaptive time advance • Reduces both compute time and data volume • Changes semantics, but good theoretical foundations Jan 18, 2022 18 Used in sh04 Not yet automatic, but should be (VELOCITY_ORDER=2)
  • 19. Algorithmic data reductions • When #MPI is a multiple of #species • One can exchange only per-species data • Comm2 data volume cut by #species • Smarter, adaptive time advance • Reduces both compute time and data volume • Changes semantics, but good theoretical foundations • Using single precision for comm instead of double • Existing option for a subset of Comm2 (upwind) • Implications on simulation fidelity not fully understood yet Jan 18, 2022 19 Used in sh04 Not yet automatic, but should be (VELOCITY_ORDER=2)
  • 20. 16 64 256 1024 Number of Nodes 10 100 1000 Wallclock time (s) (a) Skylake Stampede2 Cori KNL Piz Daint Titan Summit Frontera Perlmutter CGYRO compute over the years • The nl03 benchmark test has long been representative of CGYRO simulations • True multiscale was considered a heroic run • Here you can see how runtimes shrunk with the newer systems Jan 18, 2022 20 Perlmutter results obtained on Phase 1 setup. (64 x 512 x 32 x 24 x 8) x 3 species report time = 0.1 a/c_s
  • 21. CGYRO Multiscale benchmarks • We are starting to collect sh04 benchmark data on modern systems • Not many points but scaling looks good on Summit and Perlmutter Jan 18, 2022 21 Perlmutter results obtained on Phase 1 setup. (128 x 1152 x 24 x 18 x 8) x 3 species report time = 1.0 a/c_s
  • 22. 0 5 10 15 20 25 kµΩs 0.00 0.02 0.04 0.06 0.08 0.10 Q (k µ ) e /Q GB 0 5 10 15 20 25 kµΩs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Q (k µ ) e /Q GB The importance on multiscale simulations • Some recent insights Jan 18, 2022 22 ”typical” ion-scale ITG turbulence High resolution multiscale ETG turbulence 3456 x 574 FFT Very different results!
  • 23. 0 5 10 15 20 25 kµΩs 0.00 0.02 0.04 0.06 0.08 0.10 Q (k µ ) e /Q GB 0 5 10 15 20 25 kµΩs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Q (k µ ) e /Q GB The importance on multiscale simulations • Some recent insights Jan 18, 2022 23 ”typical” ion-scale ITG turbulence High resolution multiscale ETG turbulence 3456 x 574 FFT Very different results! Simulating burning plasmas in H-mode regimes will be very complex, requiring high (multiscale) resolution
  • 24. The need for multiple species • Most simulations so far used only 2 or 3 species, typically • deuterium ions + electrons • wall impurity (carbon) • But burning plasmas are much more complex, we will need many more species • electrons • 2 fuel ions (D and T) • helium ash • Low and high-Z wall impurities (tungsten and beryllium) • That will drastically increase the compute cost • But recent improvement should limit data size growth Jan 18, 2022 24
  • 25. Multi-species scaling results • Benchmark results from Summit • Strong scaling with 4 species • Weak scaling going from 2 to 6 species Jan 18, 2022 25 128 256 512 1024 2048 Number of Nodes 1 2 4 8 16 32 64 Wallclock time (s) strong scaling weak scaling > 20% Summit Multiscale CGYRO Simulation 2 3 4 6 These results were without recent multi-species communication optimizations 1.4x faster (192 x 2304 x 24 x 18 x 8) x 3 species report time = 0.16 a/c_s nl05 benchmark test
  • 26. Summary and future work • Simulating burning plasmas in H-mode regimes will be very complex • Requiring multiscale resolution and multiple species • CGYRO is well positioned to provide those insights on current and upcoming HPC systems • The porting to GPUs drastically reduced the observed compute time • Algorithmic improvements have reduced communication cost • Future work • Ensure the code continues to work on new HW (e.g. AMD GPUs) • Further algorithmic improvements Jan 18, 2022 - Joint US-Japan Workshop on Exascale Computing Collaboration and 6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program 26
  • 27. Acknowledgments • Many people contributed to the development and optimization of CGYRO, and I would like to explicitly acknowledge J.Candy, E.Belli, K.Hallatschek, C.Holland, N.Howard and E.D’Azevedoe • This work was supported by the U.S. Department of Energy under awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation Laboratory), and DE-SC0017992 (AToM SciDAC-4 project). • Computing resources were provided by NERSC and by OLCF through the INCITE and ALCC programs. 27 Jan 18, 2022
  • 28. Further reading • The most comprehensive reference: • J. Candy et al., “Multiscale-optimized plasma turbulence simulation on petascale architectures”, Computers & Fluids, vol. 188, 125 (2019) https://doi.org/10.1016/j.compfluid.2019.04.016 • Additional material: • I. Sfiligoi et al., “CGYRO Performance on Power9 CPUs and Volta GPUs”, Lecture Notes in Computer Science, vol 11203 (2018). https://doi.org/10.1007/978-3-030-02465-9_24 • I. Sfiligoi et al. “Fusion Research Using Azure A100 HPC instances”, Poster 149 at SC21 (2021) https://sc21.supercomputing.org/proceedings/tech_poster/ tech_poster_pages/rpost149.html 28 Jan 18, 2022