SlideShare a Scribd company logo
1 of 16
Download to read offline
Preparing Fusion Codes for
Perlmutter
Igor Sfiligoi
San Diego Supercomputer Center
Under contract with General Atomics
1
NUG 2022
This talk focuses on CGYRO
• There are many tools used in Fusion research
• This talk focuses on CGYRO
• An Eulerian fusion plasma
turbulence simulation tool
• Optimized for multi-scale simulations
• Both memory and compute heavy
Experimental methods are essential for gathering new
operational modes. But simulations are used to validate
basic theory, plan experiments, interpret results on
present devices, and ultimately to design future devices.
E. Belli and J. Candy
main authors
https://gafusion.github.io/doc/cgyro.html
2
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
3
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Easy to split among several
CPU/GPU cores and
nodes
4
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Easy to split among several
CPU/GPU cores and
nodes
Most of the compute-intensive
portion is based on small-ish 2D FFTs
Can use system-optimized libraries
5
Using
OpenMP +
OpenACC +
MPI
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Requires frequent TRANSPOSE operations
i.e. MPI_AllToAll
Easy to split among several
CPU/GPU cores and
nodes
6
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Requires frequent TRANSPOSE operations
i.e. MPI_AllToAll
Easy to split among several
CPU/GPU cores and
nodes
Exploring alternatives
but none ready yet
7
Cori and Perlmutter
• Cori was long a major CGYRO compute resource
• And we were very happy with KNL CPUs
• Lots of (slower) cores was always better than fewer marginally-faster cores
• CGYRO was ported to GPUs first for ORNL Titan
• Then improved for ORNL Summit
(Titan’s K80’s have severe limitations, like tiny memory and limited comm.)
• Deploying on Perlmutter (GPU partition) required just a recompilation
• It just worked
• Most of the time since spent
on environment optimizations, e.g. NVIDIA MPS Already had
experience with A100s
from Cloud compute
8
CPU vs GPU code paths
• CGYRO uses a OpenMP+OpenACC(+MPI) parallelization approach
• Plus native FFT libraries, FFTW/MKL on Cori, cuFFT on Perlmutter
• Most code identical for the two
• Enabling OpenMP or OpenACC based on compile flag
• A few loops have specialized OpenMP vs OpenACC implementations
(but most don’t)
• cuFFT required batch execution (reminder, many small FFTs)
• Efficient OpenACC requires careful memory handling
• Was especially a problem while porting pieces of the code to GPU
(now virtually all compute on GPU, partitioned memory just works)
• Now mostly for interacting with IO / diagnostics printouts
9
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• Great compute speedup
• But simulation bottlenecked
by communication
~30%
~70%
Benchmark sh04 case
10
Preview of
SC22 poster
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• The updated Slingshot 11 networking makes us much happier
~30%
~50%
Benchmark sh04 case
11
Preview of
SC22 poster
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• The updated Slingshot 11 networking makes us much happier
• But brings new problems
• SS11 does not play well with MPS
• Gets drastically slower when mapping multiple MPI processes per GPU
• Something we are currently relying on for optimization reasons
• Not a showstopper, but slows down our simulation in certain setups
• NERSC ticket open, hopefully can be fixed
• But we are also working on alternatives in CGYRO code 12
Disk IO light
• CGYRO does not have much disk IO
• Updates results every O(10 mins)
• Checkpoints every O(1h)
• Uses MPI-mediated parallel writes
• Only a couple files, one per logical data type
13
A comparison to other systems
14
Single GCP node faster
than 16x Summit nodes
Compute-only
Total time
SS 10
Presented at PEARC22 - https://doi.org/10.1145/3491418.3535130
Looking at compute-only,
Perlmutter’s A100s about
twice as fast as Summit’s V100s
Summary and Conclusions
• Fusion CGYRO users happy with transition from Cori to Perlmutter
• Much faster at equivalent chip count
• Porting required just a recompile
• Perlmutter still in deployment phase
• Had periods when things were not working too great
• But typically transient, hopefully will stabilize
• Waiting for the quotas to be raised (128 nodes is not a lot for CGYRO)
• Only known remaining annoyance is SS11+MPS interference
15
Acknowledgements
• This work was partially supported by
• The U.S. Department of Energy under awards DE-FG02-95ER54309,
DE-FC02-06ER54873 (Edge Simulation Laboratory) and
DE-SC0017992 (AToM SciDAC-4 project).
• The US National Science Foundation (NSF) Grant OAC-1826967.
• An award of computer time was provided by the INCITE program.
• This research used resources of the Oak Ridge Leadership Computing Facility,
which is an Office of Science User Facility supported under Contract DE-
AC05-00OR22725.
• Computing resources were also provided by the National Energy Research
Scientific Computing Center, which is an Office of Science User Facility
supported under Contract DE-AC02-05CH11231.
16

More Related Content

Similar to Preparing Fusion codes for Perlmutter - CGYRO

Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.Umang Sharma
 
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI WorkloadsQNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI Workloadsinside-BigData.com
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeAnne Nicolas
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)Robert Burrell Donkin
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.pptFannyBellows
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Controlindiver
 

Similar to Preparing Fusion codes for Perlmutter - CGYRO (20)

Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI WorkloadsQNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
OpenMP
OpenMPOpenMP
OpenMP
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Go fundamentals
Go fundamentalsGo fundamentals
Go fundamentals
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
Lect07
Lect07Lect07
Lect07
 

More from Igor Sfiligoi

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 

More from Igor Sfiligoi (20)

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 

Recently uploaded

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 

Recently uploaded (20)

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 

Preparing Fusion codes for Perlmutter - CGYRO

  • 1. Preparing Fusion Codes for Perlmutter Igor Sfiligoi San Diego Supercomputer Center Under contract with General Atomics 1 NUG 2022
  • 2. This talk focuses on CGYRO • There are many tools used in Fusion research • This talk focuses on CGYRO • An Eulerian fusion plasma turbulence simulation tool • Optimized for multi-scale simulations • Both memory and compute heavy Experimental methods are essential for gathering new operational modes. But simulations are used to validate basic theory, plan experiments, interpret results on present devices, and ultimately to design future devices. E. Belli and J. Candy main authors https://gafusion.github.io/doc/cgyro.html 2
  • 3. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step 3
  • 4. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Easy to split among several CPU/GPU cores and nodes 4
  • 5. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Easy to split among several CPU/GPU cores and nodes Most of the compute-intensive portion is based on small-ish 2D FFTs Can use system-optimized libraries 5 Using OpenMP + OpenACC + MPI
  • 6. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Requires frequent TRANSPOSE operations i.e. MPI_AllToAll Easy to split among several CPU/GPU cores and nodes 6
  • 7. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Requires frequent TRANSPOSE operations i.e. MPI_AllToAll Easy to split among several CPU/GPU cores and nodes Exploring alternatives but none ready yet 7
  • 8. Cori and Perlmutter • Cori was long a major CGYRO compute resource • And we were very happy with KNL CPUs • Lots of (slower) cores was always better than fewer marginally-faster cores • CGYRO was ported to GPUs first for ORNL Titan • Then improved for ORNL Summit (Titan’s K80’s have severe limitations, like tiny memory and limited comm.) • Deploying on Perlmutter (GPU partition) required just a recompilation • It just worked • Most of the time since spent on environment optimizations, e.g. NVIDIA MPS Already had experience with A100s from Cloud compute 8
  • 9. CPU vs GPU code paths • CGYRO uses a OpenMP+OpenACC(+MPI) parallelization approach • Plus native FFT libraries, FFTW/MKL on Cori, cuFFT on Perlmutter • Most code identical for the two • Enabling OpenMP or OpenACC based on compile flag • A few loops have specialized OpenMP vs OpenACC implementations (but most don’t) • cuFFT required batch execution (reminder, many small FFTs) • Efficient OpenACC requires careful memory handling • Was especially a problem while porting pieces of the code to GPU (now virtually all compute on GPU, partitioned memory just works) • Now mostly for interacting with IO / diagnostics printouts 9
  • 10. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • Great compute speedup • But simulation bottlenecked by communication ~30% ~70% Benchmark sh04 case 10 Preview of SC22 poster
  • 11. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • The updated Slingshot 11 networking makes us much happier ~30% ~50% Benchmark sh04 case 11 Preview of SC22 poster
  • 12. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • The updated Slingshot 11 networking makes us much happier • But brings new problems • SS11 does not play well with MPS • Gets drastically slower when mapping multiple MPI processes per GPU • Something we are currently relying on for optimization reasons • Not a showstopper, but slows down our simulation in certain setups • NERSC ticket open, hopefully can be fixed • But we are also working on alternatives in CGYRO code 12
  • 13. Disk IO light • CGYRO does not have much disk IO • Updates results every O(10 mins) • Checkpoints every O(1h) • Uses MPI-mediated parallel writes • Only a couple files, one per logical data type 13
  • 14. A comparison to other systems 14 Single GCP node faster than 16x Summit nodes Compute-only Total time SS 10 Presented at PEARC22 - https://doi.org/10.1145/3491418.3535130 Looking at compute-only, Perlmutter’s A100s about twice as fast as Summit’s V100s
  • 15. Summary and Conclusions • Fusion CGYRO users happy with transition from Cori to Perlmutter • Much faster at equivalent chip count • Porting required just a recompile • Perlmutter still in deployment phase • Had periods when things were not working too great • But typically transient, hopefully will stabilize • Waiting for the quotas to be raised (128 nodes is not a lot for CGYRO) • Only known remaining annoyance is SS11+MPS interference 15
  • 16. Acknowledgements • This work was partially supported by • The U.S. Department of Energy under awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation Laboratory) and DE-SC0017992 (AToM SciDAC-4 project). • The US National Science Foundation (NSF) Grant OAC-1826967. • An award of computer time was provided by the INCITE program. • This research used resources of the Oak Ridge Leadership Computing Facility, which is an Office of Science User Facility supported under Contract DE- AC05-00OR22725. • Computing resources were also provided by the National Energy Research Scientific Computing Center, which is an Office of Science User Facility supported under Contract DE-AC02-05CH11231. 16