SlideShare a Scribd company logo
ORNL is managed by UT-Battelle
for the US Department of Energy
Towards Exascale
Simulations of Stellar
Explosions with FLASH
J. Austin Harris
Scientific Computing Group
Oak Ridge National Laboratory
Collaborators:
Bronson Messer (ORNL)
Tom Papatheodore (ORNL)
2 J. Austin Harris --- OpenPOWER ADG 2018
• Preparing codes to run on the upcoming (CORAL) Summit supercomputer at
ORNL
• Summit – IBM POWER9 + NVIDIA Volta
• EA System – IBM POWER8 + NVIDIA Pascal
Acknowledgements
FLASH – adaptive-mesh, multi-physics simulation code widely used
in astrophysics
http://flash.uchicago.edu/site/
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National
Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No.
DE-AC05-00OR22725.
3 J. Austin Harris --- OpenPOWER ADG 2018
4 J. Austin Harris --- OpenPOWER ADG 2018
Supernovae
• Brightness rivals that of the host galaxy
• Primarily two physical mechanisms:
– Core-collapse supernova (gravity-induced explosion of massive star)
– Thermonuclear supernova (accretion-induced explosion of white dwarf)
Cassiopeia A (SN 1680)Tycho (SN 1572)“The Crab” (SN 1054) Kepler (SN 1604)
5 J. Austin Harris --- OpenPOWER ADG 2018
FLASH code
• FLASH is a publicly available, component-based, MPI+OpenMP
parallel, adaptive mesh refinement (AMR) code that has been
used on a variety of parallel platforms.
• The code has been used to simulate a variety of phenomena,
including
– thermonuclear and core-collapse supernovae,
– galaxy cluster formation,
– classical novae,
– formation of proto-planetary disks, and
– high-energy-density physics.
• FLASH’s multi-physics and AMR capabilities make it an ideal
numerical laboratory for studying nucleosynthesis in supernovae.
• Targeted for CAAR:
– Nuclear kinetics (burn unit) --- GPU-enabled libraries
– Equation of State (EOS) --- OpenACC
– Hydrodynamics and Gravity module performance
6 J. Austin Harris --- OpenPOWER ADG 2018
Nuclear kinetics
• Nuclear composition evolved at each time-step
– Linear solution of coupled set of stiff ODEs
– FLOPS ~ n3
• Accurate treatment important for nuclear energy
generation and determining final composition
• Computational constraints traditionally limit system to < 14 species
• FLASH-CAAR:
– Replace small “hard-wired” reaction network in FLASH with general purpose
reaction network and use GPU-enabled libraries to accelerate solution of ODEs
7 J. Austin Harris --- OpenPOWER ADG 2018
8 J. Austin Harris --- OpenPOWER ADG 2018
Nuclear kinetics
XNet
• General purpose thermonuclear reaction network written in modular
Fortran 90
̇"# = %
&
'#
&(&"& + %
&,+
'#
&,+,-. /0 &,+"&"+ + %
&,+,1
'#
&,+,1,2-.
2 /0 &,+,1"&"+"1
• Stiff system of ODEs
– Implicit solver (Backward Euler/Bader-Deuflhard/Gear):
⃑4 5 + Δ5 =
7 89:8 ;7 8
:8
−
̇
" 5 + Δ5 =0
• Being implemented into a shared repository of microphysics being
developed for AMREx-based codes, including FLASH
– https://starkiller-astro.github.io/Microphysics/
9 J. Austin Harris --- OpenPOWER ADG 2018
XNet in FLASH
• FLASH burner restructured to operate on multiple zones at once from all local
AMR blocks for XNet to evolve simultaneously
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_zones
do j = 1, num_timesteps
<build linear system>
dgetrf(…)
dgetrs(…)
<check convergence>
end do
end do
!$omp end do
!$omp end parallel
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_local_batches
do j = 1, num_timesteps
<CPU operations>
!$acc parallel loop
do ib = 1, nb
<build ib’th linear system>
end do
!$acc end parallel loop
cublasDgetrfBatched(…)
cublasDgetrsBatched(…)
<send results to CPU>
<check convergence>
end do
end do
!$omp end do
!$omp end parallel
10 J. Austin Harris --- OpenPOWER ADG 2018
FLASH AMR
• Currently uses PARAMESH (MacNeice+, 2000)
• Moving to ECP-supported AMREx (ExaStar ECP project)
11 J. Austin Harris --- OpenPOWER ADG 2018
FLASH AMR Optimization
• Problem: Computational load can be
quite unevenly distributed
• Solution: Weight the Morton space-
filling curve by maximum number of
timesteps taken by any single cell in
a block.
12 J. Austin Harris --- OpenPOWER ADG 2018
FLASH Performance w/ XNet
• Tests performed on single Summit Phase I node
– 2 IBM Power9 (22 cores each), 6 NVIDIA “Volta” V100
GPUS
• 1 3D block (163 = 4096 zones) per rank per
GPU evolved for 20 FLASH timesteps
13 J. Austin Harris --- OpenPOWER ADG 2018
FLASH Early Summit Results
• CAAR work primarily concerned with increasing physical
fidelity by accelerating the nuclear burning module and
associated load balancing.
• Summit GPU performance fundamentally changes the
potential science impact by enabling large-network (i.e.
160 or more nuclear species) simulations.
– Heaviest elements in the Universe are made in
neutron-rich environments – small networks are
incapable of tracking these neutron-rich nuclei
– Opens up the possibility of producing precision
nucleosynthesis predictions to compare to
observations
– Provides detailed information regarding most
astrophysically important nuclear reactions and
masses to be measured at FRIB
NASA, ESA, J. Hester and A. Loll
(Arizona St. Univ.)
n
H
He
Li
Be
B
C
N
O
F
Ne
Na
Mg
Al
Si
P
S
Cl
Ar
K
Ca
Sc
Ti
V
Cr
Mn
Fe
Co
Ni
Cu
Zn
0
1
2
3 4
5 6
7 8
9
10 11 12
13 14
15 16 17
18 19 20 21 22 23 24
25 26 27 28 29
30 31 32 33
34
35 36 37
38
39
40
41
42 43 44
neutrino_p_process (zone_01)
Timestep = 0
Time (sec) = 0.000E+00
Density (g/cm^3) = 4.214E+06
Temperature (T9) = 8.240E+00
nucastrodata.orgnucastrodata.org
Max: 2.35E-01
Min: 1.00E-25
Abundance
Aprox13: 13-species α-chain
X150: 150-species network
ProtonNumber
Neutron Number
Time for 160-species
(blue) run on Summit
roughly equal to 13-
species “alpha” (red)
network run on Titan
>100x the computation
for identical cost
Preliminary results on Summit
GPU+CPU vs. CPU-only performance on
Summit for 288-species network : 2.9x
P9: 24.65 seconds/step
P9 + Volta: 8.5 seconds/step
288-species impossible to run on Titan
14 J. Austin Harris --- OpenPOWER ADG 2018
Equation of State
“Helmholtz EOS” (Timmes & Swesty, 2000)
• Provides closure to thermodynamic system (e.g. P=P(ρ,T,X) )
• Based on Helmholtz free energy formulation
– High-order interpolation from table of free energy (quintic Hermite polynomials)
• OpenACC version developed by collaborators at Stony Brook University
– Part of a shared repository of microphysics being developed for AMREx-based
codes, including FLASH (“starkiller”)
• FLASH traditionally only operates on vectors (i.e. rows from AMR blocks)
– Does this expose enough parallelism? No
• How many AMR blocks should we evaluate the EOS for simultaneously
per MPI rank?
15 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
• To determine best use of accelerated EOS in FLASH, we used mini-
app driver:
– Mimics AMR block structure and time stepping in FLASH
• Loops through several time steps
• Change the number of total grid zones
• Fill these zones with new data
• Calculate interpolation in all grid zones
16 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
1) Allocate main data arrays (global) on host and device
– Arrays of Fortran derived types
• Each elements holds grid data for single zone
– Persist for the duration of the program
– Used to pass zone data back and forth from host to device
• Reduced set sent from H-to-D
• Full set sent from D-to-H
17 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
2) Read in tabulated Helmholtz free energy data and make copy on
device
– This will persist for the duration of the program
– Thermodynamic quantities are interpolated from this table
3) For each time step
– Change number of AMR blocks
• ± 5%, consistent with variation encountered in production simulations at high rank count
– Update device with new grid data
– Launch EOS kernel: calculate all interpolated quantities for all grid zones
– Update host with newly calculated quantities
18 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
Basic Flow of Driver Program
!$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1)
!$acc kernels async(thread_id + 1)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$acc end kernels
!$acc update self(state(start_element:stop_element)) async(thread_id + 1)
!$acc wait
!$omp target update to(reduced_state(start_element:stop_element))
!$omp target
!$omp teams distribute parallel do thread_limit(128) num_threads(128)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$omp end teams distribute parallel do
!$omp end target
!$omp target update from(state(start_element:stop_element))
OpenACCOpenMP4.5
19 J. Austin Harris --- OpenPOWER ADG 2018
EOS Experiments
• All experiments carried out on
SummitDev
– Nodes have 2 IBM Power8+ 10-core
CPUs
– peak flop rate of approximately 560 GF
– peak memory bandwidth of 340 GB/sec
• + 4 NVIDIA P100 GPUs
– peak single/double precision flop rate of
10.6/5.3 TF
– peak memory bandwidth of 732 GB/sec
• Number of “AMR” blocks: 1, 10, 100,
1000, 10000 (each with 256 zones)
– Emulates 2D block in FLASH
• Tested with 1, 2, 4, and 10 (CPU)
OpenMP threads for each block count
20 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC vs OpenMP 4.5
• DISCLAIMER:
• At the time these tests were performed:
– PGI’s OpenACC implementation had a mature API (version 16.10)
– IBM’s XL Fortran implementation of OpenMP 4.5 (version 16.1)
• Beta version of the compiler
• Did not allow pinned memory or asynchronous data transfers / kernel execution
21 J. Austin Harris --- OpenPOWER ADG 2018
Results
• For high numbers of AMR blocks, OpenACC is roughly 3x faster
More complicated behavior for lower block counts
22 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC at low block counts
• At low AMR block counts, kernel overhead is
large relative to compute time and increased
work does little to increase total performance.
~0.1 ms
23 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC kernel overheads continued
• Multiple CPU threads stagger
H2D transfers, exacerbating
kernel overhead delay
24 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC at high block counts
• At higher block counts,
kernel overhead is
negligible; now dominated
by D2H transfers
25 J. Austin Harris --- OpenPOWER ADG 2018
OpenMP at low block counts
• There is no asynchronous GPU
execution, i.e. the work enqueued
by each CPU thread is serialized
on the device.
• Performance is proportionally less
than the OpenACC asynchronous
execution.
26 J. Austin Harris --- OpenPOWER ADG 2018
OpenMP at higher block counts
• Lack of asynchronous execution
becomes less important, as the
device compute capability is
saturated.
• D2H (and H2D) transfers are
significantly slower than for
OpenACC, as we here lack the
ability to pin CPU memory.
27 J. Austin Harris --- OpenPOWER ADG 2018
Optimal GPU configuration
• Clear advantage from GPUs when
>100 AMR blocks in 2D (or >6 in 3D)
• Can calculate 100 2D blocks with
GPU in roughly the same time as 1
2D block without
• So in FLASH, we should compute
100s to 1000s of 2D blocks per MPI
rank, depending on available memory
28 J. Austin Harris --- OpenPOWER ADG 2018
EOS Summary
• OpenMP provides an effective to path to performance portability, so despite the
lower performance here, we plan to test the OpenMP 4.5 implementation in
FLASH production.
– Primary factors affecting current OpenMP performance are the serialization of kernels on the
device and high data transfer times associated with having to use pageable memory when
using OpenMP 4.5. These are technical problems that are certainly surmountable.
• In general, we find that the best balance between CPU threads and block
number occurs at and above 2-4 CPU threads and roughly 1,000 2D blocks. We
can retire all 1,000 of these EOS evaluations in a time less than 10x the fastest
100-block calculation for both OpenACC and OpenMP.
– This mode is congruent with our planned production use of FLASH on the OLCF Summit
machine, where we will place 3 MPI ranks on each CPU socket, each bound to one of the
three available, closely coupled GPUs.
29 J. Austin Harris --- OpenPOWER ADG 2018
Conclusions
• Overall, very positive experience with Summit
– Some issues with parallel HDF5 under investigation
• With the upgrades to the nuclear burning and EOS in FLASH, we find
significant speedup (2x - 3x) relative to the CPU alone
• Still plenty of work to do!
– Hydrodynamics
– Gravity
– Radiation Transport

More Related Content

What's hot

Interstellar explorermay02
Interstellar explorermay02Interstellar explorermay02
Interstellar explorermay02
Clifford Stone
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Riley Waite
 
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE PresentationSpace Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space_Situational_Awareness
 

What's hot (20)

Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
System Interconnects for HPC
System Interconnects for HPCSystem Interconnects for HPC
System Interconnects for HPC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Interstellar explorermay02
Interstellar explorermay02Interstellar explorermay02
Interstellar explorermay02
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
DA-JPL-final
DA-JPL-finalDA-JPL-final
DA-JPL-final
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
OOW-IMC-final
OOW-IMC-finalOOW-IMC-final
OOW-IMC-final
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
NPOESS Program Overview
NPOESS Program OverviewNPOESS Program Overview
NPOESS Program Overview
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
 
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE PresentationSpace Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
 
OSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenOSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe Haen
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networks
 

Similar to Towards Exascale Simulations of Stellar Explosions with FLASH

Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
ManhHoangVan
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
inside-BigData.com
 
TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Research
shandra_psc
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 

Similar to Towards Exascale Simulations of Stellar Explosions with FLASH (20)

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputer
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Research
 
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 dist
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Overview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing ProjectOverview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing Project
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 

More from Ganesan Narayanasamy

180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 

Recently uploaded

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Towards Exascale Simulations of Stellar Explosions with FLASH

  • 1. ORNL is managed by UT-Battelle for the US Department of Energy Towards Exascale Simulations of Stellar Explosions with FLASH J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory Collaborators: Bronson Messer (ORNL) Tom Papatheodore (ORNL)
  • 2. 2 J. Austin Harris --- OpenPOWER ADG 2018 • Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL • Summit – IBM POWER9 + NVIDIA Volta • EA System – IBM POWER8 + NVIDIA Pascal Acknowledgements FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics http://flash.uchicago.edu/site/ This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
  • 3. 3 J. Austin Harris --- OpenPOWER ADG 2018
  • 4. 4 J. Austin Harris --- OpenPOWER ADG 2018 Supernovae • Brightness rivals that of the host galaxy • Primarily two physical mechanisms: – Core-collapse supernova (gravity-induced explosion of massive star) – Thermonuclear supernova (accretion-induced explosion of white dwarf) Cassiopeia A (SN 1680)Tycho (SN 1572)“The Crab” (SN 1054) Kepler (SN 1604)
  • 5. 5 J. Austin Harris --- OpenPOWER ADG 2018 FLASH code • FLASH is a publicly available, component-based, MPI+OpenMP parallel, adaptive mesh refinement (AMR) code that has been used on a variety of parallel platforms. • The code has been used to simulate a variety of phenomena, including – thermonuclear and core-collapse supernovae, – galaxy cluster formation, – classical novae, – formation of proto-planetary disks, and – high-energy-density physics. • FLASH’s multi-physics and AMR capabilities make it an ideal numerical laboratory for studying nucleosynthesis in supernovae. • Targeted for CAAR: – Nuclear kinetics (burn unit) --- GPU-enabled libraries – Equation of State (EOS) --- OpenACC – Hydrodynamics and Gravity module performance
  • 6. 6 J. Austin Harris --- OpenPOWER ADG 2018 Nuclear kinetics • Nuclear composition evolved at each time-step – Linear solution of coupled set of stiff ODEs – FLOPS ~ n3 • Accurate treatment important for nuclear energy generation and determining final composition • Computational constraints traditionally limit system to < 14 species • FLASH-CAAR: – Replace small “hard-wired” reaction network in FLASH with general purpose reaction network and use GPU-enabled libraries to accelerate solution of ODEs
  • 7. 7 J. Austin Harris --- OpenPOWER ADG 2018
  • 8. 8 J. Austin Harris --- OpenPOWER ADG 2018 Nuclear kinetics XNet • General purpose thermonuclear reaction network written in modular Fortran 90 ̇"# = % & '# &(&"& + % &,+ '# &,+,-. /0 &,+"&"+ + % &,+,1 '# &,+,1,2-. 2 /0 &,+,1"&"+"1 • Stiff system of ODEs – Implicit solver (Backward Euler/Bader-Deuflhard/Gear): ⃑4 5 + Δ5 = 7 89:8 ;7 8 :8 − ̇ " 5 + Δ5 =0 • Being implemented into a shared repository of microphysics being developed for AMREx-based codes, including FLASH – https://starkiller-astro.github.io/Microphysics/
  • 9. 9 J. Austin Harris --- OpenPOWER ADG 2018 XNet in FLASH • FLASH burner restructured to operate on multiple zones at once from all local AMR blocks for XNet to evolve simultaneously !$omp parallel shared(…) private(…) !$omp do do k = 1, num_zones do j = 1, num_timesteps <build linear system> dgetrf(…) dgetrs(…) <check convergence> end do end do !$omp end do !$omp end parallel !$omp parallel shared(…) private(…) !$omp do do k = 1, num_local_batches do j = 1, num_timesteps <CPU operations> !$acc parallel loop do ib = 1, nb <build ib’th linear system> end do !$acc end parallel loop cublasDgetrfBatched(…) cublasDgetrsBatched(…) <send results to CPU> <check convergence> end do end do !$omp end do !$omp end parallel
  • 10. 10 J. Austin Harris --- OpenPOWER ADG 2018 FLASH AMR • Currently uses PARAMESH (MacNeice+, 2000) • Moving to ECP-supported AMREx (ExaStar ECP project)
  • 11. 11 J. Austin Harris --- OpenPOWER ADG 2018 FLASH AMR Optimization • Problem: Computational load can be quite unevenly distributed • Solution: Weight the Morton space- filling curve by maximum number of timesteps taken by any single cell in a block.
  • 12. 12 J. Austin Harris --- OpenPOWER ADG 2018 FLASH Performance w/ XNet • Tests performed on single Summit Phase I node – 2 IBM Power9 (22 cores each), 6 NVIDIA “Volta” V100 GPUS • 1 3D block (163 = 4096 zones) per rank per GPU evolved for 20 FLASH timesteps
  • 13. 13 J. Austin Harris --- OpenPOWER ADG 2018 FLASH Early Summit Results • CAAR work primarily concerned with increasing physical fidelity by accelerating the nuclear burning module and associated load balancing. • Summit GPU performance fundamentally changes the potential science impact by enabling large-network (i.e. 160 or more nuclear species) simulations. – Heaviest elements in the Universe are made in neutron-rich environments – small networks are incapable of tracking these neutron-rich nuclei – Opens up the possibility of producing precision nucleosynthesis predictions to compare to observations – Provides detailed information regarding most astrophysically important nuclear reactions and masses to be measured at FRIB NASA, ESA, J. Hester and A. Loll (Arizona St. Univ.) n H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 neutrino_p_process (zone_01) Timestep = 0 Time (sec) = 0.000E+00 Density (g/cm^3) = 4.214E+06 Temperature (T9) = 8.240E+00 nucastrodata.orgnucastrodata.org Max: 2.35E-01 Min: 1.00E-25 Abundance Aprox13: 13-species α-chain X150: 150-species network ProtonNumber Neutron Number Time for 160-species (blue) run on Summit roughly equal to 13- species “alpha” (red) network run on Titan >100x the computation for identical cost Preliminary results on Summit GPU+CPU vs. CPU-only performance on Summit for 288-species network : 2.9x P9: 24.65 seconds/step P9 + Volta: 8.5 seconds/step 288-species impossible to run on Titan
  • 14. 14 J. Austin Harris --- OpenPOWER ADG 2018 Equation of State “Helmholtz EOS” (Timmes & Swesty, 2000) • Provides closure to thermodynamic system (e.g. P=P(ρ,T,X) ) • Based on Helmholtz free energy formulation – High-order interpolation from table of free energy (quintic Hermite polynomials) • OpenACC version developed by collaborators at Stony Brook University – Part of a shared repository of microphysics being developed for AMREx-based codes, including FLASH (“starkiller”) • FLASH traditionally only operates on vectors (i.e. rows from AMR blocks) – Does this expose enough parallelism? No • How many AMR blocks should we evaluate the EOS for simultaneously per MPI rank?
  • 15. 15 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS • To determine best use of accelerated EOS in FLASH, we used mini- app driver: – Mimics AMR block structure and time stepping in FLASH • Loops through several time steps • Change the number of total grid zones • Fill these zones with new data • Calculate interpolation in all grid zones
  • 16. 16 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS 1) Allocate main data arrays (global) on host and device – Arrays of Fortran derived types • Each elements holds grid data for single zone – Persist for the duration of the program – Used to pass zone data back and forth from host to device • Reduced set sent from H-to-D • Full set sent from D-to-H
  • 17. 17 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS 2) Read in tabulated Helmholtz free energy data and make copy on device – This will persist for the duration of the program – Thermodynamic quantities are interpolated from this table 3) For each time step – Change number of AMR blocks • ± 5%, consistent with variation encountered in production simulations at high rank count – Update device with new grid data – Launch EOS kernel: calculate all interpolated quantities for all grid zones – Update host with newly calculated quantities
  • 18. 18 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS Basic Flow of Driver Program !$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element)) OpenACCOpenMP4.5
  • 19. 19 J. Austin Harris --- OpenPOWER ADG 2018 EOS Experiments • All experiments carried out on SummitDev – Nodes have 2 IBM Power8+ 10-core CPUs – peak flop rate of approximately 560 GF – peak memory bandwidth of 340 GB/sec • + 4 NVIDIA P100 GPUs – peak single/double precision flop rate of 10.6/5.3 TF – peak memory bandwidth of 732 GB/sec • Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones) – Emulates 2D block in FLASH • Tested with 1, 2, 4, and 10 (CPU) OpenMP threads for each block count
  • 20. 20 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC vs OpenMP 4.5 • DISCLAIMER: • At the time these tests were performed: – PGI’s OpenACC implementation had a mature API (version 16.10) – IBM’s XL Fortran implementation of OpenMP 4.5 (version 16.1) • Beta version of the compiler • Did not allow pinned memory or asynchronous data transfers / kernel execution
  • 21. 21 J. Austin Harris --- OpenPOWER ADG 2018 Results • For high numbers of AMR blocks, OpenACC is roughly 3x faster More complicated behavior for lower block counts
  • 22. 22 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC at low block counts • At low AMR block counts, kernel overhead is large relative to compute time and increased work does little to increase total performance. ~0.1 ms
  • 23. 23 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC kernel overheads continued • Multiple CPU threads stagger H2D transfers, exacerbating kernel overhead delay
  • 24. 24 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC at high block counts • At higher block counts, kernel overhead is negligible; now dominated by D2H transfers
  • 25. 25 J. Austin Harris --- OpenPOWER ADG 2018 OpenMP at low block counts • There is no asynchronous GPU execution, i.e. the work enqueued by each CPU thread is serialized on the device. • Performance is proportionally less than the OpenACC asynchronous execution.
  • 26. 26 J. Austin Harris --- OpenPOWER ADG 2018 OpenMP at higher block counts • Lack of asynchronous execution becomes less important, as the device compute capability is saturated. • D2H (and H2D) transfers are significantly slower than for OpenACC, as we here lack the ability to pin CPU memory.
  • 27. 27 J. Austin Harris --- OpenPOWER ADG 2018 Optimal GPU configuration • Clear advantage from GPUs when >100 AMR blocks in 2D (or >6 in 3D) • Can calculate 100 2D blocks with GPU in roughly the same time as 1 2D block without • So in FLASH, we should compute 100s to 1000s of 2D blocks per MPI rank, depending on available memory
  • 28. 28 J. Austin Harris --- OpenPOWER ADG 2018 EOS Summary • OpenMP provides an effective to path to performance portability, so despite the lower performance here, we plan to test the OpenMP 4.5 implementation in FLASH production. – Primary factors affecting current OpenMP performance are the serialization of kernels on the device and high data transfer times associated with having to use pageable memory when using OpenMP 4.5. These are technical problems that are certainly surmountable. • In general, we find that the best balance between CPU threads and block number occurs at and above 2-4 CPU threads and roughly 1,000 2D blocks. We can retire all 1,000 of these EOS evaluations in a time less than 10x the fastest 100-block calculation for both OpenACC and OpenMP. – This mode is congruent with our planned production use of FLASH on the OLCF Summit machine, where we will place 3 MPI ranks on each CPU socket, each bound to one of the three available, closely coupled GPUs.
  • 29. 29 J. Austin Harris --- OpenPOWER ADG 2018 Conclusions • Overall, very positive experience with Summit – Some issues with parallel HDF5 under investigation • With the upgrades to the nuclear burning and EOS in FLASH, we find significant speedup (2x - 3x) relative to the CPU alone • Still plenty of work to do! – Hydrodynamics – Gravity – Radiation Transport