Towards Exascale Simulations of Stellar Explosions with FLASH

ORNL is managed by UT-Battelle
for the US Department of Energy
Towards Exascale
Simulations of Stellar
Explosions with FLASH
J. Austin Harris
Scientific Computing Group
Oak Ridge National Laboratory
Collaborators:
Bronson Messer (ORNL)
Tom Papatheodore (ORNL)

2 J. Austin Harris --- OpenPOWER ADG 2018
• Preparing codes to run on the upcoming (CORAL) Summit supercomputer at
ORNL
• Summit – IBM POWER9 + NVIDIA Volta
• EA System – IBM POWER8 + NVIDIA Pascal
Acknowledgements
FLASH – adaptive-mesh, multi-physics simulation code widely used
in astrophysics
http://flash.uchicago.edu/site/
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National
Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No.
DE-AC05-00OR22725.

Supernovae
• Brightness rivals that of the host galaxy
• Primarily two physical mechanisms:
– Core-collapse supernova (gravity-induced explosion of massive star)
– Thermonuclear supernova (accretion-induced explosion of white dwarf)
Cassiopeia A (SN 1680)Tycho (SN 1572)“The Crab” (SN 1054) Kepler (SN 1604)

FLASH code
• FLASH is a publicly available, component-based, MPI+OpenMP
parallel, adaptive mesh refinement (AMR) code that has been
used on a variety of parallel platforms.
• The code has been used to simulate a variety of phenomena,
including
– thermonuclear and core-collapse supernovae,
– galaxy cluster formation,
– classical novae,
– formation of proto-planetary disks, and
– high-energy-density physics.
• FLASH’s multi-physics and AMR capabilities make it an ideal
numerical laboratory for studying nucleosynthesis in supernovae.
• Targeted for CAAR:
– Nuclear kinetics (burn unit) --- GPU-enabled libraries
– Equation of State (EOS) --- OpenACC
– Hydrodynamics and Gravity module performance

Nuclear kinetics
• Nuclear composition evolved at each time-step
– Linear solution of coupled set of stiff ODEs
– FLOPS ~ n3
• Accurate treatment important for nuclear energy
generation and determining final composition
• Computational constraints traditionally limit system to < 14 species
• FLASH-CAAR:
– Replace small “hard-wired” reaction network in FLASH with general purpose
reaction network and use GPU-enabled libraries to accelerate solution of ODEs

Nuclear kinetics
XNet
• General purpose thermonuclear reaction network written in modular
Fortran 90
̇"# = %
&
'#
&(&"& + %
&,+
'#
&,+,-. /0 &,+"&"+ + %
&,+,1
'#
&,+,1,2-.
2 /0 &,+,1"&"+"1
• Stiff system of ODEs
– Implicit solver (Backward Euler/Bader-Deuflhard/Gear):
⃑4 5 + Δ5 =
7 89:8 ;7 8
:8
−
̇
" 5 + Δ5 =0
• Being implemented into a shared repository of microphysics being
developed for AMREx-based codes, including FLASH
– https://starkiller-astro.github.io/Microphysics/

XNet in FLASH
• FLASH burner restructured to operate on multiple zones at once from all local
AMR blocks for XNet to evolve simultaneously
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_zones
do j = 1, num_timesteps
<build linear system>
dgetrf(…)
dgetrs(…)
<check convergence>
end do
end do
!$omp end do
!$omp end parallel
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_local_batches
do j = 1, num_timesteps
<CPU operations>
!$acc parallel loop
do ib = 1, nb
<build ib’th linear system>
end do
!$acc end parallel loop
cublasDgetrfBatched(…)
cublasDgetrsBatched(…)
<send results to CPU>
<check convergence>
end do
end do
!$omp end do
!$omp end parallel

FLASH AMR
• Currently uses PARAMESH (MacNeice+, 2000)
• Moving to ECP-supported AMREx (ExaStar ECP project)

FLASH AMR Optimization
• Problem: Computational load can be
quite unevenly distributed
• Solution: Weight the Morton space-
filling curve by maximum number of
timesteps taken by any single cell in
a block.

FLASH Performance w/ XNet
• Tests performed on single Summit Phase I node
– 2 IBM Power9 (22 cores each), 6 NVIDIA “Volta” V100
GPUS
• 1 3D block (163 = 4096 zones) per rank per
GPU evolved for 20 FLASH timesteps

FLASH Early Summit Results
• CAAR work primarily concerned with increasing physical
fidelity by accelerating the nuclear burning module and
associated load balancing.
• Summit GPU performance fundamentally changes the
potential science impact by enabling large-network (i.e.
160 or more nuclear species) simulations.
– Heaviest elements in the Universe are made in
neutron-rich environments – small networks are
incapable of tracking these neutron-rich nuclei
– Opens up the possibility of producing precision
nucleosynthesis predictions to compare to
observations
– Provides detailed information regarding most
astrophysically important nuclear reactions and
masses to be measured at FRIB
NASA, ESA, J. Hester and A. Loll
(Arizona St. Univ.)
n
H
He
Li
Be
B
C
N
O
F
Ne
Na
Mg
Al
Si
P
S
Cl
Ar
K
Ca
Sc
Ti
V
Cr
Mn
Fe
Co
Ni
Cu
Zn
0
1
2
3 4
5 6
7 8
9
10 11 12
13 14
15 16 17
18 19 20 21 22 23 24
25 26 27 28 29
30 31 32 33
34
35 36 37
38
39
40
41
42 43 44
neutrino_p_process (zone_01)
Timestep = 0
Time (sec) = 0.000E+00
Density (g/cm^3) = 4.214E+06
Temperature (T9) = 8.240E+00
nucastrodata.orgnucastrodata.org
Max: 2.35E-01
Min: 1.00E-25
Abundance
Aprox13: 13-species α-chain
X150: 150-species network
ProtonNumber
Neutron Number
Time for 160-species
(blue) run on Summit
roughly equal to 13-
species “alpha” (red)
network run on Titan
>100x the computation
for identical cost
Preliminary results on Summit
GPU+CPU vs. CPU-only performance on
Summit for 288-species network : 2.9x
P9: 24.65 seconds/step
P9 + Volta: 8.5 seconds/step
288-species impossible to run on Titan

Equation of State
“Helmholtz EOS” (Timmes & Swesty, 2000)
• Provides closure to thermodynamic system (e.g. P=P(ρ,T,X) )
• Based on Helmholtz free energy formulation
– High-order interpolation from table of free energy (quintic Hermite polynomials)
• OpenACC version developed by collaborators at Stony Brook University
– Part of a shared repository of microphysics being developed for AMREx-based
codes, including FLASH (“starkiller”)
• FLASH traditionally only operates on vectors (i.e. rows from AMR blocks)
– Does this expose enough parallelism? No
• How many AMR blocks should we evaluate the EOS for simultaneously
per MPI rank?

Helmholtz EOS
• To determine best use of accelerated EOS in FLASH, we used mini-
app driver:
– Mimics AMR block structure and time stepping in FLASH
• Loops through several time steps
• Change the number of total grid zones
• Fill these zones with new data
• Calculate interpolation in all grid zones

Helmholtz EOS
1) Allocate main data arrays (global) on host and device
– Arrays of Fortran derived types
• Each elements holds grid data for single zone
– Persist for the duration of the program
– Used to pass zone data back and forth from host to device
• Reduced set sent from H-to-D
• Full set sent from D-to-H

Helmholtz EOS
2) Read in tabulated Helmholtz free energy data and make copy on
device
– This will persist for the duration of the program
– Thermodynamic quantities are interpolated from this table
3) For each time step
– Change number of AMR blocks
• ± 5%, consistent with variation encountered in production simulations at high rank count
– Update device with new grid data
– Launch EOS kernel: calculate all interpolated quantities for all grid zones
– Update host with newly calculated quantities

Helmholtz EOS
Basic Flow of Driver Program
!$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1)
!$acc kernels async(thread_id + 1)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$acc end kernels
!$acc update self(state(start_element:stop_element)) async(thread_id + 1)
!$acc wait
!$omp target update to(reduced_state(start_element:stop_element))
!$omp target
!$omp teams distribute parallel do thread_limit(128) num_threads(128)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$omp end teams distribute parallel do
!$omp end target
!$omp target update from(state(start_element:stop_element))
OpenACCOpenMP4.5

EOS Experiments
• All experiments carried out on
SummitDev
– Nodes have 2 IBM Power8+ 10-core
CPUs
– peak flop rate of approximately 560 GF
– peak memory bandwidth of 340 GB/sec
• + 4 NVIDIA P100 GPUs
– peak single/double precision flop rate of
10.6/5.3 TF
– peak memory bandwidth of 732 GB/sec
• Number of “AMR” blocks: 1, 10, 100,
1000, 10000 (each with 256 zones)
– Emulates 2D block in FLASH
• Tested with 1, 2, 4, and 10 (CPU)
OpenMP threads for each block count

OpenACC vs OpenMP 4.5
• DISCLAIMER:
• At the time these tests were performed:
– PGI’s OpenACC implementation had a mature API (version 16.10)
– IBM’s XL Fortran implementation of OpenMP 4.5 (version 16.1)
• Beta version of the compiler
• Did not allow pinned memory or asynchronous data transfers / kernel execution

Results
• For high numbers of AMR blocks, OpenACC is roughly 3x faster
More complicated behavior for lower block counts

OpenACC at low block counts
• At low AMR block counts, kernel overhead is
large relative to compute time and increased
work does little to increase total performance.
~0.1 ms

OpenACC kernel overheads continued
• Multiple CPU threads stagger
H2D transfers, exacerbating
kernel overhead delay

OpenACC at high block counts
• At higher block counts,
kernel overhead is
negligible; now dominated
by D2H transfers

OpenMP at low block counts
• There is no asynchronous GPU
execution, i.e. the work enqueued
by each CPU thread is serialized
on the device.
• Performance is proportionally less
than the OpenACC asynchronous
execution.

OpenMP at higher block counts
• Lack of asynchronous execution
becomes less important, as the
device compute capability is
saturated.
• D2H (and H2D) transfers are
significantly slower than for
OpenACC, as we here lack the
ability to pin CPU memory.

Optimal GPU configuration
• Clear advantage from GPUs when
>100 AMR blocks in 2D (or >6 in 3D)
• Can calculate 100 2D blocks with
GPU in roughly the same time as 1
2D block without
• So in FLASH, we should compute
100s to 1000s of 2D blocks per MPI
rank, depending on available memory

EOS Summary
• OpenMP provides an effective to path to performance portability, so despite the
lower performance here, we plan to test the OpenMP 4.5 implementation in
FLASH production.
– Primary factors affecting current OpenMP performance are the serialization of kernels on the
device and high data transfer times associated with having to use pageable memory when
using OpenMP 4.5. These are technical problems that are certainly surmountable.
• In general, we find that the best balance between CPU threads and block
number occurs at and above 2-4 CPU threads and roughly 1,000 2D blocks. We
can retire all 1,000 of these EOS evaluations in a time less than 10x the fastest
100-block calculation for both OpenACC and OpenMP.
– This mode is congruent with our planned production use of FLASH on the OLCF Summit
machine, where we will place 3 MPI ranks on each CPU socket, each bound to one of the
three available, closely coupled GPUs.

Conclusions
• Overall, very positive experience with Summit
– Some issues with parallel HDF5 under investigation
• With the upgrades to the nuclear burning and EOS in FLASH, we find
significant speedup (2x - 3x) relative to the CPU alone
• Still plenty of work to do!
– Hydrodynamics
– Gravity
– Radiation Transport

Towards Exascale Simulations of Stellar Explosions with FLASH

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Towards Exascale Simulations of Stellar Explosions with FLASH

Similar to Towards Exascale Simulations of Stellar Explosions with FLASH (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

Towards Exascale Simulations of Stellar Explosions with FLASH