Recent developments in HPX and Octo-Tiger

Recent developments in HPX and Octo-Tiger
Patrick Diehl
Joint work with: Gregor Daiß, Sagiv Schieber, Dominic Marcello, Kevin Huck,
Hartmut Kaiser, Juhan Frank, Geoffery Clayton, Patrick Motl, Dirk Pflüger,
Orsola DeMarco, Mikael Simberg, John Bidiscombe, Srinivas Yada, and many
more
Center for Computation & Technology, Louisiana State University
Department of Physics & Astronomy
patrickdiehl@lsu.edu
November 2022
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 1 / 44

Motivation
Astrophysical event: Merging of two stars – Flow on the surface which
corresponds in layman terms to the weather on the stars.

Outline
1 Astrophysical application
2 Software framework
Octo-Tiger
HPX
APEX
3 New features
Kokkos and HPX
Vectorization
Work aggregation
4 Overhead measurements
5 Scaling
Synchronous (MPI) vs asynchronous communication (libfabric)
Scaling on ORNL’s Summit
First experience on Fugaku using A64FX
6 Performance profiling
7 Conclusion and Outlook

Outline
1 Astrophysical application
2 Software framework
Octo-Tiger
HPX
APEX
3 New features
Kokkos and HPX
Vectorization
Work aggregation
4 Overhead measurements
5 Scaling
Synchronous (MPI) vs asynchronous communication (libfabric)
6 Performance profiling
7 Conclusion and Outlook

Astrophysical application

V1309 Scorpii
At peak brightness, the rare 2002 red
nova V838 Monocerotis briefly
rivalled the most powerful stars in
the Galaxy. Credit: NASA/ESA/H.
E. Bond (STScl)
A near-infrared (I band) light curve
for V1309 Scorpii, plotted from
OGLE data
Reference
Tylenda, R., et al. ”V1309 Scorpii: merger of a contact binary.” Astronomy & Astrophysics 528 (2011): A114.

Software framework

Octo-Tiger
Astrophysics open source program1 simulating the evolution of star
systems based on the fast multipole method on adaptive Octrees.
Modules
Hydro
Gravity
Radiation (benchmarking)
Supports
Communication: MPI/libfabric
Backends: CUDA, HIP, Kokkos
Reference
Marcello, Dominic C., et al. ”octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses hpx parallelization.”
Monthly Notices of the Royal Astronomical Society 504.4 (2021): 5345-5382.
1
https://github.com/STEllAR-GROUP/octotiger

Example of adaptive mesh refinement
Reference
Heller, Thomas, et al. ”Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two
stars.” The International Journal of High Performance Computing Applications 33.4 (2019): 699-715.

HPX
HPX is a open source C++ Standard Library for Concurrency and
Parallelism2.
Features
HPX exposes a uniform, standards-oriented API for ease of
programming parallel and distributed applications.
HPX provides unified syntax and semantics for local and remote
operations.
HPX exposes a uniform, flexible, and extendable performance counter
framework which can enable runtime adaptivity.
Reference
Kaiser, Hartmut, et al. ”Hpx-the c++ standard library for parallelism and concurrency.” Journal of Open Source
Software 5.53 (2020): 2352.
2
https://github.com/STEllAR-GROUP/hpx

HPX’s architecture
Application
Operating System
C++2z Concurrency/Parallelism APIs
Threading Subsystem
Active Global Address
Space (AGAS)
Local Control Objects
(LCOs)
Parcel Transport Layer
(Networking)
API
OS
Performance Counter
Framework
Policy
Engine/Policies
Reference
Kaiser, Hartmut, et al. ”Hpx-the c++ standard library for parallelism and concurrency.” Journal of Open Source
Software 5.53 (2020): 2352.

APEX
APEX: Autonomous
Performance
Environment for
Exascale: Performance
measurement library for
distributed,
asynchronous
multitasking systems.
CUPTI used to capture CUDA events
NVML used to monitor the GPU
OTF2 and Google Trace Events trace
output
Task Graphs and Trees
Scatterplots of timers and counters
Reference
Huck, Kevin A., et al. ”An autonomic performance environment for exascale.” Supercomputing frontiers and innovations
2.3 (2015): 49-66.

APEX
To support performance
measurement in systems
that employ user-level
threading, APEX uses a
dependency chain in
addition to the call stack
to produce traces and
task dependency graphs.
CUPTI used to capture CUDA events
NVML used to monitor the GPU
OTF2 and Google Trace Events trace
output
Task Graphs and Trees
Scatterplots of timers and counters
Reference
Huck, Kevin A., et al. ”An autonomic performance environment for exascale.” Supercomputing frontiers and innovations
2.3 (2015): 49-66.

New features

HPX and Kokkos
Reference
Edwards, H. Carter, Christian R. Trott, and Daniel Sunderland. ”Kokkos: Enabling manycore performance portability
through polymorphic memory access patterns.” Journal of parallel and distributed computing 74.12 (2014): 3202-3216.
Daiß, Gregor, et al. ”Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX.” 2021 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2021.

Overhead
Reference
Daiß, Gregor, et al. ”Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX.” 2021 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2021.

Distributed scaling

Vectorizaton using Kokkos + HPX
We can now easily switch both the SIMD library (Kokkos SIMD or
std::experimental::simd) and the used SIMD extensions.
Reference
Daiß, Gregor, et al. ”From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types.”
arXiv preprint arXiv:2210.06439 (2022). (Accepted to SC 22 workshop proceedings)

Single node runs on different CPU architectures

Work aggregation strategies
Reference
Daiß, Gregor, et al. ”From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks
into Portable GPU Kernels.” arXiv preprint arXiv:2210.06438 (2022). (Accepted to SC 22 workshop proceedings)

Results

Overhead measurements

Task Bench
Task Bench is a configurable benchmark for evaluating the efficiency and
performance of parallel and distributed programming models, runtimes,
and languages. It is primarily intended for evaluating task-based
models, in which the basic unit of execution is a task, but it can be
implemented in any parallel system.
Reference
Slaughter, Elliott, et al. ”Task bench: A parameterized benchmark for evaluating parallel runtime performance.” SC20:
International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.

Measurements
Single Node: 1 task per core
Distributed runs: 16 task per core
METG (Minimum Effective Task Granularity): 50% percent effective task granularity, the time a system takes to achieve 50%
overall efficiency.
Take away: Ligth-weight threads and work-stealing comes with overhead,
but not for large simulations.
Reference
Wu, Nanmiao, et al. ”Quantifying Overheads in Charm++ and HPX using Task Bench.” arXiv preprint
arXiv:2207.12127 (2022). (Accepted to EuroPar 22 workshop proceedings)

Scaling

Synchronous (MPI) vs asynchronous communication
(libfabric)

Configuration
Reference
Daiß, Gregor, et al. ”From piz daint to the stars: Simulation of stellar mergers using high-level abstractions.”
Proceedings of the international conference for high performance computing, networking, storage and analysis. 2019.

Synchronous vs asynchronous communication
Reference
Daiß, Gregor, et al. ”From piz daint to the stars: Simulation of stellar mergers using high-level abstractions.”
Proceedings of the international conference for high performance computing, networking, storage and analysis. 2019.

Node level scaling: Hydro
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.

Distributes scaling: Hydro
Reference

Node level scaling: Hydro + Gravity
Reference

Distributed scaling: Hydro + Gravity
Reference

Porting HPX to Arm
Challenges on Fugaku:
Add support for Parallel job
manager (PJM)
Cross compilation on x86 head
node to A64FX compute nodes.
Some of the dependencies do
not support cross compilation.
Reference
Gupta, Nikunj, et al. ”Deploying a task-based runtime system on raspberry pi clusters.” 2020 IEEE/ACM Fifth
International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2020.

Nodel-level scaling
SVE vectorization reduced the computation time by a factor of two
compared to Neon.

Distributed scaling
Due to the 28 GB of memory, more nodes on Fugaku are required as on
other machines.
References: HPCI report submitted and IDPDS workshop paper in preparation.

Performance profiling

Overhead measurements

Task trees and task graphs
Reference
Diehl, Patrick, et al. ”Distributed, combined CPU and GPU profiling within HPX using APEX.” arXiv preprint
arXiv:2210.06437 (2022).

Sampled profile of tasks on Piz Daint and Summit

Conclusion and Outlook

Conclusion
HPX and Octo-Tiger
Asynchronous integration of GPU in HPX:
→ Using CUDA API or HIP API
→ Using Kokkos for Nvidia or AMD GPUs
Providing HPX as a backend for Kokkos and integration of
asynchronous Kokkos lanuches
Work aggregation for small tasks on CPU and GPU
Vectorization using std::simd
Alternatives to MPI, like libfabric or LCI
Programming model
Work-stealing; overlapping communication and computation; and
light-weighted threads can improve the performance of irregular
workloads.

Outlook
Outlook
Distributed scaling data and optimization for AMD GPUs
Optimization and large scale runs on Fugaku or Ookami
Get ready for Intel GPU support
Test the effect of libfabric on other systems
More astrophysics studies to prepare for the production run with the
light curve
Advanced visualization of the large scale results
Thanks to all my collaborators and without their effort, I could not present all these results.
Thanks for your attention! Questions?

Astrophysic validation

Resolution convergence: Double white dwarf merger
Reference
Diehl, Patrick, et al. ”Performance Measurements Within Asynchronous Task-Based Runtime Systems: A Double White
Dwarf Merger as an Application.” Computing in Science & Engineering 23.3 (2021): 73-81.

Higher reconstruction in the hydro module
Reference
arXiv:2107.10987 (2021). (Accepted IEEE Cluster 21)

Comparison with Flash I
Reference

Comparison with Flash II
Reference

This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.

Recent developments in HPX and Octo-Tiger

Recommended

Recommended

More Related Content

Similar to Recent developments in HPX and Octo-Tiger

Similar to Recent developments in HPX and Octo-Tiger (20)

More from Patrick Diehl

More from Patrick Diehl (16)

Recently uploaded

Recently uploaded (20)

Recent developments in HPX and Octo-Tiger