Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
1. Evaluating HPX and Kokkos on RISC-V using an
Astrophysics Application Octo-Tiger
Patrick Diehl
Joint work with: Gregor Daiß, Steven R. Brandt, Alireza Kheirkhahan,
Hartmut Kaiser, Christopher Taylor, and John Leidel
Louisiana State University
patrickdiehl@lsu.edu
April 25, 2024
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 1 / 27
2. Motivation
What is RISC-V?
RISC-V was introduced in 2015 as an open standard instruction set
architecture (ISA); RISC-V is an iteration on established reduced
instruction set computer (RISC) principles
Why is it interesting for the HPC community?
The RISC-V ISA is completely open for use by anyone and is
royalty-free.
RISC-V is extensible; processor features can be added to provide
customized capabilities (ie: Cache management, SIMD, and Vector
Machine support are optional)
European Processor Initiative (EPI), which aims to develop a
vendor-independent European CPU for high-performance computing,
has identified RISC-V as a target for future investment.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 2 / 27
3. Overview
1 Astrophysical application
2 Software stack
Octo-Tiger
Kokkos
HPX
3 Porting the software stack to RISC-V
4 In-house RISC-V Test System
5 Performance measurements
Node level scaling
Distributed scaling
6 Energy consumption
7 Conclusion and Outlook
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 3 / 27
5. Example simulation
Astrophysical event: Merging of two stars – Flow on the surface which
corresponds in layman’s terms to the weather on the stars.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 5 / 27
7. Octo-Tiger
Astrophysics open source program1 simulating the evolution of star
systems based on the fast multipole method on adaptive Octrees.
Modules
Hydro
Gravity
Radiation
Supports
Communication:
MPI/libfabric/LCI/GASNet +
OpenSHMEM
Backends: CUDA, HIP, SYCL
Reference
Marcello, Dominic C., et al. ”octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses hpx
parallelization.” Monthly Notices of the Royal Astronomical Society 504.4 (2021): 5345-5382.
1
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 7 / 27
8. Kokkos: C++ Performance Portability Programming
EcoSystem
Kokkos is a C++ library2 for writing performance portable applications
targeting all major HPC platforms
CPU
OpenMP
HPX
GPU
Native: CUDA & HIP
SYCL: CUDA & HIP
Reference
Trott, Christian R., et al. ”Kokkos 3: Programming model extensions for the exascale era.” IEEE Transactions on
Parallel and Distributed Systems 33.4 (2021): 805-817.
2
https://github.com/kokkos/kokkos
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 8 / 27
9. HPX
HPX is a open source C++ Standard Library for Concurrency and
Parallelism3.
Features
HPX exposes a uniform, standards-oriented API for ease of
programming parallel and distributed applications.
HPX provides unified syntax and semantics for local and remote
operations.
HPX exposes a uniform, flexible, and extendable performance counter
framework which can enable runtime adaptivity.
Reference
Kaiser, Hartmut, et al. ”HPX-the C++ standard library for parallelism and concurrency.” Journal of Open Source
Software 5.53 (2020): 2352.
3
https://github.com/STEllAR-GROUP/hpx
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 9 / 27
10. Porting the software stack to RISC-V
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 10 / 27
11. Porting HPX
Most parts of HPX are implemented ISO C++. However, small
portions of the runtime system are implemented using assembly.
The HPX context-switching software implementation can optionally
utilize Boost.Context support or a native independently provided
assembly implementation for a targeted ISA. Note HPX already relies
on Boost.
We had to do some single source code modification within the HPX
timer. The RISC-V HPX port implements timing using the RISC-V
RDTIME instruction. RDTIME is a pseudo-instruction that reads
bits from the time Control and Status Register (CSR).
Recall, that since we had an ISO C++ compiler and had Boost support, the
code changes were minimal.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 11 / 27
12. Porting Kokkos and Octo-Tiger
Kokkos
Building Kokkos required no changes to the code base and GCC
compiled the Kokkos without any issues.
However, Kokkos’s build system CMake files required some minor
changes. The RISC-V architecture was not detected, and incorrect
compiler flags were added for the architecture and vectorization.
Octo-Tiger
Octo-Tiger needed no porting after HPX and Kokkos were already
ported.
Due to the abstraction levels provided by HPX and Kokkos porting the
software stack was a walk in the park.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 12 / 27
13. In-house RISC-V Test System
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 13 / 27
14. In-house RISC-V test system I
Image of the in-house cluster using
two VisionFive2 Open Source
RISC-V single board computers with
Quad-core StarFive JH7110 64-bit
CPU and 8GB LPDDR4 System
Memory.
Official image based on an older
Ubuntu version
Ubuntu Linux image based on
23.04 had the versions, we need
or the Slurm integration and
recent compilers.
The Ubuntu image does not
support USB and PCIe on the
VisionFive2.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 14 / 27
15. In-house RISC-V test system II
Two MILK-V with desktop
computers with 64-core SOPHON
SG2042 64-bit CPU and 128 GB
DDR System Memory.
Linux OS: Fedora Linux 38
Slurm integration
GNU compiler collection
MPI
We have the full HPC stack on
RISC-V. At least to run Kokkos and
HPX!
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 15 / 27
17. Node-level scaling (MILK-V)
20 21 22 23 24 25 26
210
212
214
216
# cores
Processed
sub-grids
per
second
DWD (Initial mesh)
Level 10 Level 10 (optimized)
Level 11 Level 11 (optimized)
0.47
1.79
10.8
61.33
GFLOP/s
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 17 / 27
18. Distributed runs (Single-board computer)
0 200 400 600 800 1,000
1-RISC
1-Fugaku
2-RISC-TCP
2-RISC-MPI
2-Fugaku-MPI
91
168
140
778
1,091
Cells processed per second
Figure: For a comparison, runs on a single and two Supercomputer Fugaku nodes
are shown (each using only four cores out of the 48 available ones for a better
comparison).
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 18 / 27
19. Distributed runs (MILK-V) I
0 200 400 600 800
1
1
2
2
105.92
176.11
163.64
225.14
Processed sub-grids per time step
#
nodes
Level 11 (initial mesh)
RISC-V
A64FX
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 19 / 27
20. Distributed runs (MILK-V) II
0 200 400 600 800 1,000
1
16
740.31
635.74
Processed sub-grids per time step
#
nodes
v1309
RISC-V
A64FX
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 20 / 27
22. How to measure energy consumption?
We want to compare the RISC-V boards and Supercomputer Fugaku for
the astrophysics application.
On Supercomputer Fugaku the power consumption was measured
with the PowerAPI interface provided by Riken.
On the RISC-V boards, no hardware counters for power measurements
are present. Here, we attached a power meter to the USB power
source and measured the power consumption while running the Linux
command stress –cpu 4 and while running Octo-Tiger with four cores.
Would be nice to have hardware counters to get more sophisticated
measurements for RISC-V!
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 22 / 27
23. Energy consumption (Single-board computer)
0 0.5 1 1.5 2
1-RISC
1-Fugaku
2-RISC-TCP
2-RISC-MPI
2-Fugaku-MPI
1.19
1.28
1.53
0.92
1.46
Wh
Figure: On Supercomputer Fugaku, the power consumption was measured using
PowerAPI. Due to missing hardware counters, the power consumption was
measured using a power meter on RISC-V.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 23 / 27
24. Energy consumption (MILK-V)
0 1,000 2,000 3,000
1
1
2
2
1,854.7
2,230.8
2,000.7
2,908.3
Wh
#
nodes
Level 11 (initial mesh)
RISC-V
A64FX
Recall that an A64FX node has 48 cores and a RISC-V node has 64 cores
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 24 / 27
26. Conclusion and Outlook
Conclusion
Porting the software stack was rather easy due to the advanced C++
compilers on RISC-V.
HPX and Octo-Tiger scaled from one up to four cores. However, more
RAM and more cores are needed for sophisticated benchmarking.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 26 / 27
27. This work is licensed under a Creative Com-
mons “Attribution-NonCommercial-ShareAlike
3.0 Unported” license.
P. Diehl (CCT/Physics/LSU) HPX and Kokkos on RISC-V April 25, 2024 27 / 27