Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the ap- propriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using var- ious programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asyn- chronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing plat- forms were C++, Rust, Chapel, Charm++, and HPX.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java
1. Benchmarking the Parallel 1D Heat Equation Solver in
Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust,
Swift, and Java
Patrick Diehl, Max Morris, Steven R. Brandt, Nikunj Gupta and
Hartmut Kaiser
Center of Computation and Technology
Department of Physiscs and Astronomy
Louisiana State University
patrickdiehl@lsu.edu
August 28, 2023
P. Diehl and et al. (LSU) August 28, 2023 1 / 26
2. Motivation
Ranking Language Ranking Change
1 Python 13.33% -2.30%
3 C++ 11.41% +0.49%
4 Java 10.33% -2.24%
12 Go 1.16% +0.20%
18 Swift 0.90% -0.35%
19 Rust 0.89% +0.32%
20 Julia 0.85% +0.41%
Table: TIOBE Index for August 2023
Chapel is not listed in the index.
Charm++ and HPX are using C++
How do these languages compare?
P. Diehl and et al. (LSU) August 28, 2023 2 / 26
3. Overview
1 Model problem
2 Features of the approaches
3 Productivity
4 Performance measurements
5 Conclusion and Outlook
P. Diehl and et al. (LSU) August 28, 2023 3 / 26
5. Model problem I
The one-dimensional heat equation on a 1-D loop (e.g. limp noodle)
(0 ≤ x < L) with the length L for all times t > 0 is described by
∂u
∂t
= α
∂2u
∂x2
, 0 ≤ x < L, t > 0, (1)
with α as the material’s diffusivity. For the discretization in space, we use
the N grid points x = {xi = i · h ∈ R | i = 0, . . . , N − 1}, with the grid
spacing h and we use 2nd order finite differencing. For the discretization in
time, we use the Euler method, i.e.
u(t + δt, xi) = u(t, xi) + δt · α
u(t, xi−1) − 2 · u(t, xi) + u(t, xi+1)
2h
, (2)
with the initial condition u(0, xi) = xi. To model a loop, we use periodic
boundary conditions, i.e. u(t, x) = u(t, L + x).
P. Diehl and et al. (LSU) August 28, 2023 5 / 26
6. Model problem II
The parallel algorithm was implemented by having multiple threads of
execution each sequentially applying Eq. 2 on a local segment of the grid.
We used queues to communicate ghost zones between the segments. We
note that for this problem, the queues are single-producer, single-consumer
and, therefore, in principle, don’t need synchronization (although
synchronization to suspend/resume threads seemed to help in some cases).
P. Diehl and et al. (LSU) August 28, 2023 6 / 26
7. Features of the approaches
P. Diehl and et al. (LSU) August 28, 2023 7 / 26
8. Overview
Approach Async Coroutine ParAlg Win Linux Mac Licence
C++ 17 X X X X X X GNU
Java X X X X X X GNU
Swift X X X X X X Apache
Chapel X X ∼ X X X Apache
Charm++ X ∼ X X X X Own
HPX X X X X X X Boost
Go X X X X X X BSD
Python X X X X X X BSD
Julia X X X X X X MIT
Rust X X X X X X MIT
Table: Overview of the programming languages: (1) the parallelism approaches
they provide, (2) supported OS, and (3) the license. The C++ 17 standard was
used as a base. The symbol ∼ indicates that partial support.
P. Diehl and et al. (LSU) August 28, 2023 8 / 26
9. Chapel
We had to write our own queue and the full/empty bit
synchronization mechanism was helpful
The coforall loop, which assigns a different thread to each iteration,
provided a convenient mechanism for launching the outer loop.
Chapel also lacked a built-in way to append to a file. However,
opening a file, seeking to the end, and writing is possible.
We also add that the support we received from questions asked in the
Chapel Gitter was exceptional.
We found Chapel among the higher performing codes, comparable to Rust
or C++.
P. Diehl and et al. (LSU) August 28, 2023 9 / 26
10. Go
We use go func to launch worker threads (goroutines) and buffered
channels using make() to facilitate the exchange of ghost zones.
We use go func to launch worker threads (goroutines) and buffered
channels using make() to facilitate the exchange of ghost zones. For
synchronization of the goroutines, we use sync.WaitGroup and add
threads by calling waitGroup.Add(), and synchronize the threads by
calling waitGroup.Wait().
At the time of this writing, only biogo, an HPC bioinformatics toolkit
[1], is available.
Reference
1. Köster, J.: Rust-bio: a fast and safe bioinformatics library. Bioinformatics 32(3), 444–446 (2016)
P. Diehl and et al. (LSU) August 28, 2023 10 / 26
11. Julia
Both Python and Fortran clearly inspire Julia. It is a good choice for
Fortran programmers who want to get into scripting, as it will offer
some familiarity in using one as the default start for array indexes
(instead of zero) and its use of end to mark the end of a block.
In our Julia code, we implemented our own queue. Since Julia does
not support classes directly (though it has structs), we found it
convenient to use arrays. For parallelism, we used Julia’s
Thread.@threads for loop macro.
Julia’s community contacted us and provided some optimized code.
However, you need to be confident in Julia and know the internals for
these optimizations.
P. Diehl and et al. (LSU) August 28, 2023 11 / 26
12. Rust
We use std :: thread :: scope to launch worker threads, and
non-blocking channels from std :: sync :: mpsc to facilitate the
exchange of ghost zones.
We avoided using unsafe, working only in the safe subset of Rust.
Only two scientific codes (molecular dynamic and bioinformatics) are
using Rust.
Because of its guarantees concerning data race conditions and memory
access, as well as its high performance, Rust is a potentially good choice
for new scientific programming projects.
However, Rust has vastly different syntax and semantics than more
traditional languages like C++, Java, and Python, all of which may make
for a steep learning curve.
P. Diehl and et al. (LSU) August 28, 2023 12 / 26
13. Swift
Swift claims to be safe by design and produces lightning-fast software.
Unfortunately, we had to disable the safety feature to get a
performant code.
UnsafeMutableBufferPointer<Double> to avoid unnecessary calls of
await for accessing the elements of arrays. These buffers allow
explicit vectorization on newer x86 and Apple Silicon. See, for
example, addingProduct. However, we could not measure a
significant improvement using these functions.
For concurrency, we use await with TaskGroup{ body: { group in}}
to launch chunks of works on each thread and
for wait _ in group{}.
We found Swift is designed for application development for iOS or Mac
OS, but not for numerical applications.
P. Diehl and et al. (LSU) August 28, 2023 13 / 26
15. Lines of code
0 50 100 150 200
Python
Swift
HPX
Julia
Go
Rust
Chapel
Charm++
C++ 17
Java
Lines of code (LOC)
The numbers were determined with the Linux tool cloc.
P. Diehl and et al. (LSU) August 28, 2023 15 / 26
16. Productivity metric
Average of the computation time
Taverage(approach) := (T2(approach) + T20(approach) + T40(approach))/3
Constructive Cost Model (COCOMO)
COCOMO does not reflect parallel features
However, the HPX community never proposed their cost model
We map both metrics to the interval [−1, 1] using
Easy and Difficult for the costs
Slow and Fast for computation time
References
1. Barry, B., et al.: Software engineering economics. New York 197 (1981)
2. Stutzke, R.D., Crosstalk, M.: Software estimating technology: A survey. Los. Alamitos, CA: IEEE Computer Society
Press (1997)
P. Diehl and et al. (LSU) August 28, 2023 16 / 26
19. AMD EPYC 7H12
0 10 20 30 40
#cores
10−1
100
Time
[s]
nx=1000000 and nt=1000
go
python
swift
rust
chapel
cxx
hpx
julia
charm++
java
P. Diehl and et al. (LSU) August 28, 2023 19 / 26
20. Intel®
Xeon®
Gold 6148 Skylake
0 10 20 30 40
#cores
10−1
100
101
Time
[s]
nx=1000000 and nt=1000
go
python
swift
rust
chapel
cxx
hpx
julia
charm++
java
P. Diehl and et al. (LSU) August 28, 2023 20 / 26
21. A64FX
0 10 20 30 40
#cores
10−1
100
101
Time
[s]
nx=1000000 and nt=1000
go
python
rust
chapel
cxx
hpx
julia
charm++
java
Swift is missing, since no package was available for Rocky Linux.
P. Diehl and et al. (LSU) August 28, 2023 21 / 26
22. Summary of performance measurements
Table: R2
correlation of the fit of the measured data points for all approaches and
architectures, computed using Python NumPy.
Arch C++ Charm++ Chapel Rust Go Julia HPX Swift Python Java
Intel 0.49 0.36 0.45 0.52 0.28 0.41 0.52 0.56 0.43 0.03
AMD 0.48 0.45 0.53 0.49 0.75 0.12 0.42 0.02 0.46 0.12
A64FX 0.49 0.52 0.08 0.40 0.52 0.42 0.73 – 0.90 0.32
Python was the slowest approach.
Swift and Julia are comparable.
For larger than 10 threads Go behaves slightly better than Swift and Julia.
For smaller core counts up to eight cores, the remaining approaches behave
similarly.
However, Chapel gets slower for higher node counts.
For Rust, Charm++, and HPX the performance is comparable. HPX is for larger
node counts the fastest, but has a high variance, see R2
in Table 3.
P. Diehl and et al. (LSU) August 28, 2023 22 / 26
24. Conclusion and Outlook
Conclusion
We will not name a winner concerning speed.
The higher performing platforms were mostly similar in what they
achieved.
The tests in this paper depend on the
hardware, the version of the interpreters and compilers, the particular
problem chosen,
the amount of effort applied, and our level of expertise (which varied
by platform).
Outlook
More numerical applications for a more comprehensive comparison
Distributed runs and GPU support
I am happy to answer any of your questions.
P. Diehl and et al. (LSU) August 28, 2023 24 / 26