Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Benchmarking the Parallel 1D Heat Equation Solver in
Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust,
Swift, and Java
Patrick Diehl, Max Morris, Steven R. Brandt, Nikunj Gupta and
Hartmut Kaiser
Center of Computation and Technology
Department of Physiscs and Astronomy
Louisiana State University
patrickdiehl@lsu.edu
August 28, 2023
P. Diehl and et al. (LSU) August 28, 2023 1 / 26

Motivation
Ranking Language Ranking Change
1 Python 13.33% -2.30%
3 C++ 11.41% +0.49%
4 Java 10.33% -2.24%
12 Go 1.16% +0.20%
18 Swift 0.90% -0.35%
19 Rust 0.89% +0.32%
20 Julia 0.85% +0.41%
Table: TIOBE Index for August 2023
Chapel is not listed in the index.
Charm++ and HPX are using C++
How do these languages compare?

Overview
1 Model problem
2 Features of the approaches
3 Productivity
4 Performance measurements
5 Conclusion and Outlook

Model problem

Model problem I
The one-dimensional heat equation on a 1-D loop (e.g. limp noodle)
(0 ≤ x < L) with the length L for all times t > 0 is described by
∂u
∂t
= α
∂2u
∂x2
, 0 ≤ x < L, t > 0, (1)
with α as the material’s diffusivity. For the discretization in space, we use
the N grid points x = {xi = i · h ∈ R | i = 0, . . . , N − 1}, with the grid
spacing h and we use 2nd order finite differencing. For the discretization in
time, we use the Euler method, i.e.
u(t + δt, xi) = u(t, xi) + δt · α
u(t, xi−1) − 2 · u(t, xi) + u(t, xi+1)
2h
, (2)
with the initial condition u(0, xi) = xi. To model a loop, we use periodic
boundary conditions, i.e. u(t, x) = u(t, L + x).

Model problem II
The parallel algorithm was implemented by having multiple threads of
execution each sequentially applying Eq. 2 on a local segment of the grid.
We used queues to communicate ghost zones between the segments. We
note that for this problem, the queues are single-producer, single-consumer
and, therefore, in principle, don’t need synchronization (although
synchronization to suspend/resume threads seemed to help in some cases).

Features of the approaches

Overview
Approach Async Coroutine ParAlg Win Linux Mac Licence
C++ 17 X X X X X X GNU
Java X X X X X X GNU
Swift X X X X X X Apache
Chapel X X ∼ X X X Apache
Charm++ X ∼ X X X X Own
HPX X X X X X X Boost
Go X X X X X X BSD
Python X X X X X X BSD
Julia X X X X X X MIT
Rust X X X X X X MIT
Table: Overview of the programming languages: (1) the parallelism approaches
they provide, (2) supported OS, and (3) the license. The C++ 17 standard was
used as a base. The symbol ∼ indicates that partial support.

Chapel
We had to write our own queue and the full/empty bit
synchronization mechanism was helpful
The coforall loop, which assigns a different thread to each iteration,
provided a convenient mechanism for launching the outer loop.
Chapel also lacked a built-in way to append to a file. However,
opening a file, seeking to the end, and writing is possible.
We also add that the support we received from questions asked in the
Chapel Gitter was exceptional.
We found Chapel among the higher performing codes, comparable to Rust
or C++.

Go
We use go func to launch worker threads (goroutines) and buffered
channels using make() to facilitate the exchange of ghost zones.
We use go func to launch worker threads (goroutines) and buffered
channels using make() to facilitate the exchange of ghost zones. For
synchronization of the goroutines, we use sync.WaitGroup and add
threads by calling waitGroup.Add(), and synchronize the threads by
calling waitGroup.Wait().
At the time of this writing, only biogo, an HPC bioinformatics toolkit
[1], is available.
Reference
1. Köster, J.: Rust-bio: a fast and safe bioinformatics library. Bioinformatics 32(3), 444–446 (2016)

Julia
Both Python and Fortran clearly inspire Julia. It is a good choice for
Fortran programmers who want to get into scripting, as it will offer
some familiarity in using one as the default start for array indexes
(instead of zero) and its use of end to mark the end of a block.
In our Julia code, we implemented our own queue. Since Julia does
not support classes directly (though it has structs), we found it
convenient to use arrays. For parallelism, we used Julia’s
Thread.@threads for loop macro.
Julia’s community contacted us and provided some optimized code.
However, you need to be confident in Julia and know the internals for
these optimizations.

Rust
We use std :: thread :: scope to launch worker threads, and
non-blocking channels from std :: sync :: mpsc to facilitate the
exchange of ghost zones.
We avoided using unsafe, working only in the safe subset of Rust.
Only two scientific codes (molecular dynamic and bioinformatics) are
using Rust.
Because of its guarantees concerning data race conditions and memory
access, as well as its high performance, Rust is a potentially good choice
for new scientific programming projects.
However, Rust has vastly different syntax and semantics than more
traditional languages like C++, Java, and Python, all of which may make
for a steep learning curve.

Swift
Swift claims to be safe by design and produces lightning-fast software.
Unfortunately, we had to disable the safety feature to get a
performant code.
UnsafeMutableBufferPointer<Double> to avoid unnecessary calls of
await for accessing the elements of arrays. These buffers allow
explicit vectorization on newer x86 and Apple Silicon. See, for
example, addingProduct. However, we could not measure a
significant improvement using these functions.
For concurrency, we use await with TaskGroup{ body: { group in}}
to launch chunks of works on each thread and
for wait _ in group{}.
We found Swift is designed for application development for iOS or Mac
OS, but not for numerical applications.

Productivity

Lines of code
0 50 100 150 200
Python
Swift
HPX
Julia
Go
Rust
Chapel
Charm++
C++ 17
Java
Lines of code (LOC)
The numbers were determined with the Linux tool cloc.

Productivity metric
Average of the computation time
Taverage(approach) := (T2(approach) + T20(approach) + T40(approach))/3
Constructive Cost Model (COCOMO)
COCOMO does not reflect parallel features
However, the HPX community never proposed their cost model
We map both metrics to the interval [−1, 1] using
Easy and Difficult for the costs
Slow and Fast for computation time
References
1. Barry, B., et al.: Software engineering economics. New York 197 (1981)
2. Stutzke, R.D., Crosstalk, M.: Software estimating technology: A survey. Los. Alamitos, CA: IEEE Computer Society
Press (1997)

Productivity
Difficult
Fast
Easy
Slow
Python
Go
Julia
Rust
Chapel
C++ 17
HPX
Charm++
Swift Java
Figure: 2D classification using the computational time and the COCOMO model.

Performance measurements

AMD EPYC 7H12
0 10 20 30 40
#cores
10−1
100
Time
[s]
nx=1000000 and nt=1000
go
python
swift
rust
chapel
cxx
hpx
julia
charm++
java

Intel®
Xeon®
Gold 6148 Skylake
0 10 20 30 40
#cores
10−1
100
101
Time
[s]
nx=1000000 and nt=1000
go
python
swift
rust
chapel
cxx
hpx
julia
charm++
java

A64FX
0 10 20 30 40
#cores
10−1
100
101
Time
[s]
nx=1000000 and nt=1000
go
python
rust
chapel
cxx
hpx
julia
charm++
java
Swift is missing, since no package was available for Rocky Linux.

Summary of performance measurements
Table: R2
correlation of the fit of the measured data points for all approaches and
architectures, computed using Python NumPy.
Arch C++ Charm++ Chapel Rust Go Julia HPX Swift Python Java
Intel 0.49 0.36 0.45 0.52 0.28 0.41 0.52 0.56 0.43 0.03
AMD 0.48 0.45 0.53 0.49 0.75 0.12 0.42 0.02 0.46 0.12
A64FX 0.49 0.52 0.08 0.40 0.52 0.42 0.73 – 0.90 0.32
Python was the slowest approach.
Swift and Julia are comparable.
For larger than 10 threads Go behaves slightly better than Swift and Julia.
For smaller core counts up to eight cores, the remaining approaches behave
similarly.
However, Chapel gets slower for higher node counts.
For Rust, Charm++, and HPX the performance is comparable. HPX is for larger
node counts the fastest, but has a high variance, see R2
in Table 3.

Conclusion and Outlook

Conclusion and Outlook
Conclusion
We will not name a winner concerning speed.
The higher performing platforms were mostly similar in what they
achieved.
The tests in this paper depend on the
hardware, the version of the interpreters and compilers, the particular
problem chosen,
the amount of effort applied, and our level of expertise (which varied
by platform).
Outlook
More numerical applications for a more comprehensive comparison
Distributed runs and GPU support
I am happy to answer any of your questions.

Special issue

Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Recommended

Recommended

More Related Content

Similar to Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Similar to Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java (20)

More from Patrick Diehl

More from Patrick Diehl (18)

Recently uploaded

Recently uploaded (20)

Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java