Quantifying Overheads in Charm++ and HPX using Task Bench

1
Quantifying Overheads in Charm++ and HPX
using Task Bench
Presenter: Patrick Diehl
Center of Computation & Technology
Louisiana State University
August 2022

Outline
• Contribution and background of Task Bench
2
• Introductions of HPX and Charm++
• Illustrating using the HPX distributed implementation as an example
• Performance results
• Conclusion and future work

Asynchronous Many-Task (AMT) Runtime
• Address performance challenges:
ü Utilize whole machine with more parallelism
ü Handle dynamic workloads with flexible task scheduling
3
[1] Jeremiah J. Wilke etc., The DARMAApproach to Asynchronous Many-Task (AMT) Programming, 2016.
• Basic idea:
1. Divide an algorithm into units of work
2. Execute these tasks in an efficient manner by considering the resources and data
dependencies between the tasks.

HPX - A General Purpose Runtime System
5
• Widely portable (Platforms / Operating System)
• Unified and standard-conforming C++ API
• Boost license and has an open, active, and thriving developer and user community
• Domains: Astrophysics, Coastal Modeling, Distributed Machine Learning
• Funded through various agencies:
First one with all C++20 standard-conforming API

Charm++ - A parallel programming language
6
• Charm++ implements a migratable-objects programming model.
• The basic unit of object: chare, which is typically a class in C++.
• Functions in a chare can group logically-related execution and communication tasks,
supporting data encapsulation and locality.
• Charm++ supports overlapping communication and computation, dynamic load-
balancing, as well as other capabilities such as fault-tolerance, shrinking or expanding the
set of nodes assigned to a job in the middle of execution, power/energy/thermal
optimizations etc.

Task Bench Library
7
Benchmark 1, Benchmark 2, … Benchmark N-1, Benchmark N
System 1, System 2, … System M-1, System M
• For N benchmarks and M programming systems
Task Bench core API
[1] Elliott Slaughter, Task Bench SC20 Talk.
[2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.
O(N+M) effort
O(NM) effort

Task Graph
8
• Node: task
• Edge: dependency
[1] Elliott Slaughter, Task Bench SC20 Talk.
[2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.

Our contribution
§ Implemented different versions of HPX implementations with respect to the Task Bench library.
§ Analyzed the commonalities, differences, and advantageous scenarios of Charm++ and HPX. The first
work comparing Charm++ and HPX using the same benchmark.
§ Quantified the overheads of Charm++ and HPX, along with several other programming systems in
terms of shared-memory parallelism and distributed-memory parallelism, in various scenarios.
9

Locality 0 Locality 1
On-node sharing
Stencil 1D example
10
Timestep 0
Timestep 1
Timestep 2
…
col 1
col 0 col 2 col 3
col 1
col 0 col 2
Cross-node communication
col 1

Explore parallelism: HPX for loop (fork_join_executor)
11
Use HPX parallel for loop (fork_join_executor) to execute each point
Timestep 0
Timestep 1
Timestep 2
Locality 0 Locality 1
Cross-node communication
On-node sharing
…
col 1
col 0 col 2 col 3
Create HPX threads
One thread per core

12
core 1
core 0 core 2 core 3
time t
t_overhead
t_work
t_overhead / t_work = 0.5, upper bound = 2
Amdahl’s law
t_overhead / t_work = 0.1, upper bound = 10
Amdahl’s law
Another limit: sequential code

13
core 1
core 0 core 2 core 3
time t
Overdecomposition (multi-task per core), inner cores steal work
Boundary cores do
communications
Boundary cores do
communications
Overlap communication with computation

Evaluation: Buran nodes on Rostam cluster
14

Performance: a single task on each core, 1 node
15
• width=48
• timestep=1000
• Stencil pattern
• All systems achieve the
peak Tera FLOP/s

16
Performance: a single task on each core, 1 node
• METG (Minimum Effective
Task Granularity): 50% percent
effective task granularity, the
time a system takes to achieve
50% overall efficiency.
• MPI: 3.9 𝜇𝑠
• HPX: 19.3 𝜇𝑠
• MPI + OpenMP: 50.9 𝜇𝑠

17
Performance: multi-task on each core
• Weak scaling, 16 tasks are
assigned to each core.
• Dynamic work-stealing policy

Conclusion
• The light-weight threads as well as work-stealing scheduling in HPX and Charm++ incurred
some costs. Such overheads were negligible when the grain size is large enough. However, for
small grain sizes, the overheads limited the performance.
• HPX and Charm++ took advantage of multi-task scenario via dynamic work-stealing and
overlap of communication and computation.
18

19
Future work
• Improvements for HPX: try different libraries for communication, e.g. libfabric and
LCI
• Improvements for Charm++: the support for active messaging in the communication
layer (such as UCX)
◦ Compare HPX and Charm++ with with other AMTs.

Quantifying Overheads in Charm++ and HPX using Task Bench

Recommended

Recommended

More Related Content

Similar to Quantifying Overheads in Charm++ and HPX using Task Bench

Similar to Quantifying Overheads in Charm++ and HPX using Task Bench (20)

More from Patrick Diehl

More from Patrick Diehl (20)

Recently uploaded

Recently uploaded (20)

Quantifying Overheads in Charm++ and HPX using Task Bench