Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C++ library for concurrency and parallelism, exposing C++ standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency.
The Mariana Trench remarkable geological features on Earth.pptx
Quantifying Overheads in Charm++ and HPX using Task Bench
1. 1
Quantifying Overheads in Charm++ and HPX
using Task Bench
Presenter: Patrick Diehl
Center of Computation & Technology
Louisiana State University
August 2022
2. Outline
• Contribution and background of Task Bench
2
• Introductions of HPX and Charm++
• Illustrating using the HPX distributed implementation as an example
• Performance results
• Conclusion and future work
3. Asynchronous Many-Task (AMT) Runtime
• Address performance challenges:
ü Utilize whole machine with more parallelism
ü Handle dynamic workloads with flexible task scheduling
3
[1] Jeremiah J. Wilke etc., The DARMAApproach to Asynchronous Many-Task (AMT) Programming, 2016.
• Basic idea:
1. Divide an algorithm into units of work
2. Execute these tasks in an efficient manner by considering the resources and data
dependencies between the tasks.
5. HPX - A General Purpose Runtime System
5
• Widely portable (Platforms / Operating System)
• Unified and standard-conforming C++ API
• Boost license and has an open, active, and thriving developer and user community
• Domains: Astrophysics, Coastal Modeling, Distributed Machine Learning
• Funded through various agencies:
First one with all C++20 standard-conforming API
6. Charm++ - A parallel programming language
6
• Charm++ implements a migratable-objects programming model.
• The basic unit of object: chare, which is typically a class in C++.
• Functions in a chare can group logically-related execution and communication tasks,
supporting data encapsulation and locality.
• Charm++ supports overlapping communication and computation, dynamic load-
balancing, as well as other capabilities such as fault-tolerance, shrinking or expanding the
set of nodes assigned to a job in the middle of execution, power/energy/thermal
optimizations etc.
7. Task Bench Library
7
Benchmark 1, Benchmark 2, … Benchmark N-1, Benchmark N
System 1, System 2, … System M-1, System M
• For N benchmarks and M programming systems
Task Bench core API
[1] Elliott Slaughter, Task Bench SC20 Talk.
[2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.
O(N+M) effort
O(NM) effort
9. Our contribution
§ Implemented different versions of HPX implementations with respect to the Task Bench library.
§ Analyzed the commonalities, differences, and advantageous scenarios of Charm++ and HPX. The first
work comparing Charm++ and HPX using the same benchmark.
§ Quantified the overheads of Charm++ and HPX, along with several other programming systems in
terms of shared-memory parallelism and distributed-memory parallelism, in various scenarios.
9
10. Locality 0 Locality 1
On-node sharing
Stencil 1D example
10
Timestep 0
Timestep 1
Timestep 2
…
col 1
col 0 col 2 col 3
col 1
col 0 col 2
Cross-node communication
col 1
11. Explore parallelism: HPX for loop (fork_join_executor)
11
Use HPX parallel for loop (fork_join_executor) to execute each point
Timestep 0
Timestep 1
Timestep 2
Locality 0 Locality 1
Cross-node communication
On-node sharing
…
col 1
col 0 col 2 col 3
Create HPX threads
One thread per core
12. 12
core 1
core 0 core 2 core 3
time t
t_overhead
t_work
t_overhead / t_work = 0.5, upper bound = 2
Amdahl’s law
t_overhead / t_work = 0.1, upper bound = 10
Amdahl’s law
Another limit: sequential code
13. 13
core 1
core 0 core 2 core 3
time t
Overdecomposition (multi-task per core), inner cores steal work
Boundary cores do
communications
Boundary cores do
communications
Overlap communication with computation
15. Performance: a single task on each core, 1 node
15
• width=48
• timestep=1000
• Stencil pattern
• All systems achieve the
peak Tera FLOP/s
16. 16
Performance: a single task on each core, 1 node
• METG (Minimum Effective
Task Granularity): 50% percent
effective task granularity, the
time a system takes to achieve
50% overall efficiency.
• MPI: 3.9 𝜇𝑠
• HPX: 19.3 𝜇𝑠
• MPI + OpenMP: 50.9 𝜇𝑠
17. 17
Performance: multi-task on each core
• Weak scaling, 16 tasks are
assigned to each core.
• Dynamic work-stealing policy
18. Conclusion
• The light-weight threads as well as work-stealing scheduling in HPX and Charm++ incurred
some costs. Such overheads were negligible when the grain size is large enough. However, for
small grain sizes, the overheads limited the performance.
• HPX and Charm++ took advantage of multi-task scenario via dynamic work-stealing and
overlap of communication and computation.
18
19. 19
Future work
• Improvements for HPX: try different libraries for communication, e.g. libfabric and
LCI
• Improvements for Charm++: the support for active messaging in the communication
layer (such as UCX)
◦ Compare HPX and Charm++ with with other AMTs.