SlideShare a Scribd company logo
1 of 20
Download to read offline
1
Quantifying Overheads in Charm++ and HPX
using Task Bench
Presenter: Patrick Diehl
Center of Computation & Technology
Louisiana State University
August 2022
Outline
• Contribution and background of Task Bench
2
• Introductions of HPX and Charm++
• Illustrating using the HPX distributed implementation as an example
• Performance results
• Conclusion and future work
Asynchronous Many-Task (AMT) Runtime
• Address performance challenges:
ü Utilize whole machine with more parallelism
ü Handle dynamic workloads with flexible task scheduling
3
[1] Jeremiah J. Wilke etc., The DARMAApproach to Asynchronous Many-Task (AMT) Programming, 2016.
• Basic idea:
1. Divide an algorithm into units of work
2. Execute these tasks in an efficient manner by considering the resources and data
dependencies between the tasks.
HPX Runtime System
4
HPX - A General Purpose Runtime System
5
• Widely portable (Platforms / Operating System)
• Unified and standard-conforming C++ API
• Boost license and has an open, active, and thriving developer and user community
• Domains: Astrophysics, Coastal Modeling, Distributed Machine Learning
• Funded through various agencies:
First one with all C++20 standard-conforming API
Charm++ - A parallel programming language
6
• Charm++ implements a migratable-objects programming model.
• The basic unit of object: chare, which is typically a class in C++.
• Functions in a chare can group logically-related execution and communication tasks,
supporting data encapsulation and locality.
• Charm++ supports overlapping communication and computation, dynamic load-
balancing, as well as other capabilities such as fault-tolerance, shrinking or expanding the
set of nodes assigned to a job in the middle of execution, power/energy/thermal
optimizations etc.
Task Bench Library
7
Benchmark 1, Benchmark 2, … Benchmark N-1, Benchmark N
System 1, System 2, … System M-1, System M
• For N benchmarks and M programming systems
Task Bench core API
[1] Elliott Slaughter, Task Bench SC20 Talk.
[2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.
O(N+M) effort
O(NM) effort
Task Graph
8
• Node: task
• Edge: dependency
[1] Elliott Slaughter, Task Bench SC20 Talk.
[2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.
Our contribution
§ Implemented different versions of HPX implementations with respect to the Task Bench library.
§ Analyzed the commonalities, differences, and advantageous scenarios of Charm++ and HPX. The first
work comparing Charm++ and HPX using the same benchmark.
§ Quantified the overheads of Charm++ and HPX, along with several other programming systems in
terms of shared-memory parallelism and distributed-memory parallelism, in various scenarios.
9
Locality 0 Locality 1
On-node sharing
Stencil 1D example
10
Timestep 0
Timestep 1
Timestep 2
…
col 1
col 0 col 2 col 3
col 1
col 0 col 2
Cross-node communication
col 1
Explore parallelism: HPX for loop (fork_join_executor)
11
Use HPX parallel for loop (fork_join_executor) to execute each point
Timestep 0
Timestep 1
Timestep 2
Locality 0 Locality 1
Cross-node communication
On-node sharing
…
col 1
col 0 col 2 col 3
Create HPX threads
One thread per core
12
core 1
core 0 core 2 core 3
time t
t_overhead
t_work
t_overhead / t_work = 0.5, upper bound = 2
Amdahl’s law
t_overhead / t_work = 0.1, upper bound = 10
Amdahl’s law
Another limit: sequential code
13
core 1
core 0 core 2 core 3
time t
Overdecomposition (multi-task per core), inner cores steal work
Boundary cores do
communications
Boundary cores do
communications
Overlap communication with computation
Evaluation: Buran nodes on Rostam cluster
14
Performance: a single task on each core, 1 node
15
• width=48
• timestep=1000
• Stencil pattern
• All systems achieve the
peak Tera FLOP/s
16
Performance: a single task on each core, 1 node
• METG (Minimum Effective
Task Granularity): 50% percent
effective task granularity, the
time a system takes to achieve
50% overall efficiency.
• MPI: 3.9 𝜇𝑠
• HPX: 19.3 𝜇𝑠
• MPI + OpenMP: 50.9 𝜇𝑠
17
Performance: multi-task on each core
• Weak scaling, 16 tasks are
assigned to each core.
• Dynamic work-stealing policy
Conclusion
• The light-weight threads as well as work-stealing scheduling in HPX and Charm++ incurred
some costs. Such overheads were negligible when the grain size is large enough. However, for
small grain sizes, the overheads limited the performance.
• HPX and Charm++ took advantage of multi-task scenario via dynamic work-stealing and
overlap of communication and computation.
18
19
Future work
• Improvements for HPX: try different libraries for communication, e.g. libfabric and
LCI
• Improvements for Charm++: the support for active messaging in the communication
layer (such as UCX)
◦ Compare HPX and Charm++ with with other AMTs.
20

More Related Content

Similar to Quantifying Overheads in Charm++ and HPX using Task Bench

Similar to Quantifying Overheads in Charm++ and HPX using Task Bench (20)

Chapter 1 Data structure.pptx
Chapter 1 Data structure.pptxChapter 1 Data structure.pptx
Chapter 1 Data structure.pptx
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Design & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptxDesign & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptx
 
data structure
data structuredata structure
data structure
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
GCF
GCFGCF
GCF
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing System
 
04 performance
04 performance04 performance
04 performance
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Lecture1
Lecture1Lecture1
Lecture1
 

More from Patrick Diehl

Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Patrick Diehl
 
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Patrick Diehl
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
Patrick Diehl
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Patrick Diehl
 
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Patrick Diehl
 
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
Patrick Diehl
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
 

More from Patrick Diehl (20)

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
 
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
 
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
 
Subtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff HammondSubtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff Hammond
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
 
JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...
 
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
 
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
 
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
A review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics modelsA review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics models
 
Deploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi ClustersDeploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi Clusters
 
On the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic modelsOn the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic models
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Recently uploaded (20)

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

Quantifying Overheads in Charm++ and HPX using Task Bench

  • 1. 1 Quantifying Overheads in Charm++ and HPX using Task Bench Presenter: Patrick Diehl Center of Computation & Technology Louisiana State University August 2022
  • 2. Outline • Contribution and background of Task Bench 2 • Introductions of HPX and Charm++ • Illustrating using the HPX distributed implementation as an example • Performance results • Conclusion and future work
  • 3. Asynchronous Many-Task (AMT) Runtime • Address performance challenges: ü Utilize whole machine with more parallelism ü Handle dynamic workloads with flexible task scheduling 3 [1] Jeremiah J. Wilke etc., The DARMAApproach to Asynchronous Many-Task (AMT) Programming, 2016. • Basic idea: 1. Divide an algorithm into units of work 2. Execute these tasks in an efficient manner by considering the resources and data dependencies between the tasks.
  • 5. HPX - A General Purpose Runtime System 5 • Widely portable (Platforms / Operating System) • Unified and standard-conforming C++ API • Boost license and has an open, active, and thriving developer and user community • Domains: Astrophysics, Coastal Modeling, Distributed Machine Learning • Funded through various agencies: First one with all C++20 standard-conforming API
  • 6. Charm++ - A parallel programming language 6 • Charm++ implements a migratable-objects programming model. • The basic unit of object: chare, which is typically a class in C++. • Functions in a chare can group logically-related execution and communication tasks, supporting data encapsulation and locality. • Charm++ supports overlapping communication and computation, dynamic load- balancing, as well as other capabilities such as fault-tolerance, shrinking or expanding the set of nodes assigned to a job in the middle of execution, power/energy/thermal optimizations etc.
  • 7. Task Bench Library 7 Benchmark 1, Benchmark 2, … Benchmark N-1, Benchmark N System 1, System 2, … System M-1, System M • For N benchmarks and M programming systems Task Bench core API [1] Elliott Slaughter, Task Bench SC20 Talk. [2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020. O(N+M) effort O(NM) effort
  • 8. Task Graph 8 • Node: task • Edge: dependency [1] Elliott Slaughter, Task Bench SC20 Talk. [2] Elliott Slaughter, etc., Task bench: A parameterized benchmark for evaluating parallel runtime performance, SC2020.
  • 9. Our contribution § Implemented different versions of HPX implementations with respect to the Task Bench library. § Analyzed the commonalities, differences, and advantageous scenarios of Charm++ and HPX. The first work comparing Charm++ and HPX using the same benchmark. § Quantified the overheads of Charm++ and HPX, along with several other programming systems in terms of shared-memory parallelism and distributed-memory parallelism, in various scenarios. 9
  • 10. Locality 0 Locality 1 On-node sharing Stencil 1D example 10 Timestep 0 Timestep 1 Timestep 2 … col 1 col 0 col 2 col 3 col 1 col 0 col 2 Cross-node communication col 1
  • 11. Explore parallelism: HPX for loop (fork_join_executor) 11 Use HPX parallel for loop (fork_join_executor) to execute each point Timestep 0 Timestep 1 Timestep 2 Locality 0 Locality 1 Cross-node communication On-node sharing … col 1 col 0 col 2 col 3 Create HPX threads One thread per core
  • 12. 12 core 1 core 0 core 2 core 3 time t t_overhead t_work t_overhead / t_work = 0.5, upper bound = 2 Amdahl’s law t_overhead / t_work = 0.1, upper bound = 10 Amdahl’s law Another limit: sequential code
  • 13. 13 core 1 core 0 core 2 core 3 time t Overdecomposition (multi-task per core), inner cores steal work Boundary cores do communications Boundary cores do communications Overlap communication with computation
  • 14. Evaluation: Buran nodes on Rostam cluster 14
  • 15. Performance: a single task on each core, 1 node 15 • width=48 • timestep=1000 • Stencil pattern • All systems achieve the peak Tera FLOP/s
  • 16. 16 Performance: a single task on each core, 1 node • METG (Minimum Effective Task Granularity): 50% percent effective task granularity, the time a system takes to achieve 50% overall efficiency. • MPI: 3.9 𝜇𝑠 • HPX: 19.3 𝜇𝑠 • MPI + OpenMP: 50.9 𝜇𝑠
  • 17. 17 Performance: multi-task on each core • Weak scaling, 16 tasks are assigned to each core. • Dynamic work-stealing policy
  • 18. Conclusion • The light-weight threads as well as work-stealing scheduling in HPX and Charm++ incurred some costs. Such overheads were negligible when the grain size is large enough. However, for small grain sizes, the overheads limited the performance. • HPX and Charm++ took advantage of multi-task scenario via dynamic work-stealing and overlap of communication and computation. 18
  • 19. 19 Future work • Improvements for HPX: try different libraries for communication, e.g. libfabric and LCI • Improvements for Charm++: the support for active messaging in the communication layer (such as UCX) ◦ Compare HPX and Charm++ with with other AMTs.
  • 20. 20