SlideShare a Scribd company logo
1 of 39
Download to read offline
1
Subtle Asynchrony
Jeff Hammond
NVIDIA HPC Group
2
Abstract
I will discuss subtle asynchrony in two contexts.
First, how do we bring asynchronous task parallelism to the Fortran
language, without relying on threads or related concepts?
Second, I will describe how asynchronous task parallelism emerges in
NWChem via overdecomposition, without programmers thinking about
tasks. This example demonstrates that many of the principles of
asynchronous many task execution can be achieved without specialized
runtime systems or programming abstractions.
3
The Concept of Asynchrony
4
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
5
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
6
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
7
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
8
“There is no cloud, just someone else’s computer.”
“There is no asynchrony, just some other processor.”
“There is no asynchrony, just some other thread.”
“There is no asynchrony, just some other context.”
Hardware concurrency
Software concurrency
Software concurrency
Forward progress
Forward progress
Scheduled
9
Examples of asynchrony
#pragma omp parallel num_threads(2)
{
assert( omp_get_num_threads() >= 2 );
switch( omp_get_thread_num() )
{
case 0:
MPI_Ssend(...);
break;
case 1:
MPI_Recv(...);
break;
}
}
10
Examples of asynchrony (or not)
#pragma omp parallel num_threads(2)
#pragma omp master
{
#pragma omp task
{
MPI_Ssend(...);
}
#pragma omp task
{
MPI_Recv(...);
}
}
11
#pragma omp parallel num_threads(2)
#pragma omp master
{
MPI_Request r;
#pragma omp task
{
MPI_Issend(...,&r);
nicewait(&r);
}
#pragma omp task
{
MPI_Irecv(...,&r);
nicewait(&r);
}
}
static inline
void nicewait(MPI_Request * r)
{
int flag=0;
while (!flag) {
MPI_Test(r, &flag, ..);
#pragma omp taskyield
}
}
Totally useless
(or not)
12
Analysis
OpenMP tasks do not make forward progress and are not schedulable.
This allows a trivial implementation and compiler optimizations like
task fusion.
Prescriptive parallelism: programmer decides, e.g. OpenMP threads.
Descriptive parallelism: implementation decides, e.g. OpenMP tasks.
https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf
13
Fortran Tasks and Asynchrony?
14
Parallelism in Fortran 2018
! coarse-grain parallelism
.., codimension[:] :: X, Y, Z
npes = num_images()
n_local = n / npes
do i=1,n_local
Z(i) = X(i) + Y(i)
end do
sync all
! fine-grain parallelism
! explicit
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! implicit
MATMUL
TRANSPOSE
RESHAPE
...
15
Three trivial tasks
module numerot
contains
pure real function yksi(X)
real, intent(in) :: X(100
yksi = norm2(X)
end function yksi
pure real function kaksi(X)
real, intent(in) :: X(100)
kaksi = 2*norm2(X)
end function kaksi
pure real function kolme(X)
real, intent(in) :: X(100)
kolme = 3*norm2(X)
end function kolme
end module numerot
program main
use numerot
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
RA = yksi(A)
RB = kaksi(B)
RC = kolme(C)
print*,RA+RB+RC
end program main
16
Tasks with DO CONCURRENT
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1
B = 1
C = 1
do concurrent (k=1:3)
select case (k)
case(1) RA = yksi(A)
case(2) RB = kaksi(B)
case(3) RC = kolme(C)
end select
end do
print*,RA+RB+RC
DO CONCURRENT (DC) is descriptive
parallelism. “concurrent” is only a hint and
does not imply any form of concurrency in
the implementation.
Any code based on DC has
implementation-defined asynchrony.
Only PURE tasks are supported by DC are
allowed, which further limits this construct
as a mechanism for realizing asynchronous
tasks.
17
Tasks with coarrays
real, dimension(100) :: A
real :: R
A = 1
if (num_images().ne.3) error stop
select case (this_image()
case(1) R = yksi(A)
case(2) R = kaksi(A)
case(3) R = kolme(A)
end select
sync all
call co_sum(R)
if (this_image().eq.1) print*,R
Coarray images are properly concurrent,
usually equivalent to an MPI/OS process.
The number of images is non-increasing
(constant minus failures).
Data is private to images unless explicitly
communicated. Cooperative algorithms
require an MPI-like approach.
18
An explicit Fortran tasking model
real, dimension(100) :: A, B, C
real :: RA, RB, RC
A = 1; B = 1; C = 1
task(1)
RA = yksi(A)
end task
task(2)
RB = kaksi(B)
end task
task(3)
RC = kolme(C)
end task
task_wait([1,2,3])
print*,RA+RB+RC
Like OpenMP, tasks are descriptive and not
required to be asynchronous, to permit
trivial implementations.
Tasks can share data but only in a limited
way, because Fortran lacks a (shared)
memory consistency model.
Is this sufficient for interesting use cases?
19
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_output(Z)
20
Motivation for Tasks
0 1 2 3
4 core CPU
Sequential
Sequential
Parallel
Fork
Join
! sequential
call my_input(X,Y)
! parallel
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential
call my_unrelated(A)
21
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU
call my_unrelated(A)
22
Motivation for Tasks
0 GPU
CPU+GPU
Sequential
Sequential
Parallel
Fork
Join
! sequential on CPU
call my_input(X,Y)
! parallel on GPU w/ async
do concurrent (i=1:n)
Z(i) = X(i) + Y(i)
end do
! sequential on CPU w/ async
call my_unrelated(A)
Savings
23
Motivation for Tasks (synthetic)
call sub1(IN=A,OUT=B)
call sub2(IN=C,OUT=D)
call sub3(IN=E,OUT=F)
call sub4(IN1=B,IN2=D,OUT=G)
call sub5(IN1=F,IN2=G,OUT=H)
! 5 steps require only 3 phases
A C E
B D F
1 2 3
G
G
4
4
5
5
Fortran compilers may be able to prove
these procedures are independent, but it
is often impossible to prove that executing
them in parallel is profitable.
24
Motivation for Tasks (realistic)
https://dl.acm.org/doi/10.1145/2425676.2425687
https://pubs.acs.org/doi/abs/10.1021/ct100584w
25
Describing asynchronous communication
subroutine stuff(A,B,C)
real :: A, B, C
call co_sum(A)
call co_min(B)
call co_max(C)
end subroutine stuff
subroutine stuff(A,B,C)
real :: A, B, C
task co_sum(A)
task co_min(B)
task co_max(C)
task_wait
end subroutine stuff
subroutine stuff(A,B,C)
use mpi_f08
real :: A, B, C
type(MPI_Request) :: R(3)
call MPI_Iallreduce(..A..SUM..R(1))
call MPI_Iallreduce(..B..MIN..R(2))
call MPI_Iallreduce(..C..MAX..R(3))
call MPI_Waitall(3,R,..)
end subroutine stuff
26
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
task
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
cudaStreamCreate(s)
cublasCreate(h)
cublasSetStream(h,s)
do i = 1,b
cublasDgemm_v2(h,
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
27
Describing asynchronous computation
do i = 1,b
C(i) = MATMUL(A(i),B(i))
end do
do i = 1,b
j = mod(i,8)
task j
C(i) = MATMUL(A(i),B(i))
end task
end do
task_wait
do j=1,8
cudaStreamCreate(s(j))
cublasCreate(h(j))
cublasSetStream(h(j), s(j))
end do
do i = 1,b
j = mod(i,8)
cublasDgemm_v2(h(j),
cu_op_n,cu_op_n,
n,n,n,
one,A(i),n,B(i),n,
one,C(i),n)
end do
cudaDeviceSynchronize()
https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F
28
J3/WG5 papers targeting Fortran 2026
https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks
https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran
There is consensus that this is a good feature to add to Fortran, but we have a long way
to go to define syntax and semantics. We will not just copy C++, nor specify threads.
29
Overdecomposition and
Implicit Task Parallelism
30
Classic MPI Domain Decomposition
31
Over Decomposition
32
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
! Static Parallelization
MySet = decompose[ (1:N)^4 ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
33
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ NonZero(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
34
Quantum Chemistry Algorithms
! SCF Fock build
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
V = (IJ|KL) ! Variable cost
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End Forall
! Static Parallelization
IJKL = (1:N)^4
MySet = decompose[ Cost(IJKL) ]
Forall I,J,K,L = MySet
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End Forall
35
Quantum Chemistry Algorithms
! Dynamic Parallelization
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
36
Quantum Chemistry Algorithms
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
Forall I,J,K,L = (1:N)^4
If NonZero(I,J,K,L)
If MyTurn()
Task(I,J,K,L)
End If
End If
End Forall
Task(I,J,K,L):
V = (IJ|KL)
F(I,J) = D(K,L) * V
F(I,K) = D(J,L) * V
F(I,L) = D(J,K) * V
F(J,K) = D(I,L) * V
F(J,L) = D(I,K) * V
F(K,L) = D(I,J) * V
FancySystem(NonZeroSet,Task)
37
Summary
NWChem, GAMESS and other QC codes distribute irregular computations
decoupling work decomposition from processing elements.
The body of a distributed loop is a task.
Efficient when num_tasks >> num_proc and dynamic scheduling is cheap.
Overdecomposition + Dynamic Scheduling = AMT w/o the system
https://www.mcs.anl.gov/papers/P3056-1112_1.pdf
38
Summary
• Task parallelism, which may be asynchronous, is under consideration for
Fortran standardization.
• Learn from prior art in OpenMP, OpenACC, Ada, etc.
• Descriptive, not prescriptive, behavior, like DO CONCURRENT.
• Successful distributed memory quantum chemistry codes are implicitly using
AMT concepts, but without explicit tasks or a tasking system.
• Irregular workloads or inhomogeneous system performance are nicely solved by AMT
systems, but not all apps are capable of adopted AMT systems.
• Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?
Subtle Asynchrony by Jeff Hammond

More Related Content

Similar to Subtle Asynchrony by Jeff Hammond

Class 16: Making Loops
Class 16: Making LoopsClass 16: Making Loops
Class 16: Making LoopsDavid Evans
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already KnowKevlin Henney
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Stdlib functions lesson
Stdlib functions lessonStdlib functions lesson
Stdlib functions lessonteach4uin
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesxryuseix
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Languagevsssuresh
 
Talk - Query monad
Talk - Query monad Talk - Query monad
Talk - Query monad Fabernovel
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelizationAlbert DeFusco
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Modelguest2a5acfb
 
Pydiomatic
PydiomaticPydiomatic
Pydiomaticrik0
 

Similar to Subtle Asynchrony by Jeff Hammond (20)

Golang dot-testing-lite
Golang dot-testing-liteGolang dot-testing-lite
Golang dot-testing-lite
 
Class 16: Making Loops
Class 16: Making LoopsClass 16: Making Loops
Class 16: Making Loops
 
Matlab integration
Matlab integrationMatlab integration
Matlab integration
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already Know
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Stdlib functions lesson
Stdlib functions lessonStdlib functions lesson
Stdlib functions lesson
 
Java 8
Java 8Java 8
Java 8
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our lives
 
CODEsign 2015
CODEsign 2015CODEsign 2015
CODEsign 2015
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Talk - Query monad
Talk - Query monad Talk - Query monad
Talk - Query monad
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
 
Pydiomatic
PydiomaticPydiomatic
Pydiomatic
 

More from Patrick Diehl

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-TigerPatrick Diehl
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerPatrick Diehl
 
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsPatrick Diehl
 
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Patrick Diehl
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranPatrick Diehl
 
JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...Patrick Diehl
 
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuPatrick Diehl
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsPatrick Diehl
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerPatrick Diehl
 
Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...Patrick Diehl
 
Quantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task BenchQuantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task BenchPatrick Diehl
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Patrick Diehl
 
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Patrick Diehl
 
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...Patrick Diehl
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerPatrick Diehl
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Patrick Diehl
 
A review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics modelsA review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics modelsPatrick Diehl
 
Deploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi ClustersDeploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi ClustersPatrick Diehl
 
On the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic modelsOn the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic modelsPatrick Diehl
 

More from Patrick Diehl (20)

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
 
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
 
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
 
JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...
 
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...
 
Quantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task BenchQuantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task Bench
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
 
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
 
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
A review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics modelsA review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics models
 
Deploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi ClustersDeploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi Clusters
 
On the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic modelsOn the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic models
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 

Subtle Asynchrony by Jeff Hammond

  • 2. 2 Abstract I will discuss subtle asynchrony in two contexts. First, how do we bring asynchronous task parallelism to the Fortran language, without relying on threads or related concepts? Second, I will describe how asynchronous task parallelism emerges in NWChem via overdecomposition, without programmers thinking about tasks. This example demonstrates that many of the principles of asynchronous many task execution can be achieved without specialized runtime systems or programming abstractions.
  • 3. 3 The Concept of Asynchrony
  • 4. 4 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.”
  • 5. 5 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.”
  • 6. 6 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.”
  • 7. 7 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.” Hardware concurrency Software concurrency Software concurrency
  • 8. 8 “There is no cloud, just someone else’s computer.” “There is no asynchrony, just some other processor.” “There is no asynchrony, just some other thread.” “There is no asynchrony, just some other context.” Hardware concurrency Software concurrency Software concurrency Forward progress Forward progress Scheduled
  • 9. 9 Examples of asynchrony #pragma omp parallel num_threads(2) { assert( omp_get_num_threads() >= 2 ); switch( omp_get_thread_num() ) { case 0: MPI_Ssend(...); break; case 1: MPI_Recv(...); break; } }
  • 10. 10 Examples of asynchrony (or not) #pragma omp parallel num_threads(2) #pragma omp master { #pragma omp task { MPI_Ssend(...); } #pragma omp task { MPI_Recv(...); } }
  • 11. 11 #pragma omp parallel num_threads(2) #pragma omp master { MPI_Request r; #pragma omp task { MPI_Issend(...,&r); nicewait(&r); } #pragma omp task { MPI_Irecv(...,&r); nicewait(&r); } } static inline void nicewait(MPI_Request * r) { int flag=0; while (!flag) { MPI_Test(r, &flag, ..); #pragma omp taskyield } } Totally useless (or not)
  • 12. 12 Analysis OpenMP tasks do not make forward progress and are not schedulable. This allows a trivial implementation and compiler optimizations like task fusion. Prescriptive parallelism: programmer decides, e.g. OpenMP threads. Descriptive parallelism: implementation decides, e.g. OpenMP tasks. https://asc.llnl.gov/sites/asc/files/2020-09/2-20_larkin.pdf
  • 13. 13 Fortran Tasks and Asynchrony?
  • 14. 14 Parallelism in Fortran 2018 ! coarse-grain parallelism .., codimension[:] :: X, Y, Z npes = num_images() n_local = n / npes do i=1,n_local Z(i) = X(i) + Y(i) end do sync all ! fine-grain parallelism ! explicit do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! implicit MATMUL TRANSPOSE RESHAPE ...
  • 15. 15 Three trivial tasks module numerot contains pure real function yksi(X) real, intent(in) :: X(100 yksi = norm2(X) end function yksi pure real function kaksi(X) real, intent(in) :: X(100) kaksi = 2*norm2(X) end function kaksi pure real function kolme(X) real, intent(in) :: X(100) kolme = 3*norm2(X) end function kolme end module numerot program main use numerot real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1 B = 1 C = 1 RA = yksi(A) RB = kaksi(B) RC = kolme(C) print*,RA+RB+RC end program main
  • 16. 16 Tasks with DO CONCURRENT real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1 B = 1 C = 1 do concurrent (k=1:3) select case (k) case(1) RA = yksi(A) case(2) RB = kaksi(B) case(3) RC = kolme(C) end select end do print*,RA+RB+RC DO CONCURRENT (DC) is descriptive parallelism. “concurrent” is only a hint and does not imply any form of concurrency in the implementation. Any code based on DC has implementation-defined asynchrony. Only PURE tasks are supported by DC are allowed, which further limits this construct as a mechanism for realizing asynchronous tasks.
  • 17. 17 Tasks with coarrays real, dimension(100) :: A real :: R A = 1 if (num_images().ne.3) error stop select case (this_image() case(1) R = yksi(A) case(2) R = kaksi(A) case(3) R = kolme(A) end select sync all call co_sum(R) if (this_image().eq.1) print*,R Coarray images are properly concurrent, usually equivalent to an MPI/OS process. The number of images is non-increasing (constant minus failures). Data is private to images unless explicitly communicated. Cooperative algorithms require an MPI-like approach.
  • 18. 18 An explicit Fortran tasking model real, dimension(100) :: A, B, C real :: RA, RB, RC A = 1; B = 1; C = 1 task(1) RA = yksi(A) end task task(2) RB = kaksi(B) end task task(3) RC = kolme(C) end task task_wait([1,2,3]) print*,RA+RB+RC Like OpenMP, tasks are descriptive and not required to be asynchronous, to permit trivial implementations. Tasks can share data but only in a limited way, because Fortran lacks a (shared) memory consistency model. Is this sufficient for interesting use cases?
  • 19. 19 Motivation for Tasks 0 1 2 3 4 core CPU Sequential Sequential Parallel Fork Join ! sequential call my_input(X,Y) ! parallel do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential call my_output(Z)
  • 20. 20 Motivation for Tasks 0 1 2 3 4 core CPU Sequential Sequential Parallel Fork Join ! sequential call my_input(X,Y) ! parallel do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential call my_unrelated(A)
  • 21. 21 Motivation for Tasks 0 GPU CPU+GPU Sequential Sequential Parallel Fork Join ! sequential on CPU call my_input(X,Y) ! parallel on GPU do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential on CPU call my_unrelated(A)
  • 22. 22 Motivation for Tasks 0 GPU CPU+GPU Sequential Sequential Parallel Fork Join ! sequential on CPU call my_input(X,Y) ! parallel on GPU w/ async do concurrent (i=1:n) Z(i) = X(i) + Y(i) end do ! sequential on CPU w/ async call my_unrelated(A) Savings
  • 23. 23 Motivation for Tasks (synthetic) call sub1(IN=A,OUT=B) call sub2(IN=C,OUT=D) call sub3(IN=E,OUT=F) call sub4(IN1=B,IN2=D,OUT=G) call sub5(IN1=F,IN2=G,OUT=H) ! 5 steps require only 3 phases A C E B D F 1 2 3 G G 4 4 5 5 Fortran compilers may be able to prove these procedures are independent, but it is often impossible to prove that executing them in parallel is profitable.
  • 24. 24 Motivation for Tasks (realistic) https://dl.acm.org/doi/10.1145/2425676.2425687 https://pubs.acs.org/doi/abs/10.1021/ct100584w
  • 25. 25 Describing asynchronous communication subroutine stuff(A,B,C) real :: A, B, C call co_sum(A) call co_min(B) call co_max(C) end subroutine stuff subroutine stuff(A,B,C) real :: A, B, C task co_sum(A) task co_min(B) task co_max(C) task_wait end subroutine stuff subroutine stuff(A,B,C) use mpi_f08 real :: A, B, C type(MPI_Request) :: R(3) call MPI_Iallreduce(..A..SUM..R(1)) call MPI_Iallreduce(..B..MIN..R(2)) call MPI_Iallreduce(..C..MAX..R(3)) call MPI_Waitall(3,R,..) end subroutine stuff
  • 26. 26 Describing asynchronous computation do i = 1,b C(i) = MATMUL(A(i),B(i)) end do do i = 1,b task C(i) = MATMUL(A(i),B(i)) end task end do task_wait cudaStreamCreate(s) cublasCreate(h) cublasSetStream(h,s) do i = 1,b cublasDgemm_v2(h, cu_op_n,cu_op_n, n,n,n, one,A(i),n,B(i),n, one,C(i),n) end do cudaDeviceSynchronize()
  • 27. 27 Describing asynchronous computation do i = 1,b C(i) = MATMUL(A(i),B(i)) end do do i = 1,b j = mod(i,8) task j C(i) = MATMUL(A(i),B(i)) end task end do task_wait do j=1,8 cudaStreamCreate(s(j)) cublasCreate(h(j)) cublasSetStream(h(j), s(j)) end do do i = 1,b j = mod(i,8) cublasDgemm_v2(h(j), cu_op_n,cu_op_n, n,n,n, one,A(i),n,B(i),n, one,C(i),n) end do cudaDeviceSynchronize() https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_trpdrv_openacc.F
  • 28. 28 J3/WG5 papers targeting Fortran 2026 https://j3-fortran.org/doc/year/22/22-169.pdf Fortran asynchronous tasks https://j3-fortran.org/doc/year/23/23-174.pdf Asynchronous Tasks in Fortran There is consensus that this is a good feature to add to Fortran, but we have a long way to go to define syntax and semantics. We will not just copy C++, nor specify threads.
  • 30. 30 Classic MPI Domain Decomposition
  • 32. 32 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall ! Static Parallelization MySet = decompose[ (1:N)^4 ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 33. 33 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End Forall ! Static Parallelization IJKL = (1:N)^4 MySet = decompose[ NonZero(IJKL) ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 34. 34 Quantum Chemistry Algorithms ! SCF Fock build Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) V = (IJ|KL) ! Variable cost F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End Forall ! Static Parallelization IJKL = (1:N)^4 MySet = decompose[ Cost(IJKL) ] Forall I,J,K,L = MySet V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End Forall
  • 35. 35 Quantum Chemistry Algorithms ! Dynamic Parallelization Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V End If End If End Forall Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() Task(I,J,K,L) End If End If End Forall
  • 36. 36 Quantum Chemistry Algorithms Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V Forall I,J,K,L = (1:N)^4 If NonZero(I,J,K,L) If MyTurn() Task(I,J,K,L) End If End If End Forall Task(I,J,K,L): V = (IJ|KL) F(I,J) = D(K,L) * V F(I,K) = D(J,L) * V F(I,L) = D(J,K) * V F(J,K) = D(I,L) * V F(J,L) = D(I,K) * V F(K,L) = D(I,J) * V FancySystem(NonZeroSet,Task)
  • 37. 37 Summary NWChem, GAMESS and other QC codes distribute irregular computations decoupling work decomposition from processing elements. The body of a distributed loop is a task. Efficient when num_tasks >> num_proc and dynamic scheduling is cheap. Overdecomposition + Dynamic Scheduling = AMT w/o the system https://www.mcs.anl.gov/papers/P3056-1112_1.pdf
  • 38. 38 Summary • Task parallelism, which may be asynchronous, is under consideration for Fortran standardization. • Learn from prior art in OpenMP, OpenACC, Ada, etc. • Descriptive, not prescriptive, behavior, like DO CONCURRENT. • Successful distributed memory quantum chemistry codes are implicitly using AMT concepts, but without explicit tasks or a tasking system. • Irregular workloads or inhomogeneous system performance are nicely solved by AMT systems, but not all apps are capable of adopted AMT systems. • Can we find ways to subtly bring AMT concepts into more “old fashioned” apps?