Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran

Brad Richardson
Framework for Extensible,
Asynchronous Task
Scheduling (FEATS) in Fortran
Co-contributors: Damian Rouson,
Harris Snyder and Robert Singleterry

Agenda
• Motivations
• Example/Demo Application
• Implementation Details
• Performance Studies
• Conclusions

Motivations
https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf

Motivations
● Target an under-served user base: Fortran
○ Enable scientists and engineers to develop eﬃcient applications for HPC beyond the
“embarrassingly parallel” problems
● Explore the native parallel features of Fortran
○ Don’t force “reformulating” the problem to be able to interoperate with C/C++ or other
external libraries

if (this_image() == 1) then
write(output_unit, "(A)") &
"Enter values for a, b and c in `a*x**2 + b*x + c`:"
flush(output_unit)
read(input_unit, *) a, b, c
end if
call co_broadcast(a, 1)
call co_broadcast(b, 1)
call co_broadcast(c, 1)
solver = dag_t( &
[ vertex_t([integer::], a_t(a)) &
, vertex_t([integer::], b_t(b)) &
, vertex_t([integer::], c_t(c)) &
, vertex_t([2], b_squared_t()) &
, vertex_t([1, 3], four_ac_t()) &
, vertex_t([4, 5], square_root_t()) &
, vertex_t([2, 6], minus_b_pm_square_root_t()) &
, vertex_t([1], two_a_t()) &
, vertex_t([8, 7], division_t()) &
, vertex_t([9], printer_t()) &
])
Quadratic Solver
-b±√(b2
-4*a*c)
2*a

Enter values for a, b and c in `a*x**2 + b*x + c`:
1 1 -12
c = -12.0000000
b = 1.0000000
a = 1.0000000
2*a = 2.0000000
b**2 = 1.0000000
4*a*c = -48.0000000
sqrt(b**2 - 4*a*c) = 7.0000000 -7.0000000
-b +- sqrt(b**2 - 4*a*c) = 6.0000000 -8.0000000
(-b +- sqrt(b**2 - 4*a*c)) / (2*a) = 3.0000000 -4.0000000
The roots are 3.0000000 -4.0000000
Enter values for a, b and c in `a*x**2 + b*x + c`:
1 -1 -30
b = -1.0000000
c = -30.0000000
a = 1.0000000
b**2 = 1.0000000
4*a*c = -1.2000000E+02
2*a = 2.0000000
sqrt(b**2 - 4*a*c) = 11.0000000 -11.0000000
-b +- sqrt(b**2 - 4*a*c) = 12.0000000 -10.0000000
(-b +- sqrt(b**2 - 4*a*c)) / (2*a) = 6.0000000 -5.0000000
The roots are 6.0000000 -5.0000000
Quadratic Solver

LU Decomposition
In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix
as the product of a lower triangular matrix and an upper triangular matrix (see matrix decomposition). The
product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form
of Gaussian elimination.
- Wikipedia
I.e. A = LU =

call read_matrix_in(matrix, matrix_size)
num_tasks = 4 + (matrix_size-1)*2 + sum([((matrix_size-step)*3, step=1,matrix_size-1)])
allocate(vertices(num_tasks))
vertices(1) = vertex_t([integer::], initial_t(matrix))
vertices(2) = vertex_t([1], print_matrix_t(0))
do step = 1, matrix_size-1
do row = step+1, matrix_size
latest_matrix = 1 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) ! reconstructed matrix from last step
task_base = sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1))
vertices(3+task_base) = vertex_t([latest_matrix], calc_factor_t(row=row, step=step))
vertices(4+task_base) = vertex_t([latest_matrix, 3+task_base], row_multiply_t(step=step))
vertices(5+task_base) = vertex_t([latest_matrix, 4+task_base], row_subtract_t(row=row))
end do
reconstruction_step = 3 + sum([(3*(matrix_size-i), i = 1, step)]) + 2*(step-1)
vertices(reconstruction_step) = vertex_t( &
[ 1 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) &
, [(5 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1)) &
, row=step+1, matrix_size)] &
], &
reconstruct_t(step=step)) ! depends on previous reconstructed matrix and just subtracted rows
! print the just reconstructed matrix
vertices(reconstruction_step+1) = vertex_t([reconstruction_step], print_matrix_t(step))
end do
vertices(num_tasks-1) = vertex_t( &
[([(3 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1)) &
, row=step+1, matrix_size)] &
, step=1, matrix_size-1)] &
, back_substitute_t(n_rows=matrix_size)) ! depends on all "factors"
vertices(num_tasks) = vertex_t([num_tasks-1], print_matrix_t(matrix_size))
dag = dag_t(vertices)
LU Decomposition

LU Decomposition - 3x3
Step: 0
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
9.7038173699999994 9.7847460899999999E+02 7.5112048300000004E+02
5.6446673599999997E+02 8.9235162400000002E+02 8.0637854000000004E+02
Step: 1
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
0.0000000000000000 6.1692409220031186E+02 2.4759655872112285E+02
0.0000000000000000 -2.0138880965761025E+04 -2.8483383952264314E+04
Step: 3
1.0000000000000000 0.0000000000000000 0.0000000000000000
0.8780400555586139 1.0000000000000000 0.0000000000000000
51.0751991036728938 -32.6440176682580301 1.0000000000000000
Step: 2
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
0.0000000000000000 6.1692409220031186E+02 2.4759655872112285E+02
0.0000000000000000 0.0000000000000000 -2.0400837514772094E+04

Run Implementation (#1)
• One scheduler image and multiple executor images
• Mailbox, and task assignment coarrays
• Events to signal ready for work and task completed
• Directed acyclic graph (DAG) to deﬁne task dependencies

Coarrays Needed
• type(payload_t), allocatable :: mailbox(:)[:]
• type(event_type), allocatable :: ready_for_next_task(:)[:]
• type(event_type) :: task_assigned[*]
• integer :: task_identifier[*]
• integer, allocatable :: task_assignment_history(:)[:]

Startup Procedure
● User:
○ Deﬁne DAG
■ Deﬁne tasks and their dependencies
○ call run(dag)
NOTE: All images must have the same dag to start
● Framework:
○ Allocate coarrays needed for communication
■ If we’re executing on more than one image

Data Layout
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks

Scheduler Steps
• Find an executor that has posted it is ready
• While we do this, we keep track of what tasks have been completed
• Find an uncompleted task with all dependencies completed
• “Wait” for the ready executor (balances posts/waits)
• Assign the task to the executor
• Post that the executor has been assigned a task
• Repeat

Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Find Executor

Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Find Next Task

Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Wait for ready image
event wait(...)

Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Assign task

Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Post task assigned
event post(...)

Executor Steps
• Post ready for a task
• Wait until it has been assigned a task
• Collect payloads from executors that ran dependent tasks
• We access the history kept by the scheduler to determine this
• Execute task and store result in mailbox
• Repeat

Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Post ready for task
event post(...)

Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Wait for task
assignment
event wait(...)

Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Collect inputs

Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Execute task and
store result

Performance - shared memory (nagfor)
Compilation flags= -O4 -Onoteams -Orounding -coarray
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM

Performance - single node (OpenCoarrays)
Compilation flags= -O3
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM

Performance - crayftn
On: Perlmutter

• Every image attempts to claim a task as soon as possible
• Mailbox, task claimed and num tasks completed coarrays
• Events to signal task completed
• Directed acyclic graph (DAG) to deﬁne task dependencies
Run Implementation (#2)

Coarrays Needed
• integer(atomic_int_kind), allocatable :: taken_on(:)[:]
• integer(atomic_int_kind), allocatable :: num_tasks_completed[:]
• type(event_type), allocatable :: task_completed(:)[:]
• type(payload_t), allocatable :: mailbox(:)[:]

Startup Procedure
● User:
○ Deﬁne DAG
■ Deﬁne tasks and their dependencies
○ call run(dag)
NOTE: All images must have the same dag to start
● Framework:
○ Allocate coarrays needed for communication

Data Layout
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Execution Steps
• Find task that hasn’t been claimed, and all of its dependencies have been completed
• Attempt to claim that task
• another image may have claimed it by now
• Collect payloads from images that ran dependent tasks
• “Wait” on event that it was completed
• Execute task and store result in mailbox
• Post to all images that the task has been completed and increment task completed counter
• Repeat

Find Ready Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Claim Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Collect Payloads
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Execute Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Notify Task Completion
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks

Performance - OpenCoarrays on Perlmutter
On: Perlmutter

Fortran’s Advantages
• Coarrays and Events a great match
• Coarray to communicate task inputs and outputs
• Events to signal task start and completion
• Teams should allow for scalable implementation
• Partition DAG and have multiple schedulers work on independent regions with separate
teams of executors
• Polymorphism
• Diﬀerent kinds of task can exist that capture diﬀerent kinds of “input” data at startup
• Fortan’s History
• Likely lots of applications that could be easily adapted

Fortran’s Disadvantages
• Can’t use polymorphic objects in coarrays
• A strategic change to the standard could enable this
• Coindexed objects with polymorphic components not allowed in variable deﬁnition
context, but could be allowed in expressions
• See https://github.com/j3-fortran/fortran_proposals/issues/251
• No introspection
• Automatic task detection, fusion or splitting not possible
• Fortran’s History
• Many existing applications have shared global state
• Presents data races in task based execution

Conclusions
• It “works”
• Scaling/performance will depend not only on algorithm, but also compiler and platform
• There are limitations
• Future Work
• Propose changes to Fortran standard to improve utility/ﬂexibility
• Investigate performance issues
• Explore alternative scheduling algorithms
• Find “beta” testers, i.e. target applications
• Questions?
• https://github.com/sourceryinstitute/feats

Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran

More Related Content

Similar to Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran

More from Patrick Diehl

Recently uploaded

Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran