Brad Richardson
Framework for Extensible,
Asynchronous Task
Scheduling (FEATS) in Fortran
Co-contributors: Damian Rouson,
Harris Snyder and Robert Singleterry
Agenda
• Motivations
• Example/Demo Application
• Implementation Details
• Performance Studies
• Conclusions
Motivations
https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf
Motivations
● Target an under-served user base: Fortran
○ Enable scientists and engineers to develop efficient applications for HPC beyond the
“embarrassingly parallel” problems
● Explore the native parallel features of Fortran
○ Don’t force “reformulating” the problem to be able to interoperate with C/C++ or other
external libraries
Example Applications
if (this_image() == 1) then
write(output_unit, "(A)") &
"Enter values for a, b and c in `a*x**2 + b*x + c`:"
flush(output_unit)
read(input_unit, *) a, b, c
end if
call co_broadcast(a, 1)
call co_broadcast(b, 1)
call co_broadcast(c, 1)
solver = dag_t( &
[ vertex_t([integer::], a_t(a)) &
, vertex_t([integer::], b_t(b)) &
, vertex_t([integer::], c_t(c)) &
, vertex_t([2], b_squared_t()) &
, vertex_t([1, 3], four_ac_t()) &
, vertex_t([4, 5], square_root_t()) &
, vertex_t([2, 6], minus_b_pm_square_root_t()) &
, vertex_t([1], two_a_t()) &
, vertex_t([8, 7], division_t()) &
, vertex_t([9], printer_t()) &
])
Quadratic Solver
-b±√(b2
-4*a*c)
2*a
Enter values for a, b and c in `a*x**2 + b*x + c`:
1 1 -12
c = -12.0000000
b = 1.0000000
a = 1.0000000
2*a = 2.0000000
b**2 = 1.0000000
4*a*c = -48.0000000
sqrt(b**2 - 4*a*c) = 7.0000000 -7.0000000
-b +- sqrt(b**2 - 4*a*c) = 6.0000000 -8.0000000
(-b +- sqrt(b**2 - 4*a*c)) / (2*a) = 3.0000000 -4.0000000
The roots are 3.0000000 -4.0000000
Enter values for a, b and c in `a*x**2 + b*x + c`:
1 -1 -30
b = -1.0000000
c = -30.0000000
a = 1.0000000
b**2 = 1.0000000
4*a*c = -1.2000000E+02
2*a = 2.0000000
sqrt(b**2 - 4*a*c) = 11.0000000 -11.0000000
-b +- sqrt(b**2 - 4*a*c) = 12.0000000 -10.0000000
(-b +- sqrt(b**2 - 4*a*c)) / (2*a) = 6.0000000 -5.0000000
The roots are 6.0000000 -5.0000000
Quadratic Solver
LU Decomposition
In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix
as the product of a lower triangular matrix and an upper triangular matrix (see matrix decomposition). The
product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form
of Gaussian elimination.
- Wikipedia
I.e. A = LU =
call read_matrix_in(matrix, matrix_size)
num_tasks = 4 + (matrix_size-1)*2 + sum([((matrix_size-step)*3, step=1,matrix_size-1)])
allocate(vertices(num_tasks))
vertices(1) = vertex_t([integer::], initial_t(matrix))
vertices(2) = vertex_t([1], print_matrix_t(0))
do step = 1, matrix_size-1
do row = step+1, matrix_size
latest_matrix = 1 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) ! reconstructed matrix from last step
task_base = sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1))
vertices(3+task_base) = vertex_t([latest_matrix], calc_factor_t(row=row, step=step))
vertices(4+task_base) = vertex_t([latest_matrix, 3+task_base], row_multiply_t(step=step))
vertices(5+task_base) = vertex_t([latest_matrix, 4+task_base], row_subtract_t(row=row))
end do
reconstruction_step = 3 + sum([(3*(matrix_size-i), i = 1, step)]) + 2*(step-1)
vertices(reconstruction_step) = vertex_t( &
[ 1 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) &
, [(5 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1)) &
, row=step+1, matrix_size)] &
], &
reconstruct_t(step=step)) ! depends on previous reconstructed matrix and just subtracted rows
! print the just reconstructed matrix
vertices(reconstruction_step+1) = vertex_t([reconstruction_step], print_matrix_t(step))
end do
vertices(num_tasks-1) = vertex_t( &
[([(3 + sum([(3*(matrix_size-i), i = 1, step-1)]) + 2*(step-1) + 3*(row-(step+1)) &
, row=step+1, matrix_size)] &
, step=1, matrix_size-1)] &
, back_substitute_t(n_rows=matrix_size)) ! depends on all "factors"
vertices(num_tasks) = vertex_t([num_tasks-1], print_matrix_t(matrix_size))
dag = dag_t(vertices)
LU Decomposition
LU Decomposition - 1x1
LU Decomposition - 2x2
LU Decomposition - 3x3
LU Decomposition - 4x4
LU Decomposition - 5x5
LU Decomposition - 3x3
LU Decomposition - 3x3
Step: 0
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
9.7038173699999994 9.7847460899999999E+02 7.5112048300000004E+02
5.6446673599999997E+02 8.9235162400000002E+02 8.0637854000000004E+02
Step: 1
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
0.0000000000000000 6.1692409220031186E+02 2.4759655872112285E+02
0.0000000000000000 -2.0138880965761025E+04 -2.8483383952264314E+04
Step: 3
1.0000000000000000 0.0000000000000000 0.0000000000000000
0.8780400555586139 1.0000000000000000 0.0000000000000000
51.0751991036728938 -32.6440176682580301 1.0000000000000000
Step: 2
11.0516795999999999 4.1176995799999997E+02 5.7346350099999995E+02
0.0000000000000000 6.1692409220031186E+02 2.4759655872112285E+02
0.0000000000000000 0.0000000000000000 -2.0400837514772094E+04
Implementation
Derived Types
Run Implementation (#1)
• One scheduler image and multiple executor images
• Mailbox, and task assignment coarrays
• Events to signal ready for work and task completed
• Directed acyclic graph (DAG) to define task dependencies
Coarrays Needed
• type(payload_t), allocatable :: mailbox(:)[:]
• type(event_type), allocatable :: ready_for_next_task(:)[:]
• type(event_type) :: task_assigned[*]
• integer :: task_identifier[*]
• integer, allocatable :: task_assignment_history(:)[:]
Startup Procedure
● User:
○ Define DAG
■ Define tasks and their dependencies
○ call run(dag)
NOTE: All images must have the same dag to start
● Framework:
○ Allocate coarrays needed for communication
■ If we’re executing on more than one image
Data Layout
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Scheduler Steps
• Find an executor that has posted it is ready
• While we do this, we keep track of what tasks have been completed
• Find an uncompleted task with all dependencies completed
• “Wait” for the ready executor (balances posts/waits)
• Assign the task to the executor
• Post that the executor has been assigned a task
• Repeat
Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Find Executor
Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Find Next Task
Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Wait for ready image
event wait(...)
Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Assign task
Scheduler Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Post task assigned
event post(...)
Executor Steps
• Post ready for a task
• Wait until it has been assigned a task
• Collect payloads from executors that ran dependent tasks
• We access the history kept by the scheduler to determine this
• Execute task and store result in mailbox
• Repeat
Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Post ready for task
event post(...)
Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Wait for task
assignment
event wait(...)
Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Collect inputs
Executor Steps
n_tasks
mailbox
ready_for_task
task_assigned
task_id
task_assign_hist
scheduler
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
n_tasks
executor
DAG
n_images
n_tasks
Execute task and
store result
Performance
Performance - shared memory (nagfor)
Compilation flags= -O4 -Onoteams -Orounding -coarray
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - shared memory (nagfor)
Compilation flags= -O4 -Onoteams -Orounding -coarray
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - single node (OpenCoarrays)
Compilation flags= -O3
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - crayftn
Compilation flags= -O3
On: Perlmutter
Performance - crayftn
Compilation flags= -O3
On: Perlmutter
• Every image attempts to claim a task as soon as possible
• Mailbox, task claimed and num tasks completed coarrays
• Events to signal task completed
• Directed acyclic graph (DAG) to define task dependencies
Run Implementation (#2)
Coarrays Needed
• integer(atomic_int_kind), allocatable :: taken_on(:)[:]
• integer(atomic_int_kind), allocatable :: num_tasks_completed[:]
• type(event_type), allocatable :: task_completed(:)[:]
• type(payload_t), allocatable :: mailbox(:)[:]
Startup Procedure
● User:
○ Define DAG
■ Define tasks and their dependencies
○ call run(dag)
NOTE: All images must have the same dag to start
● Framework:
○ Allocate coarrays needed for communication
Data Layout
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Execution Steps
• Find task that hasn’t been claimed, and all of its dependencies have been completed
• Attempt to claim that task
• another image may have claimed it by now
• Collect payloads from images that ran dependent tasks
• “Wait” on event that it was completed
• Execute task and store result in mailbox
• Post to all images that the task has been completed and increment task completed counter
• Repeat
Find Ready Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Claim Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Collect Payloads
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Execute Task
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Notify Task Completion
mailbox
task_completed
num_tasks_completed
taken_on
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
n_tasks
image
DAG
n_tasks
n_tasks
Performance
Performance - shared memory (nagfor)
Compilation flags= -O4 -Onoteams -Orounding -coarray
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - shared memory (nagfor)
Compilation flags= -O4 -Onoteams -Orounding -coarray
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - single node (OpenCoarrays)
Compilation flags= -O3
On system with: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 16 GB RAM
Performance - crayftn
Compilation flags= -O3
On: Perlmutter
Performance - crayftn
Compilation flags= -O3
On: Perlmutter
Performance - crayftn
Compilation flags= -O3
On: Perlmutter
Performance - OpenCoarrays on Perlmutter
Compilation flags= -O3
On: Perlmutter
Conclusions
Fortran’s Advantages
• Coarrays and Events a great match
• Coarray to communicate task inputs and outputs
• Events to signal task start and completion
• Teams should allow for scalable implementation
• Partition DAG and have multiple schedulers work on independent regions with separate
teams of executors
• Polymorphism
• Different kinds of task can exist that capture different kinds of “input” data at startup
• Fortan’s History
• Likely lots of applications that could be easily adapted
Fortran’s Disadvantages
• Can’t use polymorphic objects in coarrays
• A strategic change to the standard could enable this
• Coindexed objects with polymorphic components not allowed in variable definition
context, but could be allowed in expressions
• See https://github.com/j3-fortran/fortran_proposals/issues/251
• No introspection
• Automatic task detection, fusion or splitting not possible
• Fortran’s History
• Many existing applications have shared global state
• Presents data races in task based execution
Conclusions
• It “works”
• Scaling/performance will depend not only on algorithm, but also compiler and platform
• There are limitations
• Future Work
• Propose changes to Fortran standard to improve utility/flexibility
• Investigate performance issues
• Explore alternative scheduling algorithms
• Find “beta” testers, i.e. target applications
• Questions?
• https://github.com/sourceryinstitute/feats
Thank You

Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran