SlideShare a Scribd company logo
1 of 64
Download to read offline
Parallel Computing
Stanford CS149, Fall 2020
Lecture 5:
Performance Optimization Part 1:
Work Distribution and Scheduling
( how to be l33t )
Stanford CS149, Fall 2020
Programming for high performance
▪ Optimizing the performance of parallel programs is an
iterative process of refining choices for decomposition,
assignment, and orchestration...
▪ Key goals (that are at odds with each other)
- Balance workload onto available execution resources
- Reduce communication (to avoid stalls)
- Reduce extra work (overhead) performed to increase parallelism,
manage assignment, reduce communication, etc.
▪ We are going to talk about a rich space of techniques
Stanford CS149, Fall 2020
TIP #1: Always implement the simplest solution first, then
measure performance to determine if you need to do better.
(Example: if you anticipate only running low-core count machines,
it may be unnecessary to implement a complex approach that
creates and hundreds or thousands of pieces of independent work)
Stanford CS149, Fall 2020
Balancing the workload
Ideally: all processors are computing all the time during program execution
(they are computing simultaneously, and they finish their portion of the work at the same time)
Only small amount of load imbalance can
significantly bound maximum speedup
Time P1 P2 P3 P4
P4 does 2x more work → P4 takes 2x longer to complete
→ 50% of parallel program’s
runtime is serial execution
(work in serialized section here is about 1/5 of the work done by the entire
program, so S=0.2 in Amdahl’s law equation)
Stanford CS149, Fall 2020
Static assignment
▪ Assignment of work to threads is not dependent on runtime factors (like how long it
takes to perform a piece of work)
- Not necessarily determined at compile-time (assignment algorithm may depend on
runtime parameters such as input data size, number of threads, etc.)
▪ Recall grid solver example: assign equal number of grid cells to each thread
- We discussed two static assignments of work to workers (blocked and interleaved)
▪ Good properties of static assignment: simple, essentially zero runtime overhead to
perform assignment (in this example: extra work to implement assignment is a little bit
of indexing math)
Stanford CS149, Fall 2020
When is static assignment applicable?
▪ When the cost (execution time) of work and the amount of work is
predictable, allowing the programmer to work out a good assignment in
advance
▪ Simplest example: it is known up front that all work has the same cost
Time P1 P2 P3 P4
In the example above:
There are 12 tasks, and it is known that each have the same cost.
Static assignment: statically assign three tasks to each of the four processors.
Stanford CS149, Fall 2020
When is static assignment applicable?
▪ When work is predictable, but not all jobs have same cost (see example below)
▪ When statistics about execution time are predictable (e.g., same cost on average)
Time P1 P2 P3 P4
Jobs have unequal, but known cost: assign equal number of tasks to processors to ensure good
load balance (on average)
Stanford CS149, Fall 2020
“Semi-static”assignment
▪ Cost of work is predictable for near-term future
- Idea: recent past is a good predictor of near future
▪ Application periodically profiles its execution and re-adjusts assignment
- Assignment is“static”for the interval between re-adjustments
Adaptive mesh:
Mesh is changed as object moves or flow over object changes,
but changes occur slowly (color indicates assignment of parts
of mesh to processors)
Particle simulation:
Redistribute particles as they move
over course of simulation
(if motion is slow, redistribution need
not occur often)
Image credit: http://typhon.sourceforge.net/spip/spip.php?article22
Stanford CS149, Fall 2020
Dynamic assignment
Program determines assignment dynamically at runtime to ensure a well
distributed load. (The execution time of tasks, or the total number of
tasks, is unpredictable.)
int N = 1024;
int* x = new int[N];
bool* prime = new bool[N];
// assume elements of x initialized here
for (int i=0; i<N; i++)
{
// unknown execution time
is_prime[i] = test_primality(x[i]);
}
int N = 1024;
// assume allocations are only executed by 1 thread
int* x = new int[N];
bool* is_prime = new bool[N];
// assume elements of x are initialized here
LOCK counter_lock;
int counter = 0; // shared variable
while (1) {
int i;
lock(counter_lock);
i = counter++;
unlock(counter_lock);
if (i >= N)
break;
is_prime[i] = test_primality(x[i]);
}
Sequential program
(independent loop iterations)
Parallel program
(SPMD execution by multiple threads,
shared address space model)
atomic_incr(counter);
Stanford CS149, Fall 2020
Dynamic assignment using a work queue
Worker threads:
Pull data from shared work queue
Push new work to queue as it is created
T1 T2 T3 T4
Sub-problems
(a.k.a.“tasks”,“work”)
Shared work queue: a list of work to do
(for now, let’s assume each piece of work is independent)
Stanford CS149, Fall 2020
What constitutes a piece of work?
What is a potential performance problem with this implementation?
const int N = 1024;
// assume allocations are only executed by 1 thread
float* x = new float[N];
bool* prime = new bool[N];
// assume elements of x are initialized here
LOCK counter_lock;
int counter = 0;
while (1) {
int i;
lock(counter_lock);
i = counter++;
unlock(counter_lock);
if (i >= N)
break;
is_prime[i] = test_primality(x[i]);
}
Fine granularity partitioning: 1“task”= 1 element
Likely good workload balance (many small tasks)
Potential for high synchronization cost
(serialization at critical section)
Time in critical section
This is overhead that
does not exist in serial
program
And.. it’s serial execution
(recall Amdahl’s Law)
Time in task 0
So... IS IT a problem?
Stanford CS149, Fall 2020
Increasing task granularity
const int N = 1024;
const int GRANULARITY = 10;
// assume allocations are only executed by 1 thread
float* x = new float[N];
bool* prime = new bool[N];
// assume elements of x are initialized here
LOCK counter_lock;
int counter = 0;
while (1) {
int i;
lock(counter_lock);
i = counter;
counter += GRANULARITY;
unlock(counter_lock);
if (i >= N)
break;
int end = min(i + GRANULARITY, N);
for (int j=i; j<end; j++)
is_prime[i] = test_primality(x[i]);
}
Coarse granularity partitioning: 1“task”= 10 elements
Decreased synchronization cost
(Critical section entered 10 times less)
Time in critical section
Time in task 0
Stanford CS149, Fall 2020
Choosing task size
▪ Useful to have many more tasks* than processors
(many small tasks enables good workload balance via dynamic assignment)
- Motivates small granularity tasks
▪ But want as few tasks as possible to minimize overhead of
managing the assignment
- Motivates large granularity tasks
▪ Ideal granularity depends on many factors
(Common theme in this course: must know your workload, and your machine)
* I had to pick a term for a piece of work
Stanford CS149, Fall 2020
Smarter task scheduling
16Tasks
Time
Consider dynamic scheduling via a shared work queue
What happens if the system assigns these tasks to workers in left-to-right order?
Stanford CS149, Fall 2020
Smarter task scheduling
What happens if scheduler runs the long task last? Potential for load imbalance!
Time P1 P2 P3 P4
One possible solution to imbalance problem:
Divide work into a larger number of smaller tasks
- Hopefully this makes the“long pole”shorter relative to overall execution time
- May increase synchronization overhead
- May not be possible (perhaps long task is fundamentally sequential)
Done!
Stanford CS149, Fall 2020
Smarter task scheduling
Schedule long task first to reduce“slop”at end of computation
P1 P2 P3 P4
Another solution: smarter scheduling
Schedule long tasks first
- Thread performing long task performs fewer overall tasks, but approximately the
same amount of work as the other threads.
- Requires some knowledge of workload (some predictability of cost)
Time
Done!
Stanford CS149, Fall 2020
Decreasing synchronization overhead using
a distributed set of queues
(avoid need for all workers to synchronize on single work queue)
Worker threads:
Pull data from OWN work queue
Push new work to OWN work queue
When local work queue is empty...
STEAL work from another work queue
T1 T2 T3 T4
Set of work queues
(In general, one per worker thread)
Steal!
Subproblems
(a.k.a.“tasks”,“work to do”)
Stanford CS149, Fall 2020
Work in task queues need not be independent
T1 T2 T3 T4
= application-specified
dependency
A task cannot be assigned to worker thread until all its
task dependencies are satisfied
Workers can submit new tasks (with optional explicit
dependencies) to task system
Task management system:
Scheduler manages dependencies between tasks
foo_handle = enqueue_task(foo); // enqueue task foo (independent of all prior tasks)
bar_handle = enqueue_task(bar, foo_handle); // enqueue task bar, cannot run until foo is complete
Stanford CS149, Fall 2020
Summary
▪ Challenge: achieving good workload balance
- Want all processors working all the time (otherwise, resources are idle!)
- But want low-cost solution for achieving this balance
- Minimize computational overhead (e.g., scheduling/assignment logic)
- Minimize synchronization costs
▪ Static assignment vs. dynamic assignment
- Really, it is not an either/or decision, there’s a continuum of choices
- Use up-front knowledge about workload as much as possible to reduce load
imbalance and task management/synchronization costs (in the limit, if the system
knows everything, use fully static assignment)
▪ Issues discussed today span aspects of task decomposition,
assignment, and orchestration
Stanford CS149, Fall 2020
Scheduling fork-join parallelism
Stanford CS149, Fall 2020
Common parallel programming patterns
Data parallelism:
Perform same sequence of operations on many data elements
// openMP parallel for
#pragma omp parallel for
for (int i=0; i<N; i++) {
B[i] = foo(A[i]);
}
// ISPC foreach
foreach (i=0 ... N) {
B[i] = foo(A[i]);
}
// ISPC bulk task launch
launch[numTasks] myFooTask(A, B);
// using higher-order function ‘map’
map(foo, A, B);
foo()
// bulk CUDA thread launch (GPU programming)
foo<<<numBlocks, threadsPerBlock>>>(A, B);
Stanford CS149, Fall 2020
Common parallel programming patterns
Explicit management of parallelism with threads:
Create one thread per execution unit (or per amount of desired concurrency)
- Example below: C code with C++ threads
float* A;
float* B;
// initialize arrays A and B here
void myFunction(float* A, float* B { … }
std::thread thread[NUM_HW_EXEC_CONTEXTS];
for (int i=0; i<NUM_HW_EXEC_CONTEXTS; i++) {
thread[i] = std::thread(myFunction, A, B);
}
for (int i=0; i<num_cores; i++) {
thread[i].join();
}
Stanford CS149, Fall 2020
Consider divide-and-conquer algorithms
// sort elements from ‘begin’ up to (but not including) ‘end’
void quick_sort(int* begin, int* end) {
if (begin >= end-1)
return;
else {
// choose partition key and partition elements
// by key, return position of key as `middle`
int* middle = partition(begin, end);
quick_sort(begin, middle);
quick_sort(middle+1, last);
}
}
Quick sort:
independent work!
quick_sort
quick_sort quick_sort
qs qs qs qs
Dependencies
Stanford CS149, Fall 2020
Fork-join pattern
▪ Natural way to express the independent work that is inherent in divide-and-
conquer algorithms
▪ This lecture’s code examples will be in Cilk Plus
- C++ language extension
- Originally developed at MIT, now adapted as open standard (in GCC, Intel ICC)
cilk_spawn foo(args);
Semantics: invoke foo, but unlike standard function call, caller may continue
executing asynchronously with execution of foo.
cilk_sync;
Semantics: returns when all calls spawned by current function have completed.
(“sync up”with the spawned calls)
Note: there is an implicit cilk_sync at the end of every function that contains a
cilk_spawn (implication: when a Cilk function returns, all work associated with
that function is complete)
“fork”(create new logical thread of control)
“join”
Stanford CS149, Fall 2020
Call-return of a function in C*
void my_func() {
// calling function (part A)
foo();
bar();
// calling function (part B)
}
foo()
bar()
part B
part A
Semantics of a function call:
Control moves to the function that is called
(Thread executes instructions for the function)
When function returns, control returns back to caller
(thread resumes executing instructions from the caller)
my_func()
* And many other languages
Stanford CS149, Fall 2020
Basic Cilk Plus examples
// foo() and bar() may run in parallel
cilk_spawn foo();
bar();
cilk_sync;
// foo() and bar() may run in parallel
cilk_spawn foo();
cilk_spawn bar();
cilk_sync;
// foo, bar, fizz, buzz, may run in parallel
cilk_spawn foo();
cilk_spawn bar();
cilk_spawn fizz();
buzz();
cilk_sync;
bar() foo()
bar() foo()
fizz() bar()
buzz() foo()
Same amount of independent work first example, but potentially
higher runtime overhead (due to two spawns vs. one)
Stanford CS149, Fall 2020
Abstraction vs. implementation
▪ Notice that the cilk_spawn abstraction does not specify how
or when spawned calls are scheduled to execute
- Only that they may be run concurrently with caller (and with all other
calls spawned by the caller)
- Question: Is an implementation of Cilk correct if it implements
cilk_spawn foo() the same way as it implementation a normal function
call to foo()?
▪ But cilk_sync does serve as a constraint on scheduling
- All spawned calls must complete before cilk_sync returns
Stanford CS149, Fall 2020
Parallel quicksort in Cilk Plus
void quick_sort(int* begin, int* end) {
if (begin >= end - PARALLEL_CUTOFF)
std::sort(begin, end);
else {
int* middle = partition(begin, end);
cilk_spawn quick_sort(begin, middle);
quick_sort(middle+1, last);
}
}
quick_sort()
part()
part()
std::
sort()
part()
std::
sort()
std::
sort()
std::
sort()
quick_sort()
part()
std::
sort()
std::
sort()
std::
sort()
part()
part()
std::
sort()
Sort sequentially if problem size is sufficiently
small (overhead of spawn trumps benefits of
potential parallelization)
part()
Stanford CS149, Fall 2020
Writing fork-join programs
▪ Main idea: expose independent work (potential parallelism)
to the system using cilk_spawn
▪ Recall parallel programming rules of thumb
- Want at least as much work as parallel execution capability (e.g., program
should probably spawn at least as much work as needed to fill all the machine’s
processing resources)
- Want more independent work than execution capability to allow for good
workload balance of all the work onto the cores
- “parallel slack”= ratio of independent work to machine’s parallel execution
capability (in practice: ~8 is a good ratio)
- But not too much independent work so that granularity of work is too small
(too much slack incurs overhead of managing fine-grained work)
Stanford CS149, Fall 2020
Scheduling fork-join programs
▪ Consider very simple scheduler:
- Launch pthread for each cilk_spawn using pthread_create
- Translate cilk_sync into appropriate pthread_join calls
▪ Potential performance problems?
- Heavyweight spawn operation
- Many more concurrently running threads than cores
- Context switching overhead
- Larger working set than necessary, less cache locality
Note: now we are going to talk about the implementation of Cilk
Stanford CS149, Fall 2020
Pool of worker threads
▪ The Cilk Plus runtime maintains pool of worker threads
- Think: all threads are created at application launch *
- Exactly as many worker threads as execution contexts in the machine
* It’s perfectly fine to think about it this way, but in reality, runtimes tend to be lazy and initialize worker threads on the
first Cilk spawn. (This is a common implementation strategy, ISPC does the same with worker threads that run ISPC tasks.)
Thread 0 Thread 1 Thread 2 Thread 3
Thread 4 Thread 5 Thread 6 Thread 7
Example: Eight thread worker pool for my
quad-core laptop with Hyper-Threading
while (work_exists()) {
work = get_new_work();
work.run();
}
Stanford CS149, Fall 2020
Consider execution of the following code
cilk_spawn foo();
bar();
cilk_sync; foo()
Specifically, consider execution from the point foo() is spawned
spawned child
continuation (rest of calling function)
bar()
What threads should foo() and bar() be executed by?
Thread 0 Thread 1
Stanford CS149, Fall 2020
First, consider a serial implementation
Thread 0
Executing foo()…
Traditional thread call stack
(indicates bar() will be run next
after return from foo())
Thread 1
What if, while executing foo(),
thread 1 goes idle…
Inefficient: thread 1 could be
performing bar() at this time!
Run child first… via a regular function call
- Thread runs foo(), then returns from foo(), then runs bar()
- Continuation is implicit in the thread’s stack
bar()
Stanford CS149, Fall 2020
Per-thread work queues store“work to do”
Thread 0
Thread
call stack
Thread 1
Thread 0 work queue Thread 1 work queue
Empty!
bar()
Executing foo()…
Upon reaching cilk_spawn foo(), thread places continuation in
its work queue, and begins executing foo().
Stanford CS149, Fall 2020
Idle threads“steal”work from busy threads
Thread 0
Thread
call stack
Thread 1
Thread 0 work queue Thread 1 work queue
bar()
Executing foo()…
1. Idle thread looks in busy
thread’s queue for work
If thread 1 goes idle (a.k.a. there is no work in its own queue), then
it looks in thread 0’s queue for work to do.
Stanford CS149, Fall 2020
Idle threads“steal”work from busy threads
Thread 0
Thread
call stack
Thread 1
Thread 0 work queue Thread 1 work queue
bar()
Executing foo()…
1. Idle thread looks in busy
threads queue for work
2. Idle thread moves work from busy
thread’s queue to its own queue
If thread 1 goes idle (a.k.a. there is no work in its own queue), then
it looks in thread 0’s queue for work to do.
Stanford CS149, Fall 2020
Idle threads“steal”work from busy threads
Thread 0
Thread
call stack
Thread 1
Thread 0 work queue Thread 1 work queue
Executing foo()…
1. Idle thread looks in busy
threads queue for work
2. Idle thread moves work from busy
thread’s queue to its own queue
Executing bar()…
3.Thread resumes execution
If thread 1 goes idle (a.k.a. there is no work in its own queue), then
it looks in thread 0’s queue for work to do.
Stanford CS149, Fall 2020
At spawn, should calling thread run the child or the
continuation?
cilk_spawn foo();
bar();
cilk_sync; foo()
spawned child
continuation (rest of calling function)
bar()
Run child first: enqueue continuation for later execution
- Continuation is made available for stealing by other threads (“continuation stealing”)
Run continuation first: queue child for later execution
- Child is made available for stealing by other threads (“child stealing”)
Which implementation do we choose?
Stanford CS149, Fall 2020
Consider thread executing the following code
for (int i=0; i<N; i++) {
cilk_spawn foo(i);
}
cilk_sync;
foo(N-1) foo(3) foo(2) foo(1) foo(0)
…
▪ Run continuation first (“child stealing”)
- Caller thread spawns work for all iterations before
executing any of it
- Think: breadth-first traversal of call graph. O(N) space
for spawned work (maximum space)
- If no stealing, execution order is very different than
that of program with cilk_spawn removed
Thread 0
Thread 0 work queue
foo(N-1)
foo(N-2)
foo(0)
…
Stanford CS149, Fall 2020
for (int i=0; i<N; i++) {
cilk_spawn foo(i);
}
cilk_sync;
foo(N-1) foo(3) foo(2) foo(1) foo(0)
…
▪ Run child first (“continuation stealing”)
- Caller thread only creates one item to steal
(continuation that represents all remaining iterations)
- If no stealing occurs, thread continually pops
continuation from work queue, enqueues new
continuation (with updated value of i)
- Order of execution is the same as for program with
spawn removed.
- Think: depth-first traversal of call graph
Thread 0
Thread 0 work queue
cont: i=1
Executing foo(0)…
Consider thread executing the following code
Stanford CS149, Fall 2020
for (int i=0; i<N; i++) {
cilk_spawn foo(i);
}
cilk_sync;
foo(N-1) foo(3) foo(2) foo(1) foo(0)
…
▪ Run child first (“continuation stealing”)
- Enqueues continuation with i advanced by 1
- If continuation is stolen, stealing thread spawns
and executes next iteration
- Can prove that work queue storage for system
withT threads is no more thanT times that of
stack storage for single threaded execution Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
cont: i=1
Executing foo(0)… Executing foo(1)…
Consider thread executing the following code
Stanford CS149, Fall 2020
Scheduling quicksort: assume 200 elements
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
void quick_sort(int* begin, int* end) {
if (begin >= end - PARALLEL_CUTOFF)
std::sort(begin, end);
else {
int* middle = partition(begin, end);
cilk_spawn quick_sort(begin, middle);
quick_sort(middle+1, last);
}
}
cont: 101-200
Working on 0-25…
cont: 51-100
cont: 26-50
…
What work in the queue should
other threads steal?
(e.g., steal from top or bottom)
Stanford CS149, Fall 2020
Implementing work stealing: dequeue per worker
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
cont: 101-200
Working on 0-25…
cont: 51-100
cont: 26-50
…
Steal!
Steal!
Work queue implemented as a dequeue (double ended queue)
- Local thread pushes/pops from the“tail”(bottom)
- Remote threads steal from“head”(top)
Stanford CS149, Fall 2020
Implementing work stealing: dequeue per worker
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
Working on 0-25…
cont: 151-200
cont: 26-50
…
cont: 76-100
Working on 51-75…
Working on 101-150…
Work queue implemented as a dequeue (double ended queue)
- Local thread pushes/pops from the“tail”(bottom)
- Remote threads steal from“head”(top)
Stanford CS149, Fall 2020
Implementing work stealing: dequeue per worker
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
Working on 0-12…
cont: 114-125
cont: 13-25
…
Working on 51-63…
Working on 101-113…
cont: 64-75
cont: 76-100
cont: 26-50 cont: 126-150
cont: 151-200
Work queue implemented as a dequeue (double ended queue)
- Local thread pushes/pops from the“tail”(bottom)
- Remote threads steal from“head”(top)
Stanford CS149, Fall 2020
Implementing work stealing: choice of victim
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
Working on 0-12…
cont: 114-125
cont: 13-25
…
▪ Idle threads randomly choose a thread to attempt to steal from
▪ Steal work from top of dequeue:
- Steals largest amount of work (reduce number of steals)
- Maximum locality in work each thread performs (when combined with run child first scheme)
- Stealing thread and local thread do not contend for same elements of dequeue (efficient
lock-free implementations of dequeue exist)
Working on 51-63…
Working on 101-113…
cont: 64-75
cont: 76-100
cont: 26-50 cont: 126-150
cont: 151-200
Stanford CS149, Fall 2020
Child-first work stealing scheduler anticipates
divide-and-conquer parallelism
void recursive_for(int start, int end) {
while (start <= end - GRANULARITY) {
int mid = (end - start) / 2;
cilk_spawn recursive_for(start, mid);
start = mid;
}
for (int i=start; i<end; i++)
foo(i);
}
recursive_for(0, N);
for (int i=0; i<N; i++) {
cilk_spawn foo(i);
}
cilk_sync;
foo(N-1) foo(3) foo(2) foo(1) foo(0)
…
(0, N/2)
(N/2, 3N/4)
(0, N/4)
(N/2, 5N/8)
(N/4, 3N/8)
Code at right generates work in parallel,
(code at left does not), so it more quickly
fills parallel machine
Stanford CS149, Fall 2020
Implementing sync
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 2
Thread 2 work queue
Working on foo(9)…
cont: i=10
Working on foo(7)… Working on foo(8)…
Thread 3
Thread 3 work queue
Working on foo(6)…
State of worker threads
when all work from loop
is nearly complete
Stanford CS149, Fall 2020
Implementing sync: no stealing case
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(9), id=A…
cont: i=10 (id=A)
block (id: A)
Sync for all calls spawned within block A
If no work has been stolen by other threads, then
there’s nothing to do at the sync point.
cilk_sync is a no-op.
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Example 1:“stalling”join policy
Thread that initiates the fork must perform the sync.
Therefore it waits for all spawned work to be complete.
In this case, thread 0 is the thread initiating the fork
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(0), id=A…
cont: i=0 (id=A)
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(0), id=A…
STOLEN (id=A) cont: i=0, id=A
Steal!
Idle thread 1 steals from busy thread 0
Note: descriptor for block A created
The descriptor tracks the number of outstanding
spawns for the block, and the number of those
spawns that have completed.
The 1 spawn tracked by the descriptor corresponds to
foo(0) being run by thread 0. (Since the continuation
is now owned by thread 1 after the steal.)
id=A
spawn: 1, done: 0
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(0), id=A…
cont: i=1, id=A
Update count
Working on foo(1), id=A…
id=A
spawn: 2, done: 0 STOLEN (id=A)
Thread 1 is now running foo(1)
Note: spawn count is now 2
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Idle! Working on foo(1), id=A…
Thread 2
Thread 2 work queue
cont: i=2, id=A
Working on foo(2), id=A…
Steal!
STOLEN (id=A)
id=A
spawn: 3, done: 1
Thread 0 completes foo(0)
(updates spawn descriptor)
Thread 2 now running foo(2)
STOLEN (id=A)
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Idle! Idle!
Thread 2
Thread 2 work queue
cont: i=10, id=A
Working on foo(9), id=A…
id=A
spawn: 10, done: 9 STOLEN (id=A)
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Computation nearing end…
Only foo(9) remains to be
completed.
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Idle! Idle!
Thread 2
Thread 2 work queue
cont: i=10, id=A
Idle!
Notify
done!
id=A
spawn: 10, done: 10 STOLEN (id=A)
Last spawn completes.
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: stalling join
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on bar()… Idle!
Thread 2
Thread 2 work queue
Idle!
Thread 0 now resumes continuation
and executes bar()
Note block A descriptor is now free.
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(0), id=A…
STOLEN (id=A) cont: i=0, id=A
Steal!
Example 2:“greedy”policy
- When thread that initiates the fork goes idle, it
looks to steal new work
- Last thread to reach the join point continues
execution after sync
id=A
spawn: 0, done: 0
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Working on foo(0), id=A…
STOLEN (id=A) cont: i=1, id=A
Idle thread 1 steals from busy thread 0
(as in the previous case)
id=A
spawn: 2, done: 0
Working on foo(1), id=A…
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Done with foo(0)!
STOLEN (id=A) cont: i=1, id=A
Thread 0 completes foo(0)
No work to do in local dequeue, so thread 0
looks to steal!
id=A
spawn: 2, done: 0
Working on foo(1), id=A…
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
STOLEN (id=A)
cont: i=2, id=A
Thread 0 now working on foo(2)
id=A
spawn: 3, done: 1
Working on foo(1), id=A…
Working on foo(2), id=A…
Steal!
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
cont: i=10, id=A
Assume thread 1 is the last to finish
spawned calls for block A.
id=A
spawn: 10, done: 9
Working on foo(9), id=A…
Idle
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Implementing sync: greedy policy
foo(9) foo(3) foo(2) foo(1) foo(0)
…
bar()
Thread 0
Thread 0 work queue
Thread 1
Thread 1 work queue
Thread 1 continues on to run bar()
Note block A descriptor is now free.
Working on bar()
Idle
for (int i=0; i<10; i++) {
cilk_spawn foo(i);
}
cilk_sync;
bar();
block (id: A)
Sync for all calls spawned within block A
Stanford CS149, Fall 2020
Cilk uses greedy join scheduling
▪ Greedy join scheduling policy
- All threads always attempt to steal if there is nothing to do (threads only go
idle if there is no work to steal in the system)
- Worker thread that initiated spawn may not be thread that executes logic after
cilk_sync
▪ Remember:
- Overhead of bookkeeping steals and managing sync points only occurs when
steals occur
- If large pieces of work are stolen, this should occur infrequently
- Most of the time, threads are pushing/popping local work from their local
dequeue
Stanford CS149, Fall 2020
Cilk summary
▪ Fork-join parallelism: a natural way to express divide-and-
conquer algorithms
- Discussed Cilk Plus, but many other systems also have fork/join primitives
- e.g., OpenMP
▪ Cilk Plus runtime implements spawn/sync abstraction with a
locality-aware work stealing scheduler
- Always run spawned child (continuation stealing)
- Greedy behavior at join (threads do not wait at join, immediately look for other
work to steal)

More Related Content

Similar to 05_Performance Optimization.pdf

NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Scheduling algorithm in real time system
Scheduling algorithm in real time systemScheduling algorithm in real time system
Scheduling algorithm in real time systemVishalPandat2
 
Weighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesWeighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesSunny Kr
 
Clock driven scheduling
Clock driven schedulingClock driven scheduling
Clock driven schedulingKamal Acharya
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptRubenGabrielHernande
 
multiprocessor real_ time scheduling.ppt
multiprocessor real_ time scheduling.pptmultiprocessor real_ time scheduling.ppt
multiprocessor real_ time scheduling.pptnaghamallella
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overviewjoeyrobert
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Temporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingTemporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingijesajournal
 
Temporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingTemporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingijesajournal
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPIntel IT Center
 
Multiprocessor Real-Time Scheduling.pptx
Multiprocessor Real-Time Scheduling.pptxMultiprocessor Real-Time Scheduling.pptx
Multiprocessor Real-Time Scheduling.pptxnaghamallella
 
Wei's notes on hadoop resource awareness
Wei's notes on hadoop resource awarenessWei's notes on hadoop resource awareness
Wei's notes on hadoop resource awarenessLu Wei
 

Similar to 05_Performance Optimization.pdf (20)

NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Scheduling algorithm in real time system
Scheduling algorithm in real time systemScheduling algorithm in real time system
Scheduling algorithm in real time system
 
RTS
RTSRTS
RTS
 
Weighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesWeighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated Machines
 
Chapter 2 ds
Chapter 2 dsChapter 2 ds
Chapter 2 ds
 
Clock driven scheduling
Clock driven schedulingClock driven scheduling
Clock driven scheduling
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
multiprocessor real_ time scheduling.ppt
multiprocessor real_ time scheduling.pptmultiprocessor real_ time scheduling.ppt
multiprocessor real_ time scheduling.ppt
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Temporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingTemporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware scheduling
 
Temporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware schedulingTemporal workload analysis and its application to power aware scheduling
Temporal workload analysis and its application to power aware scheduling
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMP
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithms
 
Multiprocessor Real-Time Scheduling.pptx
Multiprocessor Real-Time Scheduling.pptxMultiprocessor Real-Time Scheduling.pptx
Multiprocessor Real-Time Scheduling.pptx
 
Wei's notes on hadoop resource awareness
Wei's notes on hadoop resource awarenessWei's notes on hadoop resource awareness
Wei's notes on hadoop resource awareness
 
Lecture1
Lecture1Lecture1
Lecture1
 

Recently uploaded

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

05_Performance Optimization.pdf

  • 1. Parallel Computing Stanford CS149, Fall 2020 Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t )
  • 2. Stanford CS149, Fall 2020 Programming for high performance ▪ Optimizing the performance of parallel programs is an iterative process of refining choices for decomposition, assignment, and orchestration... ▪ Key goals (that are at odds with each other) - Balance workload onto available execution resources - Reduce communication (to avoid stalls) - Reduce extra work (overhead) performed to increase parallelism, manage assignment, reduce communication, etc. ▪ We are going to talk about a rich space of techniques
  • 3. Stanford CS149, Fall 2020 TIP #1: Always implement the simplest solution first, then measure performance to determine if you need to do better. (Example: if you anticipate only running low-core count machines, it may be unnecessary to implement a complex approach that creates and hundreds or thousands of pieces of independent work)
  • 4. Stanford CS149, Fall 2020 Balancing the workload Ideally: all processors are computing all the time during program execution (they are computing simultaneously, and they finish their portion of the work at the same time) Only small amount of load imbalance can significantly bound maximum speedup Time P1 P2 P3 P4 P4 does 2x more work → P4 takes 2x longer to complete → 50% of parallel program’s runtime is serial execution (work in serialized section here is about 1/5 of the work done by the entire program, so S=0.2 in Amdahl’s law equation)
  • 5. Stanford CS149, Fall 2020 Static assignment ▪ Assignment of work to threads is not dependent on runtime factors (like how long it takes to perform a piece of work) - Not necessarily determined at compile-time (assignment algorithm may depend on runtime parameters such as input data size, number of threads, etc.) ▪ Recall grid solver example: assign equal number of grid cells to each thread - We discussed two static assignments of work to workers (blocked and interleaved) ▪ Good properties of static assignment: simple, essentially zero runtime overhead to perform assignment (in this example: extra work to implement assignment is a little bit of indexing math)
  • 6. Stanford CS149, Fall 2020 When is static assignment applicable? ▪ When the cost (execution time) of work and the amount of work is predictable, allowing the programmer to work out a good assignment in advance ▪ Simplest example: it is known up front that all work has the same cost Time P1 P2 P3 P4 In the example above: There are 12 tasks, and it is known that each have the same cost. Static assignment: statically assign three tasks to each of the four processors.
  • 7. Stanford CS149, Fall 2020 When is static assignment applicable? ▪ When work is predictable, but not all jobs have same cost (see example below) ▪ When statistics about execution time are predictable (e.g., same cost on average) Time P1 P2 P3 P4 Jobs have unequal, but known cost: assign equal number of tasks to processors to ensure good load balance (on average)
  • 8. Stanford CS149, Fall 2020 “Semi-static”assignment ▪ Cost of work is predictable for near-term future - Idea: recent past is a good predictor of near future ▪ Application periodically profiles its execution and re-adjusts assignment - Assignment is“static”for the interval between re-adjustments Adaptive mesh: Mesh is changed as object moves or flow over object changes, but changes occur slowly (color indicates assignment of parts of mesh to processors) Particle simulation: Redistribute particles as they move over course of simulation (if motion is slow, redistribution need not occur often) Image credit: http://typhon.sourceforge.net/spip/spip.php?article22
  • 9. Stanford CS149, Fall 2020 Dynamic assignment Program determines assignment dynamically at runtime to ensure a well distributed load. (The execution time of tasks, or the total number of tasks, is unpredictable.) int N = 1024; int* x = new int[N]; bool* prime = new bool[N]; // assume elements of x initialized here for (int i=0; i<N; i++) { // unknown execution time is_prime[i] = test_primality(x[i]); } int N = 1024; // assume allocations are only executed by 1 thread int* x = new int[N]; bool* is_prime = new bool[N]; // assume elements of x are initialized here LOCK counter_lock; int counter = 0; // shared variable while (1) { int i; lock(counter_lock); i = counter++; unlock(counter_lock); if (i >= N) break; is_prime[i] = test_primality(x[i]); } Sequential program (independent loop iterations) Parallel program (SPMD execution by multiple threads, shared address space model) atomic_incr(counter);
  • 10. Stanford CS149, Fall 2020 Dynamic assignment using a work queue Worker threads: Pull data from shared work queue Push new work to queue as it is created T1 T2 T3 T4 Sub-problems (a.k.a.“tasks”,“work”) Shared work queue: a list of work to do (for now, let’s assume each piece of work is independent)
  • 11. Stanford CS149, Fall 2020 What constitutes a piece of work? What is a potential performance problem with this implementation? const int N = 1024; // assume allocations are only executed by 1 thread float* x = new float[N]; bool* prime = new bool[N]; // assume elements of x are initialized here LOCK counter_lock; int counter = 0; while (1) { int i; lock(counter_lock); i = counter++; unlock(counter_lock); if (i >= N) break; is_prime[i] = test_primality(x[i]); } Fine granularity partitioning: 1“task”= 1 element Likely good workload balance (many small tasks) Potential for high synchronization cost (serialization at critical section) Time in critical section This is overhead that does not exist in serial program And.. it’s serial execution (recall Amdahl’s Law) Time in task 0 So... IS IT a problem?
  • 12. Stanford CS149, Fall 2020 Increasing task granularity const int N = 1024; const int GRANULARITY = 10; // assume allocations are only executed by 1 thread float* x = new float[N]; bool* prime = new bool[N]; // assume elements of x are initialized here LOCK counter_lock; int counter = 0; while (1) { int i; lock(counter_lock); i = counter; counter += GRANULARITY; unlock(counter_lock); if (i >= N) break; int end = min(i + GRANULARITY, N); for (int j=i; j<end; j++) is_prime[i] = test_primality(x[i]); } Coarse granularity partitioning: 1“task”= 10 elements Decreased synchronization cost (Critical section entered 10 times less) Time in critical section Time in task 0
  • 13. Stanford CS149, Fall 2020 Choosing task size ▪ Useful to have many more tasks* than processors (many small tasks enables good workload balance via dynamic assignment) - Motivates small granularity tasks ▪ But want as few tasks as possible to minimize overhead of managing the assignment - Motivates large granularity tasks ▪ Ideal granularity depends on many factors (Common theme in this course: must know your workload, and your machine) * I had to pick a term for a piece of work
  • 14. Stanford CS149, Fall 2020 Smarter task scheduling 16Tasks Time Consider dynamic scheduling via a shared work queue What happens if the system assigns these tasks to workers in left-to-right order?
  • 15. Stanford CS149, Fall 2020 Smarter task scheduling What happens if scheduler runs the long task last? Potential for load imbalance! Time P1 P2 P3 P4 One possible solution to imbalance problem: Divide work into a larger number of smaller tasks - Hopefully this makes the“long pole”shorter relative to overall execution time - May increase synchronization overhead - May not be possible (perhaps long task is fundamentally sequential) Done!
  • 16. Stanford CS149, Fall 2020 Smarter task scheduling Schedule long task first to reduce“slop”at end of computation P1 P2 P3 P4 Another solution: smarter scheduling Schedule long tasks first - Thread performing long task performs fewer overall tasks, but approximately the same amount of work as the other threads. - Requires some knowledge of workload (some predictability of cost) Time Done!
  • 17. Stanford CS149, Fall 2020 Decreasing synchronization overhead using a distributed set of queues (avoid need for all workers to synchronize on single work queue) Worker threads: Pull data from OWN work queue Push new work to OWN work queue When local work queue is empty... STEAL work from another work queue T1 T2 T3 T4 Set of work queues (In general, one per worker thread) Steal! Subproblems (a.k.a.“tasks”,“work to do”)
  • 18. Stanford CS149, Fall 2020 Work in task queues need not be independent T1 T2 T3 T4 = application-specified dependency A task cannot be assigned to worker thread until all its task dependencies are satisfied Workers can submit new tasks (with optional explicit dependencies) to task system Task management system: Scheduler manages dependencies between tasks foo_handle = enqueue_task(foo); // enqueue task foo (independent of all prior tasks) bar_handle = enqueue_task(bar, foo_handle); // enqueue task bar, cannot run until foo is complete
  • 19. Stanford CS149, Fall 2020 Summary ▪ Challenge: achieving good workload balance - Want all processors working all the time (otherwise, resources are idle!) - But want low-cost solution for achieving this balance - Minimize computational overhead (e.g., scheduling/assignment logic) - Minimize synchronization costs ▪ Static assignment vs. dynamic assignment - Really, it is not an either/or decision, there’s a continuum of choices - Use up-front knowledge about workload as much as possible to reduce load imbalance and task management/synchronization costs (in the limit, if the system knows everything, use fully static assignment) ▪ Issues discussed today span aspects of task decomposition, assignment, and orchestration
  • 20. Stanford CS149, Fall 2020 Scheduling fork-join parallelism
  • 21. Stanford CS149, Fall 2020 Common parallel programming patterns Data parallelism: Perform same sequence of operations on many data elements // openMP parallel for #pragma omp parallel for for (int i=0; i<N; i++) { B[i] = foo(A[i]); } // ISPC foreach foreach (i=0 ... N) { B[i] = foo(A[i]); } // ISPC bulk task launch launch[numTasks] myFooTask(A, B); // using higher-order function ‘map’ map(foo, A, B); foo() // bulk CUDA thread launch (GPU programming) foo<<<numBlocks, threadsPerBlock>>>(A, B);
  • 22. Stanford CS149, Fall 2020 Common parallel programming patterns Explicit management of parallelism with threads: Create one thread per execution unit (or per amount of desired concurrency) - Example below: C code with C++ threads float* A; float* B; // initialize arrays A and B here void myFunction(float* A, float* B { … } std::thread thread[NUM_HW_EXEC_CONTEXTS]; for (int i=0; i<NUM_HW_EXEC_CONTEXTS; i++) { thread[i] = std::thread(myFunction, A, B); } for (int i=0; i<num_cores; i++) { thread[i].join(); }
  • 23. Stanford CS149, Fall 2020 Consider divide-and-conquer algorithms // sort elements from ‘begin’ up to (but not including) ‘end’ void quick_sort(int* begin, int* end) { if (begin >= end-1) return; else { // choose partition key and partition elements // by key, return position of key as `middle` int* middle = partition(begin, end); quick_sort(begin, middle); quick_sort(middle+1, last); } } Quick sort: independent work! quick_sort quick_sort quick_sort qs qs qs qs Dependencies
  • 24. Stanford CS149, Fall 2020 Fork-join pattern ▪ Natural way to express the independent work that is inherent in divide-and- conquer algorithms ▪ This lecture’s code examples will be in Cilk Plus - C++ language extension - Originally developed at MIT, now adapted as open standard (in GCC, Intel ICC) cilk_spawn foo(args); Semantics: invoke foo, but unlike standard function call, caller may continue executing asynchronously with execution of foo. cilk_sync; Semantics: returns when all calls spawned by current function have completed. (“sync up”with the spawned calls) Note: there is an implicit cilk_sync at the end of every function that contains a cilk_spawn (implication: when a Cilk function returns, all work associated with that function is complete) “fork”(create new logical thread of control) “join”
  • 25. Stanford CS149, Fall 2020 Call-return of a function in C* void my_func() { // calling function (part A) foo(); bar(); // calling function (part B) } foo() bar() part B part A Semantics of a function call: Control moves to the function that is called (Thread executes instructions for the function) When function returns, control returns back to caller (thread resumes executing instructions from the caller) my_func() * And many other languages
  • 26. Stanford CS149, Fall 2020 Basic Cilk Plus examples // foo() and bar() may run in parallel cilk_spawn foo(); bar(); cilk_sync; // foo() and bar() may run in parallel cilk_spawn foo(); cilk_spawn bar(); cilk_sync; // foo, bar, fizz, buzz, may run in parallel cilk_spawn foo(); cilk_spawn bar(); cilk_spawn fizz(); buzz(); cilk_sync; bar() foo() bar() foo() fizz() bar() buzz() foo() Same amount of independent work first example, but potentially higher runtime overhead (due to two spawns vs. one)
  • 27. Stanford CS149, Fall 2020 Abstraction vs. implementation ▪ Notice that the cilk_spawn abstraction does not specify how or when spawned calls are scheduled to execute - Only that they may be run concurrently with caller (and with all other calls spawned by the caller) - Question: Is an implementation of Cilk correct if it implements cilk_spawn foo() the same way as it implementation a normal function call to foo()? ▪ But cilk_sync does serve as a constraint on scheduling - All spawned calls must complete before cilk_sync returns
  • 28. Stanford CS149, Fall 2020 Parallel quicksort in Cilk Plus void quick_sort(int* begin, int* end) { if (begin >= end - PARALLEL_CUTOFF) std::sort(begin, end); else { int* middle = partition(begin, end); cilk_spawn quick_sort(begin, middle); quick_sort(middle+1, last); } } quick_sort() part() part() std:: sort() part() std:: sort() std:: sort() std:: sort() quick_sort() part() std:: sort() std:: sort() std:: sort() part() part() std:: sort() Sort sequentially if problem size is sufficiently small (overhead of spawn trumps benefits of potential parallelization) part()
  • 29. Stanford CS149, Fall 2020 Writing fork-join programs ▪ Main idea: expose independent work (potential parallelism) to the system using cilk_spawn ▪ Recall parallel programming rules of thumb - Want at least as much work as parallel execution capability (e.g., program should probably spawn at least as much work as needed to fill all the machine’s processing resources) - Want more independent work than execution capability to allow for good workload balance of all the work onto the cores - “parallel slack”= ratio of independent work to machine’s parallel execution capability (in practice: ~8 is a good ratio) - But not too much independent work so that granularity of work is too small (too much slack incurs overhead of managing fine-grained work)
  • 30. Stanford CS149, Fall 2020 Scheduling fork-join programs ▪ Consider very simple scheduler: - Launch pthread for each cilk_spawn using pthread_create - Translate cilk_sync into appropriate pthread_join calls ▪ Potential performance problems? - Heavyweight spawn operation - Many more concurrently running threads than cores - Context switching overhead - Larger working set than necessary, less cache locality Note: now we are going to talk about the implementation of Cilk
  • 31. Stanford CS149, Fall 2020 Pool of worker threads ▪ The Cilk Plus runtime maintains pool of worker threads - Think: all threads are created at application launch * - Exactly as many worker threads as execution contexts in the machine * It’s perfectly fine to think about it this way, but in reality, runtimes tend to be lazy and initialize worker threads on the first Cilk spawn. (This is a common implementation strategy, ISPC does the same with worker threads that run ISPC tasks.) Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Example: Eight thread worker pool for my quad-core laptop with Hyper-Threading while (work_exists()) { work = get_new_work(); work.run(); }
  • 32. Stanford CS149, Fall 2020 Consider execution of the following code cilk_spawn foo(); bar(); cilk_sync; foo() Specifically, consider execution from the point foo() is spawned spawned child continuation (rest of calling function) bar() What threads should foo() and bar() be executed by? Thread 0 Thread 1
  • 33. Stanford CS149, Fall 2020 First, consider a serial implementation Thread 0 Executing foo()… Traditional thread call stack (indicates bar() will be run next after return from foo()) Thread 1 What if, while executing foo(), thread 1 goes idle… Inefficient: thread 1 could be performing bar() at this time! Run child first… via a regular function call - Thread runs foo(), then returns from foo(), then runs bar() - Continuation is implicit in the thread’s stack bar()
  • 34. Stanford CS149, Fall 2020 Per-thread work queues store“work to do” Thread 0 Thread call stack Thread 1 Thread 0 work queue Thread 1 work queue Empty! bar() Executing foo()… Upon reaching cilk_spawn foo(), thread places continuation in its work queue, and begins executing foo().
  • 35. Stanford CS149, Fall 2020 Idle threads“steal”work from busy threads Thread 0 Thread call stack Thread 1 Thread 0 work queue Thread 1 work queue bar() Executing foo()… 1. Idle thread looks in busy thread’s queue for work If thread 1 goes idle (a.k.a. there is no work in its own queue), then it looks in thread 0’s queue for work to do.
  • 36. Stanford CS149, Fall 2020 Idle threads“steal”work from busy threads Thread 0 Thread call stack Thread 1 Thread 0 work queue Thread 1 work queue bar() Executing foo()… 1. Idle thread looks in busy threads queue for work 2. Idle thread moves work from busy thread’s queue to its own queue If thread 1 goes idle (a.k.a. there is no work in its own queue), then it looks in thread 0’s queue for work to do.
  • 37. Stanford CS149, Fall 2020 Idle threads“steal”work from busy threads Thread 0 Thread call stack Thread 1 Thread 0 work queue Thread 1 work queue Executing foo()… 1. Idle thread looks in busy threads queue for work 2. Idle thread moves work from busy thread’s queue to its own queue Executing bar()… 3.Thread resumes execution If thread 1 goes idle (a.k.a. there is no work in its own queue), then it looks in thread 0’s queue for work to do.
  • 38. Stanford CS149, Fall 2020 At spawn, should calling thread run the child or the continuation? cilk_spawn foo(); bar(); cilk_sync; foo() spawned child continuation (rest of calling function) bar() Run child first: enqueue continuation for later execution - Continuation is made available for stealing by other threads (“continuation stealing”) Run continuation first: queue child for later execution - Child is made available for stealing by other threads (“child stealing”) Which implementation do we choose?
  • 39. Stanford CS149, Fall 2020 Consider thread executing the following code for (int i=0; i<N; i++) { cilk_spawn foo(i); } cilk_sync; foo(N-1) foo(3) foo(2) foo(1) foo(0) … ▪ Run continuation first (“child stealing”) - Caller thread spawns work for all iterations before executing any of it - Think: breadth-first traversal of call graph. O(N) space for spawned work (maximum space) - If no stealing, execution order is very different than that of program with cilk_spawn removed Thread 0 Thread 0 work queue foo(N-1) foo(N-2) foo(0) …
  • 40. Stanford CS149, Fall 2020 for (int i=0; i<N; i++) { cilk_spawn foo(i); } cilk_sync; foo(N-1) foo(3) foo(2) foo(1) foo(0) … ▪ Run child first (“continuation stealing”) - Caller thread only creates one item to steal (continuation that represents all remaining iterations) - If no stealing occurs, thread continually pops continuation from work queue, enqueues new continuation (with updated value of i) - Order of execution is the same as for program with spawn removed. - Think: depth-first traversal of call graph Thread 0 Thread 0 work queue cont: i=1 Executing foo(0)… Consider thread executing the following code
  • 41. Stanford CS149, Fall 2020 for (int i=0; i<N; i++) { cilk_spawn foo(i); } cilk_sync; foo(N-1) foo(3) foo(2) foo(1) foo(0) … ▪ Run child first (“continuation stealing”) - Enqueues continuation with i advanced by 1 - If continuation is stolen, stealing thread spawns and executes next iteration - Can prove that work queue storage for system withT threads is no more thanT times that of stack storage for single threaded execution Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue cont: i=1 Executing foo(0)… Executing foo(1)… Consider thread executing the following code
  • 42. Stanford CS149, Fall 2020 Scheduling quicksort: assume 200 elements Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue void quick_sort(int* begin, int* end) { if (begin >= end - PARALLEL_CUTOFF) std::sort(begin, end); else { int* middle = partition(begin, end); cilk_spawn quick_sort(begin, middle); quick_sort(middle+1, last); } } cont: 101-200 Working on 0-25… cont: 51-100 cont: 26-50 … What work in the queue should other threads steal? (e.g., steal from top or bottom)
  • 43. Stanford CS149, Fall 2020 Implementing work stealing: dequeue per worker Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue cont: 101-200 Working on 0-25… cont: 51-100 cont: 26-50 … Steal! Steal! Work queue implemented as a dequeue (double ended queue) - Local thread pushes/pops from the“tail”(bottom) - Remote threads steal from“head”(top)
  • 44. Stanford CS149, Fall 2020 Implementing work stealing: dequeue per worker Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue Working on 0-25… cont: 151-200 cont: 26-50 … cont: 76-100 Working on 51-75… Working on 101-150… Work queue implemented as a dequeue (double ended queue) - Local thread pushes/pops from the“tail”(bottom) - Remote threads steal from“head”(top)
  • 45. Stanford CS149, Fall 2020 Implementing work stealing: dequeue per worker Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue Working on 0-12… cont: 114-125 cont: 13-25 … Working on 51-63… Working on 101-113… cont: 64-75 cont: 76-100 cont: 26-50 cont: 126-150 cont: 151-200 Work queue implemented as a dequeue (double ended queue) - Local thread pushes/pops from the“tail”(bottom) - Remote threads steal from“head”(top)
  • 46. Stanford CS149, Fall 2020 Implementing work stealing: choice of victim Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue Working on 0-12… cont: 114-125 cont: 13-25 … ▪ Idle threads randomly choose a thread to attempt to steal from ▪ Steal work from top of dequeue: - Steals largest amount of work (reduce number of steals) - Maximum locality in work each thread performs (when combined with run child first scheme) - Stealing thread and local thread do not contend for same elements of dequeue (efficient lock-free implementations of dequeue exist) Working on 51-63… Working on 101-113… cont: 64-75 cont: 76-100 cont: 26-50 cont: 126-150 cont: 151-200
  • 47. Stanford CS149, Fall 2020 Child-first work stealing scheduler anticipates divide-and-conquer parallelism void recursive_for(int start, int end) { while (start <= end - GRANULARITY) { int mid = (end - start) / 2; cilk_spawn recursive_for(start, mid); start = mid; } for (int i=start; i<end; i++) foo(i); } recursive_for(0, N); for (int i=0; i<N; i++) { cilk_spawn foo(i); } cilk_sync; foo(N-1) foo(3) foo(2) foo(1) foo(0) … (0, N/2) (N/2, 3N/4) (0, N/4) (N/2, 5N/8) (N/4, 3N/8) Code at right generates work in parallel, (code at left does not), so it more quickly fills parallel machine
  • 48. Stanford CS149, Fall 2020 Implementing sync for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 2 Thread 2 work queue Working on foo(9)… cont: i=10 Working on foo(7)… Working on foo(8)… Thread 3 Thread 3 work queue Working on foo(6)… State of worker threads when all work from loop is nearly complete
  • 49. Stanford CS149, Fall 2020 Implementing sync: no stealing case for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(9), id=A… cont: i=10 (id=A) block (id: A) Sync for all calls spawned within block A If no work has been stolen by other threads, then there’s nothing to do at the sync point. cilk_sync is a no-op.
  • 50. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Example 1:“stalling”join policy Thread that initiates the fork must perform the sync. Therefore it waits for all spawned work to be complete. In this case, thread 0 is the thread initiating the fork Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(0), id=A… cont: i=0 (id=A) for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 51. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(0), id=A… STOLEN (id=A) cont: i=0, id=A Steal! Idle thread 1 steals from busy thread 0 Note: descriptor for block A created The descriptor tracks the number of outstanding spawns for the block, and the number of those spawns that have completed. The 1 spawn tracked by the descriptor corresponds to foo(0) being run by thread 0. (Since the continuation is now owned by thread 1 after the steal.) id=A spawn: 1, done: 0 for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 52. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(0), id=A… cont: i=1, id=A Update count Working on foo(1), id=A… id=A spawn: 2, done: 0 STOLEN (id=A) Thread 1 is now running foo(1) Note: spawn count is now 2 for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 53. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Idle! Working on foo(1), id=A… Thread 2 Thread 2 work queue cont: i=2, id=A Working on foo(2), id=A… Steal! STOLEN (id=A) id=A spawn: 3, done: 1 Thread 0 completes foo(0) (updates spawn descriptor) Thread 2 now running foo(2) STOLEN (id=A) for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 54. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Idle! Idle! Thread 2 Thread 2 work queue cont: i=10, id=A Working on foo(9), id=A… id=A spawn: 10, done: 9 STOLEN (id=A) for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A Computation nearing end… Only foo(9) remains to be completed.
  • 55. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Idle! Idle! Thread 2 Thread 2 work queue cont: i=10, id=A Idle! Notify done! id=A spawn: 10, done: 10 STOLEN (id=A) Last spawn completes. for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 56. Stanford CS149, Fall 2020 Implementing sync: stalling join foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on bar()… Idle! Thread 2 Thread 2 work queue Idle! Thread 0 now resumes continuation and executes bar() Note block A descriptor is now free. for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 57. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(0), id=A… STOLEN (id=A) cont: i=0, id=A Steal! Example 2:“greedy”policy - When thread that initiates the fork goes idle, it looks to steal new work - Last thread to reach the join point continues execution after sync id=A spawn: 0, done: 0 for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 58. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Working on foo(0), id=A… STOLEN (id=A) cont: i=1, id=A Idle thread 1 steals from busy thread 0 (as in the previous case) id=A spawn: 2, done: 0 Working on foo(1), id=A… for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 59. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Done with foo(0)! STOLEN (id=A) cont: i=1, id=A Thread 0 completes foo(0) No work to do in local dequeue, so thread 0 looks to steal! id=A spawn: 2, done: 0 Working on foo(1), id=A… for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 60. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue STOLEN (id=A) cont: i=2, id=A Thread 0 now working on foo(2) id=A spawn: 3, done: 1 Working on foo(1), id=A… Working on foo(2), id=A… Steal! for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 61. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue cont: i=10, id=A Assume thread 1 is the last to finish spawned calls for block A. id=A spawn: 10, done: 9 Working on foo(9), id=A… Idle for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 62. Stanford CS149, Fall 2020 Implementing sync: greedy policy foo(9) foo(3) foo(2) foo(1) foo(0) … bar() Thread 0 Thread 0 work queue Thread 1 Thread 1 work queue Thread 1 continues on to run bar() Note block A descriptor is now free. Working on bar() Idle for (int i=0; i<10; i++) { cilk_spawn foo(i); } cilk_sync; bar(); block (id: A) Sync for all calls spawned within block A
  • 63. Stanford CS149, Fall 2020 Cilk uses greedy join scheduling ▪ Greedy join scheduling policy - All threads always attempt to steal if there is nothing to do (threads only go idle if there is no work to steal in the system) - Worker thread that initiated spawn may not be thread that executes logic after cilk_sync ▪ Remember: - Overhead of bookkeeping steals and managing sync points only occurs when steals occur - If large pieces of work are stolen, this should occur infrequently - Most of the time, threads are pushing/popping local work from their local dequeue
  • 64. Stanford CS149, Fall 2020 Cilk summary ▪ Fork-join parallelism: a natural way to express divide-and- conquer algorithms - Discussed Cilk Plus, but many other systems also have fork/join primitives - e.g., OpenMP ▪ Cilk Plus runtime implements spawn/sync abstraction with a locality-aware work stealing scheduler - Always run spawned child (continuation stealing) - Greedy behavior at join (threads do not wait at join, immediately look for other work to steal)