CILK/CILK++ and Reducers

CILK/CILK++
AND
REDUCERS
YUNMING ZHANG
RICE UNIVERSITY
1

OUTLINE
• CILK and CILK++ Language Features and
Usages
• Work stealing runtime
• CILK++ Reducers
• Conclusions
2

IDEALIZED SHARED
MEMORY ARCHITECTURE
3
• Hardware model
• Processors
• Shared global
memory
• Software model
• Threads
• Shared variables
• Communication
• Synchronization
Slide from Comp 422 Rice University Lecture 4

CILK AND CILK++
DESIGN GOALS
• Programmer friendly
• Dynamic tasking
• Parallel extension to C
• Scalable performance
• Efficient runtime system
• Minimum program overhead
4

CILK KEYWORDS
• Cilk: a Cilk function
• Spawn: call can execute asynchronously
in a concurrent thread
• Sync: current thread waits for all locally-
spawned functions
5

CILK EXAMPLE
cilk int fib(n) {
if (n < 2)
return n;
else {
int n1, n2;
n1 = spawn fib(n-1);
n2 = spawn fib(n-2);
sync;
return (n1 + n2);
}
}
6
Borrowed from Comp 422 Rice University Lecture 4

CILK++ EXAMPLE
int fib(n) {
if (n < 2)
return n;
else {
int n1, n2;
n1 = cilk_spawn fib(n-1);
n2 = fib(n-2);
cilk_sync;
return (n1 + n2);
}
}
7
Borrowed from Comp 422 Rice University Lecture 4

CILK++ EXAMPLE
WITH DAG
8
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

OUTLINE
Usages
• CILK++ Reducers
• Conclusions
9

WORK FIRST
PRINCIPLE
• Work: T1
• Critical path length: T∞
• Number of processor: P
• Expected time
• Tp = T1/P + O(T∞)
• Parallel slackness assumption
• T1/P >> C∞T∞
10

WORK FIRST
PRINCIPLE
• Minimize scheduling overhead borne by
work at the expense of increasing critical
path
• Tp ≤ C1Ts/P + C∞T∞
≈ C1Ts/P
Minimize C1 even at the expense of a larger
C∞
11

WORK STEALING
DESIGN GOALS
• Minimizing contentions
• Decentralized task deque
• Doubly linked deque
• Minimizing communication
• Steal work rather than push work
• Load balance across cores
• Lazy task creation
• Steal from the top of the deque
12

CILK WORK STEALING
SCHEDULER
13

CILK WORK STEALING
SCHEDULER
14

CILK WORK STEALING
SCHEDULER
15

CILK WORK STEALING
SCHEDULER
16

CILK WORK STEALING
SCHEDULER
17

CILK WORK STEALING
SCHEDULER
18

CILK WORK STEALING
SCHEDULER

CILK WORK STEALING
SCHEDULER
21

CILK WORK STEALING
SCHEDULER
22
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING
SCHEDULER
23

CILK WORK STEALING
SCHEDULER
24

CILK WORK STEALING
SCHEDULER
25

CILK WORK STEALING
SCHEDULER
26

TWO CLONE
STRATEGY
• Fast clone
• Identical in most respects to the C elision of the Cilk
program
• Very little execution overhead
• Sync statements compile to no op
• Allocates an continuation
• Program variables and instruction pointer
• Slow clone
• Convert a spawn schedule to slow clone only when it
is stolen
• Restores program state from activation frame that
contains local variables, program counter and other
parts of the procedure instance
27

SLOW CLONE
Slow_fib(frame * _cilk_frame){
restore states of the program
switch (_cilk_frame->header.entry)
{
fast_fib(_cilk_frame->n - 1 );
case 1: goto _cilk_sync1;
fast_fib(_cilk_frame->n - 2 );
sync (not a no op)
}
}
29

EXTENDED DEQUE
WITH CALL STACKS
30
Stack frame
Full frame
Extended Deque
Call stack

FRAMES
• C++ Main Frame
• Local variables of the procedure instance
• Temporary variables
• Linkage information for return values
31

FRAMES
• CILK++ Stack Frame
• Everything in C++ Main Frame
• Continuation
• Parent pointer
• Have exactly one child
• Used by Fast Clone
• A worker can have multiple Stack Frames
32

FRAMES
• CILK++ Full Frame (used by slow clone)
• Everything in CILK++ Stack Frame
• Lock
• Join counter
• List of children (has more than one
children)
• A worker has at most one Full Frame
33

FUNCTION CALL
34
Stack frame
Full frame
Extended Deque (Before Function Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

FUNCTION CALL
35
Stack frame
Full frame
Extended Deque (After Function Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
New stack
frame

SPAWN
36
Stack frame
Full frame
Extended Deque (Before Spawn Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

SPAWN
37
Stack frame
Full frame
Extended Deque (After Spawn Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Set
continuation
in last stack
frame

RESUME FULL FRAME
38
Stack frame
Full frame
Extended DequeFunction call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Set the full frame to be the only frame in the
call stack, resume execution on the
continuation

RANDOMLY STEAL
39
Stack frame
Full frame
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Steal this call stack

RANDOMLY STEAL
40
Stack frame
Full frame
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Steal this call stack
1 1 1

RANDOMLY STEAL
41
Stack frame
Full frame
Extended Deque
Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
1
1 1

PROVABLY GOOD
STEAL
42
Stack frame
Full frame
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
0

UNCONDITIONALLY
STEAL
43
Stack frame
Full frame
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
2

FUNCTION CALL
RETURN
44
Stack frame
Full frame
Extended Deque (Before Return from a Call Case1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

FUNCTION CALL
RETURN
45
Stack frame
Full frame
Extended Deque (Return from a Call Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

FUNCTION CALL
RETURN
46
Stack frame
Full frame
Extended Deque (Return from a Call Case2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Worker executes an
unconditional steal

SPAWN RETURN
47
Stack frame
Full frame
Extended Deque (Before Spawn return Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

SPAWN RETURN
48
Stack frame
Full frame
Extended Deque (After Spawn return Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame

SPAWN RETURN
49
Stack frame
Full frame
Extended Deque (Return from a SpawnCase2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Worker executes an
provably good steal

SYNC
50
Stack frame
Full frame
Extended Deque (Sync Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Do nothing if it
is a stack
frame (No Op)

SYNC
51
Stack frame
Full frame
Extended Deque (Sync Case 2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Pop the frame,
provably good steal

OUTLINE
Usages
• CILK++ Reducers
• Conclusions
52

PROBLEMS WITH
NON-LOCAL VARIABLES
bool has_property(Node *)
List<Node *> output_list;
void walk(Node *x)
{
if (x) {
if (has_property(x))
output_list.push_back(x);
cilk_spawn walk(x->left);
walk(x->right);
cilk_sync;
}
}
53

REDUCER
DESIGN GOALS
• Support parallelization of programs
containing global variables
• Enable efficient parallel scaling by
avoiding a single point of contention
• Provide deterministic result for
associative reduce operations
• Operate independently of any control
constructs
54

REDUCER EXAMPLE
List_append_reducer<Node *> output_list;
void walk(Node *x)
{
if (x) {
cilk_spawn walk(x->left);
walk(x->right);
cilk_sync;
}
}
55

HYPER OBJECTS
56

REDUCER
57

SEMANTICS OF
REDUCERS
• The child strand owns the view owned by
parent function before cilk_spawn
• The parent strand owns a new view,
initialized to identity view e,
• A special optimization ensures that if a
view is unchanged when combined with
the identity view
• Parent strand P own the view from
completed child strands
58

REDUCING OVER LIST
CONCATENATION
59

REDUCING OVER LIST
CONCATENATION
60

IMPLEMENTATION OF
REDUCER
• Each worker maintains a hypermap
• Hypermap
• Maps reducers to the views
• User
• The view of the current procedure
• Children
• The view of the children procedures
• Right
• The view of right sibling
• Identity
• The default value of a view
61

UNDERSTANDING
HYPERMAPS
List_append_reducer<Node *> output_list;
void walk(Node *x) ------------ Proc A
{
if (x) {
cilk_spawn walk(x->left); ---------proc B
cilk_spawn walk(x->right); -------- proc C
cilk_sync;
}
62

HYPERMAP CREATION
64

HYPERMAP CREATION
65

HYPERMAP CREATION
66

HYPERMAP CREATION
67

HYPERMAP CREATION
68

LOOK UP FAILURE
• Inserts a view containing an identity
element for the reducer into the
hypermap.
• Following the lazy principle
• Look up returns the newly inserted
identity view
69

RANDOM WORK
STEALING
A random steal operation steals a full frame
P and replaces it with a new full frame C in
the victim.
USERC ← USERP;
U S E R P ← 0/ ;
CHILDRENP←0/;
RIGHTP←0/.
70

RANDOM WORK
STEALING
71

RETURN FROM A CALL
Let C be a child frame of the parent frame P
that originally called C, and suppose that C
returns.
• If C is a stack frame, do nothing,
• If C is a full frame.
• Transfer ownership of view
• Children and Right are empty
• USERP ← USERC
77

RETURN FROM A
SPAWN
Let C be a child frame of the parent frame P that
originally spawned C, and suppose that C returns.
• Always do USERC ← REDUCE(USERC,RIGHTC)
• If C is a stack frame, do nothing
• If C is a full frame
• If C has siblings,
• RIGHTL ← REDUCE(RIGHTL,USERC)
• C is the leftmost child
• CHILDRENP ←
REDUCE(CHILDRENP,USERC)
78

SYNC
A cilk_sync statement waits until all children have com-
pleted. When frame P executes a cilk_sync, one of following
two cases applies:
• If P is a stack frame, do nothing.
• If P is a full frame,
• USERP ← REDUCE(CHILDRENP,USERP).
82

OUTLINE
Usages
• CILK++ Reducers
• Conclusions
84

CONCLUSIONS
• CILK and CILK++ provide a programmer
friendly programming model
• Extension to C
• Incremental parallelism
• Scaling on future machines
• Non-compromising performance
• Minimizing overheads
• Reducers
85

FINAL NOTES
• Designed for an idealized shared memory
model
• Today’s architectures are typically NUMA
• Task creation can be lazier
• http://ieeexplore.ieee.org/xpls/abs_all.jsp?
arnumber=6012915&tag=1
• Cilk_for
• Divide and conquer parallelization
86

CILK/CILK++ and Reducers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CILK/CILK++ and Reducers

Similar to CILK/CILK++ and Reducers (20)

Recently uploaded

Recently uploaded (20)

CILK/CILK++ and Reducers

Editor's Notes