CILK/CILK++ and Reducers

757 views

Published on

A presentation based on CILK5, CILK++ Reducer papers

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
757
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • CILK and CILK++ adopt the shared memory model, No uniform address, sockets, abstraction
  • If you have taken Comp 322, Spawn is very similar to the “async” keyword in Habanero Java, the Sync keyword is similar to the “finish” scopeCilk++ extends C++??
  • An example of thefibonacci sequence computation in cilk, Spawn two threads at each invocation of the function, notice the cilk keyword is used to denote a cilk function,
  • Cilk++ took away the cilk keyword, prefixed cilk_ to spawn and sync
  • Directed acyclic graphSpawn creates parallel executions, B and C, they join together and recombine to execute D
  • Work:The time needed to execute the program serialyParallel slackness assumption: number of processors is much smaller than average degree of parallelism
  • To support dynamic task creation
  • The Cilk runtime uses a specialworkstealing scheduler, There are two kinds of schedulers, worksharing, where all the workers steal from a unified task queue, it is less efficient for a number of reasons, There is a single lock potentially on the task queue to deal with contentionsThe queue could be empty, but there are still work leftThe workstealing runtime solvees the problem by building an extended deque for each worker, when a worker is out of work, it steals randomly from other workersWe will demonstrate the process in the next few slidesDecentralized Push work rather than pull work (when necessary)Loop contians a spawn, package child task, stack, single processor 9LAZY TASK CREATION
  • Steal from the top to reduce contentionSteal from the top to get bigger subtree (divide and conquer), larger task granularity, minimize stealsSteal from the top increase possible locality of the program (cache locality
  • The reason
  • All sync statements compile to no-ops because a fast clone never has any children when it is executing, we know at compile time that all previously spawned procedures have completed. Thus, no operations are required for a sync statementBefore it recursively spawns,
  • Looks a lot like orginal fib (highlight the original sequential code), the rest is bookeepingLittle bit bookeeping, Sig is the signature , included the pointer to the slow clone rountine, fibsig represents the slow cloneEntry point, instruction pointerComes back to the principle we described earlier
  • Uses fast_fib locally
  • Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
  • Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
  • Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
  • Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
  • Joint counter, frames left in heap, (0)Assert that the frame A begin stolen is a full frame and the extended deque is empty. Decrement the join counter of A. If the join counter is 0 and no worker is working on A, execute a resume-full-frame action on A. Otherwise, begin random work stealing.3
  • Assert that the frame A being stolen is a full frame, the extended deque is empty, and A’s join counter is positive. Decrement the join counter of A. Execute a resume-full- frame action on A.
  • Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
  • Just removing a stack frame
  • This case the full frame has finished execution
  • Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
  • Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
  • This case the full frame has finished execution
  • Do nothing if it is a stack frame
  • Do nothing if it is a stack frame
  • Little modificationsDeterministic output even in the presence of output (associative)
  • Can be used to parallelize many programs containing global (or nonlocal) variables without locking, atomic updating, or the need to logically restructure the codeThe programmer can count on a deterministic result as long as the reducer operator is associative. Commutability is not requiredReducers opeerateindependenly of any control constructs, such as parallel for, and of any data structures that contribute their values to the final result
  • Little modificationsDeterministic output even in the presence of output (associative)
  • Fast clone uses identity view
  • Example of serial execution
  • Children of A would be {B, C}Right Sibling of B would be CUser would be view in A,
  • We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
  • Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
  • Just removing a stack frame
  • We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
  • This case the full frame has finished execution
  • We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
  • Again we distinguish the “fast path” when C is a stack frame from the “slow path” when C is a full frame:
  • If proc B finishes first,
  • If proc B finishes first, the results would be in children of A, If C finishes, it would be the left most, Children of A would just be a union of current children of A and UserCTwo of them are leftmost case
  • When C finishesC has a right sibling, B, so the result of C is accumulated into Right BWhen B finishes, the children of A has UserB
  • 1. Doing nothing is correct because all children of P, if any exist, were stack frames, and thus they transferred ownership of their views to P when they completed. Thus, no outstanding child views exist that must be reduced into P’s. 2. Then after P passes the cilk_sync state- ment but before executing any client code, we perform the update. This up- date reduces all reducers of completed children into the parent.
  • Comparing reducers against mutual exclusion
  • Future scaling with dynmiac parallelismProvides a simple way to add incremental parallelismIncremental parallelization of programsInspired many future works, such as Habanero Java, Habanero C, X10,
  • Eagerly saving all the state, gather the states using an Exception when they make a steal
  • CILK/CILK++ and Reducers

    1. 1. CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1
    2. 2. OUTLINE • CILK and CILK++ Language Features and Usages • Work stealing runtime • CILK++ Reducers • Conclusions 2
    3. 3. IDEALIZED SHARED MEMORY ARCHITECTURE 3 • Hardware model • Processors • Shared global memory • Software model • Threads • Shared variables • Communication • Synchronization Slide from Comp 422 Rice University Lecture 4
    4. 4. CILK AND CILK++ DESIGN GOALS • Programmer friendly • Dynamic tasking • Parallel extension to C • Scalable performance • Efficient runtime system • Minimum program overhead 4
    5. 5. CILK KEYWORDS • Cilk: a Cilk function • Spawn: call can execute asynchronously in a concurrent thread • Sync: current thread waits for all locally- spawned functions 5
    6. 6. CILK EXAMPLE cilk int fib(n) { if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); } } 6 Borrowed from Comp 422 Rice University Lecture 4
    7. 7. CILK++ EXAMPLE int fib(n) { if (n < 2) return n; else { int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); } } 7 Borrowed from Comp 422 Rice University Lecture 4
    8. 8. CILK++ EXAMPLE WITH DAG 8 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    9. 9. OUTLINE • CILK and CILK++ Language Features and Usages • Work stealing runtime • CILK++ Reducers • Conclusions 9
    10. 10. WORK FIRST PRINCIPLE • Work: T1 • Critical path length: T∞ • Number of processor: P • Expected time • Tp = T1/P + O(T∞) • Parallel slackness assumption • T1/P >> C∞T∞ 10
    11. 11. WORK FIRST PRINCIPLE • Minimize scheduling overhead borne by work at the expense of increasing critical path • Tp ≤ C1Ts/P + C∞T∞ ≈ C1Ts/P Minimize C1 even at the expense of a larger C∞ 11
    12. 12. WORK STEALING DESIGN GOALS • Minimizing contentions • Decentralized task deque • Doubly linked deque • Minimizing communication • Steal work rather than push work • Load balance across cores • Lazy task creation • Steal from the top of the deque 12
    13. 13. CILK WORK STEALING SCHEDULER 13 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    14. 14. CILK WORK STEALING SCHEDULER 14 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    15. 15. CILK WORK STEALING SCHEDULER 15 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    16. 16. CILK WORK STEALING SCHEDULER 16 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    17. 17. CILK WORK STEALING SCHEDULER 17 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    18. 18. CILK WORK STEALING SCHEDULER 18 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    19. 19. CILK WORK STEALING SCHEDULER Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    20. 20. CILK WORK STEALING SCHEDULER Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    21. 21. CILK WORK STEALING SCHEDULER 21 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    22. 22. CILK WORK STEALING SCHEDULER 22 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    23. 23. CILK WORK STEALING SCHEDULER 23 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    24. 24. CILK WORK STEALING SCHEDULER 24 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    25. 25. CILK WORK STEALING SCHEDULER 25 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    26. 26. CILK WORK STEALING SCHEDULER 26 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    27. 27. TWO CLONE STRATEGY • Fast clone • Identical in most respects to the C elision of the Cilk program • Very little execution overhead • Sync statements compile to no op • Allocates an continuation • Program variables and instruction pointer • Slow clone • Convert a spawn schedule to slow clone only when it is stolen • Restores program state from activation frame that contains local variables, program counter and other parts of the procedure instance 27
    28. 28. FAST CLONE 28
    29. 29. SLOW CLONE Slow_fib(frame * _cilk_frame){ restore states of the program switch (_cilk_frame->header.entry) { fast_fib(_cilk_frame->n - 1 ); case 1: goto _cilk_sync1; fast_fib(_cilk_frame->n - 2 ); case 2: goto _cilk_sync2; sync (not a no op) case 3: goto _cilk_sync3; } } 29
    30. 30. EXTENDED DEQUE WITH CALL STACKS 30 Stack frame Full frame Extended Deque Call stack
    31. 31. FRAMES • C++ Main Frame • Local variables of the procedure instance • Temporary variables • Linkage information for return values 31
    32. 32. FRAMES • CILK++ Stack Frame • Everything in C++ Main Frame • Continuation • Parent pointer • Have exactly one child • Used by Fast Clone • A worker can have multiple Stack Frames 32
    33. 33. FRAMES • CILK++ Full Frame (used by slow clone) • Everything in CILK++ Stack Frame • Lock • Join counter • List of children (has more than one children) • A worker has at most one Full Frame 33
    34. 34. FUNCTION CALL 34 Stack frame Full frame Extended Deque (Before Function Call)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    35. 35. FUNCTION CALL 35 Stack frame Full frame Extended Deque (After Function Call)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame New stack frame
    36. 36. SPAWN 36 Stack frame Full frame Extended Deque (Before Spawn Call)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    37. 37. SPAWN 37 Stack frame Full frame Extended Deque (After Spawn Call)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Set continuation in last stack frame
    38. 38. RESUME FULL FRAME 38 Stack frame Full frame Extended DequeFunction call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Set the full frame to be the only frame in the call stack, resume execution on the continuation
    39. 39. RANDOMLY STEAL 39 Stack frame Full frame Extended DequeFunction call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Steal this call stack
    40. 40. RANDOMLY STEAL 40 Stack frame Full frame Extended DequeFunction call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Steal this call stack 1 1 1
    41. 41. RANDOMLY STEAL 41 Stack frame Full frame Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame 1 1 1
    42. 42. PROVABLY GOOD STEAL 42 Stack frame Full frame Extended DequeFunction call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame 0
    43. 43. UNCONDITIONALLY STEAL 43 Stack frame Full frame Extended DequeFunction call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame 2
    44. 44. FUNCTION CALL RETURN 44 Stack frame Full frame Extended Deque (Before Return from a Call Case1)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    45. 45. FUNCTION CALL RETURN 45 Stack frame Full frame Extended Deque (Return from a Call Case 1)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    46. 46. FUNCTION CALL RETURN 46 Stack frame Full frame Extended Deque (Return from a Call Case2)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Worker executes an unconditional steal
    47. 47. SPAWN RETURN 47 Stack frame Full frame Extended Deque (Before Spawn return Case 1)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    48. 48. SPAWN RETURN 48 Stack frame Full frame Extended Deque (After Spawn return Case 1)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame
    49. 49. SPAWN RETURN 49 Stack frame Full frame Extended Deque (Return from a SpawnCase2)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Worker executes an provably good steal
    50. 50. SYNC 50 Stack frame Full frame Extended Deque (Sync Case 1)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Do nothing if it is a stack frame (No Op)
    51. 51. SYNC 51 Stack frame Full frame Extended Deque (Sync Case 2)Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame Pop the frame, provably good steal
    52. 52. OUTLINE • CILK and CILK++ Language Features and Usages • Work stealing runtime • CILK++ Reducers • Conclusions 52
    53. 53. PROBLEMS WITH NON-LOCAL VARIABLES bool has_property(Node *) List<Node *> output_list; void walk(Node *x) { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; } } 53
    54. 54. REDUCER DESIGN GOALS • Support parallelization of programs containing global variables • Enable efficient parallel scaling by avoiding a single point of contention • Provide deterministic result for associative reduce operations • Operate independently of any control constructs 54
    55. 55. REDUCER EXAMPLE bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; } } 55
    56. 56. HYPER OBJECTS 56 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    57. 57. REDUCER 57 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    58. 58. SEMANTICS OF REDUCERS • The child strand owns the view owned by parent function before cilk_spawn • The parent strand owns a new view, initialized to identity view e, • A special optimization ensures that if a view is unchanged when combined with the identity view • Parent strand P own the view from completed child strands 58
    59. 59. REDUCING OVER LIST CONCATENATION 59 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    60. 60. REDUCING OVER LIST CONCATENATION 60 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    61. 61. IMPLEMENTATION OF REDUCER • Each worker maintains a hypermap • Hypermap • Maps reducers to the views • User • The view of the current procedure • Children • The view of the children procedures • Right • The view of right sibling • Identity • The default value of a view 61
    62. 62. UNDERSTANDING HYPERMAPS bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) ------------ Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ---------proc B cilk_spawn walk(x->right); -------- proc C cilk_sync; } 62
    63. 63. HYPERMAP CREATION 64 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    64. 64. HYPERMAP CREATION 65 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    65. 65. HYPERMAP CREATION 66 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    66. 66. HYPERMAP CREATION 67 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    67. 67. HYPERMAP CREATION 68 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    68. 68. LOOK UP FAILURE • Inserts a view containing an identity element for the reducer into the hypermap. • Following the lazy principle • Look up returns the newly inserted identity view 69
    69. 69. RANDOM WORK STEALING A random steal operation steals a full frame P and replaces it with a new full frame C in the victim. USERC ← USERP; U S E R P ← 0/ ; CHILDRENP←0/; RIGHTP←0/. 70
    70. 70. RANDOM WORK STEALING 71 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
    71. 71. RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. • If C is a stack frame, do nothing, • If C is a full frame. • Transfer ownership of view • Children and Right are empty • USERP ← USERC 77
    72. 72. RETURN FROM A SPAWN Let C be a child frame of the parent frame P that originally spawned C, and suppose that C returns. • Always do USERC ← REDUCE(USERC,RIGHTC) • If C is a stack frame, do nothing • If C is a full frame • If C has siblings, • RIGHTL ← REDUCE(RIGHTL,USERC) • C is the leftmost child • CHILDRENP ← REDUCE(CHILDRENP,USERC) 78
    73. 73. SYNC A cilk_sync statement waits until all children have com- pleted. When frame P executes a cilk_sync, one of following two cases applies: • If P is a stack frame, do nothing. • If P is a full frame, • USERP ← REDUCE(CHILDRENP,USERP). 82
    74. 74. BENEFITS OF REDUCERS 83
    75. 75. OUTLINE • CILK and CILK++ Language Features and Usages • Work stealing runtime • CILK++ Reducers • Conclusions 84
    76. 76. CONCLUSIONS • CILK and CILK++ provide a programmer friendly programming model • Extension to C • Incremental parallelism • Scaling on future machines • Non-compromising performance • Work stealing runtime • Minimizing overheads • Reducers 85
    77. 77. FINAL NOTES • Designed for an idealized shared memory model • Today’s architectures are typically NUMA • Task creation can be lazier • http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=6012915&tag=1 • Cilk_for • Divide and conquer parallelization 86

    ×