Mit cilk

1. The Era of Multicore Is Here 1 Source: www.newegg.com

2. Memory Network … ¢ ¢ ¢ P P P Chip Multiprocessor (CMP) Multicore Architecture* 2 *The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

3. Concurrency Platforms Aconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming. User Application Concurrency Platform Operating System 3

4. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA

5. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA

6. Three Desirable Criteria Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 6

7. Various Strategies Cilk++ TBB Cilk Plus 7 The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.

8. The Cactus-Stack Problem Customer Engineer SP Reciprocity Space Usage Performance 8

9. The Cactus-Stack Problem Parallelize my software? SP Reciprocity Space Usage Performance 9

10. The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 10

11. The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 11

12. The Cactus-Stack Problem Just be sure to recompile all your codebase. Space Usage Performance 12

13. The Cactus-Stack Problem Hm … I use third party binaries … Space Usage Performance 13

14. The Cactus-Stack Problem *Sigh*. Ok fine. SP Reciprocity Space Usage Performance 14

15. The Cactus-Stack Problem Upgrade your RAM then … SP Reciprocity Performance 15

16. The Cactus-Stack Problem … you are gonna need extra memory. SP Reciprocity Performance 16

17. The Cactus-Stack Problem … no? SP Reciprocity Performance 17

18. The Cactus-Stack Problem … no? SP Reciprocity Space Usage Performance 18

19. The Cactus-Stack Problem Well … you didn’t say you want any performance guarantee, did you? ⌃ # SP Reciprocity Space Usage 19

20. The Cactus-Stack Problem Gee … I can get that just by running serially. ⌃ # SP Reciprocity Space Usage 20

21. The Cactus-Stack Problem Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 21

22. Legacy Linear Stack An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 22

23. Legacy Linear Stack Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 23

24. Legacy Linear Stack — 1960* Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 24 * Stack-based space management for recursive subroutines developed with compilers for Algol 60.

25. Cactus Stack — 1968* A cactus stack supports multiple views in parallel. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 25 * Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.

26. Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. A Mesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy. heap C B E D 26

28. Cilk-5 (MIT)

29. Cilk-M (MIT)

30. Cilk Plus (Intel)

31. Fortress (Oracle Labs)

32. Habanero (Rice)

33. JCilk (MIT)

34. OpenMP

35. StreamIt (MIT)

36. Task Parallel Library (Microsoft)

37. Threading Building Blocks (Intel)

38. X10 (IBM)

39. …27

40. Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well. A heap Good time and space bounds can be obtained … C B E D 28

41. Heap-Based Cactus Stack Heap linkage: call/return via frames in the heap. A Heap linkage parallel functions fail to interoperate with legacy serial code. heap C B E D 29

42. Various Strategies 30 The main constraint: once allocated, a frame’s location in virtual address cannot change.

44. Cilk-M Overview

45. Cilk-M’s Work-Stealing Scheduler

46. TLMM-Based Cactus Stacks

47. The Analysis of Cilk-M

48. OS Support for TLMMSurvey of My Other Work Direction for Future Work 31

49. The Cilk Programming Model The named childfunction may execute in parallel with the continuation of its parent. intfib(intn) { if(n < 2) { return n; } intx = spawn fib(n-1); inty = fib(n-2); sync; return (x + y); } Control cannot pass this point until all spawned children have returned. Cilk keywords grant permissionfor parallel execution. They do not commandparallel execution. 32

50. Cilk-M A work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM). 33

51. Cilk-M Overview High virtual addr stack TLMM Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently. heap uninitialized data (bss) shared Idea: Allocate the stacks for each worker in the TLMM region. initialized data code Low virtual addr 34

52. Basic Cilk-M Idea 0x7f000 A A A Workers achieve sharing by mapping the same physical memory at the same virtual address. x: 42 x: 42 x: 42 B C C y: &x y: &x E D y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 35

54. Space bound: Sp /P≤ S1.

55. Does not support SP-reciprocity.36

56. Cilk Depth 37 A C B Cilk depth (3) is not the same as spawn depth (2). E D G F Cilk depth is the max number of Cilk functions nested on the stack during a serial execution

58. SP reciprocity:

59. No longer need to distinguish function types

60. Parallelism or not is dictated only by how a function is invoked (spawn vs. call).38

62. We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).

63. We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.39

64. Performance Comparison AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk Plus running time Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D) 40

65. Space Usage Space bound: Sp /P≤ S1+D 41

67. Cilk-M Overview

72. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn spawn call P P P P 43

73. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call call! P P P P 44

74. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn spawn! P P P P 45

75. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn call spawn call! spawn! spawn! spawn P P P P 46

76. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call spawn call spawn spawn spawn call call spawn call spawn return! spawn P P P P 47

77. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn call spawn steal! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 48

78. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 49

79. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn P P P P Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup. 50

81. Cilk-M Overview

86. TLMM-Based Cactus Stacks 0x7f000 A x: 42 B Use standard linear stack in virtual memory. y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 52

87. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B Map (not copy) the stolen prefixto the same virtual addresses. y: &x steal A P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 53

88. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Subsequent spawns and calls grow down-ward in the thief’s TLMM region. C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 54

89. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Both workers see the same virtual address value for &x. C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 55

90. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B C Both workers see the same virtual address value for &x. y: &x D y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 56

91. TLMM-Based Cactus Stacks 0x7f000 A A A A x: 42 x: 42 x: 42 x: 42 B C C C Map (not copy) the stolen prefixto the same virtual addresses. y: &x y: &x y: &x D y: &x steal C P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 57

92. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C Subsequent spawns and calls grow down-ward in the thief’s TLMM region. y: &x y: &x D E y: &x z: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 58

93. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C All workers see the same virtual address value for &x. y: &x y: &x E D y: &x z: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 59

94. Handling Page Granularity 0x7f000 A page size B 0x7e000 0x7d000 A C B P3 P1 P2 E D 60

95. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 Map the stolen prefix. 0x7d000 A steal A C B P3 P1 P2 E D 61

96. Handling Page Granularity 0x7f000 A A page size B 0x7e000 Advance the stack pointer fragmentation. 0x7d000 A steal A C B P3 P1 P2 E D 62

97. Handling Page Granularity 0x7f000 A A page size B 0x7e000 C D 0x7d000 A C B P3 P1 P2 E D 63

98. Handling Page Granularity 0x7f000 A A A A page size B 0x7e000 C C C D 0x7d000 A steal C C B P3 P1 P2 E D 64

99. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 A steal C C B P3 P1 P2 E D 65

100. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 E A C B P3 P1 P2 E D 66

101. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Space-reclaiming heuristic: reset the stack pointer upon successful sync. D 0x7d000 E A C B P3 P1 P2 E D 67

103. Cilk-M Overview

108. Space Bound with a Heap-Based Cactus Stack Theorem [BL94].Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1. Proof. The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 69

109. Cilk-M Space Bound Claim.Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D. Proof. The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 70

110. Space Usage Space bound: Sp /P≤ S1+D 71

111. Performance Bound with aHeap-Based Cactus Stack Definition.TP— execution time on P processors T1— work T∞— spanT1 / T∞ — parallelism Theorem [BL94]. A work-stealing scheduler can achieve expected running time TP=T1 / P + O(T∞) on P processors. Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 72

112. Cilk-M Performance Bound Definition.TP— execution time on P processors T1— work T∞— spanT1 / T∞ — parallelismD — Cilk depth Claim. A work-stealing scheduler can achieve expected running time Tp= T1 / P + CT∞ onP processors, where C = O(S1+D) . Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 73

114. Cilk-M Overview

120. Workers share a single page table.

121. By default, nothing is shared.

122. By default, everything is shared.

123. Manually (i.e. mmap) share nonstack memory.

124. Reserve a region to be independently mapped.

125. User calls to mmap do not work (which may include malloc).

126. User calls to mmap operate properly.75

127. Page Table for TLMM (Ideally) TLMM 2 TLMM 1 Shared TLMM 0 x86: Hardware walks the page table.Each thread has a single root-page directory! Page 28 Page 12 Page 7 Page 32 76

128. Support for TLMM Thread 0 Thread 1 Must synchronize the root-page directory among threads. Page 32 Page 7 Page 12 77

130. E.g., MCS locks [MCS91]:

131. When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.

132. Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.

133. When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.

134. If A allocates LA on its stack using TLMM, LA may not be visible to B!78

136. Serial-Parallel Reciprocity

137. Good Performance

138. Bounded and efficient use of memory for the cactus stack

139. Cilk-M employs:

140. TLMM-based cactus stacks

141. OS support for TLMM (~600 lines of code)

142. Legacy compatible linkageCilk-M Summary 79

144. Location-Based Memory Fences

145. Ownership-Aware Transactional Memory Direction for Future Work 80

146. The JCilk Language Joint work with John Danaher and Charles Leiserson Parallel Constructs from Cilk: spawn & sync Java Core Functionalities 81

147. The JCilk Language Joint work with John Danaher and Charles Leiserson Exception Handling Parallel Constructs from Cilk: spawn & sync Java Core Functionalities 82

149. JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.

150. Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.Exception Handling in a Concurrent Context 83

151. The JCilk System JCilkRuntime System JCilk Compiler JCilk to Java + goto Jgo compiler: GCJ + goto support JVM source Fib.jcilk Fib.jgo Fib.class 84

153. JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.What We Discovered 85

157. Initially, L1 = 0 and L2 = 0 Thread 1 Thread 2 L1 = 1; if(L2 == 0) { /* critical section */ … } L1 = 0; L2 = 1; if(L1 == 0) { /* critical section */ … } L2 = 0; Dekker’s Protocol (Simplified) Reads may be reordered with older writes. 87

158. Initially, L1 = 0 and L2 = 0 Thread 1 Thread 2 L1 = 1; mfence(); if(L2 == 0) { /* critical section */ … } L1 = 0; L2 = 1; mfence(); if(L1 == 0) { /* critical section */ … } L2 = 0; Memory fences needed  cause stalling Dekker’s Protocol (Simplified) 88

160. the victim vs. the thief

161. Java Monitors using Quickly Reacquirable Locks or Biased Locking[DMS03] [OKK04]

162. the bias-holding thread vs. a revoker thread

163. JNI reentry barrier in JVM

164. a Java mutator thread vs. the garbage collector

165. Network package processing [VNE10]

166. the owner thread vs. other threadsApplications exhibit asymmetric synchronization patterns. 89

168. Some applications can benefit from a software implementation [DHY03] that uses interrupt.

169. A light-weight hardware mechanism can piggyback on the cache coherence protocol. Location-Based Memory Fences 90 Joint work with EdyaLadan-Mozes and DmitriyVyukov

173. Transactional Memory Rset: w,xWset: w,x Memory atomic {//A x++; } Rset: xWset:x A atomic {//B w = x; } Rset: xWset:w B Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory. 92

174. Transactional Memory Rset: w,xWset: w,x Memory atomic {//A x++; } Rset: xWset:x A atomic {//B w = x; } Rset: xWset:w B TM guarantees that transactions are serializable[P79]. 93

175. Nested Transactions Rset: w,x,y,zWset: w,x,y,z atomic {//A int a = x;... atomic { //B w++; } intb = y; z = x + y; } Memory Rset: xWset: A Rset:wWset: w B Closed-nesting:propagate the changes to A. 94

176. Nested Transactions Rset: w,x,y,zWset: w,x,y,z atomic {//A int a = x;... atomic { //B w++; } intb = y; z = x + y; } Memory Rset: xWset: A Rset:wWset: w B Open-nesting:commit the changes globally. 95

177. Nested Transactions All memories are treated equally – there is only one level of abstraction. 96

179. In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept ofownership types [BLS03] to ensure data encapsulation within a module.

180. The OAT system guarantees abstract serializabilityas long as the program conforms to a set of well-defined constraints on how the modules share data. 97

184. Parallelism Abstraction Aconcurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling. User Application Concurrency Platform Operating System 99

186. Hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.

187. Transactional Memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.100 Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?

189. Grace [BYL+09] (deterministic execution)

190. Sammati [PV10] (deadlock avoidance)

191. Cilk-M [LSH+10] (TLMM cactus stack)Can we relax limitation of manipulating virtual memory at page-granularity ? 101

194. Ownership-Aware Transactional Memory Direction for Future Work

195. 103

196. 104

197. Quadratic Stack Growth [Robison08] P : parallel P Assume one linear stack per worker : serial P S S : spawn P S : call P S Depth = d S . . . S . . . . . . P S S S . . . P P P Repeat dtimes . . . S S S S S S 105

198. Quadratic Stack Growth [Robison08] P The green worker repeatedly blocks, then steals, using Θ(d2)stack space. Assume one linear stack per worker P S P S P S Depth = d S . . . S . . . . . . P S S S . . . P P P Repeat dtimes . . . S S S S S S 106

199. Performance Comparison AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk-5 running time Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D) 107

200. Space Usage (Hand Compiled) Space bound: Sp /P≤ S1+D 108

201. Space Usage 109

202. GCC/Linux C Subroutine Linkage args to A The legacy linear stack obtains efficiency by overlapping frames. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables linkage region args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 110

203. Handling Page Granularity 0x7f000 A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. 0x7d000 A steal A C B P3 P1 P2 E D 111

204. Handling Page Granularity 0x7f000 A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C D 0x7d000 A C B P3 P1 P2 E D 112

205. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C C D 0x7d000 A steal C C B P3 P1 P2 E D 113

206. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C C D 0x7d000 E A C B P3 P1 P2 E D 114

207. Key Invocation Invariants Arguments are passed via stack pointer with positive offset. Local variables are referenced via base pointer with negative offset. Live registers are flushed onto the stack immediately before each spawn. Live registers are flushed onto the stack before returning back to runtime if sync fails. When resuming a stolen function after a spawn or sync, live registers are restored from the stack. When returning from a spawn, the return value is flushed from its register onto the stack. The frame size is fixed before any spawn statements. 115

208. GCC/Linux C Subroutine Linkage Legacy linear stacks enable efficient passing of arguments from caller to callee. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to A’s callees A C B E D 116

209. GCC/Linux C Subroutine Linkage linkage region FrameAaccesses its arguments through positive offset indexed from its base pointer. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to A’s callees A C B E D 117

210. GCC/Linux C Subroutine Linkage args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables FrameAaccesses its local variables through negative offset indexed from its base pointer. args to A’s callees A C B E D 118

211. GCC/Linux C Subroutine Linkage Before invoking B, A places the arguments for Binto the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables linkage region args to B args to A’s callees A C B E D 119

212. GCC/Linux C Subroutine Linkage args to A Athen makes the call to B, which saves the return address for Band transfers control to B. A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to B B’s return address A C B E D 120

213. GCC/Linux C Subroutine Linkage args to A Upon entering, Bsaves A’s base pointer and sets the base pointer to where the stack pointer is. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address bp A’s base pointer A C B E D 121

214. GCC/Linux C Subroutine Linkage args to A B advances the stack pointer to allocate space for local variables and linkage region. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 122

215. GCC/Linux C Subroutine Linkage args to A The legacy linear stack obtains efficiency by overlapping frames. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 123

216. Legacy Linear Stack High Addr An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree. A A B C C B D E E D Low Addr invocation tree 124

217. Legacy Linear Stack High Addr Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent. … A A C C B x: 42 E E D Low Addr invocation tree y: &x 125

218. Legacy Linear Stack High Addr Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent. A A C C B ✗ y: &z E E D Low Addr invocation tree z: 42 126

219. The Queens Problem Given n> 0, search for oneway to arrange n queens on an n-by-n chessboard so that none attacks another. legal configuration illegal configuration 127

220. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Serial strategy: Depth-first search with backtracking. The search tree size grows exponentially as n increases. 128

221. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Parallel strategy: spawn searches in parallel. Speculative computation – some work may be wasted. 129

222. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Parallel strategy: spawn searches in parallel. Abort other parallel searches once a solution is found. 130

223. Various Parallel Programming Models 131

224. 132 Parallelize Your Code using Cilk++ class SAT_Solver { public: int solve( … ); … private: … }; Convert the entire code base to Cilk++ language. 2. Structure the project so that Cilk++ code calls C++ code, but not conversely. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++. a. Use C++ wrapper functionsb. Use “extern C++” c. Limited call back to C++ code

225. 133 Parallelize Your Code using TBB class SAT_Solver { public: int solve( … ); … private: … }; Convert the entire project to Cilk++ language. 2. Structure the project so that Cilk++ code calls C++ code, but not conversely. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++. a. Use C++ wrapper functionsb. Use “extern C++” c. Limited call back to C++ code Your program may end up using a lot more stack space or fail to get good speedup.

226. Memory Network … ¢ ¢ ¢ P P P Chip Multiprocessor (CMP) Multicore Architecture — 2001* 134 *The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

227. The Era of Multicore IS Here 135 # of CPUS # of Cores Source: www.newegg.com Single core processor is becoming obsolete.

228. My Sister Is Buying a New Laptop … 136 Source: www.apple.com The era of multicore IS here!

Mit cilk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mit cilk

Similar to Mit cilk (20)

Mit cilk

Editor's Notes