The Era of Multicore Is Here1Source: www.newegg.com
MemoryNetwork…¢¢¢PPPChip Multiprocessor (CMP)Multicore Architecture*2*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
Concurrency PlatformsAconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming.User ApplicationConcurrency PlatformOperating System3
Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime SystemsJoint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles LeisersonI-Ting Angelina LeeComputer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyMarch 22, Intel XTRL / USA
Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime SystemsJoint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles LeisersonI-Ting Angelina LeeComputer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyMarch 22, Intel XTRL / USA
Three Desirable CriteriaInteroperability with serial code, including binariesSerial-ParallelReciprocityGoodPerformanceBoundedStack SpaceReasonable space usage compared to serial executionAmple parallelism  linear speedup6
Various StrategiesCilk++TBBCilk Plus7The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.
The Cactus-Stack ProblemCustomerEngineerSP ReciprocitySpace UsagePerformance8
The Cactus-Stack ProblemParallelize my software?SP ReciprocitySpace UsagePerformance9
The Cactus-Stack ProblemSure! Use my concurrency platform!SP ReciprocitySpace UsagePerformance10
The Cactus-Stack ProblemSure! Use my concurrency platform!SP ReciprocitySpace UsagePerformance11
The Cactus-Stack ProblemJust be sure to recompile all your codebase.Space UsagePerformance12
The Cactus-Stack ProblemHm … I use  third party binaries … Space UsagePerformance13
The Cactus-Stack Problem*Sigh*. Ok fine. SP ReciprocitySpace UsagePerformance14
The Cactus-Stack ProblemUpgrade your RAM then … SP ReciprocityPerformance15
The Cactus-Stack Problem… you are gonna need extra memory.SP ReciprocityPerformance16
The Cactus-Stack Problem… no?SP ReciprocityPerformance17
The Cactus-Stack Problem… no?SP ReciprocitySpace UsagePerformance18
The Cactus-Stack ProblemWell … you didn’t say you want any performance guarantee, did you?⌃#SP ReciprocitySpace Usage19
The Cactus-Stack ProblemGee … I can get that just by running serially.⌃#SP ReciprocitySpace Usage20
The Cactus-Stack ProblemInteroperability with serial code, including binariesSerial-ParallelReciprocityGoodPerformanceBoundedStack SpaceReasonable space usage compared to serial executionAmple parallelism  linear speedup21
Legacy Linear StackAn execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack22
Legacy Linear StackRule for pointers:  A parent can pass pointers to its stack variables down to its children, but not the other way around.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack23
Legacy Linear Stack — 1960*Rule for pointers:  A parent can pass pointers to its stack variables down to its children, but not the other way around.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack24* Stack-based space management for recursive subroutines developed with compilers for Algol 60.
Cactus Stack — 1968*A cactus stack supports multiple views in parallel. CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack25* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.
Heap-Based Cactus StackA heap-based cactus stack allocates frames off the heap.AMesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.heapCBED26
Modern Concurrency PlatformsCilk++ (Intel)
Cilk-5 (MIT)
Cilk-M (MIT)
Cilk Plus (Intel)
Fortress (Oracle Labs)
Habanero (Rice)
JCilk (MIT)
OpenMP
StreamIt (MIT)
Task Parallel Library (Microsoft)
Threading Building Blocks (Intel)
X10 (IBM)
…27
Heap-Based Cactus StackA heap-based cactus stack allocates frames off the heap.MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.AheapGood time and space bounds can be obtained … CBED28
Heap-Based Cactus StackHeap linkage: call/return via frames in the heap.AHeap linkageparallel functions fail to interoperate with legacy serial code.heapCBED29
Various Strategies30The main constraint: once allocated, a frame’s location in virtual address cannot change.
Outline	Cilk-M:The Cactus Stack Problem
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other WorkDirection for Future Work31
The Cilk Programming ModelThe named childfunction may execute in parallel with the continuation of its parent.intfib(intn) {if(n < 2) { return n; }intx = spawn fib(n-1);inty = fib(n-2);sync;	  return (x + y);}Control cannot pass this point until all spawned children have returned.Cilk keywords grant permissionfor parallel execution.  They do not commandparallel execution.32
Cilk-MA work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM).33
Cilk-M OverviewHigh virtual addrstackTLMM Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.heapuninitialized data (bss)sharedIdea: Allocate the stacks for each worker in the TLMM region. initialized datacodeLow virtual addr34
Basic Cilk-M Idea0x7f000AAAWorkers achieve sharing by mapping the same physical memory at the same virtual address.x: 42x: 42x: 42BCCy: &xy: &xEDy: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED35
Cilk Guarantees with aHeap-Based Cactus StackDefinition.TP— execution time on P processorsT1— work       T∞— spanT1 / T∞ — parallelismSP— stack space on P processorsS1— stack space of a serial execution       Time bound:  Tp=  T1 / P + O(T∞) .linear speedup when P ≪ T1 / T∞
Space bound: Sp /P≤  S1.
Does not support SP-reciprocity.36
Cilk Depth37ACBCilk depth (3) is not the same as spawn depth (2).EDGFCilk depth is the max number of Cilk functions nested on the stack during a serial execution
Cilk-M GuaranteesDefinition.TP— execution time on P processorsT1— work       T∞— spanT1 / T∞ — parallelismSP— stack space on P processorsS1— stack space of a serial execution       D — Cilk depthTime bound:  Tp=  T1 / P + O((S1+D) T∞).linear speedup when P ≪T1 / (S1+D)T∞ Space bound:  Sp /P≤ S1+D, whereS1is measured in pages.
SP reciprocity:
No longer need to distinguish function types
Parallelism or not is dictated only by how a function is invoked (spawn vs. call).38
System OverviewWe implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.
We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).
We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.39
Performance ComparisonAMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk Plus running timeTime Bound:Tp=  T1 / P + CT∞ , where C = O(S1+D)40
Space UsageSpace bound:  Sp /P≤ S1+D  41
Outline	Cilk-M:The Cactus Stack Problem
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other WorkDirection for Future Work42
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawnspawncallPPPP43
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallcall!PPPP44
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallspawnspawn!PPPP45
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallspawncallspawncall!spawn!spawn!spawnPPPP46
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallspawncallspawnspawnspawncallcallspawncallspawnreturn!spawnPPPP47
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawncallspawnsteal!spawnPPPPWhen a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.48
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawnspawncallspawnspawn!spawnPPPPWhen a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.49
Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawnspawncallspawnspawnPPPPTheorem [BL94]:  With sufficient parallelism, workers steal infrequently linear speedup.50
Outline	Cilk-M:The Cactus Stack Problem
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other WorkDirection for Future Work51
TLMM-Based Cactus Stacks0x7f000Ax: 42BUse standard linear stack in virtual memory.y: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED52
TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BMap (not copy) the stolen prefixto the same virtual addresses.  y: &xsteal AP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED53
TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BSubsequent spawns and calls grow down-ward in the thief’s TLMM region.Cy: &xy: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED54
TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BBoth workers see the same virtual address value for &x. Cy: &xy: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED55
TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BCBoth workers see the same virtual address value for &x. y: &xDy: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED56
TLMM-Based Cactus Stacks0x7f000AAAAx: 42x: 42x: 42x: 42BCCCMap (not copy) the stolen prefixto the same virtual addresses.  y: &xy: &xy: &xDy: &xsteal CP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED57
TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BCCSubsequent spawns and calls grow down-ward in the thief’s TLMM region.y: &xy: &xDEy: &xz: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED58
TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BCCAll workers see the same virtual address value for &x. y: &xy: &xEDy: &xz: &xP3P1P2ACBUnreasonable simplification:  Assume that we can map with arbitrary granularity.ED59
Handling Page Granularity0x7f000Apage sizeB0x7e0000x7d000ACBP3P1P2ED60
Handling Page Granularity0x7f000AAApage sizeB0x7e000Map the stolen prefix.0x7d000Asteal ACBP3P1P2ED61
Handling Page Granularity0x7f000AApage sizeB0x7e000Advance the stack pointer fragmentation.0x7d000Asteal ACBP3P1P2ED62
Handling Page Granularity0x7f000AApage sizeB0x7e000CD0x7d000ACBP3P1P2ED63
Handling Page Granularity0x7f000AAAApage sizeB0x7e000CCCD0x7d000Asteal CCBP3P1P2ED64
Handling Page Granularity0x7f000AAApage sizeB0x7e000CCAdvance the stack pointer again additional fragmentation.D0x7d000Asteal CCBP3P1P2ED65
Handling Page Granularity0x7f000AAApage sizeB0x7e000CCAdvance the stack pointer again additional fragmentation.D0x7d000EACBP3P1P2ED66
Handling Page Granularity0x7f000AAApage sizeB0x7e000CCSpace-reclaiming heuristic: reset the stack pointer upon successful sync.D0x7d000EACBP3P1P2ED67
Outline	Cilk-M:The Cactus Stack Problem
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other WorkDirection for Future Work68
Space Bound with a Heap-Based Cactus StackTheorem [BL94].Let S1 be the stack space required by a serial execution of a program.  The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1.Proof.  The work-stealing algorithm maintains the busy-leaves property:Every active leaf frame has a worker executing on it.■P = 4S1PPPP69
Cilk-M Space BoundClaim.Let S1 be the stack space required by a serial execution of a program.  Let D be the Cilk depth.  The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D.Proof.  The work-stealing algorithm maintains the busy-leaves property:Every active leaf frame has a worker executing on it.■P = 4S1PPPP70
Space UsageSpace bound:  Sp /P≤ S1+D  71
Performance Bound with aHeap-Based Cactus StackDefinition.TP— execution time on P processorsT1— work       T∞— spanT1 / T∞ — parallelismTheorem [BL94].  A work-stealing scheduler can achieve expected running timeTP=T1 / P + O(T∞)on P processors.Corollary.  If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).  72
Cilk-M Performance BoundDefinition.TP— execution time on P processorsT1— work       T∞— spanT1 / T∞ — parallelismD — Cilk depthClaim.  A work-stealing scheduler can achieve expected running timeTp=  T1 / P + CT∞onP processors, where C = O(S1+D) .Corollary.  If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).  73
Outline	Cilk-M:The Cactus Stack Problem
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other WorkDirection for Future Work74
To Be or Not To Be … a ProcessA Worker = A ProcessA Worker = A ThreadEvery worker has its own page table.
Workers share a single page table.
By default, nothing is shared.
By default, everything is shared.
Manually (i.e. mmap) share nonstack memory.
Reserve a region to be independently mapped.
User calls to mmap do not work (which may include malloc).
User calls to mmap operate properly.75
Page Table for TLMM (Ideally)TLMM 2TLMM 1SharedTLMM 0x86: Hardware walks the page table.Each thread has a single root-page directory!Page 28Page 12Page 7Page 3276
Support for TLMMThread 0Thread 1Must synchronize the root-page directory among threads.Page 32Page 7Page 1277
Limitation of TLMM Cactus StacksTLMM does not work for codes that require one thread to see another thread’s stack.
E.g., MCS locks [MCS91]:
When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.
Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.
When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.
If A allocates LA on its stack using TLMM, LA may not be visible to B!78
Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously:
Serial-Parallel Reciprocity

Mit cilk

Editor's Notes

  • #4 A concurrency platform is asoftware abstraction layer that manages the processors&apos; resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.
  • #8 It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.
  • #23 I am sure everyone know what a linear stack is.An execution of a serial language can be viewed as a serial walk of an invocation tree.On the left, we have an invocation tree, where A calls B and C, and C calls D and E.On the right is the corresponding views of stack for each function when it is active.throughout the rest of the talk, I will use the convention that the stack grows downwardNote that, when a function is active, it can always see its ancestors’ frames in the stack
  • #30 But parallel functions fail to interoperate with legacy serial code, because a legacy serial could would allocate its frame off the linear stack, and it does not understand the heap linkage, where the call / return is performed via frames in the heap.
  • #31 It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.We don’t have time to go into all the strategy but I will go into a little more details on one strategy to illustrate the challenge in satisfying all three criteria.You are welcome to ask me about the other strategies after the talk, if you are interested.
  • #33 The Cilk work-stealing scheduler then executes the prog in the way that respects the logical parallelism specified by the programmer while guaranteeing that programs take full advantage of the processors available at runtime.
  • #35 Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.One main constraint that we are operating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,We can get around this, if every thread has its own local view of virtual address range
  • #36 Because the stacks are allocated in TLMM region, we can map the region such that part of the stack is shared. For example, worker one has the stack view of ..A frame for a given function refers to the same physical memory in all stacks and is mapped to the same virtual address.
  • #37 Time bound guarantee linear speed up if there is sufficient parallelismSpace bound: each worker does not use more than S1
  • #41 Note that this is running on 16 cores
  • #42 across all app, each worker uses no more than 2 times more compared to the serial stack usage
  • #53 Use a standard linear stack in virtual memory
  • #54 Upon a steal, map the physical memory corresponding to the stolen prefixto the same virtual addresses in the thief as in the victim
  • #55 Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • #56 Both the victim and the thief see the same virtual address value for the reference to A’s local variable x
  • #57 Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • #58 ORALLY: P3 Steal C … techinucally it first stole A and fail to make progress on it, then it steals C
  • #59 Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • #61 Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • #63 Advancing the stack pointer avoids overwriting other frames on the page, at the cost of fragmentation.
  • #64 The thief then resumes the stolen frame and executes normally in its own TLMM region.
  • #66 Once again, the stack pointer must be advanced, which causes additional fragmentation.
  • #76 only the worker who executes it would perform the mmap ... need to synchronize among all workers to perform the mmap as well.
  • #77 multiple threads’ local overlaps w/ diff pages.
  • #78 Each thread uses a unique root page directory …When a thread maps in the shared region … need to synchronize, but the synchronization is done only once per shared entry in the root page directory
  • #84 Tazuneki and Yoshida \\cite{TazunekiYo00} and Issarny \\cite{Issarny91}have investigated the semantics of concurrent exception-handling,taking different approaches from our work. In particular, theseresearchers pursue new linguistic mechanisms for concurrentexceptions, rather than extending them faithfully from a serial baselanguage as does \\jcilk. The treatment of multiple exceptions thrownsimultaneously is another point of divergence.
  • #85 The JCilk system consists of two components: the runtime system and the compiler.
  • #88 Critically there is a duality between the actions of the threadsModern processors typically employ TSO(Total-Store-Order) and PO(Processor-Ordering) That is, Reads are not reordered with other readsWriter are not reordered with older readsWrites are not reordered with other writes; andReads may be reordered with older writes if they have different target location
  • #89 Traditional memory barriers are PC-based – the processor inevitably stalls upon execution
  • #90 The lock word associated with a monitor can be biased towards one thread.The bias-holding thread can update the lock word using regular ld-update-store.A unbiased lock word must be updated using CAS.Dekker is used to synchronize between the bias-holding thread and revoker thread when revoker attempts to update the bias.network package processing applications --- each thread handles a group of source addresses and maintain its own data structureOccasionally, a processing thread needs to update another thread’s data structure If collection is in-progress the barrier halts the thread until the collection completes. The prevents the thread from mutating the heap concurrently with the collector. The JNI reentry barrier is commonly implemented with a CAS or a Dekker-like &quot;ST;MEMBAR;LD&quot; sequence to mark the thread as a mutator (the ST) and check for a collection in-progress (the LD)JNI occur frequently but collections are relatively infrequent.
  • #93 The TM system enforces atomicity by tracking the memory locations that each transaction accesses, detecting conflicts, and possibly aborting and retrying transactions.
  • #94 TM guarantees that transactions are serializable [Papadimitriou79]. That is, transactions affect globalmemory as if they were executed one at a time in some order, even ifin reality, several executed concurrently.
  • #100 A decade ago, much multithreaded software was still written with POSIX orJava threads, where the programmer handled the task decomposition andscheduling explicitly. By providing a parallelism abstraction, aconcurrency platform frees the programmer from worrying about loadbalancing and task scheduling.
  • #101 TLMM cactus stack: each worker gets its own linear local view of the tree-structured call stack.hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.transactional memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.I believe, a concurrency platform can as well mitigate the complexity of synchronization by providing the appropriate memory abstractions.A memory abstraction isan abstraction layer between the program execution and the memory thatprovides a different ``view&apos;&apos; of a memory location depending on theexecution context in which the memory access is made.What other memory abstraction can we do?
  • #106 Assume we use one linear stack per worker. Here, we are using the term worker interchangeably with the term persistant thread – think of Java thread or POSIX thread.This is a beautiful observation made by Arch Robison, who is the main architect of Intel TBB, which is another concurrency platform.The observation is that, using the strategy of one linear stack per worker, some computation may incur quadratic stack growth compared to its serial execution.An example of such computation is as follows. Here, I am showing you an invocation tree. The frame marked as P is a parallel function, which may have multiple extant children executing in parallel. The frame marked as S is a serial function. I haven’t told you the details of how a work-stealing scheduler operates, but for the purpose of this example, all you need to know is that the execution of the computation typically goes depth-first and left to right. Once a P function spawns the left branch (marked as red), however, the right branch now becomes available for execution. In order to guarantee the good time bound, one must allow a worker thread to randomly choose a readily available function to execute.
  • #107 Then we can run into the following scenario: I am using different colors to denote the worker who invoked a given function.One main constraint that we are ope rating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,So the green worker cannot pop off the stack-allocated frames, because the purple worker may have pointers to variables allocated on those frames
  • #108 Note that this is running on 16 cores
  • #109 This worst case occurs when every Cilk function on a stack that realizes the Cilk depth D is stolen.
  • #110 Show space consumption, mention couple tricks we did to recycle stack space
  • #111 The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  • #112 Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • #113 Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • #114 Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • #115 Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • #116 static for P1 and P2.mention fragmentation part of stack no visibleissue of fragmentationwant to use backward compatible linkagecombine this page and the next page. Insert the linkage block in there.animate just P3
  • #117 mention memory args only when regs are not enoughoverlapping of the frames
  • #118 mention memory args only when regs are not enoughoverlapping of the framesSAY: can access via stack pointer if the frame size is known statically
  • #119 mention memory args only when regs are not enoughoverlapping of the frames
  • #120 mention memory args only when regs are not enoughoverlapping of the frames
  • #121 CORRECT this text A then transfer control to B
  • #124 The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  • #125 On the left, I am showing you an invocation tree … Suchserial languages admit simple array-based stack for allocating functionactivation frames. To allocate an activation frame when a function iscalled, the stack pointer is advanced, and when the function returns, theoriginal stack pointer is restored. This style of execution is spaceefficient, because all the children of a given function can use and reusethe same region of the stack.
  • #127 x: 42 on Apass &amp;x, stored as y in C &amp; E.make widerThe other way around … should be symmetricIn A, have x:42, pass that down to C.
  • #132 static threading / OpenMP / Streaming / fork-join parallel programming / message passing / GPU , which is not commonly used for multicore architecture with shared memoryis a software abstraction layer that manages the processors&apos; resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.