Mit cilk

The Era of Multicore Is Here1Source: www.newegg.com

MemoryNetwork…¢¢¢PPPChip Multiprocessor (CMP)Multicore Architecture*2*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

Concurrency PlatformsAconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming.User ApplicationConcurrency PlatformOperating System3

Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime SystemsJoint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles LeisersonI-Ting Angelina LeeComputer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyMarch 22, Intel XTRL / USA

Three Desirable CriteriaInteroperability with serial code, including binariesSerial-ParallelReciprocityGoodPerformanceBoundedStack SpaceReasonable space usage compared to serial executionAmple parallelism  linear speedup6

Various StrategiesCilk++TBBCilk Plus7The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.

The Cactus-Stack ProblemCustomerEngineerSP ReciprocitySpace UsagePerformance8

The Cactus-Stack ProblemParallelize my software?SP ReciprocitySpace UsagePerformance9

The Cactus-Stack ProblemSure! Use my concurrency platform!SP ReciprocitySpace UsagePerformance10

The Cactus-Stack ProblemSure! Use my concurrency platform!SP ReciprocitySpace UsagePerformance11

The Cactus-Stack ProblemJust be sure to recompile all your codebase.Space UsagePerformance12

The Cactus-Stack ProblemHm … I use third party binaries … Space UsagePerformance13

The Cactus-Stack Problem*Sigh*. Ok fine. SP ReciprocitySpace UsagePerformance14

The Cactus-Stack ProblemUpgrade your RAM then … SP ReciprocityPerformance15

The Cactus-Stack Problem… you are gonna need extra memory.SP ReciprocityPerformance16

The Cactus-Stack Problem… no?SP ReciprocityPerformance17

The Cactus-Stack Problem… no?SP ReciprocitySpace UsagePerformance18

The Cactus-Stack ProblemWell … you didn’t say you want any performance guarantee, did you?⌃#SP ReciprocitySpace Usage19

The Cactus-Stack ProblemGee … I can get that just by running serially.⌃#SP ReciprocitySpace Usage20

The Cactus-Stack ProblemInteroperability with serial code, including binariesSerial-ParallelReciprocityGoodPerformanceBoundedStack SpaceReasonable space usage compared to serial executionAmple parallelism  linear speedup21

Legacy Linear StackAn execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack22

Legacy Linear StackRule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack23

Legacy Linear Stack — 1960*Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack24* Stack-based space management for recursive subroutines developed with compilers for Algol 60.

Cactus Stack — 1968*A cactus stack supports multiple views in parallel. CBADEAAAAAABCCCCBDEEDinvocation treeviews of stack25* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.

Heap-Based Cactus StackA heap-based cactus stack allocates frames off the heap.AMesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.heapCBED26

Modern Concurrency PlatformsCilk++ (Intel)

Task Parallel Library (Microsoft)

Threading Building Blocks (Intel)

Heap-Based Cactus StackA heap-based cactus stack allocates frames off the heap.MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.AheapGood time and space bounds can be obtained … CBED28

Heap-Based Cactus StackHeap linkage: call/return via frames in the heap.AHeap linkageparallel functions fail to interoperate with legacy serial code.heapCBED29

Various Strategies30The main constraint: once allocated, a frame’s location in virtual address cannot change.

Outline Cilk-M:The Cactus Stack Problem

Cilk-M’s Work-Stealing Scheduler

OS Support for TLMMSurvey of My Other WorkDirection for Future Work31

The Cilk Programming ModelThe named childfunction may execute in parallel with the continuation of its parent.intfib(intn) {if(n < 2) { return n; }intx = spawn fib(n-1);inty = fib(n-2);sync; return (x + y);}Control cannot pass this point until all spawned children have returned.Cilk keywords grant permissionfor parallel execution. They do not commandparallel execution.32

Cilk-MA work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM).33

Cilk-M OverviewHigh virtual addrstackTLMM Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.heapuninitialized data (bss)sharedIdea: Allocate the stacks for each worker in the TLMM region. initialized datacodeLow virtual addr34

Basic Cilk-M Idea0x7f000AAAWorkers achieve sharing by mapping the same physical memory at the same virtual address.x: 42x: 42x: 42BCCy: &xy: &xEDy: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED35

Cilk Guarantees with aHeap-Based Cactus StackDefinition.TP— execution time on P processorsT1— work T∞— spanT1 / T∞ — parallelismSP— stack space on P processorsS1— stack space of a serial execution Time bound: Tp= T1 / P + O(T∞) .linear speedup when P ≪ T1 / T∞

Does not support SP-reciprocity.36

Cilk Depth37ACBCilk depth (3) is not the same as spawn depth (2).EDGFCilk depth is the max number of Cilk functions nested on the stack during a serial execution

Cilk-M GuaranteesDefinition.TP— execution time on P processorsT1— work T∞— spanT1 / T∞ — parallelismSP— stack space on P processorsS1— stack space of a serial execution D — Cilk depthTime bound: Tp= T1 / P + O((S1+D) T∞).linear speedup when P ≪T1 / (S1+D)T∞ Space bound: Sp /P≤ S1+D, whereS1is measured in pages.

No longer need to distinguish function types

Parallelism or not is dictated only by how a function is invoked (spawn vs. call).38

System OverviewWe implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.

We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).

We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.39

Performance ComparisonAMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk Plus running timeTime Bound:Tp= T1 / P + CT∞ , where C = O(S1+D)40

Space UsageSpace bound: Sp /P≤ S1+D 41

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawnspawncallPPPP43

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallcall!PPPP44

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallspawnspawn!PPPP45

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawncallspawnspawnspawncallspawncallspawncallspawncallspawncall!spawn!spawn!spawnPPPP46

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallspawncallspawnspawnspawncallcallspawncallspawnreturn!spawnPPPP47

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawncallspawnsteal!spawnPPPPWhen a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.48

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawnspawncallspawnspawn!spawnPPPPWhen a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.49

Cilk-M’s Work-Stealing SchedulerEach worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].spawnspawncallcallspawnspawnspawncallcallspawnspawncallspawnspawnPPPPTheorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup.50

TLMM-Based Cactus Stacks0x7f000Ax: 42BUse standard linear stack in virtual memory.y: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED52

TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BMap (not copy) the stolen prefixto the same virtual addresses. y: &xsteal AP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED53

TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BSubsequent spawns and calls grow down-ward in the thief’s TLMM region.Cy: &xy: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED54

TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BBoth workers see the same virtual address value for &x. Cy: &xy: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED55

TLMM-Based Cactus Stacks0x7f000AAx: 42x: 42BCBoth workers see the same virtual address value for &x. y: &xDy: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED56

TLMM-Based Cactus Stacks0x7f000AAAAx: 42x: 42x: 42x: 42BCCCMap (not copy) the stolen prefixto the same virtual addresses. y: &xy: &xy: &xDy: &xsteal CP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED57

TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BCCSubsequent spawns and calls grow down-ward in the thief’s TLMM region.y: &xy: &xDEy: &xz: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED58

TLMM-Based Cactus Stacks0x7f000AAAx: 42x: 42x: 42BCCAll workers see the same virtual address value for &x. y: &xy: &xEDy: &xz: &xP3P1P2ACBUnreasonable simplification: Assume that we can map with arbitrary granularity.ED59

Handling Page Granularity0x7f000Apage sizeB0x7e0000x7d000ACBP3P1P2ED60

Handling Page Granularity0x7f000AAApage sizeB0x7e000Map the stolen prefix.0x7d000Asteal ACBP3P1P2ED61

Handling Page Granularity0x7f000AApage sizeB0x7e000Advance the stack pointer fragmentation.0x7d000Asteal ACBP3P1P2ED62

Handling Page Granularity0x7f000AApage sizeB0x7e000CD0x7d000ACBP3P1P2ED63

Handling Page Granularity0x7f000AAAApage sizeB0x7e000CCCD0x7d000Asteal CCBP3P1P2ED64

Handling Page Granularity0x7f000AAApage sizeB0x7e000CCAdvance the stack pointer again additional fragmentation.D0x7d000Asteal CCBP3P1P2ED65

Handling Page Granularity0x7f000AAApage sizeB0x7e000CCAdvance the stack pointer again additional fragmentation.D0x7d000EACBP3P1P2ED66

Handling Page Granularity0x7f000AAApage sizeB0x7e000CCSpace-reclaiming heuristic: reset the stack pointer upon successful sync.D0x7d000EACBP3P1P2ED67

Space Bound with a Heap-Based Cactus StackTheorem [BL94].Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1.Proof. The work-stealing algorithm maintains the busy-leaves property:Every active leaf frame has a worker executing on it.■P = 4S1PPPP69

Cilk-M Space BoundClaim.Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D.Proof. The work-stealing algorithm maintains the busy-leaves property:Every active leaf frame has a worker executing on it.■P = 4S1PPPP70

Space UsageSpace bound: Sp /P≤ S1+D 71

Performance Bound with aHeap-Based Cactus StackDefinition.TP— execution time on P processorsT1— work T∞— spanT1 / T∞ — parallelismTheorem [BL94]. A work-stealing scheduler can achieve expected running timeTP=T1 / P + O(T∞)on P processors.Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 72

Cilk-M Performance BoundDefinition.TP— execution time on P processorsT1— work T∞— spanT1 / T∞ — parallelismD — Cilk depthClaim. A work-stealing scheduler can achieve expected running timeTp= T1 / P + CT∞onP processors, where C = O(S1+D) .Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 73

To Be or Not To Be … a ProcessA Worker = A ProcessA Worker = A ThreadEvery worker has its own page table.

Workers share a single page table.

By default, nothing is shared.

By default, everything is shared.

Manually (i.e. mmap) share nonstack memory.

Reserve a region to be independently mapped.

User calls to mmap do not work (which may include malloc).

User calls to mmap operate properly.75

Page Table for TLMM (Ideally)TLMM 2TLMM 1SharedTLMM 0x86: Hardware walks the page table.Each thread has a single root-page directory!Page 28Page 12Page 7Page 3276

Support for TLMMThread 0Thread 1Must synchronize the root-page directory among threads.Page 32Page 7Page 1277

Limitation of TLMM Cactus StacksTLMM does not work for codes that require one thread to see another thread’s stack.

When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.

Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.

When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.

If A allocates LA on its stack using TLMM, LA may not be visible to B!78

Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously:

Mit cilk

More Related Content

What's hot

Similar to Mit cilk

Mit cilk

Editor's Notes