SlideShare a Scribd company logo
1 of 136
The Era of Multicore Is Here 1 Source: www.newegg.com
Memory Network … ¢ ¢ ¢ P P P Chip Multiprocessor (CMP) Multicore Architecture* 2 *The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
Concurrency Platforms Aconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming. User Application Concurrency Platform Operating System 3
Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA
Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA
Three Desirable Criteria Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 6
Various Strategies Cilk++ TBB Cilk Plus 7 The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.
The Cactus-Stack Problem Customer Engineer SP Reciprocity Space Usage Performance 8
The Cactus-Stack Problem Parallelize my software? SP Reciprocity Space Usage Performance 9
The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 10
The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 11
The Cactus-Stack Problem Just be sure to recompile all your codebase. Space Usage Performance 12
The Cactus-Stack Problem Hm … I use  third party binaries …  Space Usage Performance 13
The Cactus-Stack Problem *Sigh*. Ok fine.  SP Reciprocity Space Usage Performance 14
The Cactus-Stack Problem Upgrade your RAM then …  SP Reciprocity Performance 15
The Cactus-Stack Problem … you are gonna need extra memory. SP Reciprocity Performance 16
The Cactus-Stack Problem … no? SP Reciprocity Performance 17
The Cactus-Stack Problem … no? SP Reciprocity Space Usage Performance 18
The Cactus-Stack Problem Well … you didn’t say you want any performance guarantee, did you? ⌃ # SP Reciprocity Space Usage 19
The Cactus-Stack Problem Gee … I can get that just by running serially. ⌃ # SP Reciprocity Space Usage 20
The Cactus-Stack Problem Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 21
Legacy Linear Stack An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 22
Legacy Linear Stack Rule for pointers:  A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 23
Legacy Linear Stack — 1960* Rule for pointers:  A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 24 * Stack-based space management for recursive subroutines developed with compilers for Algol 60.
Cactus Stack — 1968* A cactus stack supports multiple views in parallel.  C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 25 * Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.
Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. A Mesa (1979),  Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy. heap C B E D 26
Modern Concurrency Platforms ,[object Object]
Cilk-5 (MIT)
Cilk-M (MIT)
Cilk Plus (Intel)
Fortress (Oracle Labs)
Habanero (Rice)
JCilk (MIT)
OpenMP
StreamIt (MIT)
Task Parallel Library (Microsoft)
Threading Building Blocks (Intel)
X10 (IBM)
…27
Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well. A heap Good time and space bounds can be obtained …  C B E D 28
Heap-Based Cactus Stack Heap linkage: call/return via frames in the heap. A Heap linkage parallel functions fail to interoperate with legacy serial code. heap C B E D 29
Various Strategies 30 The main constraint: once allocated, a frame’s location in virtual address cannot change.
Outline	 Cilk-M: ,[object Object]
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other Work Direction for Future Work 31
The Cilk Programming Model The named childfunction may execute in parallel with the continuation of its parent. intfib(intn) { if(n < 2) { return n; } intx = spawn fib(n-1); inty = fib(n-2); sync; 	  return (x + y); } Control cannot pass this point until all spawned children have returned. Cilk keywords grant permissionfor parallel execution.  They do not commandparallel execution. 32
Cilk-M A work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM). 33
Cilk-M Overview High virtual addr stack TLMM  Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently. heap uninitialized data (bss) shared Idea: Allocate the stacks for each worker in the TLMM region.  initialized data code Low  virtual addr 34
Basic Cilk-M Idea 0x7f000 A A A Workers achieve sharing by mapping the same physical memory at the same virtual address. x: 42 x: 42 x: 42 B C C y: &x y: &x E D y: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 35
Cilk Guarantees with aHeap-Based Cactus Stack Definition.TP— execution time on P processors T1— work       T∞— spanT1 / T∞ — parallelism SP— stack space on P processors S1— stack space of a serial execution        ,[object Object]
Space bound: Sp /P≤  S1.
Does not support SP-reciprocity.36
Cilk Depth 37 A C B Cilk depth (3) is not the same as spawn depth (2). E D G F Cilk depth is the max number of Cilk functions nested on the stack during a serial execution
Cilk-M Guarantees Definition.TP— execution time on P processors T1— work       T∞— spanT1 / T∞ — parallelism SP— stack space on P processors S1— stack space of a serial execution       D — Cilk depth ,[object Object],linear speedup when P ≪T1 / (S1+D)T∞  ,[object Object]
SP reciprocity:
No longer need to distinguish function types
Parallelism or not is dictated only by how a function is invoked (spawn vs. call).38
System Overview ,[object Object]
We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).
We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.39
Performance Comparison AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3  Cilk-M running time / Cilk Plus running time Time Bound:Tp=  T1 / P + CT∞ , where C = O(S1+D) 40
Space Usage Space bound:  Sp /P≤ S1+D   41
Outline	 Cilk-M: ,[object Object]
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other Work Direction for Future Work 42
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn spawn call P P P P 43
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call call! P P P P 44
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn spawn! P P P P 45
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn call spawn call! spawn! spawn! spawn P P P P 46
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call spawn call spawn spawn spawn call call spawn call spawn return! spawn P P P P 47
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn call spawn steal! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 48
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 49
Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn P P P P Theorem [BL94]:  With sufficient parallelism, workers steal infrequently linear speedup. 50
Outline	 Cilk-M: ,[object Object]
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other Work Direction for Future Work 51
TLMM-Based Cactus Stacks 0x7f000 A x: 42 B Use standard linear stack in virtual memory. y: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 52
TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B Map (not copy) the stolen prefixto the same virtual addresses.   y: &x steal A P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 53
TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Subsequent spawns and calls grow down-ward in the thief’s TLMM region. C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 54
TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Both workers see the same virtual address value for &x.  C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 55
TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B C Both workers see the same virtual address value for &x.  y: &x D y: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 56
TLMM-Based Cactus Stacks 0x7f000 A A A A x: 42 x: 42 x: 42 x: 42 B C C C Map (not copy) the stolen prefixto the same virtual addresses.   y: &x y: &x y: &x D y: &x steal C P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 57
TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C Subsequent spawns and calls grow down-ward in the thief’s TLMM region. y: &x y: &x D E y: &x z: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 58
TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C All workers see the same virtual address value for &x.  y: &x y: &x E D y: &x z: &x P3 P1 P2 A C B Unreasonable simplification:  Assume that we can map with arbitrary granularity. E D 59
Handling Page Granularity 0x7f000 A page size B 0x7e000 0x7d000 A C B P3 P1 P2 E D 60
Handling Page Granularity 0x7f000 A A A page size B 0x7e000 Map the stolen prefix. 0x7d000 A steal A C B P3 P1 P2 E D 61
Handling Page Granularity 0x7f000 A A page size B 0x7e000 Advance the stack pointer fragmentation. 0x7d000 A steal A C B P3 P1 P2 E D 62
Handling Page Granularity 0x7f000 A A page size B 0x7e000 C D 0x7d000 A C B P3 P1 P2 E D 63
Handling Page Granularity 0x7f000 A A A A page size B 0x7e000 C C C D 0x7d000 A steal C C B P3 P1 P2 E D 64
Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 A steal C C B P3 P1 P2 E D 65
Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 E A C B P3 P1 P2 E D 66
Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Space-reclaiming heuristic: reset the stack pointer upon successful sync. D 0x7d000 E A C B P3 P1 P2 E D 67
Outline	 Cilk-M: ,[object Object]
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other Work Direction for Future Work 68
Space Bound with a Heap-Based Cactus Stack Theorem [BL94].Let S1 be the stack space required by a serial execution of a program.  The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1. Proof.  The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 69
Cilk-M Space Bound Claim.Let S1 be the stack space required by a serial execution of a program.  Let D be the Cilk depth.  The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D. Proof.  The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 70
Space Usage Space bound:  Sp /P≤ S1+D   71
Performance Bound with aHeap-Based Cactus Stack Definition.TP— execution time on P processors T1— work       T∞— spanT1 / T∞ — parallelism Theorem [BL94].  A work-stealing scheduler can achieve expected running time TP=T1 / P + O(T∞) on P processors. Corollary.  If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).   72
Cilk-M Performance Bound Definition.TP— execution time on P processors T1— work       T∞— spanT1 / T∞ — parallelismD — Cilk depth Claim.  A work-stealing scheduler can achieve expected running time Tp=  T1 / P + CT∞ onP processors, where C = O(S1+D) . Corollary.  If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).   73
Outline	 Cilk-M: ,[object Object]
Cilk-M Overview
Cilk-M’s Work-Stealing Scheduler
TLMM-Based Cactus Stacks
The Analysis of Cilk-M
OS Support for TLMMSurvey of My Other Work Direction for Future Work 74
To Be or Not To Be … a Process A Worker = A Process A Worker = A Thread ,[object Object]
Workers share a single page table.
By default, nothing is shared.
By default, everything is shared.
Manually (i.e. mmap) share nonstack memory.
Reserve a region to be independently mapped.
User calls to mmap do not work (which may include malloc).
User calls to mmap operate properly.75
Page Table for TLMM (Ideally) TLMM 2 TLMM 1 Shared TLMM 0 x86: Hardware walks the page table.Each thread has a single root-page directory! Page 28 Page 12 Page 7 Page 32 76
Support for TLMM Thread 0 Thread 1 Must synchronize the root-page directory among threads. Page 32 Page 7 Page 12 77
Limitation of TLMM Cactus Stacks ,[object Object]
E.g., MCS locks [MCS91]:
When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.
Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.
When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.
If A allocates LA on its stack using TLMM, LA may not be visible to B!78
[object Object]
Serial-Parallel Reciprocity

More Related Content

What's hot

St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Microsoft kafka load imbalance
Microsoft   kafka load imbalanceMicrosoft   kafka load imbalance
Microsoft kafka load imbalanceNitin Kumar
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
 
Hardware Description Beyond Register-Transfer Level (RTL) Languages
Hardware Description Beyond Register-Transfer Level (RTL) LanguagesHardware Description Beyond Register-Transfer Level (RTL) Languages
Hardware Description Beyond Register-Transfer Level (RTL) LanguagesLEGATO project
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programminghybr1s
 
Python Basis Tutorial
Python Basis TutorialPython Basis Tutorial
Python Basis Tutorialmd sathees
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkAlexey Smirnov
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Flink Forward
 
CNIT 141: 9. Elliptic Curve Cryptosystems
CNIT 141: 9. Elliptic Curve CryptosystemsCNIT 141: 9. Elliptic Curve Cryptosystems
CNIT 141: 9. Elliptic Curve CryptosystemsSam Bowne
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep LearningSungjoon Choi
 
Juan josefumeroarray14
Juan josefumeroarray14Juan josefumeroarray14
Juan josefumeroarray14Juan Fumero
 
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...Kalman Graffi
 

What's hot (20)

St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Run time
Run timeRun time
Run time
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Microsoft kafka load imbalance
Microsoft   kafka load imbalanceMicrosoft   kafka load imbalance
Microsoft kafka load imbalance
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
 
Hardware Description Beyond Register-Transfer Level (RTL) Languages
Hardware Description Beyond Register-Transfer Level (RTL) LanguagesHardware Description Beyond Register-Transfer Level (RTL) Languages
Hardware Description Beyond Register-Transfer Level (RTL) Languages
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
 
Python Basis Tutorial
Python Basis TutorialPython Basis Tutorial
Python Basis Tutorial
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
CNIT 141: 9. Elliptic Curve Cryptosystems
CNIT 141: 9. Elliptic Curve CryptosystemsCNIT 141: 9. Elliptic Curve Cryptosystems
CNIT 141: 9. Elliptic Curve Cryptosystems
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
Juan josefumeroarray14
Juan josefumeroarray14Juan josefumeroarray14
Juan josefumeroarray14
 
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
 

Similar to Mit cilk

Cilk - An Efficient Multithreaded Runtime System
Cilk - An Efficient Multithreaded Runtime SystemCilk - An Efficient Multithreaded Runtime System
Cilk - An Efficient Multithreaded Runtime SystemShareek Ahamed
 
Stephan berg track f
Stephan berg   track fStephan berg   track f
Stephan berg track fAlona Gradman
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesEelco Visser
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Introduction to simulink (1)
Introduction to simulink (1)Introduction to simulink (1)
Introduction to simulink (1)Memo Love
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Power and Clock Gating Modelling in Coarse Grained Reconfigurable Systems
Power and Clock Gating Modelling in Coarse Grained Reconfigurable SystemsPower and Clock Gating Modelling in Coarse Grained Reconfigurable Systems
Power and Clock Gating Modelling in Coarse Grained Reconfigurable SystemsMDC_UNICA
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsUniversität Rostock
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 

Similar to Mit cilk (20)

Cilk - An Efficient Multithreaded Runtime System
Cilk - An Efficient Multithreaded Runtime SystemCilk - An Efficient Multithreaded Runtime System
Cilk - An Efficient Multithreaded Runtime System
 
Lecture12
Lecture12Lecture12
Lecture12
 
Programmable Logic Array
Programmable Logic Array Programmable Logic Array
Programmable Logic Array
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Stephan berg track f
Stephan berg   track fStephan berg   track f
Stephan berg track f
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual Machines
 
Compiler unit 4
Compiler unit 4Compiler unit 4
Compiler unit 4
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Introduction to simulink (1)
Introduction to simulink (1)Introduction to simulink (1)
Introduction to simulink (1)
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Power and Clock Gating Modelling in Coarse Grained Reconfigurable Systems
Power and Clock Gating Modelling in Coarse Grained Reconfigurable SystemsPower and Clock Gating Modelling in Coarse Grained Reconfigurable Systems
Power and Clock Gating Modelling in Coarse Grained Reconfigurable Systems
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementations
 
NoSQL Smackdown!
NoSQL Smackdown!NoSQL Smackdown!
NoSQL Smackdown!
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 

Mit cilk

  • 1. The Era of Multicore Is Here 1 Source: www.newegg.com
  • 2. Memory Network … ¢ ¢ ¢ P P P Chip Multiprocessor (CMP) Multicore Architecture* 2 *The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
  • 3. Concurrency Platforms Aconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming. User Application Concurrency Platform Operating System 3
  • 4. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA
  • 5. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson I-Ting Angelina Lee Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology March 22, Intel XTRL / USA
  • 6. Three Desirable Criteria Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 6
  • 7. Various Strategies Cilk++ TBB Cilk Plus 7 The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.
  • 8. The Cactus-Stack Problem Customer Engineer SP Reciprocity Space Usage Performance 8
  • 9. The Cactus-Stack Problem Parallelize my software? SP Reciprocity Space Usage Performance 9
  • 10. The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 10
  • 11. The Cactus-Stack Problem Sure! Use my concurrency platform! SP Reciprocity Space Usage Performance 11
  • 12. The Cactus-Stack Problem Just be sure to recompile all your codebase. Space Usage Performance 12
  • 13. The Cactus-Stack Problem Hm … I use third party binaries … Space Usage Performance 13
  • 14. The Cactus-Stack Problem *Sigh*. Ok fine. SP Reciprocity Space Usage Performance 14
  • 15. The Cactus-Stack Problem Upgrade your RAM then … SP Reciprocity Performance 15
  • 16. The Cactus-Stack Problem … you are gonna need extra memory. SP Reciprocity Performance 16
  • 17. The Cactus-Stack Problem … no? SP Reciprocity Performance 17
  • 18. The Cactus-Stack Problem … no? SP Reciprocity Space Usage Performance 18
  • 19. The Cactus-Stack Problem Well … you didn’t say you want any performance guarantee, did you? ⌃ # SP Reciprocity Space Usage 19
  • 20. The Cactus-Stack Problem Gee … I can get that just by running serially. ⌃ # SP Reciprocity Space Usage 20
  • 21. The Cactus-Stack Problem Interoperability with serial code, including binaries Serial-ParallelReciprocity GoodPerformance BoundedStack Space Reasonable space usage compared to serial execution Ample parallelism  linear speedup 21
  • 22. Legacy Linear Stack An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 22
  • 23. Legacy Linear Stack Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 23
  • 24. Legacy Linear Stack — 1960* Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 24 * Stack-based space management for recursive subroutines developed with compilers for Algol 60.
  • 25. Cactus Stack — 1968* A cactus stack supports multiple views in parallel. C B A D E A A A A A A B C C C C B D E E D invocation tree views of stack 25 * Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.
  • 26. Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. A Mesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy. heap C B E D 26
  • 27.
  • 36. Task Parallel Library (Microsoft)
  • 39. …27
  • 40. Heap-Based Cactus Stack A heap-based cactus stack allocates frames off the heap. MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well. A heap Good time and space bounds can be obtained … C B E D 28
  • 41. Heap-Based Cactus Stack Heap linkage: call/return via frames in the heap. A Heap linkage parallel functions fail to interoperate with legacy serial code. heap C B E D 29
  • 42. Various Strategies 30 The main constraint: once allocated, a frame’s location in virtual address cannot change.
  • 43.
  • 48. OS Support for TLMMSurvey of My Other Work Direction for Future Work 31
  • 49. The Cilk Programming Model The named childfunction may execute in parallel with the continuation of its parent. intfib(intn) { if(n < 2) { return n; } intx = spawn fib(n-1); inty = fib(n-2); sync; return (x + y); } Control cannot pass this point until all spawned children have returned. Cilk keywords grant permissionfor parallel execution. They do not commandparallel execution. 32
  • 50. Cilk-M A work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM). 33
  • 51. Cilk-M Overview High virtual addr stack TLMM Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently. heap uninitialized data (bss) shared Idea: Allocate the stacks for each worker in the TLMM region. initialized data code Low virtual addr 34
  • 52. Basic Cilk-M Idea 0x7f000 A A A Workers achieve sharing by mapping the same physical memory at the same virtual address. x: 42 x: 42 x: 42 B C C y: &x y: &x E D y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 35
  • 53.
  • 54. Space bound: Sp /P≤ S1.
  • 55. Does not support SP-reciprocity.36
  • 56. Cilk Depth 37 A C B Cilk depth (3) is not the same as spawn depth (2). E D G F Cilk depth is the max number of Cilk functions nested on the stack during a serial execution
  • 57.
  • 59. No longer need to distinguish function types
  • 60. Parallelism or not is dictated only by how a function is invoked (spawn vs. call).38
  • 61.
  • 62. We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).
  • 63. We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.39
  • 64. Performance Comparison AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk Plus running time Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D) 40
  • 65. Space Usage Space bound: Sp /P≤ S1+D 41
  • 66.
  • 71. OS Support for TLMMSurvey of My Other Work Direction for Future Work 42
  • 72. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn spawn call P P P P 43
  • 73. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call call! P P P P 44
  • 74. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn spawn! P P P P 45
  • 75. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn spawn spawn call spawn call spawn call spawn call spawn call! spawn! spawn! spawn P P P P 46
  • 76. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call spawn call spawn spawn spawn call call spawn call spawn return! spawn P P P P 47
  • 77. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn call spawn steal! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 48
  • 78. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn! spawn P P P P When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque. 49
  • 79. Cilk-M’s Work-Stealing Scheduler Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn spawn call call spawn spawn spawn call call spawn spawn call spawn spawn P P P P Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup. 50
  • 80.
  • 85. OS Support for TLMMSurvey of My Other Work Direction for Future Work 51
  • 86. TLMM-Based Cactus Stacks 0x7f000 A x: 42 B Use standard linear stack in virtual memory. y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 52
  • 87. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B Map (not copy) the stolen prefixto the same virtual addresses. y: &x steal A P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 53
  • 88. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Subsequent spawns and calls grow down-ward in the thief’s TLMM region. C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 54
  • 89. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B Both workers see the same virtual address value for &x. C y: &x y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 55
  • 90. TLMM-Based Cactus Stacks 0x7f000 A A x: 42 x: 42 B C Both workers see the same virtual address value for &x. y: &x D y: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 56
  • 91. TLMM-Based Cactus Stacks 0x7f000 A A A A x: 42 x: 42 x: 42 x: 42 B C C C Map (not copy) the stolen prefixto the same virtual addresses. y: &x y: &x y: &x D y: &x steal C P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 57
  • 92. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C Subsequent spawns and calls grow down-ward in the thief’s TLMM region. y: &x y: &x D E y: &x z: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 58
  • 93. TLMM-Based Cactus Stacks 0x7f000 A A A x: 42 x: 42 x: 42 B C C All workers see the same virtual address value for &x. y: &x y: &x E D y: &x z: &x P3 P1 P2 A C B Unreasonable simplification: Assume that we can map with arbitrary granularity. E D 59
  • 94. Handling Page Granularity 0x7f000 A page size B 0x7e000 0x7d000 A C B P3 P1 P2 E D 60
  • 95. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 Map the stolen prefix. 0x7d000 A steal A C B P3 P1 P2 E D 61
  • 96. Handling Page Granularity 0x7f000 A A page size B 0x7e000 Advance the stack pointer fragmentation. 0x7d000 A steal A C B P3 P1 P2 E D 62
  • 97. Handling Page Granularity 0x7f000 A A page size B 0x7e000 C D 0x7d000 A C B P3 P1 P2 E D 63
  • 98. Handling Page Granularity 0x7f000 A A A A page size B 0x7e000 C C C D 0x7d000 A steal C C B P3 P1 P2 E D 64
  • 99. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 A steal C C B P3 P1 P2 E D 65
  • 100. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Advance the stack pointer again additional fragmentation. D 0x7d000 E A C B P3 P1 P2 E D 66
  • 101. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 C C Space-reclaiming heuristic: reset the stack pointer upon successful sync. D 0x7d000 E A C B P3 P1 P2 E D 67
  • 102.
  • 106. The Analysis of Cilk-M
  • 107. OS Support for TLMMSurvey of My Other Work Direction for Future Work 68
  • 108. Space Bound with a Heap-Based Cactus Stack Theorem [BL94].Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1. Proof. The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 69
  • 109. Cilk-M Space Bound Claim.Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D. Proof. The work-stealing algorithm maintains the busy-leaves property: Every active leaf frame has a worker executing on it.■ P = 4 S1 P P P P 70
  • 110. Space Usage Space bound: Sp /P≤ S1+D 71
  • 111. Performance Bound with aHeap-Based Cactus Stack Definition.TP— execution time on P processors T1— work T∞— spanT1 / T∞ — parallelism Theorem [BL94]. A work-stealing scheduler can achieve expected running time TP=T1 / P + O(T∞) on P processors. Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 72
  • 112. Cilk-M Performance Bound Definition.TP— execution time on P processors T1— work T∞— spanT1 / T∞ — parallelismD — Cilk depth Claim. A work-stealing scheduler can achieve expected running time Tp= T1 / P + CT∞ onP processors, where C = O(S1+D) . Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). 73
  • 113.
  • 117. The Analysis of Cilk-M
  • 118. OS Support for TLMMSurvey of My Other Work Direction for Future Work 74
  • 119.
  • 120. Workers share a single page table.
  • 121. By default, nothing is shared.
  • 123. Manually (i.e. mmap) share nonstack memory.
  • 124. Reserve a region to be independently mapped.
  • 125. User calls to mmap do not work (which may include malloc).
  • 126. User calls to mmap operate properly.75
  • 127. Page Table for TLMM (Ideally) TLMM 2 TLMM 1 Shared TLMM 0 x86: Hardware walks the page table.Each thread has a single root-page directory! Page 28 Page 12 Page 7 Page 32 76
  • 128. Support for TLMM Thread 0 Thread 1 Must synchronize the root-page directory among threads. Page 32 Page 7 Page 12 77
  • 129.
  • 130. E.g., MCS locks [MCS91]:
  • 131. When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.
  • 132. Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.
  • 133. When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.
  • 134. If A allocates LA on its stack using TLMM, LA may not be visible to B!78
  • 135.
  • 138. Bounded and efficient use of memory for the cactus stack
  • 141. OS support for TLMM (~600 lines of code)
  • 143.
  • 145. Ownership-Aware Transactional Memory Direction for Future Work 80
  • 146. The JCilk Language Joint work with John Danaher and Charles Leiserson Parallel Constructs from Cilk: spawn & sync Java Core Functionalities 81
  • 147. The JCilk Language Joint work with John Danaher and Charles Leiserson Exception Handling Parallel Constructs from Cilk: spawn & sync Java Core Functionalities 82
  • 148.
  • 149. JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.
  • 150. Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.Exception Handling in a Concurrent Context 83
  • 151. The JCilk System JCilkRuntime System JCilk Compiler JCilk to Java + goto Jgo compiler: GCJ + goto support JVM source Fib.jcilk Fib.jgo Fib.class 84
  • 152.
  • 153. JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.What We Discovered 85
  • 154.
  • 156. Ownership-Aware Transactional Memory Direction for Future Work 86
  • 157. Initially, L1 = 0 and L2 = 0 Thread 1 Thread 2 L1 = 1; if(L2 == 0) { /* critical section */ … } L1 = 0; L2 = 1; if(L1 == 0) { /* critical section */ … } L2 = 0; Dekker’s Protocol (Simplified) Reads may be reordered with older writes. 87
  • 158. Initially, L1 = 0 and L2 = 0 Thread 1 Thread 2 L1 = 1; mfence(); if(L2 == 0) { /* critical section */ … } L1 = 0; L2 = 1; mfence(); if(L1 == 0) { /* critical section */ … } L2 = 0; Memory fences needed  cause stalling Dekker’s Protocol (Simplified) 88
  • 159.
  • 160. the victim vs. the thief
  • 161. Java Monitors using Quickly Reacquirable Locks or Biased Locking[DMS03] [OKK04]
  • 162. the bias-holding thread vs. a revoker thread
  • 164. a Java mutator thread vs. the garbage collector
  • 166. the owner thread vs. other threadsApplications exhibit asymmetric synchronization patterns. 89
  • 167.
  • 168. Some applications can benefit from a software implementation [DHY03] that uses interrupt.
  • 169. A light-weight hardware mechanism can piggyback on the cache coherence protocol. Location-Based Memory Fences 90 Joint work with EdyaLadan-Mozes and DmitriyVyukov
  • 170.
  • 172. Ownership-Aware Transactional Memory Direction for Future Work 91
  • 173. Transactional Memory Rset: w,xWset: w,x Memory atomic {//A x++; } Rset: xWset:x A atomic {//B w = x; } Rset: xWset:w B Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory. 92
  • 174. Transactional Memory Rset: w,xWset: w,x Memory atomic {//A x++; } Rset: xWset:x A atomic {//B w = x; } Rset: xWset:w B TM guarantees that transactions are serializable[P79]. 93
  • 175. Nested Transactions Rset: w,x,y,zWset: w,x,y,z atomic {//A int a = x;... atomic { //B w++; } intb = y; z = x + y; } Memory Rset: xWset: A Rset:wWset: w B Closed-nesting:propagate the changes to A. 94
  • 176. Nested Transactions Rset: w,x,y,zWset: w,x,y,z atomic {//A int a = x;... atomic { //B w++; } intb = y; z = x + y; } Memory Rset: xWset: A Rset:wWset: w B Open-nesting:commit the changes globally. 95
  • 177. Nested Transactions All memories are treated equally – there is only one level of abstraction. 96
  • 178.
  • 179. In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept ofownership types [BLS03] to ensure data encapsulation within a module.
  • 180. The OAT system guarantees abstract serializabilityas long as the program conforms to a set of well-defined constraints on how the modules share data. 97
  • 181.
  • 183. Ownership-Aware Transactional Memory Direction for Future Work 98
  • 184. Parallelism Abstraction Aconcurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling. User Application Concurrency Platform Operating System 99
  • 185.
  • 186. Hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.
  • 187. Transactional Memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.100 Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?
  • 188.
  • 191. Cilk-M [LSH+10] (TLMM cactus stack)Can we relax limitation of manipulating virtual memory at page-granularity ? 101
  • 192.
  • 194. Ownership-Aware Transactional Memory Direction for Future Work
  • 195. 103
  • 196. 104
  • 197. Quadratic Stack Growth [Robison08] P : parallel P Assume one linear stack per worker : serial P S S : spawn P S : call P S Depth = d S . . . S . . . . . . P S S S . . . P P P Repeat dtimes . . . S S S S S S 105
  • 198. Quadratic Stack Growth [Robison08] P The green worker repeatedly blocks, then steals, using Θ(d2)stack space. Assume one linear stack per worker P S P S P S Depth = d S . . . S . . . . . . P S S S . . . P P P Repeat dtimes . . . S S S S S S 106
  • 199. Performance Comparison AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 Cilk-M running time / Cilk-5 running time Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D) 107
  • 200. Space Usage (Hand Compiled) Space bound: Sp /P≤ S1+D 108
  • 202. GCC/Linux C Subroutine Linkage args to A The legacy linear stack obtains efficiency by overlapping frames. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables linkage region args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 110
  • 203. Handling Page Granularity 0x7f000 A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. 0x7d000 A steal A C B P3 P1 P2 E D 111
  • 204. Handling Page Granularity 0x7f000 A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C D 0x7d000 A C B P3 P1 P2 E D 112
  • 205. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C C D 0x7d000 A steal C C B P3 P1 P2 E D 113
  • 206. Handling Page Granularity 0x7f000 A A A page size B 0x7e000 The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee. C C D 0x7d000 E A C B P3 P1 P2 E D 114
  • 207. Key Invocation Invariants Arguments are passed via stack pointer with positive offset. Local variables are referenced via base pointer with negative offset. Live registers are flushed onto the stack immediately before each spawn. Live registers are flushed onto the stack before returning back to runtime if sync fails. When resuming a stolen function after a spawn or sync, live registers are restored from the stack. When returning from a spawn, the return value is flushed from its register onto the stack. The frame size is fixed before any spawn statements. 115
  • 208. GCC/Linux C Subroutine Linkage Legacy linear stacks enable efficient passing of arguments from caller to callee. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to A’s callees A C B E D 116
  • 209. GCC/Linux C Subroutine Linkage linkage region FrameAaccesses its arguments through positive offset indexed from its base pointer. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to A’s callees A C B E D 117
  • 210. GCC/Linux C Subroutine Linkage args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables FrameAaccesses its local variables through negative offset indexed from its base pointer. args to A’s callees A C B E D 118
  • 211. GCC/Linux C Subroutine Linkage Before invoking B, A places the arguments for Binto the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer. args to A A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables linkage region args to B args to A’s callees A C B E D 119
  • 212. GCC/Linux C Subroutine Linkage args to A Athen makes the call to B, which saves the return address for Band transfers control to B. A’s return address A’s parent’s base ptr frame forA sp bp A’s local variables args to B B’s return address A C B E D 120
  • 213. GCC/Linux C Subroutine Linkage args to A Upon entering, Bsaves A’s base pointer and sets the base pointer to where the stack pointer is. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address bp A’s base pointer A C B E D 121
  • 214. GCC/Linux C Subroutine Linkage args to A B advances the stack pointer to allocate space for local variables and linkage region. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 122
  • 215. GCC/Linux C Subroutine Linkage args to A The legacy linear stack obtains efficiency by overlapping frames. A’s return address A’s parent’s base ptr frame forA bp sp A’s local variables args to B B’s return address B’s local variables A’s base pointer frame forB A args to B’s callees C B E D 123
  • 216. Legacy Linear Stack High Addr An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree. A A B C C B D E E D Low Addr invocation tree 124
  • 217. Legacy Linear Stack High Addr Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent. … A A C C B x: 42 E E D Low Addr invocation tree y: &x 125
  • 218. Legacy Linear Stack High Addr Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent. A A C C B ✗ y: &z E E D Low Addr invocation tree z: 42 126
  • 219. The Queens Problem Given n> 0, search for oneway to arrange n queens on an n-by-n chessboard so that none attacks another. legal configuration illegal configuration 127
  • 220. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Serial strategy: Depth-first search with backtracking. The search tree size grows exponentially as n increases. 128
  • 221. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Parallel strategy: spawn searches in parallel. Speculative computation – some work may be wasted. 129
  • 222. Exploring the Search Tree for Queens start r0,c1 r0,c2 r0,c3 r0,c0 r1,c3 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c3 r1,c2 r1,c1 r1,c0 r1,c2 r1,c1 r1,c0 r2,c0 r2,c0 r2,c0 r2,c0 . . . . . . Parallel strategy: spawn searches in parallel. Abort other parallel searches once a solution is found. 130
  • 224. 132 Parallelize Your Code using Cilk++ class SAT_Solver { public: int solve( … ); … private: … }; Convert the entire code base to Cilk++ language. 2. Structure the project so that Cilk++ code calls C++ code, but not conversely. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++. a. Use C++ wrapper functionsb. Use “extern C++” c. Limited call back to C++ code
  • 225. 133 Parallelize Your Code using TBB class SAT_Solver { public: int solve( … ); … private: … }; Convert the entire project to Cilk++ language. 2. Structure the project so that Cilk++ code calls C++ code, but not conversely. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++. a. Use C++ wrapper functionsb. Use “extern C++” c. Limited call back to C++ code Your program may end up using a lot more stack space or fail to get good speedup.
  • 226. Memory Network … ¢ ¢ ¢ P P P Chip Multiprocessor (CMP) Multicore Architecture — 2001* 134 *The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
  • 227. The Era of Multicore IS Here 135 # of CPUS # of Cores Source: www.newegg.com Single core processor is becoming obsolete.
  • 228. My Sister Is Buying a New Laptop … 136 Source: www.apple.com The era of multicore IS here!

Editor's Notes

  1. A concurrency platform is asoftware abstraction layer that manages the processors&apos; resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.
  2. It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.
  3. I am sure everyone know what a linear stack is.An execution of a serial language can be viewed as a serial walk of an invocation tree.On the left, we have an invocation tree, where A calls B and C, and C calls D and E.On the right is the corresponding views of stack for each function when it is active.throughout the rest of the talk, I will use the convention that the stack grows downwardNote that, when a function is active, it can always see its ancestors’ frames in the stack
  4. But parallel functions fail to interoperate with legacy serial code, because a legacy serial could would allocate its frame off the linear stack, and it does not understand the heap linkage, where the call / return is performed via frames in the heap.
  5. It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.We don’t have time to go into all the strategy but I will go into a little more details on one strategy to illustrate the challenge in satisfying all three criteria.You are welcome to ask me about the other strategies after the talk, if you are interested.
  6. The Cilk work-stealing scheduler then executes the prog in the way that respects the logical parallelism specified by the programmer while guaranteeing that programs take full advantage of the processors available at runtime.
  7. Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.One main constraint that we are operating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,We can get around this, if every thread has its own local view of virtual address range
  8. Because the stacks are allocated in TLMM region, we can map the region such that part of the stack is shared. For example, worker one has the stack view of ..A frame for a given function refers to the same physical memory in all stacks and is mapped to the same virtual address.
  9. Time bound guarantee linear speed up if there is sufficient parallelismSpace bound: each worker does not use more than S1
  10. Note that this is running on 16 cores
  11. across all app, each worker uses no more than 2 times more compared to the serial stack usage
  12. Use a standard linear stack in virtual memory
  13. Upon a steal, map the physical memory corresponding to the stolen prefixto the same virtual addresses in the thief as in the victim
  14. Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  15. Both the victim and the thief see the same virtual address value for the reference to A’s local variable x
  16. Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  17. ORALLY: P3 Steal C … techinucally it first stole A and fail to make progress on it, then it steals C
  18. Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  19. Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  20. Advancing the stack pointer avoids overwriting other frames on the page, at the cost of fragmentation.
  21. The thief then resumes the stolen frame and executes normally in its own TLMM region.
  22. Once again, the stack pointer must be advanced, which causes additional fragmentation.
  23. only the worker who executes it would perform the mmap ... need to synchronize among all workers to perform the mmap as well.
  24. multiple threads’ local overlaps w/ diff pages.
  25. Each thread uses a unique root page directory …When a thread maps in the shared region … need to synchronize, but the synchronization is done only once per shared entry in the root page directory
  26. Tazuneki and Yoshida \\cite{TazunekiYo00} and Issarny \\cite{Issarny91}have investigated the semantics of concurrent exception-handling,taking different approaches from our work. In particular, theseresearchers pursue new linguistic mechanisms for concurrentexceptions, rather than extending them faithfully from a serial baselanguage as does \\jcilk. The treatment of multiple exceptions thrownsimultaneously is another point of divergence.
  27. The JCilk system consists of two components: the runtime system and the compiler.
  28. Critically there is a duality between the actions of the threadsModern processors typically employ TSO(Total-Store-Order) and PO(Processor-Ordering) That is, Reads are not reordered with other readsWriter are not reordered with older readsWrites are not reordered with other writes; andReads may be reordered with older writes if they have different target location
  29. Traditional memory barriers are PC-based – the processor inevitably stalls upon execution
  30. The lock word associated with a monitor can be biased towards one thread.The bias-holding thread can update the lock word using regular ld-update-store.A unbiased lock word must be updated using CAS.Dekker is used to synchronize between the bias-holding thread and revoker thread when revoker attempts to update the bias.network package processing applications --- each thread handles a group of source addresses and maintain its own data structureOccasionally, a processing thread needs to update another thread’s data structure If collection is in-progress the barrier halts the thread until the collection completes. The prevents the thread from mutating the heap concurrently with the collector. The JNI reentry barrier is commonly implemented with a CAS or a Dekker-like &quot;ST;MEMBAR;LD&quot; sequence to mark the thread as a mutator (the ST) and check for a collection in-progress (the LD)JNI occur frequently but collections are relatively infrequent.
  31. The TM system enforces atomicity by tracking the memory locations that each transaction accesses, detecting conflicts, and possibly aborting and retrying transactions.
  32. TM guarantees that transactions are serializable [Papadimitriou79]. That is, transactions affect globalmemory as if they were executed one at a time in some order, even ifin reality, several executed concurrently.
  33. A decade ago, much multithreaded software was still written with POSIX orJava threads, where the programmer handled the task decomposition andscheduling explicitly. By providing a parallelism abstraction, aconcurrency platform frees the programmer from worrying about loadbalancing and task scheduling.
  34. TLMM cactus stack: each worker gets its own linear local view of the tree-structured call stack.hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.transactional memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.I believe, a concurrency platform can as well mitigate the complexity of synchronization by providing the appropriate memory abstractions.A memory abstraction isan abstraction layer between the program execution and the memory thatprovides a different ``view&apos;&apos; of a memory location depending on theexecution context in which the memory access is made.What other memory abstraction can we do?
  35. Assume we use one linear stack per worker. Here, we are using the term worker interchangeably with the term persistant thread – think of Java thread or POSIX thread.This is a beautiful observation made by Arch Robison, who is the main architect of Intel TBB, which is another concurrency platform.The observation is that, using the strategy of one linear stack per worker, some computation may incur quadratic stack growth compared to its serial execution.An example of such computation is as follows. Here, I am showing you an invocation tree. The frame marked as P is a parallel function, which may have multiple extant children executing in parallel. The frame marked as S is a serial function. I haven’t told you the details of how a work-stealing scheduler operates, but for the purpose of this example, all you need to know is that the execution of the computation typically goes depth-first and left to right. Once a P function spawns the left branch (marked as red), however, the right branch now becomes available for execution. In order to guarantee the good time bound, one must allow a worker thread to randomly choose a readily available function to execute.
  36. Then we can run into the following scenario: I am using different colors to denote the worker who invoked a given function.One main constraint that we are ope rating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,So the green worker cannot pop off the stack-allocated frames, because the purple worker may have pointers to variables allocated on those frames
  37. Note that this is running on 16 cores
  38. This worst case occurs when every Cilk function on a stack that realizes the Cilk depth D is stolen.
  39. Show space consumption, mention couple tricks we did to recycle stack space
  40. The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  41. Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  42. Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  43. Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  44. Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  45. static for P1 and P2.mention fragmentation part of stack no visibleissue of fragmentationwant to use backward compatible linkagecombine this page and the next page. Insert the linkage block in there.animate just P3
  46. mention memory args only when regs are not enoughoverlapping of the frames
  47. mention memory args only when regs are not enoughoverlapping of the framesSAY: can access via stack pointer if the frame size is known statically
  48. mention memory args only when regs are not enoughoverlapping of the frames
  49. mention memory args only when regs are not enoughoverlapping of the frames
  50. CORRECT this text A then transfer control to B
  51. The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  52. On the left, I am showing you an invocation tree … Suchserial languages admit simple array-based stack for allocating functionactivation frames. To allocate an activation frame when a function iscalled, the stack pointer is advanced, and when the function returns, theoriginal stack pointer is restored. This style of execution is spaceefficient, because all the children of a given function can use and reusethe same region of the stack.
  53. x: 42 on Apass &amp;x, stored as y in C &amp; E.make widerThe other way around … should be symmetricIn A, have x:42, pass that down to C.
  54. static threading / OpenMP / Streaming / fork-join parallel programming / message passing / GPU , which is not commonly used for multicore architecture with shared memoryis a software abstraction layer that manages the processors&apos; resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.