Mit cilk

768 views
646 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
768
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A concurrency platform is asoftware abstraction layer that manages the processors' resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.
  • It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.
  • I am sure everyone know what a linear stack is.An execution of a serial language can be viewed as a serial walk of an invocation tree.On the left, we have an invocation tree, where A calls B and C, and C calls D and E.On the right is the corresponding views of stack for each function when it is active.throughout the rest of the talk, I will use the convention that the stack grows downwardNote that, when a function is active, it can always see its ancestors’ frames in the stack
  • But parallel functions fail to interoperate with legacy serial code, because a legacy serial could would allocate its frame off the linear stack, and it does not understand the heap linkage, where the call / return is performed via frames in the heap.
  • It turns out that, there seems to be a fundamental tradeoff between the three criteria.We and other practitioners have considered various strategies, and all of them fail to satisfy one of the three criteria, except for the TLMM cactus stacks, which is the strategy we employed in this work.We don’t have time to go into all the strategy but I will go into a little more details on one strategy to illustrate the challenge in satisfying all three criteria.You are welcome to ask me about the other strategies after the talk, if you are interested.
  • The Cilk work-stealing scheduler then executes the prog in the way that respects the logical parallelism specified by the programmer while guaranteeing that programs take full advantage of the processors available at runtime.
  • Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.One main constraint that we are operating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,We can get around this, if every thread has its own local view of virtual address range
  • Because the stacks are allocated in TLMM region, we can map the region such that part of the stack is shared. For example, worker one has the stack view of ..A frame for a given function refers to the same physical memory in all stacks and is mapped to the same virtual address.
  • Time bound guarantee linear speed up if there is sufficient parallelismSpace bound: each worker does not use more than S1
  • Note that this is running on 16 cores
  • across all app, each worker uses no more than 2 times more compared to the serial stack usage
  • Use a standard linear stack in virtual memory
  • Upon a steal, map the physical memory corresponding to the stolen prefixto the same virtual addresses in the thief as in the victim
  • Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • Both the victim and the thief see the same virtual address value for the reference to A’s local variable x
  • Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • ORALLY: P3 Steal C … techinucally it first stole A and fail to make progress on it, then it steals C
  • Subsequent spawns and calls grow down-ward in the thief’s independently mapped TLMM region.
  • Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • Advancing the stack pointer avoids overwriting other frames on the page, at the cost of fragmentation.
  • The thief then resumes the stolen frame and executes normally in its own TLMM region.
  • Once again, the stack pointer must be advanced, which causes additional fragmentation.
  • only the worker who executes it would perform the mmap ... need to synchronize among all workers to perform the mmap as well.
  • multiple threads’ local overlaps w/ diff pages.
  • Each thread uses a unique root page directory …When a thread maps in the shared region … need to synchronize, but the synchronization is done only once per shared entry in the root page directory
  • Tazuneki and Yoshida \\cite{TazunekiYo00} and Issarny \\cite{Issarny91}have investigated the semantics of concurrent exception-handling,taking different approaches from our work. In particular, theseresearchers pursue new linguistic mechanisms for concurrentexceptions, rather than extending them faithfully from a serial baselanguage as does \\jcilk. The treatment of multiple exceptions thrownsimultaneously is another point of divergence.
  • The JCilk system consists of two components: the runtime system and the compiler.
  • Critically there is a duality between the actions of the threadsModern processors typically employ TSO(Total-Store-Order) and PO(Processor-Ordering) That is, Reads are not reordered with other readsWriter are not reordered with older readsWrites are not reordered with other writes; andReads may be reordered with older writes if they have different target location
  • Traditional memory barriers are PC-based – the processor inevitably stalls upon execution
  • The lock word associated with a monitor can be biased towards one thread.The bias-holding thread can update the lock word using regular ld-update-store.A unbiased lock word must be updated using CAS.Dekker is used to synchronize between the bias-holding thread and revoker thread when revoker attempts to update the bias.network package processing applications --- each thread handles a group of source addresses and maintain its own data structureOccasionally, a processing thread needs to update another thread’s data structure If collection is in-progress the barrier halts the thread until the collection completes. The prevents the thread from mutating the heap concurrently with the collector. The JNI reentry barrier is commonly implemented with a CAS or a Dekker-like "ST;MEMBAR;LD" sequence to mark the thread as a mutator (the ST) and check for a collection in-progress (the LD)JNI occur frequently but collections are relatively infrequent.
  • The TM system enforces atomicity by tracking the memory locations that each transaction accesses, detecting conflicts, and possibly aborting and retrying transactions.
  • TM guarantees that transactions are serializable [Papadimitriou79]. That is, transactions affect globalmemory as if they were executed one at a time in some order, even ifin reality, several executed concurrently.
  • A decade ago, much multithreaded software was still written with POSIX orJava threads, where the programmer handled the task decomposition andscheduling explicitly. By providing a parallelism abstraction, aconcurrency platform frees the programmer from worrying about loadbalancing and task scheduling.
  • TLMM cactus stack: each worker gets its own linear local view of the tree-structured call stack.hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.transactional memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.I believe, a concurrency platform can as well mitigate the complexity of synchronization by providing the appropriate memory abstractions.A memory abstraction isan abstraction layer between the program execution and the memory thatprovides a different ``view'' of a memory location depending on theexecution context in which the memory access is made.What other memory abstraction can we do?
  • Assume we use one linear stack per worker. Here, we are using the term worker interchangeably with the term persistant thread – think of Java thread or POSIX thread.This is a beautiful observation made by Arch Robison, who is the main architect of Intel TBB, which is another concurrency platform.The observation is that, using the strategy of one linear stack per worker, some computation may incur quadratic stack growth compared to its serial execution.An example of such computation is as follows. Here, I am showing you an invocation tree. The frame marked as P is a parallel function, which may have multiple extant children executing in parallel. The frame marked as S is a serial function. I haven’t told you the details of how a work-stealing scheduler operates, but for the purpose of this example, all you need to know is that the execution of the computation typically goes depth-first and left to right. Once a P function spawns the left branch (marked as red), however, the right branch now becomes available for execution. In order to guarantee the good time bound, one must allow a worker thread to randomly choose a readily available function to execute.
  • Then we can run into the following scenario: I am using different colors to denote the worker who invoked a given function.One main constraint that we are ope rating under, is that once a frame has been allocated, its location in virtual memory cannot be changed,So the green worker cannot pop off the stack-allocated frames, because the purple worker may have pointers to variables allocated on those frames
  • Note that this is running on 16 cores
  • This worst case occurs when every Cilk function on a stack that realizes the Cilk depth D is stolen.
  • Show space consumption, mention couple tricks we did to recycle stack space
  • The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  • Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • Of course, our assumption is not so reasonable --- we can’t really map at arbitrary granularity. Instead, we have to map at page granularity.
  • static for P1 and P2.mention fragmentation part of stack no visibleissue of fragmentationwant to use backward compatible linkagecombine this page and the next page. Insert the linkage block in there.animate just P3
  • mention memory args only when regs are not enoughoverlapping of the frames
  • mention memory args only when regs are not enoughoverlapping of the framesSAY: can access via stack pointer if the frame size is known statically
  • mention memory args only when regs are not enoughoverlapping of the frames
  • mention memory args only when regs are not enoughoverlapping of the frames
  • CORRECT this text A then transfer control to B
  • The compact linear-stack representation ispossible only because in a serial language, a function has at most oneextant child function at any time.
  • On the left, I am showing you an invocation tree … Suchserial languages admit simple array-based stack for allocating functionactivation frames. To allocate an activation frame when a function iscalled, the stack pointer is advanced, and when the function returns, theoriginal stack pointer is restored. This style of execution is spaceefficient, because all the children of a given function can use and reusethe same region of the stack.
  • x: 42 on Apass &x, stored as y in C & E.make widerThe other way around … should be symmetricIn A, have x:42, pass that down to C.
  • static threading / OpenMP / Streaming / fork-join parallel programming / message passing / GPU , which is not commonly used for multicore architecture with shared memoryis a software abstraction layer that manages the processors' resources, schedulesthe computation over the available processors, and provides an interfacefor the programmer to specify parallel computations.
  • Mit cilk

    1. 1. The Era of Multicore Is Here<br />1<br />Source: www.newegg.com<br />
    2. 2. Memory<br />Network<br />…<br />¢<br />¢<br />¢<br />P<br />P<br />P<br />Chip Multiprocessor (CMP)<br />Multicore Architecture*<br />2<br />*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).<br />
    3. 3. Concurrency Platforms<br />Aconcurrency platform,that provides linguistic support and handles load balancing, can ease the task of parallel programming.<br />User Application<br />Concurrency Platform<br />Operating System<br />3<br />
    4. 4. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems<br />Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson<br />I-Ting Angelina Lee<br />Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology<br />March 22, Intel XTRL / USA<br />
    5. 5. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems<br />Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson<br />I-Ting Angelina Lee<br />Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology<br />March 22, Intel XTRL / USA<br />
    6. 6. Three Desirable Criteria<br />Interoperability with serial code, including binaries<br />Serial-ParallelReciprocity<br />GoodPerformance<br />BoundedStack Space<br />Reasonable space usage compared to serial execution<br />Ample parallelism  linear speedup<br />6<br />
    7. 7. Various Strategies<br />Cilk++<br />TBB<br />Cilk Plus<br />7<br />The Cactus-Stack Problem: how to satisfy all three criteriasimultaneously.<br />
    8. 8. The Cactus-Stack Problem<br />Customer<br />Engineer<br />SP Reciprocity<br />Space Usage<br />Performance<br />8<br />
    9. 9. The Cactus-Stack Problem<br />Parallelize my software?<br />SP Reciprocity<br />Space Usage<br />Performance<br />9<br />
    10. 10. The Cactus-Stack Problem<br />Sure! Use my concurrency platform!<br />SP Reciprocity<br />Space Usage<br />Performance<br />10<br />
    11. 11. The Cactus-Stack Problem<br />Sure! Use my concurrency platform!<br />SP Reciprocity<br />Space Usage<br />Performance<br />11<br />
    12. 12. The Cactus-Stack Problem<br />Just be sure to recompile all your codebase.<br />Space Usage<br />Performance<br />12<br />
    13. 13. The Cactus-Stack Problem<br />Hm … I use third party binaries … <br />Space Usage<br />Performance<br />13<br />
    14. 14. The Cactus-Stack Problem<br />*Sigh*. Ok fine. <br />SP Reciprocity<br />Space Usage<br />Performance<br />14<br />
    15. 15. The Cactus-Stack Problem<br />Upgrade your RAM then … <br />SP Reciprocity<br />Performance<br />15<br />
    16. 16. The Cactus-Stack Problem<br />… you are gonna need extra memory.<br />SP Reciprocity<br />Performance<br />16<br />
    17. 17. The Cactus-Stack Problem<br />… no?<br />SP Reciprocity<br />Performance<br />17<br />
    18. 18. The Cactus-Stack Problem<br />… no?<br />SP Reciprocity<br />Space Usage<br />Performance<br />18<br />
    19. 19. The Cactus-Stack Problem<br />Well … you didn’t say you want any performance guarantee, did you?<br />⌃<br />#<br />SP Reciprocity<br />Space Usage<br />19<br />
    20. 20. The Cactus-Stack Problem<br />Gee … I can get that just by running serially.<br />⌃<br />#<br />SP Reciprocity<br />Space Usage<br />20<br />
    21. 21. The Cactus-Stack Problem<br />Interoperability with serial code, including binaries<br />Serial-ParallelReciprocity<br />GoodPerformance<br />BoundedStack Space<br />Reasonable space usage compared to serial execution<br />Ample parallelism  linear speedup<br />21<br />
    22. 22. Legacy Linear Stack<br />An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.<br />C<br />B<br />A<br />D<br />E<br />A<br />A<br />A<br />A<br />A<br />A<br />B<br />C<br />C<br />C<br />C<br />B<br />D<br />E<br />E<br />D<br />invocation tree<br />views of stack<br />22<br />
    23. 23. Legacy Linear Stack<br />Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.<br />C<br />B<br />A<br />D<br />E<br />A<br />A<br />A<br />A<br />A<br />A<br />B<br />C<br />C<br />C<br />C<br />B<br />D<br />E<br />E<br />D<br />invocation tree<br />views of stack<br />23<br />
    24. 24. Legacy Linear Stack — 1960*<br />Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.<br />C<br />B<br />A<br />D<br />E<br />A<br />A<br />A<br />A<br />A<br />A<br />B<br />C<br />C<br />C<br />C<br />B<br />D<br />E<br />E<br />D<br />invocation tree<br />views of stack<br />24<br />* Stack-based space management for recursive subroutines developed with compilers for Algol 60.<br />
    25. 25. Cactus Stack — 1968*<br />A cactus stack supports multiple views in parallel. <br />C<br />B<br />A<br />D<br />E<br />A<br />A<br />A<br />A<br />A<br />A<br />B<br />C<br />C<br />C<br />C<br />B<br />D<br />E<br />E<br />D<br />invocation tree<br />views of stack<br />25<br />* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.<br />
    26. 26. Heap-Based Cactus Stack<br />A heap-based cactus stack allocates frames off the heap.<br />A<br />Mesa (1979), <br />Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.<br />heap<br />C<br />B<br />E<br />D<br />26<br />
    27. 27. Modern Concurrency Platforms<br /><ul><li>Cilk++ (Intel)
    28. 28. Cilk-5 (MIT)
    29. 29. Cilk-M (MIT)
    30. 30. Cilk Plus (Intel)
    31. 31. Fortress (Oracle Labs)
    32. 32. Habanero (Rice)
    33. 33. JCilk (MIT)
    34. 34. OpenMP
    35. 35. StreamIt (MIT)
    36. 36. Task Parallel Library (Microsoft)
    37. 37. Threading Building Blocks (Intel)
    38. 38. X10 (IBM)
    39. 39. …</li></ul>27<br />
    40. 40. Heap-Based Cactus Stack<br />A heap-based cactus stack allocates frames off the heap.<br />MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.<br />A<br />heap<br />Good time and space bounds can be obtained … <br />C<br />B<br />E<br />D<br />28<br />
    41. 41. Heap-Based Cactus Stack<br />Heap linkage: call/return via frames in the heap.<br />A<br />Heap linkage<br />parallel functions fail to interoperate with legacy serial code.<br />heap<br />C<br />B<br />E<br />D<br />29<br />
    42. 42. Various Strategies<br />30<br />The main constraint: once allocated, a frame’s location in virtual address cannot change.<br />
    43. 43. Outline <br />Cilk-M:<br /><ul><li>The Cactus Stack Problem
    44. 44. Cilk-M Overview
    45. 45. Cilk-M’s Work-Stealing Scheduler
    46. 46. TLMM-Based Cactus Stacks
    47. 47. The Analysis of Cilk-M
    48. 48. OS Support for TLMM</li></ul>Survey of My Other Work<br />Direction for Future Work<br />31<br />
    49. 49. The Cilk Programming Model<br />The named childfunction may execute in parallel with the continuation of its parent.<br />intfib(intn) {<br />if(n < 2) { return n; }<br />intx = spawn fib(n-1);<br />inty = fib(n-2);<br />sync;<br /> return (x + y);<br />}<br />Control cannot pass this point until all spawned children have returned.<br />Cilk keywords grant permissionfor parallel execution. They do not commandparallel execution.<br />32<br />
    50. 50. Cilk-M<br />A work-stealing runtime system based on Cilk that solves the cactus-stack problem by thread-local memory mapping (TLMM).<br />33<br />
    51. 51. Cilk-M Overview<br />High virtual addr<br />stack<br />TLMM <br />Thread-local memory mapped (TLMM) region: A virtual-address range in which each thread can map physical memory independently.<br />heap<br />uninitialized data (bss)<br />shared<br />Idea: Allocate the stacks for each worker in the TLMM region. <br />initialized data<br />code<br />Low <br />virtual addr<br />34<br />
    52. 52. Basic Cilk-M Idea<br />0x7f000<br />A<br />A<br />A<br />Workers achieve sharing by mapping the same physical memory at the same virtual address.<br />x: 42<br />x: 42<br />x: 42<br />B<br />C<br />C<br />y: &x<br />y: &x<br />E<br />D<br />y: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />35<br />
    53. 53. Cilk Guarantees with aHeap-Based Cactus Stack<br />Definition.TP— execution time on P processors<br />T1— work T∞— spanT1 / T∞ — parallelism<br />SP— stack space on P processors<br />S1— stack space of a serial execution <br /><ul><li>Time bound: Tp= T1 / P + O(T∞) .linear speedup when P ≪ T1 / T∞
    54. 54. Space bound: Sp /P≤ S1.
    55. 55. Does not support SP-reciprocity.</li></ul>36<br />
    56. 56. Cilk Depth<br />37<br />A<br />C<br />B<br />Cilk depth (3) is not the same as spawn depth (2).<br />E<br />D<br />G<br />F<br />Cilk depth is the max number of Cilk functions nested on the stack during a serial execution<br />
    57. 57. Cilk-M Guarantees<br />Definition.TP— execution time on P processors<br />T1— work T∞— spanT1 / T∞ — parallelism<br />SP— stack space on P processors<br />S1— stack space of a serial execution D — Cilk depth<br /><ul><li>Time bound: Tp= T1 / P + O((S1+D) T∞).</li></ul>linear speedup when P ≪T1 / (S1+D)T∞ <br /><ul><li>Space bound: Sp /P≤ S1+D, whereS1is measured in pages.
    58. 58. SP reciprocity:
    59. 59. No longer need to distinguish function types
    60. 60. Parallelism or not is dictated only by how a function is invoked (spawn vs. call).</li></ul>38<br />
    61. 61. System Overview<br /><ul><li>We implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.
    62. 62. We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).
    63. 63. We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.</li></ul>39<br />
    64. 64. Performance Comparison<br />AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 <br />Cilk-M running time / Cilk Plus running time<br />Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D)<br />40<br />
    65. 65. Space Usage<br />Space bound: Sp /P≤ S1+D <br />41<br />
    66. 66. Outline <br />Cilk-M:<br /><ul><li>The Cactus Stack Problem
    67. 67. Cilk-M Overview
    68. 68. Cilk-M’s Work-Stealing Scheduler
    69. 69. TLMM-Based Cactus Stacks
    70. 70. The Analysis of Cilk-M
    71. 71. OS Support for TLMM</li></ul>Survey of My Other Work<br />Direction for Future Work<br />42<br />
    72. 72. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />call<br />spawn<br />spawn<br />spawn<br />call<br />spawn<br />spawn<br />call<br />P<br />P<br />P<br />P<br />43<br />
    73. 73. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />call<br />spawn<br />spawn<br />spawn<br />call<br />spawn<br />call<br />spawn<br />call<br />call!<br />P<br />P<br />P<br />P<br />44<br />
    74. 74. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />call<br />spawn<br />spawn<br />spawn<br />call<br />spawn<br />call<br />spawn<br />call<br />spawn<br />spawn!<br />P<br />P<br />P<br />P<br />45<br />
    75. 75. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />call<br />spawn<br />spawn<br />spawn<br />call<br />spawn<br />call<br />spawn<br />call<br />spawn<br />call<br />spawn<br />call!<br />spawn!<br />spawn!<br />spawn<br />P<br />P<br />P<br />P<br />46<br />
    76. 76. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />spawn<br />call<br />spawn<br />call<br />spawn<br />spawn<br />spawn<br />call<br />call<br />spawn<br />call<br />spawn<br />return!<br />spawn<br />P<br />P<br />P<br />P<br />47<br />
    77. 77. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />spawn<br />call<br />call<br />spawn<br />spawn<br />spawn<br />call<br />call<br />spawn<br />call<br />spawn<br />steal!<br />spawn<br />P<br />P<br />P<br />P<br />When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.<br />48<br />
    78. 78. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />spawn<br />call<br />call<br />spawn<br />spawn<br />spawn<br />call<br />call<br />spawn<br />spawn<br />call<br />spawn<br />spawn!<br />spawn<br />P<br />P<br />P<br />P<br />When a worker runs out of work, itstealsfrom the top of a randomvictim’s deque.<br />49<br />
    79. 79. Cilk-M’s Work-Stealing Scheduler<br />Each worker maintains awork dequeof frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].<br />spawn<br />spawn<br />call<br />call<br />spawn<br />spawn<br />spawn<br />call<br />call<br />spawn<br />spawn<br />call<br />spawn<br />spawn<br />P<br />P<br />P<br />P<br />Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup.<br />50<br />
    80. 80. Outline <br />Cilk-M:<br /><ul><li>The Cactus Stack Problem
    81. 81. Cilk-M Overview
    82. 82. Cilk-M’s Work-Stealing Scheduler
    83. 83. TLMM-Based Cactus Stacks
    84. 84. The Analysis of Cilk-M
    85. 85. OS Support for TLMM</li></ul>Survey of My Other Work<br />Direction for Future Work<br />51<br />
    86. 86. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />x: 42<br />B<br />Use standard linear stack in virtual memory.<br />y: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />52<br />
    87. 87. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />A<br />x: 42<br />x: 42<br />x: 42<br />B<br />Map (not copy) the stolen prefixto the same virtual addresses. <br />y: &x<br />steal A<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />53<br />
    88. 88. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />x: 42<br />x: 42<br />B<br />Subsequent spawns and calls grow down-ward in the thief’s TLMM region.<br />C<br />y: &x<br />y: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />54<br />
    89. 89. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />x: 42<br />x: 42<br />B<br />Both workers see the same virtual address value for &x. <br />C<br />y: &x<br />y: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />55<br />
    90. 90. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />x: 42<br />x: 42<br />B<br />C<br />Both workers see the same virtual address value for &x. <br />y: &x<br />D<br />y: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />56<br />
    91. 91. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />A<br />A<br />x: 42<br />x: 42<br />x: 42<br />x: 42<br />B<br />C<br />C<br />C<br />Map (not copy) the stolen prefixto the same virtual addresses. <br />y: &x<br />y: &x<br />y: &x<br />D<br />y: &x<br />steal C<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />57<br />
    92. 92. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />A<br />x: 42<br />x: 42<br />x: 42<br />B<br />C<br />C<br />Subsequent spawns and calls grow down-ward in the thief’s TLMM region.<br />y: &x<br />y: &x<br />D<br />E<br />y: &x<br />z: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />58<br />
    93. 93. TLMM-Based Cactus Stacks<br />0x7f000<br />A<br />A<br />A<br />x: 42<br />x: 42<br />x: 42<br />B<br />C<br />C<br />All workers see the same virtual address value for &x. <br />y: &x<br />y: &x<br />E<br />D<br />y: &x<br />z: &x<br />P3<br />P1<br />P2<br />A<br />C<br />B<br />Unreasonable simplification: Assume that we can map with arbitrary granularity.<br />E<br />D<br />59<br />
    94. 94. Handling Page Granularity<br />0x7f000<br />A<br />page size<br />B<br />0x7e000<br />0x7d000<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />60<br />
    95. 95. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />Map the stolen prefix.<br />0x7d000<br />A<br />steal A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />61<br />
    96. 96. Handling Page Granularity<br />0x7f000<br />A<br />A<br />page size<br />B<br />0x7e000<br />Advance the stack pointer fragmentation.<br />0x7d000<br />A<br />steal A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />62<br />
    97. 97. Handling Page Granularity<br />0x7f000<br />A<br />A<br />page size<br />B<br />0x7e000<br />C<br />D<br />0x7d000<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />63<br />
    98. 98. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />C<br />C<br />C<br />D<br />0x7d000<br />A<br />steal C<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />64<br />
    99. 99. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />C<br />C<br />Advance the stack pointer again additional fragmentation.<br />D<br />0x7d000<br />A<br />steal C<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />65<br />
    100. 100. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />C<br />C<br />Advance the stack pointer again additional fragmentation.<br />D<br />0x7d000<br />E<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />66<br />
    101. 101. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />C<br />C<br />Space-reclaiming heuristic: reset the stack pointer upon successful sync.<br />D<br />0x7d000<br />E<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />67<br />
    102. 102. Outline <br />Cilk-M:<br /><ul><li>The Cactus Stack Problem
    103. 103. Cilk-M Overview
    104. 104. Cilk-M’s Work-Stealing Scheduler
    105. 105. TLMM-Based Cactus Stacks
    106. 106. The Analysis of Cilk-M
    107. 107. OS Support for TLMM</li></ul>Survey of My Other Work<br />Direction for Future Work<br />68<br />
    108. 108. Space Bound with a Heap-Based Cactus Stack<br />Theorem [BL94].Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at mostSP/P ≤ S1.<br />Proof. The work-stealing algorithm maintains the busy-leaves property:<br />Every active leaf frame has a worker executing on it.■<br />P = 4<br />S1<br />P<br />P<br />P<br />P<br />69<br />
    109. 109. Cilk-M Space Bound<br />Claim.Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stackis at mostSP/P ≤ S1+D.<br />Proof. The work-stealing algorithm maintains the busy-leaves property:<br />Every active leaf frame has a worker executing on it.■<br />P = 4<br />S1<br />P<br />P<br />P<br />P<br />70<br />
    110. 110. Space Usage<br />Space bound: Sp /P≤ S1+D <br />71<br />
    111. 111. Performance Bound with aHeap-Based Cactus Stack<br />Definition.TP— execution time on P processors<br />T1— work T∞— spanT1 / T∞ — parallelism<br />Theorem [BL94]. A work-stealing scheduler can achieve expected running time<br />TP=T1 / P + O(T∞)<br />on P processors.<br />Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). <br />72<br />
    112. 112. Cilk-M Performance Bound<br />Definition.TP— execution time on P processors<br />T1— work T∞— spanT1 / T∞ — parallelismD — Cilk depth<br />Claim. A work-stealing scheduler can achieve expected running time<br />Tp= T1 / P + CT∞<br />onP processors, where C = O(S1+D) .<br />Corollary. If the computation exhibits sufficient parallelism(P ≪T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P). <br />73<br />
    113. 113. Outline <br />Cilk-M:<br /><ul><li>The Cactus Stack Problem
    114. 114. Cilk-M Overview
    115. 115. Cilk-M’s Work-Stealing Scheduler
    116. 116. TLMM-Based Cactus Stacks
    117. 117. The Analysis of Cilk-M
    118. 118. OS Support for TLMM</li></ul>Survey of My Other Work<br />Direction for Future Work<br />74<br />
    119. 119. To Be or Not To Be … a Process<br />A Worker = A Process<br />A Worker = A Thread<br /><ul><li>Every worker has its own page table.
    120. 120. Workers share a single page table.
    121. 121. By default, nothing is shared.
    122. 122. By default, everything is shared.
    123. 123. Manually (i.e. mmap) share nonstack memory.
    124. 124. Reserve a region to be independently mapped.
    125. 125. User calls to mmap do not work (which may include malloc).
    126. 126. User calls to mmap operate properly.</li></ul>75<br />
    127. 127. Page Table for TLMM (Ideally)<br />TLMM 2<br />TLMM 1<br />Shared<br />TLMM 0<br />x86: Hardware walks the page table.Each thread has a single root-page directory!<br />Page 28<br />Page 12<br />Page 7<br />Page 32<br />76<br />
    128. 128. Support for TLMM<br />Thread 0<br />Thread 1<br />Must synchronize the root-page directory among threads.<br />Page 32<br />Page 7<br />Page 12<br />77<br />
    129. 129. Limitation of TLMM Cactus Stacks<br /><ul><li>TLMM does not work for codes that require one thread to see another thread’s stack.
    130. 130. E.g., MCS locks [MCS91]:
    131. 131. When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.
    132. 132. Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.
    133. 133. When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.
    134. 134. If A allocates LA on its stack using TLMM, LA may not be visible to B!</li></ul>78<br />
    135. 135. <ul><li>Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously:
    136. 136. Serial-Parallel Reciprocity
    137. 137. Good Performance
    138. 138. Bounded and efficient use of memory for the cactus stack
    139. 139. Cilk-M employs:
    140. 140. TLMM-based cactus stacks
    141. 141. OS support for TLMM (~600 lines of code)
    142. 142. Legacy compatible linkage</li></ul>Cilk-M Summary<br />79<br />
    143. 143. Outline <br />Cilk-M<br />Survey of My Other Work<br /><ul><li>The JCilk Language
    144. 144. Location-Based Memory Fences
    145. 145. Ownership-Aware Transactional Memory </li></ul>Direction for Future Work<br />80<br />
    146. 146. The JCilk Language<br />Joint work with John Danaher and Charles Leiserson<br />Parallel<br />Constructs<br />from Cilk: <br />spawn & sync<br />Java Core<br />Functionalities<br />81<br />
    147. 147. The JCilk Language<br />Joint work with John Danaher and Charles Leiserson<br />Exception<br />Handling<br />Parallel<br />Constructs<br />from Cilk: <br />spawn & sync<br />Java Core<br />Functionalities<br />82<br />
    148. 148. <ul><li>JCilk provides a faithful extensionof Java’s exception mechanism consistent with Cilk’s primitives.
    149. 149. JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.
    150. 150. Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.</li></ul>Exception Handling in a Concurrent Context<br />83<br />
    151. 151. The JCilk System<br />JCilkRuntime System<br />JCilk Compiler<br />JCilk<br />to<br />Java + goto<br />Jgo compiler: <br />GCJ + goto<br />support<br />JVM<br />source<br />Fib.jcilk<br />Fib.jgo<br />Fib.class<br />84<br />
    152. 152. <ul><li>JCilk's strategy of integrating multithreading with Java's exception semantics is synergistic – it obviates the need for Cilk’sinlet and abort.
    153. 153. JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.</li></ul>What We Discovered<br />85<br />
    154. 154. Outline <br />Cilk-M<br />Survey of My Other Work<br /><ul><li>The JCilk Language
    155. 155. Location-Based Memory Fences
    156. 156. Ownership-Aware Transactional Memory </li></ul>Direction for Future Work<br />86<br />
    157. 157. Initially, L1 = 0 and L2 = 0<br />Thread 1<br />Thread 2<br />L1 = 1;<br />if(L2 == 0) {<br /> /* critical section */<br /> …<br />}<br />L1 = 0;<br />L2 = 1;<br />if(L1 == 0) {<br /> /* critical section */<br /> …<br />}<br />L2 = 0;<br />Dekker’s Protocol (Simplified)<br />Reads may be reordered with older writes. <br />87<br />
    158. 158. Initially, L1 = 0 and L2 = 0<br />Thread 1<br />Thread 2<br />L1 = 1;<br />mfence();<br />if(L2 == 0) {<br /> /* critical section */<br /> …<br />}<br />L1 = 0;<br />L2 = 1;<br />mfence();<br />if(L1 == 0) {<br /> /* critical section */<br /> …<br />}<br />L2 = 0;<br />Memory fences needed  cause stalling<br />Dekker’s Protocol (Simplified)<br />88<br />
    159. 159. Applications of Dekker’s Protocol<br /><ul><li>The THE protocol used by Cilk’s work stealing scheduler [FLR98]
    160. 160. the victim vs. the thief
    161. 161. Java Monitors using Quickly Reacquirable Locks or Biased Locking[DMS03] [OKK04]
    162. 162. the bias-holding thread vs. a revoker thread
    163. 163. JNI reentry barrier in JVM
    164. 164. a Java mutator thread vs. the garbage collector
    165. 165. Network package processing [VNE10]
    166. 166. the owner thread vs. other threads</li></ul>Applications exhibit asymmetric synchronization patterns.<br />89<br />
    167. 167. <ul><li>We introduce location-based memory fences, whichcauses a thread’s instruction stream to serialize when another thread attempts to access the guarded memory location.
    168. 168. Some applications can benefit from a software implementation [DHY03] that uses interrupt.
    169. 169. A light-weight hardware mechanism can piggyback on the cache coherence protocol. </li></ul>Location-Based Memory Fences<br />90<br />Joint work with EdyaLadan-Mozes and DmitriyVyukov<br />
    170. 170. Outline <br />Cilk-M<br />Survey of My Other Work<br /><ul><li>The JCilk Language
    171. 171. Location-Based Memory Fences
    172. 172. Ownership-Aware Transactional Memory </li></ul>Direction for Future Work<br />91<br />
    173. 173. Transactional Memory<br />Rset: w,xWset: w,x<br />Memory<br />atomic {//A<br />x++;<br />}<br />Rset: xWset:x<br />A<br />atomic {//B<br />w = x;<br />}<br />Rset: xWset:w<br />B<br />Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory.<br />92<br />
    174. 174. Transactional Memory<br />Rset: w,xWset: w,x<br />Memory<br />atomic {//A<br />x++;<br />}<br />Rset: xWset:x<br />A<br />atomic {//B<br />w = x;<br />}<br />Rset: xWset:w<br />B<br />TM guarantees that transactions are serializable[P79]. <br />93<br />
    175. 175. Nested Transactions<br />Rset: w,x,y,zWset: w,x,y,z<br />atomic {//A<br />int a = x;...<br /> atomic { //B<br />w++;<br /> }<br />intb = y;<br />z = x + y;<br />}<br />Memory<br />Rset: xWset: <br />A<br />Rset:wWset: w<br />B<br />Closed-nesting:propagate the changes to A. <br />94<br />
    176. 176. Nested Transactions<br />Rset: w,x,y,zWset: w,x,y,z<br />atomic {//A<br />int a = x;...<br /> atomic { //B<br />w++;<br /> }<br />intb = y;<br />z = x + y;<br />}<br />Memory<br />Rset: xWset: <br />A<br />Rset:wWset: w<br />B<br />Open-nesting:commit the changes globally. <br />95<br />
    177. 177. Nested Transactions<br />All memories are treated equally – there is only one level of abstraction.<br />96<br />
    178. 178. Ownership-Aware Transactions (OAT)<br />Joint work with KunalAgrawal and Jim Sukha<br /><ul><li>Ownership-aware transactions is a hybrid between open-nesting and closed-nesting; it provides multiple levels of abstraction.
    179. 179. In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept ofownership types [BLS03] to ensure data encapsulation within a module.
    180. 180. The OAT system guarantees abstract serializabilityas long as the program conforms to a set of well-defined constraints on how the modules share data. </li></ul>97<br />
    181. 181. Outline <br />Cilk-M<br />Survey of My Other Work<br /><ul><li>The JCilk Language
    182. 182. Location-Based Memory Fences
    183. 183. Ownership-Aware Transactional Memory </li></ul>Direction for Future Work<br />98<br />
    184. 184. Parallelism Abstraction<br />Aconcurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling.<br />User Application<br />Concurrency Platform<br />Operating System<br />99<br />
    185. 185. Memory Abstraction<br />A memory abstractionprovides a different “view” of a memory location depending on the execution context in which the memory access is made.<br /><ul><li>TLMM cactus stack: each worker gets its own linear local view of the tree-structured call stack.
    186. 186. Hyperobject [FHLL09]: a linguistic mechanism that supports coordinated local views of the same nonlocal object.
    187. 187. Transactional Memory [HM93]: memory accesses dynamically enclosed by an atomic block appear to occur atomically.</li></ul>100<br />Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?<br />
    188. 188. OS / Hardware Support for Memory Abstraction<br />Recently researchers begin to explore ways to enable memory abstractions using page mapping / page protection mechanism:<br /><ul><li>C# with atomic sections [AHM09] (strong atomicity)
    189. 189. Grace [BYL+09] (deterministic execution)
    190. 190. Sammati [PV10] (deadlock avoidance)
    191. 191. Cilk-M [LSH+10] (TLMM cactus stack)</li></ul>Can we relax limitation of manipulating virtual memory at page-granularity ?<br />101<br />
    192. 192. THANK YOU!<br />102<br />Cilk-M<br />Survey of My Other Work<br /><ul><li>The JCilk Language
    193. 193. Location-Based Memory Fences
    194. 194. Ownership-Aware Transactional Memory </li></ul>Direction for Future Work<br />
    195. 195. 103<br />
    196. 196. 104<br />
    197. 197. Quadratic Stack Growth [Robison08]<br />P<br />: parallel<br />P<br />Assume one linear stack per worker<br />: serial<br />P<br />S<br />S<br />: spawn<br />P<br />S<br />: call<br />P<br />S<br />Depth = d<br />S<br />. . .<br />S<br />. . .<br />. . .<br />P<br />S<br />S<br />S<br />. . .<br />P<br />P<br />P<br />Repeat dtimes<br />. . .<br />S<br />S<br />S<br />S<br />S<br />S<br />105<br />
    198. 198. Quadratic Stack Growth [Robison08]<br />P<br />The green worker repeatedly blocks, then steals, using Θ(d2)stack space.<br />Assume one linear stack per worker<br />P<br />S<br />P<br />S<br />P<br />S<br />Depth = d<br />S<br />. . .<br />S<br />. . .<br />. . .<br />P<br />S<br />S<br />S<br />. . .<br />P<br />P<br />P<br />Repeat dtimes<br />. . .<br />S<br />S<br />S<br />S<br />S<br />S<br />106<br />
    199. 199. Performance Comparison<br />AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3 <br />Cilk-M running time / Cilk-5 running time<br />Time Bound:Tp= T1 / P + CT∞ , where C = O(S1+D)<br />107<br />
    200. 200. Space Usage (Hand Compiled)<br />Space bound: Sp /P≤ S1+D <br />108<br />
    201. 201. Space Usage<br />109<br />
    202. 202. GCC/Linux C Subroutine Linkage<br />args to A<br />The legacy linear stack obtains efficiency by overlapping frames.<br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />bp<br />sp<br />A’s local variables<br />linkage<br />region<br />args to B<br />B’s return address<br />B’s local variables<br />A’s base pointer<br />frame<br />forB<br />A<br />args to B’s callees<br />C<br />B<br />E<br />D<br />110<br />
    203. 203. Handling Page Granularity<br />0x7f000<br />A<br />A<br />page size<br />B<br />0x7e000<br />The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee.<br />0x7d000<br />A<br />steal A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />111<br />
    204. 204. Handling Page Granularity<br />0x7f000<br />A<br />A<br />page size<br />B<br />0x7e000<br />The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee.<br />C<br />D<br />0x7d000<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />112<br />
    205. 205. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee.<br />C<br />C<br />D<br />0x7d000<br />A<br />steal C<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />113<br />
    206. 206. Handling Page Granularity<br />0x7f000<br />A<br />A<br />A<br />page size<br />B<br />0x7e000<br />The thief advances its stack pointer past the next page boundary, reserving space for thelinkage regionfor the next callee.<br />C<br />C<br />D<br />0x7d000<br />E<br />A<br />C<br />B<br />P3<br />P1<br />P2<br />E<br />D<br />114<br />
    207. 207. Key Invocation Invariants<br />Arguments are passed via stack pointer with positive offset.<br />Local variables are referenced via base pointer with negative offset.<br />Live registers are flushed onto the stack immediately before each spawn.<br />Live registers are flushed onto the stack before returning back to runtime if sync fails.<br />When resuming a stolen function after a spawn or sync, live registers are restored from the stack.<br />When returning from a spawn, the return value is flushed from its register onto the stack.<br />The frame size is fixed before any spawn statements.<br />115<br />
    208. 208. GCC/Linux C Subroutine Linkage<br />Legacy linear stacks enable efficient passing of arguments from caller to callee.<br />args to A<br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />sp<br />bp<br />A’s local variables<br />args to A’s callees<br />A<br />C<br />B<br />E<br />D<br />116<br />
    209. 209. GCC/Linux C Subroutine Linkage<br />linkage<br />region<br />FrameAaccesses its arguments through positive offset indexed from its base pointer.<br />args to A<br />A’s return address<br />A’s parent’s base ptr<br /> frame<br />forA<br />sp<br />bp<br />A’s local variables<br />args to A’s callees<br />A<br />C<br />B<br />E<br />D<br />117<br />
    210. 210. GCC/Linux C Subroutine Linkage<br />args to A<br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />sp<br />bp<br />A’s local variables<br />FrameAaccesses its local variables through negative offset indexed from its base pointer.<br />args to A’s callees<br />A<br />C<br />B<br />E<br />D<br />118<br />
    211. 211. GCC/Linux C Subroutine Linkage<br />Before invoking B, A places the arguments for Binto the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer. <br />args to A<br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />sp<br />bp<br />A’s local variables<br />linkage<br />region<br />args to B<br />args to A’s callees<br />A<br />C<br />B<br />E<br />D<br />119<br />
    212. 212. GCC/Linux C Subroutine Linkage<br />args to A<br />Athen makes the call to B, which saves the return address for Band transfers control to B. <br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />sp<br />bp<br />A’s local variables<br />args to B<br />B’s return address<br />A<br />C<br />B<br />E<br />D<br />120<br />
    213. 213. GCC/Linux C Subroutine Linkage<br />args to A<br />Upon entering, Bsaves A’s base pointer and sets the base pointer to where the stack pointer is. <br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />bp<br />sp<br />A’s local variables<br />args to B<br />B’s return address<br />bp<br />A’s base pointer<br />A<br />C<br />B<br />E<br />D<br />121<br />
    214. 214. GCC/Linux C Subroutine Linkage<br />args to A<br />B advances the stack pointer to allocate space for local variables and linkage region. <br />A’s return address<br />A’s parent’s base ptr<br /> frame<br />forA<br />bp<br />sp<br />A’s local variables<br />args to B<br />B’s return address<br />B’s local variables<br />A’s base pointer<br /> frame<br />forB<br />A<br />args to B’s callees<br />C<br />B<br />E<br />D<br />122<br />
    215. 215. GCC/Linux C Subroutine Linkage<br />args to A<br />The legacy linear stack obtains efficiency by overlapping frames.<br />A’s return address<br />A’s parent’s base ptr<br />frame<br />forA<br />bp<br />sp<br />A’s local variables<br />args to B<br />B’s return address<br />B’s local variables<br />A’s base pointer<br />frame<br />forB<br />A<br />args to B’s callees<br />C<br />B<br />E<br />D<br />123<br />
    216. 216. Legacy Linear Stack<br />High Addr<br />An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.<br />A<br />A<br />B<br />C<br />C<br />B<br />D<br />E<br />E<br />D<br />Low Addr<br />invocation tree<br />124<br />
    217. 217. Legacy Linear Stack<br />High Addr<br />Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent.<br />…<br />A<br />A<br />C<br />C<br />B<br />x: 42<br />E<br />E<br />D<br />Low Addr<br />invocation tree<br />y: &x<br />125<br />
    218. 218. Legacy Linear Stack<br />High Addr<br />Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot pass a pointer to its stack variable up to its parent.<br />A<br />A<br />C<br />C<br />B<br />✗<br />y: &z<br />E<br />E<br />D<br />Low Addr<br />invocation tree<br />z: 42<br />126<br />
    219. 219. The Queens Problem<br />Given n> 0, search for oneway to arrange n queens on an n-by-n chessboard so that none attacks another.<br />legal configuration<br />illegal configuration<br />127<br />
    220. 220. Exploring the Search Tree for Queens<br />start<br />r0,c1<br />r0,c2<br />r0,c3<br />r0,c0<br />r1,c3<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c2<br />r1,c1<br />r1,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />. . . . . .<br />Serial strategy: Depth-first search with backtracking. <br />The search tree size grows exponentially as n increases.<br />128<br />
    221. 221. Exploring the Search Tree for Queens<br />start<br />r0,c1<br />r0,c2<br />r0,c3<br />r0,c0<br />r1,c3<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c2<br />r1,c1<br />r1,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />. . . . . .<br />Parallel strategy: spawn searches in parallel. <br />Speculative computation – some work may be wasted.<br />129<br />
    222. 222. Exploring the Search Tree for Queens<br />start<br />r0,c1<br />r0,c2<br />r0,c3<br />r0,c0<br />r1,c3<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c3<br />r1,c2<br />r1,c1<br />r1,c0<br />r1,c2<br />r1,c1<br />r1,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />r2,c0<br />. . . . . .<br />Parallel strategy: spawn searches in parallel. <br />Abort other parallel searches once a solution is found.<br />130<br />
    223. 223. Various Parallel Programming Models<br />131<br />
    224. 224. 132<br />Parallelize Your Code using Cilk++<br />class SAT_Solver {<br /> public: <br />int solve( … );<br /> …<br /> private:<br /> …<br />};<br />Convert the entire code base to Cilk++ language.<br />2. Structure the project so that Cilk++ code calls C++ code, but not conversely.<br />Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.<br /> a. Use C++ wrapper functionsb. Use “extern C++”<br />c. Limited call back to C++ code<br />
    225. 225. 133<br />Parallelize Your Code using TBB<br />class SAT_Solver {<br /> public: <br />int solve( … );<br /> …<br /> private:<br /> …<br />};<br />Convert the entire project to Cilk++ language.<br />2. Structure the project so that Cilk++ code calls C++ code, but not conversely.<br />Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.<br /> a. Use C++ wrapper functionsb. Use “extern C++”<br />c. Limited call back to C++ code<br />Your program may end up using a lot more stack space or fail to get good speedup.<br />
    226. 226. Memory<br />Network<br />…<br />¢<br />¢<br />¢<br />P<br />P<br />P<br />Chip Multiprocessor (CMP)<br />Multicore Architecture — 2001*<br />134<br />*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).<br />
    227. 227. The Era of Multicore IS Here<br />135<br /># of CPUS<br /># of Cores<br />Source: www.newegg.com<br />Single core processor is becoming obsolete.<br />
    228. 228. My Sister Is Buying a New Laptop …<br />136<br />Source: www.apple.com<br />The era of multicore IS here!<br />

    ×