Hipeac

164 views
129 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
164
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • One approach to overcome the problem is to use speculative execution. A technique called Speculative DOALL <CLICK> boxes up the statements according to the iterations to which they belong; <CLICK> thus removing edges which do not cross the iteration boundary.<CLICK>
  • It then schedules multiple iterations to execute in parallel. To fit the loop into this RIGID EXECUTION MODEL, Spec-DOALL needs to break edges that go from iteration to iteration. To do this, Spec-DOALL first speculates that the loop will iterate many times. <CLICK>
  • Reasonable assumption. <CLICK>This basically transforms statement A into <CLICK> “while (true)”. <CLICK> Let’s get rid of these A blocks. Describe reason for frequent conflicts with Spec-DOALl.Slowdown.
  • Spec-DOACROSS<CLICK> Spec-DSWPThe primary difference between these techniques is the way these red edges are handled. These edges together constitute the critical path of the loop: Work on an iteration cannot be started until that edge is satisfied. Spec-DOACROSS communicates values back and forth between cores to respect these edges; in contrast, Speculative Pipelining respects these edges by sequentially executing the statements participating in the edges in a single thread, while communicating the dependences off the critical path in a uni-directional pipelined fashion.<CLICK>
  • When the comm. latency between cores is 1 cycle, both techniques yield the same throughput. <CLICK>But when the comm. latency is doubled, <CLICK>Spec-DOACROSS is able to initiate a new iteration only once every two cycles because the inter-core latency is on the critical path of execution. In contrast, all that changes in the case of Spec-Pipelining <CLICK> is that the pipeline fill time doubles. <CLICK> The throughput still remains 1 iter/cycle.
  • Also in this paper, is a description of our work in making TLS suitable for execution on clusters.We created this version of TLS for a strong point of comparison to Spec-DSWP.And This graph shows the GEOMEAN performance speedup of TLS and Spec-DSWP.The results are as you would expect. TLS scales only to about 15x, getting 90% of that gain at 64 cores. However, Spec-DSWP scales to 49x with even the last additional core posting reasonable gains. Again, we think Spec-DSWP will do better with improved machine bandwidth. On the other hand, TLS will probably not improve much in the future due to fundamental physical limits on communication latency.
  • This graph shows performance speedup relative to the best optimized sequential performance.Y axis shows speedup, and X axis shows the total number of nodes used to achieve the speedup, and each node has 4 cores.Overall we see a 49x GEOMEAN speedup on these applications running on a 128 core, 32 node Xeon cluster.Again, the speedup is relative to the best optimized sequential performance, not the parallel version set to execute only one thread.Through a combination of optimizations described in the paper, we achieve scalable speedup for these originally sequential programs.
  • Hipeac

    1. 1. A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Parakinetics, Inc.)<br />
    2. 2. Era of DIY:<br /><ul><li>Multicore
    3. 3. Reconfigurable
    4. 4. GPUs
    5. 5. Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architecture<br />~ 3 years behind<br />SPEC CINT Performance (log. Scale)<br />CPU92<br />CPU95<br />CPU2000<br />CPU2006<br />1992<br />1994<br />1996<br />1998<br />2000<br />2002<br />2004<br />2012<br />2006<br />2008<br />2010<br />Year<br />
    6. 6. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
    7. 7. Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
    8. 8.
    9. 9. Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Time<br />
    10. 10. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
    11. 11. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization<br />A Roadmap to restoring computing’sformer glory.<br />Parallel Programming<br />Computer Architecture<br />Parallel Libraries<br />Automatic Parallelization<br />
    12. 12. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    13. 13. 2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<br />0<br />LD:1<br />1<br />LD:2<br />LD:3<br />W:1<br />W:2<br />LD:4<br />LD:5<br />W:3<br />C:1<br />W:4<br />C:2<br />C:3<br />
    14. 14. PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    15. 15. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    16. 16. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D2<br />D3<br />D1<br />Program Dependence Graph<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />C2<br />B3<br />B2<br />B1<br />C1<br />C3<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    17. 17. SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br /> while (true) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D1<br />D3<br />D2<br />D2<br />D3<br />D4<br />Program Dependence Graph<br />B2<br />B3<br />B4<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />B1<br />B3<br />C3<br />C2<br />C3<br />C4<br />B2<br />C2<br />C1<br />197.parser<br />C<br />Slowdown<br />D<br />Control Dependence<br />Data Dependence<br />
    18. 18. DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br />Core 2<br />Core 1<br />Core 2<br />Core 3<br />Core 3<br />D4<br />C4<br />B4<br />B6<br />B1<br />B3<br />B1<br />B4<br />B7<br />B5<br />B2<br />B7<br />B2<br />C4<br />C5<br />C1<br />C1<br />C2<br />D2<br />C3<br />D5<br />C5<br />C2<br />C6<br />B5<br />D5<br />D1<br />B3<br />C3<br />D2<br />C6<br />D3<br />B6<br />D4<br />D1<br />D3<br />Time<br />Time<br />
    19. 19. LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 2:<br />1 iter/cycle<br />Comm.Latency = 2:<br />0.5 iter/cycle<br />Core 1<br />Core 2<br />Core 3<br />Core 1<br />Core 2<br />Core 3<br />D4<br />B4<br />B6<br />B3<br />B1<br />B7<br />C1<br />D1<br />B1<br />B4<br />B7<br />B5<br />B2<br />C4<br />Pipeline<br />Fill time<br />D5<br />C4<br />C3<br />C1<br />C2<br />B2<br />B5<br />D2<br />C5<br />C2<br />C6<br />C5<br />D5<br />C3<br />B3<br />B6<br />D2<br />C6<br />D3<br />D3<br />D4<br />D1<br />Time<br />Time<br />
    20. 20. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster<br />
    21. 21. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    22. 22. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
    23. 23. Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
    24. 24. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />Easily Understood Non-Determinism!<br />alloc5<br />alloc6<br />
    25. 25. ~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]<br />
    26. 26. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    27. 27. 24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0.10X<br />Sum<br />Reduction<br />Rotate<br />30.0X<br />Unroll<br />Sum<br />Reduction<br />1.5X<br />Rotate<br />1.1X<br />Unroll<br />Sum<br />Reduction<br />0.8X<br />
    28. 28. PS-DSWP<br />Complainer<br />
    29. 29. PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rotate<br />Red Edges: Deps between malloc() & free()<br />Blue Edges: Deps between rand() calls<br />Green Edges: Flow Deps inside Inner Loop<br />Orange Edges: Deps between function calls<br />
    30. 30. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
    31. 31. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
    32. 32. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
    33. 33. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
    34. 34. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization. <br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    35. 35. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
    36. 36. Restoration of Trend<br />
    37. 37. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />Compiler Technology<br />Architecture/Devices<br />Era of DIY:<br /><ul><li>Multicore
    38. 38. Reconfigurable
    39. 39. GPUs
    40. 40. Clusters</li></ul>Compiler technology inspired class of architectures?<br />
    41. 41. The End<br />

    ×