A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Para...
Era of DIY:<br /><ul><li>Multicore
Reconfigurable
GPUs
Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architectu...
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />A...
Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automa...
Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Tim...
“Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead acce...
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead acce...
2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<...
PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->next;<br />C:   res = w...
SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->n...
SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->n...
SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />   while (true)...
DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br ...
LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 i...
TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks  on the same cluster<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.  <br />Low overhead a...
Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2...
Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commu...
Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commu...
~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ...
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead ac...
24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0....
PS-DSWP<br />Complainer<br />
PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rot...
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead ac...
Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
Upcoming SlideShare
Loading in...5
×

Hipeac

122

Published on

sdd

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
122
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • One approach to overcome the problem is to use speculative execution. A technique called Speculative DOALL &lt;CLICK&gt; boxes up the statements according to the iterations to which they belong; &lt;CLICK&gt; thus removing edges which do not cross the iteration boundary.&lt;CLICK&gt;
  • It then schedules multiple iterations to execute in parallel. To fit the loop into this RIGID EXECUTION MODEL, Spec-DOALL needs to break edges that go from iteration to iteration. To do this, Spec-DOALL first speculates that the loop will iterate many times. &lt;CLICK&gt;
  • Reasonable assumption. &lt;CLICK&gt;This basically transforms statement A into &lt;CLICK&gt; “while (true)”. &lt;CLICK&gt; Let’s get rid of these A blocks. Describe reason for frequent conflicts with Spec-DOALl.Slowdown.
  • Spec-DOACROSS&lt;CLICK&gt; Spec-DSWPThe primary difference between these techniques is the way these red edges are handled. These edges together constitute the critical path of the loop: Work on an iteration cannot be started until that edge is satisfied. Spec-DOACROSS communicates values back and forth between cores to respect these edges; in contrast, Speculative Pipelining respects these edges by sequentially executing the statements participating in the edges in a single thread, while communicating the dependences off the critical path in a uni-directional pipelined fashion.&lt;CLICK&gt;
  • When the comm. latency between cores is 1 cycle, both techniques yield the same throughput. &lt;CLICK&gt;But when the comm. latency is doubled, &lt;CLICK&gt;Spec-DOACROSS is able to initiate a new iteration only once every two cycles because the inter-core latency is on the critical path of execution. In contrast, all that changes in the case of Spec-Pipelining &lt;CLICK&gt; is that the pipeline fill time doubles. &lt;CLICK&gt; The throughput still remains 1 iter/cycle.
  • Also in this paper, is a description of our work in making TLS suitable for execution on clusters.We created this version of TLS for a strong point of comparison to Spec-DSWP.And This graph shows the GEOMEAN performance speedup of TLS and Spec-DSWP.The results are as you would expect. TLS scales only to about 15x, getting 90% of that gain at 64 cores. However, Spec-DSWP scales to 49x with even the last additional core posting reasonable gains. Again, we think Spec-DSWP will do better with improved machine bandwidth. On the other hand, TLS will probably not improve much in the future due to fundamental physical limits on communication latency.
  • This graph shows performance speedup relative to the best optimized sequential performance.Y axis shows speedup, and X axis shows the total number of nodes used to achieve the speedup, and each node has 4 cores.Overall we see a 49x GEOMEAN speedup on these applications running on a 128 core, 32 node Xeon cluster.Again, the speedup is relative to the best optimized sequential performance, not the parallel version set to execute only one thread.Through a combination of optimizations described in the paper, we achieve scalable speedup for these originally sequential programs.
  • Hipeac

    1. 1. A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Parakinetics, Inc.)<br />
    2. 2. Era of DIY:<br /><ul><li>Multicore
    3. 3. Reconfigurable
    4. 4. GPUs
    5. 5. Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architecture<br />~ 3 years behind<br />SPEC CINT Performance (log. Scale)<br />CPU92<br />CPU95<br />CPU2000<br />CPU2006<br />1992<br />1994<br />1996<br />1998<br />2000<br />2002<br />2004<br />2012<br />2006<br />2008<br />2010<br />Year<br />
    6. 6. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
    7. 7. Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
    8. 8.
    9. 9. Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Time<br />
    10. 10. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
    11. 11. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization<br />A Roadmap to restoring computing’sformer glory.<br />Parallel Programming<br />Computer Architecture<br />Parallel Libraries<br />Automatic Parallelization<br />
    12. 12. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    13. 13. 2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<br />0<br />LD:1<br />1<br />LD:2<br />LD:3<br />W:1<br />W:2<br />LD:4<br />LD:5<br />W:3<br />C:1<br />W:4<br />C:2<br />C:3<br />
    14. 14. PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    15. 15. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    16. 16. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D2<br />D3<br />D1<br />Program Dependence Graph<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />C2<br />B3<br />B2<br />B1<br />C1<br />C3<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
    17. 17. SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br /> while (true) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D1<br />D3<br />D2<br />D2<br />D3<br />D4<br />Program Dependence Graph<br />B2<br />B3<br />B4<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />B1<br />B3<br />C3<br />C2<br />C3<br />C4<br />B2<br />C2<br />C1<br />197.parser<br />C<br />Slowdown<br />D<br />Control Dependence<br />Data Dependence<br />
    18. 18. DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br />Core 2<br />Core 1<br />Core 2<br />Core 3<br />Core 3<br />D4<br />C4<br />B4<br />B6<br />B1<br />B3<br />B1<br />B4<br />B7<br />B5<br />B2<br />B7<br />B2<br />C4<br />C5<br />C1<br />C1<br />C2<br />D2<br />C3<br />D5<br />C5<br />C2<br />C6<br />B5<br />D5<br />D1<br />B3<br />C3<br />D2<br />C6<br />D3<br />B6<br />D4<br />D1<br />D3<br />Time<br />Time<br />
    19. 19. LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 2:<br />1 iter/cycle<br />Comm.Latency = 2:<br />0.5 iter/cycle<br />Core 1<br />Core 2<br />Core 3<br />Core 1<br />Core 2<br />Core 3<br />D4<br />B4<br />B6<br />B3<br />B1<br />B7<br />C1<br />D1<br />B1<br />B4<br />B7<br />B5<br />B2<br />C4<br />Pipeline<br />Fill time<br />D5<br />C4<br />C3<br />C1<br />C2<br />B2<br />B5<br />D2<br />C5<br />C2<br />C6<br />C5<br />D5<br />C3<br />B3<br />B6<br />D2<br />C6<br />D3<br />D3<br />D4<br />D1<br />Time<br />Time<br />
    20. 20. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster<br />
    21. 21. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    22. 22. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
    23. 23. Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
    24. 24. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />Easily Understood Non-Determinism!<br />alloc5<br />alloc6<br />
    25. 25. ~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]<br />
    26. 26. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    27. 27. 24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0.10X<br />Sum<br />Reduction<br />Rotate<br />30.0X<br />Unroll<br />Sum<br />Reduction<br />1.5X<br />Rotate<br />1.1X<br />Unroll<br />Sum<br />Reduction<br />0.8X<br />
    28. 28. PS-DSWP<br />Complainer<br />
    29. 29. PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rotate<br />Red Edges: Deps between malloc() & free()<br />Blue Edges: Deps between rand() calls<br />Green Edges: Flow Deps inside Inner Loop<br />Orange Edges: Deps between function calls<br />
    30. 30. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
    31. 31. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
    32. 32. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
    33. 33. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
    34. 34. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization. <br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
    35. 35. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
    36. 36. Restoration of Trend<br />
    37. 37. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />Compiler Technology<br />Architecture/Devices<br />Era of DIY:<br /><ul><li>Multicore
    38. 38. Reconfigurable
    39. 39. GPUs
    40. 40. Clusters</li></ul>Compiler technology inspired class of architectures?<br />
    41. 41. The End<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×