Upcoming SlideShare
×

# Hipeac

164 views
129 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
164
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
1
0
Likes
0
Embeds 0
No embeds

No notes for slide
• One approach to overcome the problem is to use speculative execution. A technique called Speculative DOALL &lt;CLICK&gt; boxes up the statements according to the iterations to which they belong; &lt;CLICK&gt; thus removing edges which do not cross the iteration boundary.&lt;CLICK&gt;
• It then schedules multiple iterations to execute in parallel. To fit the loop into this RIGID EXECUTION MODEL, Spec-DOALL needs to break edges that go from iteration to iteration. To do this, Spec-DOALL first speculates that the loop will iterate many times. &lt;CLICK&gt;
• Reasonable assumption. &lt;CLICK&gt;This basically transforms statement A into &lt;CLICK&gt; “while (true)”. &lt;CLICK&gt; Let’s get rid of these A blocks. Describe reason for frequent conflicts with Spec-DOALl.Slowdown.
• Spec-DOACROSS&lt;CLICK&gt; Spec-DSWPThe primary difference between these techniques is the way these red edges are handled. These edges together constitute the critical path of the loop: Work on an iteration cannot be started until that edge is satisfied. Spec-DOACROSS communicates values back and forth between cores to respect these edges; in contrast, Speculative Pipelining respects these edges by sequentially executing the statements participating in the edges in a single thread, while communicating the dependences off the critical path in a uni-directional pipelined fashion.&lt;CLICK&gt;
• When the comm. latency between cores is 1 cycle, both techniques yield the same throughput. &lt;CLICK&gt;But when the comm. latency is doubled, &lt;CLICK&gt;Spec-DOACROSS is able to initiate a new iteration only once every two cycles because the inter-core latency is on the critical path of execution. In contrast, all that changes in the case of Spec-Pipelining &lt;CLICK&gt; is that the pipeline fill time doubles. &lt;CLICK&gt; The throughput still remains 1 iter/cycle.
• Also in this paper, is a description of our work in making TLS suitable for execution on clusters.We created this version of TLS for a strong point of comparison to Spec-DSWP.And This graph shows the GEOMEAN performance speedup of TLS and Spec-DSWP.The results are as you would expect. TLS scales only to about 15x, getting 90% of that gain at 64 cores. However, Spec-DSWP scales to 49x with even the last additional core posting reasonable gains. Again, we think Spec-DSWP will do better with improved machine bandwidth. On the other hand, TLS will probably not improve much in the future due to fundamental physical limits on communication latency.
• This graph shows performance speedup relative to the best optimized sequential performance.Y axis shows speedup, and X axis shows the total number of nodes used to achieve the speedup, and each node has 4 cores.Overall we see a 49x GEOMEAN speedup on these applications running on a 128 core, 32 node Xeon cluster.Again, the speedup is relative to the best optimized sequential performance, not the parallel version set to execute only one thread.Through a combination of optimizations described in the paper, we achieve scalable speedup for these originally sequential programs.
• ### Hipeac

1. 1. A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Parakinetics, Inc.)<br />
2. 2. Era of DIY:<br /><ul><li>Multicore
3. 3. Reconfigurable
4. 4. GPUs
5. 5. Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architecture<br />~ 3 years behind<br />SPEC CINT Performance (log. Scale)<br />CPU92<br />CPU95<br />CPU2000<br />CPU2006<br />1992<br />1994<br />1996<br />1998<br />2000<br />2002<br />2004<br />2012<br />2006<br />2008<br />2010<br />Year<br />
6. 6. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
7. 7. Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
8. 8.
9. 9. Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Time<br />
10. 10. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
11. 11. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization<br />A Roadmap to restoring computing’sformer glory.<br />Parallel Programming<br />Computer Architecture<br />Parallel Libraries<br />Automatic Parallelization<br />
12. 12. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
13. 13. 2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<br />0<br />LD:1<br />1<br />LD:2<br />LD:3<br />W:1<br />W:2<br />LD:4<br />LD:5<br />W:3<br />C:1<br />W:4<br />C:2<br />C:3<br />
14. 14. PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
15. 15. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
16. 16. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D2<br />D3<br />D1<br />Program Dependence Graph<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />C2<br />B3<br />B2<br />B1<br />C1<br />C3<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
17. 17. SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br /> while (true) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D1<br />D3<br />D2<br />D2<br />D3<br />D4<br />Program Dependence Graph<br />B2<br />B3<br />B4<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />B1<br />B3<br />C3<br />C2<br />C3<br />C4<br />B2<br />C2<br />C1<br />197.parser<br />C<br />Slowdown<br />D<br />Control Dependence<br />Data Dependence<br />
18. 18. DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br />Core 2<br />Core 1<br />Core 2<br />Core 3<br />Core 3<br />D4<br />C4<br />B4<br />B6<br />B1<br />B3<br />B1<br />B4<br />B7<br />B5<br />B2<br />B7<br />B2<br />C4<br />C5<br />C1<br />C1<br />C2<br />D2<br />C3<br />D5<br />C5<br />C2<br />C6<br />B5<br />D5<br />D1<br />B3<br />C3<br />D2<br />C6<br />D3<br />B6<br />D4<br />D1<br />D3<br />Time<br />Time<br />
19. 19. LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 2:<br />1 iter/cycle<br />Comm.Latency = 2:<br />0.5 iter/cycle<br />Core 1<br />Core 2<br />Core 3<br />Core 1<br />Core 2<br />Core 3<br />D4<br />B4<br />B6<br />B3<br />B1<br />B7<br />C1<br />D1<br />B1<br />B4<br />B7<br />B5<br />B2<br />C4<br />Pipeline<br />Fill time<br />D5<br />C4<br />C3<br />C1<br />C2<br />B2<br />B5<br />D2<br />C5<br />C2<br />C6<br />C5<br />D5<br />C3<br />B3<br />B6<br />D2<br />C6<br />D3<br />D3<br />D4<br />D1<br />Time<br />Time<br />
20. 20. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster<br />
21. 21. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
22. 22. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
23. 23. Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
24. 24. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />Easily Understood Non-Determinism!<br />alloc5<br />alloc6<br />
25. 25. ~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]<br />
26. 26. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
27. 27. 24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0.10X<br />Sum<br />Reduction<br />Rotate<br />30.0X<br />Unroll<br />Sum<br />Reduction<br />1.5X<br />Rotate<br />1.1X<br />Unroll<br />Sum<br />Reduction<br />0.8X<br />
28. 28. PS-DSWP<br />Complainer<br />
29. 29. PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rotate<br />Red Edges: Deps between malloc() & free()<br />Blue Edges: Deps between rand() calls<br />Green Edges: Flow Deps inside Inner Loop<br />Orange Edges: Deps between function calls<br />
30. 30. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
31. 31. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
32. 32. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
33. 33. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
34. 34. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization. <br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
35. 35. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
36. 36. Restoration of Trend<br />
37. 37. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />Compiler Technology<br />Architecture/Devices<br />Era of DIY:<br /><ul><li>Multicore
38. 38. Reconfigurable
39. 39. GPUs
40. 40. Clusters</li></ul>Compiler technology inspired class of architectures?<br />
41. 41. The End<br />