Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Para...
Era of DIY:<br /><ul><li>Multicore
Reconfigurable
GPUs
Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architectu...
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />A...
Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automa...
Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Tim...
“Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead acce...
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead acce...
2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<...
PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->next;<br />C:   res = w...
SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->n...
SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B:   node = node->n...
SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />   while (true)...
DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br ...
LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 i...
TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks  on the same cluster<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.  <br />Low overhead a...
Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2...
Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commu...
Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commu...
~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ...
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead ac...
24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0....
PS-DSWP<br />Complainer<br />
PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rot...
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead ac...
Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
Upcoming SlideShare
Loading in …5
×

Hipeac

535 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Hipeac

  1. 1. A Roadmap to Restoring Computing's Former Glory<br />David I. August<br />Princeton University<br />(Not speaking for Parakinetics, Inc.)<br />
  2. 2. Era of DIY:<br /><ul><li>Multicore
  3. 3. Reconfigurable
  4. 4. GPUs
  5. 5. Clusters</li></ul>10 Cores!<br />10-Core Intel Xeon<br />“Unparalleled Performance”<br />Golden era of computer architecture<br />~ 3 years behind<br />SPEC CINT Performance (log. Scale)<br />CPU92<br />CPU95<br />CPU2000<br />CPU2006<br />1992<br />1994<br />1996<br />1998<br />2000<br />2002<br />2004<br />2012<br />2006<br />2008<br />2010<br />Year<br />
  6. 6. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
  7. 7. Multicore Architecture (Circa 2010)<br />Automatic<br />Speculation<br />Automatic<br />Pipelining<br />Commit<br />Automatic<br />Allocation/Scheduling<br />Parallel Resources<br />
  8. 8.
  9. 9. Parallel Library Calls<br />Threads<br />Time<br />Realizable parallelism<br />Threads<br />Credit: Jack Dongarra<br />Time<br />
  10. 10. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />
  11. 11. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization<br />A Roadmap to restoring computing’sformer glory.<br />Parallel Programming<br />Computer Architecture<br />Parallel Libraries<br />Automatic Parallelization<br />
  12. 12. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining.<br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
  13. 13. 2<br />3<br />4<br />5<br />Spec-PS-DSWP<br />P6 SUPERSCALAR ARCHITECTURE<br />Core 4<br />Core 1<br />Core 2<br />Core 3<br />0<br />LD:1<br />1<br />LD:2<br />LD:3<br />W:1<br />W:2<br />LD:4<br />LD:5<br />W:3<br />C:1<br />W:4<br />C:2<br />C:3<br />
  14. 14. PDG<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
  15. 15. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />A1<br />D1<br />D2<br />Program Dependence Graph<br />B1<br />A2<br />A<br />B<br />Time<br />C2<br />C1<br />B2<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
  16. 16. SpecDOALL<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D2<br />D3<br />D1<br />Program Dependence Graph<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />C2<br />B3<br />B2<br />B1<br />C1<br />C3<br />C<br />D<br />Control Dependence<br />Data Dependence<br />
  17. 17. SpecDOALLPerf<br />Spec-DOALL<br />Core 3<br />Core 1<br />Core 2<br />Example<br />A: while (node) {<br /> while (true) {<br />B: node = node->next;<br />C: res = work(node);<br />D: write(res);<br /> }<br />D1<br />D3<br />D2<br />D2<br />D3<br />D4<br />Program Dependence Graph<br />B2<br />B3<br />B4<br />A1<br />A3<br />A2<br />A<br />B<br />Time<br />B1<br />B3<br />C3<br />C2<br />C3<br />C4<br />B2<br />C2<br />C1<br />197.parser<br />C<br />Slowdown<br />D<br />Control Dependence<br />Data Dependence<br />
  18. 18. DOACROSSDSWP<br />Spec-DOACROSS<br />Spec-DSWP<br />Throughput: 1 iter/cycle<br />Throughput: 1 iter/cycle<br />Core 1<br />Core 2<br />Core 1<br />Core 2<br />Core 3<br />Core 3<br />D4<br />C4<br />B4<br />B6<br />B1<br />B3<br />B1<br />B4<br />B7<br />B5<br />B2<br />B7<br />B2<br />C4<br />C5<br />C1<br />C1<br />C2<br />D2<br />C3<br />D5<br />C5<br />C2<br />C6<br />B5<br />D5<br />D1<br />B3<br />C3<br />D2<br />C6<br />D3<br />B6<br />D4<br />D1<br />D3<br />Time<br />Time<br />
  19. 19. LatencyProblem<br />Comparison: Spec-DOACROSS and Spec-DSWP<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 1: 1 iter/cycle<br />Comm.Latency = 2:<br />1 iter/cycle<br />Comm.Latency = 2:<br />0.5 iter/cycle<br />Core 1<br />Core 2<br />Core 3<br />Core 1<br />Core 2<br />Core 3<br />D4<br />B4<br />B6<br />B3<br />B1<br />B7<br />C1<br />D1<br />B1<br />B4<br />B7<br />B5<br />B2<br />C4<br />Pipeline<br />Fill time<br />D5<br />C4<br />C3<br />C1<br />C2<br />B2<br />B5<br />D2<br />C5<br />C2<br />C6<br />C5<br />D5<br />C3<br />B3<br />B6<br />D2<br />C6<br />D3<br />D3<br />D4<br />D1<br />Time<br />Time<br />
  20. 20. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster<br />
  21. 21. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight.<br />Code reuse. Ideally, this includes support of legacy codes as well as new codes.<br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
  22. 22. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />char *memory;<br />void * alloc(int size);<br />alloc2<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
  23. 23. Execution Plan<br />char *memory;<br />void * alloc(int size);<br />Core 3<br />Core 1<br />Core 2<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />alloc5<br />alloc6<br />
  24. 24. Execution Plan<br />Core 3<br />Core 1<br />Core 2<br />char *memory;<br />void * alloc(int size);<br />alloc1<br />@Commutative<br />void * alloc(int size) {<br /> void * ptr = memory;<br /> memory = memory + size;<br /> return ptr;<br />}<br />alloc2<br />alloc3<br />Time<br />alloc4<br />Easily Understood Non-Determinism!<br />alloc5<br />alloc6<br />
  25. 25. ~50 of ½ Million LOCs modified in SpecINT 2000<br />Mods also include Non-Deterministic Branch<br />[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]<br />
  26. 26. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization.<br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
  27. 27. 24<br />Iterative Compilation<br />[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]<br />0.90X<br />Rotate<br />Unroll<br />0.10X<br />Sum<br />Reduction<br />Rotate<br />30.0X<br />Unroll<br />Sum<br />Reduction<br />1.5X<br />Rotate<br />1.1X<br />Unroll<br />Sum<br />Reduction<br />0.8X<br />
  28. 28. PS-DSWP<br />Complainer<br />
  29. 29. PS-DSWP<br />Complainer<br />Who can help me?<br />Programmer<br />Annotation<br />Sum<br />Reduction<br />Unroll<br />Rotate<br />Red Edges: Deps between malloc() & free()<br />Blue Edges: Deps between rand() calls<br />Green Edges: Flow Deps inside Inner Loop<br />Orange Edges: Deps between function calls<br />
  30. 30. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />
  31. 31. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />
  32. 32. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
  33. 33. PS-DSWP<br />Complainer<br />Sum<br />Reduction<br />PROGRAMMER<br />Commutative<br />LIBRARY<br />Commutative<br />
  34. 34. Multicore Needs:<br />Automatic resource allocation/scheduling, speculation/commit, and pipelining. <br />Low overhead access to programmer insight. <br />Code reuse. Ideally, this includes support of legacy codes as well as new codes. <br />Intelligent automatic parallelization. <br />One Implementation<br />Machine Specific <br />Performance Primitives <br />Speculative<br />Optis<br />New or Existing<br />Sequential Code<br />Insight<br />Annotation<br />Other<br />Optis<br />DSWP Family<br />Optis<br />Parallelized Code<br />New or Existing<br />Libraries<br />Insight<br />Annotation<br />Complainer/Fixer<br />
  35. 35. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]<br />
  36. 36. Restoration of Trend<br />
  37. 37. “Compiler Advances Double Computing Power Every 18 Years!”<br />– Proebsting’s Law<br />Compiler Technology<br />Architecture/Devices<br />Era of DIY:<br /><ul><li>Multicore
  38. 38. Reconfigurable
  39. 39. GPUs
  40. 40. Clusters</li></ul>Compiler technology inspired class of architectures?<br />
  41. 41. The End<br />

×