Your SlideShare is downloading. ×
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Hipeac
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hipeac

107

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
107
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • One approach to overcome the problem is to use speculative execution. A technique called Speculative DOALL <CLICK> boxes up the statements according to the iterations to which they belong; <CLICK> thus removing edges which do not cross the iteration boundary.<CLICK>
  • It then schedules multiple iterations to execute in parallel. To fit the loop into this RIGID EXECUTION MODEL, Spec-DOALL needs to break edges that go from iteration to iteration. To do this, Spec-DOALL first speculates that the loop will iterate many times. <CLICK>
  • Reasonable assumption. <CLICK>This basically transforms statement A into <CLICK> “while (true)”. <CLICK> Let’s get rid of these A blocks. Describe reason for frequent conflicts with Spec-DOALl.Slowdown.
  • Spec-DOACROSS<CLICK> Spec-DSWPThe primary difference between these techniques is the way these red edges are handled. These edges together constitute the critical path of the loop: Work on an iteration cannot be started until that edge is satisfied. Spec-DOACROSS communicates values back and forth between cores to respect these edges; in contrast, Speculative Pipelining respects these edges by sequentially executing the statements participating in the edges in a single thread, while communicating the dependences off the critical path in a uni-directional pipelined fashion.<CLICK>
  • When the comm. latency between cores is 1 cycle, both techniques yield the same throughput. <CLICK>But when the comm. latency is doubled, <CLICK>Spec-DOACROSS is able to initiate a new iteration only once every two cycles because the inter-core latency is on the critical path of execution. In contrast, all that changes in the case of Spec-Pipelining <CLICK> is that the pipeline fill time doubles. <CLICK> The throughput still remains 1 iter/cycle.
  • Also in this paper, is a description of our work in making TLS suitable for execution on clusters.We created this version of TLS for a strong point of comparison to Spec-DSWP.And This graph shows the GEOMEAN performance speedup of TLS and Spec-DSWP.The results are as you would expect. TLS scales only to about 15x, getting 90% of that gain at 64 cores. However, Spec-DSWP scales to 49x with even the last additional core posting reasonable gains. Again, we think Spec-DSWP will do better with improved machine bandwidth. On the other hand, TLS will probably not improve much in the future due to fundamental physical limits on communication latency.
  • This graph shows performance speedup relative to the best optimized sequential performance.Y axis shows speedup, and X axis shows the total number of nodes used to achieve the speedup, and each node has 4 cores.Overall we see a 49x GEOMEAN speedup on these applications running on a 128 core, 32 node Xeon cluster.Again, the speedup is relative to the best optimized sequential performance, not the parallel version set to execute only one thread.Through a combination of optimizations described in the paper, we achieve scalable speedup for these originally sequential programs.
  • Transcript

    • 1. A Roadmap to Restoring Computing's Former Glory
      David I. August
      Princeton University
      (Not speaking for Parakinetics, Inc.)
    • 2. Era of DIY:
      10 Cores!
      10-Core Intel Xeon
      “Unparalleled Performance”
      Golden era of computer architecture
      ~ 3 years behind
      SPEC CINT Performance (log. Scale)
      CPU92
      CPU95
      CPU2000
      CPU2006
      1992
      1994
      1996
      1998
      2000
      2002
      2004
      2012
      2006
      2008
      2010
      Year
    • 6. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)
      Automatic
      Speculation
      Automatic
      Pipelining
      Commit
      Automatic
      Allocation/Scheduling
      Parallel Resources
    • 7. Multicore Architecture (Circa 2010)
      Automatic
      Speculation
      Automatic
      Pipelining
      Commit
      Automatic
      Allocation/Scheduling
      Parallel Resources
    • 8.
    • 9. Parallel Library Calls
      Threads
      Time
      Realizable parallelism
      Threads
      Credit: Jack Dongarra
      Time
    • 10. “Compiler Advances Double Computing Power Every 18 Years!”
      – Proebsting’s Law
    • 11. Multicore Needs:
      Automatic resource allocation/scheduling, speculation/commit, and pipelining.
      Low overhead access to programmer insight.
      Code reuse. Ideally, this includes support of legacy codes as well as new codes.
      Intelligent automatic parallelization.
      Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization
      A Roadmap to restoring computing’sformer glory.
      Parallel Programming
      Computer Architecture
      Parallel Libraries
      Automatic Parallelization
    • 12. Multicore Needs:
      Automatic resource allocation/scheduling, speculation/commit, and pipelining.
      Low overhead access to programmer insight.
      Code reuse. Ideally, this includes support of legacy codes as well as new codes.
      Intelligent automatic parallelization.
      One Implementation
      Machine Specific
      Performance Primitives
      Speculative
      Optis
      New or Existing
      Sequential Code
      Insight
      Annotation
      Other
      Optis
      DSWP Family
      Optis
      Parallelized Code
      New or Existing
      Libraries
      Insight
      Annotation
      Complainer/Fixer
    • 13. 2
      3
      4
      5
      Spec-PS-DSWP
      P6 SUPERSCALAR ARCHITECTURE
      Core 4
      Core 1
      Core 2
      Core 3
      0
      LD:1
      1
      LD:2
      LD:3
      W:1
      W:2
      LD:4
      LD:5
      W:3
      C:1
      W:4
      C:2
      C:3
    • 14. PDG
      Core 3
      Core 1
      Core 2
      Example
      A: while (node) {
      B: node = node->next;
      C: res = work(node);
      D: write(res);
      }
      A1
      D1
      D2
      Program Dependence Graph
      B1
      A2
      A
      B
      Time
      C2
      C1
      B2
      C
      D
      Control Dependence
      Data Dependence
    • 15. SpecDOALL
      Spec-DOALL
      Core 3
      Core 1
      Core 2
      Example
      A: while (node) {
      B: node = node->next;
      C: res = work(node);
      D: write(res);
      }
      A1
      D1
      D2
      Program Dependence Graph
      B1
      A2
      A
      B
      Time
      C2
      C1
      B2
      C
      D
      Control Dependence
      Data Dependence
    • 16. SpecDOALL
      Spec-DOALL
      Core 3
      Core 1
      Core 2
      Example
      A: while (node) {
      B: node = node->next;
      C: res = work(node);
      D: write(res);
      }
      D2
      D3
      D1
      Program Dependence Graph
      A1
      A3
      A2
      A
      B
      Time
      C2
      B3
      B2
      B1
      C1
      C3
      C
      D
      Control Dependence
      Data Dependence
    • 17. SpecDOALLPerf
      Spec-DOALL
      Core 3
      Core 1
      Core 2
      Example
      A: while (node) {
      while (true) {
      B: node = node->next;
      C: res = work(node);
      D: write(res);
      }
      D1
      D3
      D2
      D2
      D3
      D4
      Program Dependence Graph
      B2
      B3
      B4
      A1
      A3
      A2
      A
      B
      Time
      B1
      B3
      C3
      C2
      C3
      C4
      B2
      C2
      C1
      197.parser
      C
      Slowdown
      D
      Control Dependence
      Data Dependence
    • 18. DOACROSSDSWP
      Spec-DOACROSS
      Spec-DSWP
      Throughput: 1 iter/cycle
      Throughput: 1 iter/cycle
      Core 1
      Core 2
      Core 1
      Core 2
      Core 3
      Core 3
      D4
      C4
      B4
      B6
      B1
      B3
      B1
      B4
      B7
      B5
      B2
      B7
      B2
      C4
      C5
      C1
      C1
      C2
      D2
      C3
      D5
      C5
      C2
      C6
      B5
      D5
      D1
      B3
      C3
      D2
      C6
      D3
      B6
      D4
      D1
      D3
      Time
      Time
    • 19. LatencyProblem
      Comparison: Spec-DOACROSS and Spec-DSWP
      Comm.Latency = 1: 1 iter/cycle
      Comm.Latency = 1: 1 iter/cycle
      Comm.Latency = 2:
      1 iter/cycle
      Comm.Latency = 2:
      0.5 iter/cycle
      Core 1
      Core 2
      Core 3
      Core 1
      Core 2
      Core 3
      D4
      B4
      B6
      B3
      B1
      B7
      C1
      D1
      B1
      B4
      B7
      B5
      B2
      C4
      Pipeline
      Fill time
      D5
      C4
      C3
      C1
      C2
      B2
      B5
      D2
      C5
      C2
      C6
      C5
      D5
      C3
      B3
      B6
      D2
      C6
      D3
      D3
      D4
      D1
      Time
      Time
    • 20. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster
    • 21. Multicore Needs:
      Automatic resource allocation/scheduling, speculation/commit, and pipelining. 
      Low overhead access to programmer insight.
      Code reuse. Ideally, this includes support of legacy codes as well as new codes.
      Intelligent automatic parallelization.
      One Implementation
      Machine Specific
      Performance Primitives
      Speculative
      Optis
      New or Existing
      Sequential Code
      Insight
      Annotation
      Other
      Optis
      DSWP Family
      Optis
      Parallelized Code
      New or Existing
      Libraries
      Insight
      Annotation
      Complainer/Fixer
    • 22. Execution Plan
      Core 3
      Core 1
      Core 2
      alloc1
      char *memory;
      void * alloc(int size);
      alloc2
      void * alloc(int size) {
      void * ptr = memory;
      memory = memory + size;
      return ptr;
      }
      alloc3
      Time
      alloc4
      alloc5
      alloc6
    • 23. Execution Plan
      char *memory;
      void * alloc(int size);
      Core 3
      Core 1
      Core 2
      alloc1
      @Commutative
      void * alloc(int size) {
      void * ptr = memory;
      memory = memory + size;
      return ptr;
      }
      alloc2
      alloc3
      Time
      alloc4
      alloc5
      alloc6
    • 24. Execution Plan
      Core 3
      Core 1
      Core 2
      char *memory;
      void * alloc(int size);
      alloc1
      @Commutative
      void * alloc(int size) {
      void * ptr = memory;
      memory = memory + size;
      return ptr;
      }
      alloc2
      alloc3
      Time
      alloc4
      Easily Understood Non-Determinism!
      alloc5
      alloc6
    • 25. ~50 of ½ Million LOCs modified in SpecINT 2000
      Mods also include Non-Deterministic Branch
      [MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]
    • 26. Multicore Needs:
      Automatic resource allocation/scheduling, speculation/commit, and pipelining. 
      Low overhead access to programmer insight. 
      Code reuse. Ideally, this includes support of legacy codes as well as new codes. 
      Intelligent automatic parallelization.
      One Implementation
      Machine Specific
      Performance Primitives
      Speculative
      Optis
      New or Existing
      Sequential Code
      Insight
      Annotation
      Other
      Optis
      DSWP Family
      Optis
      Parallelized Code
      New or Existing
      Libraries
      Insight
      Annotation
      Complainer/Fixer
    • 27. 24
      Iterative Compilation
      [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]
      0.90X
      Rotate
      Unroll
      0.10X
      Sum
      Reduction
      Rotate
      30.0X
      Unroll
      Sum
      Reduction
      1.5X
      Rotate
      1.1X
      Unroll
      Sum
      Reduction
      0.8X
    • 28. PS-DSWP
      Complainer
    • 29. PS-DSWP
      Complainer
      Who can help me?
      Programmer
      Annotation
      Sum
      Reduction
      Unroll
      Rotate
      Red Edges: Deps between malloc() & free()
      Blue Edges: Deps between rand() calls
      Green Edges: Flow Deps inside Inner Loop
      Orange Edges: Deps between function calls
    • 30. PS-DSWP
      Complainer
      Sum
      Reduction
    • 31. PS-DSWP
      Complainer
      Sum
      Reduction
      PROGRAMMER
      Commutative
    • 32. PS-DSWP
      Complainer
      Sum
      Reduction
      PROGRAMMER
      Commutative
      LIBRARY
      Commutative
    • 33. PS-DSWP
      Complainer
      Sum
      Reduction
      PROGRAMMER
      Commutative
      LIBRARY
      Commutative
    • 34. Multicore Needs:
      Automatic resource allocation/scheduling, speculation/commit, and pipelining. 
      Low overhead access to programmer insight. 
      Code reuse. Ideally, this includes support of legacy codes as well as new codes. 
      Intelligent automatic parallelization. 
      One Implementation
      Machine Specific
      Performance Primitives
      Speculative
      Optis
      New or Existing
      Sequential Code
      Insight
      Annotation
      Other
      Optis
      DSWP Family
      Optis
      Parallelized Code
      New or Existing
      Libraries
      Insight
      Annotation
      Complainer/Fixer
    • 35. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
    • 36. Restoration of Trend
    • 37. “Compiler Advances Double Computing Power Every 18 Years!”
      – Proebsting’s Law
      Compiler Technology
      Architecture/Devices
      Era of DIY:
      Compiler technology inspired class of architectures?
    • 41. The End

    ×