3D Microprocessor Design
                                     Stacking at different granularities


                                              Alberto Villegas Erce

                                            Seminar on Computer Systems
                                                  Turku University


                                                       April 2010




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor           April 2010   1 / 29
Introduction


  Concepts review
  Previously on 3D world...


   Industry trends
   Make it faster, smaller and cuter but do not forget the prize

   3D Design
   Benefits: shorter wire length, speed increase, lower power consumption.
   Challenges: risk of defects, heat problems, design complexity.

   Through Silicon Vias (TSVs)
   Vertical electrical connection passing completely through a silicon die.
           Low power consumption
           Low latency
           Increasing integration level (10k-100k per cm2 )

Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor           April 2010   2 / 29
Introduction


  Today
  Three dimensional Puzzle


                                             How to face 3D design?
                                             2D design decomposition at different
                                             granularities.
                                                1     Entire cores, cache: add functionality
                                                      with high 2D reuse.
                                                2     Functional unit blocks: performance
                                                      improvement and power reduction.
                                                      Must re-floorplan and retime paths.
                                                3     Logic gates (block splitting): reduce
                                                      latency and power on every level routes.
                                                      Need new 3D circuit design,
                                                      methodologies and layout tools.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                   April 2010   3 / 29
Introduction


  Index




                                                     1   Stacking Complete Modules
                                                     2   Stacking Functional Unit Blocks
                                                     3   Splitting Functional Unit Blocks
                                                     4   Conclusions




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                  April 2010   4 / 29
Stacking Complete Modules


  Index




                                                     1   Stacking Complete Modules
                                                     2   Stacking Functional Unit Blocks
                                                     3   Splitting Functional Unit Blocks
                                                     4   Conclusions




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                  April 2010   5 / 29
Stacking Complete Modules     Idea


  Three-Dimensional Stacked Caches

   Idea
   Break & stack existing modules.




                                                                 Conventional dual-core processor
                                                                 featuring a 4MB L2 cache.
                                                                 Design options for 3D stacking
                                                                        Reduce space.
                                                                        Increase storage.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                        April 2010   6 / 29
Stacking Complete Modules     Increasing storage


  L2 cache controller in 3D


                                                                     Objective
                                                                     Add more storage to the L2
                                                                     cache.
                                                                     Stacking a second silicon
                                                                     layer
                                                                               Additional 8MB of cache
                                                                               Nearly no impact in L2
                                                                               access latency
                                                                     Traditional 2D solution
                                                                               Double silicon area.
                                                                               Latency increased.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                             April 2010   7 / 29
Stacking Complete Modules     Increasing storage


  L2 cache controller in 3D (cont.)

                                                                                 DRAM Solution
                                                                                      Much greater
                                                                                      storage density.
                                                                                      Greater latency
                                                                                      (50-150 cycles).
                                                                                      Reduce silicon
                                                                                      area in a half.
                                                                                 Hybrid solution
                                                                                      SRAM to store
                                                                                      only the tags.
                                                                                      DRAM to store
                                                                                      the actual data.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                            April 2010   8 / 29
Stacking Complete Modules     Increasing storage


  L2 cache controller in 3D (testing)
   Three programs test:
    Program A : small working set that fits in 4MB SRAM cache.
    Program B : larger working set that do not fit 4MB SRAM but does fit
               within 32MB DRAM cache.
    Program C : streaming memory access patterns. Poor cache hits rate for
               both configurations.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                  April 2010   9 / 29
Stacking Complete Modules     3D optionality


  3D Integration
  ... for everyone?



                                     3D Integration:
                                            Increase silicon required for the chip (layers)
                                            =⇒ Increase manufacturing cost
                                            Extra manufacturing steps for bounding.
                                            Impact on yield rates.

                                                       3D is not the general answer!

      3D stacking is to use it as a means to optionally augment the processor
                     with some additional functionality



Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                   April 2010   10 / 29
Stacking Complete Modules     3D optionality


  Introspective 3D Processors
   Objective
   Access to more dynamic information about the internal state of a
   microprocessor.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor              April 2010   11 / 29
Stacking Complete Modules     3D optionality


  Reliable 3D Processors

   Problem
   Small size in modern processors makes them vulnerable to data corruption

      Solutions
              Redundancy: two/three copies of the
              processor operating lock-step =⇒
              multiple pipelines increase cost.
              Leading execution/trailing checking
              cores: trailing core re-executes
              instructions (not lock-step) =⇒ still
              additional pipeline increases area.
                                                                                  Extra wires eliminated.
                                   Stack it!                                      Optional checker core.
                                                                                  Unutilized silicon area.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                          April 2010     12 / 29
Stacking Functional Unit Blocks


  Index




                                                     1   Stacking Complete Modules
                                                     2   Stacking Functional Unit Blocks
                                                     3   Splitting Functional Unit Blocks
                                                     4   Conclusions




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                 April 2010   13 / 29
Stacking Functional Unit Blocks   Introduction


  Stacking Functional Unit Blocks

                                                 Nowadays
                                                 Early step of development for this
                                                 technologies.
                                                 3D integration will require
                                                        Design automation tools.
                                                        Layout support.
                                                        Verification and validation
                                                        methodologies.

                                                 Future
                                                 Reorganize the processor pipeline in new
                                                 ways.

Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                  April 2010   14 / 29
Stacking Functional Unit Blocks    Removing wires


  Removing Wires
  Pentium III & IV branch misprediction

   Problem
   Wire delays have not evolve as fast as transistors speed.



                                                    PIII branch misprediction




                                                    PIV branch misprediction



   Solution
   3D implementation so distant blocks are now vertically stacked on top of
   each other.
Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor               April 2010   15 / 29
Stacking Functional Unit Blocks   Removing wires


  Removing Wires
  Alpha 21264


   Problem
   Superscalar processor with multiple execution units (EU) requires a bypass
   network to forward results between all of the EU =⇒ wiring.

      2D Solution
      Divide EU into two groups or
      clusters, each with its own bypass
      network and communicated.



      3D Solution
               Stack the clusters.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor              April 2010   16 / 29
Stacking Functional Unit Blocks   Trade-offs


  Removing Wires
  Trade-offs




                                                                     Cons
          Pros                                                           Non-trivial engineering
                  Optimize processor                                     effort.
                  pipeline opportunities.                                          Modify pipeline
                  Physically reduction of                                          Verify and validate
                  amount of wiring.                                                new design.
                                                                               Additional costs.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                               April 2010   17 / 29
Stacking Functional Unit Blocks   TSV Reality


  Removing Wires
  TSV Reality

   Problem
   After stacking two blocks there is enough room for placing TSVs.




   Solution
   Different layouts of the TSVs.
   Wire overhead reintroduction
           Reintroduced wires do not completely cancel the 3D wire reduction benefits.
Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor           April 2010   18 / 29
Splitting Functional Unit Blocks


  Index




                                                     1    Stacking Complete Modules
                                                     2    Stacking Functional Unit Blocks
                                                     3    Splitting Functional Unit Blocks
                                                     4    Conclusions




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                  April 2010   19 / 29
Splitting Functional Unit Blocks   Introduction


  Splitting Functional Unit Blocks

                                                                  Last level
                                                                  Logic gates
                                                                         Split individual functional units
                                                                         across multiple layers.
                                                                         Reorganize the functional unit
                                                                         block =⇒ more compact 3D
                                                                         arragement.

                                                                  Benefits
                                                                      Reduce length of intra-block
                                                                      wiring.
                                                                         Improve operating frequencies.

                          We will introduce a starting point of thinking.
Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                         April 2010   20 / 29
Splitting Functional Unit Blocks   3D Cache Organizations


  3D Cache Organizations
  First view



                                                 Problem
                                                 L2 cache consumes about half of the overall
                                                 die area.

                                                         Worst case routing distance: 2x+4y

                                             Two stack possibilities.

    Banks on cores                                                                         Banks on banks
            Half space.                                                                         Half space.
            Accessing                                                                           Accessing
            equal.                                                                              reduced.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                          April 2010   21 / 29
Splitting Functional Unit Blocks   Splitting the cache


  3D Splitting the cache


      Problem
      Wires within each bank also impact overall
      latency.


                       Split individual cache banks across multiple layers.


       Columns on
       columns
               Best
               latency.
                                                                        Rows on rows
                                                                                Energy reduction.
Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                            April 2010   22 / 29
Splitting Functional Unit Blocks   Splitting the cache


  3D Splitting cache
  Testing


   Experimental results
           SPICE simulation.
           Column on column organization.
           SRAM implementations in 65-nm process.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                    April 2010   23 / 29
Splitting Functional Unit Blocks   3D Adders


  3D Adders
  Classic Look-ahead Carry Adder




      Look-ahead Carry Adder
              n = 16-bits
              Critical path along bit[0]-bit[n-1]


   Several ways to split the adder

 Based on inputs                                                               By significance
         x bottom layer;                                                              least significant bits
         y top layer.                                                                 bottom layer;
                                                                                      most significant top
         1st lvl of propagate                                                         layer.
         layer splitted.
                                                                                      TSV between root
         Half wire length.                                                            nodes.




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                   April 2010      24 / 29
Conclusions


  Index




                                                     1   Stacking Complete Modules
                                                     2   Stacking Functional Unit Blocks
                                                     3   Splitting Functional Unit Blocks
                                                     4   Conclusions




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                 April 2010   25 / 29
Conclusions


  Conclusions

                                                                         Benefits of 3D organizing
                                                                         components
                                                                             Can significantly reduce
                                                                             wire lengths.
                                                                               Devices from different
                                                                               technologies can be
                                                                               tightly integrated and
                                                                               combined.

                                                                         3D organizations may be
                                                                         required depending on the
                                                                         exact design constraints and
                                                                         objectives.

Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                         April 2010   26 / 29
Conclusions


  Conclusions


                                                                         Cons
                                                                             More granularity ⇒
                                                                             more re-dising.
                                                                               Stacking can increase
                                                                               heat.
                                                                               Long level of
                                                                               technological
                                                                               development

                                                                         Every re-design process yields
                                                                         to a cost increment.


Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor                           April 2010   27 / 29
References


  References


          Three-Dimensional Microprocessor Design
          Gabriel H. Loh
          Springer Science 2010
          A Modular 3D Processor for Flexible Product Design and Technology
          Migration
          Gabriel H. Loh
          ACM 2008
          Die-stacking (3D) microarchitecture
          B. Black.
          International Symposium on Microarchitecture, pp. 469-479, 2006



Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor           April 2010   28 / 29
The end     Questions




                    Thank you.
                     Questions?
                                                     Please be nice




Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design
                                                   3D Microprocessor           April 2010   29 / 29

3D Microprocessor Design: Stacking at different granularities

  • 1.
    3D Microprocessor Design Stacking at different granularities Alberto Villegas Erce Seminar on Computer Systems Turku University April 2010 Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 1 / 29
  • 2.
    Introduction Conceptsreview Previously on 3D world... Industry trends Make it faster, smaller and cuter but do not forget the prize 3D Design Benefits: shorter wire length, speed increase, lower power consumption. Challenges: risk of defects, heat problems, design complexity. Through Silicon Vias (TSVs) Vertical electrical connection passing completely through a silicon die. Low power consumption Low latency Increasing integration level (10k-100k per cm2 ) Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 2 / 29
  • 3.
    Introduction Today Three dimensional Puzzle How to face 3D design? 2D design decomposition at different granularities. 1 Entire cores, cache: add functionality with high 2D reuse. 2 Functional unit blocks: performance improvement and power reduction. Must re-floorplan and retime paths. 3 Logic gates (block splitting): reduce latency and power on every level routes. Need new 3D circuit design, methodologies and layout tools. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 3 / 29
  • 4.
    Introduction Index 1 Stacking Complete Modules 2 Stacking Functional Unit Blocks 3 Splitting Functional Unit Blocks 4 Conclusions Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 4 / 29
  • 5.
    Stacking Complete Modules Index 1 Stacking Complete Modules 2 Stacking Functional Unit Blocks 3 Splitting Functional Unit Blocks 4 Conclusions Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 5 / 29
  • 6.
    Stacking Complete Modules Idea Three-Dimensional Stacked Caches Idea Break & stack existing modules. Conventional dual-core processor featuring a 4MB L2 cache. Design options for 3D stacking Reduce space. Increase storage. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 6 / 29
  • 7.
    Stacking Complete Modules Increasing storage L2 cache controller in 3D Objective Add more storage to the L2 cache. Stacking a second silicon layer Additional 8MB of cache Nearly no impact in L2 access latency Traditional 2D solution Double silicon area. Latency increased. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 7 / 29
  • 8.
    Stacking Complete Modules Increasing storage L2 cache controller in 3D (cont.) DRAM Solution Much greater storage density. Greater latency (50-150 cycles). Reduce silicon area in a half. Hybrid solution SRAM to store only the tags. DRAM to store the actual data. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 8 / 29
  • 9.
    Stacking Complete Modules Increasing storage L2 cache controller in 3D (testing) Three programs test: Program A : small working set that fits in 4MB SRAM cache. Program B : larger working set that do not fit 4MB SRAM but does fit within 32MB DRAM cache. Program C : streaming memory access patterns. Poor cache hits rate for both configurations. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 9 / 29
  • 10.
    Stacking Complete Modules 3D optionality 3D Integration ... for everyone? 3D Integration: Increase silicon required for the chip (layers) =⇒ Increase manufacturing cost Extra manufacturing steps for bounding. Impact on yield rates. 3D is not the general answer! 3D stacking is to use it as a means to optionally augment the processor with some additional functionality Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 10 / 29
  • 11.
    Stacking Complete Modules 3D optionality Introspective 3D Processors Objective Access to more dynamic information about the internal state of a microprocessor. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 11 / 29
  • 12.
    Stacking Complete Modules 3D optionality Reliable 3D Processors Problem Small size in modern processors makes them vulnerable to data corruption Solutions Redundancy: two/three copies of the processor operating lock-step =⇒ multiple pipelines increase cost. Leading execution/trailing checking cores: trailing core re-executes instructions (not lock-step) =⇒ still additional pipeline increases area. Extra wires eliminated. Stack it! Optional checker core. Unutilized silicon area. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 12 / 29
  • 13.
    Stacking Functional UnitBlocks Index 1 Stacking Complete Modules 2 Stacking Functional Unit Blocks 3 Splitting Functional Unit Blocks 4 Conclusions Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 13 / 29
  • 14.
    Stacking Functional UnitBlocks Introduction Stacking Functional Unit Blocks Nowadays Early step of development for this technologies. 3D integration will require Design automation tools. Layout support. Verification and validation methodologies. Future Reorganize the processor pipeline in new ways. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 14 / 29
  • 15.
    Stacking Functional UnitBlocks Removing wires Removing Wires Pentium III & IV branch misprediction Problem Wire delays have not evolve as fast as transistors speed. PIII branch misprediction PIV branch misprediction Solution 3D implementation so distant blocks are now vertically stacked on top of each other. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 15 / 29
  • 16.
    Stacking Functional UnitBlocks Removing wires Removing Wires Alpha 21264 Problem Superscalar processor with multiple execution units (EU) requires a bypass network to forward results between all of the EU =⇒ wiring. 2D Solution Divide EU into two groups or clusters, each with its own bypass network and communicated. 3D Solution Stack the clusters. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 16 / 29
  • 17.
    Stacking Functional UnitBlocks Trade-offs Removing Wires Trade-offs Cons Pros Non-trivial engineering Optimize processor effort. pipeline opportunities. Modify pipeline Physically reduction of Verify and validate amount of wiring. new design. Additional costs. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 17 / 29
  • 18.
    Stacking Functional UnitBlocks TSV Reality Removing Wires TSV Reality Problem After stacking two blocks there is enough room for placing TSVs. Solution Different layouts of the TSVs. Wire overhead reintroduction Reintroduced wires do not completely cancel the 3D wire reduction benefits. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 18 / 29
  • 19.
    Splitting Functional UnitBlocks Index 1 Stacking Complete Modules 2 Stacking Functional Unit Blocks 3 Splitting Functional Unit Blocks 4 Conclusions Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 19 / 29
  • 20.
    Splitting Functional UnitBlocks Introduction Splitting Functional Unit Blocks Last level Logic gates Split individual functional units across multiple layers. Reorganize the functional unit block =⇒ more compact 3D arragement. Benefits Reduce length of intra-block wiring. Improve operating frequencies. We will introduce a starting point of thinking. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 20 / 29
  • 21.
    Splitting Functional UnitBlocks 3D Cache Organizations 3D Cache Organizations First view Problem L2 cache consumes about half of the overall die area. Worst case routing distance: 2x+4y Two stack possibilities. Banks on cores Banks on banks Half space. Half space. Accessing Accessing equal. reduced. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 21 / 29
  • 22.
    Splitting Functional UnitBlocks Splitting the cache 3D Splitting the cache Problem Wires within each bank also impact overall latency. Split individual cache banks across multiple layers. Columns on columns Best latency. Rows on rows Energy reduction. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 22 / 29
  • 23.
    Splitting Functional UnitBlocks Splitting the cache 3D Splitting cache Testing Experimental results SPICE simulation. Column on column organization. SRAM implementations in 65-nm process. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 23 / 29
  • 24.
    Splitting Functional UnitBlocks 3D Adders 3D Adders Classic Look-ahead Carry Adder Look-ahead Carry Adder n = 16-bits Critical path along bit[0]-bit[n-1] Several ways to split the adder Based on inputs By significance x bottom layer; least significant bits y top layer. bottom layer; most significant top 1st lvl of propagate layer. layer splitted. TSV between root Half wire length. nodes. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 24 / 29
  • 25.
    Conclusions Index 1 Stacking Complete Modules 2 Stacking Functional Unit Blocks 3 Splitting Functional Unit Blocks 4 Conclusions Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 25 / 29
  • 26.
    Conclusions Conclusions Benefits of 3D organizing components Can significantly reduce wire lengths. Devices from different technologies can be tightly integrated and combined. 3D organizations may be required depending on the exact design constraints and objectives. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 26 / 29
  • 27.
    Conclusions Conclusions Cons More granularity ⇒ more re-dising. Stacking can increase heat. Long level of technological development Every re-design process yields to a cost increment. Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 27 / 29
  • 28.
    References References Three-Dimensional Microprocessor Design Gabriel H. Loh Springer Science 2010 A Modular 3D Processor for Flexible Product Design and Technology Migration Gabriel H. Loh ACM 2008 Die-stacking (3D) microarchitecture B. Black. International Symposium on Microarchitecture, pp. 469-479, 2006 Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 28 / 29
  • 29.
    The end Questions Thank you. Questions? Please be nice Alberto Villegas Erce (Seminar on Computer Systems Turku University ) Design 3D Microprocessor April 2010 29 / 29