“Bobcat”                              AMD’s New Low Power x86 Core Architecture                              Brad Burgess,...
Two x86 Cores Tuned for Target Markets                       “Bulldozer”                        Performance &             ...
Bobcat Design Goals A small, efficient, low power  x86 core Excellent performance Synthesizable with small  number of c...
Feature Set 64-bit AMD64 x86 ISA SIMD extensions: SSE1, SSE2,  SSE3, SSSE3, SSE4A Virtualization Support for misaligne...
Micro-architecture Overview Dual x86 instruction decode Out-of-Order instruction execution Dual COP retirement Complex...
Bobcat                                ITLB                        32KB                            Branch PredictorMicro-Ar...
Bobcat                                  ITLB                        32KB                            Branch PredictorMicro-...
Bobcat                                   ITLB                        32KB                            Branch PredictorMicro...
Bobcat                                  ITLB                        32KB                            Branch PredictorMicro-...
Bobcat                                   ITLB                        32KB                            Branch PredictorMicro...
Bobcat                                   ITLB                        32KB                            Branch PredictorMicro...
Bobcat                                  ITLB                        32KB                            Branch PredictorMicro-...
Bobcat                                   ITLB                        32KB                            Branch PredictorMicro...
Bobcat                                  ITLB                        32KB                            Branch PredictorMicro-...
Bobcat                                   ITLB                        32KB                            Branch PredictorMicro...
Bobcat Pipeline  0             1           2             3           4      5            6    7         8         9       ...
Core Floor Plan                                                  Floating Point Unit                                      ...
Power Reduction Use of physical Register files Extensive use of non-shifting queues with  pointers Fine grain clock gat...
Bobcat Core OverviewAdvanced Micro-architecture   Dual x86 Decode                                                      IC...
Summary Estimated 90% of the performance of today’s  mainstream notebook CPU in half the area* Sub-one watt capable Hig...
Upcoming SlideShare
Loading in …5
×

Bobcat hotchips final 8 2 10

648 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
648
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bobcat hotchips final 8 2 10

  1. 1. “Bobcat” AMD’s New Low Power x86 Core Architecture Brad Burgess, AMD Fellow Chief Architect / Bobcat Core August 24, 20101 | Bobcat | Hot Chips 2010
  2. 2. Two x86 Cores Tuned for Target Markets “Bulldozer” Performance & Scalability Mainstream Client and Server Markets Low Power Small Cloud Clients “Bobcat” Markets Die Area Optimized Flexible, Low Power & Small 2 | Bobcat | Hot Chips 2010
  3. 3. Bobcat Design Goals A small, efficient, low power x86 core Excellent performance Synthesizable with small number of custom arrays Easily Portable across process technologies 3 | Bobcat | Hot Chips 2010
  4. 4. Feature Set 64-bit AMD64 x86 ISA SIMD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A Virtualization Support for misaligned 128-bit data types Instruction Based Sampling (for dynamic optimization) C6 (with integrated power gating) 4 | Bobcat | Hot Chips 2010
  5. 5. Micro-architecture Overview Dual x86 instruction decode Out-of-Order instruction execution Dual COP retirement Complex microOPs State of the art branch prediction Aggressive OOO load/store engine w/ hazard prediction Advanced Virtualization w/ nested page tables, ASIDs and world switch acceleration Low power C6 state w/ core level power gating and state save acceleration 5 | Bobcat | Hot Chips 2010
  6. 6. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue uCode Dual x86 Decoder Instr Queue FP Decode ROB Int Rename FP Rename FP Sched Scheduler Scheduler FP PRF Int PRF ALU ALU LAGU SAGU MMX Alu MMX Alu Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 6 | Bobcat | Hot Chips 2010
  7. 7. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueIcache uCode Dual x86 Decoder 32Kbyte 2-way set associative Instr Queue FP Decode ROB 64-byte line Int Rename FP Rename Parity Protected FP Sched 512/8 entry ITLB Scheduler Scheduler (4k/2m) FP PRF Int PRF Fetch up to MMX Alu MMX Alu ALU ALU LAGU SAGU 32-bytes/cycle Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 7 | Bobcat | Hot Chips 2010
  8. 8. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueBranch Predictor: uCode Dual x86 Decoder Predicts up to two branches per cycle Instr Queue FP Decode Remembers branch ROB instruction locations Int Rename FP Rename Return Stack Address FP Sched Predictor Scheduler Scheduler FP PRF Indirect Dynamic Int PRF Address Predictor MMX Alu MMX Alu ALU ALU LAGU SAGU State of the Art Mul Table Walker IntMul St Conv condition Predictor Only necessary DTLB 32KB LdSt FP Logical FP Logical DCACHE Unit structures are clocked FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 8 | Bobcat | Hot Chips 2010
  9. 9. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueDual x86 Decoder: uCode Dual x86 Decoder Scans up to 22 bytes Decodes up to two x86 Instr Queue FP Decode instructions per cycle ROB Int Rename FP Rename The decoder can directly map 89% of x86 FP Sched instructions to a single Scheduler Scheduler microOp, an additional Int PRF FP PRF 10% to a pair of MMX Alu MMX Alu microOps, and more ALU ALU LAGU SAGU complicated x86 Mul IntMul St Conv Table Walker instructions (<1%) are microcoded. (Dynamic DTLB 32KB LdSt FP Logical FP Logical Instruction Counts) DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 9 | Bobcat | Hot Chips 2010
  10. 10. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueInteger Execution: uCode Dual x86 Decoder A dual port integer scheduler feeds two ALUs Instr Queue FP Decode A dual port address ROB scheduler feeds a load Int Rename FP Rename address unit, and a store address unit. Scheduler Scheduler FP Sched Physical Register File uses Int PRF FP PRF maps and pointers to MMX Alu MMX Alu reduce power by ALU ALU LAGU SAGU minimizing data Mul IntMul St Conv Table Walker copying/movement. 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 10 | Bobcat | Hot Chips 2010
  11. 11. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueFloating Point Unit: uCode Dual x86 Decoder A centralized FP scheduler feeds two 64-bit FP Instr Queue execution stacks FP Decode ROB MMX and Logical units are Int Rename FP Rename replicated in both stacks FP Sched The FP Mul Unit can Scheduler Scheduler perform two SP multiplies Int PRF FP PRF per cycle ALU ALU LAGU SAGU MMX Alu MMX Alu The FP Add Unit can perform two SP additions Table Walker Mul IntMul St Conv per cycle 32KB LdSt FP Logical FP Logical DTLB A physical register file is DCACHE Unit used to reduce power Prefetch FPAdd FPMul 512KB BU L2CACHE To/from Northbridge 11 | Bobcat | Hot Chips 2010
  12. 12. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueData Cache: uCode Dual x86 Decoder 32-Kbyte 8-way set associative Instr Queue FP Decode ROB 64-byte line Int Rename FP Rename Parity Protected FP Sched Copyback Scheduler Scheduler 40/8 entry L1DTLB Int PRF FP PRF (4k/2m) MMX Alu MMX Alu ALU ALU LAGU SAGU 512/64 entry L2DTLB Mul IntMul St Conv (4k/2m) Table Walker Advanced 8-stream DTLB 32KB LdSt FP Logical FP Logical prefetcher DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 12 | Bobcat | Hot Chips 2010
  13. 13. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueOut-of-Order Load Store Unit: uCode Dual x86 Decoder Loads bypassing loads Instr Queue FP Decode Loads bypassing stores ROB Int Rename FP Rename Stores bypassing loads Bypass tracking and Scheduler Scheduler FP Sched dependency correction FP PRF Int PRF Hazard predictor ALU ALU LAGU SAGU MMX Alu MMX Alu Fast store forwarding Table Walker Mul IntMul St Conv Fast critical word fill forwarding 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 13 | Bobcat | Hot Chips 2010
  14. 14. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueL2 Cache: uCode Dual x86 Decoder 512Kbyte 16-way set associative Instr Queue FP Decode ROB 64 byte lines Int Rename FP Rename ECC Protected FP Sched Half speed clocking for Scheduler Scheduler power reduction FP PRF Int PRF ALU ALU LAGU SAGU MMX Alu MMX Alu Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 14 | Bobcat | Hot Chips 2010
  15. 15. Bobcat ITLB 32KB Branch PredictorMicro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch QueueBus Unit: uCode Dual x86 Decoder 8-outstanding data accesses Instr Queue FP Decode 2-outstanding fetch ROB accesses Int Rename FP Rename Eviction Buffers FP Sched Scheduler Scheduler Fill Buffers FP PRF Int PRF Write combining buffers ALU ALU LAGU SAGU MMX Alu MMX Alu Coherency management Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 15 | Bobcat | Hot Chips 2010
  16. 16. Bobcat Pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12Fetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5 uCode MDec Branch Mispredict Latency ROM 13-cycles Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU WritebackTransit FpDec RegRen Schedule RegRead EXE Writeback AGU DC1 DC2 EXE EXE Load Use Latency L1 hit: 3-cyclesTransit L2Tag L2Data Load Use Latency L2 hit: 17-cycles 16 | Bobcat | Hot Chips 2010
  17. 17. Core Floor Plan Floating Point Unit Test/Debug Data L2 TLB X86 Decode Bus Unit Instruction Cache L2 Sub Array Inst TLB/Tag L2 TAG Branch PredictUcode ROM ROB Data Cache Integer Unit Data Tag/TLB Load Store Unit 17 | Bobcat | Hot Chips 2010
  18. 18. Power Reduction Use of physical Register files Extensive use of non-shifting queues with pointers Fine grain clock gating Integrated Core Power Gating Only needed arrays are clocked – i.e. Dtag hit before Dcache read – Predicting the type of branch then clocking the appropriate predictor(s) Elimination of instruction marker bits in the Icache Finding the knee of the curve (scrutinize performance gains against power costs) Polishing speed paths to raise the Vt mix and reduce leakage 18 | Bobcat | Hot Chips 2010
  19. 19. Bobcat Core OverviewAdvanced Micro-architecture Dual x86 Decode ICACHE Advanced Branch Predictor Bobcat L2 Full OOO instruction execution Low Fetch Full OOO load/store engine Power High Performance Floating Point Core AMD64 64-bit ISA Decode BU SSE1,2,3, SSSE3 ISA Secure Virtualization 32kb L1s, 512kb L2Low Power Design Integer Address FP Power Optimized Execution Scheduler Scheduler Scheduler Micro-architecture that minimizes data movement and unnecessary reads I I Load Store A M Clock gating, Power gating Pipe Pipe Pipe Pipe Pipe Pipe System Low Power StatesSmall Core DCACHE Area efficient balance of high performance and low power 19 | Bobcat | Hot Chips 2010
  20. 20. Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half the area* Sub-one watt capable Highly portable across designs and manufacturing technologies 20 | Bobcat | Hot Chips 2010 *Based on internal AMD modeling using benchmark simulations

×