Bobcat hotchips final 8 2 10

“Bobcat”
AMD’s New Low Power x86 Core Architecture

Brad Burgess, AMD Fellow
Chief Architect / Bobcat Core

August 24, 2010

1 | Bobcat | Hot Chips 2010

Two x86 Cores Tuned for Target Markets

“Bulldozer”
Performance &
Scalability Mainstream Client and Server Markets

Low Power Small Cloud Clients
“Bobcat” Markets Die Area Optimized

Flexible, Low
Power & Small


Bobcat Design Goals

 A small, efficient, low power
x86 core

 Excellent performance

 Synthesizable with small
number of custom arrays

 Easily Portable across process
technologies


Feature Set
 64-bit AMD64 x86 ISA
 SIMD extensions: SSE1, SSE2,
SSE3, SSSE3, SSE4A
 Virtualization
 Support for misaligned 128-bit
data types
 Instruction Based Sampling
(for dynamic optimization)
 C6 (with integrated power gating)


Micro-architecture Overview
 Dual x86 instruction decode
 Out-of-Order instruction execution
 Dual COP retirement
 Complex microOPs
 State of the art branch prediction
 Aggressive OOO load/store engine w/ hazard
prediction
 Advanced Virtualization w/ nested page tables,
ASIDs and world switch acceleration
 Low power C6 state w/ core level power gating and
state save acceleration


Bobcat ITLB 32KB Branch Predictor

Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue

uCode Dual x86 Decoder

Instr Queue FP Decode
ROB
Int Rename FP Rename

FP Sched
Scheduler Scheduler
FP PRF
Int PRF

ALU ALU LAGU SAGU MMX Alu MMX Alu

Table Walker Mul IntMul St Conv

32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge



Micro-Architecture
ICACHE Branch Locator ConditionPredict
or
Fetch Queue
Icache
 32Kbyte
 2-way set associative Instr Queue FP Decode
ROB
 64-byte line
 Parity Protected
FP Sched
 512/8 entry ITLB Scheduler Scheduler
(4k/2m) FP PRF
Int PRF
 Fetch up to MMX Alu MMX Alu
ALU ALU LAGU SAGU
32-bytes/cycle

DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Branch Predictor:
 Predicts up to two
branches per cycle
 Remembers branch ROB
instruction locations Int Rename FP Rename

 Return Stack Address FP Sched
Predictor Scheduler Scheduler
FP PRF
 Indirect Dynamic Int PRF
Address Predictor MMX Alu MMX Alu
ALU ALU LAGU SAGU
 State of the Art Mul
Table Walker IntMul St Conv
condition Predictor
 Only necessary DTLB
DCACHE Unit
structures are clocked FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Dual x86 Decoder:
 Scans up to 22 bytes
 Decodes up to two x86 Instr Queue FP Decode
instructions per cycle ROB
 The decoder can directly
map 89% of x86 FP Sched
instructions to a single Scheduler Scheduler
microOp, an additional Int PRF
FP PRF
10% to a pair of
MMX Alu MMX Alu
microOps, and more ALU ALU LAGU SAGU
complicated x86 Mul IntMul St Conv
Table Walker
instructions (<1%) are
microcoded. (Dynamic DTLB
Instruction Counts) DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Integer Execution:
 A dual port integer
scheduler feeds two ALUs
 A dual port address ROB
scheduler feeds a load Int Rename FP Rename
address unit, and a store
address unit. Scheduler Scheduler
FP Sched

 Physical Register File uses Int PRF
FP PRF
maps and pointers to
MMX Alu MMX Alu
reduce power by ALU ALU LAGU SAGU
minimizing data Mul IntMul St Conv
Table Walker
copying/movement.
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Floating Point Unit:
 A centralized FP scheduler
feeds two 64-bit FP
Instr Queue
execution stacks FP Decode
ROB
 MMX and Logical units are Int Rename FP Rename
replicated in both stacks
FP Sched
 The FP Mul Unit can Scheduler Scheduler
perform two SP multiplies Int PRF
FP PRF
per cycle
 The FP Add Unit can
perform two SP additions Table Walker Mul IntMul St Conv
per cycle
DTLB
 A physical register file is DCACHE Unit
used to reduce power Prefetch
FPAdd FPMul
512KB
BU



or
Fetch Queue
Data Cache:
 32-Kbyte
ROB
 64-byte line
 Parity Protected
FP Sched
 Copyback Scheduler Scheduler

 40/8 entry L1DTLB Int PRF
FP PRF
(4k/2m) MMX Alu MMX Alu
ALU ALU LAGU SAGU
 512/64 entry L2DTLB
Mul IntMul St Conv
(4k/2m) Table Walker

 Advanced 8-stream DTLB
prefetcher DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Out-of-Order Load
Store Unit: uCode Dual x86 Decoder

 Loads bypassing loads Instr Queue FP Decode
 Loads bypassing stores ROB
 Stores bypassing loads
 Bypass tracking and Scheduler Scheduler
FP Sched
dependency correction FP PRF
Int PRF
 Hazard predictor
 Fast store forwarding
 Fast critical word fill
forwarding 32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
L2 Cache:
 512Kbyte
ROB
 64 byte lines
 ECC Protected
FP Sched
 Half speed clocking for Scheduler Scheduler
power reduction FP PRF
Int PRF



DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU



or
Fetch Queue
Bus Unit:
 8-outstanding data
accesses
 2-outstanding fetch ROB
accesses Int Rename FP Rename

 Eviction Buffers FP Sched
Scheduler Scheduler
 Fill Buffers
FP PRF
Int PRF
 Write combining buffers
 Coherency management

DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU


Bobcat Pipeline
0 1 2 3 4 5 6 7 8 9 10 11 12

Fetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5
uCode
MDec Branch Mispredict Latency
ROM 13-cycles

Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback

Transit FpDec RegRen Schedule RegRead EXE Writeback AGU DC1 DC2
EXE
EXE
Load Use Latency
L1 hit: 3-cycles

Transit L2Tag L2Data

Load Use Latency
L2 hit: 17-cycles


Core Floor Plan
Floating Point Unit
Test/Debug Data L2 TLB

X86 Decode Bus Unit

Instruction
Cache L2 Sub Array
Inst
TLB/Tag
L2 TAG
Branch
Predict

Ucode
ROM

ROB Data Cache
Integer Unit Data Tag/TLB

Load Store Unit


Power Reduction
 Use of physical Register files
 Extensive use of non-shifting queues with
pointers
 Fine grain clock gating
 Integrated Core Power Gating
 Only needed arrays are clocked
– i.e. Dtag hit before Dcache read
– Predicting the type of branch then clocking the
appropriate predictor(s)

 Elimination of instruction marker bits in the
Icache
 Finding the knee of the curve (scrutinize
performance gains against power costs)
 Polishing speed paths to raise the Vt mix
and reduce leakage


Bobcat Core Overview
Advanced Micro-architecture
 Dual x86 Decode ICACHE
 Advanced Branch Predictor Bobcat L2
 Full OOO instruction execution Low Fetch
 Full OOO load/store engine Power
 High Performance Floating Point Core
 AMD64 64-bit ISA Decode BU
 SSE1,2,3, SSSE3 ISA
 Secure Virtualization
 32kb L1s, 512kb L2
Low Power Design Integer Address FP
 Power Optimized Execution Scheduler Scheduler Scheduler
 Micro-architecture that minimizes data movement
and unnecessary reads
I I Load Store A M
 Clock gating, Power gating Pipe Pipe Pipe Pipe Pipe Pipe
 System Low Power States
Small Core
DCACHE
 Area efficient balance of high performance and low
power


Summary

 Estimated 90% of the performance of today’s
mainstream notebook CPU in half the area*

 Sub-one watt capable

 Highly portable across designs and
manufacturing technologies

20 | Bobcat | Hot Chips 2010 *Based on internal AMD modeling using benchmark simulations

Bobcat hotchips final 8 2 10

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Bobcat hotchips final 8 2 10

Similar to Bobcat hotchips final 8 2 10 (20)

Bobcat hotchips final 8 2 10