Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core

2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core© 2020 IEEE
International Solid-State Circuits Conference
1 of 33
Zen 2: The AMD 7nm Energy-Efficient
High-Performance x86-64 Microprocessor Core
T. Singh1, S. Rangarajan1, D. John1, R. Schreiber1, S. Oliver1, R. Seahra2, A. Schaefer1
1AMD, Austin, TX, 2AMD, Markham, ON, Canada
Presented at ISSCC 2020

2 of 33
Outline
• Motivation
• Market Segments
• Architecture
• Core Complex
• Technology
• Implementation
• SRAMs
• Power
• Silicon Results
• Conclusion

3 of 33
Motivation
• Zen was a huge lift
• Zen2 compelling successor to Zen
• Goals
– Give above industry trend generational
performance improvement
– Enable 2x cores same socket
– Improve single thread (1T) performance
• How can we do this?
– Technology port
– Architectural changes
– Physical design and methodology changes
• AMD was aggressive and we did all of the
above to achieve the goals!!

4 of 33
Zen 2 Market Segments

5 of 33
Zen 2 Architecture
• Changes from Zen
– New TAGE Branch Predictor
– Optimized L1 Instruction Cache: 32K/8-way vs. 64K/4-way
– 2X Op Cache Capacity: 4K vs. 2K ops
– 2X Floating Point Data Path Width: 256b vs 128b
– 3rd Address Generation Unit
– Larger Physical Structures: Integer Scheduler, PRF, ROB, Store Queue, L2DTLB
– 2X L1 Data Cache Read/Write Bandwidth
– 2X L3 Cache: 16MB vs. 8MB per Core Complex (CCX)
• +15%1 single thread (1T) IPC over Zen
• ~9% switching capacitance (CAC) improvement over previous
generation, technology neutral
1 AMD "Zen 2" CPU-based system scored an estimated 15% higher than previous generation AMD “Zen” based system using estimated SPECint®_base2006 results.
SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org.

6 of 33
Core Functional Units
• 32KB IC
• 32KB DC
• ~20 blocks, ~400K
avg instances
• ROM for uCODE
• 5 L1 RAM variants
• Chip Pervasive Logic
(CPL) – clock/test
block
Floating
Point
Data
Cache
Load/
Store
ALU
Scheduler
Branch
Prediction
I-CacheDecode
L2
Cache
uCode CPL

7 of 33
L2/L3 Cache Hierarchy
• Only 3 unique custom
macros
– Down from 8 on Zen
• Each 4M slice is identical
• Multi-stage clock gating in
L3 to keep clock
distribution power the
same as 8M L3 from Zen
• LDOs incorporated into
the L3 to supply VDDM to
L2 and L3 arrays
– Loss of package distribution
of VDDM meant LDOs had
to be moved closer
– Must reduce current on
VDDM
CTLL3Tags
L3Data
4M Slice
L2 Data
L2 Tags
L2 Status
Shadow tag macros for serving external probes
512K L2
LDOs

8 of 33
Zen 2 Core Complex (CCX)
• 4 core complex
• L3 size increases to 16MB
• Design for flexibility
• Maximize # cores for server case

9 of 33
Zen 2 CCX Configs
HEDT/Server
4 Core,
16MB L3 CCX
APU
4 Core,
4MB L3 CCX
Value
2 Core,
4MB L3
CCX
• Zen 2 Core can be used in various
configs covering a wide power range
• Multiple CCX can be placed to
achieved desired core count
Cores Market TDP
8 Notebook 15W
6 Desktop 65 W
8 Desktop/Server 65-120 W
24 HEDT/Server 155-280 W
48 Server 200-225 W

10 of 33
Zen vs. Zen 2 Technology Comparison
Zen Zen 2
Tech 14nm FinFET 7nm FinFET
Cores/CCX
4 Cores,
8 Threads
4 Cores,
8 Threads
Area/CCX 44 mm2 31.3 mm2
L2/core 512KB 512KB
L3/CCX 8MB 16MB
CPP 78 nm 57 nm
Fin Pitch 48 nm 30 nm
1x Metal Pitch 64 nm 57 nm
Stdcell Track Library 10.5 track 6 track
Cu Metal Layers 11 w/ MiM 13 w/ MiM

11 of 33
Zen vs. Zen 2 Technology Comparison (cont)
Zen (14nm) Zen 2 (7nm)
Layer Name Pitch Layer Name Pitch
M0
StdCell Internal
n/a
M0
StdCell Internal
1.0x
M1
StdCell Internal
1.0x
M1
Stdcell
& BEOL
1.425x
M2-M3 1.0x M2-M3 1.0x-1.1x
M4-M7 1.25x M4-M7 2.0x
M8-M9 2.0x M8-M9 2.0x
--- --- M10-M11 3.15x
M10-M11
(RDL)
11.25x
M12-M13
(RDL)
18.0x

12 of 33
Place and Route Design Optimization
• 7nm FinFET presents unique route challenges
– Lower layer jogs forbidden
– Denser standard cells with reduction in track height
– Increased lower level metal resistance
• Deep collaboration between AMD CAD,
foundry, and EDA partners
– Cell density management
– Advanced legalization techniques
– Improved pre-route timing estimates
– Wire Engineering and Via Ladders
Same-Layer Jogs
Forbidden
Inter-Layer Jumpers
Required

13 of 33
Placement Restricted by Large Cells
• Multi-row cells benefit
power and area, but
create placement
challenges
• Clustering of flops has
many benefits but can
cause placement
issues
• Resulting small gaps
are challenging to use
and required innovation
to exploit
– New algorithms
– Flexible power grid
choices

14 of 33
Design RC Miscorrelation
• Pre-route vs Post-route miscorrelation caused
by length and layer assumptions
• Pre-route miscorrelations for resistance and
capacitance have differing root causes
– Layer assignment for resistance
– Length estimates for resistance and capacitance
• Based on previously modeled trends, EDA
tools may have challenges estimating delay
• Required innovation to tackle
Layer
Normalized
Resistance
Normalized
Capacitance
M1 1.00 1.00
M2 3.17 0.96
M3 2.31 0.96
M4 0.72 0.75
M5 0.55 0.83
M6 0.52 0.83
M7 0.55 0.83
M8 0.52 0.83
M9 0.55 0.92
M10 0.16 0.96
M11 0.16 0.92

15 of 33
Pre-Route Correlation Improvements
• Plots showing ClockTreeSynthesis vs
Route timing
• Large variance in initial results
– Large number of paths have overly-
pessimistic delay during pre-route steps.
Tools waste resources trying to fix
– Significant number of paths have optimistic
delay estimates. These paths are under-
optimized
• Employed timing with targeted
capacitance scaling and global route-
based layer estimation
– Standard deviation dramatically improved
while keeping a slightly pessimistic mean cts_vs_route.slack.corr cts_vs_route.slack_delta.hist
cts_vs_route.slack.corr cts_vs_route.slack_delta.hist
Timing Slack Correlation Timing Slack Delta
Initial
Results
Improved
Results
Pessimistic Optimistic

16 of 33
Wire Engineering Challenges
• Lower layers getting more
resistive with latest
technology nodes
– Very short routes in tight data
paths need a buffer
– Routes longer than Steiner due
to complex rules
– Challenging for optimization
tools to comprehend
• Critical signals need to get to
higher layers quickly

17 of 33
Wire Engineering and Via Ladders
• Team used selective layer optimization,
buffering, pre-routes, and via ladders to
exploit the fast layers for critical signals
• Two types of via ladders
– High Performance: for large buffers driving long
wires
– EM: for high-activity gates (e.g., clock drivers)
– Mitigated EM issues on large fanout nodes with
high activity
Top Via Ladder View
Side Via Ladder View

18 of 33
L2/L3 Cache Changes
• Zen had an on-die LDO
to generate VDDM
supply for use by cache
arrays
• Zen 2’s package choices
make using package
layers for VDDM
distribution impossible
• Moved the bitline
precharge from VDDM to
VDD to reduce current
VDDM VDDM
BLT[]
BLC[]
WRCS[]
RDCSX[]
SAPCX
SAEN
WDT_X
WDC_X
XCENX
SAT
SAC
SAC_INT SAT_INT
BLPCX
WL[N:0]
BLT[]
BLC[]
WRCS[]
RDCSX[]
SAPCX
SAEN
XCENX
SAT
SAC
SAT SAC
BLPCX
WL[N:0]
NegBL Write DriverWDT_X
WDC_X

19 of 33
• Moving bitline precharge
to VDD creates both
bitcell stability and
writeability challenges
• High level of
configurability allows for
silicon flexibility
VDD Precharge Challenges
SRAMSRAMSRAMSRAMs
WLUdEn
NegBlEn
Assist configurations
Fuses
Assist controller
System
Management
Programming
details
superVminEn
superVmaxEn
Voltage
thresholds
superVmaxEn=1superVminEn=1
VDDM
VDDmax
VDDmin
VDD
Controller pauses voltage
increase and unsets
superVminEn register before
continuing to raise voltage
Controller pauses voltage
increase and sets
superVmaxEn register before
continuing to raise voltage
VDD where VDDM-VDD=superVminThreshold
VDD where VDD-VDDM=superVmaxThreshold

20 of 33
• Moving precharge to VDD reduced our current enough to allow on die-distribution but presents other
challenges
• Read before write timing challenges at low VDD, high VDDM
VDD Precharge Timing Challenges
WL@ constant
VDDM
BLPCX @ high
VDD
BLPCX @ low
VDD
WL on before
Bitline precharge
turns off at low
VDD!
Bitline precharge
turns on before WL
turns off at high
VDD!
Power races with WL Read before write challenges

21 of 33
• Solving these multiple voltage timing challenges required a number of techniques
– Dual voltage clock shapers to average two voltage domains
• Can alter the number of these buffers on VDD or VDDM or remove them entirely to make timing more
or less dependent on either supply
– False read before write problem can be mitigated by compressing the front end of the WL
during a write operation
Solving Timing Challenges
ISOX@VDDM
Input@VDD
shapedFallInput
@VDD
VDDM
LS
LS
VDD
Psuedo-dynamic level shifter
WREN
WLCLK WLCLK_shape
WLCLK
WL during read
WL during write

22 of 33
CAC Comparison
• 3% decrease in flop power allocates more budget for combinational logic

23 of 33
FLOP Palette Improvements
• Rich flop library,
balance
timing/power
needs by
driving right flop
mix
• Up to 8% Fmax
benefit from
high speed
flops in timing
critical loop
paths
Best for Performance Best for Power

24 of 33
Low Power Gater Latch
Energy with AvgApp Activity (fJ)
State
LP
Latch
Regular
Latch
Ratio
E=1 0.22 0.18 121%
E=0 0.17 1.61 10%
Total 0.38 1.79 22%
E
TE
CLKB
CLKBB
CLKBBCLKB
CLKBB
CLKB
qf_x qf
Q
Dbar
Dbar
qf CLK
• 90% Power savings in latch for common case of E = 0
through internal self gating
• Clock gater latch power contribution from 22% in Zen to 13%
in Zen 2 for an average application

25 of 33
Zen 2 Clock Optimization
• Multi mesh plan for the
core supported by
configurable clock tree
construction
– FP level mesh gating
enabled with minimal
timing/area overhead
– 15% Mesh power savings
in Idle and Average App
• Tight clock skew distribution
• Relocated clock spines and
technology shrink (vs. Zen)
achieves similar skew profile
while reducing CAC

26 of 33
Zen vs. Zen 2 CAC Comparison
• Primary sources of CAC reduction
– 14 nm to 7 nm scaling
– 6 track library
– Aggressive microarchitectural CAC optimizations

27 of 33
Generational Leadership Perf/Watt
• Performance/Watt driven by a combination
of technology and design improvements
• Timing
– Improved scalability by optimizing at a wider
voltage range compared to Zen
– Multi-corner optimization
• Library choice and optimization
– 6 track library enabled additional
CAC/leakage savings in addition to default
technology entitlement
• Design CAC
– MBFF, low power clock-gater library
optimization
– RTL improvements
– CAC aware downsizing methodology
Zen power
@ 100% IPC
Zen2 power
@ 115% IPC
7nm CAC Savings
7nm Timing
Design CAC
Savings
Library Choice
Power Improvements – ISO Frequency

28 of 33
Frequency/Power Silicon Results
• 4 cores active with 2
threads per core
• The combined effect of
lower Vmin for the same
frequency and reduced
CAC enabled a 50%
reduction in power for a
given frequency
throughout most of the
F(P) curve
• This enables 2x cores in
the same socket!!
50% power
reduction

29 of 33
Frequency/Voltage Silicon Results
• 1 core active with
two threads per
core, 3 cores idle
• F/V curve improved
over all voltages
• Design worked to
improve the low
voltage
performance for
improved linearity
• Wide voltage range

30 of 33
Conclusion
• Met Goals
– Moved to energy efficient TSMC 7nm finFET
– Made huge architectural changes
– Improved PD and methodology
• Results are clear
– Scalable across 15W mobile to 280W Server
– 50% reduced power at iso-frequency
– Enable 2x cores in same-socket
– >15% 1T IPC over previous generation
– ~9% CAC improvement over previous
generation technology neutral
– Enables peak frequencies up to 4.7GHz
(+350MHz generationally)
• Zen2 delivers generational performance
uplift!!

31 of 33
Acknowledgements
• We would like to thank our talented AMD design team across
Austin, Fort Collins, Santa Clara, Boston, Markham, and India
who contributed to Zen 2
• Please stay for our chiplet paper next

2.2 : AMD Chiplet Architecture for High-Performance Server and Desktop Products
© 2020 IEEE
International Solid-State Circuits Conference 32 of 33
Disclaimer and Endnotes
DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. While every
precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and
typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro
Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or
fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described
herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this
document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18
All rights reserved. AMD, the AMD Arrow logo, EPYC, RYZEN, Threadripper, Radeon, Infinity Fabric, and combinations
thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification
purposes only and may be trademarks of their respective companies.

33 of 33

Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core

More Related Content

What's hot

Similar to Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core

More from AMD

Recently uploaded

Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core