2. October 24, 2007
2
Topics
Introduction
Power Dissipation basic
Existing Low Power Techniques and Issues for
Advance LP Techniques (under exploration)
3. October 24, 2007
3
Introduction
0.1
1
10
100
1000
1970 1980 1990 2000 2010 2020
Power
(Watts)
1000's of
Watts?
8080
8086
386
Pentium® proc
Pentium® 4 proc
Unconstrained power will reach 1,000’s of watts
Unconstrained power will reach 1,000
Unconstrained power will reach 1,000’
’s of watts
s of watts
4. October 24, 2007
4
Power Density will Get Even Worse
Hot Plate
Hot Plate
Nuclear Reactor
Nuclear Reactor
Rocket Nozzle
Rocket Nozzle
Sun
Sun’
’s Surface
s Surface
4004
4004
8008
8008
8080
8080
8085
8085
8086
8086
286
286
386
386
486
486
Pentium
Pentium®
®
processors
processors
1
1
10
10
100
100
1,000
1,000
10,000
10,000
’
’70
70 ’
’80
80 ’
’90
90 ’
’00
00 ’
’10
10
Power
Density
Power
Density
(W/cm2)
(W/cm2)
5. October 24, 2007
5
Motivation
• Portability
– Extending battery life
• Battery technologies scales-up slowly – 150Wh/kg today vs. 75Wh/kg in 1990
• 1 Kg Ni Cad battery could power 1 hrs for P4 can power Centrino for 4 Hour
– Low power dissipation as a product feature in itself
– Enabling portable devices to be more powerful and feature-rich
• Packaging
– High power dissipation leads to expensive packaging and cooling systems
• ~ 1W: inexpensive plastic package limit
• ~ 10W: Ceramic package limit
• ~ 10W/cm2: limit for convection cooling
• ~ 50W/cm2: limit for forced-air cooling
• Reliability
– High Product life time
6. October 24, 2007
6
Sources of Power Dissipation in CMOS
Power in a CMOS inverter is governed by the 3 part equation above
• Dynamic (switching) power
– Currently the largest part, but percentage getting smaller
• Leakage Power
– Subthreshold conduction – getting bigger due to aggressive scaling, temperature, etc.
– Reverse leakage of diodes (relatively small)
– Possible gate tunneling current in future technologies
• Short-circuit (crowbar) current
– Both pull-up and pull-down devices are partially conducting for a small, but finite
amount of time
– Can be modeled as some fraction of dynamic current
Ptotal = CLVDD
2fclka01 + VDDIshort-circuit + VDDIleakage
7. October 24, 2007
7
Sources of Power Dissipation: Switching
• One half of the power from the supply is
consumed in the pull-up network (PMOS) and
the remaining half is stored in CL when Vout
makes 10 transition
• During 01 transition the charge stored in CL is
dumped via the pull-down network (NMOS)
• Power = (Energy/Transition)*(Transition Rate)
= CLVDD
2 * f01
= CLVDD
2 * fclk* a01
= CswitchedVDD
2fclk
where Cswitched = CL*a01 and
a01 = probability of 01 transition
• Dynamic power therefore can be reduced by
– Scaling down the supply voltage VDD
– Reducing the switching probability thru’
architectural means
– Scaling down the frequency as per
throughput demands
– Optimizing/reducing the load capacitance
(Device Scaling)
8. October 24, 2007
8
Sources of Power Dissipation: Short-Circuit
• Due to finite input transition time both NMOS and PMOS conduct for a small, but
finite duration, thus providing a resistive path btw VDD and GND
• Typically less than 10% of the total dynamic power
• The short-circuit current Isc depends on the ratio of input to output transition times
(higher the ratio, more is the duration for which both the devices are ON, higher the
dissipation due to short-circuit current)
• Can be minimized by balancing out the input and output rise times
• Can be virtually eliminated by making VDD less than (VTN+|VTP|)
9. October 24, 2007
9
Sources of Power Dissipation: Leakage
I1 : pn junction reverse bias current
I2 : Subthreshold conduction due to weak inversion
I3 : Drain-induced barrier lowering (DIBL)
I4 : Gate-induced drain leakage (GIDL)
I5 : Punchthrough
I6 : Narrow width effect
I7 : Gate oxide tunneling
I8 : Hot carrier injection (HCI)
• Significant contributor to standby power
• The most dominant one among these is the
subthreshold leakage current (I2) due to
constant lowering of VTH with scaling (see
the exponential dependence over VTH and
also see the sensitivity w/ temperature)
• There are several techniques to contain this
viz. using dual-VT, multi-VT libraries, using
MTCMOS technology, using VTCMOS
technologies, using Back-biasing etc.
10. October 24, 2007
10
Medium-High
High
High
Medium
Low
None
None
Synth, Formal
Test Impact
High
High
Medium-
high
Medium
Low
Low
None
Implement.
impact
Medium
High
High
Low
None
None
None
Verification
impact
10%
10%
-
10X
Substrate Biasing
10%
0%
40-70%
2-3X
Dynamic and Adaptive
Voltage Frequency Scaling
(DVFS and AVS)
5-15%
4-8%
~0%
10-50X
Power shut-off (PSO)
10%
0%
40-50%
2X
Multi-supply voltage (MSV)
2%
0%
20%
0X
Clock gating
2 to -2%
0%
0%
6X
Multi-Vt optimization
-10%
0%
10%
1.1X
Area optimization
Area
penalty
Timing
penalty
Dynamic
power
Leakage
power
Power reduction
technique
Low-Power Techniques
Basic
Advanced
11. October 24, 2007
11
PSO in std cell based design
Fine Grain Power Switches -Eg. Coarse Grain Power Switches
Buffered µ
µ
µ
µSwitch Un-buffered µ
µ
µ
µSwitch
Virtual
Vss
Real
VSS
Real
VSS
Virtual
Vss
A Z
SLEEP
A1 Z
SLEEP
A2
Real
VSS
Real
VSS
SLEEP
SLEEP
5%
30%
Power gate leakage
Needs to be addressed
No issue
Simultaneous switching
capacitance
Always-on buffer by abutment
Always-on buffer network
Gate control slew rate
Actual switching (5% area)
Worst case switching (30% area)
Power gate size
Coarse grain
Fine grain
12. October 24, 2007
12
PSO in std cell based design (contd..)
D: Active
B:
FF
Vss
Vdd
VddC
PD3 – Shut down
FF
SRP
G FF
Iso.
iso_en
shutoff
PSE
En_in
Column Pitch
(200um)
Left Offset
(150um)
PD1
En_out_1 En_out_2 En_out_3
En_in
Column Pitch
(200um)
Left Offset
(150um)
PD1
En_out_1 En_out_2 En_out_3
Switchable Power Domain
(PD1)
En_in
En_out
Note:
switch cell
has 2 buffers
built-in with
different directions
Switchable Power Domain
(PD1)
En_in
En_out
Note:
switch cell
has 2 buffers
built-in with
different directions
Filler
Forms contiguous ring
Prevents additional leakage
Breaker
Divides into separate gate control groups
Used with feed-thru enable signal
Corner
Acts like corner cell in pad ring
Buffer-only (no switch) / switch-only (no buffer)
Allows flexible control of buffer tree
13. October 24, 2007
13
PSO in std cell based design (contd..)
• Ring
– Ring(s) of switches enclose the power domain fully or partially
– Switches placed outside the power domain
– Switch cell treated as hard macro
– Often used with hard macros (not allowed to touch inside)
– More IRdrop
– Better current distribution
• Column
– Columns of switches inside power domain
– Switches placed in the standard cell rows
– Switch is a standard cell
– Often used inside hierarchical (soft) blocks
– Lesser IRdrop
– More prone to rush current issue
– Needs careful EM checks
14. October 24, 2007
14
• Key in PSO design apart of PSO insertion
– Power up and rush current
• Dynamic IRdrop becoming must
• Optimum no’s of switch
• Smooth power up
– Verification
• RTL simulations
• Low power insertion checks
• CPF verification
– DFM
• More power rails
• Stacked via requirements
• EM
– Testability
• Coverage on the logic on the Restore and Save signals
– ESD
– IRdrop aware timing analysis
PSO in std cell based design (contd..)
15. October 24, 2007
15
PSO in std cell based design (contd…)
PDM2
PDM1
Good
Missing
OFF ON
LH
ISO
1.2V
0.8V
iB
PD
ISO_EN
PMM
iA
1
0
X
Structural/Rule Checking
• User defines rules for crossings, isolation type, and location
Conformal LP reports missing or redundant isolation/ level shifter cells
Conformal LP reports wrong isolation cell type
Conformal LP reports bad level shifter direction
Conformal LP reports wrong isolation cell / level shifter domain location
Low Power Insertion Checks
16. October 24, 2007
16
PSO in std cell based design
• Equivalence checking for Low Power design
– Ensure low power optimizations do not introduce logical errors
– Verify gated clocks, gated signals, de-cloning, and re-cloning of
gated clocks
– Check State Retention mapping from RTL to gate
– Check corresponding presence of Isolation and level shifter during
implementation
Silicon Virtual Prototype
Power Routing
Low Power
Clock Tree Synthesis
Domain-aware Post-CTS
Optimization
IR-Aware Timing/SI Opt.
Decap insertion
Sign-off
Switch cell Insertion
(for MTCMOS)
Placement including
SRPG/Level shifters/Isol. cells
Top-down Single-pass Synthesis
Power Grid Synthesis
Domain Aware NanoRoute
Conformal
Low
Power
• Power domain structural and functional checks
– Ensure proper insertion of low power cells
– Ensure proper connectivity of low power cells
– Formally validate isolation function
– Formally validate state retention function
– Supports both logical and physical (power aware) netlists
• Transistor Electrical Verification
– Detect Sneak (leakage) paths across power domain boundaries
Low Power Insertion Checks
17. October 24, 2007
17
PSO in std cell based design (contd...)
Test the Low Power Design, Reduce Power During Test
• Insert the required Power-
Aware test DFT
• Test Access Mechanism
(PTAM)
• Power-aware scan chains
• Encounter Test Model
has test modes that reflects
power modes
• Power domains verified for
isolation and scan integrity
• ATPG can process each
power mode
• Low Power scan vectors
reduce scan-shift power
• Runtime MBIST scheduling
reduces memory test power
• Limited Pin testing reduces
IO power switching
• Level shifters are tested
• Isolation logic stressed
• Retention Flops
Power Aware DFT Power-Aware
Test Model
PD1 PD2
PTAM
Reduce Power
during Test
ATPG for Power
Structures
A B
v1 v2
ISO
Top
PD1
Mem
PD4
PD2
PMU
Core
PD3
SR
Low Power Test
18. October 24, 2007
18
PSO in std cell based design (cont…)
Power Analysis
• Power-gating – goals/tasks
– Power-switch on – overall IR drop
• Modelling the power-switch as on
• Running IR drop on entire power-grid, both global and switched at once
– Power-up
• Simulate as the power-switch is turned on
• Capture power-switch current behavior
– IR drop effect on global grid and neighbors when block powering-up
• Use captured current behavior from previous step and feed into rail analysis
19. October 24, 2007
19
• Impact on IR drop and EM
– Power switches modeled as resistors in power grid view
• Solution flags if switches enter saturation (I/Idsat – PI)
• Support steady-state on and off
• Off-state – use leakage value of switch
• Power Consumption of steady-state on and off
– Power savings in different modes
• Power-up analysis
– Fastmos simulator used for power-up simulation (UltraSim)
– Dynamic currents captured through power switches
• Impact of power-up on global grid
– Dynamic VSDG rail analysis uses captured currents from power-up analysis to
show impact of power-up on surrounding logic
PSO in std cell based design (cont…)
Power Analysis
20. October 24, 2007
20
How Many Power Switches?
• Two-part approach
1. Steady state analysis
– To monitor IR drop through switches
– VoltageStorm analyzes for IR drop
– VoltageStorm reports power switches
operating in saturation
2. Dynamic analysis
– To monitor control power ramp-up
– VoltageStorm reports block
“power-on” time
• Too fast latch-up
• Too slow limits performance
21. October 24, 2007
21
Logic
Circuit
Netlist
1. Create circuit netlist
Control
VDD
Circuit
Netlist
Inputs
clamped
Outputs
correctly
loaded
2. Simulate with UltraSim
Load full-chip power RC network
with PGVs and analyze
VDD
4. Analyze top level grid in VSDG
Block Power-up/Down Analysis and
Global Grid Verification
Capture Dynamic
Current in PGV
3. Create Dynamic Power Grid Views
Circuit
Netlist
22. October 24, 2007
22
MTCMOS Power-On
• PowerMeter generates data to drive spice
simulation using Ultrasim
– Netlist sensitized to the virtual power domain
• Use existing sub-circuit netlist
• Generate sub-circuit netlist from .cl
– Signal loading dspf (lumped C)
– Voltage Source file
– Template Stimulus file
• DC voltages
• PWL for control logic – derived from TWF file
• QX can generate RC network of Virtual power net
– Potential capacity limitations
– Analyse to see Ton differences
– Not used to date
• QX generates RC network of control signals
– Important to capture delay in controlling swithes
• User simulates power-on conditions
– Analyzes ramp-up time to steady-state
• UltraSim also captures current behavior through
power-transistors
– Leverages existing UltraSim commands used
within integration inside VST (.usim_ir)
– Generates binary current data files (.pti)
PowerMeter
QX
UltraSim
RC grid
Netlist
Signal
Loading
Voltage
Sources
Template
Stimulus
Toplevel
Circuit File
Power-transistor
dynamic
currents (pti)
Spice
waveforms
Results
23. October 24, 2007
23
PSO for memories
• Why Memory Shut-off
– On-chip memory is increasing
• Memory increase result in higher leakage
– Activity factor for the large memories is less so less active
power
– Memories already have Higher L devices (lesser Sub-threshold
leakage)
– Below 65nm process, Junction leakage starts getting
dominating factor
Reduced standby/average power by power down is absolute necessary
24. October 24, 2007
24
PSO for memories
Memory Shut-off can be
Selective shut-off
Retention Memory
Complete memory shut-off
Memory Shut-off implemented at SOC level
Tools are competent enough for
implementation
Key Challenges
Performance Hit
RTL functional verification
Yield is an issue
Testing is a big issue
Support for the IRdrop aware timing models
25. October 24, 2007
25
IO power shut-off
• IO’s are to be grouped together based on architecture
– Set of IO voltages can be shut-down
• Issues
– Board design and pad selection
– ESD
26. October 24, 2007
26
Dynamic Voltage Frequency Scaling Requires Multi-
Mode Analysis
• Multiple modes need to be
analyzed/optimized for multiple
corners
– Setup analysis for (WC,
1,125C) corner
0.0V
1.08V
125MHz
0.0V
Standby
1.08V
125MHz
1.08V
125 MHz
Drowsy
0.9V
66MHz
1.08V
125 MHz
Dull
1.08V
125MHz
Slow
1.08V
125MHz
Baseline
Core
Mode
• Multiple constraints
(.sdc)
– Example: baseline.sdc,
ios.sdc, dull.sdc,
drowsy.sdc
CORE
DROWSY DULL
• Libraries
– stdcell_1.08sl.lib,
stdcell_0.9sl.lib,
stdcell_1.08fs.lib,
stdcell_0.9fs.lib
27. October 24, 2007
27
DVFS: Multi-Mode Multi-Corner Flow
Create library set
Define various RC
corners
Define constraint modes
Create analysis views
optDesign/
timeDesign
The library set can be a
single library or a collection
of libraries (ECSM)
Specify PVT condition for
each corner. Specify spef for
each corner
Specify SDC file for each
mode. Same SDC file may
be used or specify 1 SDC
file per domain
Associate a corner with a
mode; Design may have 5
corners and 3 modes, but
only 10 views
Run optimization and timing
checks for concurrent
handling of views
Primary Concerns:
1. Timing Closure
2. Verification
3. Mixed Scenario for Power
Saving (DVFS and PSO
together)
28. October 24, 2007
28
Pulsed Latch Design Methodology
Traditional FF is replaced with a pulsed-latch
Pulse generator is shared by several pulsed-latch
Dummy clock delay cell is used to balance clock tree
q
t
t
t
d
t
t
cp
q
t
t
t
t
t
t
d
t
t
cp
pulse clock
Pulsed latch
Pulse Generator
Traditional register
Dummy delay
Negative edge FF memory
Advance Low Power Techniques
29. October 24, 2007
29
Pulsed Latch: Results
25% active power reduction by swapping to pulsed latch
50% of active power is consumed by FF - cut half by pulsed latch
Power consumption overhead :
Slew control after pulse generator cell
Slew need to be faster at pulse clock-tree
Pulse generator cell insertion (addition)
Required # of PG cell is controllable
latency control : slow slew
skew control : fast slew
General clock-tree structure
Pulse generator
insertion point
Clock-Tree image
~5% overhead
Advance Low Power Techniques (Contd..)
30. October 24, 2007
30
Advance Low Power Techniques (Contd..)
4.38
1.30
3.18
0.41
Conditional Sum
3.38
0.81
2.24
0.36
Carry Select
2.04
0.70
1.59
0.44
Carry Look Ahead
1.88
0.57
1.29
0.44
Variable Block Width
Carry Skip
1.27
0.59
1.06
0.56
Constant Block Width
Carry Skip
1
1
1
1
Ripple Carry
Area
PDP
Power
Delay
Topology
Delay, power, PDP and area of 16-bit adders
normalized to the delay, the power, the PDP and
the area, respectively, of the Ripple Carry Adder
Source: T. Callaway and E. Swartzlander, ”The power consumption of CMOS adders and multipliers”
2.02
0.47
0.95
0.49
Modified Booth
1.93
0.43
0.74
0.58
Wallace Tree
1.43
0.59
0.87
0.68
Split Array
1
1
1
1
Array
Area
PDP
Power
Delay
Topology
Low Power Arithmetic Units:
Delay, power, PDP and area of 16-bit
multipliers normalized to the delay, the
power, the PDP and the area,
respectively, of the Array Multiplier
32. October 24, 2007
32
Advance Low Power Techniques (Contd..)
• Double-edge triggered F/Fs (DETFF) can “ideally” save 50% of clock network power
by reducing the clock frequency requirement to half
• However stringent 50% duty-cycle constraint over clock and the area overhead of
DETFF can significantly offset the amount of power saved
• Slower than normal F/Fs due to increased internal and/or output node capacitance
Clock for single-edge F/F with period T
Clock for DTFF with period 2T and 50% duty-cycle
Clock for DTFF with period 2T and 50% duty-cycle
Clock for DTFF with period 2T and 50% duty-cycle
Double-Edge Triggered F/Fs
33. October 24, 2007
33
Advance Low Power Techniques (Contd..)
There are Several Other Techniques which are under
exploration/Used
Thermal Throttling
Clock Swing Controls
Clock-on Demand
Dynamic Threshold
Generic Bus power reduction IPs
36. October 24, 2007
36
Development goals
• ARM 1136JF-S IC
– Power optimization methodology leverageable to synthesized digital designs
– Collaborative development: Silicon design chain (Applied Materials, ARM,
Cadence, TSMC)
• ARM 1136JF-S IC PSO
– Power switch-off (PSO) enhancement: Methodology and implementation
• ARM 1176JZF-S IC
– PSO and dynamic voltage and frequency scaling (DVFS) enhancement:
Methodology and implementation
– Facilitate comprehensive methodology across design, verification and
implementation
• Power Forward Initiative (Common Power Format, CPF)
• ARM, AMD, ATI, Applied Materials, Cadence, Calypto, Freescale, Fujitsu, Golden Gate
Technology, NEC Electronics, NXP, Sequence, TSMC
37. October 24, 2007
37
ARM1136JF-S IC architecture
• ARM1136JF-S microprocessor
– 16k I+D cache, 16 kB TCM; Tag RAMs, TLBs
– ARM, Thumb, DSP instructions; Java
• ETM11 trace macro, ETB11 trace buffer
• Adv. high-performance bus (AHB) bus
– Core AHB Lite ports AHB I/F (pin access.)
– Access to 128 KB on-chip test RAM: Enable concurrent data
transfers from any four ports
Trace
Full
AHB
Fetch
LSU
1V VDD
~100K cell
+ 44 SRAMs
~3,400 voltage
level-shifting cells
0.8V VDD
~200K cells
• 300 K standard cell instances; 22M
transistors; 44 SRAMs
• IC: 355 MHz typical (90nm standard
CMOS: TSMC 90G)
• Dual VDD domains, dual VT library
38. October 24, 2007
38
Design methodology overview (1)
• Microprocessor verification
– Set microprocessor code,
memory configurations
– Verify RAM functionality in 90nm process
– Verify microprocessor functionality (RTL)
• Test cases (135K vectors)
• Vector sets generated used subsequently for power
dissipation analysis
• VCD and TCF formats
– Fully verified RTL “golden reference” for Regression
tests / functional verification
• ARM1136JF-S IC
– VDD domain selection and voltage level
shifting cells (VLS) design considerations
– MSV RTL synthesis
– Clock gating
– Timing closure in multi-VDD designs
– Dynamic/static IR drop analysis/optimization
– System-level validation
Timing,
Power
and
Area
Optimization
39. October 24, 2007
39
Design methodology overview (2)
• ARM1136JF-S IC PSO
– PSO design, verification
– Structured PSO ring methodology
– VLS/isolation cells insertion in synthesis
– Automated placement / insertion: VLS cells, switch cells,
state retention registers
– Automated power stitching
– Automated multi-domain clocks
– Power switch-off, switch-on voltage drop
and transients analysis
• ARM1176JZF-S IC
– PSO management, verification
– Integrate dynamic voltage and frequency scaling function
(DVFS)
– Physical synthesis / optimization and timing analysis
(DVFS)
– Functional integrity verification and test insertion with
power-optimization features
• Vsoc, Vram 1.0V libraries; Vcore 0.8V libraries;
~800 test cases
Timing,
Power
and
Area
Optimization
40. October 24, 2007
40
• Multiple supply voltage synthesis
– Newly-developed technology
– Single-pass concurrent optimization for timing, area and power
– 0.8 and 1.0 VDD domains, dual-VT cell libraries
• Power optimization in synthesis
– Logic restructuring
– Logic resizing (before clock tree synthesis)
• Buffer removal/resizing
– Transition rate buffering (Buffer slow transition nets)
• Minimize duration in which both pFET and nFET conduct
simultaneously
– Pin swapping
• Apply high transition rate signal nets to low capacitance inputs
• ARM1136JF-S IC cells: 62%, 38% in 0.8V, 1.0V
Pin Swapping (CACC)
A
B
C
X
A
B
C
X
A
B
C
D
E
X
Y
Z
Buffer introduced
to reduce slew
Multiple Supply Voltage (MSV) RTL
synthesis
41. October 24, 2007
41
VDD domains, clock gating
• 0.8V, 1.0V VDD domains
– Analyze standard cells delay, leakage, standby and
dynamic power (2.5x delta)
– Adequate performance for timing critical nets
– Customization further improvements feasible
Cell
Delays
(normalized)
• Architectural clock gating included in uP RTL
• Automated design flow add’l. clock gating
– Inferred from RTL through low-power synthesis
– ~1,000 clock gated cells identified and managed 85%
registers gated
– Shut off dynamic current in quiescent logic
• Clock decloning: 1,112 703 cells (1136 IC)
– Move clock gating to highest hierarchical node of logic tree
reduce power, insertion delay
42. October 24, 2007
42
MSV electrical/timing closure (1)
• Automated (VLS) insertion
– For nets traversing VDD domains
– Align cells to avoid n-well spacing violations (domain
perimeter placements)
– Automated multi-VDD power distribution and cell
placements, antenna diode insertion
– ARM1176JZF-S IC: Automated in synthesis
• VLS placement directly affects electrical
performance
– Optimal or detoured routing
– Power-supply-aware timing and multi-VDD supply
constraints drive placement
– ARM1136JF-S IC: Netlist modified to insert VLS cells
where needed
– ARM1136JF-S IC PSO, ARM1176JZF-S IC: Automated
VLS cells insertion, placement, timing
43. October 24, 2007
43
MSV electrical/timing closure (2)
• Cell substitution with timing constraint
– Replace standard-VT with high-VT cells
• Net by net basis; same footprint as original cell
• Signal integrity addressed within PR
– ~10 of 500K nets required post-layout optimization
• Effective current source model (ECSM)
instance-specific multiple VDD delay
calculation
– Standard cell libraries characterized for multiple VDD
values at outset
– Numerical model 2% deviation vs. full circuit
simulation
Distribution
(%)
Length, Y/X Ratio
SPICE
ECSM
A C
X
mm.
B
(X
mm.)
A C
B Y
mm. D
44. October 24, 2007
44
IR drop analysis and optimization
• Grid-specific resistor meshes
• Dynamic power (manage di/dt)
22 mV
1.0V VDD
19 mV
0.8V VDD
VSS
Dynamic IR
drop analysis
45. October 24, 2007
45
ARM1136JF-S IC validation
• ARM RealView® Validation
System (instrumented system)
• Run applications, measure performance
– ~15,000 system-level validation tests
– Linux (2.4.7, 2.4.19, v6 backport and 2.6.x), WinCE
.NET 4.2 and Symbian OS7 operating systems
– Applications: X-windows, Doom, Pocket Word, and
Pocket Explorer, etc.
• ~40% overall and 46% leakage power reduction
Sim.
Baseline
(90nm)
Sim. LP
(90nm)
Meas. LP
(90nm)
Meas.
Power (130
nm; ARM)
Core 0.28 0.14 0.10 0.60
Other 0.36 0.32 0.21
Total 0.64 0.46 0.31
IC Block
Dynamic Power Dissipation (mW/MHz)
0%
20%
40%
60%
80%
100%
Std. Power Low Power
Norm.
Power
Dissipation
(%)
Leakage (Total)
Switching (Total)
Leakage (Logic)
Switching (Logic)
46. October 24, 2007
46
ARM1136JF-S IC PSO design
• Automated PSO implementation
– PSO design, functional verification (VLS cells)
– Power, clock distribution
– Static and dynamic power analysis
• Structured ring methodology
– Filler, breaker, corner, switch- or buffer-only
Switches and
Fillers forming
the ring
Internal
power
mesh
Switchable
Power Domain
Switch cell has 2 buffers built-in
with different directions
En_out
En_in
PSO
domain
Pso
switched
block
PSO
switched
block
1.0V
48. October 24, 2007
48
MSV optimization
• Cross-domain timing optimization
– Automatically handle conditions shown
• Domain-aware clock tree synthesis
– Automatically handle multi-domain clocks
• Automatic insertion of state retention
registers
– RTL synthesis, implement., verification
– Capability not implemented in this work
Power Domain 0.8V
Libraries A
Power Domain 0.8V
Libraries A
0.8V
I/O
0.8V
I/O
Power Domain 0.8V
Libraries A
Power Domain 1.0V
Libraries B
Don’t touch
nets
Power Domain 1.0V
Libraries B
Power Domain 1.0V
Libraries B
FF
FF
SRPG
FF
SRPG
FF
VDDC (not swtiched)
VDD
Shutdown block
VSS
VSSC not swtiched)
PG
PG
RET
VDD (switched)
49. October 24, 2007
49
ARM1176JZF-S IC architecture
• ARM1176JZF-S microprocessor
– 16k I+D cache, 16 kB TCM; Tag RAMs, TLBs
– ARM, Thumb, DSP instructions; Java, IEM
• ARM1176JZF-S IC
– ETM11 trace macro, ETB11 trace buffer
– AHB bus I/F through AXI to AHB bridge
Vsoc
1.0V
Voltage
level-shifting cells
Vcore
0.8V
Vram
1.0V
• 360K standard cell instances; 22M
transistors; 46 SRAMs
• IC: 340MHz typical (90nm standard
CMOS: TSMC 90G)
• 3 power domains defined
• Dual VDD domains, dual VT library
ARM1176Main
TestChip
RAM
AXI
AXI to AHB
Bridge
VIC
1176
MBIST
JTAG, TAP
Boundary Scan
Test Logic
PLL
Clock Reset
Validation
Coprocessor
Validation
Coprocessor
Dormant Mode
Sequencer
TPIU
ETBM11CS
Clocks and
resets
Debug
interface
TAP I/F
Trace I/F
CP14 I/F
ARM 1176_IC
ETB11 MBIST
ETB11 RAM
ARM1176JZFSImp
IARS: IEM Asynchronous Register Slices
Peripheral
AXI
DMA
AXI
Data
AXI
Instruction
AXI
Cache and TCM
RAMs
Vram
VLS/Clamps
ETM11 CS
50. October 24, 2007
50
Intelligent energy manager (IEM)
ARM1176JZF-S RTL structure
• ARM1176 IEM: Ease of implementation in present design methodologies
– Asynchronous between voltage domains at different voltages, frequencies
• IEM Asynchronous AXI Register Slices required
– Has logical partitioning for voltage domains
• No logic at the top-level of the design
– Has logical partitioning for level shifters
• Implementer must replace with specific library cells or rely on implementation tools to add
– Has separate clocks and resets per voltage domain
51. October 24, 2007
51
ARM1176JZF-S IEM configuration overview
• RAM Interface
• Clamps for
dormant mode
support
• Always
Synchronous
• IEM Register
Slices
• Asynchronous for
DVS
• Synchronous when
Vsoc = Vcore
53. October 24, 2007
53
VLS and standard cells placement and clock
design
• Leverage ARM1136JF-S IC PSO design methodology
– Automatic placement (at domain edge)
– Non-integral multiple height rows
• 7, 9, 11-track cells, etc. in the same design
• Clock skew 122ps skew (worst-case, global)
VLS
cells
VCORE-VSOC
PSO
cells
VLS
cells
VRAM-VSOC
54. October 24, 2007
54
• Power Forward Initiative: Common Power
Format (CPF)
– New method to capture design and
constraint information
– Facilitates comprehensive
methodology across design,
verification, and implementation
– Enables automation and what-if
exploration
– Collaboration/integration across
design/supply chain
– Foundation for an integrated
methodology
R. Goering, “EDA spec describes power” EETimes, May 22, 2006
An effective power management solution
55. October 24, 2007
55
Formal
Analysis
Acceleration
Emulation
Simulation
Verification
Coverage
Testbench
Automation
Verification
Chip Integration
Prototyping
Synthesis
Physical Synthesis
Routing
DFT
Analysis
Sign-off
ATPG
Constraint
Design
EC
LVS/DRC/Ext
Physical Implementation
RTL+CPF Gates+CPF
GDSII
Synthesis
SDC Constraint
Generation
Design for Test
SVP
Equivalence
Checking
SDC
Constraint
Validation
Design Creation
Spec CPF
Iterate
Iterate
Gate
RTL
RTL+CPF Gate+CPF
RTL
Coding
RTL+CPF
Coding
Design methodology with CPF
Verify low power
implementation
MPD, MSV, DVFS
Automatic partitioning
of physical design
• Multiple
supply voltage
synthesis
• Level shifter
and power
gate insertion
• Automatic test scheduling ATPG for power gating cells
• Automatic scan stitching for power domains
56. October 24, 2007
56
Summary
• Power optimization methodology
delivered ~40% overall and
46% leakage power reduction (ARM1136JF-S IC)
– Single-pass synthesis with concurrent optimization (timing,
power, area); multi-VDD, multi-VT designs
• ARM1136JF-S IC PSO implementation
– Normalized ~98.5% (66x) reduction of leakage power in the
low power region (typical conditions)
– Automated PSO implementation
– Structured ring methodology
• ARM1176JZF-S IC development
– Dynamic voltage and frequency scaling enhancement
methodology and implementation
– Power optimization methodology enhancements
• IEM; synthesis, test, formal verification, clocks, timing
closure, electrical/physical design; CPF
PSO
VDD
0.8V
VDD
1.0V
VLS
57. October 24, 2007
57
Acknowledgments and references
• Acknowledgments
– We thank C. Chu, A. Gupta, J. Goodenough, A. Harry, C. Hopkins, L. Jensen, T. Valind, L.
Milano, A. Iyer, P. Mamtora, J. Willis, M. McAweeney, R. Williams,
I. Devereux and the ARM Physical IP team for their contributions
• References
– A. Khan et al., “A 90nm Power Optimization Methodology with Application to the ARM 1136JF-S Microprocessor,”
In IEEE Journal of Solid State Circuits, Vol. 41, No. 8, pp. 1707 – 1717, August 2006
– A. Khan et al., “A 90nm Power Optimization Methodology and its’ Application to the ARM 1136JF-S
Microprocessor,” Proceedings of the IEEE Custom Integrated Circuits Conference, San Jose, CA, September 21,
2005
– Gartner- WW ASIC/ASSP, FPGA/PLD and SLI/SOC App. Fcst., 1Q04
– B. Calhoun, “Ultra-Dynamic Voltage Scaling Using Sub-threshold Operation and Local Voltage Dithering in 90nm
CMOS,” ISSCC, 2/05
– S. Henzler, “Sleep Transistor Circuits for Fine-Grained Power Switch-Off with Short Power-Down Times,” ISSCC,
Feb. 05
– http://www.arm.com/pdfs/DUI0273B_core_tile_user_guide.pdf.
– A. Khan et al., “Design and Development of 130-nanometer ICs for a Multi-Gigabit Switching Network System,”
CICC, Oct. 04
– D. Desharnais, ”Nanometer IC routing requires new approaches,” EEDesign.com, Dec. 03
– A. Khan et al., “A 150 MHz Graphics Rendering Processor with 256Mb Embedded DRAM,“ ISSCC, Feb. 2001
– G. Paul, et al., “A Scalable 160Gb/s Switch Fabric Processor with 320Gb/s Memory Bandwidth,” ISSCC, Feb. 04
58. October 24, 2007
58
PSO in std cell based design
RTL Model
Gate Netlist
Synthesis
Level shifters
– Placement
– Location
– Connectivity
Isolation cells
– Placement
– Isolation type
– Isolation function
State retention cells
– Placement
– Retention function
Miscellaneous
– Floating nets / pins
Logical Netlist
Level shifters
– Placement/Location
– Power connectivity
– Level Shifter function
Isolation cells
– Placement/type
– Power connectivity
– Isolation function
State retention cells
– Placement
– Power connectivity
– Retention function
Miscellaneous
– Power switches
– Shorts b/n VDD/VSS
Physical Netlist
EC
Gate Netlist
Place Route
EC