Presented approaches for generation of multiple clock gating domain parameterized PVT independent power abstracts for large IP blocks. We accomplish the gating domain parameterization through separation of the attribution of switching due to each single domain through a marking and tracing process, thereby precluding the need for separate domain by domain simulation to achieve the parameterization.
Experimental results comparing proposed approach on IP blocks of varying sizes from a real industry strength microprocessor design clearly highlight accuracy impact while keeping run time and model size increase in an acceptable range. In terms of extensions, we are exploring approaches where we could preserve each of the domains independently, for which we are looking into formulations based on constructing clock gating domain conflict hyper graphs and coloring them to determine domain interactions.
Handwritten Text Recognition for manuscripts and early printed texts
Per domain power analysis
1. 1
Efficient Techniques for Per Clock Gating Domain
Contributor based Power Abstraction of IP Blocks
for Hierarchical Power Analysis
Arun Joseph, Nagu Dhanwada, Spandana Rachamalla, William Dungan,
Ricardo Nigaglioni
IBM Systems Group
2. Motivation
IP blocks are becoming larger with increased number of
clock gating domains.
Much more aggressive solutions are being adopted for
improving the clock gating of these very large IP blocks.
There are several workloads for which significant
heterogeneity in both clock and data activity is seen across
the multiple clock gating domains within an IP block.
2
3. 3
Background: Contributor based Power Analysis Flow
Library
Characterization
Corner 1 ………. Corner N
Contributor
Power
Model
Contributor
based Macro
Power
Abstract
Chip Level
Power Analysis
Corner 1 ……. Corner N
Workload 1….Workload N
Input to Wafer Test,
System Planning,
Power Sorting and Binning
Cell
Library
Macro Power
Abstract
Generation
Macro/IP Block
Chip
4. Background
Accurate and efficient full chip power analysis is an important step in
the design of power efficient microprocessor and SoC chips.
A power model abstraction flow based on contributors is PVT-
independent and enables efficient hierarchical chip level power
analysis.
The dynamic power and leakage power model abstraction primarily
targeted for full chip power analysis was presented in [3]. This prior
work was based on parameterizing capacitance switching due to clock
gating onto a single clock gate control for the entire macro.
This approximation to a single macro wide clock gate control works
fairly well for full chip dynamic power analysis but there is a need to
explore improved dynamic power abstraction techniques for enabling
more accurate power analysis.
4
5. Main Idea: Multi Clock Gating Domain Abstract
Slide 5
Single Clock Gate Control Base Power Abstract Multiple Clock Gate Domain Power Abstract
d1
d3
d2
d4
IP Block
IP Block Power Abstract
1. Case Setup
2. Simulation
3. Power Contributor Element Generation
and Contributor Accumulation
IP Block Power Abstract Generation
Multiple clock gating
domains in the IP block.
Multiple clock gating
domains in the IP block.
Approximated to a single
macro wide clock gate
control as a part of the
abstraction process.
Approximated to a single
macro wide clock gate
control as a part of the
abstraction process.
Chip Level
Power Analysis
Weights and activity
factors set during activity
extraction in chip level
power analysis.
Weights and activity
factors set during activity
extraction in chip level
power analysis.
Name Weight Activity factor(s)
AlwaysCeff
GatableCeff (1 -
clock_gating)
PiSfDepCeff input_switch_rate
LoSfDepCeff latch_output_switch_rate
PiLoXPCeff input_switch_rate,
latch_output_switch_rate
Chip level
power analysis
Per clock
gating
domain activity
extraction
Per clock
gating
domain activity
extraction
1.1. Marking and Domain IdentificationMarking and Domain Identification
2. Case Setup
3. Simulation
4.4. Power Contributor Element Generation and AccumulationPower Contributor Element Generation and Accumulation
IP Block Power Abstract Generation
Name Weight Activity factor(s)
AlwaysCeff
PiSfDepCeff input_switch_rate
GatableCeff (1 - clock_gating)
LoSfDepCeff latch_output_switch_rate
PiLoXPCeff input_switch_rate,
latch_output_switch_rate
GatableCeff.d1 (1 - clock_gating.d1)
LoSfDepCeff.d1 latch_output_switch_rate.d1
PiLoXPCeff.d1 input_switch_rate,
latch_output_switch_rate.d.d1
GatableCeff.d2 (1 - clock_gating.d2)
LoSfDepCeff.d2 latch_output_switch_rate.d2
PiLoXPCeff.d2 input_switch_rate,
latch_output_switch_rate.d2
LoSfDepCeff.d1-d2 latch_output_switch_rate.d1
latch_output_switch_rate.d2
PiLoXPCeff.d1-d2 input_switch_rate,
latch_output_switch_rate.d1
latch_output_switch_rate.d2
6. Multi Clock gate domain Abstraction: Three Variants
6
Domain
identification
Marking of
domains
Domains
combination
list creation
Per case
simulations
Per domain and
Single domain ceffs
computation
and accumulation
Per domain IP
Power abstract
creation
Per domain
Bill of
materials
file
generation
No-sim based
clock power
only abstraction
Domain
collapsing
Domain
parameterized
clock power
abstract
Domain
parameterize
d
clock and
data power
abstract
• Quick tracing based clock
power only abstraction,
• Clock and Data Power
abstraction based on
domain merging using
Domain combination lists,
• Domain collapsing for
handling large extensively
gated designs
7. 7
Experimentation
Workload based
Simulation
Abstraction based Power
Analysis
In sync
Gate Level Block
Power
Abstract
Activity Files
(SAIF Like)
Abstraction
based Power
Unit Level
Activity
Extraction and
Power Rollup
Compare
Gate Level Block
(Workload driven
model)
Workload
driven power
VHDL Sim Data
Activity Files
Waveform File
Comparison of workload driven power simulation with the power abstract based
estimation for the three approaches
8. Experimental Results
8
Comparison for Design D4. D4 has 87 latches, 2 clock gating
domains, 4 domain combinations, ~12000 gates and nets.
Comparison for Design D3. D3 has 1200 latches, 21 clock
gating domains, 83 domain combinations.
Comparison for Design D1. D2 has 640 latches, 9 clock
gating domains, 44 domain combinations, ~13000 standard
cell instances and nets.
Comparison for Design D2. D2 has 520 latches, 3 clock
gating domains, 3 domain combinations, ~10000 gates and
nets.
Approach
Model Size
Increase
Data Power
%Error
Clock Power
%Error
TAT
Benefit
Single domain -54.87 -7.26 2.02
No-sim clock 1.17 -54.87 -0.02 1.93
Domain combinations 2.03 -16.30 -0.02 1.58
Domain combinations & collapse 1.13 -17.41 -0.02 1.78
Approach
Model Size
Increase
Data Power
%Error
Clock Power
%Error
TAT
Benefit
Single domain -6.7 4.8 1.82
No-sim clock 1.03 -6.7 0.3 1.75
Domain combinations 1.12 -0.7 0.3 1.61
Domain combinations & collapse 1.12 -0.7 0.3 1.61
Approach
Model Size
Increase
Data Power
%Error
Clock Power
%Error
TAT
Benefit
Single domain -8.7 4.1 2.35
No-sim clock 1.06 -8.7 0.3 2.22
Domain combinations 1.98 -2.3 0.3 1.64
Domain combinations & collapse 1.08 -4.4 0.3 1.96
Approach
Model Size
Increase
Data Power
%Error
Clock Power
%Error
TAT
Benefit
Single domain -22.4 0.4 1.51
No-sim clock 1.10 -22.4 0.4 1.47
Domain combinations 1.25 2.7 0.4 1.41
Domain combinations & collapse 1.25 2.7 0.4 1.41
9. 9
Conclusion
Presented approaches for generation of multiple clock gating domain
parameterized PVT independent power abstracts for large IP blocks.
We accomplish the gating domain parameterization through separation of
the attribution of switching due to each single domain through a marking
and tracing process, thereby precluding the need for separate domain by
domain simulation to achieve the parameterization.
Experimental results comparing proposed approach on IP blocks of
varying sizes from a real industry strength microprocessor design clearly
highlight accuracy impact while keeping run time and model size increase
in an acceptable range.
In terms of extensions, we are exploring approaches where we could
preserve each of the domains independently, for which we are looking into
formulations based on constructing clock gating domain conflict hyper
graphs and coloring them to determine domain interactions.
10. 9
Conclusion
Presented approaches for generation of multiple clock gating domain
parameterized PVT independent power abstracts for large IP blocks.
We accomplish the gating domain parameterization through separation of
the attribution of switching due to each single domain through a marking
and tracing process, thereby precluding the need for separate domain by
domain simulation to achieve the parameterization.
Experimental results comparing proposed approach on IP blocks of
varying sizes from a real industry strength microprocessor design clearly
highlight accuracy impact while keeping run time and model size increase
in an acceptable range.
In terms of extensions, we are exploring approaches where we could
preserve each of the domains independently, for which we are looking into
formulations based on constructing clock gating domain conflict hyper
graphs and coloring them to determine domain interactions.
Editor's Notes
For instance, such analysis is required for power sort process, which is used for determining product shipping frequencies.
The key enablers of this flow are the concept of contributor based power models [1, 2], an abstract definition and a method for generating such abstracts for complex IP blocks.
Base (Single Clock Gate Control) Abstraction:
The dynamic power abstraction introduced in [3] is performed in terms of the dynamic power contributors.
It characterizes power as a function of a clock_gating weight factor, input_switch_rate and latch_output_switch_rate activity factors.
The capacitance, weight and activity factors (computed during higher level power analysis) are computed by approximating (into a single macro wide clock gate control) across the clock gating domains to compute power.
Proposed Multi-Domain Abstraction:
In the proposed abstraction, there are clock and data Ceff components that correspond to the individual clock gating domains.
The per domain Ceff, along with weight and activity factors (computed on a per clock gating domain basis during higher level analysis) are used for hierarchical per clock gating domain power analysis.
This makes it more efficient, accurate and usable to drive more aggressive use of clock gating in the logic design process.
Workload driven gate level simulation (GLS) based power is compared with abstraction (ABS) based power for validation of proposed abstractions.
GLS based power: A netlist for an IP block is simulated for several thousand cycles by applying switching patterns extracted from RTL simulations of higher level realistic workloads. The switching at every net is computed to get an average power dissipated for the simulated switching patterns.
ABS based power: Same netlist is simulated to generate different power abstracts. For the same switching patterns, switching activity factors including the clock gating factor, switching factor at the primary inputs and latch outputs on a per gating domain basis are computed. The computed activity factors are applied to the generated power abstract (base, and per clock gating domain) to calculate ABS based power.
All experiments were run on 24 core 2.6GHz Xeon machine running RHEL 5 with 256GB memory. Designs from both core and uncore units of the microprocessor were studied. A variant of a thermal design point workload is used for comparison
“TAT Benefit” is the improvement seen in runtime while using ABS based power computation, when compared against the runtime of GLS based power computation.
“Model size increase” here refers to ratio of the size of per clock gating domain power abstract (in bytes when stored on the disk) to the size of the base power abstract.
The domain collapse procedure is triggered when the size of domain combinations list is greater than a certain threshold (DT), and is collapsed to DX% of the size of the domain combinations list. Both DT and DX are programmable, and they were empirically chosen as DT=10 and DX=10%.