1) The document describes empirically derived power models for uncore elements like the Power Bus and memory controllers of IBM's POWER8 server processor.
2) Using a small set of activity markers like read, write, retry and snoop events along with microbenchmarks, the models can predict uncore power with up to 6% error.
3) These abstract power models allow more accurate dynamic power management by the chip compared to using a constant worst-case uncore power, potentially enabling a 5% CPU frequency boost.
How to Remove Document Management Hurdles with X-Docs?
Empirically Derived Uncore Power Models for Server Processor
1. International Symposium on Low Power Electronics and Design 1
Empirically Derived Abstractions in
Uncore Power Modeling for a
Server-Class Processor Chip
Hans Jacobson, Arun Joseph*, Dharmesh
Parikh*, Pradip Bose, Alper Buyuktosunoglu
IBM Systems & Technology Group*
IBM T. J. Watson Research
2. International Symposium on Low Power Electronics and Design 2
Uncore Power: Overview
• Pre-silicon power modeling has primarily focused on the processor
cores. As designs evolved, attention has shifted to “uncore”.
• Abstractions needed to characterize power-performance trade-offs.
• We examine the challenge of developing practical abstractions in
uncore power modeling in an industrial setting.
• We report a systematic methodology of abstractions in modeling
with focus on key uncore elements of IBM POWER8 processor.
• We show that uncore elements can be modeled using a few activity
markers and a small set of microbenchmark stress test cases.
3. International Symposium on Low Power Electronics and Design 3
Use-case: Digital Power Proxies
• Without an uncore proxy, dynamic power management policy
conservatively estimates a constant, high power for the uncore.
– Reduces opportunity for maximizing performance/watt at chip level.
• Uncore proxy for a POWER8 improves accuracy of chip power by 15%.
– Compared to core, L2 and L3 level proxies with uncore worst-case power.
• For scenarios where the uncore is largely idle this could translate to an
opportunity to boost the frequency by at least 5%.
– For a given chip power cap assuming chip-wide DVFS.
– With per-core DVFS control and the ability to shift power across core domains,
the boost in performance could be much higher.
• More use-cases: Inductive noise trend analysis, Correct decisions in the
choice of early-stage micro-architectural parameters
4. International Symposium on Low Power Electronics and Design 4
Reference Power Modeling
• The detailed reference chip
power analysis tool chain
used at IBM [1].
• Accuracy validated against
POWER7+ hardware
power.
• The workload-specific
power and event counts,
form the data points we
use in generating an
uncore abstract power
model.
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
Figure: Reference power modeling methodology
[1] Dhanwada, N., et al. 2013. Efficient PVT independent
abstraction of large IP blocks for hierarchical power
analysis, ICCAD, Nov. 2013.
5. International Symposium on Low Power Electronics and Design 5
Reference Power Abstraction
• Can be abstracted along several dimensions:
– The RTL simulator could be an early-stage microarchitecture-level
pipeline timing model;
– The workload could be a suite of representative loop kernels;
– The switching statistics could be reduced to a smaller subset;
– The circuit-level detailed analysis could be approximated by area or
gate count based analytical equations.
• Abstracted power models:
– RTL simulations provide data: power & high level event counts.
– Linear regression techniques to data from RTL simulations.
6. International Symposium on Low Power Electronics and Design 6
Uncore Power Modeling
• IBM POWER7 L2, L3 cache uncore elements constitutes
20% of power for a TDP workload.
– Further include large macros that are shared by all chiplets.
• We focus on the path taken by memory requests.
– Starting at L3 to chip memory I/O links.
• We propose a seemingly drastic abstraction.
– 4 activity markers: reads, writes, retry and snoop events.
– Small set of carefully crafted micro-benchmarks.
– Average error: 1.4-2.4%.
7. International Symposium on Low Power Electronics and Design 7
IBM POWER8
• 12 cores with 8-way
simultaneous multi-threading
(SMT) per core. Total on-chip
L2+L3 cache capacity is 102
MB.
• Fabricated using a 22nm
CMOS SOI. Die size of 649
mm2. 4.2 billion transistors.
• Each core is an aggressive,
wide-issue super scalar
design, with 16 execution
pipelines for massive data
crunching.
• Uncore supports a massive 7.6
Tb/s off-chip bandwidth
including memory and SMP
links, PCIe links, an off-chip
coherent accelerator interface,
as well as on-chip bus-
attached data accelerators.
Figure: POWER8TM chip photomicrograph with
superimposed demarcations to indicate regions occupied by cores,
L2/L3 caches, chip interconnect, memory controllers, etc.
8. International Symposium on Low Power Electronics and Design 8
IBM POWER8 Uncore
• 512KB private L2 per core, an
8 MB L3 instance per core.
• Memory stack separated into
four on-chip MCU each
partitioned into two sub-
controllers for a total of 8
memory I/O links and an off-
chip Centaur L4 buffer chip
per link.
• PowerBus (PB) provides
coherent communication
support across the cache-
memory subsystem.Figure: Block-diagram view of the chip, with clearly
defined “uncore” elements (highlighted in blue).
9. International Symposium on Low Power Electronics and Design 9
Simulation for Uncore
Table: Characteristics of workloads
used for power abstraction
Figure: Uncore simulation
environment
10. International Symposium on Low Power Electronics and Design 10
Power Bus Ramp
• The Power Bus Ramp (PBIEX)
– Interface between a chiplet and the
Power Bus Unit.
– Buffers memory requests initiated
by L3 (cache miss/flush) until
Power Bus is available.
• Sends/receives memory requests to the
power bus unit PBEH.
– Activity is foremost determined by
the read/writes bandwidth to and
from its associated chiplet.
• Handle coherence requests on the
Power Bus.
– Activity is thus further determined
by snooping activity resulting from
reads/writes originating from other
chiplets.
Figure: PBIEX regression statistics and bar graph showing
error sources for predicted power for each workload.
11. International Symposium on Low Power Electronics and Design 11
Power Bus Unit
• Routing fabric between each chiplet
and the memory/network
controllers. Performs coherency
checks on each memory request
received from the chiplets.
• Each L3 cache miss results in the
PBEH broadcasting a request to
each chiplet to check whether some
other L3 contains the requested
cache line.
– If not, the request is forwarded to
the correct memory controller.
• Communicate coherence requests
and responses between chiplets as
well as memory requests to and
from the MCU.
PBEH regression statistics and bar graph showing error
sources for predicted power for each workload
12. International Symposium on Low Power Electronics and Design 12
Memory Controller Unit
• Interface between the PBEH and the
high speed serial I/O links going to
and from memory.
• Each request is assembled into a
transmission frame and sent over
the link.
• Activity of the MCU is foremost
determined by the read and write
bandwidth of the PBEH.
• MCU must also reject a request if it
cannot handle more requests due to
full buffers.
– Activity therefore also dependent on
the number of such retries.
MCU regression statistics and bar graph showing error
sources for predicted power for each workload.
13. International Symposium on Low Power Electronics and Design 13
Abstract Model Conclusions
• Can be modeled accurately even with seemingly drastic
abstractions in modern day processors.
• Accurate for abstract level they are intended to be used.
– < 6% maximum errors across the abstract uncore models.
– < 9% power difference observed for the minimum vs. maximum
address and data switching workloads.
• Future work: Model refinements to focus on DS events.
– Capture the degree of bit switching on addresses and data that
move through the uncore units.
14. International Symposium on Low Power Electronics and Design 14
Summary & Conclusions
• Uncore power and identification of power reduction opportunities is
a critical aspect of future power-efficient micro-processor design.
• We present a practical methodology for use in an industrial setting
for deriving abstract analytical power models for selected key
uncore elements.
• We show that even with very few power event markers and a small
set of stress marks, it is possible to develop accurate power models
for uncore elements of a modern day chip.
• We quantify the accuracy such models have in providing improved
power proxies and predicting worst-case bounds on chip level
inductive noise in future technologies.