Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

International Symposium on Low Power Electronics and Design 1
Empirically Derived Abstractions in
Uncore Power Modeling for a
Server-Class Processor Chip
Hans Jacobson, Arun Joseph*, Dharmesh
Parikh*, Pradip Bose, Alper Buyuktosunoglu
IBM Systems & Technology Group*
IBM T. J. Watson Research

Uncore Power: Overview
• Pre-silicon power modeling has primarily focused on the processor
cores. As designs evolved, attention has shifted to “uncore”.
• Abstractions needed to characterize power-performance trade-offs.
• We examine the challenge of developing practical abstractions in
uncore power modeling in an industrial setting.
• We report a systematic methodology of abstractions in modeling
with focus on key uncore elements of IBM POWER8 processor.
• We show that uncore elements can be modeled using a few activity
markers and a small set of microbenchmark stress test cases.

Use-case: Digital Power Proxies
• Without an uncore proxy, dynamic power management policy
conservatively estimates a constant, high power for the uncore.
– Reduces opportunity for maximizing performance/watt at chip level.
• Uncore proxy for a POWER8 improves accuracy of chip power by 15%.
– Compared to core, L2 and L3 level proxies with uncore worst-case power.
• For scenarios where the uncore is largely idle this could translate to an
opportunity to boost the frequency by at least 5%.
– For a given chip power cap assuming chip-wide DVFS.
– With per-core DVFS control and the ability to shift power across core domains,
the boost in performance could be much higher.
• More use-cases: Inductive noise trend analysis, Correct decisions in the
choice of early-stage micro-architectural parameters

Reference Power Modeling
• The detailed reference chip
power analysis tool chain
used at IBM [1].
• Accuracy validated against
POWER7+ hardware
power.
• The workload-specific
power and event counts,
form the data points we
use in generating an
uncore abstract power
model.
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
Figure: Reference power modeling methodology
[1] Dhanwada, N., et al. 2013. Efficient PVT independent
abstraction of large IP blocks for hierarchical power
analysis, ICCAD, Nov. 2013.

Reference Power Abstraction
• Can be abstracted along several dimensions:
– The RTL simulator could be an early-stage microarchitecture-level
pipeline timing model;
– The workload could be a suite of representative loop kernels;
– The switching statistics could be reduced to a smaller subset;
– The circuit-level detailed analysis could be approximated by area or
gate count based analytical equations.
• Abstracted power models:
– RTL simulations provide data: power & high level event counts.
– Linear regression techniques to data from RTL simulations.

Uncore Power Modeling
• IBM POWER7 L2, L3 cache uncore elements constitutes
20% of power for a TDP workload.
– Further include large macros that are shared by all chiplets.
• We focus on the path taken by memory requests.
– Starting at L3 to chip memory I/O links.
• We propose a seemingly drastic abstraction.
– 4 activity markers: reads, writes, retry and snoop events.
– Small set of carefully crafted micro-benchmarks.
– Average error: 1.4-2.4%.

IBM POWER8
• 12 cores with 8-way
simultaneous multi-threading
(SMT) per core. Total on-chip
L2+L3 cache capacity is 102
MB.
• Fabricated using a 22nm
CMOS SOI. Die size of 649
mm2. 4.2 billion transistors.
• Each core is an aggressive,
wide-issue super scalar
design, with 16 execution
pipelines for massive data
crunching.
• Uncore supports a massive 7.6
Tb/s off-chip bandwidth
including memory and SMP
links, PCIe links, an off-chip
coherent accelerator interface,
as well as on-chip bus-
attached data accelerators.
Figure: POWER8TM chip photomicrograph with
superimposed demarcations to indicate regions occupied by cores,
L2/L3 caches, chip interconnect, memory controllers, etc.

IBM POWER8 Uncore
• 512KB private L2 per core, an
8 MB L3 instance per core.
• Memory stack separated into
four on-chip MCU each
partitioned into two sub-
controllers for a total of 8
memory I/O links and an off-
chip Centaur L4 buffer chip
per link.
• PowerBus (PB) provides
coherent communication
support across the cache-
memory subsystem.Figure: Block-diagram view of the chip, with clearly
defined “uncore” elements (highlighted in blue).

Simulation for Uncore
Table: Characteristics of workloads
used for power abstraction
Figure: Uncore simulation
environment

Power Bus Ramp
• The Power Bus Ramp (PBIEX)
– Interface between a chiplet and the
Power Bus Unit.
– Buffers memory requests initiated
by L3 (cache miss/flush) until
Power Bus is available.
• Sends/receives memory requests to the
power bus unit PBEH.
– Activity is foremost determined by
the read/writes bandwidth to and
from its associated chiplet.
• Handle coherence requests on the
Power Bus.
– Activity is thus further determined
by snooping activity resulting from
reads/writes originating from other
chiplets.
Figure: PBIEX regression statistics and bar graph showing
error sources for predicted power for each workload.

Power Bus Unit
• Routing fabric between each chiplet
and the memory/network
controllers. Performs coherency
checks on each memory request
received from the chiplets.
• Each L3 cache miss results in the
PBEH broadcasting a request to
each chiplet to check whether some
other L3 contains the requested
cache line.
– If not, the request is forwarded to
the correct memory controller.
• Communicate coherence requests
and responses between chiplets as
well as memory requests to and
from the MCU.
PBEH regression statistics and bar graph showing error
sources for predicted power for each workload

Memory Controller Unit
• Interface between the PBEH and the
high speed serial I/O links going to
and from memory.
• Each request is assembled into a
transmission frame and sent over
the link.
• Activity of the MCU is foremost
determined by the read and write
bandwidth of the PBEH.
• MCU must also reject a request if it
cannot handle more requests due to
full buffers.
– Activity therefore also dependent on
the number of such retries.
MCU regression statistics and bar graph showing error
sources for predicted power for each workload.

Abstract Model Conclusions
• Can be modeled accurately even with seemingly drastic
abstractions in modern day processors.
• Accurate for abstract level they are intended to be used.
– < 6% maximum errors across the abstract uncore models.
– < 9% power difference observed for the minimum vs. maximum
address and data switching workloads.
• Future work: Model refinements to focus on DS events.
– Capture the degree of bit switching on addresses and data that
move through the uncore units.

Summary & Conclusions
• Uncore power and identification of power reduction opportunities is
a critical aspect of future power-efficient micro-processor design.
• We present a practical methodology for use in an industrial setting
for deriving abstract analytical power models for selected key
uncore elements.
• We show that even with very few power event markers and a small
set of stress marks, it is possible to develop accurate power models
for uncore elements of a modern day chip.
• We quantify the accuracy such models have in providing improved
power proxies and predicting worst-case bounds on chip level
inductive noise in future technologies.

Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

Recommended

Recommended

More Related Content

Similar to Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

Similar to Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip (20)

More from Arun Joseph

More from Arun Joseph (7)

Recently uploaded

Recently uploaded (20)

Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip