SlideShare a Scribd company logo
International Symposium on Low Power Electronics and Design 1
Empirically Derived Abstractions in
Uncore Power Modeling for a
Server-Class Processor Chip
Hans Jacobson, Arun Joseph*, Dharmesh
Parikh*, Pradip Bose, Alper Buyuktosunoglu
IBM Systems & Technology Group*
IBM T. J. Watson Research
International Symposium on Low Power Electronics and Design 2
Uncore Power: Overview
• Pre-silicon power modeling has primarily focused on the processor
cores. As designs evolved, attention has shifted to “uncore”.
• Abstractions needed to characterize power-performance trade-offs.
• We examine the challenge of developing practical abstractions in
uncore power modeling in an industrial setting.
• We report a systematic methodology of abstractions in modeling
with focus on key uncore elements of IBM POWER8 processor.
• We show that uncore elements can be modeled using a few activity
markers and a small set of microbenchmark stress test cases.
International Symposium on Low Power Electronics and Design 3
Use-case: Digital Power Proxies
• Without an uncore proxy, dynamic power management policy
conservatively estimates a constant, high power for the uncore.
– Reduces opportunity for maximizing performance/watt at chip level.
• Uncore proxy for a POWER8 improves accuracy of chip power by 15%.
– Compared to core, L2 and L3 level proxies with uncore worst-case power.
• For scenarios where the uncore is largely idle this could translate to an
opportunity to boost the frequency by at least 5%.
– For a given chip power cap assuming chip-wide DVFS.
– With per-core DVFS control and the ability to shift power across core domains,
the boost in performance could be much higher.
• More use-cases: Inductive noise trend analysis, Correct decisions in the
choice of early-stage micro-architectural parameters
International Symposium on Low Power Electronics and Design 4
Reference Power Modeling
• The detailed reference chip
power analysis tool chain
used at IBM [1].
• Accuracy validated against
POWER7+ hardware
power.
• The workload-specific
power and event counts,
form the data points we
use in generating an
uncore abstract power
model.
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
Figure: Reference power modeling methodology
[1] Dhanwada, N., et al. 2013. Efficient PVT independent
abstraction of large IP blocks for hierarchical power
analysis, ICCAD, Nov. 2013.
International Symposium on Low Power Electronics and Design 5
Reference Power Abstraction
• Can be abstracted along several dimensions:
– The RTL simulator could be an early-stage microarchitecture-level
pipeline timing model;
– The workload could be a suite of representative loop kernels;
– The switching statistics could be reduced to a smaller subset;
– The circuit-level detailed analysis could be approximated by area or
gate count based analytical equations.
• Abstracted power models:
– RTL simulations provide data: power & high level event counts.
– Linear regression techniques to data from RTL simulations.
International Symposium on Low Power Electronics and Design 6
Uncore Power Modeling
• IBM POWER7 L2, L3 cache uncore elements constitutes
20% of power for a TDP workload.
– Further include large macros that are shared by all chiplets.
• We focus on the path taken by memory requests.
– Starting at L3 to chip memory I/O links.
• We propose a seemingly drastic abstraction.
– 4 activity markers: reads, writes, retry and snoop events.
– Small set of carefully crafted micro-benchmarks.
– Average error: 1.4-2.4%.
International Symposium on Low Power Electronics and Design 7
IBM POWER8
• 12 cores with 8-way
simultaneous multi-threading
(SMT) per core. Total on-chip
L2+L3 cache capacity is 102
MB.
• Fabricated using a 22nm
CMOS SOI. Die size of 649
mm2. 4.2 billion transistors.
• Each core is an aggressive,
wide-issue super scalar
design, with 16 execution
pipelines for massive data
crunching.
• Uncore supports a massive 7.6
Tb/s off-chip bandwidth
including memory and SMP
links, PCIe links, an off-chip
coherent accelerator interface,
as well as on-chip bus-
attached data accelerators.
Figure: POWER8TM chip photomicrograph with
superimposed demarcations to indicate regions occupied by cores,
L2/L3 caches, chip interconnect, memory controllers, etc.
International Symposium on Low Power Electronics and Design 8
IBM POWER8 Uncore
• 512KB private L2 per core, an
8 MB L3 instance per core.
• Memory stack separated into
four on-chip MCU each
partitioned into two sub-
controllers for a total of 8
memory I/O links and an off-
chip Centaur L4 buffer chip
per link.
• PowerBus (PB) provides
coherent communication
support across the cache-
memory subsystem.Figure: Block-diagram view of the chip, with clearly
defined “uncore” elements (highlighted in blue).
International Symposium on Low Power Electronics and Design 9
Simulation for Uncore
Table: Characteristics of workloads
used for power abstraction
Figure: Uncore simulation
environment
International Symposium on Low Power Electronics and Design 10
Power Bus Ramp
• The Power Bus Ramp (PBIEX)
– Interface between a chiplet and the
Power Bus Unit.
– Buffers memory requests initiated
by L3 (cache miss/flush) until
Power Bus is available.
• Sends/receives memory requests to the
power bus unit PBEH.
– Activity is foremost determined by
the read/writes bandwidth to and
from its associated chiplet.
• Handle coherence requests on the
Power Bus.
– Activity is thus further determined
by snooping activity resulting from
reads/writes originating from other
chiplets.
Figure: PBIEX regression statistics and bar graph showing
error sources for predicted power for each workload.
International Symposium on Low Power Electronics and Design 11
Power Bus Unit
• Routing fabric between each chiplet
and the memory/network
controllers. Performs coherency
checks on each memory request
received from the chiplets.
• Each L3 cache miss results in the
PBEH broadcasting a request to
each chiplet to check whether some
other L3 contains the requested
cache line.
– If not, the request is forwarded to
the correct memory controller.
• Communicate coherence requests
and responses between chiplets as
well as memory requests to and
from the MCU.
PBEH regression statistics and bar graph showing error
sources for predicted power for each workload
International Symposium on Low Power Electronics and Design 12
Memory Controller Unit
• Interface between the PBEH and the
high speed serial I/O links going to
and from memory.
• Each request is assembled into a
transmission frame and sent over
the link.
• Activity of the MCU is foremost
determined by the read and write
bandwidth of the PBEH.
• MCU must also reject a request if it
cannot handle more requests due to
full buffers.
– Activity therefore also dependent on
the number of such retries.
MCU regression statistics and bar graph showing error
sources for predicted power for each workload.
International Symposium on Low Power Electronics and Design 13
Abstract Model Conclusions
• Can be modeled accurately even with seemingly drastic
abstractions in modern day processors.
• Accurate for abstract level they are intended to be used.
– < 6% maximum errors across the abstract uncore models.
– < 9% power difference observed for the minimum vs. maximum
address and data switching workloads.
• Future work: Model refinements to focus on DS events.
– Capture the degree of bit switching on addresses and data that
move through the uncore units.
International Symposium on Low Power Electronics and Design 14
Summary & Conclusions
• Uncore power and identification of power reduction opportunities is
a critical aspect of future power-efficient micro-processor design.
• We present a practical methodology for use in an industrial setting
for deriving abstract analytical power models for selected key
uncore elements.
• We show that even with very few power event markers and a small
set of stress marks, it is possible to develop accurate power models
for uncore elements of a modern day chip.
• We quantify the accuracy such models have in providing improved
power proxies and predicting worst-case bounds on chip level
inductive noise in future technologies.

More Related Content

Similar to Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
NECST Lab @ Politecnico di Milano
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and Verification
DVClub
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
Arun Joseph
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A survey
CSITiaesprime
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
GIORGOS STAMELOS
 
Low Power System on chip based design methodology
Low Power System on chip based design methodologyLow Power System on chip based design methodology
Low Power System on chip based design methodology
Aakash Patel
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
Radhegovind
 
A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...
ijsrd.com
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
Deepak Shankar
 
Cache
CacheCache
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
Arun Joseph
 
Implementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSRImplementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSR
International Journal of Science and Research (IJSR)
 
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUModern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
abhijeetnawal
 
On chip cache
On chip cacheOn chip cache
On chip cache
Syeda Nasiha
 
Tossim
Tossim Tossim
Tossim
crew1274
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
Léia de Sousa
 
How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...
Valencell, Inc
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select Adders
Kumar Goud
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind
 

Similar to Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip (20)

Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and Verification
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A survey
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 
Low Power System on chip based design methodology
Low Power System on chip based design methodologyLow Power System on chip based design methodology
Low Power System on chip based design methodology
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
 
A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
 
Cache
CacheCache
Cache
 
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
 
Implementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSRImplementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSR
 
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUModern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
 
On chip cache
On chip cacheOn chip cache
On chip cache
 
Tossim
Tossim Tossim
Tossim
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select Adders
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
 

More from Arun Joseph

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Arun Joseph
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Arun Joseph
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...
Arun Joseph
 
FreqLeak
FreqLeakFreqLeak
FreqLeak
Arun Joseph
 
Process synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memoriesProcess synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memories
Arun Joseph
 
FirmLeak
FirmLeakFirmLeak
FirmLeak
Arun Joseph
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
Arun Joseph
 

More from Arun Joseph (7)

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...
 
FreqLeak
FreqLeakFreqLeak
FreqLeak
 
Process synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memoriesProcess synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memories
 
FirmLeak
FirmLeakFirmLeak
FirmLeak
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

  • 1. International Symposium on Low Power Electronics and Design 1 Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip Hans Jacobson, Arun Joseph*, Dharmesh Parikh*, Pradip Bose, Alper Buyuktosunoglu IBM Systems & Technology Group* IBM T. J. Watson Research
  • 2. International Symposium on Low Power Electronics and Design 2 Uncore Power: Overview • Pre-silicon power modeling has primarily focused on the processor cores. As designs evolved, attention has shifted to “uncore”. • Abstractions needed to characterize power-performance trade-offs. • We examine the challenge of developing practical abstractions in uncore power modeling in an industrial setting. • We report a systematic methodology of abstractions in modeling with focus on key uncore elements of IBM POWER8 processor. • We show that uncore elements can be modeled using a few activity markers and a small set of microbenchmark stress test cases.
  • 3. International Symposium on Low Power Electronics and Design 3 Use-case: Digital Power Proxies • Without an uncore proxy, dynamic power management policy conservatively estimates a constant, high power for the uncore. – Reduces opportunity for maximizing performance/watt at chip level. • Uncore proxy for a POWER8 improves accuracy of chip power by 15%. – Compared to core, L2 and L3 level proxies with uncore worst-case power. • For scenarios where the uncore is largely idle this could translate to an opportunity to boost the frequency by at least 5%. – For a given chip power cap assuming chip-wide DVFS. – With per-core DVFS control and the ability to shift power across core domains, the boost in performance could be much higher. • More use-cases: Inductive noise trend analysis, Correct decisions in the choice of early-stage micro-architectural parameters
  • 4. International Symposium on Low Power Electronics and Design 4 Reference Power Modeling • The detailed reference chip power analysis tool chain used at IBM [1]. • Accuracy validated against POWER7+ hardware power. • The workload-specific power and event counts, form the data points we use in generating an uncore abstract power model. IP Blocks IP Block Power Abstract Generation IP Power Abstracts Contributor Based Cell Power Model Generation Standard Cell Library Chip Level Power Analysis ChipNetlist Core Sim Uncore Sim RTL Simulator Workloads Clock and Data Switching Power Chip RTL IP Blocks IP Block Power Abstract Generation IP Power Abstracts Contributor Based Cell Power Model Generation Standard Cell Library Chip Level Power Analysis ChipNetlist Core Sim Uncore Sim RTL Simulator Workloads Clock and Data Switching Power Chip RTL Figure: Reference power modeling methodology [1] Dhanwada, N., et al. 2013. Efficient PVT independent abstraction of large IP blocks for hierarchical power analysis, ICCAD, Nov. 2013.
  • 5. International Symposium on Low Power Electronics and Design 5 Reference Power Abstraction • Can be abstracted along several dimensions: – The RTL simulator could be an early-stage microarchitecture-level pipeline timing model; – The workload could be a suite of representative loop kernels; – The switching statistics could be reduced to a smaller subset; – The circuit-level detailed analysis could be approximated by area or gate count based analytical equations. • Abstracted power models: – RTL simulations provide data: power & high level event counts. – Linear regression techniques to data from RTL simulations.
  • 6. International Symposium on Low Power Electronics and Design 6 Uncore Power Modeling • IBM POWER7 L2, L3 cache uncore elements constitutes 20% of power for a TDP workload. – Further include large macros that are shared by all chiplets. • We focus on the path taken by memory requests. – Starting at L3 to chip memory I/O links. • We propose a seemingly drastic abstraction. – 4 activity markers: reads, writes, retry and snoop events. – Small set of carefully crafted micro-benchmarks. – Average error: 1.4-2.4%.
  • 7. International Symposium on Low Power Electronics and Design 7 IBM POWER8 • 12 cores with 8-way simultaneous multi-threading (SMT) per core. Total on-chip L2+L3 cache capacity is 102 MB. • Fabricated using a 22nm CMOS SOI. Die size of 649 mm2. 4.2 billion transistors. • Each core is an aggressive, wide-issue super scalar design, with 16 execution pipelines for massive data crunching. • Uncore supports a massive 7.6 Tb/s off-chip bandwidth including memory and SMP links, PCIe links, an off-chip coherent accelerator interface, as well as on-chip bus- attached data accelerators. Figure: POWER8TM chip photomicrograph with superimposed demarcations to indicate regions occupied by cores, L2/L3 caches, chip interconnect, memory controllers, etc.
  • 8. International Symposium on Low Power Electronics and Design 8 IBM POWER8 Uncore • 512KB private L2 per core, an 8 MB L3 instance per core. • Memory stack separated into four on-chip MCU each partitioned into two sub- controllers for a total of 8 memory I/O links and an off- chip Centaur L4 buffer chip per link. • PowerBus (PB) provides coherent communication support across the cache- memory subsystem.Figure: Block-diagram view of the chip, with clearly defined “uncore” elements (highlighted in blue).
  • 9. International Symposium on Low Power Electronics and Design 9 Simulation for Uncore Table: Characteristics of workloads used for power abstraction Figure: Uncore simulation environment
  • 10. International Symposium on Low Power Electronics and Design 10 Power Bus Ramp • The Power Bus Ramp (PBIEX) – Interface between a chiplet and the Power Bus Unit. – Buffers memory requests initiated by L3 (cache miss/flush) until Power Bus is available. • Sends/receives memory requests to the power bus unit PBEH. – Activity is foremost determined by the read/writes bandwidth to and from its associated chiplet. • Handle coherence requests on the Power Bus. – Activity is thus further determined by snooping activity resulting from reads/writes originating from other chiplets. Figure: PBIEX regression statistics and bar graph showing error sources for predicted power for each workload.
  • 11. International Symposium on Low Power Electronics and Design 11 Power Bus Unit • Routing fabric between each chiplet and the memory/network controllers. Performs coherency checks on each memory request received from the chiplets. • Each L3 cache miss results in the PBEH broadcasting a request to each chiplet to check whether some other L3 contains the requested cache line. – If not, the request is forwarded to the correct memory controller. • Communicate coherence requests and responses between chiplets as well as memory requests to and from the MCU. PBEH regression statistics and bar graph showing error sources for predicted power for each workload
  • 12. International Symposium on Low Power Electronics and Design 12 Memory Controller Unit • Interface between the PBEH and the high speed serial I/O links going to and from memory. • Each request is assembled into a transmission frame and sent over the link. • Activity of the MCU is foremost determined by the read and write bandwidth of the PBEH. • MCU must also reject a request if it cannot handle more requests due to full buffers. – Activity therefore also dependent on the number of such retries. MCU regression statistics and bar graph showing error sources for predicted power for each workload.
  • 13. International Symposium on Low Power Electronics and Design 13 Abstract Model Conclusions • Can be modeled accurately even with seemingly drastic abstractions in modern day processors. • Accurate for abstract level they are intended to be used. – < 6% maximum errors across the abstract uncore models. – < 9% power difference observed for the minimum vs. maximum address and data switching workloads. • Future work: Model refinements to focus on DS events. – Capture the degree of bit switching on addresses and data that move through the uncore units.
  • 14. International Symposium on Low Power Electronics and Design 14 Summary & Conclusions • Uncore power and identification of power reduction opportunities is a critical aspect of future power-efficient micro-processor design. • We present a practical methodology for use in an industrial setting for deriving abstract analytical power models for selected key uncore elements. • We show that even with very few power event markers and a small set of stress marks, it is possible to develop accurate power models for uncore elements of a modern day chip. • We quantify the accuracy such models have in providing improved power proxies and predicting worst-case bounds on chip level inductive noise in future technologies.