SlideShare a Scribd company logo
1 of 14
International Symposium on Low Power Electronics and Design 1
Empirically Derived Abstractions in
Uncore Power Modeling for a
Server-Class Processor Chip
Hans Jacobson, Arun Joseph*, Dharmesh
Parikh*, Pradip Bose, Alper Buyuktosunoglu
IBM Systems & Technology Group*
IBM T. J. Watson Research
International Symposium on Low Power Electronics and Design 2
Uncore Power: Overview
• Pre-silicon power modeling has primarily focused on the processor
cores. As designs evolved, attention has shifted to “uncore”.
• Abstractions needed to characterize power-performance trade-offs.
• We examine the challenge of developing practical abstractions in
uncore power modeling in an industrial setting.
• We report a systematic methodology of abstractions in modeling
with focus on key uncore elements of IBM POWER8 processor.
• We show that uncore elements can be modeled using a few activity
markers and a small set of microbenchmark stress test cases.
International Symposium on Low Power Electronics and Design 3
Use-case: Digital Power Proxies
• Without an uncore proxy, dynamic power management policy
conservatively estimates a constant, high power for the uncore.
– Reduces opportunity for maximizing performance/watt at chip level.
• Uncore proxy for a POWER8 improves accuracy of chip power by 15%.
– Compared to core, L2 and L3 level proxies with uncore worst-case power.
• For scenarios where the uncore is largely idle this could translate to an
opportunity to boost the frequency by at least 5%.
– For a given chip power cap assuming chip-wide DVFS.
– With per-core DVFS control and the ability to shift power across core domains,
the boost in performance could be much higher.
• More use-cases: Inductive noise trend analysis, Correct decisions in the
choice of early-stage micro-architectural parameters
International Symposium on Low Power Electronics and Design 4
Reference Power Modeling
• The detailed reference chip
power analysis tool chain
used at IBM [1].
• Accuracy validated against
POWER7+ hardware
power.
• The workload-specific
power and event counts,
form the data points we
use in generating an
uncore abstract power
model.
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
IP Blocks
IP Block
Power Abstract
Generation
IP Power
Abstracts
Contributor
Based Cell Power
Model Generation
Standard
Cell Library
Chip Level
Power Analysis
ChipNetlist
Core
Sim
Uncore
Sim
RTL Simulator
Workloads
Clock and Data
Switching
Power
Chip RTL
Figure: Reference power modeling methodology
[1] Dhanwada, N., et al. 2013. Efficient PVT independent
abstraction of large IP blocks for hierarchical power
analysis, ICCAD, Nov. 2013.
International Symposium on Low Power Electronics and Design 5
Reference Power Abstraction
• Can be abstracted along several dimensions:
– The RTL simulator could be an early-stage microarchitecture-level
pipeline timing model;
– The workload could be a suite of representative loop kernels;
– The switching statistics could be reduced to a smaller subset;
– The circuit-level detailed analysis could be approximated by area or
gate count based analytical equations.
• Abstracted power models:
– RTL simulations provide data: power & high level event counts.
– Linear regression techniques to data from RTL simulations.
International Symposium on Low Power Electronics and Design 6
Uncore Power Modeling
• IBM POWER7 L2, L3 cache uncore elements constitutes
20% of power for a TDP workload.
– Further include large macros that are shared by all chiplets.
• We focus on the path taken by memory requests.
– Starting at L3 to chip memory I/O links.
• We propose a seemingly drastic abstraction.
– 4 activity markers: reads, writes, retry and snoop events.
– Small set of carefully crafted micro-benchmarks.
– Average error: 1.4-2.4%.
International Symposium on Low Power Electronics and Design 7
IBM POWER8
• 12 cores with 8-way
simultaneous multi-threading
(SMT) per core. Total on-chip
L2+L3 cache capacity is 102
MB.
• Fabricated using a 22nm
CMOS SOI. Die size of 649
mm2. 4.2 billion transistors.
• Each core is an aggressive,
wide-issue super scalar
design, with 16 execution
pipelines for massive data
crunching.
• Uncore supports a massive 7.6
Tb/s off-chip bandwidth
including memory and SMP
links, PCIe links, an off-chip
coherent accelerator interface,
as well as on-chip bus-
attached data accelerators.
Figure: POWER8TM chip photomicrograph with
superimposed demarcations to indicate regions occupied by cores,
L2/L3 caches, chip interconnect, memory controllers, etc.
International Symposium on Low Power Electronics and Design 8
IBM POWER8 Uncore
• 512KB private L2 per core, an
8 MB L3 instance per core.
• Memory stack separated into
four on-chip MCU each
partitioned into two sub-
controllers for a total of 8
memory I/O links and an off-
chip Centaur L4 buffer chip
per link.
• PowerBus (PB) provides
coherent communication
support across the cache-
memory subsystem.Figure: Block-diagram view of the chip, with clearly
defined “uncore” elements (highlighted in blue).
International Symposium on Low Power Electronics and Design 9
Simulation for Uncore
Table: Characteristics of workloads
used for power abstraction
Figure: Uncore simulation
environment
International Symposium on Low Power Electronics and Design 10
Power Bus Ramp
• The Power Bus Ramp (PBIEX)
– Interface between a chiplet and the
Power Bus Unit.
– Buffers memory requests initiated
by L3 (cache miss/flush) until
Power Bus is available.
• Sends/receives memory requests to the
power bus unit PBEH.
– Activity is foremost determined by
the read/writes bandwidth to and
from its associated chiplet.
• Handle coherence requests on the
Power Bus.
– Activity is thus further determined
by snooping activity resulting from
reads/writes originating from other
chiplets.
Figure: PBIEX regression statistics and bar graph showing
error sources for predicted power for each workload.
International Symposium on Low Power Electronics and Design 11
Power Bus Unit
• Routing fabric between each chiplet
and the memory/network
controllers. Performs coherency
checks on each memory request
received from the chiplets.
• Each L3 cache miss results in the
PBEH broadcasting a request to
each chiplet to check whether some
other L3 contains the requested
cache line.
– If not, the request is forwarded to
the correct memory controller.
• Communicate coherence requests
and responses between chiplets as
well as memory requests to and
from the MCU.
PBEH regression statistics and bar graph showing error
sources for predicted power for each workload
International Symposium on Low Power Electronics and Design 12
Memory Controller Unit
• Interface between the PBEH and the
high speed serial I/O links going to
and from memory.
• Each request is assembled into a
transmission frame and sent over
the link.
• Activity of the MCU is foremost
determined by the read and write
bandwidth of the PBEH.
• MCU must also reject a request if it
cannot handle more requests due to
full buffers.
– Activity therefore also dependent on
the number of such retries.
MCU regression statistics and bar graph showing error
sources for predicted power for each workload.
International Symposium on Low Power Electronics and Design 13
Abstract Model Conclusions
• Can be modeled accurately even with seemingly drastic
abstractions in modern day processors.
• Accurate for abstract level they are intended to be used.
– < 6% maximum errors across the abstract uncore models.
– < 9% power difference observed for the minimum vs. maximum
address and data switching workloads.
• Future work: Model refinements to focus on DS events.
– Capture the degree of bit switching on addresses and data that
move through the uncore units.
International Symposium on Low Power Electronics and Design 14
Summary & Conclusions
• Uncore power and identification of power reduction opportunities is
a critical aspect of future power-efficient micro-processor design.
• We present a practical methodology for use in an industrial setting
for deriving abstract analytical power models for selected key
uncore elements.
• We show that even with very few power event markers and a small
set of stress marks, it is possible to develop accurate power models
for uncore elements of a modern day chip.
• We quantify the accuracy such models have in providing improved
power proxies and predicting worst-case bounds on chip level
inductive noise in future technologies.

More Related Content

Similar to Empirically Derived Uncore Power Models for Server Processor

Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and VerificationDVClub
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysisArun Joseph
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A surveyCSITiaesprime
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluationGIORGOS STAMELOS
 
Low Power System on chip based design methodology
Low Power System on chip based design methodologyLow Power System on chip based design methodology
Low Power System on chip based design methodologyAakash Patel
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysisRadhegovind
 
A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...ijsrd.com
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? Deepak Shankar
 
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...Arun Joseph
 
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUModern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUabhijeetnawal
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...Valencell, Inc
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersKumar Goud
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind
 

Similar to Empirically Derived Uncore Power Models for Server Processor (20)

Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and Verification
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A survey
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 
Low Power System on chip based design methodology
Low Power System on chip based design methodologyLow Power System on chip based design methodology
Low Power System on chip based design methodology
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
 
A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...A verilog based simulation methodology for estimating statistical test for th...
A verilog based simulation methodology for estimating statistical test for th...
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
 
Cache
CacheCache
Cache
 
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
 
Implementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSRImplementation of Low Power Test Pattern Generator Using LFSR
Implementation of Low Power Test Pattern Generator Using LFSR
 
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUModern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
 
On chip cache
On chip cacheOn chip cache
On chip cache
 
Tossim
Tossim Tossim
Tossim
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...How lower power consumption is transforming wearables and enabling new and di...
How lower power consumption is transforming wearables and enabling new and di...
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select Adders
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
 

More from Arun Joseph

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Arun Joseph
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Arun Joseph
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...Arun Joseph
 
Process synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memoriesProcess synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memoriesArun Joseph
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...Arun Joseph
 

More from Arun Joseph (7)

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...
 
FreqLeak
FreqLeakFreqLeak
FreqLeak
 
Process synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memoriesProcess synchronization in multi core systems using on-chip memories
Process synchronization in multi core systems using on-chip memories
 
FirmLeak
FirmLeakFirmLeak
FirmLeak
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Empirically Derived Uncore Power Models for Server Processor

  • 1. International Symposium on Low Power Electronics and Design 1 Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip Hans Jacobson, Arun Joseph*, Dharmesh Parikh*, Pradip Bose, Alper Buyuktosunoglu IBM Systems & Technology Group* IBM T. J. Watson Research
  • 2. International Symposium on Low Power Electronics and Design 2 Uncore Power: Overview • Pre-silicon power modeling has primarily focused on the processor cores. As designs evolved, attention has shifted to “uncore”. • Abstractions needed to characterize power-performance trade-offs. • We examine the challenge of developing practical abstractions in uncore power modeling in an industrial setting. • We report a systematic methodology of abstractions in modeling with focus on key uncore elements of IBM POWER8 processor. • We show that uncore elements can be modeled using a few activity markers and a small set of microbenchmark stress test cases.
  • 3. International Symposium on Low Power Electronics and Design 3 Use-case: Digital Power Proxies • Without an uncore proxy, dynamic power management policy conservatively estimates a constant, high power for the uncore. – Reduces opportunity for maximizing performance/watt at chip level. • Uncore proxy for a POWER8 improves accuracy of chip power by 15%. – Compared to core, L2 and L3 level proxies with uncore worst-case power. • For scenarios where the uncore is largely idle this could translate to an opportunity to boost the frequency by at least 5%. – For a given chip power cap assuming chip-wide DVFS. – With per-core DVFS control and the ability to shift power across core domains, the boost in performance could be much higher. • More use-cases: Inductive noise trend analysis, Correct decisions in the choice of early-stage micro-architectural parameters
  • 4. International Symposium on Low Power Electronics and Design 4 Reference Power Modeling • The detailed reference chip power analysis tool chain used at IBM [1]. • Accuracy validated against POWER7+ hardware power. • The workload-specific power and event counts, form the data points we use in generating an uncore abstract power model. IP Blocks IP Block Power Abstract Generation IP Power Abstracts Contributor Based Cell Power Model Generation Standard Cell Library Chip Level Power Analysis ChipNetlist Core Sim Uncore Sim RTL Simulator Workloads Clock and Data Switching Power Chip RTL IP Blocks IP Block Power Abstract Generation IP Power Abstracts Contributor Based Cell Power Model Generation Standard Cell Library Chip Level Power Analysis ChipNetlist Core Sim Uncore Sim RTL Simulator Workloads Clock and Data Switching Power Chip RTL Figure: Reference power modeling methodology [1] Dhanwada, N., et al. 2013. Efficient PVT independent abstraction of large IP blocks for hierarchical power analysis, ICCAD, Nov. 2013.
  • 5. International Symposium on Low Power Electronics and Design 5 Reference Power Abstraction • Can be abstracted along several dimensions: – The RTL simulator could be an early-stage microarchitecture-level pipeline timing model; – The workload could be a suite of representative loop kernels; – The switching statistics could be reduced to a smaller subset; – The circuit-level detailed analysis could be approximated by area or gate count based analytical equations. • Abstracted power models: – RTL simulations provide data: power & high level event counts. – Linear regression techniques to data from RTL simulations.
  • 6. International Symposium on Low Power Electronics and Design 6 Uncore Power Modeling • IBM POWER7 L2, L3 cache uncore elements constitutes 20% of power for a TDP workload. – Further include large macros that are shared by all chiplets. • We focus on the path taken by memory requests. – Starting at L3 to chip memory I/O links. • We propose a seemingly drastic abstraction. – 4 activity markers: reads, writes, retry and snoop events. – Small set of carefully crafted micro-benchmarks. – Average error: 1.4-2.4%.
  • 7. International Symposium on Low Power Electronics and Design 7 IBM POWER8 • 12 cores with 8-way simultaneous multi-threading (SMT) per core. Total on-chip L2+L3 cache capacity is 102 MB. • Fabricated using a 22nm CMOS SOI. Die size of 649 mm2. 4.2 billion transistors. • Each core is an aggressive, wide-issue super scalar design, with 16 execution pipelines for massive data crunching. • Uncore supports a massive 7.6 Tb/s off-chip bandwidth including memory and SMP links, PCIe links, an off-chip coherent accelerator interface, as well as on-chip bus- attached data accelerators. Figure: POWER8TM chip photomicrograph with superimposed demarcations to indicate regions occupied by cores, L2/L3 caches, chip interconnect, memory controllers, etc.
  • 8. International Symposium on Low Power Electronics and Design 8 IBM POWER8 Uncore • 512KB private L2 per core, an 8 MB L3 instance per core. • Memory stack separated into four on-chip MCU each partitioned into two sub- controllers for a total of 8 memory I/O links and an off- chip Centaur L4 buffer chip per link. • PowerBus (PB) provides coherent communication support across the cache- memory subsystem.Figure: Block-diagram view of the chip, with clearly defined “uncore” elements (highlighted in blue).
  • 9. International Symposium on Low Power Electronics and Design 9 Simulation for Uncore Table: Characteristics of workloads used for power abstraction Figure: Uncore simulation environment
  • 10. International Symposium on Low Power Electronics and Design 10 Power Bus Ramp • The Power Bus Ramp (PBIEX) – Interface between a chiplet and the Power Bus Unit. – Buffers memory requests initiated by L3 (cache miss/flush) until Power Bus is available. • Sends/receives memory requests to the power bus unit PBEH. – Activity is foremost determined by the read/writes bandwidth to and from its associated chiplet. • Handle coherence requests on the Power Bus. – Activity is thus further determined by snooping activity resulting from reads/writes originating from other chiplets. Figure: PBIEX regression statistics and bar graph showing error sources for predicted power for each workload.
  • 11. International Symposium on Low Power Electronics and Design 11 Power Bus Unit • Routing fabric between each chiplet and the memory/network controllers. Performs coherency checks on each memory request received from the chiplets. • Each L3 cache miss results in the PBEH broadcasting a request to each chiplet to check whether some other L3 contains the requested cache line. – If not, the request is forwarded to the correct memory controller. • Communicate coherence requests and responses between chiplets as well as memory requests to and from the MCU. PBEH regression statistics and bar graph showing error sources for predicted power for each workload
  • 12. International Symposium on Low Power Electronics and Design 12 Memory Controller Unit • Interface between the PBEH and the high speed serial I/O links going to and from memory. • Each request is assembled into a transmission frame and sent over the link. • Activity of the MCU is foremost determined by the read and write bandwidth of the PBEH. • MCU must also reject a request if it cannot handle more requests due to full buffers. – Activity therefore also dependent on the number of such retries. MCU regression statistics and bar graph showing error sources for predicted power for each workload.
  • 13. International Symposium on Low Power Electronics and Design 13 Abstract Model Conclusions • Can be modeled accurately even with seemingly drastic abstractions in modern day processors. • Accurate for abstract level they are intended to be used. – < 6% maximum errors across the abstract uncore models. – < 9% power difference observed for the minimum vs. maximum address and data switching workloads. • Future work: Model refinements to focus on DS events. – Capture the degree of bit switching on addresses and data that move through the uncore units.
  • 14. International Symposium on Low Power Electronics and Design 14 Summary & Conclusions • Uncore power and identification of power reduction opportunities is a critical aspect of future power-efficient micro-processor design. • We present a practical methodology for use in an industrial setting for deriving abstract analytical power models for selected key uncore elements. • We show that even with very few power event markers and a small set of stress marks, it is possible to develop accurate power models for uncore elements of a modern day chip. • We quantify the accuracy such models have in providing improved power proxies and predicting worst-case bounds on chip level inductive noise in future technologies.