SlideShare a Scribd company logo
MemGuard: Memory Bandwidth Reservation
System for Efficient Performance Isolation in
Multi-core Platforms
Apr 9, 2013
Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*,
Marco Caccamo+, Lui Sha+
+University of Illinois at Urbana-Champaign
*University of Waterloo
Real-Time Applications
2
• Resource intensive real-time applications
– Multimedia processing(*), real-time data analytic(**), object tracking
• Requirements
– Need more performance and cost less  Commercial Off-The Shelf (COTS)
– Performance guarantee (i.e., temporal predictability and isolation)
(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010
(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
Modern System-on-Chip (SoC)
• More cores
– Freescale P4080 has 8 cores
• More sharing
– Shared memory hierarchy (LLC, MC, DRAM)
– Shared I/O channels
3
More performance
Less cost
But, isolation?
Problem: Shared Memory Hierarchy
• Shared hardware resources
• OS has little control
Core1 Core2 Core3 Core4
DRAM
Part 1 Part 2 Part 3 Part 4
4
Memory Controller (MC)
Shared Last Level Cache (LLC) Space
contention
Access
contention
Memory Performance Isolation
• Q. How to guarantee worst-case performance?
– Need to guarantee memory performance
Part 1 Part 2 Part 3 Part 4
5
Core1 Core2 Core3 Core4
DRAM
Memory Controller
LLC LLC LLC LLC
Inter-Core Memory Interference
• Significant slowdown (Up to 2x on 2 cores)
• Slowdown is not proportional to memory bandwidth usage
6
Core
Shared Memory
foreground
X-axis
Intel Core2
L2 L2
1.0
1.2
1.4
1.6
1.8
2.0
2.2
437.leslie3d 462.libquantum 410.bwaves 471.omnetpp
Slowdown ratio due to interference
(1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)
Core
background
470.lbm
Runtimeslowdown
Core
Bank 4
Background: DRAM Chip
Row 1
Row 2
Row 3
Row 4
Row 5
Bank 1
Row Buffer
Bank 2
Bank 3
activate
precharge
Read/write
• State dependent access latency
– Row miss: 19 cycles, Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
READ (Bank 1, Row 3, Col 7)
Col7
Background: Memory Controller(MC)
• Request queue
– Buffer read/write requests from CPU cores
– Unpredictable queuing delay due to reordering
8
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
Background: MC Queue Re-ordering
• Improve row hit ratio and throughput
• Unpredictable queuing delay
9
Core1: READ Row 1, Col 1
Core2: READ Row 2, Col 1
Core1: READ Row 1, Col 2
Core1: READ Row 1, Col 1
Core1: READ Row 1, Col 2
Core2: READ Row 2, Col 1
DRAM DRAM
Initial Queue Reordered Queue
2 Row Switch 1 Row Switch
Challenges for Real-Time Systems
• Memory controller(MC) level queuing delay
– Main source of interference
– Unpredictable (re-ordering)
• DRAM state dependent latency
– DRAM row open/close state
• State of Art
– Predictable DRAM controller h/w: [Akesson’07]
[Paolieri’09] [Reineke’11]  not in COTS
10
Core
1
Core
2
Core
3
Core
4
DRAM
Memory ControllerMemory Controller
DRAM
Predictable Memory Controller
Our Approach
• OS level approach
– Works on commodity h/w
– Guarantees performance of each core
– Maximizes memory performance if possible
11
Core
1
Core
2
Core
3
Core
4
DRAM
Memory ControllerMemory Controller
DRAM
OS control mechanism
Operating System
MemGuard
12
Core1 Core2 Core3 Core4
PMC PMC PMC PMC
DRAM DIMM
MemGuard
Multicore Processor
Memory Controller
• Memory bandwidth reservation and reclaiming
BW
Regulator
BW
Regulator
BW
Regulator
BW
Regulator
0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s
Reclaim Manager
Memory Bandwidth Reservation
• Idea
– OS monitor and enforce each core’s memory bandwidth usage
13
1ms 2ms0
Dequeue tasks
Enqueue tasks
Dequeue tasks
Budget
Core
activity
2
1
computation memory fetch
Memory Bandwidth Reservation
14
Guaranteed Bandwidth: rmin
15
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
Memory Access Pattern
• Memory access patterns vary over time
• Static reservation is inefficient
16
Time(ms)
Memory
requests
Time(ms)
Memory
requests
Memory Bandwidth Reclaiming
• Key objective
– Redistribute excess bandwidth to demanding
cores
– Improve memory b/w utilization
• Predictive bandwidth donation and reclaiming
– Donate unneeded budget predictively
– Reclaim on-demand basis
17
Reclaim Example
• Time 0
– Initial budget 3 for both cores
• Time 3,4
– Decrement budgets
• Time 10 (period 1)
– Predictive donation (total 4)
• Time 12,15
– Core 1: reclaim
• Time 16
– Core 0: reclaim
• Time 17
– Core 1: no remaining budget; dequeue
tasks
• Time 20 (period 2)
– Core 0 donates 2
– Core 1 do not donate
18
Best-effort Bandwidth Sharing
• Key objective
– Utilize best-effort bandwidth whenever possible
• Best-effort bandwidth
– After all cores use their budgets (i.e., delivering
guaranteed bandwidth), before the next period begins
• Sharing policy
– Maximize throughput
– Broadcast all cores to continue to execute
19
Evaluation Platform
• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM
– Prefetchers were turned off for evaluation
– Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported
• Modified Linux kernel 3.6.0 + MemGuard kernel module
– https://github.com/heechul/memguard/wiki/MemGuard
• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks
20
Core 0
I D
L2 Cache
Intel Core2Quad
Core 1
I D
Core 2
I D
L2 Cache
Core 3
I D
Memory Controller
DRAM
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.libquantum
(foreground)
memory hogs
(background)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
run-alone co-run run-alone co-run run-alone co-run
w/o Memguard Memguard
(reservation only)
Memguard
(reclaim+share)
NormalizedIPC
Foreground (462.libquantum)
C2
53% performance reduction
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum
(foreground)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
run-alone co-run run-alone co-run run-alone co-run
w/o Memguard Memguard
(reservation only)
Memguard
(reclaim+share)
NormalizedIPC
Foreground (462.libquantum)
1GB/s
memory hogs
(background)
C2
.2GB/s
Reservation provides performance isolation
Guaranteed
performance
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum
(foreground)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
run-alone co-run run-alone co-run run-alone co-run
w/o Memguard Memguard
(reservation only)
Memguard
(reclaim+share)
NormalizedIPC
Foreground (462.libquantum)
1GB/s
memory hogs
(background)
C2
.2GB/s
Reclaiming and Sharing maximize performance
Guaranteed
performance
SPEC2006 Results
24
0.0
0.5
1.0
1.5
2.0
2.5
3.0
foreground (X-axis) @1.0GB/s
NormalizedIPC
Normalized to guaranteed performance
• Guarantee(soft) performance of foreground (x-axis)
– W.r.t. 1.0GB/s memory b/w reservation
C0
Shared Memory
C2
Intel Core2
L2 L2
X-axis
(foreground)
1GB/s
470.lbm
(background)
C2
.2GB/s
SPEC2006 Results
• Improve overall throughput
– background: 368%, foreground(X-axis): 6%
25
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s
NormalizedIPC
Normalized to guaranteed performance
C0
Shared Memory
C2
Intel Core2
L2 L2
X-axis
(foreground)
1GB/s
470.lbm
(background)
C2
.2GB/s
Conclusion
• Inter-Core Memory Interference
– Big challenge for multi-core based real-time systems
– Sources: queuing delay in MC, state dependent latency in
DRAM
• MemGuard
– OS mechanism providing efficient per-core memory
performance isolation on COTS H/W
– Memory bandwidth reservation and reclaiming support
– https://github.com/heechul/memguard/wiki/MemGuard
26
Thank you.
27
Reclaim Underrun Error
28
Effect of MemGuard
• Soft real-time application on each core.
• Provides differentiated memory bandwidth
– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled
29

More Related Content

What's hot

LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
Linaro
 

What's hot (20)

Booting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot imagesBooting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot images
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
eMMC 5.0 Total IP Solution
eMMC 5.0 Total IP SolutioneMMC 5.0 Total IP Solution
eMMC 5.0 Total IP Solution
 
Process Scheduler and Balancer in Linux Kernel
Process Scheduler and Balancer in Linux KernelProcess Scheduler and Balancer in Linux Kernel
Process Scheduler and Balancer in Linux Kernel
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Serial Driver
Linux Serial DriverLinux Serial Driver
Linux Serial Driver
 
Expansion Bus, Internal & External Buses, Northbridge/Southbridge, Device Dri...
Expansion Bus, Internal & External Buses, Northbridge/Southbridge, Device Dri...Expansion Bus, Internal & External Buses, Northbridge/Southbridge, Device Dri...
Expansion Bus, Internal & External Buses, Northbridge/Southbridge, Device Dri...
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
OPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialOPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build Tutorial
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 
Logging kernel oops and panic
Logging kernel oops and panicLogging kernel oops and panic
Logging kernel oops and panic
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
U-Boot Porting on New Hardware
U-Boot Porting on New HardwareU-Boot Porting on New Hardware
U-Boot Porting on New Hardware
 
Udev for Device Management in Linux
Udev for Device Management in Linux Udev for Device Management in Linux
Udev for Device Management in Linux
 
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
 
I2c drivers
I2c driversI2c drivers
I2c drivers
 
Linux scheduler
Linux schedulerLinux scheduler
Linux scheduler
 

Similar to MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Heechul Yun
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
byteLAKE
 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
chiportal
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
DzH QWuynh
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
inside-BigData.com
 

Similar to MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms (20)

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
Accelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile SystemsAccelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile Systems
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
WEEK6_COMPUTER_ORGANIZATION.pptx
WEEK6_COMPUTER_ORGANIZATION.pptxWEEK6_COMPUTER_ORGANIZATION.pptx
WEEK6_COMPUTER_ORGANIZATION.pptx
 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systems
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
EC8791-U5-PPT.pptx
EC8791-U5-PPT.pptxEC8791-U5-PPT.pptx
EC8791-U5-PPT.pptx
 
Presentation sparc m6 m5-32 server technical overview
Presentation   sparc m6 m5-32 server technical overviewPresentation   sparc m6 m5-32 server technical overview
Presentation sparc m6 m5-32 server technical overview
 
Presentation best practices for optimal configuration of oracle databases o...
Presentation   best practices for optimal configuration of oracle databases o...Presentation   best practices for optimal configuration of oracle databases o...
Presentation best practices for optimal configuration of oracle databases o...
 

Recently uploaded

Recently uploaded (20)

10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms

  • 1. MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2013 Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*, Marco Caccamo+, Lui Sha+ +University of Illinois at Urbana-Champaign *University of Waterloo
  • 2. Real-Time Applications 2 • Resource intensive real-time applications – Multimedia processing(*), real-time data analytic(**), object tracking • Requirements – Need more performance and cost less  Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
  • 3. Modern System-on-Chip (SoC) • More cores – Freescale P4080 has 8 cores • More sharing – Shared memory hierarchy (LLC, MC, DRAM) – Shared I/O channels 3 More performance Less cost But, isolation?
  • 4. Problem: Shared Memory Hierarchy • Shared hardware resources • OS has little control Core1 Core2 Core3 Core4 DRAM Part 1 Part 2 Part 3 Part 4 4 Memory Controller (MC) Shared Last Level Cache (LLC) Space contention Access contention
  • 5. Memory Performance Isolation • Q. How to guarantee worst-case performance? – Need to guarantee memory performance Part 1 Part 2 Part 3 Part 4 5 Core1 Core2 Core3 Core4 DRAM Memory Controller LLC LLC LLC LLC
  • 6. Inter-Core Memory Interference • Significant slowdown (Up to 2x on 2 cores) • Slowdown is not proportional to memory bandwidth usage 6 Core Shared Memory foreground X-axis Intel Core2 L2 L2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 437.leslie3d 462.libquantum 410.bwaves 471.omnetpp Slowdown ratio due to interference (1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s) Core background 470.lbm Runtimeslowdown Core
  • 7. Bank 4 Background: DRAM Chip Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer Bank 2 Bank 3 activate precharge Read/write • State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting) READ (Bank 1, Row 3, Col 7) Col7
  • 8. Background: Memory Controller(MC) • Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 8 Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
  • 9. Background: MC Queue Re-ordering • Improve row hit ratio and throughput • Unpredictable queuing delay 9 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM Initial Queue Reordered Queue 2 Row Switch 1 Row Switch
  • 10. Challenges for Real-Time Systems • Memory controller(MC) level queuing delay – Main source of interference – Unpredictable (re-ordering) • DRAM state dependent latency – DRAM row open/close state • State of Art – Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09] [Reineke’11]  not in COTS 10 Core 1 Core 2 Core 3 Core 4 DRAM Memory ControllerMemory Controller DRAM Predictable Memory Controller
  • 11. Our Approach • OS level approach – Works on commodity h/w – Guarantees performance of each core – Maximizes memory performance if possible 11 Core 1 Core 2 Core 3 Core 4 DRAM Memory ControllerMemory Controller DRAM OS control mechanism
  • 12. Operating System MemGuard 12 Core1 Core2 Core3 Core4 PMC PMC PMC PMC DRAM DIMM MemGuard Multicore Processor Memory Controller • Memory bandwidth reservation and reclaiming BW Regulator BW Regulator BW Regulator BW Regulator 0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s Reclaim Manager
  • 13. Memory Bandwidth Reservation • Idea – OS monitor and enforce each core’s memory bandwidth usage 13 1ms 2ms0 Dequeue tasks Enqueue tasks Dequeue tasks Budget Core activity 2 1 computation memory fetch
  • 15. Guaranteed Bandwidth: rmin 15 (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
  • 16. Memory Access Pattern • Memory access patterns vary over time • Static reservation is inefficient 16 Time(ms) Memory requests Time(ms) Memory requests
  • 17. Memory Bandwidth Reclaiming • Key objective – Redistribute excess bandwidth to demanding cores – Improve memory b/w utilization • Predictive bandwidth donation and reclaiming – Donate unneeded budget predictively – Reclaim on-demand basis 17
  • 18. Reclaim Example • Time 0 – Initial budget 3 for both cores • Time 3,4 – Decrement budgets • Time 10 (period 1) – Predictive donation (total 4) • Time 12,15 – Core 1: reclaim • Time 16 – Core 0: reclaim • Time 17 – Core 1: no remaining budget; dequeue tasks • Time 20 (period 2) – Core 0 donates 2 – Core 1 do not donate 18
  • 19. Best-effort Bandwidth Sharing • Key objective – Utilize best-effort bandwidth whenever possible • Best-effort bandwidth – After all cores use their budgets (i.e., delivering guaranteed bandwidth), before the next period begins • Sharing policy – Maximize throughput – Broadcast all cores to continue to execute 19
  • 20. Evaluation Platform • Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM – Prefetchers were turned off for evaluation – Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported • Modified Linux kernel 3.6.0 + MemGuard kernel module – https://github.com/heechul/memguard/wiki/MemGuard • Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks 20 Core 0 I D L2 Cache Intel Core2Quad Core 1 I D Core 2 I D L2 Cache Core 3 I D Memory Controller DRAM
  • 21. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.libquantum (foreground) memory hogs (background) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) C2 53% performance reduction
  • 22. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.Libquantum (foreground) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) 1GB/s memory hogs (background) C2 .2GB/s Reservation provides performance isolation Guaranteed performance
  • 23. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.Libquantum (foreground) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) 1GB/s memory hogs (background) C2 .2GB/s Reclaiming and Sharing maximize performance Guaranteed performance
  • 24. SPEC2006 Results 24 0.0 0.5 1.0 1.5 2.0 2.5 3.0 foreground (X-axis) @1.0GB/s NormalizedIPC Normalized to guaranteed performance • Guarantee(soft) performance of foreground (x-axis) – W.r.t. 1.0GB/s memory b/w reservation C0 Shared Memory C2 Intel Core2 L2 L2 X-axis (foreground) 1GB/s 470.lbm (background) C2 .2GB/s
  • 25. SPEC2006 Results • Improve overall throughput – background: 368%, foreground(X-axis): 6% 25 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s NormalizedIPC Normalized to guaranteed performance C0 Shared Memory C2 Intel Core2 L2 L2 X-axis (foreground) 1GB/s 470.lbm (background) C2 .2GB/s
  • 26. Conclusion • Inter-Core Memory Interference – Big challenge for multi-core based real-time systems – Sources: queuing delay in MC, state dependent latency in DRAM • MemGuard – OS mechanism providing efficient per-core memory performance isolation on COTS H/W – Memory bandwidth reservation and reclaiming support – https://github.com/heechul/memguard/wiki/MemGuard 26
  • 29. Effect of MemGuard • Soft real-time application on each core. • Provides differentiated memory bandwidth – weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled 29

Editor's Notes

  1. Let me begin the talk. Today, I will talk about.
  2. Performance and energy consumption is important. One of the memory intensive
  3. Modern SoC can satisfy many of these requirements.This figure is a example of such SoC.
  4. Memory bandwidth isolation problemResource isolation  reservation, CPU reservation has been extensively studied.Memory bandwidth reservation
  5. In muticore based systems, however, CPU reservation doesn’t provide sufficient guarantee.We focus on a partitioned system where each core perform independent set of workload with each other.The big question is, however, how we can guarantee performance of each core.Cache, existing mechanisms in commodity processors. But memory is not so much.
  6. To quantify the maginitute of inteference problem, I ran some experiment on a dual core system.First, both foreground on core0 and background on core2 suffer slowdown (expected)Second, some e.g., libquantum, suffer much more while lbm only suffer less than 20%. Interestingly, libquantum is not the most memory intensive task. X-axis is sorted. Milc suffer more than omnetpp for example. We need to know, how much will be slowdown, or how to avoid it? As shown here, performance interference is huge but quantifying it is challenging. To better understand,
  7. Different access cost, depend on access history.DRAM is composed of set of banks that can be accessed in parallel. Each bank is composed of multiple rows, each row is ranging from 1K to 2KBTherefore a DRAM address can be given (bank, row, column). (rank, channel can be added depending on systems)Now interesting part is that in order to access the location, the corresponding row must be first copied to a register file. The copy processes is called activate. Once it is activated, consecutive accesses can be handled through the register file as long as they are in the same row. If a new request is in different row, it then has to be copied back and the new row must be copied to the register file. Storing data in register file to dram is called precharge. So simply speaking, (5 + 5 + 5 + 4)(5 + 4)
  8. Q. Buffer size?Queuing delay (while queuing delay is well studied in the context of network packet switching it can not be directly applied because schedule re-orderScheduler re-ordering. (for maximizing throughput)
  9. Major source of unpredictability.T1 -> Core1 …FIXME
  10. Many h/w proposal that eliminate these limitation at the level of DRAM controller, there has been very little work that provide performance guarantee with software based mechanisms. Paolieri-AMC
  11. Many h/w proposal that eliminate these limitation at the level of DRAM controller, there has been very little work that provide performance guarantee with software based mechanisms.
  12. MemGuard consists of a set of bandwidth controllers that runs inside OS kernel. Each controller regulate memory request rate of each core to a certain value. Core1’s request rate is 0.9GB/s and core2’s rate is 0.1GB/s and so on.The goal is we want to guarantee this rate for the core, regardless of memory activities on other cores. In other word, we want to reserve the bandwidth for each core. While reservation is for performance guarantee, reclaiming is to improve performance whenever it is possible. For example, if Core1 does not use its assigned b/w because it is currently in cpu intensive phase, then the unused b/w is redistributed by reclaiming mechanism. Let me explain each of this in detail.[Prof. Sha]
  13. First, let me explain how b/w regulator works.
  14. Why we want to regulate the request rates?
  15. The guaranteed b/w is defined as the worst case DRAM performance. It can be calculated.Claim is that if we regulate requests rate of all cores to be below the minimum DRAM performance, which is 1.2GB/s, in this case, requests are likely to be processed immediately without buffered in the memory controller queues. While much smaller than the peak, many applications do not use peak bandwidth all the time. Hence, reserving the b/w is meaningful. Performance guarantee. We will later show this holds with extensive experiments on a real platform.
  16. Because these problem, reservation on memory bandwidth is not easyAs with any reservation strategy, reserving resource may waste resource if reserved b/w is unused. Unfortunately, application’s memory b/w demand typically vary significantly over time. To mitigate the problem, we support dynamic b/w reclaiming
  17. Key objective here is re-distribute underused bandwidth efficiently at runtime in order to increase memory b/w utilization if possible.
  18. This shows the effectiveness of reclaiming and spare bandwidth sharing.Foreground, each of spec2006 on core0 with 1.0GB/s reservation,Background, lbm benchmark with 0.2GB/s reservtion.
  19. This shows the effectiveness of reclaiming and spare bandwidth sharing.Foreground, each of spec2006 on core0 with 1.0GB/s reservation,Background, lbm benchmark with 0.2GB/s reservtion.
  20. [Prof.Sha], idea. Appliction example, multiple camera syntheicvisons.