This document describes MemGuard, an operating system mechanism for providing efficient per-core memory performance isolation on commercial off-the-shelf hardware. MemGuard uses memory bandwidth reservation to guarantee each core's minimum memory bandwidth. It then performs predictive bandwidth donation and on-demand reclaiming to redistribute excess bandwidth, improving overall utilization. Evaluation shows MemGuard isolates performance and eliminates over 50% slowdown of a foreground real-time task due to interference, while maximizing throughput via bandwidth sharing.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms
1. MemGuard: Memory Bandwidth Reservation
System for Efficient Performance Isolation in
Multi-core Platforms
Apr 9, 2013
Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*,
Marco Caccamo+, Lui Sha+
+University of Illinois at Urbana-Champaign
*University of Waterloo
2. Real-Time Applications
2
• Resource intensive real-time applications
– Multimedia processing(*), real-time data analytic(**), object tracking
• Requirements
– Need more performance and cost less Commercial Off-The Shelf (COTS)
– Performance guarantee (i.e., temporal predictability and isolation)
(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010
(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
3. Modern System-on-Chip (SoC)
• More cores
– Freescale P4080 has 8 cores
• More sharing
– Shared memory hierarchy (LLC, MC, DRAM)
– Shared I/O channels
3
More performance
Less cost
But, isolation?
4. Problem: Shared Memory Hierarchy
• Shared hardware resources
• OS has little control
Core1 Core2 Core3 Core4
DRAM
Part 1 Part 2 Part 3 Part 4
4
Memory Controller (MC)
Shared Last Level Cache (LLC) Space
contention
Access
contention
5. Memory Performance Isolation
• Q. How to guarantee worst-case performance?
– Need to guarantee memory performance
Part 1 Part 2 Part 3 Part 4
5
Core1 Core2 Core3 Core4
DRAM
Memory Controller
LLC LLC LLC LLC
6. Inter-Core Memory Interference
• Significant slowdown (Up to 2x on 2 cores)
• Slowdown is not proportional to memory bandwidth usage
6
Core
Shared Memory
foreground
X-axis
Intel Core2
L2 L2
1.0
1.2
1.4
1.6
1.8
2.0
2.2
437.leslie3d 462.libquantum 410.bwaves 471.omnetpp
Slowdown ratio due to interference
(1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)
Core
background
470.lbm
Runtimeslowdown
Core
7. Bank 4
Background: DRAM Chip
Row 1
Row 2
Row 3
Row 4
Row 5
Bank 1
Row Buffer
Bank 2
Bank 3
activate
precharge
Read/write
• State dependent access latency
– Row miss: 19 cycles, Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
READ (Bank 1, Row 3, Col 7)
Col7
8. Background: Memory Controller(MC)
• Request queue
– Buffer read/write requests from CPU cores
– Unpredictable queuing delay due to reordering
8
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
9. Background: MC Queue Re-ordering
• Improve row hit ratio and throughput
• Unpredictable queuing delay
9
Core1: READ Row 1, Col 1
Core2: READ Row 2, Col 1
Core1: READ Row 1, Col 2
Core1: READ Row 1, Col 1
Core1: READ Row 1, Col 2
Core2: READ Row 2, Col 1
DRAM DRAM
Initial Queue Reordered Queue
2 Row Switch 1 Row Switch
10. Challenges for Real-Time Systems
• Memory controller(MC) level queuing delay
– Main source of interference
– Unpredictable (re-ordering)
• DRAM state dependent latency
– DRAM row open/close state
• State of Art
– Predictable DRAM controller h/w: [Akesson’07]
[Paolieri’09] [Reineke’11] not in COTS
10
Core
1
Core
2
Core
3
Core
4
DRAM
Memory ControllerMemory Controller
DRAM
Predictable Memory Controller
11. Our Approach
• OS level approach
– Works on commodity h/w
– Guarantees performance of each core
– Maximizes memory performance if possible
11
Core
1
Core
2
Core
3
Core
4
DRAM
Memory ControllerMemory Controller
DRAM
OS control mechanism
18. Reclaim Example
• Time 0
– Initial budget 3 for both cores
• Time 3,4
– Decrement budgets
• Time 10 (period 1)
– Predictive donation (total 4)
• Time 12,15
– Core 1: reclaim
• Time 16
– Core 0: reclaim
• Time 17
– Core 1: no remaining budget; dequeue
tasks
• Time 20 (period 2)
– Core 0 donates 2
– Core 1 do not donate
18
19. Best-effort Bandwidth Sharing
• Key objective
– Utilize best-effort bandwidth whenever possible
• Best-effort bandwidth
– After all cores use their budgets (i.e., delivering
guaranteed bandwidth), before the next period begins
• Sharing policy
– Maximize throughput
– Broadcast all cores to continue to execute
19
20. Evaluation Platform
• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM
– Prefetchers were turned off for evaluation
– Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported
• Modified Linux kernel 3.6.0 + MemGuard kernel module
– https://github.com/heechul/memguard/wiki/MemGuard
• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks
20
Core 0
I D
L2 Cache
Intel Core2Quad
Core 1
I D
Core 2
I D
L2 Cache
Core 3
I D
Memory Controller
DRAM
26. Conclusion
• Inter-Core Memory Interference
– Big challenge for multi-core based real-time systems
– Sources: queuing delay in MC, state dependent latency in
DRAM
• MemGuard
– OS mechanism providing efficient per-core memory
performance isolation on COTS H/W
– Memory bandwidth reservation and reclaiming support
– https://github.com/heechul/memguard/wiki/MemGuard
26
29. Effect of MemGuard
• Soft real-time application on each core.
• Provides differentiated memory bandwidth
– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled
29
Editor's Notes
Let me begin the talk. Today, I will talk about.
Performance and energy consumption is important. One of the memory intensive
Modern SoC can satisfy many of these requirements.This figure is a example of such SoC.
Memory bandwidth isolation problemResource isolation reservation, CPU reservation has been extensively studied.Memory bandwidth reservation
In muticore based systems, however, CPU reservation doesn’t provide sufficient guarantee.We focus on a partitioned system where each core perform independent set of workload with each other.The big question is, however, how we can guarantee performance of each core.Cache, existing mechanisms in commodity processors. But memory is not so much.
To quantify the maginitute of inteference problem, I ran some experiment on a dual core system.First, both foreground on core0 and background on core2 suffer slowdown (expected)Second, some e.g., libquantum, suffer much more while lbm only suffer less than 20%. Interestingly, libquantum is not the most memory intensive task. X-axis is sorted. Milc suffer more than omnetpp for example. We need to know, how much will be slowdown, or how to avoid it? As shown here, performance interference is huge but quantifying it is challenging. To better understand,
Different access cost, depend on access history.DRAM is composed of set of banks that can be accessed in parallel. Each bank is composed of multiple rows, each row is ranging from 1K to 2KBTherefore a DRAM address can be given (bank, row, column). (rank, channel can be added depending on systems)Now interesting part is that in order to access the location, the corresponding row must be first copied to a register file. The copy processes is called activate. Once it is activated, consecutive accesses can be handled through the register file as long as they are in the same row. If a new request is in different row, it then has to be copied back and the new row must be copied to the register file. Storing data in register file to dram is called precharge. So simply speaking, (5 + 5 + 5 + 4)(5 + 4)
Q. Buffer size?Queuing delay (while queuing delay is well studied in the context of network packet switching it can not be directly applied because schedule re-orderScheduler re-ordering. (for maximizing throughput)
Major source of unpredictability.T1 -> Core1 …FIXME
Many h/w proposal that eliminate these limitation at the level of DRAM controller, there has been very little work that provide performance guarantee with software based mechanisms. Paolieri-AMC
Many h/w proposal that eliminate these limitation at the level of DRAM controller, there has been very little work that provide performance guarantee with software based mechanisms.
MemGuard consists of a set of bandwidth controllers that runs inside OS kernel. Each controller regulate memory request rate of each core to a certain value. Core1’s request rate is 0.9GB/s and core2’s rate is 0.1GB/s and so on.The goal is we want to guarantee this rate for the core, regardless of memory activities on other cores. In other word, we want to reserve the bandwidth for each core. While reservation is for performance guarantee, reclaiming is to improve performance whenever it is possible. For example, if Core1 does not use its assigned b/w because it is currently in cpu intensive phase, then the unused b/w is redistributed by reclaiming mechanism. Let me explain each of this in detail.[Prof. Sha]
First, let me explain how b/w regulator works.
Why we want to regulate the request rates?
The guaranteed b/w is defined as the worst case DRAM performance. It can be calculated.Claim is that if we regulate requests rate of all cores to be below the minimum DRAM performance, which is 1.2GB/s, in this case, requests are likely to be processed immediately without buffered in the memory controller queues. While much smaller than the peak, many applications do not use peak bandwidth all the time. Hence, reserving the b/w is meaningful. Performance guarantee. We will later show this holds with extensive experiments on a real platform.
Because these problem, reservation on memory bandwidth is not easyAs with any reservation strategy, reserving resource may waste resource if reserved b/w is unused. Unfortunately, application’s memory b/w demand typically vary significantly over time. To mitigate the problem, we support dynamic b/w reclaiming
Key objective here is re-distribute underused bandwidth efficiently at runtime in order to increase memory b/w utilization if possible.
This shows the effectiveness of reclaiming and spare bandwidth sharing.Foreground, each of spec2006 on core0 with 1.0GB/s reservation,Background, lbm benchmark with 0.2GB/s reservtion.
This shows the effectiveness of reclaiming and spare bandwidth sharing.Foreground, each of spec2006 on core0 with 1.0GB/s reservation,Background, lbm benchmark with 0.2GB/s reservtion.
[Prof.Sha], idea. Appliction example, multiple camera syntheicvisons.