Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MemGuard: Memory Bandwidth Reservation
System for Efficient Performance Isolation in
Multi-core Platforms
Apr 9, 2013
Heec...
Real-Time Applications
2
• Resource intensive real-time applications
– Multimedia processing(*), real-time data analytic(*...
Modern System-on-Chip (SoC)
• More cores
– Freescale P4080 has 8 cores
• More sharing
– Shared memory hierarchy (LLC, MC, ...
Problem: Shared Memory Hierarchy
• Shared hardware resources
• OS has little control
Core1 Core2 Core3 Core4
DRAM
Part 1 P...
Memory Performance Isolation
• Q. How to guarantee worst-case performance?
– Need to guarantee memory performance
Part 1 P...
Inter-Core Memory Interference
• Significant slowdown (Up to 2x on 2 cores)
• Slowdown is not proportional to memory bandw...
Bank 4
Background: DRAM Chip
Row 1
Row 2
Row 3
Row 4
Row 5
Bank 1
Row Buffer
Bank 2
Bank 3
activate
precharge
Read/write
•...
Background: Memory Controller(MC)
• Request queue
– Buffer read/write requests from CPU cores
– Unpredictable queuing dela...
Background: MC Queue Re-ordering
• Improve row hit ratio and throughput
• Unpredictable queuing delay
9
Core1: READ Row 1,...
Challenges for Real-Time Systems
• Memory controller(MC) level queuing delay
– Main source of interference
– Unpredictable...
Our Approach
• OS level approach
– Works on commodity h/w
– Guarantees performance of each core
– Maximizes memory perform...
Operating System
MemGuard
12
Core1 Core2 Core3 Core4
PMC PMC PMC PMC
DRAM DIMM
MemGuard
Multicore Processor
Memory Control...
Memory Bandwidth Reservation
• Idea
– OS monitor and enforce each core’s memory bandwidth usage
13
1ms 2ms0
Dequeue tasks
...
Memory Bandwidth Reservation
14
Guaranteed Bandwidth: rmin
15
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
Memory Access Pattern
• Memory access patterns vary over time
• Static reservation is inefficient
16
Time(ms)
Memory
reque...
Memory Bandwidth Reclaiming
• Key objective
– Redistribute excess bandwidth to demanding
cores
– Improve memory b/w utiliz...
Reclaim Example
• Time 0
– Initial budget 3 for both cores
• Time 3,4
– Decrement budgets
• Time 10 (period 1)
– Predictiv...
Best-effort Bandwidth Sharing
• Key objective
– Utilize best-effort bandwidth whenever possible
• Best-effort bandwidth
– ...
Evaluation Platform
• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM
– Prefetchers were turned off for evaluation
– ...
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.libquantum
(foreground)
memory hogs
(background)
0.0
0.2
0.4
...
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum
(foreground)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
run-alone...
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum
(foreground)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
run-alone...
SPEC2006 Results
24
0.0
0.5
1.0
1.5
2.0
2.5
3.0
foreground (X-axis) @1.0GB/s
NormalizedIPC
Normalized to guaranteed perfor...
SPEC2006 Results
• Improve overall throughput
– background: 368%, foreground(X-axis): 6%
25
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7....
Conclusion
• Inter-Core Memory Interference
– Big challenge for multi-core based real-time systems
– Sources: queuing dela...
Thank you.
27
Reclaim Underrun Error
28
Effect of MemGuard
• Soft real-time application on each core.
• Provides differentiated memory bandwidth
– weight for each...
Upcoming SlideShare
Loading in …5
×

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms

4,461 views

Published on

Memory bandwidth in modern multi-core platforms is highly variable for many reasons and is a big challenge in designing real-time systems as applications are increasingly becoming more memory intensive. In this work, we proposed, designed, and implemented an efficient memory bandwidth reservation system, that we call MemGuard. MemGuard distinguishes memory bandwidth as two parts: guaranteed and best effort. It provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth. It further improves performance by exploiting the best effort bandwidth after satisfying each core’s reserved bandwidth. MemGuard is evaluated with SPEC2006 benchmarks on a real hardware platform, and the results demonstrate that it is able to provide memory performance isolation with minimal impact on overall throughput.

Published in: Technology
  • Be the first to comment

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multicore Platforms

  1. 1. MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2013 Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*, Marco Caccamo+, Lui Sha+ +University of Illinois at Urbana-Champaign *University of Waterloo
  2. 2. Real-Time Applications 2 • Resource intensive real-time applications – Multimedia processing(*), real-time data analytic(**), object tracking • Requirements – Need more performance and cost less  Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
  3. 3. Modern System-on-Chip (SoC) • More cores – Freescale P4080 has 8 cores • More sharing – Shared memory hierarchy (LLC, MC, DRAM) – Shared I/O channels 3 More performance Less cost But, isolation?
  4. 4. Problem: Shared Memory Hierarchy • Shared hardware resources • OS has little control Core1 Core2 Core3 Core4 DRAM Part 1 Part 2 Part 3 Part 4 4 Memory Controller (MC) Shared Last Level Cache (LLC) Space contention Access contention
  5. 5. Memory Performance Isolation • Q. How to guarantee worst-case performance? – Need to guarantee memory performance Part 1 Part 2 Part 3 Part 4 5 Core1 Core2 Core3 Core4 DRAM Memory Controller LLC LLC LLC LLC
  6. 6. Inter-Core Memory Interference • Significant slowdown (Up to 2x on 2 cores) • Slowdown is not proportional to memory bandwidth usage 6 Core Shared Memory foreground X-axis Intel Core2 L2 L2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 437.leslie3d 462.libquantum 410.bwaves 471.omnetpp Slowdown ratio due to interference (1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s) Core background 470.lbm Runtimeslowdown Core
  7. 7. Bank 4 Background: DRAM Chip Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer Bank 2 Bank 3 activate precharge Read/write • State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting) READ (Bank 1, Row 3, Col 7) Col7
  8. 8. Background: Memory Controller(MC) • Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 8 Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
  9. 9. Background: MC Queue Re-ordering • Improve row hit ratio and throughput • Unpredictable queuing delay 9 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM Initial Queue Reordered Queue 2 Row Switch 1 Row Switch
  10. 10. Challenges for Real-Time Systems • Memory controller(MC) level queuing delay – Main source of interference – Unpredictable (re-ordering) • DRAM state dependent latency – DRAM row open/close state • State of Art – Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09] [Reineke’11]  not in COTS 10 Core 1 Core 2 Core 3 Core 4 DRAM Memory ControllerMemory Controller DRAM Predictable Memory Controller
  11. 11. Our Approach • OS level approach – Works on commodity h/w – Guarantees performance of each core – Maximizes memory performance if possible 11 Core 1 Core 2 Core 3 Core 4 DRAM Memory ControllerMemory Controller DRAM OS control mechanism
  12. 12. Operating System MemGuard 12 Core1 Core2 Core3 Core4 PMC PMC PMC PMC DRAM DIMM MemGuard Multicore Processor Memory Controller • Memory bandwidth reservation and reclaiming BW Regulator BW Regulator BW Regulator BW Regulator 0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s Reclaim Manager
  13. 13. Memory Bandwidth Reservation • Idea – OS monitor and enforce each core’s memory bandwidth usage 13 1ms 2ms0 Dequeue tasks Enqueue tasks Dequeue tasks Budget Core activity 2 1 computation memory fetch
  14. 14. Memory Bandwidth Reservation 14
  15. 15. Guaranteed Bandwidth: rmin 15 (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
  16. 16. Memory Access Pattern • Memory access patterns vary over time • Static reservation is inefficient 16 Time(ms) Memory requests Time(ms) Memory requests
  17. 17. Memory Bandwidth Reclaiming • Key objective – Redistribute excess bandwidth to demanding cores – Improve memory b/w utilization • Predictive bandwidth donation and reclaiming – Donate unneeded budget predictively – Reclaim on-demand basis 17
  18. 18. Reclaim Example • Time 0 – Initial budget 3 for both cores • Time 3,4 – Decrement budgets • Time 10 (period 1) – Predictive donation (total 4) • Time 12,15 – Core 1: reclaim • Time 16 – Core 0: reclaim • Time 17 – Core 1: no remaining budget; dequeue tasks • Time 20 (period 2) – Core 0 donates 2 – Core 1 do not donate 18
  19. 19. Best-effort Bandwidth Sharing • Key objective – Utilize best-effort bandwidth whenever possible • Best-effort bandwidth – After all cores use their budgets (i.e., delivering guaranteed bandwidth), before the next period begins • Sharing policy – Maximize throughput – Broadcast all cores to continue to execute 19
  20. 20. Evaluation Platform • Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM – Prefetchers were turned off for evaluation – Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported • Modified Linux kernel 3.6.0 + MemGuard kernel module – https://github.com/heechul/memguard/wiki/MemGuard • Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks 20 Core 0 I D L2 Cache Intel Core2Quad Core 1 I D Core 2 I D L2 Cache Core 3 I D Memory Controller DRAM
  21. 21. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.libquantum (foreground) memory hogs (background) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) C2 53% performance reduction
  22. 22. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.Libquantum (foreground) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) 1GB/s memory hogs (background) C2 .2GB/s Reservation provides performance isolation Guaranteed performance
  23. 23. Evaluation Results C0 Shared Memory C2 Intel Core2 L2 L2 462.Libquantum (foreground) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) NormalizedIPC Foreground (462.libquantum) 1GB/s memory hogs (background) C2 .2GB/s Reclaiming and Sharing maximize performance Guaranteed performance
  24. 24. SPEC2006 Results 24 0.0 0.5 1.0 1.5 2.0 2.5 3.0 foreground (X-axis) @1.0GB/s NormalizedIPC Normalized to guaranteed performance • Guarantee(soft) performance of foreground (x-axis) – W.r.t. 1.0GB/s memory b/w reservation C0 Shared Memory C2 Intel Core2 L2 L2 X-axis (foreground) 1GB/s 470.lbm (background) C2 .2GB/s
  25. 25. SPEC2006 Results • Improve overall throughput – background: 368%, foreground(X-axis): 6% 25 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s NormalizedIPC Normalized to guaranteed performance C0 Shared Memory C2 Intel Core2 L2 L2 X-axis (foreground) 1GB/s 470.lbm (background) C2 .2GB/s
  26. 26. Conclusion • Inter-Core Memory Interference – Big challenge for multi-core based real-time systems – Sources: queuing delay in MC, state dependent latency in DRAM • MemGuard – OS mechanism providing efficient per-core memory performance isolation on COTS H/W – Memory bandwidth reservation and reclaiming support – https://github.com/heechul/memguard/wiki/MemGuard 26
  27. 27. Thank you. 27
  28. 28. Reclaim Underrun Error 28
  29. 29. Effect of MemGuard • Soft real-time application on each core. • Provides differentiated memory bandwidth – weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled 29

×