Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Parallelism-Aware Memory Interference
Delay Analysis for COTS Multicore Systems
Heechul Yun +, Rodolfo Pellizzoni*, Pratha...
High-Performance Multicores for
Embedded Real-Time Systems
• Why?
– Intelligence  more performance
– Space, weight, power...
Challenge: Shared Memory Hierarchy
3
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
• Hardware resources are ...
Memory Interference Delay
• Can be extremely high in the worst-case
4
8.0
33.5
45.8
0
5
10
15
20
25
30
35
40
45
50
ARM
Cor...
Modeling Memory Interference
• Common (false) assumptions on COTS systems
– A single resource
• Reality: multiple parallel...
Our Approach
• Realistic memory interference model for COTS systems
– Memory-level parallelism (MLP) in COTS architecture
...
Outline
• Motivation
• Background
– COTS multicore architecture
– DRAM organization
– Memory controller
• Our approach
• E...
Memory-Level Parallelism (MLP)
• Broadly defined as the number of concurrent
memory requests that a given architecture
can...
Last Level Cache (LLC)
DRAM DIMM
Memory Controller (MC)
Core1 Core2 Core3 Core4
Request buffers
Read Write
Scheduler
MSHRs...
DRAM Organization
LLC
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
Mess
• intra-ba...
DRAM Bank Partitioning
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4 • Private b...
Bank Access Cost
Row 1
Row 2
Row 3
Row 4
Row 5
A DRAM Bank
Row Bufferactivate
precharge
Read/write
• State dependent acces...
Memory Controller
13
Read request
buffer
Write request
buffer
Bank 1
scheduler
Channel scheduler
Bank 2
scheduler
Bank N
s...
“Intelligent” Read/Write Switching
• Intuition
– Writes are not in the critical path. So buffer and
process them opportuni...
FR-FCFS Scheduling[Rixner’00]
• Priority order
1. Row hit request
2. Older request
• Maximize memory throughput
15
Bank 1
...
Outline
• Motivation
• Background
• Our approach
– System model
– Delay analysis
• Evaluation
• Conclusion
16
System Model
• Task
– Solo execution time C
– Memory demand (#of LLC misses): H
• Core
– Can generate multiple, but bounde...
Delay Analysis
• Goal
– Compute the worst-case memory interference delay of a
task under analysis
• Request driven analysi...
Key Intuition #1
• The #of competing requests Nrq is bounded
– Because the # of per-core parallel requests is bounded.
– E...
Key Intuition #2
• DRAM sub-commands of the competing
memory requests are overlapped
– Much less pessimistic than [Kim’14]...
Key Intuition #3
• The worst-case delay happens when
– The read buffer has Nrq requests
– And the write request buffer jus...
Outline
• Motivation
• Background
• Our approach
• Evaluation
• Conclusion
22
Evaluation Setup
• Gem5 simulator
– 4 out-of-order cores (based on Cortex-A15)
• L2 MSHR size is increased to eliminate MS...
Results with the Latency benchmark
• Ours(ideal): Read only delay analysis (ignore writes)
• Ours(opt): assume writes are ...
Results with SPEC2006 Benchmarks
• Main source of pessimism:
– The pathological case of write (LLC write-backs) processing...
Conclusion
• Memory interference delay on COTS multicore
– Existing analysis methods rely on strong assumptions
• Our appr...
Thank You
27
Upcoming SlideShare
Loading in …5
×

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

778 views

Published on

In modern Commercial Off-The-Shelf (COTS) multicore systems, each core
can generate many parallel memory requests at a time. The processing of
these parallel requests in the DRAM controller greatly affects the memory
interference delay experienced by running tasks on the platform.

In this paper, we present a new parallelism-aware worst-case memory
interference delay analysis for COTS multicore systems. The analysis
considers a COTS processor that can generate multiple outstanding
requests and a COTS DRAM controller that has a separate read and write
request buffer, prioritizes reads over writes, and supports
out-of-order request processing. Focusing on LLC and DRAM bank
partitioned systems, our analysis computes worst-case upper bounds on
memory-interference delays, caused by competing memory requests.

We validate our analysis on a Gem5 full-system simulator modeling a
realistic COTS multicore platform, with a set of carefully designed
synthetic benchmarks as well as SPEC2006 benchmarks.
The evaluation results show that our analysis produces safe upper
bounds in all tested benchmarks, while the current state-of-the-art analysis significantly
under-estimates the delays.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

  1. 1. Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems Heechul Yun +, Rodolfo Pellizzoni*, Prathap Kumar Valsan+ +University of Kansas *University of Waterloo 1
  2. 2. High-Performance Multicores for Embedded Real-Time Systems • Why? – Intelligence  more performance – Space, weight, power (SWaP), cost 2
  3. 3. Challenge: Shared Memory Hierarchy 3 Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache • Hardware resources are contented among the cores • Tasks can suffer significant memory interference delays DRAM
  4. 4. Memory Interference Delay • Can be extremely high in the worst-case 4 8.0 33.5 45.8 0 5 10 15 20 25 30 35 40 45 50 ARM Cortex A15 Intel Nahelem Intel Haswell solo + 1 co-runner + 2 co-runners + 3 co-runners Normalizedexecutiontime 45.8X slowdown DRAM LLC Core1 Core2 Core3 Core4 bench co-runner(s) banks
  5. 5. Modeling Memory Interference • Common (false) assumptions on COTS systems – A single resource • Reality: multiple parallel resources (banks) – A constant memory service cost • Reality: it varies depending on the DRAM bank state – Round-robin arbitration, in-order processing • Reality: FR-FCFS can re-order requests – Both read and write requests are treated equally • Reality: writes are buffered and processed opportunistically – One outstanding request per core • Reality: an out-of-order core can generate parallel reqs. 5 Addressed in This Work Addressed in [Kim’14] [Kim’14] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. R.Rajkumar. “Bounding Memory Interference Delay in COTS-based Multi-Core Systems,” RTAS’14
  6. 6. Our Approach • Realistic memory interference model for COTS systems – Memory-level parallelism (MLP) in COTS architecture – Write-buffering and opportunistic batch processing in MC • DRAM bank-partitioning – Reduce interference • Delay analysis – Compute worst-case memory interference delay of the task under analysis 6
  7. 7. Outline • Motivation • Background – COTS multicore architecture – DRAM organization – Memory controller • Our approach • Evaluation • Conclusion 7
  8. 8. Memory-Level Parallelism (MLP) • Broadly defined as the number of concurrent memory requests that a given architecture can handle at a time 8
  9. 9. Last Level Cache (LLC) DRAM DIMM Memory Controller (MC) Core1 Core2 Core3 Core4 Request buffers Read Write Scheduler MSHRs CMD/ADDR DATA COTS Multicore Architecture Out-of-order core: Multiple memory requests Non-blocking caches: Multiple cache-misses MC and DRAM: Multiple banks Bank 4 Bank 3 Bank 2 Bank 1 MSHRs MSHRs MSHRs MSHRs 9
  10. 10. DRAM Organization LLC DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1 Core2 Core3 Core4 Mess • intra-bank conflicts • Inter-bank conflicts 10
  11. 11. DRAM Bank Partitioning L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1 Core2 Core3 Core4 • Private banking – OS kernel allocates pages from dedicated banks for each core Eliminate intra bank conflicts 11
  12. 12. Bank Access Cost Row 1 Row 2 Row 3 Row 4 Row 5 A DRAM Bank Row Bufferactivate precharge Read/write • State dependent access cost – Row hit: fast – Row miss: slow READ (Bank 1, Row 3, Col 7) Col7 12
  13. 13. Memory Controller 13 Read request buffer Write request buffer Bank 1 scheduler Channel scheduler Bank 2 scheduler Bank N scheduler DRAM chip Memory requests from cores Writes are buffered and processed opportunistically
  14. 14. “Intelligent” Read/Write Switching • Intuition – Writes are not in the critical path. So buffer and process them opportunistically • Algorithm [Hansson’14] – If there are reads, process them unless the write buffer is almost full (high watermark) – If there’s no reads and there is enough buffered writes (low watermark), process the writes until reads arrive 14 [Hansson’14] Hansson et al., “Simulating DRAM controllers for future syste m architecture exploration,” ISPASS’14
  15. 15. FR-FCFS Scheduling[Rixner’00] • Priority order 1. Row hit request 2. Older request • Maximize memory throughput 15 Bank 1 scheduler Channel scheduler Bank 2 scheduler Bank N scheduler [Rixner’00] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. Owens. Memo ry access scheduling. ACM SIGARCH Computer Architecture News. 2000
  16. 16. Outline • Motivation • Background • Our approach – System model – Delay analysis • Evaluation • Conclusion 16
  17. 17. System Model • Task – Solo execution time C – Memory demand (#of LLC misses): H • Core – Can generate multiple, but bounded, parallel requests • Upper-bounded by L1 cache’s MSHR size • Cache (LLC) – Assume no cache-level interference • Core-private or partitioned LLC • No MSHR contention • DRAM controller – Efficient FR-FCFS scheduler, open-page pollicy – Separate read and write request buffer • Watermark scheme on processing writes 17
  18. 18. Delay Analysis • Goal – Compute the worst-case memory interference delay of a task under analysis • Request driven analysis – Based on the task’s own memory demand: H – Compute worst-case per request delay: RD – Memory interference delay = RD x H • Job driven analysis – Based on the other tasks’ memory requests over time – See paper 18
  19. 19. Key Intuition #1 • The #of competing requests Nrq is bounded – Because the # of per-core parallel requests is bounded. – Example • Cortex-A15’s per-core bound = 6 • Nrq = 6 x 3 (cores) = 18 19
  20. 20. Key Intuition #2 • DRAM sub-commands of the competing memory requests are overlapped – Much less pessimistic than [Kim’14], which simply sums up each sub-command’s maximum delay – See paper for the proof 20
  21. 21. Key Intuition #3 • The worst-case delay happens when – The read buffer has Nrq requests – And the write request buffer just becomes full • Start a write batch – Then the read request under analysis arrives 21 RD = read batch delay + write batch delay
  22. 22. Outline • Motivation • Background • Our approach • Evaluation • Conclusion 22
  23. 23. Evaluation Setup • Gem5 simulator – 4 out-of-order cores (based on Cortex-A15) • L2 MSHR size is increased to eliminate MSHR contention – DRAM controller model [Hansson’14] – LPDDR2 @ 533Mhz • Linux 3.14 – Use PALLOC[Yun’14] to partition DRAM banks and LLC • Workload – Subject: Latency, SPEC2006 – Co-runners: Bandwidth (write) 23 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s)
  24. 24. Results with the Latency benchmark • Ours(ideal): Read only delay analysis (ignore writes) • Ours(opt): assume writes are balanced over multiple banks • Ours(worst): all writes are targeting one bank & all row misses 24 0 5 10 15 20 25 30 Measured [Kim'14] Ours(ideal) Ours(opt) Ours(worst) NormalizedResponseTime underestimate overestimate [Kim’14] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. R.Rajkumar. “Bounding Memory Interference Delay in COTS-based Multi-Core Systems,” RTAS’14
  25. 25. Results with SPEC2006 Benchmarks • Main source of pessimism: – The pathological case of write (LLC write-backs) processing 25
  26. 26. Conclusion • Memory interference delay on COTS multicore – Existing analysis methods rely on strong assumptions • Our approach – A realistic model of COTS memory system • Parallel memory requests • Read prioritization and opportunistic write processing – Request and job-driven delay analysis methods • Pessimistic but still can be useful for low memory intensive tasks • Future work – Reduce pessimism in the analysis 26
  27. 27. Thank You 27

×