SlideShare a Scribd company logo
Taming Non-blocking Caches to Improve
Isolation in Multicore Real-Time Systems
Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi
University of Kansas
1
High-Performance Multicores
for Real-Time Systems
• Why?
– Intelligence  more performance
– Space, weight, power (SWaP), cost
2
Time Predictability Challenge
3
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
• Shared hardware resource contention can cause
significant interference delays
• Shared cache is a major shared resource
DRAM
Task 1 Task 2 Task 3 Task 4
Cache Partitioning
• Can be done in either software or hardware
– Software: page coloring
– Hardware: way partitioning
• Eliminate unwanted cache-line evictions
• Common assumption
– cache partitioning  performance isolation
• If working-set fits in the cache partition
• Not necessarily true on out-of-order cores using
non-blocking caches
4
Outline
• Evaluate the isolation effect of cache partitioning
– On four real COTS multicore architectures
– Observed up to 21X slowdown of a task running on a
dedicated core accessing a dedicated cache partition
• Understand the source of contention
– Using a cycle-accurate full system simulator
• Propose an OS/architecture collaborative solution
– For better cache performance isolation
5
Memory-Level Parallelism (MLP)
• Broadly defined as the number of concurrent
memory requests that a given architecture
can handle at a time
6
Non-blocking Cache(*)
• Can serve cache hits under multiple cache misses
– Essential for an out-of-order core and any multicore
• Miss-Status-Holding Registers (MSHRs)
– On a miss, allocate a MSHR entry to track the req.
– On receiving the data, clear the MSHR entry
7
cpu cpu
miss hit miss
Miss penalty
Miss penalty
stall only when
result is needed
Multiple outstanding misses
(*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
Non-blocking Cache
• #of cache MSHRs  memory-level parallelism
(MLP) of the cache
• What happens if all MSHRs are occupied?
– The cache is locked up
– Subsequent accesses---including cache hits---to
the cache stall
– We will see the impact of this in later experiments
8
COTS Multicore Platforms
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Core 4core @
1.4GHz
In-order
4core @
1.7GHz
Out-of-order
4core @
2.0GHz
Out-of-order
4core @
2.8GHz
Out-of-order
LLC (shared) 512KB 1MB 2MB 8MB
9
• COTS multicore platforms
– Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15
– Odroid-U3: 4x Cortex-A9
– Dell desktop: Intel Xeon quad-core (Nehalem)
Identified MLP
• Local MLP
– MLP of a core-private cache
• Global MLP
– MLP of the shared cache (and DRAM)
10
(See paper for our experimental identification method)
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
Cache Interference Experiments
• Measure the performance of the ‘subject’
– (1) alone, (2) with co-runners
– LLC is partitioned (equal partition) using PALLOC (*)
• Q: Does cache partitioning provide isolation?
11
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem
ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
IsolBench: Synthetic Workloads
• Latency
– A linked-list traversal, data dependency, one outstanding miss
• Bandwidth
– An array reads or writes, no data dependency, multiple misses
• Subject benchmarks: LLC partition fitting
12
Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
Latency(LLC) vs. BwRead(DRAM)
• No interference on Cortex-A7 and Nehalem
• On Cortex-A15, Latency(LLC) suffers 3.8X slowdown
– despite partitioned LLC
13
BwRead(LLC) vs. BwRead(DRAM)
• Up to 10.6X slowdown on Cortex-A15
• Cache partitioning != performance isolation
– On all tested out-of-order cores (A9, A15, Nehalem)
14
BwRead(LLC) vs. BwWrite(DRAM)
• Up to 21X slowdown on Cortex-A15
• Writes generally cause more slowdowns
– Due to write-backs
15
EEMBC and SD-VBS
• X-axis: EEMBC, SD-VBS (cache partition fitting)
– Co-runners: BwWrite(DRAM)
• Cache partitioning != performance isolation
16
Cortex-A7 (in-order) Cortex-A15 (out-of-order)
MSHR Contention
• Shortage of cache MSHRs  lock up the cache
• LLC MSHRs are shared resources
– 4cores x L1 cache MSHRs > LLC MSHRs
17
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
(See paper for our experimental identification method)
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
– Effect of LLC MSHRs
– Effect of L1 MSHRs
• Propose an OS/architecture collaborative
solution
18
Simulation Setup
• Gem5 full-system simulator
– 4 out-of-order cores (modeling Cortex-A15)
• L1: 32K I&D, LLC (L2): 2048K
– Vary #of L1 and LLC MSHRs
• Linux 3.14
– Use PALLOC to partition LLC
• Workload
– IsolBench, EEMBC, SD-VBS, SPEC2006
19
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
Effect of LLC (Shared) MSHRs
• Increasing LLC MSHRs eliminates the MSHR contention
• Issue: scalable?
– MSHRs are highly associative (must be accessed in parallel)
20
L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
Effect of L1 (Private) MSHRs
• Reducing L1 MSHRs can eliminate contention
• Issue: single thread performance. How much?
– It depends. L1 MSHRs are often underutilized
– EEMBC: low, SD-VBS: medium, SPEC2006: high
21
EEMBC, SD-VBS IsolBench,SPEC2006
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
• Propose an OS/architecture collaborative
solution
22
OS Controlled MSHR Partitioning
23
• Add two registers in each core’s L1 cache
– TargetCount: max. MSHRs (set by OS)
– ValidCount: used MSHRs (set by HW)
• OS can control each core’s MLP
– By adjusting the core’s TargetCount register
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
MSHR 1
MSHR 2
MSHR n
TargetCount ValidCount
Last Level Cache (LLC)
Core
1
Core
2
Core
3
Core
4
MSHRs
MSHRs MSHRs MSHRs MSHRs
OS Scheduler Design
• Partition LLC MSHRs by enforcing
• When RT tasks are scheduled
– RT tasks: reserve MSHRs
– Non-RT tasks: share remaining MSHRs
• Implemented in Linux kernel
– prepare_task_switch() at kernel/sched/core.c
24
Case Study
• 4 RT tasks (EEMBC)
– One RT per core
– Reserve 2 MSHRs
– P: 20,30,40,60ms
– C: ~8ms
• 4 NRT tasks
– One NRT per core
– Run on slack
• Up to 20% WCET reduction
– Compared to cache partition only
25
Conclusion
• Evaluated the effect of cache partitioning on four real COTS
multicore architectures and a full system simulator
• Found cache partitioning does not ensure cache (hit)
performance isolation due to MSHR contention
• Proposed low-cost architecture extension and OS scheduler
support to eliminate MSHR contention
• Availability
– https://github.com/CSL-KU/IsolBench
– Synthetic benchmarks (IsolBench),test scripts, kernel patches
26
Exp3: BwRead(LLC) vs. BwRead(LLC)
• Cache bandwidth contention is not the main source
– Higher cache-bandwidth co-runners cause less slowdowns
27
EEMBC, SD-VBS Workload
• Subject
– Subset of EEMBC, SD-VBS
– High L2 hit, Low L2 miss
• Co-runners
– BwWrite(DRAM): High L2 miss, write-intensive
28

More Related Content

What's hot

Lost with data consistency
Lost with data consistencyLost with data consistency
Lost with data consistency
Michał Gryglicki
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
RKGhosh3
 
“Linux Kernel CPU Hotplug in the Multicore System”
“Linux Kernel CPU Hotplug in the Multicore System”“Linux Kernel CPU Hotplug in the Multicore System”
“Linux Kernel CPU Hotplug in the Multicore System”
GlobalLogic Ukraine
 
Chapter 7
Chapter 7Chapter 7
Chapter 7
Amin Omi
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Yan Vugenfirer
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
Alex Maestretti
 
SOUG Day Oracle 21c New Security Features
SOUG Day Oracle 21c New Security FeaturesSOUG Day Oracle 21c New Security Features
SOUG Day Oracle 21c New Security Features
Stefan Oehrli
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
Memory management
Memory managementMemory management
Memory management
Adrien Mahieux
 
Embedded linux network device driver development
Embedded linux network device driver developmentEmbedded linux network device driver development
Embedded linux network device driver development
Amr Ali (ISTQB CTAL Full, CSM, ITIL Foundation)
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
jemin lee
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
哲豪 康哲豪
 
Memory mapping
Memory mappingMemory mapping
Memory mapping
SnehalataAgasti
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
Smruti Sarangi
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu
 
Cache memory
Cache  memoryCache  memory
Cache memory
Prasenjit Dey
 
I2 c bus
I2 c busI2 c bus
Tasklet vs work queues (Deferrable functions in linux)
Tasklet vs work queues (Deferrable functions in linux)Tasklet vs work queues (Deferrable functions in linux)
Tasklet vs work queues (Deferrable functions in linux)
RajKumar Rampelli
 

What's hot (20)

Lost with data consistency
Lost with data consistencyLost with data consistency
Lost with data consistency
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
“Linux Kernel CPU Hotplug in the Multicore System”
“Linux Kernel CPU Hotplug in the Multicore System”“Linux Kernel CPU Hotplug in the Multicore System”
“Linux Kernel CPU Hotplug in the Multicore System”
 
Chapter 7
Chapter 7Chapter 7
Chapter 7
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
SOUG Day Oracle 21c New Security Features
SOUG Day Oracle 21c New Security FeaturesSOUG Day Oracle 21c New Security Features
SOUG Day Oracle 21c New Security Features
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Memory management
Memory managementMemory management
Memory management
 
Embedded linux network device driver development
Embedded linux network device driver developmentEmbedded linux network device driver development
Embedded linux network device driver development
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
 
Memory mapping
Memory mappingMemory mapping
Memory mapping
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Cache memory
Cache  memoryCache  memory
Cache memory
 
I2 c bus
I2 c busI2 c bus
I2 c bus
 
Tasklet vs work queues (Deferrable functions in linux)
Tasklet vs work queues (Deferrable functions in linux)Tasklet vs work queues (Deferrable functions in linux)
Tasklet vs work queues (Deferrable functions in linux)
 

Similar to Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Heechul Yun
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Heechul Yun
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
Heechul Yun
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
JyotiprakashMishra18
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
jaxconf
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
shinolajla
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
shinolajla
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystem
Sandeep Kamath
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
Heechul Yun
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
GlobalLogic Ukraine
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
2022002857mbit
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Yue Cheng
 
Lect18
Lect18Lect18
Lect18
Vin Voro
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
RohitPaul71
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Hsien-Hsin Sean Lee, Ph.D.
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
Tomer Perry
 
Mass storage systems presentation operating systems
Mass storage systems presentation operating systemsMass storage systems presentation operating systems
Mass storage systems presentation operating systems
night1ng4ale
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
Sarwan ali
 

Similar to Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems (20)

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystem
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
 
Lect18
Lect18Lect18
Lect18
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Mass storage systems presentation operating systems
Mass storage systems presentation operating systemsMass storage systems presentation operating systems
Mass storage systems presentation operating systems
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 

Recently uploaded

openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
Abdullah Al Noman
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Transcat
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 

Recently uploaded (20)

openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

  • 1. Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1
  • 2. High-Performance Multicores for Real-Time Systems • Why? – Intelligence  more performance – Space, weight, power (SWaP), cost 2
  • 3. Time Predictability Challenge 3 Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache • Shared hardware resource contention can cause significant interference delays • Shared cache is a major shared resource DRAM Task 1 Task 2 Task 3 Task 4
  • 4. Cache Partitioning • Can be done in either software or hardware – Software: page coloring – Hardware: way partitioning • Eliminate unwanted cache-line evictions • Common assumption – cache partitioning  performance isolation • If working-set fits in the cache partition • Not necessarily true on out-of-order cores using non-blocking caches 4
  • 5. Outline • Evaluate the isolation effect of cache partitioning – On four real COTS multicore architectures – Observed up to 21X slowdown of a task running on a dedicated core accessing a dedicated cache partition • Understand the source of contention – Using a cycle-accurate full system simulator • Propose an OS/architecture collaborative solution – For better cache performance isolation 5
  • 6. Memory-Level Parallelism (MLP) • Broadly defined as the number of concurrent memory requests that a given architecture can handle at a time 6
  • 7. Non-blocking Cache(*) • Can serve cache hits under multiple cache misses – Essential for an out-of-order core and any multicore • Miss-Status-Holding Registers (MSHRs) – On a miss, allocate a MSHR entry to track the req. – On receiving the data, clear the MSHR entry 7 cpu cpu miss hit miss Miss penalty Miss penalty stall only when result is needed Multiple outstanding misses (*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
  • 8. Non-blocking Cache • #of cache MSHRs  memory-level parallelism (MLP) of the cache • What happens if all MSHRs are occupied? – The cache is locked up – Subsequent accesses---including cache hits---to the cache stall – We will see the impact of this in later experiments 8
  • 9. COTS Multicore Platforms Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Core 4core @ 1.4GHz In-order 4core @ 1.7GHz Out-of-order 4core @ 2.0GHz Out-of-order 4core @ 2.8GHz Out-of-order LLC (shared) 512KB 1MB 2MB 8MB 9 • COTS multicore platforms – Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15 – Odroid-U3: 4x Cortex-A9 – Dell desktop: Intel Xeon quad-core (Nehalem)
  • 10. Identified MLP • Local MLP – MLP of a core-private cache • Global MLP – MLP of the shared cache (and DRAM) 10 (See paper for our experimental identification method) Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16
  • 11. Cache Interference Experiments • Measure the performance of the ‘subject’ – (1) alone, (2) with co-runners – LLC is partitioned (equal partition) using PALLOC (*) • Q: Does cache partitioning provide isolation? 11 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s) (*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
  • 12. IsolBench: Synthetic Workloads • Latency – A linked-list traversal, data dependency, one outstanding miss • Bandwidth – An array reads or writes, no data dependency, multiple misses • Subject benchmarks: LLC partition fitting 12 Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
  • 13. Latency(LLC) vs. BwRead(DRAM) • No interference on Cortex-A7 and Nehalem • On Cortex-A15, Latency(LLC) suffers 3.8X slowdown – despite partitioned LLC 13
  • 14. BwRead(LLC) vs. BwRead(DRAM) • Up to 10.6X slowdown on Cortex-A15 • Cache partitioning != performance isolation – On all tested out-of-order cores (A9, A15, Nehalem) 14
  • 15. BwRead(LLC) vs. BwWrite(DRAM) • Up to 21X slowdown on Cortex-A15 • Writes generally cause more slowdowns – Due to write-backs 15
  • 16. EEMBC and SD-VBS • X-axis: EEMBC, SD-VBS (cache partition fitting) – Co-runners: BwWrite(DRAM) • Cache partitioning != performance isolation 16 Cortex-A7 (in-order) Cortex-A15 (out-of-order)
  • 17. MSHR Contention • Shortage of cache MSHRs  lock up the cache • LLC MSHRs are shared resources – 4cores x L1 cache MSHRs > LLC MSHRs 17 Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16 (See paper for our experimental identification method)
  • 18. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention – Effect of LLC MSHRs – Effect of L1 MSHRs • Propose an OS/architecture collaborative solution 18
  • 19. Simulation Setup • Gem5 full-system simulator – 4 out-of-order cores (modeling Cortex-A15) • L1: 32K I&D, LLC (L2): 2048K – Vary #of L1 and LLC MSHRs • Linux 3.14 – Use PALLOC to partition LLC • Workload – IsolBench, EEMBC, SD-VBS, SPEC2006 19 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s)
  • 20. Effect of LLC (Shared) MSHRs • Increasing LLC MSHRs eliminates the MSHR contention • Issue: scalable? – MSHRs are highly associative (must be accessed in parallel) 20 L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
  • 21. Effect of L1 (Private) MSHRs • Reducing L1 MSHRs can eliminate contention • Issue: single thread performance. How much? – It depends. L1 MSHRs are often underutilized – EEMBC: low, SD-VBS: medium, SPEC2006: high 21 EEMBC, SD-VBS IsolBench,SPEC2006
  • 22. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention • Propose an OS/architecture collaborative solution 22
  • 23. OS Controlled MSHR Partitioning 23 • Add two registers in each core’s L1 cache – TargetCount: max. MSHRs (set by OS) – ValidCount: used MSHRs (set by HW) • OS can control each core’s MLP – By adjusting the core’s TargetCount register Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. MSHR 1 MSHR 2 MSHR n TargetCount ValidCount Last Level Cache (LLC) Core 1 Core 2 Core 3 Core 4 MSHRs MSHRs MSHRs MSHRs MSHRs
  • 24. OS Scheduler Design • Partition LLC MSHRs by enforcing • When RT tasks are scheduled – RT tasks: reserve MSHRs – Non-RT tasks: share remaining MSHRs • Implemented in Linux kernel – prepare_task_switch() at kernel/sched/core.c 24
  • 25. Case Study • 4 RT tasks (EEMBC) – One RT per core – Reserve 2 MSHRs – P: 20,30,40,60ms – C: ~8ms • 4 NRT tasks – One NRT per core – Run on slack • Up to 20% WCET reduction – Compared to cache partition only 25
  • 26. Conclusion • Evaluated the effect of cache partitioning on four real COTS multicore architectures and a full system simulator • Found cache partitioning does not ensure cache (hit) performance isolation due to MSHR contention • Proposed low-cost architecture extension and OS scheduler support to eliminate MSHR contention • Availability – https://github.com/CSL-KU/IsolBench – Synthetic benchmarks (IsolBench),test scripts, kernel patches 26
  • 27. Exp3: BwRead(LLC) vs. BwRead(LLC) • Cache bandwidth contention is not the main source – Higher cache-bandwidth co-runners cause less slowdowns 27
  • 28. EEMBC, SD-VBS Workload • Subject – Subset of EEMBC, SD-VBS – High L2 hit, Low L2 miss • Co-runners – BwWrite(DRAM): High L2 miss, write-intensive 28

Editor's Notes

  1. This talk is about improving isolation in modern multicore real-time systems focusing on shared cache.
  2. High-performance multicore processors are increasingly demanded in many modern real-time systems to satisfy the ever increasing performance need while reducing size, weight, and power consumption.
  3. Unfortunately, it is well know that providing timing predictability is very difficult in multicore processors due to contention in shared hardware resources. In this work, we focus on shared cache.
  4. Cache partitioning is a well known technique to improve isolation for real-time systems. And it can be done either in software (by the OS) or in hardware. Either way, the goal of partitioning is to eliminate the unwanted cache-line evictions by providing isolated space per core/task. Most literature equate cache partitioning and cache performance isolation. However, as we will show in this work, partitioning cache doesn’t necessarily provide performance isolation in modern high-performance multicore architectures.
  5. This talk is composed of three parts. First, we evaluate the isolation effect cache partitioning on four real COTS multicore architectures. Surprisingly, we observe up to 21X slowdown of a task running on a dedicated core, accessing dedicated cache partition, and almost all memory accesses are cache hits. We then further investigate and better understand the source of the contention using a cycle accurate full system simulator. Finally, based on the understanding, we propose an OS and architecture collaborative solution to to provide true cache performance isolation.
  6. Before we begin, let me briefly introduce essential background on non-blocking caches. A non-blocking cache can continue to serve cache-hit requests under multiple cache misses. Supporting multiple outstanding misses is important to leverage memory-level parallelism. And it is especially important for out-of-order cores, but in-order multicore also can benefit from using non-blocking shared caches. In a non-blocking cache, there are special hardware registers known as miss-status-holing registers (MSHRs). When a cache miss occurs, the cache controller allocates a MSHR entry to track the outgoing cache-miss and it is cleared when the data is delivered from the lower-level memory hierarch,
  7. The number of MSHRs essentially determines the parallelism of the cache. Now, because MSHRs are limited and getting the data from DRAM could take 100s of cpu cycles, it is possible that MSRHs of a cache are fully occupied with outstanding misses. If that happens, the cache lock-up and all the subsequent accesses to the cache will be stalled.
  8. For experiments, we use four COTS multicore architectures. One in-order and three out-of-order architectuers.
  9. On the platforms, we perform a set of cache-interference experiments. The basic setup is as follows. We ran a subject benchmark on one core, and we increase the number of co-runners on the other cores. I would like to stress that the LLC is partitioned on a per-core basis. So, what we would like to see is whether cache portioning provide cache performance isolation.
  10. First, we use a benchmark suite called IsolBench that consists of two synthetic benchmarks latency and bandwidth. The latency benchmark is a simple linked-list traversal program. Due to data dependency, only one outstanding miss can be generated at a time. The bandwidth benchmark is a simple array access program. As there’s no data dependency, multiple concurrent accesses can be made in an out-of-order core. We created 6 workloads with varying their working-set sizes. Here, LLC denotes less than the given cache partiion while (DRAM) denotes twice size of the the LLC. In all workloads, the subject benchmarks always fit in their cache partitions. In other words, accesses are cache hits. So, ideally, cache partitioning should provide performance isolation regardless of the co-runners.
  11. We also used real benchmarks as subjects and get the similar results. That is isolation is provided in in-order cores but not so in out-of-order cores
  12. We attribute this to the MSHR contention problem. This table shows the experimentally obtained core local and global memory-level parallelism of the tested platforms. Detailed methods to obtain MLP is in paper. These numbers were also validated from the manuals of the processors. What this table suggests is that the MSHRs in the LLCs are smaller than the sum of L1 MSHRs on the tested out-of-order cores. As I mentioned before, then the shortage of the LLC MSHRs can lock-up the LLC, even if the cache space is partitioned.
  13. Next, we validate the MSHR contention problem using a cycle-accurate full system simulator. We also perform a set of sensitive analysis to better understand the effects of MSHRs in the local and shared caches.
  14. At first, we suspected that this may be because of shared bus that connects the LLC and the cores. But we found that it was not the case from next experiment. In this experiment, we chaged the corunners’ working-set sizes to fit in their cache partitions. Because now that co-runners’ memory accesses are also cache hits, the cache bandwidth demand, and therefore the shared bus contention would have been increased. Yet, the slowdown becomes much smaller than the previous experiment where cache bandwidth usages were much lower.