This document provides an overview of cache partitioning techniques (CPTs) in multicore processors. It begins with background on the motivation for CPTs due to increasing cache contention with more cores. It then covers various classifications of CPTs including granularity (way, set, block), static vs dynamic, strict vs pseudo, and hardware vs software control. It discusses challenges of CPTs like profiling overhead and complexity. It also covers techniques for profiling cache usage and determining optimal partitions. The goal of CPTs is to improve performance by reducing interference between applications sharing a cache.
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
This document summarizes a lecture on advanced computer architecture and memory hierarchy design techniques. It discusses cache penalty reduction techniques like victim caches and prefetching. It also covers virtual memory and address translation using page tables and translation lookaside buffers. Different cache organizations are described, including virtually and physically indexed caches. Solutions to problems like aliases and synonyms in virtual caches are also summarized.
The document discusses various techniques to optimize computer memory performance. It begins by describing the memory hierarchy and characteristics of main memory technologies like SRAM and DRAM. It then discusses 11 advanced cache optimization techniques:
1) Using small, simple caches to reduce hit time.
2) Increasing cache bandwidth through techniques like pipelined, multibanked, and nonblocking caches.
3) Decreasing miss penalty through critical word first and merging write buffers.
4) Reducing miss rate via compiler optimizations and hardware/software prefetching.
The document analyzes each technique's impact on performance factors and implementation complexity. Generally, optimizations impact one factor but prefetching can reduce both misses and
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
The document discusses various techniques to optimize computer memory performance. It begins by describing the memory hierarchy and characteristics of main memory technologies like SRAM and DRAM. It then discusses 11 advanced cache optimization techniques:
1) Using small, simple caches to reduce hit time.
2) Increasing cache bandwidth through techniques like pipelined, multibanked, and nonblocking caches.
3) Decreasing miss penalty through critical word first and merging write buffers.
4) Reducing miss rate via compiler optimizations and hardware/software prefetching.
The document analyzes each technique's impact on performance factors and implementation complexity. Generally, optimizations impact one factor but prefetching can reduce both misses and
Cache memory is a fast memory located between the CPU and main memory that stores frequently accessed instructions and data. It improves system performance by reducing memory access time. Cache is organized into multiple levels - L1 cache is closest to the CPU, L2 cache is next, and some CPUs have an L3 cache. (Level 1, 2, 3 caches refer to their proximity to the CPU.) Cache memory uses SRAM instead of DRAM for faster access. It is organized into rows containing a data block, tag, and flag bits. Optimization techniques for cache include improving data locality through code transformations and maintaining coherence across cache levels.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
This document discusses energy-efficient hardware data prefetching. It begins with an introduction to data prefetching and why it is needed due to the growing gap between processor and memory speeds. It then covers different types of prefetching techniques including software-based, hardware-based, sequential, stride, and pointer prefetching. It also discusses the tradeoffs between software and hardware approaches. Finally, it introduces the concept of energy-aware data prefetching to reduce the increased energy consumption from aggressive prefetching techniques.
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
This document summarizes a lecture on advanced computer architecture and memory hierarchy design techniques. It discusses cache penalty reduction techniques like victim caches and prefetching. It also covers virtual memory and address translation using page tables and translation lookaside buffers. Different cache organizations are described, including virtually and physically indexed caches. Solutions to problems like aliases and synonyms in virtual caches are also summarized.
The document discusses various techniques to optimize computer memory performance. It begins by describing the memory hierarchy and characteristics of main memory technologies like SRAM and DRAM. It then discusses 11 advanced cache optimization techniques:
1) Using small, simple caches to reduce hit time.
2) Increasing cache bandwidth through techniques like pipelined, multibanked, and nonblocking caches.
3) Decreasing miss penalty through critical word first and merging write buffers.
4) Reducing miss rate via compiler optimizations and hardware/software prefetching.
The document analyzes each technique's impact on performance factors and implementation complexity. Generally, optimizations impact one factor but prefetching can reduce both misses and
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
The document discusses various techniques to optimize computer memory performance. It begins by describing the memory hierarchy and characteristics of main memory technologies like SRAM and DRAM. It then discusses 11 advanced cache optimization techniques:
1) Using small, simple caches to reduce hit time.
2) Increasing cache bandwidth through techniques like pipelined, multibanked, and nonblocking caches.
3) Decreasing miss penalty through critical word first and merging write buffers.
4) Reducing miss rate via compiler optimizations and hardware/software prefetching.
The document analyzes each technique's impact on performance factors and implementation complexity. Generally, optimizations impact one factor but prefetching can reduce both misses and
Cache memory is a fast memory located between the CPU and main memory that stores frequently accessed instructions and data. It improves system performance by reducing memory access time. Cache is organized into multiple levels - L1 cache is closest to the CPU, L2 cache is next, and some CPUs have an L3 cache. (Level 1, 2, 3 caches refer to their proximity to the CPU.) Cache memory uses SRAM instead of DRAM for faster access. It is organized into rows containing a data block, tag, and flag bits. Optimization techniques for cache include improving data locality through code transformations and maintaining coherence across cache levels.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
This document discusses energy-efficient hardware data prefetching. It begins with an introduction to data prefetching and why it is needed due to the growing gap between processor and memory speeds. It then covers different types of prefetching techniques including software-based, hardware-based, sequential, stride, and pointer prefetching. It also discusses the tradeoffs between software and hardware approaches. Finally, it introduces the concept of energy-aware data prefetching to reduce the increased energy consumption from aggressive prefetching techniques.
Optimizing your java applications for multi core hardwareIndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Rising power dissipation in microprocessor chips is leading to a trend towards increasing the number of cores on a chip (multi-core processors) rather than increasing clock frequency as the primary basis for increasing system performance. Consequently the number of threads in commodity hardware has also exploded. This leads to complexity in designing and configuring high performance Java applications that make effective use of new hardware. In this talk we provide a summary of the changes happening in the multi-core world and subsequently discuss about some of the JVM features which exploit the multi-core capabilities of the underlying hardware. We also explain techniques to analyze and optimize your application for highly concurrent systems. Key topics include an overview of Java Virtual Machine features & configuration, ways to correctly leverage java.util.concurrent package to achieve enhanced parallelism for applications in a multi-core environment, operating system issues, virtualization, Java code optimizations and useful profiling tools and techniques.
Takeaways for the Audience
Attendees will leave with a better understanding of the new multi-core world, understanding of Java Virtual Machine features which exploit mulit-core and the techniques they can apply to ensure their Java applications run well in mulit-core environment.
This document provides an overview of computer architecture and memory systems. It discusses topics like memory hierarchy, cache organization and operation, DRAM and SRAM technologies, and memory timing parameters. The key points are:
- Computer architecture refers to the operational design of a computer system, from low-level instruction sets to higher-level aspects like memory subsystems.
- Memory hierarchy utilizes smaller and faster memory levels (registers, cache, main memory, disk) to improve performance. Caches exploit locality of reference to reduce memory access latency.
- Caches are organized into blocks and use tags/indices/offsets to store and retrieve data efficiently. Replacement policies determine which blocks to evict.
- DR
International Journal of Engineering Research and DevelopmentIJERD Editor
This document summarizes the design of a virtual extended memory symmetric multiprocessor (SMP) organization using LC-3 processors. It discusses the LC-3 processor architecture and instruction set. It then describes the design of a dual core LC-3 processor that shares memory over 32K bank sizes. The key components of the LC-3 processor pipeline including fetch, decode, execute, and writeback units are defined along with their inputs, outputs, and functions. Memory architectures for SMP systems including conventional, direct connect, and shared bus approaches are also summarized.
This document discusses optimizing Linux AMIs for performance at Netflix. It begins by providing background on Netflix and explaining why tuning the AMI is important given Netflix runs tens of thousands of instances globally with varying workloads. It then outlines some of the key tools and techniques used to bake performance optimizations into the base AMI, including kernel tuning to improve efficiency and identify ideal instance types. Specific examples of CFS scheduler, page cache, block layer, memory allocation, and network stack tuning are also covered. The document concludes by discussing future tuning plans and an appendix on profiling tools like perf and SystemTap.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC). Instruction cache has major contribution in improving IPC. Cache memories are realized on the same chip where the processor is running. This considerably increases the system cost as well. Hence, it is required to maintain a trade-off between cache sizes and performance improvement offered. Determining the number of cache lines and size of cache line are important parameters for cache designing. The design space for cache is quite large. It is time taking to execute the given application with different cache sizes on an instruction set simulator (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or nearly best IPC. Cache size is derived, at a higher abstraction level, from basic block analysis in the Low Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out-of-order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache line size, number of cache lines).
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
The document discusses cache organization and mapping techniques. It describes:
1) Direct mapping where each block maps to one line. Set associative mapping divides cache into sets with multiple lines per set.
2) Replacement algorithms like FIFO and LRU that determine which block to replace when the cache is full.
3) Write policies like write-through and write-back that handle writing cached data back to main memory.
Main Memory Management in Operating SystemRashmi Bhat
Main Memory Management techniques include paging and segmentation. Paging divides both logical and physical memory into fixed-size blocks called pages and frames respectively. The CPU address is divided into a page number and page offset. The page number is used to index a page table to map the logical page to a physical frame. A Translation Lookaside Buffer (TLB) is used to cache recent page table entries to speed up virtual to physical address translation and reduce memory accesses on TLB hits.
This document discusses caching in enterprise Java EE applications. It covers caching at the web layer using browser, proxy, and content delivery network caches to improve performance and scalability. It also discusses caching options in memory, on disk, and hybrid approaches. Challenges of enterprise caching include cache refresh in distributed systems, eviction policies, and monitoring caches. Caching can improve latency, reduce network traffic, and avoid bottlenecks.
Computer architecture refers to the operational design of a computer system. This lecture focuses on the memory subsystem aspects of computer architecture, including memory hierarchy, memory performance, cache organization and operation, random access memory technologies, and bus interfaces. Key topics discussed are memory hierarchy principles like smaller and faster memory levels, cache hit rates, direct mapped, set associative, and write-back caching techniques, and differences between static and dynamic random access memory technologies.
Abstract:
Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often help to improve performance, they increase energy consumption by as much as 30% in the memory system. This paper provides a detailed evaluation on the energy impact of hardware data prefetching and then presents a set of new energy-aware techniques to overcome prefetching energy overhead of such schemes. These include compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine that can reduce hardware prefetching related energy consumption by 7-11 ×. Combined with the effect of leakage energy reduction due to performance improvement, the total energy consumption for the memory system after the application of these techniques can be up to 12% less than the baseline with no prefetching.
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
This document discusses various approaches for improving the energy efficiency of cache memory architectures, specifically for write-through caches. It begins by introducing the way-tagged cache approach, which maintains way tags for the L2 cache in the L1 cache. This allows the L2 cache to operate in a direct-mapped manner for write hits, reducing energy. The document then reviews related work on cache sub-banking, bit line segmentation, way prediction, way memoization, and a new way memoization technique using a memory address buffer to skip redundant tag/way accesses. The goal of these techniques is to reduce unnecessary accesses and optimize for write-through policy overhead while maintaining performance.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses different memory management techniques:
- It describes swapping, where a process is temporarily moved out of memory to disk to make room for other processes. Paging and segmentation are also covered, where memory is divided into pages/segments and logical addresses are translated to physical addresses.
- Memory management aims to allocate processes efficiently in memory while avoiding issues like fragmentation. Techniques like contiguous allocation, paging, and segmentation map logical addresses to physical frames and protect memory access.
This document discusses approaches to scaling in-memory databases on multicore hardware. There are two main approaches: employing a symmetric database engine where a single process uses multiple threads across cores to access shared memory, and employing a partitioned database engine where the database is divided into partitions managed by dedicated cores. A challenge is that cache coherency limits scalability as it does not scale to thousands of cores. The document recommends a software-hardware co-design approach, avoiding centralized critical sections, leveraging hardware message passing, and using techniques like optimistic concurrency control to improve scalability on high core count systems.
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESijdpsjournal
This document summarizes a research paper that proposes a technique to improve performance in simultaneous multithreaded (SMT) processor cores. The technique partitions a shared write buffer and implements register capping to better manage two critical shared resources: the physical register file and write buffer. Register capping limits the number of physical registers each thread can use. This increases variety in the sources that can commit instructions to the write buffer. The paper proposes partitioning the write buffer to prioritize committing instructions from threads with cache hits. Simulations show the combined approach can improve SMT system performance and resource efficiency by up to 70.6% in a 4-threaded SMT system and 65.9% in an 8-threaded SMT system.
This cache simulator is designed to observe and simulate cache configurations and address mapping methods in an AI accelerator system. The simulator can analyze the effectiveness of different address mapping strategies under various workloads. It is integrated with the AIonChip simulator to access real traces. The cache simulator is configurable and collects statistics like cycle count to analyze performance under different settings like cache size, mapping method, and workload. It simulates key cache components like queues, LRU tables, and MSHRs to emulate cache behavior.
This document provides an overview of computer forensics. It defines computer forensics as the process of identifying, preserving, analyzing and presenting digital evidence in a legally acceptable manner. The document discusses the history, goals, and characteristics of computer forensics. It also covers types of cyber crimes and digital evidence, common locations of evidence, and the methodology and applications of computer forensics.
Digital forensics involves the process of preserving, analyzing, and presenting digital evidence in a manner that is legally acceptable. This document defines digital forensics and outlines the key steps involved, including acquiring evidence, recovering data, analyzing findings, and presenting results. It also discusses who uses computer forensics, common file types and locations examined, and important tools and skills required like drive imaging software, network analyzers, and operating system expertise. Maintaining a proper forensic workstation, following evidence handling guidelines, and anticipating anti-forensics techniques are also covered.
More Related Content
Similar to PPT_on_Cache_Partitioning_Techniques.pdf
Optimizing your java applications for multi core hardwareIndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Rising power dissipation in microprocessor chips is leading to a trend towards increasing the number of cores on a chip (multi-core processors) rather than increasing clock frequency as the primary basis for increasing system performance. Consequently the number of threads in commodity hardware has also exploded. This leads to complexity in designing and configuring high performance Java applications that make effective use of new hardware. In this talk we provide a summary of the changes happening in the multi-core world and subsequently discuss about some of the JVM features which exploit the multi-core capabilities of the underlying hardware. We also explain techniques to analyze and optimize your application for highly concurrent systems. Key topics include an overview of Java Virtual Machine features & configuration, ways to correctly leverage java.util.concurrent package to achieve enhanced parallelism for applications in a multi-core environment, operating system issues, virtualization, Java code optimizations and useful profiling tools and techniques.
Takeaways for the Audience
Attendees will leave with a better understanding of the new multi-core world, understanding of Java Virtual Machine features which exploit mulit-core and the techniques they can apply to ensure their Java applications run well in mulit-core environment.
This document provides an overview of computer architecture and memory systems. It discusses topics like memory hierarchy, cache organization and operation, DRAM and SRAM technologies, and memory timing parameters. The key points are:
- Computer architecture refers to the operational design of a computer system, from low-level instruction sets to higher-level aspects like memory subsystems.
- Memory hierarchy utilizes smaller and faster memory levels (registers, cache, main memory, disk) to improve performance. Caches exploit locality of reference to reduce memory access latency.
- Caches are organized into blocks and use tags/indices/offsets to store and retrieve data efficiently. Replacement policies determine which blocks to evict.
- DR
International Journal of Engineering Research and DevelopmentIJERD Editor
This document summarizes the design of a virtual extended memory symmetric multiprocessor (SMP) organization using LC-3 processors. It discusses the LC-3 processor architecture and instruction set. It then describes the design of a dual core LC-3 processor that shares memory over 32K bank sizes. The key components of the LC-3 processor pipeline including fetch, decode, execute, and writeback units are defined along with their inputs, outputs, and functions. Memory architectures for SMP systems including conventional, direct connect, and shared bus approaches are also summarized.
This document discusses optimizing Linux AMIs for performance at Netflix. It begins by providing background on Netflix and explaining why tuning the AMI is important given Netflix runs tens of thousands of instances globally with varying workloads. It then outlines some of the key tools and techniques used to bake performance optimizations into the base AMI, including kernel tuning to improve efficiency and identify ideal instance types. Specific examples of CFS scheduler, page cache, block layer, memory allocation, and network stack tuning are also covered. The document concludes by discussing future tuning plans and an appendix on profiling tools like perf and SystemTap.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC). Instruction cache has major contribution in improving IPC. Cache memories are realized on the same chip where the processor is running. This considerably increases the system cost as well. Hence, it is required to maintain a trade-off between cache sizes and performance improvement offered. Determining the number of cache lines and size of cache line are important parameters for cache designing. The design space for cache is quite large. It is time taking to execute the given application with different cache sizes on an instruction set simulator (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or nearly best IPC. Cache size is derived, at a higher abstraction level, from basic block analysis in the Low Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out-of-order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache line size, number of cache lines).
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
The document discusses cache organization and mapping techniques. It describes:
1) Direct mapping where each block maps to one line. Set associative mapping divides cache into sets with multiple lines per set.
2) Replacement algorithms like FIFO and LRU that determine which block to replace when the cache is full.
3) Write policies like write-through and write-back that handle writing cached data back to main memory.
Main Memory Management in Operating SystemRashmi Bhat
Main Memory Management techniques include paging and segmentation. Paging divides both logical and physical memory into fixed-size blocks called pages and frames respectively. The CPU address is divided into a page number and page offset. The page number is used to index a page table to map the logical page to a physical frame. A Translation Lookaside Buffer (TLB) is used to cache recent page table entries to speed up virtual to physical address translation and reduce memory accesses on TLB hits.
This document discusses caching in enterprise Java EE applications. It covers caching at the web layer using browser, proxy, and content delivery network caches to improve performance and scalability. It also discusses caching options in memory, on disk, and hybrid approaches. Challenges of enterprise caching include cache refresh in distributed systems, eviction policies, and monitoring caches. Caching can improve latency, reduce network traffic, and avoid bottlenecks.
Computer architecture refers to the operational design of a computer system. This lecture focuses on the memory subsystem aspects of computer architecture, including memory hierarchy, memory performance, cache organization and operation, random access memory technologies, and bus interfaces. Key topics discussed are memory hierarchy principles like smaller and faster memory levels, cache hit rates, direct mapped, set associative, and write-back caching techniques, and differences between static and dynamic random access memory technologies.
Abstract:
Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often help to improve performance, they increase energy consumption by as much as 30% in the memory system. This paper provides a detailed evaluation on the energy impact of hardware data prefetching and then presents a set of new energy-aware techniques to overcome prefetching energy overhead of such schemes. These include compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine that can reduce hardware prefetching related energy consumption by 7-11 ×. Combined with the effect of leakage energy reduction due to performance improvement, the total energy consumption for the memory system after the application of these techniques can be up to 12% less than the baseline with no prefetching.
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
This document discusses various approaches for improving the energy efficiency of cache memory architectures, specifically for write-through caches. It begins by introducing the way-tagged cache approach, which maintains way tags for the L2 cache in the L1 cache. This allows the L2 cache to operate in a direct-mapped manner for write hits, reducing energy. The document then reviews related work on cache sub-banking, bit line segmentation, way prediction, way memoization, and a new way memoization technique using a memory address buffer to skip redundant tag/way accesses. The goal of these techniques is to reduce unnecessary accesses and optimize for write-through policy overhead while maintaining performance.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses different memory management techniques:
- It describes swapping, where a process is temporarily moved out of memory to disk to make room for other processes. Paging and segmentation are also covered, where memory is divided into pages/segments and logical addresses are translated to physical addresses.
- Memory management aims to allocate processes efficiently in memory while avoiding issues like fragmentation. Techniques like contiguous allocation, paging, and segmentation map logical addresses to physical frames and protect memory access.
This document discusses approaches to scaling in-memory databases on multicore hardware. There are two main approaches: employing a symmetric database engine where a single process uses multiple threads across cores to access shared memory, and employing a partitioned database engine where the database is divided into partitions managed by dedicated cores. A challenge is that cache coherency limits scalability as it does not scale to thousands of cores. The document recommends a software-hardware co-design approach, avoiding centralized critical sections, leveraging hardware message passing, and using techniques like optimistic concurrency control to improve scalability on high core count systems.
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESijdpsjournal
This document summarizes a research paper that proposes a technique to improve performance in simultaneous multithreaded (SMT) processor cores. The technique partitions a shared write buffer and implements register capping to better manage two critical shared resources: the physical register file and write buffer. Register capping limits the number of physical registers each thread can use. This increases variety in the sources that can commit instructions to the write buffer. The paper proposes partitioning the write buffer to prioritize committing instructions from threads with cache hits. Simulations show the combined approach can improve SMT system performance and resource efficiency by up to 70.6% in a 4-threaded SMT system and 65.9% in an 8-threaded SMT system.
This cache simulator is designed to observe and simulate cache configurations and address mapping methods in an AI accelerator system. The simulator can analyze the effectiveness of different address mapping strategies under various workloads. It is integrated with the AIonChip simulator to access real traces. The cache simulator is configurable and collects statistics like cycle count to analyze performance under different settings like cache size, mapping method, and workload. It simulates key cache components like queues, LRU tables, and MSHRs to emulate cache behavior.
Similar to PPT_on_Cache_Partitioning_Techniques.pdf (20)
This document provides an overview of computer forensics. It defines computer forensics as the process of identifying, preserving, analyzing and presenting digital evidence in a legally acceptable manner. The document discusses the history, goals, and characteristics of computer forensics. It also covers types of cyber crimes and digital evidence, common locations of evidence, and the methodology and applications of computer forensics.
Digital forensics involves the process of preserving, analyzing, and presenting digital evidence in a manner that is legally acceptable. This document defines digital forensics and outlines the key steps involved, including acquiring evidence, recovering data, analyzing findings, and presenting results. It also discusses who uses computer forensics, common file types and locations examined, and important tools and skills required like drive imaging software, network analyzers, and operating system expertise. Maintaining a proper forensic workstation, following evidence handling guidelines, and anticipating anti-forensics techniques are also covered.
This document provides an overview of computer forensics. It defines computer forensics as the process of preserving, identifying, extracting, documenting and interpreting computer data for legal evidence. The document then outlines the history of computer forensics, the steps involved which include acquisition, identification, evaluation and presentation, certifications available, requirements, advantages and disadvantages of the field. It concludes by listing some computer forensics labs and centers in India.
This document provides an overview of computer forensics. It defines computer forensics and explains its need in addressing rising cybercrime. It outlines types of computer forensics and common cybercrimes. The document also describes the components and steps involved in computer forensics investigations, including acquiring evidence, creating forensic images, and analyzing data. It discusses important digital evidence sources like metadata, slack space, swap files, and unallocated space.
Computer forensics is a branch of digital forensic science involving the legal investigation and analysis of evidence found in computers and digital storage media. The objectives are to recover, analyze, and preserve digital evidence in a way that can be presented in a court of law, and to identify evidence and assess the identity and intent of perpetrators in a timely manner. Computer forensics techniques include live analysis, cross-drive analysis, and recovery of deleted files through specialized software tools. Applications include criminal, domestic, security, and marketing investigations.
This document provides an overview of computer forensics and discusses:
- The key components of computer forensics including acquisition, analysis and examination of digital evidence (the "3 As") within a criminal investigation process.
- How computer forensics fits within the broader fields of digital forensic science and criminalistics, which applies scientific methods to criminal law enforcement.
- The challenges faced by computer forensics professionals including issues around its status and acceptance as a scientific discipline, legal challenges around evidence admissibility, and technical challenges posed by new storage technologies and network environments.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
2. Some acronyms
CP = cache partitioning, CPT = CP technique
HW = hardware, SW = software
BW = bandwidth
LLC = last level cache
RP = replacement policy
MLP/TLP = memory/thread level parallelism
IPC = instruction per cycle
NUCA = non-uniform cache architecture
QoS = quality-of-service
Perf = performance
N denotes number of cores
3. Motivation for Cache Management in Multicores
With increasing number of cores, performance of
multicores does not scale linearly due to cache
contention and other factors
Memory requirement of applications is increasing
=> Cache management has become extremely
important in multicores
4. Private v/s shared cache
Private Caches
Avoid interference
Cannot account for inter-
and intra-application
variation in cache
requirements
Limited capacity =>
cannot reduce miss-rate
effectively
Shared cache
Higher total capacity =>
can reduce miss-rate
Interference b/w apps on
using traditional cache
management policies =>
performance loss,
unfairness and lack of
QoS
Use CP in shared cache => capacity advantage of shared
cache, performance isolation advantage of private cache
5. Examples of processors with shared LLC
IBM Power 7
Intel core i7
AMD Phenom X4
Sun Niagara T2
We first provide background on CPTs and then discuss
several CPTs
7. Potential of CP
Different cache demand and performance sensitivity
Of different apps
Of different threads in a multithreaded app
Further, performance of cores may differ due to
differences in cache latencies due to NUCA design
differences in core frequencies due to process variation
CP can compensate for these differences!
CP can also optimize for fairness and QoS
8. Potential of CP
CP avoids interference & provides higher effective
cache capacity
Reduces miss-rate and bandwidth contention
This may benefit even those applications whose cache
quota are reduced!
Saves energy by
Reducing execution time
Allowing unused cache to bet power-gated
9. Challenges of CP
Number of possible partitions increase exponentially
with increasing core-count
Simple schemes become ineffective
Finding partitioning with minimum overall cache
miss-rate (i.e., optimal partitioning) is NP-hard and
yet, optimal partitioning may not be fair.
Naive CPTs: large profiling and reconfiguration
overhead
Hardware support required for implementing CPTs
(e.g., true-LRU) may be too-costly or unavailable
10. Challenges of CP
Reduction in miss-rate brought by CPT may not
translate into better performance
When there is performance bottleneck due to load-
imbalance, BW congestion, etc.
CP useful only for LLC-sensitive apps; unnecessary
or harmful for small-footprint apps
CP unnecessary for large-sized caches
11. A Quick Background on Page coloring
(will be useful for understanding CP)
12. Page Coloring
virtual page number
Virtual address page offset
physical page number
Physical address Page offset
Address translation
Cache tag Block offset
Set index
Cache address
Physically indexed cache
page color bits
… …
OS control
=
•Physically indexed caches are divided into multiple regions (colors).
•All cache lines in a physical page are cached in one of those regions (colors).
Lin et al. HPCA’08
13. Summary of page coloring
Virtual address has: Virtual page number and page offset
VA converted to PA by OS-controlled address translation
PA used in a physically indexed cache.
Page color bits = common bits between physical page
number and set index
Physically indexed cache is divided into multiple regions.
Each OS page will cached in one of those regions indexed
by the page color
OS can control the page color of a virtual page through
address mapping (by selecting a physical page with a
specific value in its page color bits).
14. #PageColor bits = # BlockOffset bits + #SetIndex bits
- #PageOffset bits
=> #PageColors = (CacheBlockSize * NumberOfSets)/PageSize
= CacheSize/(PageSize * CacheAssociativity)
Page number Page offset
Cache tag Cache set index Block offset
Physical
address
Cache color bits
Computing # of Page colors
16. Classification 1. Based on granularity
Cache quota can be allocated in terms of ways, sets
(colors) and blocks
A 16-way, 4MB cache with block size of 64B and
system page size of 4KB => 16 ways, 64 colors,
65536 blocks
Way Set Block
Increasing granularity
17. Way-based CPT
Simple implementation
Flushing-free reconfiguration
Ease of obtaining way-level profiling information
Sufficient for small N (number of cores)
Harms associativity
Meaningful only if associativity >= 2N (at least one
way needs to be allocated to each core)
Requires caches of high associativity => high access
latency/power overheads
Requires additional bits to identify owner core of
each block
18. Set (color)-based CPT
Higher granularity than way-based CP
Amenable to SW control.
Requires significant changes to OS
May complicate virtual memory management
Changes set-indices of many blocks => these blocks
need to be flushed or migrated to new set-indices
To lower this, reduce reconfig. frequency or number
of recolored pages in each interval or perform page-
migration in lazy manner
19. Block-based CPT
Provides highest granularity
Highly useful for large N
Obtaining profiling info for block-based allocation is
challenging
Some CPTs obtain this info by linearly interpolating
miss-rate curve of way-level monitors => not accurate
May require changes to RP and additional bits to
identify owner core of each block
20. Classification 2. Whether static or dynamic
Static CPT: determine cache partitions offline (i.e.,
before application execution)
Dynamic CPT: determine cache partitions
dynamically (i.e., at runtime, i.e., when application is
running)
21. Static v/s dynamic CPT
Static CPT
Useful for testing all
possible partitions for
small core-count to find
upper bound on gain
Not feasible with large N
Cannot account for
temporal variation in
cache behavior
Dynamic CPT
Suitable for large N
Can account for temporal
variation in behavior
Incur runtime overhead
Unnecessary if app
behavior uniform over
time
22. Classification 3. Whether strict or pseudo
Strict (hard) CPT: cache quota is strictly enforced
Pseudo (soft) CPT: cache quota not strictly
enforced, actual allocation may differ from target quota
Ex.: 8-way cache, quota App1 =3 ways, App2 = 5 ways
Strict: Enforce [3,5] in all intervals
Pseudo: Quota = [3,5] in most intervals but [2,6] or
[4,4] in other intervals
23. Way-based CP Block-based CP
Pseudo-partitioning
(actual deviates from target)
Strict-
Partitioning
(actual close
to target)
Sanchez et al. ISCA’11
24. Strict v/s pseudo CPT
Strict CPT
Important to guarantee
QoS and fairness
May lead to inefficient
utilization of cache, esp.
when allocation
granularity is large.
Dead blocks of one cannot
be evicted by other core,
even if it can benefit from
those blocks
Pseudo CPT
May provide most
benefits of strict-CPT
with much simpler
implementation
Allow cores to steal
quotas of other cores
Actual quota of a core
can differ from target
This problem esp. severe
with large N
25. Classification 4. Whether HW or SW-control
HW-based CPT: CPT is independent of OS
parameters and is implemented in HW
SW-based CPT: Partitioning decision is taken in
SW, CPT depends on OS features (e.g., system page
size)
26. HW-based v/s SW-based CPT
HW-based CPT
Can be used at fine-
granularity (~10K cycles)
Reduces profiling and
reconfiguration overhead
Adding required HW
support is challenging
SW-based CPT
SW control is important to
account for other
processor components,
management schemes and
system-level goals, e.g.
optimizing fairness (v/s
cache-level goals e.g.
minimizing miss-rate).
Higher reconfig. overhead
=> can be used at coarse
granularity (>1M cycles)
27. Partitioned b/w cores
(private to each core)
Classification 5. Fully v/s partially partitioned
Partitioned b/w cores
(private to each core)
Not partitioned
(Shared b/w
cores)
Fully partitioned
Higher capacity and
better granularity
available for partitioning,
Partially partitioned
May provide advantage of
both shared and private
cache
28. CPTs in real processors
Some Intel processors provides support for way-based
CP [Int16]
Page coloring-based CP [Lin08] in Linux kernel
Intel Xeon processor E5-2600 v3 family: support for
implementing shared cache QoS. It has
“cache monitoring technology” to track cache usage
“cache allocation technology” for allocating cache quotas,
e.g. to avoid cache starvation
AMD Opteron: pseudo-CPT to restrict cache quota of
cache-polluting apps
[Int16: Intel 64 and IA-32 Architectures Developer’s Manual: Vol. 3B http://goo.gl/sw24WL ]
[Lin08: Lin et al. HPCA’08]
30. How to perform CP
Profile apps to find their cache behavior/requirement
Classify apps based on their cache behavior
Determine cache quota of each app
31. Profiling techniques
Collect data about hits/misses to different ways.
Based on that, decide benefit from giving/taking-
away cache space to/from an app
Set-sampling: only few sets need to be monitored
to estimate property of entire cache
Data can be collected from actual cache or
separate profiling unit
Separate unit only needs tags, not data => size small
By using set-sampling, its size can be reduced greatly
32. T0
D0
T1
D0
T2
D2
T3
D3
TK-2
DK-2
TK-1
DK-1
V0 V1 V2 V3
T0
T1
T2
T3
TK-2
TK-1
V0 V1 V2 V3
T1
T3
TK-1
V0 V1 V2 V3
(a) A cache with 4-ways, K-sets
(tag & data directories)
(b) Auxiliary tag directory (ATD)
(c) Sampled ATD
(sampling ratio =2)
Counters
Tj and Dj : Tag and Data for set j
Collecting profiling data from
Actual cache Separate profiling unit ( (b) and (c) )
MRU LRU
34. • Find misses avoided with each way in each core
• Find the partitioning which gives least misses (highest hits)
Xie et al. HIPEAC’10
Two-core system
How to Use Profiling Information
35. Xie et al. HIPEAC’10
Four-core system
How to Use Profiling Information
36. Cache-behavior Based Application Classification
Cache insensitive: not many accesses to L2
Cache friendly: miss-rate reduces with increasing
L2 quota
Cache fitting: miss-rate reduces with increasing L2
quota and becomes nearly-zero at some point
Streaming: very large working set; show thrashing
with any RP due to inadequate cache reuse
Thrashing: working set larger than cache capacity;
they thrash LRU-managed cache, but may benefit
from thrash-resistant RPs.
37. Cache-behavior Based Application Classification
• Reduce cache quota of thrashing/streaming app
• Give higher quota to friendly and fitting app till they
benefit from it
38. Cache-behavior Based Application Classification
Utility = change in miss-rate with cache quota
Low, high and saturating utility
39. Ideas for Reducing Overhead of CPTs
1. Few thrashing apps are responsible for interference
in a shared cache
=> Restraining just their cache quotas can provide
performance benefits similar to exact (but complex)
CPTs
2. Some works extend CPTs proposed for true-LRU to
pseudo-LRU
3. Use Bloom Filter to reduce storage overhead
41. Utility Based Shared Cache Partitioning
Goal: Maximize system throughput
Observation: Not all threads/applications benefit
equally from caching simple LRU replacement not
good for system throughput
Idea: Allocate more cache space to applications that
obtain the most benefit from more space
Qureshi et al. MICRO’06
42. Utility Based Shared Cache Partitioning
47
Utility Ua
b = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Misses
per
1000
instructions
43. Partitioning Algorithm
Evaluate all possible partitions and select the best
With a ways to core1 and (16-a) ways to core2:
Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2
Select a that maximizes (Hitscore1 + Hitscore2)
Partitioning done once every 5 million cycles
48 Qureshi et al. MICRO’06
44. Way Partitioning Support
49
1. Each line has core-id bits
2. On a miss, count ways_occupied in set by miss-
causing app
ways_occupied < ways_given
Yes No
Victim is the LRU line
from other app
Victim is the LRU line
from miss-causing app
45. Greedy and Lookahead CP Algorithm
Greedy CP algorithm
Iteratively assign a way to an app with highest utility for
that way
Optimal if utility curves of all apps are convex
Lookahead algorithm
Works for general case when utility curves are not convex
In each iteration, for every app: Compute “maximum
marginal utility” (MMU) and least number of ways at
which MMU occurs
App with largest MMU is given # of ways required for
achieving MMU.
Stop iterating when all ways allocated
Qureshi et al. MICRO’06
46. Machine-learning based CPT
Perform synergistic management of processor
resources (e.g., L2 cache quota, power budget and
off-chip bandwidth), instead of isolated management
Train a neural network to learn processor
performance as function of allocated resources
Use a stochastic hill-climbing based search heuristic
Use a way-based CPT, per-core DVFS to manage
chip-power distribution and a strategy for
distributing bandwidth between apps
Bitirgen et al. MICRO'08
47. Coloring-based CPT
CPT for performance
Run one interval with current partition and one interval each
with increasing/decreasing quotas of each core (total 2 cores)
Select partition with least misses
CPT for QoS
Target: perf. of app1 is not degraded >= threshold1 and perf. of
app2 is maximized
if ((IPC of app1 - baselineIPC of app1)< threshold2)
Increase quota of app1 (if already maximum, stall app2)
Else if(IPC of app1 > baselineIPC of app1)
Resume app2 (if stalled) or increase its quota
On change in cache quota, perform lazy page-migration
Lin et al. HPCA'08
48. Coloring-based CPT with HW support
Estimate energy consumption of running apps for
different cache quota
Let #sets in LLC be X
Estimate miss-rate for caches of different number of sets,
viz., X, X=2, X=4, X=8, etc. using profiling units
From this, estimate energy consumption for different
cache partitions
Quota of app with small utility is reduced or is increased
only slightly
In some partitions, some colors may not be allocated to
any core
From these, select a partitioning with minimum energy
Power-gate unused cache colors
Mittal et al. TVLSI’14
49. Vantage: A Block-based CPT (1/2)
Divide cache into managed and unmanaged portion
(e.g., 85:15)
Only partition managed portion
Allows maintaining associativity of each partition
Sanchez et al. ISCA’11
50. Vantage: A Block-based CPT (2/2)
Preferentially evict blocks from unmanaged portion
~0 evictions from managed portion
Enforce quotas by matching demotion and
promotion rates
On any eviction, all candidates with eviction
priorities greater than a partition-specific threshold
are demoted
Use time-stamp based LRU to estimate eviction
priorities with low-overhead
Sanchez et al. ISCA’11
52. Partitioning by Controlling Insertion-priority (1/3)
Find cache quota of each app
Quota of an app decides its insertion priority location
Cache hit => block promoted by one step with priority
Z and not promoted by priority 1-Z
Blocks of apps with low-priority experience high
competition
Thrashing apps get one way each. Also, Z is very small
for them
Xie et al. ISCA'09
53. 1 P 2 3 4 Q R
5 Accesses
S
1 P 2 3 4 S Q 5
6
1 P 2 6 3 Q
4
1 P 2 7 6 S
S
3 4
7
1 P 2 7 6 S
3 4
S
1 P 2 7 6 S
3
T
T
1 P
2 7 6 S
3
T
2
T
1 P
2 7 6 S
3
T
8
1 P
2 7 6 3
T
8
Action
Insertion at loc. 3
Insertion at loc. 5
Insertion at loc. 5
Promotion by 1
Insertion at loc. 3
Promotion by 1
Promotion by 1
Insertion at loc. 5
Insertion Locations
Core 0 Core 1
Quota
Deviation
Quota
Deviation
Highest
priority
Lowest
priority
Xie et al. ISCA'09
Partitioning by Controlling Insertion-priority (2/3)
54. Limitations:
many partitions may have low insertion positions =>
severe contention at near-LRU position
difficult-to-evict blocks at near-MRU position
Partitioning by Controlling Insertion-priority (3/3)
Xie et al. ISCA'09
55. Decay-interval based CPT
Let decay interval = if a block not accessed for a
decay interval, it becomes candidate for replacement
irrespective of LRU status.
Tune decay intervals of apps based on their cache
utility and priority
=> blocks of apps with high priority and locality stay
in cache for longer time
Choose decay interval which minimizes total misses
and increases cache usage efficiency
Petoumenos et al. MoBS'06
56. Reuse-distance based CPT
Keep a block in cache only until its expected reuse
happens
This reuse distance is called protecting distance (PD)
At insertion/promotion time, reuse distance of a
block set to PD
On each access to set, PD values for all its blocks
decreased by one; if value reaches 0, block becomes
replacement candidate.
Change PD to control cache quota of an app
In multicore, for PDs for cores to maximize overall
cache hit rate
Duong et al. MICRO'12
57. Performance metric-based
CPTs
Next slide discusses limitations of miss-rate guided
CPTs. We then summarize CPTs which are guided
directly by some performance metric
58. Limitation of miss-rate guided CPTs
Latency of different misses may be different due to
instantaneous MLP and NUCA effects, however,
most CPTs treat different misses equally
L2 miss
memory latency
memory latency
memory latency
memory latency
L2 misses
IPC IPC
Pipeline stalls Commit restarts
(a) Isolated miss (b) Clustered misses
Time Time
Higher average latency Lower average latency
Moreto et al. HiPEAC'08
59. Using MLP penalty of misses
Find cache misses for different # of ways
Assign higher MLP penalty to isolated misses than
clustered misses
Compute perf. impact of a cache miss converted into
hit and vice versa on an increase/decrease in cache
size, respectively
From this, find length of the miss-cluster
L2 instruction misses stall fetch => they have a fixed
miss latency and MLP penalty
From all possible partitions, select one with
minimum total "MLP penalty"
Moreto et al. HiPEAC'08
60. Using application slowdown model
This model measures app slowdown due to
interference at shared cache and main memory
Measure slowdown for every app at different # of
ways
Compute marginal slowdown utility (MSU) as
(slowdowW+K – slowdowW)/K
Partition using lookahead algorithm, except that use
MSU instead of marginal miss utility
Subramanian et al. MICRO'15
61. Using stall rate curves
Use "instruction retirement stall rate curves“ (SRC):
stall cycles due to memory latency at various L2 sizes
Get SRC directly from HW counters on real system
SRC is better than miss-rate curve in guiding CPT,
since SRC accounts for several factors, e.g.,
L2 miss-rate
impact of L2 misses on instruction retirement stall
memory bus contention
variable latencies of lower levels of memory hierarchy
(e.g., L3 and main memory)
Tam et al. WIOSCA'07
62. Using Memory Bandwidth
Apps with large miss-count may not consume largest
off-chip bandwidth if their memory accesses are not
clustered
=> Partitioning based on bandwidth can provide
better performance than based on misses
Through offline analysis, find partition with least
overall bandwidth requirement
Reduce cache quota of apps with low bandwidth
requirement
Yu et al. DAC'10
63. EXAMPLE OBJECTIVES:
• FAIRNESS
• LOAD-BALANCING
• IMPLEMENTING PRIORITIES
• ENERGY
• LIMITING PEAK-POWER OF LLC
CPTs for Various Optimization
Objectives
64. CPT for Ensuring Fairness
Iteratively perform two steps
1. Quota allocation: evaluate fairness metric for all apps
If unfairness difference between apps with least unfair
and most unfair impact of CP > threshold1
Transfer some cache space from app with lower
unfairness to one with larger unfairness
Exclude these 2 apps. Repeat the step for remaining apps
2. Adjustment: If reduction in miss-rate of app receiving
increased quota is more than threshold2
Commit decision made in quota allocation step
Else
Reverse the decision
Kim et al. PACT'04
65. Using Feedback-Control theory
Assume: IPC targets are given for all apps
Find new targets to maximize cache utilization
Find cache quota to achieve those targets
If total cache quota exceeds cache size
For QoS: reduce quota of low-priority apps
For fairness: reduce quota in proportion to current quota
App-level controller (a PID controller) finds quota
required for next epoch to achieve perf targets based
on perf in previous epoch with its quota
Srikantaiah et al. MICRO'09 (PID = proportional integral derivative)
66. Limiting LLC power and achieving fairness/QoS (1/2)
Goal: Limiting maximum power of LLC and
achieving fair or differentiated cache access latencies
between different apps
Use two-level synergistic controllers design
1. LLC power-controller (every 10M cycles)
Limits maximum LLC power for a given budget by
controlling number of active LLC banks
Remaining banks are power-gated
Wang et al. TC'2012
67. Limiting LLC power and achieving fairness/QoS (2/2)
2. Latency controller (every 1M cycles)
Controls ratio of cache access latencies between two
apps on every pair of neighboring cores.
For fairness: same latencies for all apps
For QoS: shorter latencies for high-priority apps
Finds cache-bank quota of each app
Their technique provides theoretical guarantee of
accurate control and system stability.
Controllers are designed as PI controllers
Wang et al. TC'2012
68. Changing quotas in different intervals (1/2)
Allocate different sized partitions to different apps in
different intervals
Cache quotas are expanded and contracted in
different intervals
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
(c) Multiple time-sharing partitioning
(for both throughput and fairness)
Spatial partitioning
Time-
sharing
(a) Fairness-oriented
spatial partitioning
(b) Throughput-oriented
spatial partitioning
Time
IPC=0.26, WS=1.23, FS=1.0
Example
results
IPC=0.52, WS=2.42, FS=1.22 IPC=0.52, WS=2.42, FS=1.97
WS/FS = weighted/fair speedup
Proposed technique
69. Changing quotas in different intervals (2/2)
A thrashing app already has low throughput,
reducing its quota in contraction epoch does not
reduce its perf much…
but increasing its quota in expansion epoch boosts
its perf greatly which compensates slowdown in
contraction epochs.
Expansion opportunity is given to different apps
equally for fairness
in differentiated manner for QoS
70. CPTs for load-balancing in Multithreaded Apps (1/2)
C0
L2 (32 ways)
C1 C2 C3 C0 C1 C2 C3
3 16 8 5
Thread #
# L2 ways
0 1 2 3
Critical
Path
Thread
0 1 2 3
(a) Shared cache (b) Partitioned Cache
Critical thread can be accelerated by giving higher
cache quota to it => bottleneck removed
71. 1st CPT
Record CPIs of all threads
Allocate more ways to threads with higher CPIs
Limitation: thread's cache sensitivity is not taken into
account
2nd CPT
For each thread, build a model of how CPI varies with
cache quota
Do curve fitting by “cubic spline interpolation”
Repeatedly transfer one way from fastest thread to slowest
thread until some other thread becomes slowest
At this point, revert cache allocation by one-step and
accept this partitioning
Muralidhara et al. IPDPS'10
CPTs for load-balancing in Multithreaded Apps (2/2)
72. Removing imbalance due to process variation (1/2)
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to
process variation)
For multithreaded programs
with synchronization barriers,
the slowest core will limit the
performance of other cores
Kozhikkottu et al. DAC'14
73. Using cache partitioning to give higher cache quota
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to PV)
1.8 GHz
2.1 2.4 2.6
20
#L2 ways 6 4 2
PV-aware L2 cache partitioning
Higher
throughput
Removing imbalance due to process variation (2/2)
Kozhikkottu et al. DAC'14
74. Saving leakage energy (1/2)
Locality of an app = (accesses to LRU blocks)/(accesses to
MRU blocks)
Most hits at MRU => app needs few ways to achieve high
hit rate and vice versa
Compare this ratio for two apps to decide cache quota
Kotera et al. HiPEAC'11
75. Allocated to
core 0
Allocated to
core 1
Power-gated
• Compare above ratio with thresholds to decide number of
ways to power-gate
• Insight: If total cache requirement of cores < available cache
=> power-gate remaining cache to save leakage energy
Saving leakage energy (2/2)
76. Sets
0
1
2
3
4
5
6
7
(a) Shared unpartitioned cache (b) Shared partitioned cache (c) Shared partitioned cache +
Way-aligned data
0 1 2 3 4 5 6 7
Ways
Core 0 Core 1 Core 2 Core 3
Saving dynamic energy
Ensure way-alignment of data of each core. On access to a core,
only that way needs to be accessed => dynamic energy saved
Dynamic energy
saving
No dynamic energy saving
Sundararajan et al. HPCA’12
77. CPTs in various contexts
If cache is NUCA: try reducing both
misses and
hit latency (by allocating cache banks to closest
core)
If main memory is PCM: perform CP such that
both misses and writebacks are minimized
since PCM has high write energy/latency and low
write endurance
79. Integration with processor partitioning
• When variation in degree of TLP between apps is high,
equally distributing processors between them is not optimal
• => Perform both
• Processor partitioning (every 65M cycle) and
• cache partitioning (every 10M cycle)
Srikantaiah et al. SC'09
80. Integration with DRAM-bank partitioning
In physical address, few bits are common among LLC set
index bits and those for computing DRAM bank
Thus, we can perform cache-only, bank-only or
combined partitioning, based on which is better
Overlapped-bits
Cache-only bits
Bank-only bits
21
22 17
18 15
16 13
14 12
Page offset
Physical frame number
Bank-only (21-22), cache-only (16-18) and overlapped (14-15) bits on a
processor with 8GB memory and 64 banks
Liu et al. ISCA'14
81. Integration with Bandwidth Partitioning
Whether BW partitioning can improve perf depends on
difference in miss frequencies between apps
With decreasing bandwidth, scope of perf improvement
increases
=> CP may lower the impact of BW partitioning on perf
By reducing difference in miss frequencies of apps and
By reducing total cache misses which relieves BW pressure
But, if CP increases difference in miss frequencies, it
increases impact of BW partitioning on performance.
E.g., for cache insensitive apps, CP cannot improve perf,
but by changing difference in miss frequencies, CP
enhances effectiveness of BW partitioning in boosting perf
Liu et al. HPCA'10
82. Integration with DVFS (1/3)
Model problem of dividing shared resource (chip
power budget and LLC capacity) between apps as a
dynamic distributed market
each app (core) is an agent
resource-prices change based on “demand” and “supply”
Initially:
each agent has a purchasing budget and builds a
performance model as function of allocated resource
A global arbiter fixes initial prices of all resources
Wang et al. HPCA'15
83. Integration with DVFS (2/3)
Iteratively: Each agent bids for the resources to
maximize its perf
Based on the bids, arbiter increases and reduces the price
of resources in high and low demand, respectively.
Agents bid again under new prices
Iteration stops when
change in price within iterations is very small or
a threshold # of iterations done or
no improvement in perf of an agent on changing the bid
At this point, perform resource-allocation
84. Integration with DVFS (3/3)
Agents work in decentralized manner and only
centralized function is pricing scheme => this
technique scales well to >64 cores
For throughput: assign larger budget to agents
with higher marginal utility
For fairness: assign equal budgets to all agents
Find cache utility by seeing miss-rate change and
power utility by changing frequency using DVFS
85. Integration with RP selection (1/2)
CPTs and thrash-resistant RPs are complementary
RPs temporally share LLC based on apps’ “locality”
CPTs spatially divide LLC based on app’ “utility”
=> Thrash-resistant RPs good for workloads with
poor-locality apps
CPTs good for workloads with apps with widely
different utility values
Idea: Perform CPT with RP-selection to optimize
both locality and utility
Zhan et al. TC'14
86. Quota0 = 5 ways
RP0 = BIP
Quota1 = 3 ways
RP1 = LRU
6
7 4
5 2
3 0
1
Most
recent
Core 0 Core 1
p
LRU
1-p
BIP
Least
recent
Decision
module
Recency stack
of 8-way cache
Insertion position for Core 0
Insertion position for Core 1
Replacement point for Core 0
Replacement point for Core 1
Integration with RP selection (2/2)
Find hits at different way-counts for LRU and BIP
From this, find optimal CP and optimal RP
For CP use lookahead algorithm
RP chosen for a core is implemented in its cache
portion
87. References
S. Mittal, “A Survey of Techniques for Cache
Partitioning in Multicore Processors”, ACM
Computing Surveys, 2017 (link)