SlideShare a Scribd company logo
Made by- Himanshu Koli (2K10/CO/041)
Hiren Madan (2K10/CO/042)
Energy-Efficient Hardware Data Prefetching
SEMINAR
Delhi Technological University
(COE-416)
Contents
 Introduction
 What is Data Prefetching?
 Prefetching Classification
 How Prefetching works?
 Software Prefetching
 Limitations of Software based Prefetching
 Hardware Prefetching
 Hardware Vs. Software Approach
 Energy Aware Data Prefetching
 Energy Aware Prefetching Architecture
 Energy Aware Prefetching Techniques
 References
Introduction
Why need Data Prefetching?
 Microprocessor performance has increased at a dramatic rate .
 Expanding gap between microprocessor and DRAM performance has necessitated the
use of aggressive techniques designed to reduce the large latency of memory accesses.
 Use of cache memory hierarchies have managed to keep pace with processor memory
request rates but continue to be too expensive for a main store technology.
 Use of large cache hierarchies has proven to be effective in reducing the average
memory access penalty for programs that show a high degree of locality in their
addressing patterns .
But scientific and other data-intensive programs spend more than half their run
times stalled on memory requests.
Processor-Memory Performance Gap
 On demand fetch policy, it will always result in a cache miss for
the first access to a cache block Such cache misses are known
as cold start or compulsory misses.
 When we reference a large array, there is a high possibility of
the elements of the array to be overwritten
 If we need the previous value of the array which has been
overwritten, then the processor need to make full main
memory access. This is called as capacity miss.
What is Data Prefetching ?
 Data prefetching anticipates such misses and issues a fetch to
the memory system in advance of the actual memory reference,
rather than waiting for a cache miss to perform a memory fetch .
 Prefetch proceeds in parallel with processor computation,
allowing the memory system time to transfer the desired data
from main memory to the cache.
 Prefetch will complete just in time for the processor to access the
needed data in the cache without stalling the processor.
Execution Diagram assuming- a) No Prefetching,
b) Perfect Prefetching and c) Degraded Prefetching
How Prefetching Works?
Prefetching Classification
 Various prefetching techniques have been proposed-
Instruction Prefetching vs. Data Prefetching
Software-controlled prefetching vs. Hardware-controlled
prefetching.
 Data prefetching for different structures in general purpose
programs:
Prefetching for array structures.
Prefetching for pointer and linked data structures.
Software Data Prefetching
 Explicit “fetch” instructions
 Non-blocking memory operation.
 Cannot cause exceptions (e.g. page faults).
 Additional instructions executed.
 Modest hardware complexity
 Challenge -- prefetch scheduling
 Placement of fetch instruction relative to the matching load and store
instruction.
 Hand-coded by programmer or automated by compiler.
 Adding just a few prefetch directives to a program can substantially
improve performance.
 Prefetching is most often used within loops responsible for large array
calculations.
 Common in scientific codes
 Poor cache utilization
 Predictable array referencing patterns
 Fetch instructions can be placed inside loop bodies so that current
iteration prefetches data for a future iteration.
Example : Vector Product
 No prefetching
for (i = 0; i < N; i++)
{ sum += a[i]*b[i]; }
 Assume each cache block holds 4
elements .
 Code segment will cause a cache
miss every fourth iteration.
 Simple prefetching
for (i = 0; i < N; i++)
{
fetch (&a[i+1]);
fetch (&b[i+1]);
sum += a[i]*b[i];
}
 Problem-
 Unnecessary prefetch operations
Example (contd.)
 Prefetching + loop unrolling
for (i = 0; i < N; i+=4)
{
fetch (&a[i+4]);
fetch (&b[i+4]);
sum += a[i]*b[i];
sum += a[i+1]*b[i+1];
sum += a[i+2]*b[i+2];
sum += a[i+3]*b[i+3];
}
 Problem
 First and last iterations
fetch (&sum);
fetch (&a[0]);
fetch (&b[0]);
for (i = 0; i < N-4; i+=4
{
fetch (&a[i+4]);
fetch (&b[i+4]);
sum += a[i]*b[i];
sum += a[i+1]*b[i+1];
sum += a[i+2]*b[i+2];
sum += a[i+3]*b[i+3];
}
for (i = N-4; i < N; i++)
sum = sum + a[i]*b[i];
Example (contd.)
 Previous assumption: prefetching 1
iteration ahead is sufficient to hide the
memory latency.
 When loops contain small
computational bodies, it may be
necessary to initiate prefetches d
iterations before the data is referenced.
d: prefetch distance
l: avg. memory latency
s: is the estimated cycle time of the shortest possible
execution path through one loop iteration
fetch (&sum);
for (i = 0; i < 12; i += 4)
{
fetch (&a[i]);
fetch (&b[i]);
}
for (i = 0; i < N-12; i += 4)
{
fetch(&a[i+12]);
fetch(&b[i+12]);
sum = sum + a[i] *b[i];
sum = sum + a[i+1]*b[i+1];
sum = sum + a[i+2]*b[i+2];
sum = sum + a[i+3]*b[i+3];
}
for (i = N-12; i < N; i++)
sum = sum + a[i]*b[i];







s
l
d
Limitation of Software-based
Prefetching
 Normally restricted to loops with array accesses
 Hard for general applications with irregular access
patterns
 Processor execution overhead
 Significant code expansion
 Performed statically.
Hardware Data Prefetching
 Special Hardware required.
 No need for programmer or compiler intervention.
 No changes to existing executable.
 Take advantage of run-time information.
Sequential Prefetching
 By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to
implicitly prefetch data that is likely to be referenced in the near future.
 Larger cache blocks suffer from
 cache pollution
 false sharing in multiprocessors.
 One block lookahead (OBL) approach
 Initiate a prefetch for block b+1 when block b is accessed.
 Prefetch-on-miss
 Whenever an access for block b results in a cache miss
 Tagged prefetch
 Associates a tag bit with every memory block.
 When a block is demand-fetched or a prefetched block is referenced for the first time, next block is
fetched.
 Used in HP PA7200
 OBL may not be initiated far enough in advance of the actual use to avoid a
processor memory stall.
 To solve this, increase the number of blocks prefetched after a demand fetch
from one to K, where K is known as the degree of prefetching.
 Aids the memory system in staying ahead of rapid processor requests.
 As each prefetched block, b, is accessed for the first time, the cache is
interrogated to check if blocks b+1, ... b+K are present in the cache and, if
not, the missing blocks are fetched from memory.
Three Forms of Sequential Prefetching:
a) Prefetch on miss, b) Tagged Prefetch and
c) Sequential Prefetching with K = 2.
 Shortcoming
 Prefetch K > 1 subsequent blocks
Additional traffic and cache pollution.
 Solution : Adaptive sequential prefetching
 Vary the value of K during program execution
 High spatial locality  large K value
 Prefetch efficiency metric periodically calculated
 Ratio of useful prefetches to total prefetches
 The value of K is initialized to one, incremented whenever the prefetch
efficiency exceeds a predetermined upper threshold and decremented
whenever the efficiency drops below a lower threshold
 If K is reduced to zero, prefetching is disabled and the prefetch hardware
begins to monitor how often a cache miss to block b occurs while block b-
1 is cached
 Prefetching restarts if the respective ratio of these two numbers exceeds
the lower threshold of the prefetch efficiency.
Sequential Adaptive Prefetching
Stride Prefetching
 Stride Prefetching monitors memory access patterns in the processor
to detect constant-stride array references originating from loop
structures.
 Accomplished by comparing successive addresses used by memory
instructions.
 Requires the previous address used by a memory instruction to be
stored along with the last detected stride, a hardware table called the
Reference Prediction Table (RPT), is added to hold the information
for the most recently used load instructions.
 Each RPT entry contains the PC address of the load instruction, the
memory address previously accessed by the instruction, a stride value
for those entries that have established a stride, and a state field used to
control the actual prefetching.
 Contains 64 entries; each entry of 64 bits.
 Prefetch commands are issued only when a matching stride is detected
 However, stride prefetching uses an associative hardware table which is
accessed whenever a load instruction is detected.
Pointer Prefetching
 Effective for pointer intensive programs containing linked data
structures.
 No constant stride.
 Dependence based prefetching-
 Uses two hardware tables.
 Correlation table (CT) stores dependence correlationbetween a load
instruction that produces an address (producer) and a subsequent load that
uses that address (consumer).
 The potential producer window (PPW) records the most recent loaded values
and the corresponding instructions. When a load commits, the
corresponding correlation is added to CT.
Combined Stride and Pointer
Prefetching
 Objective to evaluate a technique that would work for all
types of memory access patterns.
 Use both array and pointer
 Better performance
 All three tables (RPT, PPW, CT)
Hardware vs. Software Approach
Hardware
 Perform. cost: low
 Memory traffic: high
 History-directed
 could be less effective
 Using profiling info.
Software
 Perform. cost: high
 Memory traffic: low
 Better Improvement
 Use human knowledge
 inserted by hand
Energy Aware Data Prefetching
 Energy and power efficiency have become key design objectives in
microprocessors, in both embedded and general-purpose domains.
 Aggressive prefetching techniques often help to improve performance, in most of
the applications, but they increase memory system energy consumption by as
much as 30%.
 Power-consumption sources
 Prefetching hardware
Prefetch history tables
Control logic
 Extra memory accesses
Unnecessary prefetching
Figure shows Power Dissipation in new
processors compared with other objects
4004
8008
8080
8085
8086
286 386
486
Pentium®
Processors
1
10
100
1000
10000
1970 1980 1990 2000 2010
PowerDensity(W/cm2)
Source: Intel
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Prefetching Hardware Required
Prefetching Energy Sources
 Prefetching hardware:
 Data (history table) and control logic.
 Extra tag-checks in L1 cache
 When a prefetch hits in L1 (no prefetch needed)
 Extra memory accesses to L2 Cache
 Due to useless prefetches from L2 to L1.
 Extra off-chip memory accesses
 When data cannot be found in the L2 Cache.
Energy-Aware Prefetching Architecture
Prefetching Filtering
Buffer (PFB)
...... ... ... ... ... L1 D-cache
Stride
Prefetcher
Pointer
Prefetcher
Stride Counter
LDQ RA RB OFFSET Hints
Prefetches
Tag-array Data-array
Prefetch from L2 Cache
Regular
Cache Access
Filtered
Filtered
Filtered
Prefetch Filtering
using Stride Counter
Compiler-Based
Selective Filtering
Compiler-Assisted
Adaptive Prefetching
Hardware Filtering
using PFB
Energy-Aware Prefetching Technique
 Compiler-Based Selective Filtering (CBSF)
 Only searching the prefetch hardware tables for selective memory instructions
identified by the compiler.
 Compiler-Assisted Adaptive Prefetching (CAAP)
 Selectively applying different prefetching schemes depending on predicted access
patterns.
 Compiler-driven Filtering using Stride Counter (SC)
 Reducing prefetching energy consumption wasted on memory access patterns
with very small strides.
 Hardware-based Filtering using PFB (PFB)
 Further reducing the L1 cache related energy overhead due to prefetching based
on locality with prefetching addresses.
Compiler-based Selective Filtering
 Only searching the prefetch hardware tables for selective
memory instructions identified by the compiler.
 Energy reduced by-
Using loop or recursive type memory access
Use only array and linked data structure memory access
Compiler-Assistive Adaptive Prefetching
 Select different prefetching scheme based on
 Memory access to an array which does not belongs to any larger
structure are only fed into the stride prefetcher.
 Memory access to an array which belongs to a larger structure are
fed into both stride and pointer prefetchers.
 Memory access to a linked data structure with no arrays are only
fed into the pointer prefetcher.
 Memory access to a linked data structure that contains arrays are
fed into both prefetchers.
Compiler-Hinted Filtering Using a
Runtime SC
 Reducing prefetching energy consumption wasted on memory
access patterns with very small strides.
 Small strides are not used.
 Stride can be larger than half the cache line size.
 Each cache line contain
 Program Counter(PC)
 Stride counter
 Counter is used to count how many times the instruction
occurs.
Hardware-based Filtering using PFB
 To reduce the number of L1 tag-checks due to prefetching, we add a
PFB to remember the most recently prefetched cache tags.
 We check the prefetching address against the PFB when a prefetching
request is issued by the prefetch engine.
 If the address is found in the PFB, the prefetching request is dropped
and we assume that the data is already in the L1 cache.
 When the data is not found in the PFB, we perform normal tag lookup
and proceed according to the lookup results.
 The LRU replacement algorithm is used when the PFB is full.
Power Dissipation of Hardware Tables
 The size of a typical history table is at least 64X64 bits, which is
implemented as a fully-associative CAM table.
 Each prefetching access consumes more than 12 mW, which is higher
than our low-power cache access.
 Low-power cache design techniques such as sub-banking don’t work.
Conclusion
 Improve the performance.
 Reduce the energy overhead of hardware data prefetching.
 Reduce total energy consumption.
 Compiler-assisted and hardware-based energy-aware
techniques and a new power-aware prefetch engine
techniques are used.
References
 Yao Guo ,”Energy-Efficient Hardware Data Prefetching,” IEEE
,vol.19,no.2,Feb.2011
 A. J. Smith, “Sequential program prefetching in memory
hierarchies,”IEEE Computer, vol. 11, no. 12, pp. 7–21, Dec.
1978.
 A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based
prefetching for linked data structures,” in Proc. ASPLOS-VIII,
Oct. 1998, pp.115–126.

More Related Content

What's hot

Virtual memory
Virtual memoryVirtual memory
Virtual memory
Anuj Modi
 
Ch1-Operating System Concept
Ch1-Operating System ConceptCh1-Operating System Concept
Ch1-Operating System Concept
Muhammad Bilal Tariq
 
Google BigTable
Google BigTableGoogle BigTable
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learning
mohdshoaibuddin1
 
Operating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page TableOperating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page Table
Zishan Mohsin
 
Distributed system
Distributed systemDistributed system
Distributed system
Syed Zaid Irshad
 
Big data
Big dataBig data
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
Sweta Kumari Barnwal
 
Chapter 8 - Main Memory
Chapter 8 - Main MemoryChapter 8 - Main Memory
Chapter 8 - Main Memory
Wayne Jones Jnr
 
Cpu organisation
Cpu organisationCpu organisation
Cpu organisation
Er Sangita Vishwakarma
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
Sonia Mim
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas Study
Meysam Javadi
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
Non-Von Neumann Architectures
Non-Von Neumann ArchitecturesNon-Von Neumann Architectures
Non-Von Neumann Architectures
Nathanael Asaam
 
Cloud computing using Eucalyptus
Cloud computing using EucalyptusCloud computing using Eucalyptus
Cloud computing using Eucalyptus
Abhishek Dey
 
Distributed database management systems
Distributed database management systemsDistributed database management systems
Distributed database management systems
Dhani Ahmad
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer
Zara Nawaz
 
Silberschatz / OS Concepts
Silberschatz /  OS Concepts Silberschatz /  OS Concepts
Silberschatz / OS Concepts
Alanisca Alanis
 
I/O buffering & disk scheduling
I/O buffering & disk schedulingI/O buffering & disk scheduling
I/O buffering & disk scheduling
Rushabh Shah
 

What's hot (20)

Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
Ch1-Operating System Concept
Ch1-Operating System ConceptCh1-Operating System Concept
Ch1-Operating System Concept
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learning
 
Operating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page TableOperating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page Table
 
Distributed system
Distributed systemDistributed system
Distributed system
 
Big data
Big dataBig data
Big data
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Chapter 8 - Main Memory
Chapter 8 - Main MemoryChapter 8 - Main Memory
Chapter 8 - Main Memory
 
Cpu organisation
Cpu organisationCpu organisation
Cpu organisation
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas Study
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Non-Von Neumann Architectures
Non-Von Neumann ArchitecturesNon-Von Neumann Architectures
Non-Von Neumann Architectures
 
Cloud computing using Eucalyptus
Cloud computing using EucalyptusCloud computing using Eucalyptus
Cloud computing using Eucalyptus
 
Distributed database management systems
Distributed database management systemsDistributed database management systems
Distributed database management systems
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer
 
Silberschatz / OS Concepts
Silberschatz /  OS Concepts Silberschatz /  OS Concepts
Silberschatz / OS Concepts
 
I/O buffering & disk scheduling
I/O buffering & disk schedulingI/O buffering & disk scheduling
I/O buffering & disk scheduling
 

Similar to Enery efficient data prefetching

Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav)
Rohit malav
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
Yulong Bai
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
cscpconf
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
csandit
 
676.v3
676.v3676.v3
676.v3
Rajesh M
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
Cad report
Cad reportCad report
Cad report
Priyanka Goswami
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
IJCSEA Journal
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
ashishmulchandani
 
PPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdfPPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdf
Gnanavi2
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
PrabhanshuKatiyar1
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
PrabhanshuKatiyar1
 
Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/A
Abdul Munam
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
Slide_N
 
OS.pptx
OS.pptxOS.pptx
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
grooverdan
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
Ilango Jeyasubramanian
 
Unit 5Memory management.pptx
Unit 5Memory management.pptxUnit 5Memory management.pptx
Unit 5Memory management.pptx
SourabhRaj29
 

Similar to Enery efficient data prefetching (20)

Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav)
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
 
676.v3
676.v3676.v3
676.v3
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Cad report
Cad reportCad report
Cad report
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
PPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdfPPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdf
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/A
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
OS.pptx
OS.pptxOS.pptx
OS.pptx
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Unit 5Memory management.pptx
Unit 5Memory management.pptxUnit 5Memory management.pptx
Unit 5Memory management.pptx
 

Recently uploaded

International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 

Recently uploaded (20)

International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 

Enery efficient data prefetching

  • 1. Made by- Himanshu Koli (2K10/CO/041) Hiren Madan (2K10/CO/042) Energy-Efficient Hardware Data Prefetching SEMINAR Delhi Technological University (COE-416)
  • 2. Contents  Introduction  What is Data Prefetching?  Prefetching Classification  How Prefetching works?  Software Prefetching  Limitations of Software based Prefetching  Hardware Prefetching  Hardware Vs. Software Approach  Energy Aware Data Prefetching  Energy Aware Prefetching Architecture  Energy Aware Prefetching Techniques  References
  • 3. Introduction Why need Data Prefetching?  Microprocessor performance has increased at a dramatic rate .  Expanding gap between microprocessor and DRAM performance has necessitated the use of aggressive techniques designed to reduce the large latency of memory accesses.  Use of cache memory hierarchies have managed to keep pace with processor memory request rates but continue to be too expensive for a main store technology.  Use of large cache hierarchies has proven to be effective in reducing the average memory access penalty for programs that show a high degree of locality in their addressing patterns . But scientific and other data-intensive programs spend more than half their run times stalled on memory requests.
  • 5.  On demand fetch policy, it will always result in a cache miss for the first access to a cache block Such cache misses are known as cold start or compulsory misses.  When we reference a large array, there is a high possibility of the elements of the array to be overwritten  If we need the previous value of the array which has been overwritten, then the processor need to make full main memory access. This is called as capacity miss.
  • 6. What is Data Prefetching ?  Data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference, rather than waiting for a cache miss to perform a memory fetch .  Prefetch proceeds in parallel with processor computation, allowing the memory system time to transfer the desired data from main memory to the cache.  Prefetch will complete just in time for the processor to access the needed data in the cache without stalling the processor.
  • 7. Execution Diagram assuming- a) No Prefetching, b) Perfect Prefetching and c) Degraded Prefetching
  • 9. Prefetching Classification  Various prefetching techniques have been proposed- Instruction Prefetching vs. Data Prefetching Software-controlled prefetching vs. Hardware-controlled prefetching.  Data prefetching for different structures in general purpose programs: Prefetching for array structures. Prefetching for pointer and linked data structures.
  • 10. Software Data Prefetching  Explicit “fetch” instructions  Non-blocking memory operation.  Cannot cause exceptions (e.g. page faults).  Additional instructions executed.  Modest hardware complexity  Challenge -- prefetch scheduling  Placement of fetch instruction relative to the matching load and store instruction.  Hand-coded by programmer or automated by compiler.
  • 11.  Adding just a few prefetch directives to a program can substantially improve performance.  Prefetching is most often used within loops responsible for large array calculations.  Common in scientific codes  Poor cache utilization  Predictable array referencing patterns  Fetch instructions can be placed inside loop bodies so that current iteration prefetches data for a future iteration.
  • 12. Example : Vector Product  No prefetching for (i = 0; i < N; i++) { sum += a[i]*b[i]; }  Assume each cache block holds 4 elements .  Code segment will cause a cache miss every fourth iteration.  Simple prefetching for (i = 0; i < N; i++) { fetch (&a[i+1]); fetch (&b[i+1]); sum += a[i]*b[i]; }  Problem-  Unnecessary prefetch operations
  • 13. Example (contd.)  Prefetching + loop unrolling for (i = 0; i < N; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; }  Problem  First and last iterations fetch (&sum); fetch (&a[0]); fetch (&b[0]); for (i = 0; i < N-4; i+=4 { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; } for (i = N-4; i < N; i++) sum = sum + a[i]*b[i];
  • 14. Example (contd.)  Previous assumption: prefetching 1 iteration ahead is sufficient to hide the memory latency.  When loops contain small computational bodies, it may be necessary to initiate prefetches d iterations before the data is referenced. d: prefetch distance l: avg. memory latency s: is the estimated cycle time of the shortest possible execution path through one loop iteration fetch (&sum); for (i = 0; i < 12; i += 4) { fetch (&a[i]); fetch (&b[i]); } for (i = 0; i < N-12; i += 4) { fetch(&a[i+12]); fetch(&b[i+12]); sum = sum + a[i] *b[i]; sum = sum + a[i+1]*b[i+1]; sum = sum + a[i+2]*b[i+2]; sum = sum + a[i+3]*b[i+3]; } for (i = N-12; i < N; i++) sum = sum + a[i]*b[i];        s l d
  • 15. Limitation of Software-based Prefetching  Normally restricted to loops with array accesses  Hard for general applications with irregular access patterns  Processor execution overhead  Significant code expansion  Performed statically.
  • 16. Hardware Data Prefetching  Special Hardware required.  No need for programmer or compiler intervention.  No changes to existing executable.  Take advantage of run-time information.
  • 17. Sequential Prefetching  By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to implicitly prefetch data that is likely to be referenced in the near future.  Larger cache blocks suffer from  cache pollution  false sharing in multiprocessors.  One block lookahead (OBL) approach  Initiate a prefetch for block b+1 when block b is accessed.  Prefetch-on-miss  Whenever an access for block b results in a cache miss  Tagged prefetch  Associates a tag bit with every memory block.  When a block is demand-fetched or a prefetched block is referenced for the first time, next block is fetched.  Used in HP PA7200
  • 18.  OBL may not be initiated far enough in advance of the actual use to avoid a processor memory stall.  To solve this, increase the number of blocks prefetched after a demand fetch from one to K, where K is known as the degree of prefetching.  Aids the memory system in staying ahead of rapid processor requests.  As each prefetched block, b, is accessed for the first time, the cache is interrogated to check if blocks b+1, ... b+K are present in the cache and, if not, the missing blocks are fetched from memory.
  • 19. Three Forms of Sequential Prefetching: a) Prefetch on miss, b) Tagged Prefetch and c) Sequential Prefetching with K = 2.
  • 20.  Shortcoming  Prefetch K > 1 subsequent blocks Additional traffic and cache pollution.  Solution : Adaptive sequential prefetching  Vary the value of K during program execution  High spatial locality  large K value  Prefetch efficiency metric periodically calculated  Ratio of useful prefetches to total prefetches
  • 21.  The value of K is initialized to one, incremented whenever the prefetch efficiency exceeds a predetermined upper threshold and decremented whenever the efficiency drops below a lower threshold  If K is reduced to zero, prefetching is disabled and the prefetch hardware begins to monitor how often a cache miss to block b occurs while block b- 1 is cached  Prefetching restarts if the respective ratio of these two numbers exceeds the lower threshold of the prefetch efficiency.
  • 23. Stride Prefetching  Stride Prefetching monitors memory access patterns in the processor to detect constant-stride array references originating from loop structures.  Accomplished by comparing successive addresses used by memory instructions.  Requires the previous address used by a memory instruction to be stored along with the last detected stride, a hardware table called the Reference Prediction Table (RPT), is added to hold the information for the most recently used load instructions.
  • 24.  Each RPT entry contains the PC address of the load instruction, the memory address previously accessed by the instruction, a stride value for those entries that have established a stride, and a state field used to control the actual prefetching.  Contains 64 entries; each entry of 64 bits.  Prefetch commands are issued only when a matching stride is detected  However, stride prefetching uses an associative hardware table which is accessed whenever a load instruction is detected.
  • 25. Pointer Prefetching  Effective for pointer intensive programs containing linked data structures.  No constant stride.  Dependence based prefetching-  Uses two hardware tables.  Correlation table (CT) stores dependence correlationbetween a load instruction that produces an address (producer) and a subsequent load that uses that address (consumer).  The potential producer window (PPW) records the most recent loaded values and the corresponding instructions. When a load commits, the corresponding correlation is added to CT.
  • 26. Combined Stride and Pointer Prefetching  Objective to evaluate a technique that would work for all types of memory access patterns.  Use both array and pointer  Better performance  All three tables (RPT, PPW, CT)
  • 27. Hardware vs. Software Approach Hardware  Perform. cost: low  Memory traffic: high  History-directed  could be less effective  Using profiling info. Software  Perform. cost: high  Memory traffic: low  Better Improvement  Use human knowledge  inserted by hand
  • 28. Energy Aware Data Prefetching  Energy and power efficiency have become key design objectives in microprocessors, in both embedded and general-purpose domains.  Aggressive prefetching techniques often help to improve performance, in most of the applications, but they increase memory system energy consumption by as much as 30%.  Power-consumption sources  Prefetching hardware Prefetch history tables Control logic  Extra memory accesses Unnecessary prefetching
  • 29. Figure shows Power Dissipation in new processors compared with other objects 4004 8008 8080 8085 8086 286 386 486 Pentium® Processors 1 10 100 1000 10000 1970 1980 1990 2000 2010 PowerDensity(W/cm2) Source: Intel Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface
  • 31. Prefetching Energy Sources  Prefetching hardware:  Data (history table) and control logic.  Extra tag-checks in L1 cache  When a prefetch hits in L1 (no prefetch needed)  Extra memory accesses to L2 Cache  Due to useless prefetches from L2 to L1.  Extra off-chip memory accesses  When data cannot be found in the L2 Cache.
  • 32. Energy-Aware Prefetching Architecture Prefetching Filtering Buffer (PFB) ...... ... ... ... ... L1 D-cache Stride Prefetcher Pointer Prefetcher Stride Counter LDQ RA RB OFFSET Hints Prefetches Tag-array Data-array Prefetch from L2 Cache Regular Cache Access Filtered Filtered Filtered Prefetch Filtering using Stride Counter Compiler-Based Selective Filtering Compiler-Assisted Adaptive Prefetching Hardware Filtering using PFB
  • 33. Energy-Aware Prefetching Technique  Compiler-Based Selective Filtering (CBSF)  Only searching the prefetch hardware tables for selective memory instructions identified by the compiler.  Compiler-Assisted Adaptive Prefetching (CAAP)  Selectively applying different prefetching schemes depending on predicted access patterns.  Compiler-driven Filtering using Stride Counter (SC)  Reducing prefetching energy consumption wasted on memory access patterns with very small strides.  Hardware-based Filtering using PFB (PFB)  Further reducing the L1 cache related energy overhead due to prefetching based on locality with prefetching addresses.
  • 34. Compiler-based Selective Filtering  Only searching the prefetch hardware tables for selective memory instructions identified by the compiler.  Energy reduced by- Using loop or recursive type memory access Use only array and linked data structure memory access
  • 35. Compiler-Assistive Adaptive Prefetching  Select different prefetching scheme based on  Memory access to an array which does not belongs to any larger structure are only fed into the stride prefetcher.  Memory access to an array which belongs to a larger structure are fed into both stride and pointer prefetchers.  Memory access to a linked data structure with no arrays are only fed into the pointer prefetcher.  Memory access to a linked data structure that contains arrays are fed into both prefetchers.
  • 36. Compiler-Hinted Filtering Using a Runtime SC  Reducing prefetching energy consumption wasted on memory access patterns with very small strides.  Small strides are not used.  Stride can be larger than half the cache line size.  Each cache line contain  Program Counter(PC)  Stride counter  Counter is used to count how many times the instruction occurs.
  • 37. Hardware-based Filtering using PFB  To reduce the number of L1 tag-checks due to prefetching, we add a PFB to remember the most recently prefetched cache tags.  We check the prefetching address against the PFB when a prefetching request is issued by the prefetch engine.  If the address is found in the PFB, the prefetching request is dropped and we assume that the data is already in the L1 cache.  When the data is not found in the PFB, we perform normal tag lookup and proceed according to the lookup results.  The LRU replacement algorithm is used when the PFB is full.
  • 38. Power Dissipation of Hardware Tables  The size of a typical history table is at least 64X64 bits, which is implemented as a fully-associative CAM table.  Each prefetching access consumes more than 12 mW, which is higher than our low-power cache access.  Low-power cache design techniques such as sub-banking don’t work.
  • 39. Conclusion  Improve the performance.  Reduce the energy overhead of hardware data prefetching.  Reduce total energy consumption.  Compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine techniques are used.
  • 40. References  Yao Guo ,”Energy-Efficient Hardware Data Prefetching,” IEEE ,vol.19,no.2,Feb.2011  A. J. Smith, “Sequential program prefetching in memory hierarchies,”IEEE Computer, vol. 11, no. 12, pp. 7–21, Dec. 1978.  A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based prefetching for linked data structures,” in Proc. ASPLOS-VIII, Oct. 1998, pp.115–126.