SlideShare a Scribd company logo
Ketan N. Kulkarni & M. V. Rajesh
ECEN 676 – Advanced Computer Architecture
1st May 2009
Reducing Memory Stalls by Hardware
Based Data Prefetching Schemes
Agenda
 Introduction
 Background work
 Hardware based Data Cache Prefetching Algorithms
1. Fixed Offset Prefetching
2. Stride Based Prefetching
3. Tag Correlated Prefetching
 Simulation Setup
 Results
 Conclusions
ECEN 676 - Advanced Computer Architecture2
Introduction
 What is Prefetching?
 Filling the cache with relevant data before it is needed by the program.
 What is the need of Prefetching?
 Expanding gap between Microprocessor and DRAM performance.
 Exponential increase in data access penalty.
 When to Prefetch?
 Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the
memory latency time).
 Advantages:
 Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.
 Program semantics remain unchanged.
 Caveats:
 Prefetching too far in advance may lead to cache pollution.
 Incorrect prefetching.
ECEN 676 - Advanced Computer Architecture3
How Prefetching works ?
Time
LoadA
(miss in L1)
LoadB
(miss in L1)
Time
LoadA
(hit in L1)
LoadB
(hit in L1)
PrefetchA *
PrefetchB *
FetchA
FetchB
ECEN 676 - Advanced Computer Architecture4
* from L2 to L1
CPU stalled
CPU executing Avoids the possible miss
Feedback based
[Honio2009]
Spatial
Stride Prefetch
[Fu1992]
Markov Prefetch
[Joseph1997]
GHB
[Nesbit2004]
Hybrid
[Hsu1998]
Software Support
[Mowry1992]
AC/DC
[Nesbit2004]
Adaptive Stream
[Hur 2006]
FDP
[Srinath2007]
Software Sequence-Base
(Order Sensitive)
Tag Correlation
[Hu2003]
SMS
[Somogyi2006]
Sequential
[Smith1978]
RPT
[Chen1995]
Locality Detect
[Johnson1998]
Spatial Pat.
[Chen2004]
Buffer Block
[Gindele1977]
Adaptive
Hybrid
Adaptive Seq.
[Dahlgren1993]
Commercial
Processors
SuperSPARC
R10000
PA7200
Power4
Pentium 4
AMPM Prefetch
[Ishii2009]
HW/SW Integrate
[Gornish1994]
Fixed offset
Hardware based Prefetching
 Advantages:
 Dynamic pattern matching.
 No compiler support/ ISA modification needed.
 Takes advantage of regular/ repeatable program
behavior.
 Caveats:
 Increased complexity/ hardware.
 High level program flow information not available.
ECEN 676 - Advanced Computer Architecture6
Fixed Offset Prefetching
 On a cache miss, retrieve next block of memory.
 Sequential prefetching (spatial locality).
ECEN 676 - Advanced Computer Architecture
Tag Index Offset
+
Tag Index Offset
Constant
7
Advantages Disadvantages
Very simple scheme. Relies solely on spatial locality to work.
Less hardware. Can’t detect patterns.
Stride Based Prefetching [Chen1995]
 Exploit stride patterns in data addresses.
 Prefetch this data before the data is accessed.
 Store state & stride data in a reference Prediction
Table (RPT), and update.
 Make state transitions based on correct/ incorrect
predictions.
ECEN 676 - Advanced Computer Architecture8
RPT - Structure
INSTRUCTION
ADDRESS
(PREVIOUS) DATA
ADDRESS
STRIDE STATE
+ PREFETCHING
ADDRESS
Program Counter -
Effective Address
ECEN 676 - Advanced Computer Architecture9
Lookup
Update
State Transition
INIT
TRANS
NO-
PRED
STEADY
Incorrect
Correct
Incorrect
Incorrect
Correct
Correct
Correct
Incorrect
ECEN 676 - Advanced Computer Architecture10
Stride based Prefetching (cont.)
ECEN 676 - Advanced Computer Architecture11
Advantages Disadvantages
Detects uniform strides (e.g. loops). Not much improvement with non-
uniform strides.
Accurate prediction for many cases. Hardware overhead.
Cannot correlate strides of one
instruction with those of others.
RPT - Example
 Load instructions at addresses 500, 504, and 512.
 Base addresses of matrices A, B, and C at locations
10,000, 50,000, and 90,000 respectively.
Matrix Multiplication Assembly Code
int A[100][100], B[100][100],
C[100][100]
for(i = 1; i < 100; i ++){
for(j = 1; j < 100; j ++){
for(k = 1; k < 100; k ++){
A[i][j] += B[i][k] x C[k][j];
}
}
}
500
504
508
512
516
520
524
528
532
536
lw r4, 0(r2)
Iw r5, 0(r3)
mul r6, r5, r4
lw r7, 0(r1)
addu r7, r7, r6
sw r7, 0(rl)
addu r2, r2, 4
addu r3, r3. 400
addu r11, rl l, 1
bne r11, r13,
500
load B[i][k]
load C[kJ[j]
B[i][k] x C[k][j]
load A[i][j]
+=
store A[i][j]
ref B[i][k]
ref C[k][j]
increase k
loop
ECEN 676 - Advanced Computer Architecture12
RPT – Example (contd.)
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
ECEN 676 - Advanced Computer Architecture13
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,000 0 INIT
504 90,000 0 INIT
512 10,000 0 INIT
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,004 4 TRANS
504 90,400 400 TRANS
512 10,000 0 STEAD
Y
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,008 4 STEADY
504 90,800 400 STEADY
512 10,000 0 STEADY
Initial State After Iteration 1
After Iteration 2 After Iteration 3
Tag Correlated Prefetching [Hu2003]
ECEN 676 - Advanced Computer Architecture14
 L1 cache tags exhibit strong regularity.
 Similar to 2-level branch prediction technique.
 Local/ Global History.
 Correlating Prefetcher that work with tags.
TCP Structure
TAG1 TAG2 …… TAG TAG’
ECEN 676 - Advanced Computer Architecture15
TAG INDEX
OFFSE
T
Index
Function
misstagTAG2 …… TAGKTAGK
UpdateMiss Address
misstag
TAG1
THT
TAG
K
TAG
KTAG
KTAG
K
misstag
Lookup
misstag
misstag
misstag
misstag
misstag TAG’
missindex
PHT
Modified TCP
SUM TAG TAG’
ECEN 676 - Advanced Computer Architecture16
TAG INDEX
OFFSE
T
Index
Function
misstagTAGK
Miss Address
THT PHT
TCP (cont.)
ECEN 676 - Advanced Computer Architecture17
Advantages Disadvantages
Captures global and local history. More Hardware.
Recognize recurring patterns. Vulnerable to noise
Simulation Environment
ECEN 676 - Advanced Computer Architecture18
L1
Data
Cache
L2
Data
Cache
Main
Memory
CPU
DATA
Prefetcher
Trace File
ADDRESS
DATA
ADDRESS
DATA
ADDRESS
Hit
Implementation: C++, Perl, Pin Tool [Reddi2004]
Trace Driven Simulation
L1-Cache L2-Cache
32KB Size
2-way Set
Associative
64 byte line size
write-through
no-write-allocate
1 cycle hit time
lru replacement
policy
256KB Size
8-way set
associative
128 byte line size
write-back
write-allocate
20 cycle access
time
lru replacement
policy
Calculate Next Prefetch Address
Benchmarks
ECEN 676 - Advanced Computer Architecture19
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
grep g++ ls plamaptestgen matrix
%
Benchmark
Instruction Mix
Non-mem ops
Stores
Loads
Benchmark Description
grep Unix utility to search for
pattern in input file.
g++ Unix C++ GNU
Compiler.
testgen Program for creating
test patterns for scan
chains in DFT.
plamap A mapping algorithm for
CPLD architecture.
ls Unix utility to list
information about files
in dir.
matrix 100x100 matrix
multiplication.
Simulation Results - I
ECEN 676 - Advanced Computer Architecture20
0.00
1.00
2.00
3.00
4.00
5.00
6.00
grep g++ ls plamap testgen matrix
CPI
Benchmark
CPI
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - II
ECEN 676 - Advanced Computer Architecture21
90
91
92
93
94
95
96
97
98
99
100
grep g++ ls plamap testgen matrix
Hit Rate (%)
Benchmark
L1 Cache Hit Rate
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - III
ECEN 676 - Advanced Computer Architecture22
0
0.5
1
1.5
2
2.5
grep g++ ls plamap testgen matrix
AMAT (#cycles)
Benchmark
Average Memory Access Time
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - IV
ECEN 676 - Advanced Computer Architecture23
94.50
95.00
95.50
96.00
96.50
97.00
97.50
98.00
98.50
lookahead size
L1 Hit
Rate (%)
Benchmark (grep)
Effect of changing offset
on L1 Hit Rate
64
2*64
8*64
95.50
96.00
96.50
97.00
97.50
98.00
98.50
sizes;
L1 Hit Rate
(%)
Benchmark (g++)
Effect of RPT size on L1 Hit
Rate
size=8;
size=64;
size=256;
size=2048;
96.95
97.00
97.05
97.10
97.15
97.20
97.25
97.30
97.35
97.40
97.45
increasing m,n; increasing k;
L1 Hit Rate (%)
Benchmark (testgen)
Effect of THT/PHT parameters on L1 Hit Rate
m=8;n=8;k=4;
m=4;n=4;k=4;
m=8;n=8;k=2;
m=8;n=8;k=4;
m=8;n=8;k=8;
m=8;n=8;k=64;
Simulation Results – V
Prefetching
Algorithm
Hardware CPI
(% improvement)
Hit Rate
(% increase)
AMAT
(% decrease)
Fixed Offset Area of adder, registers 10.28 1.41 9.62
Stride Based 26.75KB (RPT) 16.56 1.75 20.42
TCP 72KB (THT) +
150KB(THT) 18.93 1.96 27.75
Modified TCP 26KB (THT) +
150KB(THT) 16.98 1.80 21.23
ECEN 676 - Advanced Computer Architecture24
The increase in hardware complexity pays off!
Conclusions
 Prefetching increases hit rate and decreases AMAT.
Fixed Offset Stride Based Tag
Correlated
 Fixed offset would give good performance for highly
spatial code.
 Stride Prefetching would perform the best when a
program has steady memory access patterns
regardless of locality.
 TCP would perform better on an average.
ECEN 676 - Advanced Computer Architecture25
Increasing Hardware Complexity
Increasing hit rate, Decreasing AMAT
References
 Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware-
based data prefetching for high-performance processors,"
Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May
1995.
 Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating
prefetchers," High-Performance Computer Architecture, 2003.
HPCA-9 2003. Proceedings. The Ninth International Symposium on ,
vol., no., pp. 317-326, 8-12 Feb. 2003.
 Reddi2004 - PIN: A Binary Instrumentation Tool for Computer
Architecture Research and Education VJ Reddi, A Settle, DA
Connors, RS Cohn, 2004.
ECEN 676 - Advanced Computer Architecture26
ECEN 676 - Advanced Computer Architecture27
Questions
 Question 1: Is cache pollution a serious concern for
anyone designing a prefetching algorithm?
 Answer: Cache pollution happens when the cache is
cluttered with useless information. However the
problem is that the exact information that is needed
is not always known, but it is predicted. The goal is
to prefetch all of the necessary data beforehand and
to prefetch the minimal amount of unused data.
ECEN 676 - Advanced Computer Architecture28

More Related Content

What's hot

Tute
TuteTute
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
Implementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsImplementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew Groups
M Mei
 
Understanding cts log_messages
Understanding cts log_messagesUnderstanding cts log_messages
Understanding cts log_messages
Mujahid Mohammed
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Marina Kolpakova
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
Marina Kolpakova
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
Tom Spyrou
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
Advanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsAdvanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem Solutions
Joe Christensen
 
ECE260BMiniProject2Report
ECE260BMiniProject2ReportECE260BMiniProject2Report
ECE260BMiniProject2Report
Fanyu Yang
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual Review
Sudhanshu Janwadkar
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
Prathyusha Madapalli
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
Ülger Ahmet
 
Design and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdfDesign and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdf
ChowdappaKv1
 
Implementation of quantum gates using verilog
Implementation of quantum gates using verilogImplementation of quantum gates using verilog
Implementation of quantum gates using verilog
Shashank Kumar
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
JAYAPRAKASH JPINFOTECH
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
keanumit
 
Lecture20 asic back_end_design
Lecture20 asic back_end_designLecture20 asic back_end_design
Lecture20 asic back_end_design
Hung Nguyen
 
ds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overviewds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overview
Angela Suen
 

What's hot (20)

Tute
TuteTute
Tute
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Implementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsImplementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew Groups
 
Understanding cts log_messages
Understanding cts log_messagesUnderstanding cts log_messages
Understanding cts log_messages
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Advanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsAdvanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem Solutions
 
ECE260BMiniProject2Report
ECE260BMiniProject2ReportECE260BMiniProject2Report
ECE260BMiniProject2Report
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual Review
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
 
Design and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdfDesign and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdf
 
Implementation of quantum gates using verilog
Implementation of quantum gates using verilogImplementation of quantum gates using verilog
Implementation of quantum gates using verilog
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
Lecture20 asic back_end_design
Lecture20 asic back_end_designLecture20 asic back_end_design
Lecture20 asic back_end_design
 
ds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overviewds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overview
 

Similar to 676.v3

Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
Himanshu Koli
 
Smpant Transact09
Smpant Transact09Smpant Transact09
Smpant Transact09
smpant
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Prelim Slides
Prelim SlidesPrelim Slides
Prelim Slides
smpant
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
International Journal of Power Electronics and Drive Systems
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
Alona Gradman
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
Alona Gradman
 
Super Computer
Super ComputerSuper Computer
Super Computer
gueste3bbd0
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
SciCompIIT
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
FrangoCamila
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
D031201021027
D031201021027D031201021027
D031201021027
inventionjournals
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
DVClub
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
nagarajan_ka
 

Similar to 676.v3 (20)

Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Smpant Transact09
Smpant Transact09Smpant Transact09
Smpant Transact09
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Prelim Slides
Prelim SlidesPrelim Slides
Prelim Slides
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
D031201021027
D031201021027D031201021027
D031201021027
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
 

More from Rajesh M

Daily Habits.pdf
Daily Habits.pdfDaily Habits.pdf
Daily Habits.pdf
Rajesh M
 
Clock relationships
Clock relationshipsClock relationships
Clock relationships
Rajesh M
 
Node Scaling Objectives
Node Scaling ObjectivesNode Scaling Objectives
Node Scaling Objectives
Rajesh M
 
Technology scaling introduction
Technology scaling introductionTechnology scaling introduction
Technology scaling introduction
Rajesh M
 
Problems between Synthesis and preCTS
Problems between Synthesis and preCTSProblems between Synthesis and preCTS
Problems between Synthesis and preCTS
Rajesh M
 
Setup fixing
Setup fixingSetup fixing
Setup fixing
Rajesh M
 
Vlsi best notes google docs
Vlsi best notes   google docsVlsi best notes   google docs
Vlsi best notes google docs
Rajesh M
 
#50 ethics
#50 ethics#50 ethics
#50 ethics
Rajesh M
 
Power Reduction Techniques
Power Reduction TechniquesPower Reduction Techniques
Power Reduction Techniques
Rajesh M
 
Study of inter and intra chip variations
Study of inter and intra chip variationsStudy of inter and intra chip variations
Study of inter and intra chip variations
Rajesh M
 

More from Rajesh M (10)

Daily Habits.pdf
Daily Habits.pdfDaily Habits.pdf
Daily Habits.pdf
 
Clock relationships
Clock relationshipsClock relationships
Clock relationships
 
Node Scaling Objectives
Node Scaling ObjectivesNode Scaling Objectives
Node Scaling Objectives
 
Technology scaling introduction
Technology scaling introductionTechnology scaling introduction
Technology scaling introduction
 
Problems between Synthesis and preCTS
Problems between Synthesis and preCTSProblems between Synthesis and preCTS
Problems between Synthesis and preCTS
 
Setup fixing
Setup fixingSetup fixing
Setup fixing
 
Vlsi best notes google docs
Vlsi best notes   google docsVlsi best notes   google docs
Vlsi best notes google docs
 
#50 ethics
#50 ethics#50 ethics
#50 ethics
 
Power Reduction Techniques
Power Reduction TechniquesPower Reduction Techniques
Power Reduction Techniques
 
Study of inter and intra chip variations
Study of inter and intra chip variationsStudy of inter and intra chip variations
Study of inter and intra chip variations
 

Recently uploaded

CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 

Recently uploaded (20)

CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 

676.v3

  • 1. Ketan N. Kulkarni & M. V. Rajesh ECEN 676 – Advanced Computer Architecture 1st May 2009 Reducing Memory Stalls by Hardware Based Data Prefetching Schemes
  • 2. Agenda  Introduction  Background work  Hardware based Data Cache Prefetching Algorithms 1. Fixed Offset Prefetching 2. Stride Based Prefetching 3. Tag Correlated Prefetching  Simulation Setup  Results  Conclusions ECEN 676 - Advanced Computer Architecture2
  • 3. Introduction  What is Prefetching?  Filling the cache with relevant data before it is needed by the program.  What is the need of Prefetching?  Expanding gap between Microprocessor and DRAM performance.  Exponential increase in data access penalty.  When to Prefetch?  Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the memory latency time).  Advantages:  Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.  Program semantics remain unchanged.  Caveats:  Prefetching too far in advance may lead to cache pollution.  Incorrect prefetching. ECEN 676 - Advanced Computer Architecture3
  • 4. How Prefetching works ? Time LoadA (miss in L1) LoadB (miss in L1) Time LoadA (hit in L1) LoadB (hit in L1) PrefetchA * PrefetchB * FetchA FetchB ECEN 676 - Advanced Computer Architecture4 * from L2 to L1 CPU stalled CPU executing Avoids the possible miss
  • 5. Feedback based [Honio2009] Spatial Stride Prefetch [Fu1992] Markov Prefetch [Joseph1997] GHB [Nesbit2004] Hybrid [Hsu1998] Software Support [Mowry1992] AC/DC [Nesbit2004] Adaptive Stream [Hur 2006] FDP [Srinath2007] Software Sequence-Base (Order Sensitive) Tag Correlation [Hu2003] SMS [Somogyi2006] Sequential [Smith1978] RPT [Chen1995] Locality Detect [Johnson1998] Spatial Pat. [Chen2004] Buffer Block [Gindele1977] Adaptive Hybrid Adaptive Seq. [Dahlgren1993] Commercial Processors SuperSPARC R10000 PA7200 Power4 Pentium 4 AMPM Prefetch [Ishii2009] HW/SW Integrate [Gornish1994] Fixed offset
  • 6. Hardware based Prefetching  Advantages:  Dynamic pattern matching.  No compiler support/ ISA modification needed.  Takes advantage of regular/ repeatable program behavior.  Caveats:  Increased complexity/ hardware.  High level program flow information not available. ECEN 676 - Advanced Computer Architecture6
  • 7. Fixed Offset Prefetching  On a cache miss, retrieve next block of memory.  Sequential prefetching (spatial locality). ECEN 676 - Advanced Computer Architecture Tag Index Offset + Tag Index Offset Constant 7 Advantages Disadvantages Very simple scheme. Relies solely on spatial locality to work. Less hardware. Can’t detect patterns.
  • 8. Stride Based Prefetching [Chen1995]  Exploit stride patterns in data addresses.  Prefetch this data before the data is accessed.  Store state & stride data in a reference Prediction Table (RPT), and update.  Make state transitions based on correct/ incorrect predictions. ECEN 676 - Advanced Computer Architecture8
  • 9. RPT - Structure INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE + PREFETCHING ADDRESS Program Counter - Effective Address ECEN 676 - Advanced Computer Architecture9 Lookup Update
  • 11. Stride based Prefetching (cont.) ECEN 676 - Advanced Computer Architecture11 Advantages Disadvantages Detects uniform strides (e.g. loops). Not much improvement with non- uniform strides. Accurate prediction for many cases. Hardware overhead. Cannot correlate strides of one instruction with those of others.
  • 12. RPT - Example  Load instructions at addresses 500, 504, and 512.  Base addresses of matrices A, B, and C at locations 10,000, 50,000, and 90,000 respectively. Matrix Multiplication Assembly Code int A[100][100], B[100][100], C[100][100] for(i = 1; i < 100; i ++){ for(j = 1; j < 100; j ++){ for(k = 1; k < 100; k ++){ A[i][j] += B[i][k] x C[k][j]; } } } 500 504 508 512 516 520 524 528 532 536 lw r4, 0(r2) Iw r5, 0(r3) mul r6, r5, r4 lw r7, 0(r1) addu r7, r7, r6 sw r7, 0(rl) addu r2, r2, 4 addu r3, r3. 400 addu r11, rl l, 1 bne r11, r13, 500 load B[i][k] load C[kJ[j] B[i][k] x C[k][j] load A[i][j] += store A[i][j] ref B[i][k] ref C[k][j] increase k loop ECEN 676 - Advanced Computer Architecture12
  • 13. RPT – Example (contd.) INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE ECEN 676 - Advanced Computer Architecture13 INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,000 0 INIT 504 90,000 0 INIT 512 10,000 0 INIT INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,004 4 TRANS 504 90,400 400 TRANS 512 10,000 0 STEAD Y INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,008 4 STEADY 504 90,800 400 STEADY 512 10,000 0 STEADY Initial State After Iteration 1 After Iteration 2 After Iteration 3
  • 14. Tag Correlated Prefetching [Hu2003] ECEN 676 - Advanced Computer Architecture14  L1 cache tags exhibit strong regularity.  Similar to 2-level branch prediction technique.  Local/ Global History.  Correlating Prefetcher that work with tags.
  • 15. TCP Structure TAG1 TAG2 …… TAG TAG’ ECEN 676 - Advanced Computer Architecture15 TAG INDEX OFFSE T Index Function misstagTAG2 …… TAGKTAGK UpdateMiss Address misstag TAG1 THT TAG K TAG KTAG KTAG K misstag Lookup misstag misstag misstag misstag misstag TAG’ missindex PHT
  • 16. Modified TCP SUM TAG TAG’ ECEN 676 - Advanced Computer Architecture16 TAG INDEX OFFSE T Index Function misstagTAGK Miss Address THT PHT
  • 17. TCP (cont.) ECEN 676 - Advanced Computer Architecture17 Advantages Disadvantages Captures global and local history. More Hardware. Recognize recurring patterns. Vulnerable to noise
  • 18. Simulation Environment ECEN 676 - Advanced Computer Architecture18 L1 Data Cache L2 Data Cache Main Memory CPU DATA Prefetcher Trace File ADDRESS DATA ADDRESS DATA ADDRESS Hit Implementation: C++, Perl, Pin Tool [Reddi2004] Trace Driven Simulation L1-Cache L2-Cache 32KB Size 2-way Set Associative 64 byte line size write-through no-write-allocate 1 cycle hit time lru replacement policy 256KB Size 8-way set associative 128 byte line size write-back write-allocate 20 cycle access time lru replacement policy Calculate Next Prefetch Address
  • 19. Benchmarks ECEN 676 - Advanced Computer Architecture19 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% grep g++ ls plamaptestgen matrix % Benchmark Instruction Mix Non-mem ops Stores Loads Benchmark Description grep Unix utility to search for pattern in input file. g++ Unix C++ GNU Compiler. testgen Program for creating test patterns for scan chains in DFT. plamap A mapping algorithm for CPLD architecture. ls Unix utility to list information about files in dir. matrix 100x100 matrix multiplication.
  • 20. Simulation Results - I ECEN 676 - Advanced Computer Architecture20 0.00 1.00 2.00 3.00 4.00 5.00 6.00 grep g++ ls plamap testgen matrix CPI Benchmark CPI No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 21. Simulation Results - II ECEN 676 - Advanced Computer Architecture21 90 91 92 93 94 95 96 97 98 99 100 grep g++ ls plamap testgen matrix Hit Rate (%) Benchmark L1 Cache Hit Rate No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 22. Simulation Results - III ECEN 676 - Advanced Computer Architecture22 0 0.5 1 1.5 2 2.5 grep g++ ls plamap testgen matrix AMAT (#cycles) Benchmark Average Memory Access Time No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 23. Simulation Results - IV ECEN 676 - Advanced Computer Architecture23 94.50 95.00 95.50 96.00 96.50 97.00 97.50 98.00 98.50 lookahead size L1 Hit Rate (%) Benchmark (grep) Effect of changing offset on L1 Hit Rate 64 2*64 8*64 95.50 96.00 96.50 97.00 97.50 98.00 98.50 sizes; L1 Hit Rate (%) Benchmark (g++) Effect of RPT size on L1 Hit Rate size=8; size=64; size=256; size=2048; 96.95 97.00 97.05 97.10 97.15 97.20 97.25 97.30 97.35 97.40 97.45 increasing m,n; increasing k; L1 Hit Rate (%) Benchmark (testgen) Effect of THT/PHT parameters on L1 Hit Rate m=8;n=8;k=4; m=4;n=4;k=4; m=8;n=8;k=2; m=8;n=8;k=4; m=8;n=8;k=8; m=8;n=8;k=64;
  • 24. Simulation Results – V Prefetching Algorithm Hardware CPI (% improvement) Hit Rate (% increase) AMAT (% decrease) Fixed Offset Area of adder, registers 10.28 1.41 9.62 Stride Based 26.75KB (RPT) 16.56 1.75 20.42 TCP 72KB (THT) + 150KB(THT) 18.93 1.96 27.75 Modified TCP 26KB (THT) + 150KB(THT) 16.98 1.80 21.23 ECEN 676 - Advanced Computer Architecture24 The increase in hardware complexity pays off!
  • 25. Conclusions  Prefetching increases hit rate and decreases AMAT. Fixed Offset Stride Based Tag Correlated  Fixed offset would give good performance for highly spatial code.  Stride Prefetching would perform the best when a program has steady memory access patterns regardless of locality.  TCP would perform better on an average. ECEN 676 - Advanced Computer Architecture25 Increasing Hardware Complexity Increasing hit rate, Decreasing AMAT
  • 26. References  Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware- based data prefetching for high-performance processors," Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May 1995.  Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating prefetchers," High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on , vol., no., pp. 317-326, 8-12 Feb. 2003.  Reddi2004 - PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education VJ Reddi, A Settle, DA Connors, RS Cohn, 2004. ECEN 676 - Advanced Computer Architecture26
  • 27. ECEN 676 - Advanced Computer Architecture27
  • 28. Questions  Question 1: Is cache pollution a serious concern for anyone designing a prefetching algorithm?  Answer: Cache pollution happens when the cache is cluttered with useless information. However the problem is that the exact information that is needed is not always known, but it is predicted. The goal is to prefetch all of the necessary data beforehand and to prefetch the minimal amount of unused data. ECEN 676 - Advanced Computer Architecture28