676.v3

Ketan N. Kulkarni & M. V. Rajesh
ECEN 676 – Advanced Computer Architecture
1st May 2009
Reducing Memory Stalls by Hardware
Based Data Prefetching Schemes

Agenda
 Introduction
 Background work
 Hardware based Data Cache Prefetching Algorithms
1. Fixed Offset Prefetching
2. Stride Based Prefetching
3. Tag Correlated Prefetching
 Simulation Setup
 Results
 Conclusions
ECEN 676 - Advanced Computer Architecture2

Introduction
 What is Prefetching?
 Filling the cache with relevant data before it is needed by the program.
 What is the need of Prefetching?
 Expanding gap between Microprocessor and DRAM performance.
 Exponential increase in data access penalty.
 When to Prefetch?
 Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the
memory latency time).
 Advantages:
 Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.
 Program semantics remain unchanged.
 Caveats:
 Prefetching too far in advance may lead to cache pollution.
 Incorrect prefetching.

How Prefetching works ?
Time
LoadA
(miss in L1)
LoadB
(miss in L1)
Time
LoadA
(hit in L1)
LoadB
(hit in L1)
PrefetchA *
PrefetchB *
FetchA
FetchB
* from L2 to L1
CPU stalled
CPU executing Avoids the possible miss

Feedback based
[Honio2009]
Spatial
Stride Prefetch
[Fu1992]
Markov Prefetch
[Joseph1997]
GHB
[Nesbit2004]
Hybrid
[Hsu1998]
Software Support
[Mowry1992]
AC/DC
[Nesbit2004]
Adaptive Stream
[Hur 2006]
FDP
[Srinath2007]
Software Sequence-Base
(Order Sensitive)
Tag Correlation
[Hu2003]
SMS
[Somogyi2006]
Sequential
[Smith1978]
RPT
[Chen1995]
Locality Detect
[Johnson1998]
Spatial Pat.
[Chen2004]
Buffer Block
[Gindele1977]
Adaptive
Hybrid
Adaptive Seq.
[Dahlgren1993]
Commercial
Processors
SuperSPARC
R10000
PA7200
Power4
Pentium 4
AMPM Prefetch
[Ishii2009]
HW/SW Integrate
[Gornish1994]
Fixed offset

Hardware based Prefetching
 Advantages:
 Dynamic pattern matching.
 No compiler support/ ISA modification needed.
 Takes advantage of regular/ repeatable program
behavior.
 Caveats:
 Increased complexity/ hardware.
 High level program flow information not available.

Fixed Offset Prefetching
 On a cache miss, retrieve next block of memory.
 Sequential prefetching (spatial locality).
ECEN 676 - Advanced Computer Architecture
Tag Index Offset
+
Tag Index Offset
Constant
7
Advantages Disadvantages
Very simple scheme. Relies solely on spatial locality to work.
Less hardware. Can’t detect patterns.

Stride Based Prefetching [Chen1995]
 Exploit stride patterns in data addresses.
 Prefetch this data before the data is accessed.
 Store state & stride data in a reference Prediction
Table (RPT), and update.
 Make state transitions based on correct/ incorrect
predictions.

RPT - Structure
INSTRUCTION
ADDRESS
(PREVIOUS) DATA
ADDRESS
STRIDE STATE
+ PREFETCHING
ADDRESS
Program Counter -
Effective Address
Lookup
Update

State Transition
INIT
TRANS
NO-
PRED
STEADY
Incorrect
Correct
Incorrect
Incorrect
Correct
Correct
Correct
Incorrect

Stride based Prefetching (cont.)
Detects uniform strides (e.g. loops). Not much improvement with non-
uniform strides.
Accurate prediction for many cases. Hardware overhead.
Cannot correlate strides of one
instruction with those of others.

RPT - Example
 Load instructions at addresses 500, 504, and 512.
 Base addresses of matrices A, B, and C at locations
10,000, 50,000, and 90,000 respectively.
Matrix Multiplication Assembly Code
int A[100][100], B[100][100],
C[100][100]
for(i = 1; i < 100; i ++){
for(j = 1; j < 100; j ++){
for(k = 1; k < 100; k ++){
A[i][j] += B[i][k] x C[k][j];
}
}
}
500
504
508
512
516
520
524
528
532
536
lw r4, 0(r2)
Iw r5, 0(r3)
mul r6, r5, r4
lw r7, 0(r1)
addu r7, r7, r6
sw r7, 0(rl)
addu r2, r2, 4
addu r3, r3. 400
addu r11, rl l, 1
bne r11, r13,
500
load B[i][k]
load C[kJ[j]
B[i][k] x C[k][j]
load A[i][j]
+=
store A[i][j]
ref B[i][k]
ref C[k][j]
increase k
loop

RPT – Example (contd.)
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,000 0 INIT
504 90,000 0 INIT
512 10,000 0 INIT
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,004 4 TRANS
504 90,400 400 TRANS
512 10,000 0 STEAD
Y
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,008 4 STEADY
504 90,800 400 STEADY
512 10,000 0 STEADY
Initial State After Iteration 1
After Iteration 2 After Iteration 3

Tag Correlated Prefetching [Hu2003]
 L1 cache tags exhibit strong regularity.
 Similar to 2-level branch prediction technique.
 Local/ Global History.
 Correlating Prefetcher that work with tags.

TCP Structure
TAG1 TAG2 …… TAG TAG’
TAG INDEX
OFFSE
T
Index
Function
misstagTAG2 …… TAGKTAGK
UpdateMiss Address
misstag
TAG1
THT
TAG
K
TAG
KTAG
KTAG
K
misstag
Lookup
misstag
misstag
misstag
misstag
misstag TAG’
missindex
PHT

Modified TCP
SUM TAG TAG’
TAG INDEX
OFFSE
T
Index
Function
misstagTAGK
Miss Address
THT PHT

TCP (cont.)
Captures global and local history. More Hardware.
Recognize recurring patterns. Vulnerable to noise

Simulation Environment
L1
Data
Cache
L2
Data
Cache
Main
Memory
CPU
DATA
Prefetcher
Trace File
ADDRESS
DATA
ADDRESS
DATA
ADDRESS
Hit
Implementation: C++, Perl, Pin Tool [Reddi2004]
Trace Driven Simulation
L1-Cache L2-Cache
32KB Size
2-way Set
Associative
64 byte line size
write-through
no-write-allocate
1 cycle hit time
lru replacement
policy
256KB Size
8-way set
associative
128 byte line size
write-back
write-allocate
20 cycle access
time
lru replacement
policy
Calculate Next Prefetch Address

Benchmarks
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
grep g++ ls plamaptestgen matrix
%
Benchmark
Instruction Mix
Non-mem ops
Stores
Loads
Benchmark Description
grep Unix utility to search for
pattern in input file.
g++ Unix C++ GNU
Compiler.
testgen Program for creating
test patterns for scan
chains in DFT.
plamap A mapping algorithm for
CPLD architecture.
ls Unix utility to list
information about files
in dir.
matrix 100x100 matrix
multiplication.

Simulation Results - I
0.00
1.00
2.00
3.00
4.00
5.00
6.00
grep g++ ls plamap testgen matrix
CPI
Benchmark
CPI
No Prefetching
Stride Based Prefetching
Tag Correlating Prefetching

Simulation Results - II
90
91
92
93
94
95
96
97
98
99
100
Hit Rate (%)
Benchmark
L1 Cache Hit Rate
No Prefetching

Simulation Results - III
0
0.5
1
1.5
2
2.5
AMAT (#cycles)
Benchmark
Average Memory Access Time
No Prefetching

Simulation Results - IV
94.50
95.00
95.50
96.00
96.50
97.00
97.50
98.00
98.50
lookahead size
L1 Hit
Rate (%)
Benchmark (grep)
Effect of changing offset
on L1 Hit Rate
64
2*64
8*64
95.50
96.00
96.50
97.00
97.50
98.00
98.50
sizes;
L1 Hit Rate
(%)
Benchmark (g++)
Effect of RPT size on L1 Hit
Rate
size=8;
size=64;
size=256;
size=2048;
96.95
97.00
97.05
97.10
97.15
97.20
97.25
97.30
97.35
97.40
97.45
increasing m,n; increasing k;
L1 Hit Rate (%)
Benchmark (testgen)
Effect of THT/PHT parameters on L1 Hit Rate
m=8;n=8;k=4;
m=4;n=4;k=4;
m=8;n=8;k=2;
m=8;n=8;k=4;
m=8;n=8;k=8;
m=8;n=8;k=64;

Simulation Results – V
Prefetching
Algorithm
Hardware CPI
(% improvement)
Hit Rate
(% increase)
AMAT
(% decrease)
Fixed Offset Area of adder, registers 10.28 1.41 9.62
Stride Based 26.75KB (RPT) 16.56 1.75 20.42
TCP 72KB (THT) +
150KB(THT) 18.93 1.96 27.75
Modified TCP 26KB (THT) +
150KB(THT) 16.98 1.80 21.23
The increase in hardware complexity pays off!

Conclusions
 Prefetching increases hit rate and decreases AMAT.
Fixed Offset Stride Based Tag
Correlated
 Fixed offset would give good performance for highly
spatial code.
 Stride Prefetching would perform the best when a
program has steady memory access patterns
regardless of locality.
 TCP would perform better on an average.
Increasing Hardware Complexity
Increasing hit rate, Decreasing AMAT

References
 Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware-
based data prefetching for high-performance processors,"
Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May
1995.
 Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating
prefetchers," High-Performance Computer Architecture, 2003.
HPCA-9 2003. Proceedings. The Ninth International Symposium on ,
vol., no., pp. 317-326, 8-12 Feb. 2003.
 Reddi2004 - PIN: A Binary Instrumentation Tool for Computer
Architecture Research and Education VJ Reddi, A Settle, DA
Connors, RS Cohn, 2004.

Questions
 Question 1: Is cache pollution a serious concern for
anyone designing a prefetching algorithm?
 Answer: Cache pollution happens when the cache is
cluttered with useless information. However the
problem is that the exact information that is needed
is not always known, but it is predicted. The goal is
to prefetch all of the necessary data beforehand and
to prefetch the minimal amount of unused data.

676.v3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 676.v3

Similar to 676.v3 (20)

More from Rajesh M

More from Rajesh M (10)

Recently uploaded

Recently uploaded (20)

676.v3