Final report

Data Prefetching Mechanisms and Their Impact on Performance
Group members:
Tian Nan<nanxx026@umn.edu>
Chang Xiong <xion1637@umn.edu>
Kshiti Deolalkar <deola002@umn.edu>
Vineel Kumar Reddy Gangireddygari <gangi009@umn.edu>
Introduction
Hardware pre-fetching is in-line with the main idea of out-of-order execution and fully
exploiting instruction level parallelism. It involves utilizing specialized hardware to track
load/store instructions and their access patterns in order to prefetch data based on past access
behavior. [10]
The graphic above clearly demonstrates how accurate and timely prefetching removes loads from
the critical path.
The advantages of hardware prefetching are that it does not waste instruction execution
bandwidth, can be easily tuned to system implementation, and does not introduce any code
portability issues. It also has better dynamic information (for instance: in case of unexpected
cache conflicts) and it does not add any instruction overhead to issue prefetches. It does however
increase the hardware complexity and is outperformed in terms of efficiency by software
prefetching in some cases. The succeeding topics discuss four major types of hardware
prefetching techniques, their algorithms along with the advantage/drawback each method
provides.
1.Stream Buffer
Because data prefetching is a technique for shorten the interval of data request and arrival to
execution unit, we first study what CPU Cache is. One of our hardware prefetching means
stream buffer based on the concept of one block lookahead (OBL) [1]
. So we study OBL in the
next step in order to figure out how stream buffer works.

1.1 How does stream buffer work
In this project, we implemented stream buffer for L2 cache. If there exists a miss in L2 cache, the
hardware will check stream buffer. Match all the data address in stream buffer. If there is a hit in
stream buffer, we just get data from stream buffer and do nothing to it. If there is a miss, we need
to flush the whole buffer, get the block with corresponding address X from memory to L2 cache,
and get the block with address X+1, X+2, … X+n into stream buffer. (n is the size of buffer).
Stream buffer for L1 cache has same distance from functional unit, and hit time is same as L1
cache hit time. Stream buffer for L2 cache also has same hit time as L2 cache. In this
circumstance, if a miss happens in L2 cache, but a hit in stream buffer, the penalty for this miss
is only twice of hit time, since stream buffer hit time is same as L2 cache hit time.
What happens if a miss occurs in stream buffer and L2 cache: We need to flush stream buffer.
We use dirty bit to check every block in stream buffer. For every block which is dirty, we need
to write the block back. When getting the block from memory for L2 cache, we also need to
prefetch n blocks from memory. The process is parallel, especially when the stream buffer size is
small.
2. Stride prefetchers(PC based)
This approach involves recording the ‘stride’ of a load instruction, i.e. the distance between the
memory addresses referenced by the instruction. The last address referenced by the load
instruction is also recorded. Then, the next time the load instruction is fetched, we also prefetch
the last address + stride of the same load instruction. A ‘reference prediction table’ (RPT) is
required for the tracking process. [8]
2.1 Reference prediction table: entries and algorithm
This is an I-detection scheme since it requires the address of the instruction, i.e. the program
counter. The reference prediction table is organized as a Cache. An example RPT would be, say,
a 256-entry direct mapped cache as in [7].
-Whenever there is a cache miss, the instruction address of the load instruction that is causing the
cache miss is compared against the entries in the reference prediction table.
-If there is no match, the instruction is entered into the RPT along with all of its information
required for tracking. The state is initially set to no-prefetch.
-When a read miss is obtained for the same instruction again, we get a match with the RPT. The
stride is then calculated, the new address as well as the stride is added to the RPT. The state is
now set to prefetch.

The below figure represents the process in the RPT.
The state transition diagram representing course of action in case of a correct or an incorrect
prediction is as follows:
The state is set to initial when the prefetching begins, and the stride has been calculated. To
check for correctness, the read requests in the second-level cache are matched with the RPT
entries. When an access to the same stride sequence is generated three times in a row, the state
becomes ‘steady’. ‘Transient’ state is when there has been a second incorrect prediction in a row.
The no-prefetch state helps reduce the number of unnecessary prefetches.
The length of a stride sequence is unknown in our case, so different schemes may determine the
number of blocks to be prefetched. The number of blocks would also depend on the architecture
in use.
2.2 Hardware requirements
The block diagram below encompasses the complete prefetching process, and is hence a reliable
source for deriving the hardware requirements the technique would entail. [9]
The prediction table is required to store several entries consisting of tracking details for the load
instructions. Increasing the number of entries can aid in better prefetch but beyond a certain limit

would yield diminishing returns. The other units are either for address calculation or comparison.
The comparator can be considered as an expensive operation since it would require bit by bit
comparison.
The drawback of this method includes the limitation on how far we can get ahead and the
amount of miss latency the prefetch can cover. In addition, this technique is not fast enough,
initiating the prefetch when the load is fetched the next time can be too late since the load will
access the data cache soon after it is fetched. Potential solutions to overcome these drawbacks
include using lookahead PC to index the prefetcher table, prefetch further ahead (last address +
N*stride) or to generate multiple prefetches.
3. Stride prefetchers(Cache block address based)
Cache Block Address based prefetchers operate based on a stride associated with a memory
address/region rather than strides associated with an instruction, as in PC based prefetchers. i.e.
access patterns in a particular memory location may look line A, A+N, A+2*N where A is the
address of memory location accessed first and N denotes the stride associated with that access
pattern to that memory region. So, there can be multiple strides for a particular instruction, each
associated with different memory regions that the respective instruction tried to access.
A normal PC-based stride predictors rely on PC to identify the stride when a loop repeats and so
cannot calculate stride of memory accesses that are not enclosed in a loop. Also, a stream buffer
based prefetcher is a special case of this Stride Prefetchers where the stride value is 1.
3.1 Design of Address Based Stride Predictor
Current Design is based on Dstride Prefetcher[5]
which consists of Prefetch Buffer, Prediction
Logic and Prefetching logic as shown in the block diagram below.

Block Diagram
3.1.1 Prefetch Buffer
This holds the predicted data and is opted to be isolated from the L1 cache in order to avoid
cache pollution. Both the L1 cache and the PB are searched in parallel and if the address is
matched in PB, the data is sent to the pipeline and the prefetching logic is informed about the
match and prefetches for any steady strides are initiated.
3.1.2 Prediction Logic
It consists of SPT (Stride prediction Table) that predicts strides by comparing current miss
address with recent miss address. The SPT has the following entries
● Miss address, which holds a data-cache miss address;
● Strides, which hold a fixed number of possible strides (4 in our case).
● Predicted Address corresponding to each stride
● Two state bits that are used to determine the steadiness of the stride predicted. An FSM
can be constructed as following

Initial state is set when a predicted address is calculated which moves on to transient if missed
address is not same as predicted address, else to steady state. When the state of a stride is no
prediction, the predictor does not initiate any prefetches to that particular memory location.
3.1.3 Prefetching Logic
A SST (Stride steady Table) and Prefetch Queue are present. When the SPT detects a stride to be
steady, it is moved into SST and a prefetch request for next address is queued onto Prefetch
Queue.
Each and every stride that was predicted to be steady has an entry in SST with following fields
● PA: predicted address
● LA-PA: Look Ahead Predicted Address
● Flag: To indicate whether the data prefetch is complete.
● nLA-PA: Next Look Ahead Predicted Address.
● Stride: The stride value that was found to be steady.
● LA-Dist: Look Ahead Stride Distance, it denotes how far we are prefetching.
So, when there is a hit in prefetch Buffer, the PA is updated with LA-PD and the LA-PD is
updated with nLA-PD. If the data has been fetched then the flag is set to IDLE state, else
PENDING state. Pending indicates that the data access is fast compared to the Prefetching. In
such case, the nLA-PD is calculated as
nLA-PD = LA-PD + stride*n (n is not equal to one)
Instead of nLA-PD = LA-PD + stride*1
This is how the Address based Stride buffer can overcome the disadvantage of the Stream
Buffers where the stride is unity.
To improve the prediction accuracy, the prefetch requests are not issued for data that is already
present in the data-cache or in the Prefetch Buffer. When the prefetch address raises a trap for
address translation, or when an out-of-range address is issued by the prefetching logic, the
prefect request is discarded. A pseudo-LRU policy is used for replacing SST entries.
Prefetch Queue maintains a table of requests with each entry looking as below
The Demand type data requests are processed with a higher priority over the prediction based
data requests. This is because the latency that might arise due to a stalling on demand might not
be overcomed by the latency gain due to prefetching data.

4. Locality Based Prefetchers
Stride prefetcher is a practical method to reduce the penalty of memory access. However,
sometimes patterns look random to closeby address, in other words, access patterns are not
perfectly strided in many applications. In these cases, locality based prefetching can be a good
solution.
Basically, locality based prefetchers use the dynamic feedback mechanism to improve the
average performance, including accuracy, timeliness of the prefetch request as well as the cache
pollution caused by prefetch requests.
In a conventional configuration of prefetcher, prefetch degree and prefetch distance are fixed
(like the N in stride prefetcher). In the feedback mechanism, the processor dynamically change
the prefetch degree and prefetch distance to get better performance (accuracy, timeliness and
cache pollution) of the prefetch.
4.1 Stream Prefetcher Design
The prefetcher is used to keep track of the access streams. There are four states for each tracking
entry:
1. Invalid: The tracking entry is not allocated a stream to keep track of. Initially, all tracking
entries are in this state.
2. Allocated: A demand (i.e. load/store) L2 miss allocates a tracking entry if the demand miss
does not find any existing tracking entry for its cache-block address.
3. Training: The prefetcher trains the direction (ascending or descending) of the stream based on
the next two L2 misses that occur +/- 16 cache blocks from the first miss.4 If the next two
accesses in the stream are to ascending (descending) addresses, the direction of the tracking entry
is set to 1 (0) and the entry transitions to Monitor and Request state.
4. Monitor and Request: The tracking entry monitors the accesses to a memory region from a
start pointer (address A) to an end pointer (address P). The maximum distance between the start
pointer and the end pointer is determined by Prefetch Distance, which indicates how far ahead of
the demand access stream the prefetcher can send requests. If there is a demand L2 cache access
to a cache block in the monitored memory region, the prefetcher requests cache blocks [P+1, ...,
P+N] as prefetch requests (assuming the direction of the tracking entry is set to 1). N is called the
Prefetch Degree. After sending the prefetch requests, the tracking entry starts monitoring the
memory region between addresses A+N to P+N (i.e. effectively it moves the tracked memory
region by N cache blocks)[6]
.

4.2 Dynamically Adjusting Prefetcher Behavior
There are two ways to adjust the prefetcher behavior, the first one is adjusting prefetcher
aggressiveness and the other one is that adjusting cache insertion policy of prefetched blocks.
Prefetcher aggressiveness is determined by the prefetch degree and prefetch distance. For
example, if the prefetch is late, processor will increase the aggressiveness (prefetch degree and
distance) to reduce the latency. If cache pollution occurs, processor will decrease the
aggressiveness. Similarly, when it comes to the insertion policy of the prefetched blocks, if
prefetched blocks cause cache pollution, we will not insert these blocks into the MRU
(Most-Recently-Used) position of the LRU stack.
5. Sequential Prefetching
This is the simplest and the most basic of all the pre-fetching methods and takes advantage of
spatial locality. It prefetches the next ‘N’ cache lines after a demand access or a demand miss.
[10] It is simple to implement as it does not require sophisticated pattern detection. It works well
for sequential/streaming access patterns. However, the demand for memory access is not always
sequential in nature and if the access patterns are irregular in nature, which is very often the case,
it ends up wasting a lot of bandwidth. It is not very suitable for any randomized patterns other
than sequential: consider a case in which the program is traversing memory from higher to lower
addresses – the sequential prefetching method would end up prefetching ‘previous’ N cache lines
increasing the odds of cache pollution and unnecessary stall cycles.
Sequential prefetching is based upon the assumption that cache misses exhibit high spatial
locality; if the processor accesses block B, the probability is high that the processor will also
reference block B+1. Therefore, if the cache experiences a miss on block B, it prefetches blocks
B+1, B+2,..., B+d, if these blocks are not already in the cache, where d is referred to as the
degree of prefetching. Sequential prefetching has been shown to perform well for parallel
applications on shared-memory multiprocessors. [12]
6. Simulation
Due to the time limit, we work on the simulation of four of five methods mentioned above,
sequential prefetching, PC-Based stride prefetching, Address-Based prefetching as well as
stream buffer prefetching.

6.1 Sequential Prefetching
For the sequential prefetching, the main modification is in cache.c. We modify the function
cache_create() to implement the sequential prefetch by adding one parameter which is the size of
prefetching.
6.2 PC-Based Stride Prefetching
The file ‘cache.c’ undergoes major modification as a large body of code is added to implement
the pre-fetching method. Corresponding changes are made in the ‘cache.h’ to ensure all functions
are declared and sim-outorder.c also undergoes certain minor modifications with respect to a few
function calls involved with the fetching aspect of the pipeline. After debugging, the code is run
on three benchmarks – anagram, go and gcc for extracting appropriate performance metrics and
using them for comparison against the other techniques. Since the cache.c file modifications are
key in the implementation of the stride-prefetching method on simplescalar, they are discussed
here in more detail:
The reference prediction table is designed to have 64 entries. When
we have an address that has generated a miss, the address is
checked against all the entries of the reference prediction table.
This is a computationally expensive step due to a linear run
through of all the entries of the table. In the event of a non-match,
the instruction that caused the miss is entered into the RPT and the
initial state is set to no-prefetch.
If there is a match, the pre-fetching mechanism is triggered and
new address calculation along with the calculated stride from the RPT aid in the prefetching.
How the stride is calculated or updated is based on the ‘state’ of the mechanism depending upon
past prediction patterns and the correctness of the prediction, as shown by the state table in
section 2.1. The four states are initial, steady, transient and no-prefetch. The state table can be
translated into code logic by either using switch-case, as is shown in the first image or by using
simple if-else, as shown in the second image.

The image on the left is from reference [11] and the image on the right is an actual code snippet;
both implementing a portion of the state table. While the left image is relatively easy to parse,
the right one refers to a variable ‘statue’ which is actually the state of the process. ‘Statue’ values
0,1,2, and 3 refer to initial, steady, transient and non-prefetch states. The if-else logic transfers
the logic to the next relevant state according to the state table in 2.1 depending on whether the
stride prediction was correct/incorrect for the past logic, its present/past state, and also updates
the stride if necessary. The pre-fetched data is not placed directly into the cache, but is placed in
a buffer.
6.3 Address-Based Stride Prefetching
All the additional logic needed to implement the Address based Stride Prefetching is added in the
“cache.c” file and the appropriate declarations are added in “cache.h” files of simplescalar. First,
A structure is defined to create a stride prediction table as shown below.
We have used 8 entry stride prediction table, so that the time delay to search the strides and
miss_address is not significant. These spt entries are searched for predicted address to match and
the state value is updated on every search.
The predict_addr is calculated by adding the corresponding stride value to the miss_addr.

I.e predict_addr[i] = miss_addr + stride[i] where i can vary from 1to 8.
The predict_addr is then compared to the input addr to the cache l2 and the correct flag is written
and then, states are updated according the code below.
All the steady entries need to be moved into a table called Steady stride table from which the
prefetching is done to separate prefetch buffer. The size of the stredy stride table used in our
project is 64. The rest modifications needed to prefetch are similar to the changes described in
the PC based stride prefetching mechanism.
6.4 Stream Buffer Based Prefetching
The core code is located in the function cache_prefetch_block in the cache.c file.
if(cp->nsets > 1000)
cache_prefetch_block(cp, addr, 0);
And we add two line in cache access function. If it is a miss in L2 cache(size = 1024), we need to
check stream buffer and prefetch data.
static struct cache_blk_t oneblk;
static md_addr_t oneaddr;

We only use static variables to record address and content of the block. So it is a 1 block size
stream buffer.
If it is a hit in stream buffer, we record this hit and return hit time. Do nothing to the buffer.
If it is a miss in stream buffer. We first check whether we should write back the previous block.
Then prefetch the next block by address plus block size. During this process, we need longer
time and to handle flag bit.

7.Results and Analysis
To run the simulation, we specify the cache as L2 instruction and data cache and the prefetching
is implemented in the L2 data cache which is two way associative. LRU replace policy is used.
The miss rates of four different methods are shown below.
From the results shown above, we can see that stream buffer based prefetching have the best
performance on miss rate. Stride methods almost have the same performance with the sequential
prefetching. This might because we just chose to use one stride in our implementation rather than
N strides. It’s likely that the stride methods will provide better performance if we increase the
stride number.
The IPC of four different methods are shown below. It’s obvious that two stride methods have
lower IPC because there are more computation and hardware access involved. For the sequential
prefetching and stream buffer based prefetching, their logic are simpler and thus lead to higher
IPC.

We also dive into the sequential prefetching method and try to explore the effect of prefetching
size. The results of Go benchmark is shown below.
From the results we can easily find out that as the prefetching size increase, the missing rate is
going down and it’s because we prefetch more blocks at one time. However we also notice that
as the prefetching size increase there is one point that miss rate is going up again. This is more

likely when a cache pollution happens. It basically means that we prefetch too many blocks at a
time and the useful cache blocks are flushed away every time there is a cache miss.
8. References
[1][2] Smith, Alan Jay (1982-09-01). "Cache Memories". ACM Comput. Surv. 14 (3): 473–530.
ISSN 0360-0300.
[3] F. Dahlgren, M. Dubois, and P. Stenström, ``Fixed and adaptive sequential prefetching in
shared memory multiprocessors.,'' in ICPP, pp. 56-63, 1993
[4] Mittal, Sparsh (2016-08-01). "A Survey of Recent Prefetching Techniques for Processor
Caches". ACM Comput. Surv. 49 (2): 35:1–35:35. ISSN 0360-0300.
[5] G. Hariprakash, R. Achutharaman and A. R. Omondi, "DStride: data-cache
miss-address-based stride prefetching scheme for multimedia processors," Proceedings 6th
Australasian Computer Systems Architecture Conference. ACSAC 2001, Gold Coast, Qld., 2001,
pp. 62-70.
[6] Srinath et al., “Feedback Directed Prefetching: Improving the Performance and
Bandwidth-Efficiency of Hardware Prefetchers “, HPCA 2007.
[7] Dahlgren and Stenstrom, “Effectiveness of Hardware-Based Stride and Sequential
Prefetching in Shared-Memory Multiprocessors,” High performance computer architecture,
1995.
[8] Baer and Chen, “An effective on-chip preloading scheme to reduce data access penalty,” SC
1991
[9] Kim and Viedenbaum, “Stride directed Prefetching for Secondary Caches”, Parallel
Processing, 1997.
[10] Prof. Onur Mutlu, “Computer Architecture: Prefetching II”,
http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php?media=wiki:lectures:onur-740-fall11-lec
ture27-prefetching-ii-afterlecture.pdf
[11] Grannæs, M. (2006). Bandwidth-Aware Prefetching in Chip Multiprocessors. [online]
Norwegian University of Science and Technology Department of Computer and Information
Science. Available at: https://daim.idi.ntnu.no/masteroppgaver/001/1507/masteroppgave.pdf
[Accessed 21 Dec. 2017].
[12] Dahlgren, F. and Stenstrom, P. (1996). Evaluation of hardware-based stride and sequential
prefetching in shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 7(4), pp.385-398.

Final report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Final report

Similar to Final report (20)

Recently uploaded

Recently uploaded (20)

Final report