SlideShare a Scribd company logo
1 of 15
Download to read offline
Data​ ​Prefetching​ ​Mechanisms​ ​and​ ​Their​ ​Impact​ ​on​ ​Performance
Group​ ​members:
Tian​ ​Nan<​nanxx026@umn.edu​>
Chang​ ​Xiong​ ​<xion1637@umn.edu>
Kshiti​ ​Deolalkar​ ​<deola002@umn.edu>
Vineel​ ​Kumar​ ​Reddy​ ​Gangireddygari​ ​<gangi009@umn.edu>
Introduction
Hardware pre-fetching is in-line with the main idea of out-of-order execution and fully
exploiting instruction level parallelism. It involves utilizing specialized hardware to track
load/store instructions and their access patterns in order to prefetch data based on past access
behavior.​ ​[10]
The graphic above clearly demonstrates how accurate and timely prefetching removes loads from
the​ ​critical​ ​path.
The advantages of hardware prefetching are that it does not waste instruction execution
bandwidth, can be easily tuned to system implementation, and does not introduce any code
portability issues. It also has better dynamic information (for instance: in case of unexpected
cache conflicts) and it does not add any instruction overhead to issue prefetches. It does however
increase the hardware complexity and is outperformed in terms of efficiency by software
prefetching in some cases. The succeeding topics discuss four major types of hardware
prefetching techniques, their algorithms along with the advantage/drawback each method
provides.
1.Stream​ ​Buffer
Because data prefetching is a technique for shorten the interval of data request and arrival to
execution unit, we first study what CPU Cache is. One of our hardware prefetching means
stream buffer based on the concept of one block lookahead (OBL) ​[1]​
. So we study OBL in the
next​ ​step​ ​in​ ​order​ ​to​ ​figure​ ​out​ ​how​ ​stream​ ​buffer​ ​works.
1.1​ ​How​ ​does​ ​stream​ ​buffer​ ​work
In this project, we implemented stream buffer for L2 cache. If there exists a miss in L2 cache, the
hardware will check stream buffer. Match all the data address in stream buffer. If there is a hit in
stream buffer, we just get data from stream buffer and do nothing to it. If there is a miss, we need
to flush the whole buffer, get the block with corresponding address X from memory to L2 cache,
and​ ​get​ ​the​ ​block​ ​with​ ​address​ ​X+1,​ ​X+2,​ ​…​ ​X+n​ ​into​ ​stream​ ​buffer.​ ​(n​ ​is​ ​the​ ​size​ ​of​ ​buffer).
Stream buffer for L1 cache has same distance from functional unit, and hit time is same as L1
cache hit time. Stream buffer for L2 cache also has same hit time as L2 cache. In this
circumstance, if a miss happens in L2 cache, but a hit in stream buffer, the penalty for this miss
is​ ​only​ ​twice​ ​of​ ​hit​ ​time,​ ​since​ ​stream​ ​buffer​ ​hit​ ​time​ ​is​ ​same​ ​as​ ​L2​ ​cache​ ​hit​ ​time.
What happens if a miss occurs in stream buffer and L2 cache: We need to flush stream buffer.
We use dirty bit to check every block in stream buffer. For every block which is dirty, we need
to write the block back. When getting the block from memory for L2 cache, we also need to
prefetch n blocks from memory. The process is parallel, especially when the stream buffer size is
small.
2.​ ​Stride​ ​prefetchers(PC​ ​based)
This approach involves recording the ‘stride’ of a load instruction, i.e. the distance between the
memory addresses referenced by the instruction. The last address referenced by the load
instruction is also recorded. Then, the next time the load instruction is fetched, we also prefetch
the ​last address + stride of the same load instruction. A ‘reference prediction table’ (RPT) is
required​ ​for​ ​the​ ​tracking​ ​process.​ ​[8]
2.1​ ​Reference​ ​prediction​ ​table:​ ​entries​ ​and​ ​algorithm
This is an I-detection scheme since it requires the address of the instruction, i.e. the program
counter. The reference prediction table is organized as a Cache. An example RPT would be, say,
a​ ​256-entry​ ​direct​ ​mapped​ ​cache​ ​as​ ​in​ ​[7].
-Whenever there is a cache miss, the instruction address of the load instruction that is causing the
cache​ ​miss​ ​is​ ​compared​ ​against​ ​the​ ​entries​ ​in​ ​the​ ​reference​ ​prediction​ ​table.
-If there is no match, the instruction is entered into the RPT along with all of its information
required​ ​for​ ​tracking.​ ​The​ ​state​ ​is​ ​initially​ ​set​ ​to​ ​no-prefetch.
-When a read miss is obtained for the same instruction again, we get a match with the RPT. The
stride is then calculated, the new address as well as the stride is added to the RPT. The state is
now​ ​set​ ​to​ ​prefetch.
The​ ​below​ ​figure​ ​represents​ ​the​ ​process​ ​in​ ​the​ ​RPT.
The state transition diagram representing course of action in case of a correct or an incorrect
prediction​ ​is​ ​as​ ​follows:
The state is set to initial when the prefetching begins, and the stride has been calculated. To
check for correctness, the read requests in the second-level cache are matched with the RPT
entries. When an access to the same stride sequence is generated three times in a row, the state
becomes ‘steady’. ‘Transient’ state is when there has been a second incorrect prediction in a row.
The​ ​no-prefetch​ ​state​ ​helps​ ​reduce​ ​the​ ​number​ ​of​ ​unnecessary​ ​prefetches.
The length of a stride sequence is unknown in our case, so different schemes may determine the
number of blocks to be prefetched. The number of blocks would also depend on the architecture
in​ ​use.
2.2​ ​Hardware​ ​requirements
The block diagram below encompasses the complete prefetching process, and is hence a reliable
source​ ​for​ ​deriving​ ​the​ ​hardware​ ​requirements​ ​the​ ​technique​ ​would​ ​entail.​ ​[9]
The prediction table is required to store several entries consisting of tracking details for the load
instructions. Increasing the number of entries can aid in better prefetch but beyond a certain limit
would yield diminishing returns. The other units are either for address calculation or comparison.
The comparator can be considered as an expensive operation since it would require bit by bit
comparison.
The drawback of this method includes the limitation on how far we can get ahead and the
amount of miss latency the prefetch can cover. In addition, this technique is not fast enough,
initiating the prefetch when the load is fetched the next time can be too late since the load will
access the data cache soon after it is fetched. Potential solutions to overcome these drawbacks
include using lookahead PC to index the prefetcher table, prefetch further ahead (last address +
N*stride)​ ​or​ ​to​ ​generate​ ​multiple​ ​prefetches.
3.​ ​Stride​ ​prefetchers(Cache​ ​block​ ​address​ ​based)
Cache Block Address based prefetchers operate based on a stride associated with a memory
address/region rather than strides associated with an instruction, as in PC based prefetchers. i.e.
access patterns in a particular memory location may look line A, A+N, A+2*N where A is the
address of memory location accessed first and N denotes the stride associated with that access
pattern to that memory region. So, there can be multiple strides for a particular instruction, each
associated​ ​with​ ​different​ ​memory​ ​regions​ ​that​ ​the​ ​respective​ ​instruction​ ​tried​ ​to​ ​access.
A normal PC-based stride predictors rely on PC to identify the stride when a loop repeats and so
cannot calculate stride of memory accesses that are not enclosed in a loop. Also, a stream buffer
based​ ​prefetcher​ ​is​ ​a​ ​special​ ​case​ ​of​ ​this​ ​Stride​ ​Prefetchers​ ​where​ ​the​ ​stride​ ​value​ ​is​ ​1.
3.1​ ​​ ​Design​ ​of​ ​Address​ ​Based​ ​Stride​ ​Predictor
Current Design is based on Dstride Prefetcher​[5] ​
which consists of Prefetch Buffer, Prediction
Logic​ ​and​ ​Prefetching​ ​logic​ ​as​ ​shown​ ​in​ ​the​ ​block​ ​diagram​ ​below.
Block​ ​Diagram
3.1.1​ ​Prefetch​ ​Buffer
This holds the predicted data and is opted to be isolated from the L1 cache in order to avoid
cache pollution. Both the L1 cache and the PB are searched in parallel and if the address is
matched in PB, the data is sent to the pipeline and the prefetching logic is informed about the
match​ ​and​ ​prefetches​ ​for​ ​any​ ​steady​ ​strides​ ​are​ ​initiated.
3.1.2​ ​Prediction​ ​Logic
It consists of SPT (Stride prediction Table) that predicts strides by comparing current miss
address​ ​with​ ​recent​ ​miss​ ​address.​ ​The​ ​SPT​ ​has​ ​the​ ​following​ ​entries
● Miss​ ​address,​ ​which​ ​holds​ ​a​ ​data-cache​ ​miss​ ​address;
● Strides,​ ​which​ ​hold​ ​a​ ​fixed​ ​number​ ​of​ ​possible​ ​strides​ ​(4​ ​in​ ​our​ ​case).
● Predicted​ ​Address​ ​corresponding​ ​to​ ​each​ ​stride
● Two state bits that are used to determine the steadiness of the stride predicted. An FSM
can​ ​be​ ​constructed​ ​as​ ​following
Initial state is set when a predicted address is calculated which moves on to transient if missed
address is not same as predicted address, else to steady state. When the state of a stride is no
prediction,​ ​the​ ​predictor​ ​does​ ​not​ ​initiate​ ​any​ ​prefetches​ ​to​ ​that​ ​particular​ ​memory​ ​location.
3.1.3​ ​Prefetching​ ​Logic
A SST (Stride steady Table) and Prefetch Queue are present. When the SPT detects a stride to be
steady, it is moved into SST and a prefetch request for next address is queued onto Prefetch
Queue.
Each​ ​and​ ​every​ ​stride​ ​that​ ​was​ ​predicted​ ​to​ ​be​ ​steady​ ​has​ ​an​ ​entry​ ​in​ ​SST​ ​with​ ​following​ ​fields
● PA:​ ​predicted​ ​address
● LA-PA:​ ​Look​ ​Ahead​ ​Predicted​ ​Address
● Flag:​ ​To​ ​indicate​ ​whether​ ​the​ ​data​ ​prefetch​ ​is​ ​complete.
● nLA-PA:​ ​Next​ ​Look​ ​Ahead​ ​Predicted​ ​Address.
● Stride:​ ​​ ​The​ ​stride​ ​value​ ​that​ ​was​ ​found​ ​to​ ​be​ ​steady.
● LA-Dist:​ ​Look​ ​Ahead​ ​Stride​ ​Distance,​ ​it​ ​denotes​ ​how​ ​far​ ​we​ ​are​ ​prefetching.
So, when there is a hit in prefetch Buffer, the PA is updated with LA-PD and the LA-PD is
updated with nLA-PD. If the data has been fetched then the flag is set to IDLE state, else
PENDING state. Pending indicates that the data access is fast compared to the Prefetching. In
such​ ​case,​ ​the​ ​nLA-PD​ ​is​ ​calculated​ ​as
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​nLA-PD​ ​=​ ​LA-PD​ ​+​ ​stride*n​ ​​ ​(n​ ​is​ ​not​ ​equal​ ​to​ ​one)
Instead​ ​of​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​nLA-PD​ ​=​ ​LA-PD​ ​+​ ​stride*1
This is how the Address based Stride buffer can overcome the disadvantage of the Stream
Buffers​ ​where​ ​the​ ​stride​ ​is​ ​unity.
To improve the prediction accuracy, the prefetch requests are not issued for data that is already
present in the data-cache or in the Prefetch Buffer. When the prefetch address raises a trap for
address translation, or when an out-of-range address is issued by the prefetching logic, the
prefect​ ​request​ ​is​ ​discarded.​ ​A​ ​pseudo-LRU​ ​policy​ ​is​ ​used​ ​for​ ​replacing​ ​SST​ ​entries.
Prefetch​ ​Queue​ ​maintains​ ​a​ ​table​ ​of​ ​requests​ ​with​ ​each​ ​entry​ ​looking​ ​as​ ​below
The Demand type data requests are processed with a higher priority over the prediction based
data requests. This is because the latency that might arise due to a stalling on demand might not
be​ ​overcomed​ ​by​ ​the​ ​latency​ ​gain​ ​due​ ​to​ ​prefetching​ ​data.
4.​ ​Locality​ ​Based​ ​Prefetchers
Stride prefetcher is a practical method to reduce the penalty of memory access. However,
sometimes patterns look random to closeby address, in other words, access patterns are not
perfectly strided in many applications. In these cases, locality based prefetching can be a good
solution.
Basically, locality based prefetchers use the dynamic feedback mechanism to improve the
average performance, including accuracy, timeliness of the prefetch request as well as the cache
pollution​ ​caused​ ​by​ ​prefetch​ ​requests.
In a conventional configuration of prefetcher, prefetch degree and prefetch distance are fixed
(like the N in stride prefetcher). In the feedback mechanism, the processor dynamically change
the prefetch degree and prefetch distance to get better performance (accuracy, timeliness and
cache​ ​pollution)​ ​of​ ​the​ ​prefetch.
4.1​ ​Stream​ ​Prefetcher​ ​Design
The prefetcher is used to keep track of the access streams. There are four states for each tracking
entry:
1. Invalid: The tracking entry is not allocated a stream to keep track of. Initially, all tracking
entries​ ​are​ ​in​ ​this​ ​state.
2. Allocated: A demand (i.e. load/store) L2 miss allocates a tracking entry if the demand miss
does​ ​not​ ​find​ ​any​ ​existing​ ​tracking​ ​entry​ ​for​ ​its​ ​cache-block​ ​address.
3. Training: The prefetcher trains the direction (ascending or descending) of the stream based on
the next two L2 misses that occur +/- 16 cache blocks from the first miss.4 If the next two
accesses in the stream are to ascending (descending) addresses, the direction of the tracking entry
is​ ​set​ ​to​ ​1​ ​(0)​ ​and​ ​the​ ​entry​ ​transitions​ ​to​ ​Monitor​ ​and​ ​Request​ ​state.
4. Monitor and Request: The tracking entry monitors the accesses to a memory region from a
start pointer (address A) to an end pointer (address P). The maximum distance between the start
pointer and the end pointer is determined by Prefetch Distance, which indicates how far ahead of
the demand access stream the prefetcher can send requests. If there is a demand L2 cache access
to a cache block in the monitored memory region, the prefetcher requests cache blocks [P+1, ...,
P+N] as prefetch requests (assuming the direction of the tracking entry is set to 1). N is called the
Prefetch Degree. After sending the prefetch requests, the tracking entry starts monitoring the
memory region between addresses A+N to P+N (i.e. effectively it moves the tracked memory
region​ ​by​ ​N​ ​cache​ ​blocks)​[6]​
.
4.2​ ​Dynamically​ ​Adjusting​ ​Prefetcher​ ​Behavior
There are two ways to adjust the prefetcher behavior, the first one is adjusting prefetcher
aggressiveness​ ​and​ ​the​ ​other​ ​one​ ​is​ ​that​ ​adjusting​ ​cache​ ​insertion​ ​policy​ ​of​ ​prefetched​ ​blocks.
Prefetcher aggressiveness is determined by the prefetch degree and prefetch distance. For
example, if the prefetch is late, processor will increase the aggressiveness (prefetch degree and
distance) to reduce the latency. If cache pollution occurs, processor will decrease the
aggressiveness. Similarly, when it comes to the insertion policy of the prefetched blocks, if
prefetched blocks cause cache pollution, we will not insert these blocks into the MRU
(Most-Recently-Used)​ ​position​ ​of​ ​the​ ​LRU​ ​stack.
5.​ ​Sequential​ ​Prefetching
This is the simplest and the most basic of all the pre-fetching methods and takes advantage of
spatial locality. It prefetches the next ‘N’ cache lines after a demand access or a demand miss.
[10] It is simple to implement as it does not require sophisticated pattern detection. It works well
for sequential/streaming access patterns. However, the demand for memory access is not always
sequential in nature and if the access patterns are irregular in nature, which is very often the case,
it ends up wasting a lot of bandwidth. It is not very suitable for any randomized patterns other
than sequential: consider a case in which the program is traversing memory from higher to lower
addresses – the sequential prefetching method would end up prefetching ‘previous’ N cache lines
increasing​ ​the​ ​odds​ ​of​ ​cache​ ​pollution​ ​and​ ​unnecessary​ ​stall​ ​cycles.
Sequential prefetching is based upon the assumption that cache misses exhibit high spatial
locality; if the processor accesses block B, the probability is high that the processor will also
reference block B+1. Therefore, if the cache experiences a miss on block B, it prefetches blocks
B+1, B+2,..., B+d, if these blocks are not already in the cache, where d is referred to as the
degree of prefetching. Sequential prefetching has been shown to perform well for parallel
applications​ ​on​ ​shared-memory​ ​multiprocessors.​ ​[12]
6.​ ​Simulation
Due to the time limit, we work on the simulation of four of five methods mentioned above,
sequential prefetching, PC-Based stride prefetching, Address-Based prefetching as well as
stream​ ​buffer​ ​prefetching.
6.1​ ​Sequential​ ​Prefetching
For the sequential prefetching, the main modification is in cache.c. We modify the function
cache_create() to implement the sequential prefetch by adding one parameter which is the size of
prefetching.
6.2​ ​PC-Based​ ​Stride​ ​Prefetching
The file ‘cache.c’ undergoes major modification as a large body of code is added to implement
the pre-fetching method. Corresponding changes are made in the ‘cache.h’ to ensure all functions
are declared and sim-outorder.c also undergoes certain minor modifications with respect to a few
function calls involved with the fetching aspect of the pipeline. After debugging, the code is run
on three benchmarks – ​anagram, go ​and gcc for extracting appropriate performance metrics and
using them for comparison against the other techniques. Since the cache.c file modifications are
key in the implementation of the stride-prefetching method on simplescalar, they are discussed
here​ ​in​ ​more​ ​detail:
The reference prediction table is designed to have ​64 ​entries. When
we have an address that has generated a ​miss​, the address is
checked against all the entries of the reference prediction table.
This is a computationally expensive step due to a linear run
through of all the entries of the table. In the event of a ​non-match​,
the instruction that caused the miss is entered into the RPT and the
initial​ ​state​ ​is​ ​set​ ​to​ ​​no-prefetch​.
If there is a ​match​, the pre-fetching mechanism is triggered and
new address calculation along with the calculated stride from the RPT aid in the prefetching.
How the stride is calculated or updated is based on the ‘state’ of the mechanism depending upon
past prediction patterns and the correctness of the prediction, as shown by the ​state table in
section 2.1​. The four states are ​initial, steady, transient and ​no-prefetch​. The state table can be
translated into code logic by either using ​switch-case​, as is shown in the first image or by using
simple​ ​​if-else​,​ ​as​ ​shown​ ​in​ ​the​ ​second​ ​image.
​ ​
The image on the left is from reference [11] and the image on the right is an actual code snippet;
both implementing a portion of the state table. While the left image is relatively easy to parse,
the right one refers to a variable ‘statue’ which is actually the state of the process. ‘Statue’ values
0,1,2, and 3 refer to initial, steady, transient and non-prefetch states. The if-else logic transfers
the logic to the next relevant state according to the state table in 2.1 depending on whether the
stride prediction was correct/incorrect for the past logic, its present/past state, and also updates
the stride if necessary. The pre-fetched data is not placed directly into the cache, but is placed in
a​ ​buffer.
6.3​ ​Address-Based​ ​Stride​ ​Prefetching
All the additional logic needed to implement the Address based Stride Prefetching is added in the
“cache.c” file and the appropriate declarations are added in “cache.h” files of simplescalar. First,
A​ ​structure​ ​is​ ​defined​ ​to​ ​create​ ​a​ ​stride​ ​prediction​ ​table​ ​as​ ​shown​ ​below.
We​ ​have​ ​used​ ​8​ ​entry​ ​stride​ ​prediction​ ​table,​ ​so​ ​that​ ​the​ ​time​ ​delay​ ​to​ ​search​ ​the​ ​strides​ ​and
miss_address​ ​is​ ​not​ ​significant.​ ​These​ ​spt​ ​entries​ ​are​ ​searched​ ​for​ ​predicted​ ​address​ ​to​ ​match​ ​and
the​ ​state​ ​value​ ​is​ ​updated​ ​on​ ​every​ ​search.
The​ ​predict_addr​ ​is​ ​calculated​ ​by​ ​adding​ ​the​ ​corresponding​ ​stride​ ​value​ ​to​ ​the​ ​miss_addr.
I.e​ ​predict_addr[i]​ ​=​ ​miss_addr​ ​+​ ​stride[i]​ ​where​ ​i​ ​can​ ​vary​ ​from​ ​1to​ ​8.
The​ ​predict_addr​ ​is​ ​then​ ​compared​ ​to​ ​the​ ​input​ ​addr​ ​to​ ​the​ ​cache​ ​l2​ ​and​ ​the​ ​correct​ ​flag​ ​is​ ​written
and​ ​then,​ ​states​ ​are​ ​updated​ ​according​ ​the​ ​code​ ​below.
All​ ​the​ ​steady​ ​entries​ ​need​ ​to​ ​be​ ​moved​ ​into​ ​a​ ​table​ ​called​ ​Steady​ ​stride​ ​table​ ​from​ ​which​ ​the
prefetching​ ​is​ ​done​ ​to​ ​separate​ ​prefetch​ ​buffer.​ ​The​ ​size​ ​of​ ​the​ ​stredy​ ​stride​ ​table​ ​used​ ​in​ ​our
project​ ​is​ ​64.​ ​The​ ​rest​ ​modifications​ ​needed​ ​to​ ​prefetch​ ​are​ ​similar​ ​to​ ​the​ ​changes​ ​described​ ​in
the​ ​PC​ ​based​ ​stride​ ​prefetching​ ​mechanism.
6.4​ ​Stream​ ​Buffer​ ​Based​ ​Prefetching
The​ ​core​ ​code​ ​is​ ​located​ ​in​ ​the​ ​function​ ​cache_prefetch_block​ ​in​ ​the​ ​cache.c​ ​file.
if(cp->nsets​ ​>​ ​1000)
cache_prefetch_block(cp,​ ​addr,​ ​0);
And we add two line in cache access function. If it is a miss in L2 cache(size = 1024), we need to
check​ ​stream​ ​buffer​ ​and​ ​prefetch​ ​data.
static​ ​struct​ ​cache_blk_t​ ​oneblk;
static​ ​md_addr_t​ ​oneaddr;
We only use static variables to record address and content of the block. So it is a 1 block size
stream​ ​buffer.
If​ ​it​ ​is​ ​a​ ​hit​ ​in​ ​stream​ ​buffer,​ ​we​ ​record​ ​this​ ​hit​ ​and​ ​return​ ​hit​ ​time.​ ​Do​ ​nothing​ ​to​ ​the​ ​buffer.
If it is a miss in stream buffer. We first check whether we should write back the previous block.
Then prefetch the next block by address plus block size. During this process, we need longer
time​ ​and​ ​to​ ​​ ​handle​ ​flag​ ​bit.
7.Results​ ​and​ ​Analysis
To run the simulation, we specify the cache as L2 instruction and data cache and the prefetching
is​ ​implemented​ ​in​ ​the​ ​L2​ ​data​ ​cache​ ​which​ ​is​ ​two​ ​way​ ​associative.​ ​LRU​ ​replace​ ​policy​ ​is​ ​used.
The​ ​miss​ ​rates​ ​of​ ​four​ ​different​ ​methods​ ​are​ ​shown​ ​below.
From​ ​the​ ​results​ ​shown​ ​above,​ ​we​ ​can​ ​see​ ​that​ ​stream​ ​buffer​ ​based​ ​prefetching​ ​have​ ​the​ ​best
performance​ ​on​ ​miss​ ​rate.​ ​Stride​ ​methods​ ​almost​ ​have​ ​the​ ​same​ ​performance​ ​with​ ​the​ ​sequential
prefetching.​ ​This​ ​might​ ​because​ ​we​ ​just​ ​chose​ ​to​ ​use​ ​one​ ​stride​ ​in​ ​our​ ​implementation​ ​rather​ ​than
N​ ​strides.​ ​It’s​ ​likely​ ​that​ ​the​ ​stride​ ​methods​ ​will​ ​provide​ ​better​ ​performance​ ​if​ ​we​ ​increase​ ​the
stride​ ​number.
The​ ​IPC​ ​of​ ​four​ ​different​ ​methods​ ​are​ ​shown​ ​below.​ ​It’s​ ​obvious​ ​that​ ​two​ ​stride​ ​methods​ ​have
lower​ ​IPC​ ​because​ ​there​ ​are​ ​more​ ​computation​ ​and​ ​hardware​ ​access​ ​involved.​ ​For​ ​the​ ​sequential
prefetching​ ​and​ ​stream​ ​buffer​ ​based​ ​prefetching,​ ​their​ ​logic​ ​are​ ​simpler​ ​and​ ​thus​ ​lead​ ​to​ ​higher
IPC.
We​ ​also​ ​dive​ ​into​ ​the​ ​sequential​ ​prefetching​ ​method​ ​and​ ​try​ ​to​ ​explore​ ​the​ ​effect​ ​of​ ​prefetching
size.​ ​The​ ​results​ ​of​ ​Go​ ​benchmark​ ​is​ ​shown​ ​below.
From​ ​the​ ​results​ ​we​ ​can​ ​easily​ ​find​ ​out​ ​that​ ​as​ ​the​ ​prefetching​ ​size​ ​increase,​ ​the​ ​missing​ ​rate​ ​is
going​ ​down​ ​and​ ​it’s​ ​because​ ​we​ ​prefetch​ ​more​ ​blocks​ ​at​ ​one​ ​time.​ ​However​ ​we​ ​also​ ​notice​ ​that
as​ ​the​ ​prefetching​ ​size​ ​increase​ ​there​ ​is​ ​one​ ​point​ ​that​ ​miss​ ​rate​ ​is​ ​going​ ​up​ ​again.​ ​This​ ​is​ ​more
likely​ ​when​ ​a​ ​cache​ ​pollution​ ​happens.​ ​It​ ​basically​ ​means​ ​that​ ​we​ ​prefetch​ ​too​ ​many​ ​blocks​ ​at​ ​a
time​ ​and​ ​the​ ​useful​ ​cache​ ​blocks​ ​are​ ​flushed​ ​away​ ​every​ ​time​ ​there​ ​is​ ​a​ ​cache​ ​miss.
8.​ ​References
[1][2] Smith, Alan Jay (1982-09-01). "Cache Memories". ACM Comput. Surv. 14 (3): 473–530.
ISSN​ ​0360-0300.
[3] F. Dahlgren, M. Dubois, and P. Stenström, ``Fixed and adaptive sequential prefetching in
shared​ ​memory​ ​multiprocessors.,''​ ​in​ ​ICPP,​ ​pp.​ ​56-63,​ ​1993
[4] Mittal, Sparsh (2016-08-01). "A Survey of Recent Prefetching Techniques for Processor
Caches".​ ​ACM​ ​Comput.​ ​Surv.​ ​49​ ​(2):​ ​35:1–35:35.​ ​ISSN​ ​0360-0300.
[5] G. Hariprakash, R. Achutharaman and A. R. Omondi, "DStride: data-cache
miss-address-based stride prefetching scheme for multimedia processors," ​Proceedings 6th
Australasian Computer Systems Architecture Conference. ACSAC 2001​, Gold Coast, Qld., 2001,
pp.​ ​62-70.
[6] Srinath et al., “Feedback Directed Prefetching: Improving the Performance and
Bandwidth-Efficiency​ ​of​ ​Hardware​ ​Prefetchers​ ​“,​ ​HPCA​ ​2007.
[7] Dahlgren and Stenstrom, “Effectiveness of Hardware-Based Stride and Sequential
Prefetching in Shared-Memory Multiprocessors,” High performance computer architecture,
1995.
[8] Baer and Chen, “An effective on-chip preloading scheme to reduce data access penalty,” SC
1991
[9] Kim and Viedenbaum, “Stride directed Prefetching for Secondary Caches”, Parallel
Processing,​ ​1997.
[10]​ ​Prof.​ ​Onur​ ​Mutlu,​ ​“Computer​ ​Architecture:​ ​Prefetching​ ​II”,
http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php?media=wiki:lectures:onur-740-fall11-lec
ture27-prefetching-ii-afterlecture.pdf
[11] ​Grannæs, M. (2006). ​Bandwidth-Aware Prefetching in Chip Multiprocessors​. [online]
Norwegian University of Science and Technology Department of Computer and Information
Science. Available at: https://daim.idi.ntnu.no/masteroppgaver/001/1507/masteroppgave.pdf
[Accessed​ ​21​ ​Dec.​ ​2017].
[12] ​Dahlgren, F. and Stenstrom, P. (1996). Evaluation of hardware-based stride and sequential
prefetching in shared-memory multiprocessors. ​IEEE Transactions on Parallel and Distributed
Systems​,​ ​7(4),​ ​pp.385-398.

More Related Content

What's hot

Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopPallav Jha
 
Lect09 adv-branch-prediction
Lect09 adv-branch-predictionLect09 adv-branch-prediction
Lect09 adv-branch-predictionGour Rakesh
 
[2009 11-09] branch prediction
[2009 11-09] branch prediction[2009 11-09] branch prediction
[2009 11-09] branch predictionmobilevc
 
multiacess protocol
multiacess protocolmultiacess protocol
multiacess protocolKanwal Khan
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...IOSR Journals
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Sreerag parallel programming
Sreerag   parallel programmingSreerag   parallel programming
Sreerag parallel programmingSreerag Gopinath
 
Design and analysis of a model predictive controller for active queue management
Design and analysis of a model predictive controller for active queue managementDesign and analysis of a model predictive controller for active queue management
Design and analysis of a model predictive controller for active queue managementISA Interchange
 
Sliding window protocol
Sliding window protocolSliding window protocol
Sliding window protocolRishu Seth
 
Tcp performance simulationsusingns2
Tcp performance simulationsusingns2Tcp performance simulationsusingns2
Tcp performance simulationsusingns2Justin Frankel
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations Syed Zaid Irshad
 

What's hot (20)

Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in Hadoop
 
Lect09 adv-branch-prediction
Lect09 adv-branch-predictionLect09 adv-branch-prediction
Lect09 adv-branch-prediction
 
[2009 11-09] branch prediction
[2009 11-09] branch prediction[2009 11-09] branch prediction
[2009 11-09] branch prediction
 
multiacess protocol
multiacess protocolmultiacess protocol
multiacess protocol
 
Telecom KPIs definitions
Telecom KPIs definitionsTelecom KPIs definitions
Telecom KPIs definitions
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23
 
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...
RTH-RSS Mac: Path loss exponent estimation with received signal strength loca...
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Sreerag parallel programming
Sreerag   parallel programmingSreerag   parallel programming
Sreerag parallel programming
 
Access to non local names
Access to non local namesAccess to non local names
Access to non local names
 
A Performance Comparison of TCP Protocols
A Performance Comparison of TCP Protocols A Performance Comparison of TCP Protocols
A Performance Comparison of TCP Protocols
 
Design and analysis of a model predictive controller for active queue management
Design and analysis of a model predictive controller for active queue managementDesign and analysis of a model predictive controller for active queue management
Design and analysis of a model predictive controller for active queue management
 
Sliding window protocol
Sliding window protocolSliding window protocol
Sliding window protocol
 
Chapter 7 Run Time Environment
Chapter 7   Run Time EnvironmentChapter 7   Run Time Environment
Chapter 7 Run Time Environment
 
Tcp performance simulationsusingns2
Tcp performance simulationsusingns2Tcp performance simulationsusingns2
Tcp performance simulationsusingns2
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
 
F03201028033
F03201028033F03201028033
F03201028033
 
Computer System Architecture
Computer System ArchitectureComputer System Architecture
Computer System Architecture
 

Similar to Final report

Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Yulong Bai
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetchingHimanshu Koli
 
A QOS BASED LOAD BALANCED SWITCH
A QOS BASED LOAD BALANCED SWITCHA QOS BASED LOAD BALANCED SWITCH
A QOS BASED LOAD BALANCED SWITCHecij
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
from OO to Multicore SST
from OO to Multicore SSTfrom OO to Multicore SST
from OO to Multicore SSTstratusdesign
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingcsandit
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
 
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...IJERA Editor
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System ManagementIbrahim Amer
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029marangburu42
 

Similar to Final report (20)

Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
 
shieh06a
shieh06ashieh06a
shieh06a
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Minimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data WarehousesMinimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
StateKeeper Report
StateKeeper ReportStateKeeper Report
StateKeeper Report
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
A QOS BASED LOAD BALANCED SWITCH
A QOS BASED LOAD BALANCED SWITCHA QOS BASED LOAD BALANCED SWITCH
A QOS BASED LOAD BALANCED SWITCH
 
676.v3
676.v3676.v3
676.v3
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
from OO to Multicore SST
from OO to Multicore SSTfrom OO to Multicore SST
from OO to Multicore SST
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
 
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029
 

Recently uploaded

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 

Recently uploaded (20)

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 

Final report

  • 1. Data​ ​Prefetching​ ​Mechanisms​ ​and​ ​Their​ ​Impact​ ​on​ ​Performance Group​ ​members: Tian​ ​Nan<​nanxx026@umn.edu​> Chang​ ​Xiong​ ​<xion1637@umn.edu> Kshiti​ ​Deolalkar​ ​<deola002@umn.edu> Vineel​ ​Kumar​ ​Reddy​ ​Gangireddygari​ ​<gangi009@umn.edu> Introduction Hardware pre-fetching is in-line with the main idea of out-of-order execution and fully exploiting instruction level parallelism. It involves utilizing specialized hardware to track load/store instructions and their access patterns in order to prefetch data based on past access behavior.​ ​[10] The graphic above clearly demonstrates how accurate and timely prefetching removes loads from the​ ​critical​ ​path. The advantages of hardware prefetching are that it does not waste instruction execution bandwidth, can be easily tuned to system implementation, and does not introduce any code portability issues. It also has better dynamic information (for instance: in case of unexpected cache conflicts) and it does not add any instruction overhead to issue prefetches. It does however increase the hardware complexity and is outperformed in terms of efficiency by software prefetching in some cases. The succeeding topics discuss four major types of hardware prefetching techniques, their algorithms along with the advantage/drawback each method provides. 1.Stream​ ​Buffer Because data prefetching is a technique for shorten the interval of data request and arrival to execution unit, we first study what CPU Cache is. One of our hardware prefetching means stream buffer based on the concept of one block lookahead (OBL) ​[1]​ . So we study OBL in the next​ ​step​ ​in​ ​order​ ​to​ ​figure​ ​out​ ​how​ ​stream​ ​buffer​ ​works.
  • 2. 1.1​ ​How​ ​does​ ​stream​ ​buffer​ ​work In this project, we implemented stream buffer for L2 cache. If there exists a miss in L2 cache, the hardware will check stream buffer. Match all the data address in stream buffer. If there is a hit in stream buffer, we just get data from stream buffer and do nothing to it. If there is a miss, we need to flush the whole buffer, get the block with corresponding address X from memory to L2 cache, and​ ​get​ ​the​ ​block​ ​with​ ​address​ ​X+1,​ ​X+2,​ ​…​ ​X+n​ ​into​ ​stream​ ​buffer.​ ​(n​ ​is​ ​the​ ​size​ ​of​ ​buffer). Stream buffer for L1 cache has same distance from functional unit, and hit time is same as L1 cache hit time. Stream buffer for L2 cache also has same hit time as L2 cache. In this circumstance, if a miss happens in L2 cache, but a hit in stream buffer, the penalty for this miss is​ ​only​ ​twice​ ​of​ ​hit​ ​time,​ ​since​ ​stream​ ​buffer​ ​hit​ ​time​ ​is​ ​same​ ​as​ ​L2​ ​cache​ ​hit​ ​time. What happens if a miss occurs in stream buffer and L2 cache: We need to flush stream buffer. We use dirty bit to check every block in stream buffer. For every block which is dirty, we need to write the block back. When getting the block from memory for L2 cache, we also need to prefetch n blocks from memory. The process is parallel, especially when the stream buffer size is small. 2.​ ​Stride​ ​prefetchers(PC​ ​based) This approach involves recording the ‘stride’ of a load instruction, i.e. the distance between the memory addresses referenced by the instruction. The last address referenced by the load instruction is also recorded. Then, the next time the load instruction is fetched, we also prefetch the ​last address + stride of the same load instruction. A ‘reference prediction table’ (RPT) is required​ ​for​ ​the​ ​tracking​ ​process.​ ​[8] 2.1​ ​Reference​ ​prediction​ ​table:​ ​entries​ ​and​ ​algorithm This is an I-detection scheme since it requires the address of the instruction, i.e. the program counter. The reference prediction table is organized as a Cache. An example RPT would be, say, a​ ​256-entry​ ​direct​ ​mapped​ ​cache​ ​as​ ​in​ ​[7]. -Whenever there is a cache miss, the instruction address of the load instruction that is causing the cache​ ​miss​ ​is​ ​compared​ ​against​ ​the​ ​entries​ ​in​ ​the​ ​reference​ ​prediction​ ​table. -If there is no match, the instruction is entered into the RPT along with all of its information required​ ​for​ ​tracking.​ ​The​ ​state​ ​is​ ​initially​ ​set​ ​to​ ​no-prefetch. -When a read miss is obtained for the same instruction again, we get a match with the RPT. The stride is then calculated, the new address as well as the stride is added to the RPT. The state is now​ ​set​ ​to​ ​prefetch.
  • 3. The​ ​below​ ​figure​ ​represents​ ​the​ ​process​ ​in​ ​the​ ​RPT. The state transition diagram representing course of action in case of a correct or an incorrect prediction​ ​is​ ​as​ ​follows: The state is set to initial when the prefetching begins, and the stride has been calculated. To check for correctness, the read requests in the second-level cache are matched with the RPT entries. When an access to the same stride sequence is generated three times in a row, the state becomes ‘steady’. ‘Transient’ state is when there has been a second incorrect prediction in a row. The​ ​no-prefetch​ ​state​ ​helps​ ​reduce​ ​the​ ​number​ ​of​ ​unnecessary​ ​prefetches. The length of a stride sequence is unknown in our case, so different schemes may determine the number of blocks to be prefetched. The number of blocks would also depend on the architecture in​ ​use. 2.2​ ​Hardware​ ​requirements The block diagram below encompasses the complete prefetching process, and is hence a reliable source​ ​for​ ​deriving​ ​the​ ​hardware​ ​requirements​ ​the​ ​technique​ ​would​ ​entail.​ ​[9] The prediction table is required to store several entries consisting of tracking details for the load instructions. Increasing the number of entries can aid in better prefetch but beyond a certain limit
  • 4. would yield diminishing returns. The other units are either for address calculation or comparison. The comparator can be considered as an expensive operation since it would require bit by bit comparison. The drawback of this method includes the limitation on how far we can get ahead and the amount of miss latency the prefetch can cover. In addition, this technique is not fast enough, initiating the prefetch when the load is fetched the next time can be too late since the load will access the data cache soon after it is fetched. Potential solutions to overcome these drawbacks include using lookahead PC to index the prefetcher table, prefetch further ahead (last address + N*stride)​ ​or​ ​to​ ​generate​ ​multiple​ ​prefetches. 3.​ ​Stride​ ​prefetchers(Cache​ ​block​ ​address​ ​based) Cache Block Address based prefetchers operate based on a stride associated with a memory address/region rather than strides associated with an instruction, as in PC based prefetchers. i.e. access patterns in a particular memory location may look line A, A+N, A+2*N where A is the address of memory location accessed first and N denotes the stride associated with that access pattern to that memory region. So, there can be multiple strides for a particular instruction, each associated​ ​with​ ​different​ ​memory​ ​regions​ ​that​ ​the​ ​respective​ ​instruction​ ​tried​ ​to​ ​access. A normal PC-based stride predictors rely on PC to identify the stride when a loop repeats and so cannot calculate stride of memory accesses that are not enclosed in a loop. Also, a stream buffer based​ ​prefetcher​ ​is​ ​a​ ​special​ ​case​ ​of​ ​this​ ​Stride​ ​Prefetchers​ ​where​ ​the​ ​stride​ ​value​ ​is​ ​1. 3.1​ ​​ ​Design​ ​of​ ​Address​ ​Based​ ​Stride​ ​Predictor Current Design is based on Dstride Prefetcher​[5] ​ which consists of Prefetch Buffer, Prediction Logic​ ​and​ ​Prefetching​ ​logic​ ​as​ ​shown​ ​in​ ​the​ ​block​ ​diagram​ ​below.
  • 5. Block​ ​Diagram 3.1.1​ ​Prefetch​ ​Buffer This holds the predicted data and is opted to be isolated from the L1 cache in order to avoid cache pollution. Both the L1 cache and the PB are searched in parallel and if the address is matched in PB, the data is sent to the pipeline and the prefetching logic is informed about the match​ ​and​ ​prefetches​ ​for​ ​any​ ​steady​ ​strides​ ​are​ ​initiated. 3.1.2​ ​Prediction​ ​Logic It consists of SPT (Stride prediction Table) that predicts strides by comparing current miss address​ ​with​ ​recent​ ​miss​ ​address.​ ​The​ ​SPT​ ​has​ ​the​ ​following​ ​entries ● Miss​ ​address,​ ​which​ ​holds​ ​a​ ​data-cache​ ​miss​ ​address; ● Strides,​ ​which​ ​hold​ ​a​ ​fixed​ ​number​ ​of​ ​possible​ ​strides​ ​(4​ ​in​ ​our​ ​case). ● Predicted​ ​Address​ ​corresponding​ ​to​ ​each​ ​stride ● Two state bits that are used to determine the steadiness of the stride predicted. An FSM can​ ​be​ ​constructed​ ​as​ ​following
  • 6. Initial state is set when a predicted address is calculated which moves on to transient if missed address is not same as predicted address, else to steady state. When the state of a stride is no prediction,​ ​the​ ​predictor​ ​does​ ​not​ ​initiate​ ​any​ ​prefetches​ ​to​ ​that​ ​particular​ ​memory​ ​location. 3.1.3​ ​Prefetching​ ​Logic A SST (Stride steady Table) and Prefetch Queue are present. When the SPT detects a stride to be steady, it is moved into SST and a prefetch request for next address is queued onto Prefetch Queue. Each​ ​and​ ​every​ ​stride​ ​that​ ​was​ ​predicted​ ​to​ ​be​ ​steady​ ​has​ ​an​ ​entry​ ​in​ ​SST​ ​with​ ​following​ ​fields ● PA:​ ​predicted​ ​address ● LA-PA:​ ​Look​ ​Ahead​ ​Predicted​ ​Address ● Flag:​ ​To​ ​indicate​ ​whether​ ​the​ ​data​ ​prefetch​ ​is​ ​complete. ● nLA-PA:​ ​Next​ ​Look​ ​Ahead​ ​Predicted​ ​Address. ● Stride:​ ​​ ​The​ ​stride​ ​value​ ​that​ ​was​ ​found​ ​to​ ​be​ ​steady. ● LA-Dist:​ ​Look​ ​Ahead​ ​Stride​ ​Distance,​ ​it​ ​denotes​ ​how​ ​far​ ​we​ ​are​ ​prefetching. So, when there is a hit in prefetch Buffer, the PA is updated with LA-PD and the LA-PD is updated with nLA-PD. If the data has been fetched then the flag is set to IDLE state, else PENDING state. Pending indicates that the data access is fast compared to the Prefetching. In such​ ​case,​ ​the​ ​nLA-PD​ ​is​ ​calculated​ ​as ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​nLA-PD​ ​=​ ​LA-PD​ ​+​ ​stride*n​ ​​ ​(n​ ​is​ ​not​ ​equal​ ​to​ ​one) Instead​ ​of​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​nLA-PD​ ​=​ ​LA-PD​ ​+​ ​stride*1 This is how the Address based Stride buffer can overcome the disadvantage of the Stream Buffers​ ​where​ ​the​ ​stride​ ​is​ ​unity. To improve the prediction accuracy, the prefetch requests are not issued for data that is already present in the data-cache or in the Prefetch Buffer. When the prefetch address raises a trap for address translation, or when an out-of-range address is issued by the prefetching logic, the prefect​ ​request​ ​is​ ​discarded.​ ​A​ ​pseudo-LRU​ ​policy​ ​is​ ​used​ ​for​ ​replacing​ ​SST​ ​entries. Prefetch​ ​Queue​ ​maintains​ ​a​ ​table​ ​of​ ​requests​ ​with​ ​each​ ​entry​ ​looking​ ​as​ ​below The Demand type data requests are processed with a higher priority over the prediction based data requests. This is because the latency that might arise due to a stalling on demand might not be​ ​overcomed​ ​by​ ​the​ ​latency​ ​gain​ ​due​ ​to​ ​prefetching​ ​data.
  • 7. 4.​ ​Locality​ ​Based​ ​Prefetchers Stride prefetcher is a practical method to reduce the penalty of memory access. However, sometimes patterns look random to closeby address, in other words, access patterns are not perfectly strided in many applications. In these cases, locality based prefetching can be a good solution. Basically, locality based prefetchers use the dynamic feedback mechanism to improve the average performance, including accuracy, timeliness of the prefetch request as well as the cache pollution​ ​caused​ ​by​ ​prefetch​ ​requests. In a conventional configuration of prefetcher, prefetch degree and prefetch distance are fixed (like the N in stride prefetcher). In the feedback mechanism, the processor dynamically change the prefetch degree and prefetch distance to get better performance (accuracy, timeliness and cache​ ​pollution)​ ​of​ ​the​ ​prefetch. 4.1​ ​Stream​ ​Prefetcher​ ​Design The prefetcher is used to keep track of the access streams. There are four states for each tracking entry: 1. Invalid: The tracking entry is not allocated a stream to keep track of. Initially, all tracking entries​ ​are​ ​in​ ​this​ ​state. 2. Allocated: A demand (i.e. load/store) L2 miss allocates a tracking entry if the demand miss does​ ​not​ ​find​ ​any​ ​existing​ ​tracking​ ​entry​ ​for​ ​its​ ​cache-block​ ​address. 3. Training: The prefetcher trains the direction (ascending or descending) of the stream based on the next two L2 misses that occur +/- 16 cache blocks from the first miss.4 If the next two accesses in the stream are to ascending (descending) addresses, the direction of the tracking entry is​ ​set​ ​to​ ​1​ ​(0)​ ​and​ ​the​ ​entry​ ​transitions​ ​to​ ​Monitor​ ​and​ ​Request​ ​state. 4. Monitor and Request: The tracking entry monitors the accesses to a memory region from a start pointer (address A) to an end pointer (address P). The maximum distance between the start pointer and the end pointer is determined by Prefetch Distance, which indicates how far ahead of the demand access stream the prefetcher can send requests. If there is a demand L2 cache access to a cache block in the monitored memory region, the prefetcher requests cache blocks [P+1, ..., P+N] as prefetch requests (assuming the direction of the tracking entry is set to 1). N is called the Prefetch Degree. After sending the prefetch requests, the tracking entry starts monitoring the memory region between addresses A+N to P+N (i.e. effectively it moves the tracked memory region​ ​by​ ​N​ ​cache​ ​blocks)​[6]​ .
  • 8. 4.2​ ​Dynamically​ ​Adjusting​ ​Prefetcher​ ​Behavior There are two ways to adjust the prefetcher behavior, the first one is adjusting prefetcher aggressiveness​ ​and​ ​the​ ​other​ ​one​ ​is​ ​that​ ​adjusting​ ​cache​ ​insertion​ ​policy​ ​of​ ​prefetched​ ​blocks. Prefetcher aggressiveness is determined by the prefetch degree and prefetch distance. For example, if the prefetch is late, processor will increase the aggressiveness (prefetch degree and distance) to reduce the latency. If cache pollution occurs, processor will decrease the aggressiveness. Similarly, when it comes to the insertion policy of the prefetched blocks, if prefetched blocks cause cache pollution, we will not insert these blocks into the MRU (Most-Recently-Used)​ ​position​ ​of​ ​the​ ​LRU​ ​stack. 5.​ ​Sequential​ ​Prefetching This is the simplest and the most basic of all the pre-fetching methods and takes advantage of spatial locality. It prefetches the next ‘N’ cache lines after a demand access or a demand miss. [10] It is simple to implement as it does not require sophisticated pattern detection. It works well for sequential/streaming access patterns. However, the demand for memory access is not always sequential in nature and if the access patterns are irregular in nature, which is very often the case, it ends up wasting a lot of bandwidth. It is not very suitable for any randomized patterns other than sequential: consider a case in which the program is traversing memory from higher to lower addresses – the sequential prefetching method would end up prefetching ‘previous’ N cache lines increasing​ ​the​ ​odds​ ​of​ ​cache​ ​pollution​ ​and​ ​unnecessary​ ​stall​ ​cycles. Sequential prefetching is based upon the assumption that cache misses exhibit high spatial locality; if the processor accesses block B, the probability is high that the processor will also reference block B+1. Therefore, if the cache experiences a miss on block B, it prefetches blocks B+1, B+2,..., B+d, if these blocks are not already in the cache, where d is referred to as the degree of prefetching. Sequential prefetching has been shown to perform well for parallel applications​ ​on​ ​shared-memory​ ​multiprocessors.​ ​[12] 6.​ ​Simulation Due to the time limit, we work on the simulation of four of five methods mentioned above, sequential prefetching, PC-Based stride prefetching, Address-Based prefetching as well as stream​ ​buffer​ ​prefetching.
  • 9. 6.1​ ​Sequential​ ​Prefetching For the sequential prefetching, the main modification is in cache.c. We modify the function cache_create() to implement the sequential prefetch by adding one parameter which is the size of prefetching. 6.2​ ​PC-Based​ ​Stride​ ​Prefetching The file ‘cache.c’ undergoes major modification as a large body of code is added to implement the pre-fetching method. Corresponding changes are made in the ‘cache.h’ to ensure all functions are declared and sim-outorder.c also undergoes certain minor modifications with respect to a few function calls involved with the fetching aspect of the pipeline. After debugging, the code is run on three benchmarks – ​anagram, go ​and gcc for extracting appropriate performance metrics and using them for comparison against the other techniques. Since the cache.c file modifications are key in the implementation of the stride-prefetching method on simplescalar, they are discussed here​ ​in​ ​more​ ​detail: The reference prediction table is designed to have ​64 ​entries. When we have an address that has generated a ​miss​, the address is checked against all the entries of the reference prediction table. This is a computationally expensive step due to a linear run through of all the entries of the table. In the event of a ​non-match​, the instruction that caused the miss is entered into the RPT and the initial​ ​state​ ​is​ ​set​ ​to​ ​​no-prefetch​. If there is a ​match​, the pre-fetching mechanism is triggered and new address calculation along with the calculated stride from the RPT aid in the prefetching. How the stride is calculated or updated is based on the ‘state’ of the mechanism depending upon past prediction patterns and the correctness of the prediction, as shown by the ​state table in section 2.1​. The four states are ​initial, steady, transient and ​no-prefetch​. The state table can be translated into code logic by either using ​switch-case​, as is shown in the first image or by using simple​ ​​if-else​,​ ​as​ ​shown​ ​in​ ​the​ ​second​ ​image.
  • 10. ​ ​ The image on the left is from reference [11] and the image on the right is an actual code snippet; both implementing a portion of the state table. While the left image is relatively easy to parse, the right one refers to a variable ‘statue’ which is actually the state of the process. ‘Statue’ values 0,1,2, and 3 refer to initial, steady, transient and non-prefetch states. The if-else logic transfers the logic to the next relevant state according to the state table in 2.1 depending on whether the stride prediction was correct/incorrect for the past logic, its present/past state, and also updates the stride if necessary. The pre-fetched data is not placed directly into the cache, but is placed in a​ ​buffer. 6.3​ ​Address-Based​ ​Stride​ ​Prefetching All the additional logic needed to implement the Address based Stride Prefetching is added in the “cache.c” file and the appropriate declarations are added in “cache.h” files of simplescalar. First, A​ ​structure​ ​is​ ​defined​ ​to​ ​create​ ​a​ ​stride​ ​prediction​ ​table​ ​as​ ​shown​ ​below. We​ ​have​ ​used​ ​8​ ​entry​ ​stride​ ​prediction​ ​table,​ ​so​ ​that​ ​the​ ​time​ ​delay​ ​to​ ​search​ ​the​ ​strides​ ​and miss_address​ ​is​ ​not​ ​significant.​ ​These​ ​spt​ ​entries​ ​are​ ​searched​ ​for​ ​predicted​ ​address​ ​to​ ​match​ ​and the​ ​state​ ​value​ ​is​ ​updated​ ​on​ ​every​ ​search. The​ ​predict_addr​ ​is​ ​calculated​ ​by​ ​adding​ ​the​ ​corresponding​ ​stride​ ​value​ ​to​ ​the​ ​miss_addr.
  • 11. I.e​ ​predict_addr[i]​ ​=​ ​miss_addr​ ​+​ ​stride[i]​ ​where​ ​i​ ​can​ ​vary​ ​from​ ​1to​ ​8. The​ ​predict_addr​ ​is​ ​then​ ​compared​ ​to​ ​the​ ​input​ ​addr​ ​to​ ​the​ ​cache​ ​l2​ ​and​ ​the​ ​correct​ ​flag​ ​is​ ​written and​ ​then,​ ​states​ ​are​ ​updated​ ​according​ ​the​ ​code​ ​below. All​ ​the​ ​steady​ ​entries​ ​need​ ​to​ ​be​ ​moved​ ​into​ ​a​ ​table​ ​called​ ​Steady​ ​stride​ ​table​ ​from​ ​which​ ​the prefetching​ ​is​ ​done​ ​to​ ​separate​ ​prefetch​ ​buffer.​ ​The​ ​size​ ​of​ ​the​ ​stredy​ ​stride​ ​table​ ​used​ ​in​ ​our project​ ​is​ ​64.​ ​The​ ​rest​ ​modifications​ ​needed​ ​to​ ​prefetch​ ​are​ ​similar​ ​to​ ​the​ ​changes​ ​described​ ​in the​ ​PC​ ​based​ ​stride​ ​prefetching​ ​mechanism. 6.4​ ​Stream​ ​Buffer​ ​Based​ ​Prefetching The​ ​core​ ​code​ ​is​ ​located​ ​in​ ​the​ ​function​ ​cache_prefetch_block​ ​in​ ​the​ ​cache.c​ ​file. if(cp->nsets​ ​>​ ​1000) cache_prefetch_block(cp,​ ​addr,​ ​0); And we add two line in cache access function. If it is a miss in L2 cache(size = 1024), we need to check​ ​stream​ ​buffer​ ​and​ ​prefetch​ ​data. static​ ​struct​ ​cache_blk_t​ ​oneblk; static​ ​md_addr_t​ ​oneaddr;
  • 12. We only use static variables to record address and content of the block. So it is a 1 block size stream​ ​buffer. If​ ​it​ ​is​ ​a​ ​hit​ ​in​ ​stream​ ​buffer,​ ​we​ ​record​ ​this​ ​hit​ ​and​ ​return​ ​hit​ ​time.​ ​Do​ ​nothing​ ​to​ ​the​ ​buffer. If it is a miss in stream buffer. We first check whether we should write back the previous block. Then prefetch the next block by address plus block size. During this process, we need longer time​ ​and​ ​to​ ​​ ​handle​ ​flag​ ​bit.
  • 13. 7.Results​ ​and​ ​Analysis To run the simulation, we specify the cache as L2 instruction and data cache and the prefetching is​ ​implemented​ ​in​ ​the​ ​L2​ ​data​ ​cache​ ​which​ ​is​ ​two​ ​way​ ​associative.​ ​LRU​ ​replace​ ​policy​ ​is​ ​used. The​ ​miss​ ​rates​ ​of​ ​four​ ​different​ ​methods​ ​are​ ​shown​ ​below. From​ ​the​ ​results​ ​shown​ ​above,​ ​we​ ​can​ ​see​ ​that​ ​stream​ ​buffer​ ​based​ ​prefetching​ ​have​ ​the​ ​best performance​ ​on​ ​miss​ ​rate.​ ​Stride​ ​methods​ ​almost​ ​have​ ​the​ ​same​ ​performance​ ​with​ ​the​ ​sequential prefetching.​ ​This​ ​might​ ​because​ ​we​ ​just​ ​chose​ ​to​ ​use​ ​one​ ​stride​ ​in​ ​our​ ​implementation​ ​rather​ ​than N​ ​strides.​ ​It’s​ ​likely​ ​that​ ​the​ ​stride​ ​methods​ ​will​ ​provide​ ​better​ ​performance​ ​if​ ​we​ ​increase​ ​the stride​ ​number. The​ ​IPC​ ​of​ ​four​ ​different​ ​methods​ ​are​ ​shown​ ​below.​ ​It’s​ ​obvious​ ​that​ ​two​ ​stride​ ​methods​ ​have lower​ ​IPC​ ​because​ ​there​ ​are​ ​more​ ​computation​ ​and​ ​hardware​ ​access​ ​involved.​ ​For​ ​the​ ​sequential prefetching​ ​and​ ​stream​ ​buffer​ ​based​ ​prefetching,​ ​their​ ​logic​ ​are​ ​simpler​ ​and​ ​thus​ ​lead​ ​to​ ​higher IPC.
  • 14. We​ ​also​ ​dive​ ​into​ ​the​ ​sequential​ ​prefetching​ ​method​ ​and​ ​try​ ​to​ ​explore​ ​the​ ​effect​ ​of​ ​prefetching size.​ ​The​ ​results​ ​of​ ​Go​ ​benchmark​ ​is​ ​shown​ ​below. From​ ​the​ ​results​ ​we​ ​can​ ​easily​ ​find​ ​out​ ​that​ ​as​ ​the​ ​prefetching​ ​size​ ​increase,​ ​the​ ​missing​ ​rate​ ​is going​ ​down​ ​and​ ​it’s​ ​because​ ​we​ ​prefetch​ ​more​ ​blocks​ ​at​ ​one​ ​time.​ ​However​ ​we​ ​also​ ​notice​ ​that as​ ​the​ ​prefetching​ ​size​ ​increase​ ​there​ ​is​ ​one​ ​point​ ​that​ ​miss​ ​rate​ ​is​ ​going​ ​up​ ​again.​ ​This​ ​is​ ​more
  • 15. likely​ ​when​ ​a​ ​cache​ ​pollution​ ​happens.​ ​It​ ​basically​ ​means​ ​that​ ​we​ ​prefetch​ ​too​ ​many​ ​blocks​ ​at​ ​a time​ ​and​ ​the​ ​useful​ ​cache​ ​blocks​ ​are​ ​flushed​ ​away​ ​every​ ​time​ ​there​ ​is​ ​a​ ​cache​ ​miss. 8.​ ​References [1][2] Smith, Alan Jay (1982-09-01). "Cache Memories". ACM Comput. Surv. 14 (3): 473–530. ISSN​ ​0360-0300. [3] F. Dahlgren, M. Dubois, and P. Stenström, ``Fixed and adaptive sequential prefetching in shared​ ​memory​ ​multiprocessors.,''​ ​in​ ​ICPP,​ ​pp.​ ​56-63,​ ​1993 [4] Mittal, Sparsh (2016-08-01). "A Survey of Recent Prefetching Techniques for Processor Caches".​ ​ACM​ ​Comput.​ ​Surv.​ ​49​ ​(2):​ ​35:1–35:35.​ ​ISSN​ ​0360-0300. [5] G. Hariprakash, R. Achutharaman and A. R. Omondi, "DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors," ​Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001​, Gold Coast, Qld., 2001, pp.​ ​62-70. [6] Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency​ ​of​ ​Hardware​ ​Prefetchers​ ​“,​ ​HPCA​ ​2007. [7] Dahlgren and Stenstrom, “Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors,” High performance computer architecture, 1995. [8] Baer and Chen, “An effective on-chip preloading scheme to reduce data access penalty,” SC 1991 [9] Kim and Viedenbaum, “Stride directed Prefetching for Secondary Caches”, Parallel Processing,​ ​1997. [10]​ ​Prof.​ ​Onur​ ​Mutlu,​ ​“Computer​ ​Architecture:​ ​Prefetching​ ​II”, http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php?media=wiki:lectures:onur-740-fall11-lec ture27-prefetching-ii-afterlecture.pdf [11] ​Grannæs, M. (2006). ​Bandwidth-Aware Prefetching in Chip Multiprocessors​. [online] Norwegian University of Science and Technology Department of Computer and Information Science. Available at: https://daim.idi.ntnu.no/masteroppgaver/001/1507/masteroppgave.pdf [Accessed​ ​21​ ​Dec.​ ​2017]. [12] ​Dahlgren, F. and Stenstrom, P. (1996). Evaluation of hardware-based stride and sequential prefetching in shared-memory multiprocessors. ​IEEE Transactions on Parallel and Distributed Systems​,​ ​7(4),​ ​pp.385-398.