SlideShare a Scribd company logo
1 of 7
Download to read offline
1
Effects of the Maximum Prefetch Degree in a Stride
Prefetcher Using a Reference Prediction Table
Yulong Bai, Fredrik Berdon Haave, Victor Iversen and Dominik Heinrich Th¨onnes
Abstract—Since the introduction of the integrated circuit, the
performance of CPUs has increased at a much higher rate than
the performance of memory. This has over time lead to a gap
in performance called the memory wall [1]. Since CPUs are
dependent on memory to function as efficient computer systems,
caches and prefetching has been introduced to hide the memory
latency seen by the CPU. In this paper, an instruction based stride
prefetcher using a reference prediction table (RPT) is studied,
and it is examined how the performance of the prefetcher is
affected by the state information used, which is the maximum
prefetch degree. We show that increasing this value increases
the coverage of the prefetching scheme. However, the highest
performance is achieved for a low maximum prefetch degree,
which gives a higher overall speedup when compared to a
sequential prefetcher with a variable prefetch degree and an
adaptive tagged sequential prefetching scheme.
I. INTRODUCTION
ONE of the biggest problems in modern computer archi-
tecture is the memory wall [1]. It describes the problem
that the performance of memory is growing much slower than
the speed of CPUs, as can be observed in Figure 1. This
development has existed since the early years of computers,
and is still in effect. As the difference is getting bigger over
the years, it gets more important to use the memory in an
efficient way. The gains in CPU performance will go to waste
if it cannot be fed quickly enough with data.
Fig. 1. The increasing gap in performance between the CPU and the memory
over time. This figure is from [5].
Originating from heavy use of sequential and looping
control structures, computer programs tend to reference a
concentrated part of the total memory space over a longer
time. This phenomena is called locality [7]. When a program
tends to reference the same memory addresses over a longer
period of time, this is called temporal locality. When the
memory addresses needed by the program are located close to
the already accessed parts of the memory, it is called spatial
locality [6].
A method for bridging the performance gap between the
memory and the CPU is the memory hierarchy, which exploits
locality in computer programs. This solution is also supported
by the fact that memory generally gets faster the smaller
it is. The memory hierarchy in practice means that modern
computers have a small amount of very fast memory close
to the CPU, a large amount of relatively slow memory further
away from the CPU, and several layers in between. In addition
to weakening the impact of the memory wall, another benefit
from this is the ability to hold frequently used data high in
the hierarchy. This increases the memory access speed for
software that often reuses its data.
A method to increase the utilization of higher levels of
the memory hierarchy is known as prefetching. It is used to
load data which is used later to a faster level in the memory
hierarchy. By doing this, the goal is to hide memory latency by
loading new data while the current data is still being processed.
This of course means knowing which data is being accessed
in the future, which is the crucial problem of prefetching.
In this paper we present some prefetching techniques and
analyze how our implementations can improve the perfor-
mance of some benchmarks. The main technique we are
focusing on is an instruction based stride prefetcher, which
uses a reference prediction table (RPT). We will be exploring
the effect of different values of the maximum prefetch degree.
The most important result is that a higher maximum prefetch
degree increases the coverage, but at a certain point the
achieved speedup will be negatively impacted. Using a maxi-
mum prefetch degree of 5, this prefetching scheme achieved an
overall average speedup of 1.085 compared to no prefetching.
This is the best case result.
In the next section we introduce some background on
the topic of prefetching. Thereafter we describe our chosen
scheme. We then cover our methodology, and present our
results. Next we discuss two sequential prefetchers we also
made, before going over some related work. The conclusions
follow at the end.
II. BACKGROUND
THE architecture of a modern computer consists of multi-
ple levels of caches in a hierarchy, as explained in Section
I. On the top (but below the CPU registers) is the Level 1
cache. A cache miss takes place if the CPU requests data that is
not present in a cache. The aim of good prefetchers is to reduce
the number of cache misses and also to keep the memory
2
traffic as low as possible, thereby avoiding unnecessary data
loads. We now give a rough overview of different classes of
prefetching algorithms.
A. Sequential Prefetching
The easiest method is called sequential prefetching [2]. In
this case every time a cache-line is loaded, the prefetcher loads
a specific number of following cache-lines. Due to spatial
locality, this method works quite well and is in fact often used
in modern CPU architectures [13].
An important definition for sequential prefetching is the
prefetch degree. This is the number of subsequent cache-lines
that will be prefetched when one cache-line is accessed [2].
B. Address Based Prefetching
In cases where memory is not accessed sequentially, we
need other methods to design a good prefetching scheme.
For example when accessing a specific element in a number
of objects, a constant stride occurs between the elements. In
this case we can use address based prefetching to reduce the
number of cache misses. The idea behind this method is to
calculate the distance between the current and the previous
memory access, and to use this distance as an offset for the
next prefetch [2]. This method therefore skips all the memory
in between, i.e. all the other elements in the object.
C. Instruction Based Prefetching
A way to improve the simple address based prefetching is
to use the instruction addresses to distinguish between the
instructions executed by the CPU. This is mostly beneficial
for more complex loop bodies. In this case a table is used
to store the latest memory access for each executed load
instruction. Every time memory is accessed, the table is used
to check the previous load address for the current instruction.
The difference between the previous load address and the
current address is calculated as the stride, which is then
used as an offset for the next prefetch. This is called Stride
Directed Prefetching [2]. A way to improve this prefetching
scheme is to add state information to the table that tracks
the load instructions executed by the CPU. This is done in
order to increase the accuracy of the prefetching scheme by
reducing the number of prefetched blocks that would not be
referenced by the CPU before being replaced. In this scheme,
a reference prediction tables (RPT) is used to contain the state
information, in addition to the instruction address, the previous
memory address, and the stride [2].
III. INSTRUCTION BASED STRIDE PREFETCHER USING AN
RPT
THE implemented prefetcher is a stride prefetcher. It uses
the addresses of the instructions executed by the CPU to
track the load instructions and the addresses of the data they
request. The load instructions are tracked using a table, where
the PC, load address and the calculated values for the stride
and confidence are stored. This table works as the RPT for
the prefetcher, and is shown in Figure 2.
When the CPU executes a load instruction, the PC for the
load instruction is used to check if there exists an entry in the
table with this PC. If an entry with the same PC as the executed
load instruction is found, the previously stored load address
is subtracted from the instruction’s load address, and stored
in the table as the stride. If the entry already has a non-zero
stride value, the old stride value is compared to the calculated
value. The confidence parameter is incremented if the new
stride value matches the old one, otherwise the confidence for
the entry is reset to zero. If no entry can be found in the table
for the executed load instruction, a new entry is added to the
table with the PC and load address taken from the executed
load instruction. A new entry has a stride and confidence equal
to zero.
A queue is used to make sure that recently executed loads
and their prefetch parameters are not removed from the table
when entries have to be removed. This is done in order to
accommodate the size limit for the data structures used in the
implementation. When the CPU executes a load instruction,
the PC for this instruction is looked up in the queue, and
moved to the top if found. If the PC is not found, it is added
to the top of the queue. Figure 2 depicts how the instruction
based stride prefetcher updates its data structures when the
CPU executes a load instruction.
Fig. 2. Instruction based stride prefetcher
A prefetch is issued if the confidence parameter for the
current load instruction is higher than zero, which means that
the calculated stride value has been verified at least once for
this load. When issuing a prefetch, the prefetch degree is
set equal to the confidence parameter. Thus, the number of
requests added to the prefetch queue is equal to the number
of times the stride has been verified for the current load
instruction.
To accommodate the size limit of 8 kB, an upper limit is
set for the number of entries in the table and the queue. Since
these structures contain the same number of elements at all
times, the maximum number of entries for both structures is
determined by the data types of the variables stored in them.
An entry in the table consists of a PC (8 bytes), last address (8
bytes), stride (8 bytes), and a confidence parameter (1 byte).
An entry in the queue only contains a reference to a PC (8
bytes). Thus, an entry in both data structures is 33 bytes in
total. In order to meet the requirement of using at most 8 kB, a
maximum of 242 entries can be added to each data structure,
as given by Equation 1. For each executed load instruction,
3
the size of the data structures is checked. If the size is going
to be more than 242 entries when a new entry is added, the
queue is used to find the least recently used entry. This entry
is then removed from both structures in order to make room
for the new entry.
max entries = 8kB
size(2×P C+lastaddress+stride+confidence) (1)
IV. METHODOLOGY
WE use the M5 simulator to perform benchmarking of
the implemented prefetchers studied in this paper. The
simulator is a generic simulation infrastructure with useful
tools for debugging and collecting statistics [4]. It is equipped
with two CPU models: an in-order non-pipelined CPU named
SimpleCPU, and the O3CPU, which is an out-of-order, su-
perscalar and pipelined CPU which features a simultaneous
multithreading model [4]. When performing benchmarking
of the prefetchers, the O3CPU model is used to model the
simulated architecture. It is based on the DEC Alpha Tsunami
architecture, where the Alpha 21264 microprocessor is used
as a reference [3]. The memory structure of the Alpha 21264
microprocessor is shown in Figure 3, with values correspond-
ing to the values used in the simulated system, given in Table
I.
Fig. 3. Alpha 21264 memory structure
Benchmarking is performed by running a Python script
which defines the test parameters used in the simulator. A
TABLE I
TEST PARAMETERS USED FOR BENCHMARKING
Test parameter Value
CPU clock speed 2 GHz
Physical memory size 256 MB
Caches with prefetching L2
L2 size 1 MB
Memory bus clock speed 400 MHz
Memory bus width 64 b
Memory latency 30 ns
Instruction checkpoint restore 1e9
Warmup-instructions 1e7
Maximum instructions 1e8
selection of the SPEC CPU2000 benchmark suite is used to
test the performance of the prefetcher, where the selection
consists of the benchmarks ammp, applu, apsi, art110, art470,
bzip2 program, bzip2 source, galgel, swim, twolf and wup-
wise.
The prefetcher interfaces the simulated memory system
through an interface file supplied together with the M5 simu-
lator. The interface includes functions called by the simulator
to notify the prefetcher before any memory is accessed, to
enable the prefetcher to initialize necessary data structures,
in addition to functions that notifies the prefetcher when the
cache is accessed, and when a prefetch is completed. The
interface also includes functions that could be called by the
prefetcher in order to issue a prefetch, and to set, clear and
check prefetch bits available in the L2 cache. Functions to
check if a memory block is currently in the cache and to
monitor the prefetch queue are also available.
In order to measure and compare our results we use the
following metrics: the accuracy of a prefetcher is defined as
the number of good prefetches divided by the number of total
prefetches, as shown by Equation 2. An ideal prefetcher has
an accuracy of one.
accuracy =
good prefetches
total prefetches
(2)
To also include the right timing of the prefetches we can
look at the coverage. This is defined as the number of useful
prefetches divided by the number of cache misses the program
would generate without prefetching. An ideal prefetcher has a
coverage of one. The formula used to calculate the coverage
is given by Equation 3.
coverage =
useful prefetches
cache misses without prefetching
(3)
Another metric that is important in order to understand
the measurements we made is the speedup. It is defined as
the computation time without prefetching, divided by the
computation time with prefetching enabled, as shown by
Equation 4. The higher the speedup, the better the prefetcher
works. A speedup of one means that the prefetcher does not
reduce the computation time of the program, while a speedup
of less than one means that the prefetcher slows down the
program.
4
speedup =
execution time without prefetcher
execution time with prefetcher
(4)
To compare the results we get out of the different bench-
marks, the harmonic mean, given by Equation 5, is used. This
is how we obtain the average values presented in this paper.
Havg =
n
1
x1
+ 1
x2
+ · · · + 1
xn
=
n
n
i=1
1
xi
(5)
V. RESULTS
FIGURE 4 depicts the obtained speedup for the prefetcher
described in Section III, for different values of the maxi-
mum prefetch degree. As seen in the figure, a speedup above
1.5 is achieved for the ammp benchmark, and above 1.2 for
the wupwise benchmark. For the remaining benchmarks, the
speedup is below 1.1, giving a best average speedup of 1.085
for a maximum prefetch degree of 5.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 4. Speedup results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
In Figure 5, the accuracy is shown for the different bench-
marks, for different values of the maximum prefetch degree.
The figure shows that a higher maximum prefetch degree is
beneficial for the art110 and art470 benchmarks, while for the
apsi benchmark, a lower maximum degree leads to a higher
accuracy.
Figure 6 depicts the coverage for the different benchmarks,
for different values of the maximum prefetch degree. From
this figure, it is seen that for the majority of the benchmarks,
the coverage increases with an increasing maximum prefetch
degree. A maximum degree of 5 obtains coverage results close
to the results given by using a maximum prefetch degree of
10 and 20 for the majority of the benchmarks.
In Table II, the mean values for the accuracy, coverage
and speedup from the benchmarks are shown for the different
values of the maximum prefetch degree.
Other prefetching schemes like adaptive sequential prefetch-
ing, Delta Correlating Prediction Tables (DCPT), Delta Corre-
lating Prediction Tables with L1 hoisting (DCPT-P), standard
Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 5. Accuracy results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 6. Coverage results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
TABLE II
MEAN VALUES OF THE BENCHMARK RESULTS
Maximum prefetch degree Accuracy Coverage Speedup
1 0.618 0.118 1.055
5 0.639 0.155 1.085
10 0.639 0.157 1.082
20 0.632 0.155 1.074
prefetching with RPT, and related schemes are listed in Table
III together with their average speedup results. Compared to
these schemes, the instruction based stride prefetcher using
RPT obtains a speedup slightly higher than the DCPT-P
scheme, which is the scheme with the highest average speedup
of all the schemes in Table III. The instruction based stride
prefetcher using RPT obtains this speedup with a maximum
prefetch degree of 5, as shown in Table II. The prefetching
schemes listed in Table III comes as part of the M5 simulator,
5
described in Section IV.
TABLE III
MEAN VALUES OF THE SPEEDUP FOR OTHER PREFETCHING SCHEMES
Prefetching scheme Speedup
Adaptive sequential 0.998
DCPT 1.052
DCPT-P 1.084
RPT 1.057
Sequential-on-access 1.011
Sequential-on-miss 0.993
Tagged 1.000
VI. DISCUSSION
THIS section describes the two sequential prefetchers
we also implemented, and how they compare to the
instruction based stride prefetcher presented in Section III.
A. Variable degree sequential prefetcher
This prefetcher looks for sequential accesses, which is de-
fined as accesses that happen in adjacent memory blocks. The
prefetching degree starts at 1, and the prefetching is performed
sequentially on every access. For every sequential access, the
prefetching degree is increased by a constant number. If the
sequential access pattern is broken, the prefetching degree is
reset to 1.
The result is highly dependent on how aggressively the
prefetching degree is being increased. This scheme shows a
moderate speedup if the prefetching degree is being increased
very conservatively. If the prefetching degree is increased too
quickly, the speedup turns negative. Overall speedup is shown
in Table IV, and detailed results are shown in Figure 7, 8 and
9.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 7. Speedup results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
Compared to the instruction based stride prefetcher, the
biggest speedup difference comes in the ammp benchmark.
TABLE IV
MEAN VALUES OF THE SPEEDUP FOR THE VARIABLE DEGREE
SEQUENTIAL PREFETCHER
Degree incrementation per sequential access Speedup
1 1.012
2 1.009
4 0.978
8 0.944
Accuracy 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 8. Accuracy results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 9. Coverage results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
It can be seen that this difference is due to a disparity in
accuracy and coverage between the different prefetchers. The
instruction based stride prefetcher is to a large extent able
to identify the benchmark’s access patterns, and manages
to achieve high accuracy and coverage. The variable degree
sequential prefetcher on the other hand, has accuracy and
coverage very close to zero in this benchmark.
6
TABLE V
MEAN VALUES OF THE SPEEDUP FOR THE ADAPTIVE TAGGED
SEQUENTIAL PREFETCHER
Memory access counting window size Speedup
10 0.984
20 0.981
40 0.980
80 0.983
B. Adaptive tagged sequential prefetcher
This prefetcher tags the prefetched addresses. The prefetch-
ing degree starts at 1, and the prefetching is performed se-
quentially on every access. Once a certain number of memory
accesses has happened, this prefetcher uses the tags to count
how many of the prefetches were useful, i.e. used by the
benchmark. If the number of useful prefetches is high enough,
the prefetching degree is increased by 1, and it is decreased by
1 if the number of useful prefetches is too low. High enough
is defined as >60%, and too low is defined as <40%.
The results show a negative overall speedup for all sizes
of the memory access counting window. It also does not
seem that the size of the counting window affects the speedup
significantly. Table V shows the overall speedup, and Figure
10, 11 and 12 depicts the detailed results.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 10. Speedup results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
The adaptive tagged sequential prefetcher has the same
problem as the variable degree sequential prefetcher in the
ammp benchmark; it achieves negative speedup because the
accuracy and coverage is very close to zero. Additionally, for
the adaptive tagged sequential prefetcher the galgel bench-
mark also stands out with a negative speedup. This happens
even though the coverage is comparable to both the variable
degree sequential prefetcher and the instruction based stride
prefetcher. The difference here lies in the accuracy, where the
adaptive tagged sequential prefetcher’s accuracy is much lower
than that of the other two prefetchers.
Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 11. Accuracy results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 12. Coverage results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
VII. RELATED WORK
VARIOUS aspects of caches are examined by A. J. Smith
[6], including ways to prefetch data into them. It is
theorized that the only viable way to prefetch data into a cache
in practice is to prefetch the immediately sequential memory
block, due to the need for a fast hardware implementation.
This is known as one block lookahead (OBL). Three methods
for OBL are focused on: always prefetch, prefetch on misses
and tagged prefetch. Prefetch on misses is found to be the
least effective method, but both always prefetch and tagged
prefetch succeeds well in reducing the miss ratio. Additionally,
the tagged prefetch method only requires a small increase in
the access ratio over prefetch on misses, thus utilizing memory
bandwidth more efficiently.
S. VanderWiel and D. Lilja [8], [9] describe many details
of prefetching, including tagged prefetching. Their paper dis-
cusses a number of aspects that need to be taken into account
7
when designing a prefetcher, such as when the prefetches are
initiated, where the prefetched data is placed, and what the unit
of prefetch should be. Their findings show that no prefetching
scheme is optimal for all workloads; they all have different
strengths and weaknesses.
S. Srinath et al. [10] propose a tagged prefetching mecha-
nism that incorporates dynamic feedback into its design. They
call their method feedback directed prefetching (FDP), and the
goal is to increase the performance improvement provided by
prefetching as well as to reduce the bandwidth usage. Their
prefetcher estimates its accuracy, timeliness and the prefetcher-
caused cache pollution in order to dynamically adjust its
aggressiveness. The results show that the FDP prefetcher im-
proves the average performance by 6.5% compared to the best
performing conventional stream based prefetcher, as well as
consuming 18.7% less memory bandwidth. The improvements
developed for the FDP prefetcher is shown to be applicable
to several other types of prefetchers, including global history
buffer delta correlation prefetchers (GHB/DC) and PC based
stride prefetchers.
K. Nesbit and J. Smith [11] examine a technique which
addresses the problems of static prefetch tables. To solve
the problems of fixed memory per prefetch key, they use a
global history buffer, which splits up the prefetch keys and
the addresses in two different structures. The results show that
GHB is profiting more if a global address is used as a key. In
that case it is 20% faster and the memory traffic is reduced
by 90%. For the instruction pointer based methods the GHB
is also faster (9%), but the memory traffic is the same. They
also point out the fact that it more important for a prefetcher
to be accurate than to be fast, since the memory bandwidth is
such an important factor in modern CPUs.
L. Ramos et al. [12] suggest that prefetchers should work
on multiple levels to fit different purposes. In their case they
design them to minimize the cost of the prefetcher, to minimize
the loss due to bad prefetching or to maximize the average
performance. They also introduce a new correlating prefetcher,
which they called PDFCM (Prefetching based on a Differential
Finite Context Machine). It uses a history and a distance table
to calculate and store multiple strides. By using a simple
sequential prefetcher they are able to minimize the cost to
1 kB. Their PDFCM is able to cut all performance losses
caused by the prefetcher. By combining both techniques they
gained the best average performance.
VIII. CONCLUSION
THIS paper focuses on how prefetching can be used to
mitigate the problems caused by the memory wall. The
more specific goal of our work is to explore the effects of
the maximum prefetch degree in a stride prefetcher using an
RPT. In order to do this we implement an instruction based
stride prefetcher, and run it through a selection of the SPEC
CPU2000 benchmarks. The benchmarks were then run with
varying values of the maximum prefetch degree.
From the results given in Section V, it is seen that a
higher limit for the confidence parameter (given by a higher
maximum prefetch degree) leads to increased coverage for
the majority of the benchmarks used to test the prefetcher
discussed in section III. As for the accuracy, increasing the
maximum prefetch degree does not seem to have a specific
effect, with varying results for the benchmarks used.
Although a higher maximum prefetch degree yields a higher
coverage for the prefetcher, the speedup does not increase
uniquely when increasing the maximum prefetch degree. This
can be explained by looking at the definitions of coverage and
speedup given in Equations 3 and 4 in Section II. The higher
coverage may lead to a lower speedup, since the memory bus
will need to use a higher amount of bandwidth for prefetching.
This bandwidth could otherwise have been used to load more
important data, which would have made it possible to obtain
a higher speedup.
Compared to the sequential prefetcher with a variable
prefetch degree and the adaptive tagged sequential prefetching
scheme studied in Section VI, the performance of the instruc-
tion based stride prefetcher using an RPT is either equal to
these schemes or higher, depending on the benchmark. The
overall best case difference in speedup for the instruction based
stride prefetcher is 7.3% compared to the sequential prefetcher
with a variable prefetch degree and 10.1% compared to the
adaptive tagged sequential prefetching scheme.
REFERENCES
[1] W. Wulf and S. McKee. Hitting the memory wall: Implications of the
obvious. ACM Computer Architecture New, 23(1); March 1995.
[2] M. Grannæs, Reducing Memory Latency by Improving Resource Utiliza-
tion. Norway: Norwegian University of Science and Technology; June
2010.
[3] M5 simulator system TDT4260 Computer Architecture User documenta-
tion. Norway: Norwegian University of Science and Technology; January
2014.
[4] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, S.
K. Reinhardt, The M5 Simulator: Modeling Networked Systems. IEEE
Micro, vol. 26, no. 4, pp. 52-60, July/August: The IEEE Computer
Society; 2006.
[5] J. L. Hennessy and D.A. Patterson, Computer Architecture, Fourth Edi-
tion: A Quantitative Approach. San Francisco, CA, USA: Morgan
Kaufmann Publishers; 2006.
[6] A. J. Smith. Cache memories. ACM Comput. Surv., 14(3), pp. 473-530;
1982.
[7] P. J. Denning. On modeling program behavior. in Proc Spring Joint
Computer Conference, vol. 40, pp. 937-944, Arlington, Va, USA: AFIPS
Press; 1972
[8] S. VanderWiel and D. Lilja. A survey of data prefetching techniques.
Technical Report 5, University of Minnesota; October 1996.
[9] S. VanderWiel and D. Lilja. When caches aren’t enough: data prefetching
techniques. Computer, 30(7), pp. 23-30; Jul 1997.
[10] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetch-
ing: Improving the performance and bandwidth-efficiency of hardware
prefetchers. Technical report, TR-HPS-2006-006, University of Texas
at Austin; May 2006.
[11] K. J. Nesbit and J. E. Smith. Data Cache Prefetching Using a Global
History Buffer. Micro, IEEE, 25, pp. 90–97; Jan. 2005.
[12] L. M. Ramos, J. L. Briz, P. E. Ibanez, and V. Vinals. Multi-level adaptive
prefetching based on performance gradient tracking. In Data Prefetching
Championship-1; 2009.
[13] Inside Intel Core Microarchitecture and Smart Memory Access, https:
//software.intel.com/sites/default/files/m/d/4/1/d/8/sma.pdf; April 2015.

More Related Content

What's hot

PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSvtunotesbysree
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
 
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSNexgen Technology
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resourceijdpsjournal
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platformsSyed Zaid Irshad
 

What's hot (15)

shashank_micro92_00697015
shashank_micro92_00697015shashank_micro92_00697015
shashank_micro92_00697015
 
B010630409
B010630409B010630409
B010630409
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
 
shieh06a
shieh06ashieh06a
shieh06a
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Unit 8
Unit 8Unit 8
Unit 8
 
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
 
Load rebalancing
Load rebalancingLoad rebalancing
Load rebalancing
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 

Similar to Tdt4260 miniproject report_group_3

The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingcsandit
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetchingHimanshu Koli
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed SystemsRicha Singh
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav) Rohit malav
 
Factors affecting performance
Factors affecting performanceFactors affecting performance
Factors affecting performancemissstevenson01
 
Operating Systems - memory management
Operating Systems - memory managementOperating Systems - memory management
Operating Systems - memory managementMukesh Chinta
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
 
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...IJERA Editor
 

Similar to Tdt4260 miniproject report_group_3 (20)

The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Final report
Final reportFinal report
Final report
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed Systems
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Cad report
Cad reportCad report
Cad report
 
Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav)
 
Factors affecting performance
Factors affecting performanceFactors affecting performance
Factors affecting performance
 
Operating Systems - memory management
Operating Systems - memory managementOperating Systems - memory management
Operating Systems - memory management
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
 

Recently uploaded

Gaya Call Girls #9907093804 Contact Number Escorts Service Gaya
Gaya Call Girls #9907093804 Contact Number Escorts Service GayaGaya Call Girls #9907093804 Contact Number Escorts Service Gaya
Gaya Call Girls #9907093804 Contact Number Escorts Service Gayasrsj9000
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Pooja Nehwal
 
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaDubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaUnited Arab Emirates
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一zul5vf0pq
 
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up Number
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up NumberCall Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up Number
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up NumberMs Riya
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...Pooja Nehwal
 
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Pooja Nehwal
 
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhisoniya singh
 
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /WhatsappsBeautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsappssapnasaifi408
 
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...nagunakhan
 
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一ga6c6bdl
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile servicerehmti665
 
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...Suhani Kapoor
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service ThanePooja Nehwal
 
Alambagh Call Girl 9548273370 , Call Girls Service Lucknow
Alambagh Call Girl 9548273370 , Call Girls Service LucknowAlambagh Call Girl 9548273370 , Call Girls Service Lucknow
Alambagh Call Girl 9548273370 , Call Girls Service Lucknowmakika9823
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...Pooja Nehwal
 

Recently uploaded (20)

Gaya Call Girls #9907093804 Contact Number Escorts Service Gaya
Gaya Call Girls #9907093804 Contact Number Escorts Service GayaGaya Call Girls #9907093804 Contact Number Escorts Service Gaya
Gaya Call Girls #9907093804 Contact Number Escorts Service Gaya
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
 
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaDubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
 
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up Number
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up NumberCall Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up Number
Call Girls Delhi {Rs-10000 Laxmi Nagar] 9711199012 Whats Up Number
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
 
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
 
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
 
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /WhatsappsBeautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
 
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...
Russian Call Girls In South Delhi Delhi 9711199012 💋✔💕😘 Independent Escorts D...
 
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一
如何办理(NUS毕业证书)新加坡国立大学毕业证成绩单留信学历认证原版一比一
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile service
 
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
 
Alambagh Call Girl 9548273370 , Call Girls Service Lucknow
Alambagh Call Girl 9548273370 , Call Girls Service LucknowAlambagh Call Girl 9548273370 , Call Girls Service Lucknow
Alambagh Call Girl 9548273370 , Call Girls Service Lucknow
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
 

Tdt4260 miniproject report_group_3

  • 1. 1 Effects of the Maximum Prefetch Degree in a Stride Prefetcher Using a Reference Prediction Table Yulong Bai, Fredrik Berdon Haave, Victor Iversen and Dominik Heinrich Th¨onnes Abstract—Since the introduction of the integrated circuit, the performance of CPUs has increased at a much higher rate than the performance of memory. This has over time lead to a gap in performance called the memory wall [1]. Since CPUs are dependent on memory to function as efficient computer systems, caches and prefetching has been introduced to hide the memory latency seen by the CPU. In this paper, an instruction based stride prefetcher using a reference prediction table (RPT) is studied, and it is examined how the performance of the prefetcher is affected by the state information used, which is the maximum prefetch degree. We show that increasing this value increases the coverage of the prefetching scheme. However, the highest performance is achieved for a low maximum prefetch degree, which gives a higher overall speedup when compared to a sequential prefetcher with a variable prefetch degree and an adaptive tagged sequential prefetching scheme. I. INTRODUCTION ONE of the biggest problems in modern computer archi- tecture is the memory wall [1]. It describes the problem that the performance of memory is growing much slower than the speed of CPUs, as can be observed in Figure 1. This development has existed since the early years of computers, and is still in effect. As the difference is getting bigger over the years, it gets more important to use the memory in an efficient way. The gains in CPU performance will go to waste if it cannot be fed quickly enough with data. Fig. 1. The increasing gap in performance between the CPU and the memory over time. This figure is from [5]. Originating from heavy use of sequential and looping control structures, computer programs tend to reference a concentrated part of the total memory space over a longer time. This phenomena is called locality [7]. When a program tends to reference the same memory addresses over a longer period of time, this is called temporal locality. When the memory addresses needed by the program are located close to the already accessed parts of the memory, it is called spatial locality [6]. A method for bridging the performance gap between the memory and the CPU is the memory hierarchy, which exploits locality in computer programs. This solution is also supported by the fact that memory generally gets faster the smaller it is. The memory hierarchy in practice means that modern computers have a small amount of very fast memory close to the CPU, a large amount of relatively slow memory further away from the CPU, and several layers in between. In addition to weakening the impact of the memory wall, another benefit from this is the ability to hold frequently used data high in the hierarchy. This increases the memory access speed for software that often reuses its data. A method to increase the utilization of higher levels of the memory hierarchy is known as prefetching. It is used to load data which is used later to a faster level in the memory hierarchy. By doing this, the goal is to hide memory latency by loading new data while the current data is still being processed. This of course means knowing which data is being accessed in the future, which is the crucial problem of prefetching. In this paper we present some prefetching techniques and analyze how our implementations can improve the perfor- mance of some benchmarks. The main technique we are focusing on is an instruction based stride prefetcher, which uses a reference prediction table (RPT). We will be exploring the effect of different values of the maximum prefetch degree. The most important result is that a higher maximum prefetch degree increases the coverage, but at a certain point the achieved speedup will be negatively impacted. Using a maxi- mum prefetch degree of 5, this prefetching scheme achieved an overall average speedup of 1.085 compared to no prefetching. This is the best case result. In the next section we introduce some background on the topic of prefetching. Thereafter we describe our chosen scheme. We then cover our methodology, and present our results. Next we discuss two sequential prefetchers we also made, before going over some related work. The conclusions follow at the end. II. BACKGROUND THE architecture of a modern computer consists of multi- ple levels of caches in a hierarchy, as explained in Section I. On the top (but below the CPU registers) is the Level 1 cache. A cache miss takes place if the CPU requests data that is not present in a cache. The aim of good prefetchers is to reduce the number of cache misses and also to keep the memory
  • 2. 2 traffic as low as possible, thereby avoiding unnecessary data loads. We now give a rough overview of different classes of prefetching algorithms. A. Sequential Prefetching The easiest method is called sequential prefetching [2]. In this case every time a cache-line is loaded, the prefetcher loads a specific number of following cache-lines. Due to spatial locality, this method works quite well and is in fact often used in modern CPU architectures [13]. An important definition for sequential prefetching is the prefetch degree. This is the number of subsequent cache-lines that will be prefetched when one cache-line is accessed [2]. B. Address Based Prefetching In cases where memory is not accessed sequentially, we need other methods to design a good prefetching scheme. For example when accessing a specific element in a number of objects, a constant stride occurs between the elements. In this case we can use address based prefetching to reduce the number of cache misses. The idea behind this method is to calculate the distance between the current and the previous memory access, and to use this distance as an offset for the next prefetch [2]. This method therefore skips all the memory in between, i.e. all the other elements in the object. C. Instruction Based Prefetching A way to improve the simple address based prefetching is to use the instruction addresses to distinguish between the instructions executed by the CPU. This is mostly beneficial for more complex loop bodies. In this case a table is used to store the latest memory access for each executed load instruction. Every time memory is accessed, the table is used to check the previous load address for the current instruction. The difference between the previous load address and the current address is calculated as the stride, which is then used as an offset for the next prefetch. This is called Stride Directed Prefetching [2]. A way to improve this prefetching scheme is to add state information to the table that tracks the load instructions executed by the CPU. This is done in order to increase the accuracy of the prefetching scheme by reducing the number of prefetched blocks that would not be referenced by the CPU before being replaced. In this scheme, a reference prediction tables (RPT) is used to contain the state information, in addition to the instruction address, the previous memory address, and the stride [2]. III. INSTRUCTION BASED STRIDE PREFETCHER USING AN RPT THE implemented prefetcher is a stride prefetcher. It uses the addresses of the instructions executed by the CPU to track the load instructions and the addresses of the data they request. The load instructions are tracked using a table, where the PC, load address and the calculated values for the stride and confidence are stored. This table works as the RPT for the prefetcher, and is shown in Figure 2. When the CPU executes a load instruction, the PC for the load instruction is used to check if there exists an entry in the table with this PC. If an entry with the same PC as the executed load instruction is found, the previously stored load address is subtracted from the instruction’s load address, and stored in the table as the stride. If the entry already has a non-zero stride value, the old stride value is compared to the calculated value. The confidence parameter is incremented if the new stride value matches the old one, otherwise the confidence for the entry is reset to zero. If no entry can be found in the table for the executed load instruction, a new entry is added to the table with the PC and load address taken from the executed load instruction. A new entry has a stride and confidence equal to zero. A queue is used to make sure that recently executed loads and their prefetch parameters are not removed from the table when entries have to be removed. This is done in order to accommodate the size limit for the data structures used in the implementation. When the CPU executes a load instruction, the PC for this instruction is looked up in the queue, and moved to the top if found. If the PC is not found, it is added to the top of the queue. Figure 2 depicts how the instruction based stride prefetcher updates its data structures when the CPU executes a load instruction. Fig. 2. Instruction based stride prefetcher A prefetch is issued if the confidence parameter for the current load instruction is higher than zero, which means that the calculated stride value has been verified at least once for this load. When issuing a prefetch, the prefetch degree is set equal to the confidence parameter. Thus, the number of requests added to the prefetch queue is equal to the number of times the stride has been verified for the current load instruction. To accommodate the size limit of 8 kB, an upper limit is set for the number of entries in the table and the queue. Since these structures contain the same number of elements at all times, the maximum number of entries for both structures is determined by the data types of the variables stored in them. An entry in the table consists of a PC (8 bytes), last address (8 bytes), stride (8 bytes), and a confidence parameter (1 byte). An entry in the queue only contains a reference to a PC (8 bytes). Thus, an entry in both data structures is 33 bytes in total. In order to meet the requirement of using at most 8 kB, a maximum of 242 entries can be added to each data structure, as given by Equation 1. For each executed load instruction,
  • 3. 3 the size of the data structures is checked. If the size is going to be more than 242 entries when a new entry is added, the queue is used to find the least recently used entry. This entry is then removed from both structures in order to make room for the new entry. max entries = 8kB size(2×P C+lastaddress+stride+confidence) (1) IV. METHODOLOGY WE use the M5 simulator to perform benchmarking of the implemented prefetchers studied in this paper. The simulator is a generic simulation infrastructure with useful tools for debugging and collecting statistics [4]. It is equipped with two CPU models: an in-order non-pipelined CPU named SimpleCPU, and the O3CPU, which is an out-of-order, su- perscalar and pipelined CPU which features a simultaneous multithreading model [4]. When performing benchmarking of the prefetchers, the O3CPU model is used to model the simulated architecture. It is based on the DEC Alpha Tsunami architecture, where the Alpha 21264 microprocessor is used as a reference [3]. The memory structure of the Alpha 21264 microprocessor is shown in Figure 3, with values correspond- ing to the values used in the simulated system, given in Table I. Fig. 3. Alpha 21264 memory structure Benchmarking is performed by running a Python script which defines the test parameters used in the simulator. A TABLE I TEST PARAMETERS USED FOR BENCHMARKING Test parameter Value CPU clock speed 2 GHz Physical memory size 256 MB Caches with prefetching L2 L2 size 1 MB Memory bus clock speed 400 MHz Memory bus width 64 b Memory latency 30 ns Instruction checkpoint restore 1e9 Warmup-instructions 1e7 Maximum instructions 1e8 selection of the SPEC CPU2000 benchmark suite is used to test the performance of the prefetcher, where the selection consists of the benchmarks ammp, applu, apsi, art110, art470, bzip2 program, bzip2 source, galgel, swim, twolf and wup- wise. The prefetcher interfaces the simulated memory system through an interface file supplied together with the M5 simu- lator. The interface includes functions called by the simulator to notify the prefetcher before any memory is accessed, to enable the prefetcher to initialize necessary data structures, in addition to functions that notifies the prefetcher when the cache is accessed, and when a prefetch is completed. The interface also includes functions that could be called by the prefetcher in order to issue a prefetch, and to set, clear and check prefetch bits available in the L2 cache. Functions to check if a memory block is currently in the cache and to monitor the prefetch queue are also available. In order to measure and compare our results we use the following metrics: the accuracy of a prefetcher is defined as the number of good prefetches divided by the number of total prefetches, as shown by Equation 2. An ideal prefetcher has an accuracy of one. accuracy = good prefetches total prefetches (2) To also include the right timing of the prefetches we can look at the coverage. This is defined as the number of useful prefetches divided by the number of cache misses the program would generate without prefetching. An ideal prefetcher has a coverage of one. The formula used to calculate the coverage is given by Equation 3. coverage = useful prefetches cache misses without prefetching (3) Another metric that is important in order to understand the measurements we made is the speedup. It is defined as the computation time without prefetching, divided by the computation time with prefetching enabled, as shown by Equation 4. The higher the speedup, the better the prefetcher works. A speedup of one means that the prefetcher does not reduce the computation time of the program, while a speedup of less than one means that the prefetcher slows down the program.
  • 4. 4 speedup = execution time without prefetcher execution time with prefetcher (4) To compare the results we get out of the different bench- marks, the harmonic mean, given by Equation 5, is used. This is how we obtain the average values presented in this paper. Havg = n 1 x1 + 1 x2 + · · · + 1 xn = n n i=1 1 xi (5) V. RESULTS FIGURE 4 depicts the obtained speedup for the prefetcher described in Section III, for different values of the maxi- mum prefetch degree. As seen in the figure, a speedup above 1.5 is achieved for the ammp benchmark, and above 1.2 for the wupwise benchmark. For the remaining benchmarks, the speedup is below 1.1, giving a best average speedup of 1.085 for a maximum prefetch degree of 5. Speedup 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise maximum degree = 1 maximum degree = 5 maximum degree = 10 maximum degree = 20 Fig. 4. Speedup results for the instruction based stride prefetcher from the benchmarks for different values of the maximum prefetch degree In Figure 5, the accuracy is shown for the different bench- marks, for different values of the maximum prefetch degree. The figure shows that a higher maximum prefetch degree is beneficial for the art110 and art470 benchmarks, while for the apsi benchmark, a lower maximum degree leads to a higher accuracy. Figure 6 depicts the coverage for the different benchmarks, for different values of the maximum prefetch degree. From this figure, it is seen that for the majority of the benchmarks, the coverage increases with an increasing maximum prefetch degree. A maximum degree of 5 obtains coverage results close to the results given by using a maximum prefetch degree of 10 and 20 for the majority of the benchmarks. In Table II, the mean values for the accuracy, coverage and speedup from the benchmarks are shown for the different values of the maximum prefetch degree. Other prefetching schemes like adaptive sequential prefetch- ing, Delta Correlating Prediction Tables (DCPT), Delta Corre- lating Prediction Tables with L1 hoisting (DCPT-P), standard Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise maximum degree = 1 maximum degree = 5 maximum degree = 10 maximum degree = 20 Fig. 5. Accuracy results for the instruction based stride prefetcher from the benchmarks for different values of the maximum prefetch degree Coverage 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise maximum degree = 1 maximum degree = 5 maximum degree = 10 maximum degree = 20 Fig. 6. Coverage results for the instruction based stride prefetcher from the benchmarks for different values of the maximum prefetch degree TABLE II MEAN VALUES OF THE BENCHMARK RESULTS Maximum prefetch degree Accuracy Coverage Speedup 1 0.618 0.118 1.055 5 0.639 0.155 1.085 10 0.639 0.157 1.082 20 0.632 0.155 1.074 prefetching with RPT, and related schemes are listed in Table III together with their average speedup results. Compared to these schemes, the instruction based stride prefetcher using RPT obtains a speedup slightly higher than the DCPT-P scheme, which is the scheme with the highest average speedup of all the schemes in Table III. The instruction based stride prefetcher using RPT obtains this speedup with a maximum prefetch degree of 5, as shown in Table II. The prefetching schemes listed in Table III comes as part of the M5 simulator,
  • 5. 5 described in Section IV. TABLE III MEAN VALUES OF THE SPEEDUP FOR OTHER PREFETCHING SCHEMES Prefetching scheme Speedup Adaptive sequential 0.998 DCPT 1.052 DCPT-P 1.084 RPT 1.057 Sequential-on-access 1.011 Sequential-on-miss 0.993 Tagged 1.000 VI. DISCUSSION THIS section describes the two sequential prefetchers we also implemented, and how they compare to the instruction based stride prefetcher presented in Section III. A. Variable degree sequential prefetcher This prefetcher looks for sequential accesses, which is de- fined as accesses that happen in adjacent memory blocks. The prefetching degree starts at 1, and the prefetching is performed sequentially on every access. For every sequential access, the prefetching degree is increased by a constant number. If the sequential access pattern is broken, the prefetching degree is reset to 1. The result is highly dependent on how aggressively the prefetching degree is being increased. This scheme shows a moderate speedup if the prefetching degree is being increased very conservatively. If the prefetching degree is increased too quickly, the speedup turns negative. Overall speedup is shown in Table IV, and detailed results are shown in Figure 7, 8 and 9. Speedup 0 0.2 0.4 0.6 0.8 1 1.2 1.4 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Degree incrementation = 1 Degree incrementation = 2 Degree incrementation = 4 Degree incrementation = 8 Fig. 7. Speedup results for the variable degree sequential prefetcher in the benchmarks for different values of degree incrementation per sequential access Compared to the instruction based stride prefetcher, the biggest speedup difference comes in the ammp benchmark. TABLE IV MEAN VALUES OF THE SPEEDUP FOR THE VARIABLE DEGREE SEQUENTIAL PREFETCHER Degree incrementation per sequential access Speedup 1 1.012 2 1.009 4 0.978 8 0.944 Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Degree incrementation = 1 Degree incrementation = 2 Degree incrementation = 4 Degree incrementation = 8 Fig. 8. Accuracy results for the variable degree sequential prefetcher in the benchmarks for different values of degree incrementation per sequential access Coverage 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Degree incrementation = 1 Degree incrementation = 2 Degree incrementation = 4 Degree incrementation = 8 Fig. 9. Coverage results for the variable degree sequential prefetcher in the benchmarks for different values of degree incrementation per sequential access It can be seen that this difference is due to a disparity in accuracy and coverage between the different prefetchers. The instruction based stride prefetcher is to a large extent able to identify the benchmark’s access patterns, and manages to achieve high accuracy and coverage. The variable degree sequential prefetcher on the other hand, has accuracy and coverage very close to zero in this benchmark.
  • 6. 6 TABLE V MEAN VALUES OF THE SPEEDUP FOR THE ADAPTIVE TAGGED SEQUENTIAL PREFETCHER Memory access counting window size Speedup 10 0.984 20 0.981 40 0.980 80 0.983 B. Adaptive tagged sequential prefetcher This prefetcher tags the prefetched addresses. The prefetch- ing degree starts at 1, and the prefetching is performed se- quentially on every access. Once a certain number of memory accesses has happened, this prefetcher uses the tags to count how many of the prefetches were useful, i.e. used by the benchmark. If the number of useful prefetches is high enough, the prefetching degree is increased by 1, and it is decreased by 1 if the number of useful prefetches is too low. High enough is defined as >60%, and too low is defined as <40%. The results show a negative overall speedup for all sizes of the memory access counting window. It also does not seem that the size of the counting window affects the speedup significantly. Table V shows the overall speedup, and Figure 10, 11 and 12 depicts the detailed results. Speedup 0 0.2 0.4 0.6 0.8 1 1.2 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Window size = 10 Window size = 20 Window size = 40 Window size = 80 Fig. 10. Speedup results for the adaptive tagged sequential prefetcher in the benchmarks for different sizes of the memory access counting window The adaptive tagged sequential prefetcher has the same problem as the variable degree sequential prefetcher in the ammp benchmark; it achieves negative speedup because the accuracy and coverage is very close to zero. Additionally, for the adaptive tagged sequential prefetcher the galgel bench- mark also stands out with a negative speedup. This happens even though the coverage is comparable to both the variable degree sequential prefetcher and the instruction based stride prefetcher. The difference here lies in the accuracy, where the adaptive tagged sequential prefetcher’s accuracy is much lower than that of the other two prefetchers. Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Window size = 10 Window size = 20 Window size = 40 Window size = 80 Fig. 11. Accuracy results for the adaptive tagged sequential prefetcher in the benchmarks for different sizes of the memory access counting window Coverage 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 am m p applu apsi art110 art470 bzip2_programbzip_source galgel sw im tw olfw upw ise Window size = 10 Window size = 20 Window size = 40 Window size = 80 Fig. 12. Coverage results for the adaptive tagged sequential prefetcher in the benchmarks for different sizes of the memory access counting window VII. RELATED WORK VARIOUS aspects of caches are examined by A. J. Smith [6], including ways to prefetch data into them. It is theorized that the only viable way to prefetch data into a cache in practice is to prefetch the immediately sequential memory block, due to the need for a fast hardware implementation. This is known as one block lookahead (OBL). Three methods for OBL are focused on: always prefetch, prefetch on misses and tagged prefetch. Prefetch on misses is found to be the least effective method, but both always prefetch and tagged prefetch succeeds well in reducing the miss ratio. Additionally, the tagged prefetch method only requires a small increase in the access ratio over prefetch on misses, thus utilizing memory bandwidth more efficiently. S. VanderWiel and D. Lilja [8], [9] describe many details of prefetching, including tagged prefetching. Their paper dis- cusses a number of aspects that need to be taken into account
  • 7. 7 when designing a prefetcher, such as when the prefetches are initiated, where the prefetched data is placed, and what the unit of prefetch should be. Their findings show that no prefetching scheme is optimal for all workloads; they all have different strengths and weaknesses. S. Srinath et al. [10] propose a tagged prefetching mecha- nism that incorporates dynamic feedback into its design. They call their method feedback directed prefetching (FDP), and the goal is to increase the performance improvement provided by prefetching as well as to reduce the bandwidth usage. Their prefetcher estimates its accuracy, timeliness and the prefetcher- caused cache pollution in order to dynamically adjust its aggressiveness. The results show that the FDP prefetcher im- proves the average performance by 6.5% compared to the best performing conventional stream based prefetcher, as well as consuming 18.7% less memory bandwidth. The improvements developed for the FDP prefetcher is shown to be applicable to several other types of prefetchers, including global history buffer delta correlation prefetchers (GHB/DC) and PC based stride prefetchers. K. Nesbit and J. Smith [11] examine a technique which addresses the problems of static prefetch tables. To solve the problems of fixed memory per prefetch key, they use a global history buffer, which splits up the prefetch keys and the addresses in two different structures. The results show that GHB is profiting more if a global address is used as a key. In that case it is 20% faster and the memory traffic is reduced by 90%. For the instruction pointer based methods the GHB is also faster (9%), but the memory traffic is the same. They also point out the fact that it more important for a prefetcher to be accurate than to be fast, since the memory bandwidth is such an important factor in modern CPUs. L. Ramos et al. [12] suggest that prefetchers should work on multiple levels to fit different purposes. In their case they design them to minimize the cost of the prefetcher, to minimize the loss due to bad prefetching or to maximize the average performance. They also introduce a new correlating prefetcher, which they called PDFCM (Prefetching based on a Differential Finite Context Machine). It uses a history and a distance table to calculate and store multiple strides. By using a simple sequential prefetcher they are able to minimize the cost to 1 kB. Their PDFCM is able to cut all performance losses caused by the prefetcher. By combining both techniques they gained the best average performance. VIII. CONCLUSION THIS paper focuses on how prefetching can be used to mitigate the problems caused by the memory wall. The more specific goal of our work is to explore the effects of the maximum prefetch degree in a stride prefetcher using an RPT. In order to do this we implement an instruction based stride prefetcher, and run it through a selection of the SPEC CPU2000 benchmarks. The benchmarks were then run with varying values of the maximum prefetch degree. From the results given in Section V, it is seen that a higher limit for the confidence parameter (given by a higher maximum prefetch degree) leads to increased coverage for the majority of the benchmarks used to test the prefetcher discussed in section III. As for the accuracy, increasing the maximum prefetch degree does not seem to have a specific effect, with varying results for the benchmarks used. Although a higher maximum prefetch degree yields a higher coverage for the prefetcher, the speedup does not increase uniquely when increasing the maximum prefetch degree. This can be explained by looking at the definitions of coverage and speedup given in Equations 3 and 4 in Section II. The higher coverage may lead to a lower speedup, since the memory bus will need to use a higher amount of bandwidth for prefetching. This bandwidth could otherwise have been used to load more important data, which would have made it possible to obtain a higher speedup. Compared to the sequential prefetcher with a variable prefetch degree and the adaptive tagged sequential prefetching scheme studied in Section VI, the performance of the instruc- tion based stride prefetcher using an RPT is either equal to these schemes or higher, depending on the benchmark. The overall best case difference in speedup for the instruction based stride prefetcher is 7.3% compared to the sequential prefetcher with a variable prefetch degree and 10.1% compared to the adaptive tagged sequential prefetching scheme. REFERENCES [1] W. Wulf and S. McKee. Hitting the memory wall: Implications of the obvious. ACM Computer Architecture New, 23(1); March 1995. [2] M. Grannæs, Reducing Memory Latency by Improving Resource Utiliza- tion. Norway: Norwegian University of Science and Technology; June 2010. [3] M5 simulator system TDT4260 Computer Architecture User documenta- tion. Norway: Norwegian University of Science and Technology; January 2014. [4] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, S. K. Reinhardt, The M5 Simulator: Modeling Networked Systems. IEEE Micro, vol. 26, no. 4, pp. 52-60, July/August: The IEEE Computer Society; 2006. [5] J. L. Hennessy and D.A. Patterson, Computer Architecture, Fourth Edi- tion: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers; 2006. [6] A. J. Smith. Cache memories. ACM Comput. Surv., 14(3), pp. 473-530; 1982. [7] P. J. Denning. On modeling program behavior. in Proc Spring Joint Computer Conference, vol. 40, pp. 937-944, Arlington, Va, USA: AFIPS Press; 1972 [8] S. VanderWiel and D. Lilja. A survey of data prefetching techniques. Technical Report 5, University of Minnesota; October 1996. [9] S. VanderWiel and D. Lilja. When caches aren’t enough: data prefetching techniques. Computer, 30(7), pp. 23-30; Jul 1997. [10] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetch- ing: Improving the performance and bandwidth-efficiency of hardware prefetchers. Technical report, TR-HPS-2006-006, University of Texas at Austin; May 2006. [11] K. J. Nesbit and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. Micro, IEEE, 25, pp. 90–97; Jan. 2005. [12] L. M. Ramos, J. L. Briz, P. E. Ibanez, and V. Vinals. Multi-level adaptive prefetching based on performance gradient tracking. In Data Prefetching Championship-1; 2009. [13] Inside Intel Core Microarchitecture and Smart Memory Access, https: //software.intel.com/sites/default/files/m/d/4/1/d/8/sma.pdf; April 2015.