9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
Tdt4260 miniproject report_group_3
1. 1
Effects of the Maximum Prefetch Degree in a Stride
Prefetcher Using a Reference Prediction Table
Yulong Bai, Fredrik Berdon Haave, Victor Iversen and Dominik Heinrich Th¨onnes
Abstract—Since the introduction of the integrated circuit, the
performance of CPUs has increased at a much higher rate than
the performance of memory. This has over time lead to a gap
in performance called the memory wall [1]. Since CPUs are
dependent on memory to function as efficient computer systems,
caches and prefetching has been introduced to hide the memory
latency seen by the CPU. In this paper, an instruction based stride
prefetcher using a reference prediction table (RPT) is studied,
and it is examined how the performance of the prefetcher is
affected by the state information used, which is the maximum
prefetch degree. We show that increasing this value increases
the coverage of the prefetching scheme. However, the highest
performance is achieved for a low maximum prefetch degree,
which gives a higher overall speedup when compared to a
sequential prefetcher with a variable prefetch degree and an
adaptive tagged sequential prefetching scheme.
I. INTRODUCTION
ONE of the biggest problems in modern computer archi-
tecture is the memory wall [1]. It describes the problem
that the performance of memory is growing much slower than
the speed of CPUs, as can be observed in Figure 1. This
development has existed since the early years of computers,
and is still in effect. As the difference is getting bigger over
the years, it gets more important to use the memory in an
efficient way. The gains in CPU performance will go to waste
if it cannot be fed quickly enough with data.
Fig. 1. The increasing gap in performance between the CPU and the memory
over time. This figure is from [5].
Originating from heavy use of sequential and looping
control structures, computer programs tend to reference a
concentrated part of the total memory space over a longer
time. This phenomena is called locality [7]. When a program
tends to reference the same memory addresses over a longer
period of time, this is called temporal locality. When the
memory addresses needed by the program are located close to
the already accessed parts of the memory, it is called spatial
locality [6].
A method for bridging the performance gap between the
memory and the CPU is the memory hierarchy, which exploits
locality in computer programs. This solution is also supported
by the fact that memory generally gets faster the smaller
it is. The memory hierarchy in practice means that modern
computers have a small amount of very fast memory close
to the CPU, a large amount of relatively slow memory further
away from the CPU, and several layers in between. In addition
to weakening the impact of the memory wall, another benefit
from this is the ability to hold frequently used data high in
the hierarchy. This increases the memory access speed for
software that often reuses its data.
A method to increase the utilization of higher levels of
the memory hierarchy is known as prefetching. It is used to
load data which is used later to a faster level in the memory
hierarchy. By doing this, the goal is to hide memory latency by
loading new data while the current data is still being processed.
This of course means knowing which data is being accessed
in the future, which is the crucial problem of prefetching.
In this paper we present some prefetching techniques and
analyze how our implementations can improve the perfor-
mance of some benchmarks. The main technique we are
focusing on is an instruction based stride prefetcher, which
uses a reference prediction table (RPT). We will be exploring
the effect of different values of the maximum prefetch degree.
The most important result is that a higher maximum prefetch
degree increases the coverage, but at a certain point the
achieved speedup will be negatively impacted. Using a maxi-
mum prefetch degree of 5, this prefetching scheme achieved an
overall average speedup of 1.085 compared to no prefetching.
This is the best case result.
In the next section we introduce some background on
the topic of prefetching. Thereafter we describe our chosen
scheme. We then cover our methodology, and present our
results. Next we discuss two sequential prefetchers we also
made, before going over some related work. The conclusions
follow at the end.
II. BACKGROUND
THE architecture of a modern computer consists of multi-
ple levels of caches in a hierarchy, as explained in Section
I. On the top (but below the CPU registers) is the Level 1
cache. A cache miss takes place if the CPU requests data that is
not present in a cache. The aim of good prefetchers is to reduce
the number of cache misses and also to keep the memory
2. 2
traffic as low as possible, thereby avoiding unnecessary data
loads. We now give a rough overview of different classes of
prefetching algorithms.
A. Sequential Prefetching
The easiest method is called sequential prefetching [2]. In
this case every time a cache-line is loaded, the prefetcher loads
a specific number of following cache-lines. Due to spatial
locality, this method works quite well and is in fact often used
in modern CPU architectures [13].
An important definition for sequential prefetching is the
prefetch degree. This is the number of subsequent cache-lines
that will be prefetched when one cache-line is accessed [2].
B. Address Based Prefetching
In cases where memory is not accessed sequentially, we
need other methods to design a good prefetching scheme.
For example when accessing a specific element in a number
of objects, a constant stride occurs between the elements. In
this case we can use address based prefetching to reduce the
number of cache misses. The idea behind this method is to
calculate the distance between the current and the previous
memory access, and to use this distance as an offset for the
next prefetch [2]. This method therefore skips all the memory
in between, i.e. all the other elements in the object.
C. Instruction Based Prefetching
A way to improve the simple address based prefetching is
to use the instruction addresses to distinguish between the
instructions executed by the CPU. This is mostly beneficial
for more complex loop bodies. In this case a table is used
to store the latest memory access for each executed load
instruction. Every time memory is accessed, the table is used
to check the previous load address for the current instruction.
The difference between the previous load address and the
current address is calculated as the stride, which is then
used as an offset for the next prefetch. This is called Stride
Directed Prefetching [2]. A way to improve this prefetching
scheme is to add state information to the table that tracks
the load instructions executed by the CPU. This is done in
order to increase the accuracy of the prefetching scheme by
reducing the number of prefetched blocks that would not be
referenced by the CPU before being replaced. In this scheme,
a reference prediction tables (RPT) is used to contain the state
information, in addition to the instruction address, the previous
memory address, and the stride [2].
III. INSTRUCTION BASED STRIDE PREFETCHER USING AN
RPT
THE implemented prefetcher is a stride prefetcher. It uses
the addresses of the instructions executed by the CPU to
track the load instructions and the addresses of the data they
request. The load instructions are tracked using a table, where
the PC, load address and the calculated values for the stride
and confidence are stored. This table works as the RPT for
the prefetcher, and is shown in Figure 2.
When the CPU executes a load instruction, the PC for the
load instruction is used to check if there exists an entry in the
table with this PC. If an entry with the same PC as the executed
load instruction is found, the previously stored load address
is subtracted from the instruction’s load address, and stored
in the table as the stride. If the entry already has a non-zero
stride value, the old stride value is compared to the calculated
value. The confidence parameter is incremented if the new
stride value matches the old one, otherwise the confidence for
the entry is reset to zero. If no entry can be found in the table
for the executed load instruction, a new entry is added to the
table with the PC and load address taken from the executed
load instruction. A new entry has a stride and confidence equal
to zero.
A queue is used to make sure that recently executed loads
and their prefetch parameters are not removed from the table
when entries have to be removed. This is done in order to
accommodate the size limit for the data structures used in the
implementation. When the CPU executes a load instruction,
the PC for this instruction is looked up in the queue, and
moved to the top if found. If the PC is not found, it is added
to the top of the queue. Figure 2 depicts how the instruction
based stride prefetcher updates its data structures when the
CPU executes a load instruction.
Fig. 2. Instruction based stride prefetcher
A prefetch is issued if the confidence parameter for the
current load instruction is higher than zero, which means that
the calculated stride value has been verified at least once for
this load. When issuing a prefetch, the prefetch degree is
set equal to the confidence parameter. Thus, the number of
requests added to the prefetch queue is equal to the number
of times the stride has been verified for the current load
instruction.
To accommodate the size limit of 8 kB, an upper limit is
set for the number of entries in the table and the queue. Since
these structures contain the same number of elements at all
times, the maximum number of entries for both structures is
determined by the data types of the variables stored in them.
An entry in the table consists of a PC (8 bytes), last address (8
bytes), stride (8 bytes), and a confidence parameter (1 byte).
An entry in the queue only contains a reference to a PC (8
bytes). Thus, an entry in both data structures is 33 bytes in
total. In order to meet the requirement of using at most 8 kB, a
maximum of 242 entries can be added to each data structure,
as given by Equation 1. For each executed load instruction,
3. 3
the size of the data structures is checked. If the size is going
to be more than 242 entries when a new entry is added, the
queue is used to find the least recently used entry. This entry
is then removed from both structures in order to make room
for the new entry.
max entries = 8kB
size(2×P C+lastaddress+stride+confidence) (1)
IV. METHODOLOGY
WE use the M5 simulator to perform benchmarking of
the implemented prefetchers studied in this paper. The
simulator is a generic simulation infrastructure with useful
tools for debugging and collecting statistics [4]. It is equipped
with two CPU models: an in-order non-pipelined CPU named
SimpleCPU, and the O3CPU, which is an out-of-order, su-
perscalar and pipelined CPU which features a simultaneous
multithreading model [4]. When performing benchmarking
of the prefetchers, the O3CPU model is used to model the
simulated architecture. It is based on the DEC Alpha Tsunami
architecture, where the Alpha 21264 microprocessor is used
as a reference [3]. The memory structure of the Alpha 21264
microprocessor is shown in Figure 3, with values correspond-
ing to the values used in the simulated system, given in Table
I.
Fig. 3. Alpha 21264 memory structure
Benchmarking is performed by running a Python script
which defines the test parameters used in the simulator. A
TABLE I
TEST PARAMETERS USED FOR BENCHMARKING
Test parameter Value
CPU clock speed 2 GHz
Physical memory size 256 MB
Caches with prefetching L2
L2 size 1 MB
Memory bus clock speed 400 MHz
Memory bus width 64 b
Memory latency 30 ns
Instruction checkpoint restore 1e9
Warmup-instructions 1e7
Maximum instructions 1e8
selection of the SPEC CPU2000 benchmark suite is used to
test the performance of the prefetcher, where the selection
consists of the benchmarks ammp, applu, apsi, art110, art470,
bzip2 program, bzip2 source, galgel, swim, twolf and wup-
wise.
The prefetcher interfaces the simulated memory system
through an interface file supplied together with the M5 simu-
lator. The interface includes functions called by the simulator
to notify the prefetcher before any memory is accessed, to
enable the prefetcher to initialize necessary data structures,
in addition to functions that notifies the prefetcher when the
cache is accessed, and when a prefetch is completed. The
interface also includes functions that could be called by the
prefetcher in order to issue a prefetch, and to set, clear and
check prefetch bits available in the L2 cache. Functions to
check if a memory block is currently in the cache and to
monitor the prefetch queue are also available.
In order to measure and compare our results we use the
following metrics: the accuracy of a prefetcher is defined as
the number of good prefetches divided by the number of total
prefetches, as shown by Equation 2. An ideal prefetcher has
an accuracy of one.
accuracy =
good prefetches
total prefetches
(2)
To also include the right timing of the prefetches we can
look at the coverage. This is defined as the number of useful
prefetches divided by the number of cache misses the program
would generate without prefetching. An ideal prefetcher has a
coverage of one. The formula used to calculate the coverage
is given by Equation 3.
coverage =
useful prefetches
cache misses without prefetching
(3)
Another metric that is important in order to understand
the measurements we made is the speedup. It is defined as
the computation time without prefetching, divided by the
computation time with prefetching enabled, as shown by
Equation 4. The higher the speedup, the better the prefetcher
works. A speedup of one means that the prefetcher does not
reduce the computation time of the program, while a speedup
of less than one means that the prefetcher slows down the
program.
4. 4
speedup =
execution time without prefetcher
execution time with prefetcher
(4)
To compare the results we get out of the different bench-
marks, the harmonic mean, given by Equation 5, is used. This
is how we obtain the average values presented in this paper.
Havg =
n
1
x1
+ 1
x2
+ · · · + 1
xn
=
n
n
i=1
1
xi
(5)
V. RESULTS
FIGURE 4 depicts the obtained speedup for the prefetcher
described in Section III, for different values of the maxi-
mum prefetch degree. As seen in the figure, a speedup above
1.5 is achieved for the ammp benchmark, and above 1.2 for
the wupwise benchmark. For the remaining benchmarks, the
speedup is below 1.1, giving a best average speedup of 1.085
for a maximum prefetch degree of 5.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 4. Speedup results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
In Figure 5, the accuracy is shown for the different bench-
marks, for different values of the maximum prefetch degree.
The figure shows that a higher maximum prefetch degree is
beneficial for the art110 and art470 benchmarks, while for the
apsi benchmark, a lower maximum degree leads to a higher
accuracy.
Figure 6 depicts the coverage for the different benchmarks,
for different values of the maximum prefetch degree. From
this figure, it is seen that for the majority of the benchmarks,
the coverage increases with an increasing maximum prefetch
degree. A maximum degree of 5 obtains coverage results close
to the results given by using a maximum prefetch degree of
10 and 20 for the majority of the benchmarks.
In Table II, the mean values for the accuracy, coverage
and speedup from the benchmarks are shown for the different
values of the maximum prefetch degree.
Other prefetching schemes like adaptive sequential prefetch-
ing, Delta Correlating Prediction Tables (DCPT), Delta Corre-
lating Prediction Tables with L1 hoisting (DCPT-P), standard
Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 5. Accuracy results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
maximum degree = 1
maximum degree = 5
maximum degree = 10
maximum degree = 20
Fig. 6. Coverage results for the instruction based stride prefetcher from the
benchmarks for different values of the maximum prefetch degree
TABLE II
MEAN VALUES OF THE BENCHMARK RESULTS
Maximum prefetch degree Accuracy Coverage Speedup
1 0.618 0.118 1.055
5 0.639 0.155 1.085
10 0.639 0.157 1.082
20 0.632 0.155 1.074
prefetching with RPT, and related schemes are listed in Table
III together with their average speedup results. Compared to
these schemes, the instruction based stride prefetcher using
RPT obtains a speedup slightly higher than the DCPT-P
scheme, which is the scheme with the highest average speedup
of all the schemes in Table III. The instruction based stride
prefetcher using RPT obtains this speedup with a maximum
prefetch degree of 5, as shown in Table II. The prefetching
schemes listed in Table III comes as part of the M5 simulator,
5. 5
described in Section IV.
TABLE III
MEAN VALUES OF THE SPEEDUP FOR OTHER PREFETCHING SCHEMES
Prefetching scheme Speedup
Adaptive sequential 0.998
DCPT 1.052
DCPT-P 1.084
RPT 1.057
Sequential-on-access 1.011
Sequential-on-miss 0.993
Tagged 1.000
VI. DISCUSSION
THIS section describes the two sequential prefetchers
we also implemented, and how they compare to the
instruction based stride prefetcher presented in Section III.
A. Variable degree sequential prefetcher
This prefetcher looks for sequential accesses, which is de-
fined as accesses that happen in adjacent memory blocks. The
prefetching degree starts at 1, and the prefetching is performed
sequentially on every access. For every sequential access, the
prefetching degree is increased by a constant number. If the
sequential access pattern is broken, the prefetching degree is
reset to 1.
The result is highly dependent on how aggressively the
prefetching degree is being increased. This scheme shows a
moderate speedup if the prefetching degree is being increased
very conservatively. If the prefetching degree is increased too
quickly, the speedup turns negative. Overall speedup is shown
in Table IV, and detailed results are shown in Figure 7, 8 and
9.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 7. Speedup results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
Compared to the instruction based stride prefetcher, the
biggest speedup difference comes in the ammp benchmark.
TABLE IV
MEAN VALUES OF THE SPEEDUP FOR THE VARIABLE DEGREE
SEQUENTIAL PREFETCHER
Degree incrementation per sequential access Speedup
1 1.012
2 1.009
4 0.978
8 0.944
Accuracy 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 8. Accuracy results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Degree incrementation = 1
Degree incrementation = 2
Degree incrementation = 4
Degree incrementation = 8
Fig. 9. Coverage results for the variable degree sequential prefetcher in the
benchmarks for different values of degree incrementation per sequential access
It can be seen that this difference is due to a disparity in
accuracy and coverage between the different prefetchers. The
instruction based stride prefetcher is to a large extent able
to identify the benchmark’s access patterns, and manages
to achieve high accuracy and coverage. The variable degree
sequential prefetcher on the other hand, has accuracy and
coverage very close to zero in this benchmark.
6. 6
TABLE V
MEAN VALUES OF THE SPEEDUP FOR THE ADAPTIVE TAGGED
SEQUENTIAL PREFETCHER
Memory access counting window size Speedup
10 0.984
20 0.981
40 0.980
80 0.983
B. Adaptive tagged sequential prefetcher
This prefetcher tags the prefetched addresses. The prefetch-
ing degree starts at 1, and the prefetching is performed se-
quentially on every access. Once a certain number of memory
accesses has happened, this prefetcher uses the tags to count
how many of the prefetches were useful, i.e. used by the
benchmark. If the number of useful prefetches is high enough,
the prefetching degree is increased by 1, and it is decreased by
1 if the number of useful prefetches is too low. High enough
is defined as >60%, and too low is defined as <40%.
The results show a negative overall speedup for all sizes
of the memory access counting window. It also does not
seem that the size of the counting window affects the speedup
significantly. Table V shows the overall speedup, and Figure
10, 11 and 12 depicts the detailed results.
Speedup
0
0.2
0.4
0.6
0.8
1
1.2
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 10. Speedup results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
The adaptive tagged sequential prefetcher has the same
problem as the variable degree sequential prefetcher in the
ammp benchmark; it achieves negative speedup because the
accuracy and coverage is very close to zero. Additionally, for
the adaptive tagged sequential prefetcher the galgel bench-
mark also stands out with a negative speedup. This happens
even though the coverage is comparable to both the variable
degree sequential prefetcher and the instruction based stride
prefetcher. The difference here lies in the accuracy, where the
adaptive tagged sequential prefetcher’s accuracy is much lower
than that of the other two prefetchers.
Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 11. Accuracy results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
m
p
applu
apsi
art110
art470
bzip2_programbzip_source
galgel
sw
im
tw
olfw
upw
ise
Window size = 10
Window size = 20
Window size = 40
Window size = 80
Fig. 12. Coverage results for the adaptive tagged sequential prefetcher in the
benchmarks for different sizes of the memory access counting window
VII. RELATED WORK
VARIOUS aspects of caches are examined by A. J. Smith
[6], including ways to prefetch data into them. It is
theorized that the only viable way to prefetch data into a cache
in practice is to prefetch the immediately sequential memory
block, due to the need for a fast hardware implementation.
This is known as one block lookahead (OBL). Three methods
for OBL are focused on: always prefetch, prefetch on misses
and tagged prefetch. Prefetch on misses is found to be the
least effective method, but both always prefetch and tagged
prefetch succeeds well in reducing the miss ratio. Additionally,
the tagged prefetch method only requires a small increase in
the access ratio over prefetch on misses, thus utilizing memory
bandwidth more efficiently.
S. VanderWiel and D. Lilja [8], [9] describe many details
of prefetching, including tagged prefetching. Their paper dis-
cusses a number of aspects that need to be taken into account
7. 7
when designing a prefetcher, such as when the prefetches are
initiated, where the prefetched data is placed, and what the unit
of prefetch should be. Their findings show that no prefetching
scheme is optimal for all workloads; they all have different
strengths and weaknesses.
S. Srinath et al. [10] propose a tagged prefetching mecha-
nism that incorporates dynamic feedback into its design. They
call their method feedback directed prefetching (FDP), and the
goal is to increase the performance improvement provided by
prefetching as well as to reduce the bandwidth usage. Their
prefetcher estimates its accuracy, timeliness and the prefetcher-
caused cache pollution in order to dynamically adjust its
aggressiveness. The results show that the FDP prefetcher im-
proves the average performance by 6.5% compared to the best
performing conventional stream based prefetcher, as well as
consuming 18.7% less memory bandwidth. The improvements
developed for the FDP prefetcher is shown to be applicable
to several other types of prefetchers, including global history
buffer delta correlation prefetchers (GHB/DC) and PC based
stride prefetchers.
K. Nesbit and J. Smith [11] examine a technique which
addresses the problems of static prefetch tables. To solve
the problems of fixed memory per prefetch key, they use a
global history buffer, which splits up the prefetch keys and
the addresses in two different structures. The results show that
GHB is profiting more if a global address is used as a key. In
that case it is 20% faster and the memory traffic is reduced
by 90%. For the instruction pointer based methods the GHB
is also faster (9%), but the memory traffic is the same. They
also point out the fact that it more important for a prefetcher
to be accurate than to be fast, since the memory bandwidth is
such an important factor in modern CPUs.
L. Ramos et al. [12] suggest that prefetchers should work
on multiple levels to fit different purposes. In their case they
design them to minimize the cost of the prefetcher, to minimize
the loss due to bad prefetching or to maximize the average
performance. They also introduce a new correlating prefetcher,
which they called PDFCM (Prefetching based on a Differential
Finite Context Machine). It uses a history and a distance table
to calculate and store multiple strides. By using a simple
sequential prefetcher they are able to minimize the cost to
1 kB. Their PDFCM is able to cut all performance losses
caused by the prefetcher. By combining both techniques they
gained the best average performance.
VIII. CONCLUSION
THIS paper focuses on how prefetching can be used to
mitigate the problems caused by the memory wall. The
more specific goal of our work is to explore the effects of
the maximum prefetch degree in a stride prefetcher using an
RPT. In order to do this we implement an instruction based
stride prefetcher, and run it through a selection of the SPEC
CPU2000 benchmarks. The benchmarks were then run with
varying values of the maximum prefetch degree.
From the results given in Section V, it is seen that a
higher limit for the confidence parameter (given by a higher
maximum prefetch degree) leads to increased coverage for
the majority of the benchmarks used to test the prefetcher
discussed in section III. As for the accuracy, increasing the
maximum prefetch degree does not seem to have a specific
effect, with varying results for the benchmarks used.
Although a higher maximum prefetch degree yields a higher
coverage for the prefetcher, the speedup does not increase
uniquely when increasing the maximum prefetch degree. This
can be explained by looking at the definitions of coverage and
speedup given in Equations 3 and 4 in Section II. The higher
coverage may lead to a lower speedup, since the memory bus
will need to use a higher amount of bandwidth for prefetching.
This bandwidth could otherwise have been used to load more
important data, which would have made it possible to obtain
a higher speedup.
Compared to the sequential prefetcher with a variable
prefetch degree and the adaptive tagged sequential prefetching
scheme studied in Section VI, the performance of the instruc-
tion based stride prefetcher using an RPT is either equal to
these schemes or higher, depending on the benchmark. The
overall best case difference in speedup for the instruction based
stride prefetcher is 7.3% compared to the sequential prefetcher
with a variable prefetch degree and 10.1% compared to the
adaptive tagged sequential prefetching scheme.
REFERENCES
[1] W. Wulf and S. McKee. Hitting the memory wall: Implications of the
obvious. ACM Computer Architecture New, 23(1); March 1995.
[2] M. Grannæs, Reducing Memory Latency by Improving Resource Utiliza-
tion. Norway: Norwegian University of Science and Technology; June
2010.
[3] M5 simulator system TDT4260 Computer Architecture User documenta-
tion. Norway: Norwegian University of Science and Technology; January
2014.
[4] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, S.
K. Reinhardt, The M5 Simulator: Modeling Networked Systems. IEEE
Micro, vol. 26, no. 4, pp. 52-60, July/August: The IEEE Computer
Society; 2006.
[5] J. L. Hennessy and D.A. Patterson, Computer Architecture, Fourth Edi-
tion: A Quantitative Approach. San Francisco, CA, USA: Morgan
Kaufmann Publishers; 2006.
[6] A. J. Smith. Cache memories. ACM Comput. Surv., 14(3), pp. 473-530;
1982.
[7] P. J. Denning. On modeling program behavior. in Proc Spring Joint
Computer Conference, vol. 40, pp. 937-944, Arlington, Va, USA: AFIPS
Press; 1972
[8] S. VanderWiel and D. Lilja. A survey of data prefetching techniques.
Technical Report 5, University of Minnesota; October 1996.
[9] S. VanderWiel and D. Lilja. When caches aren’t enough: data prefetching
techniques. Computer, 30(7), pp. 23-30; Jul 1997.
[10] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetch-
ing: Improving the performance and bandwidth-efficiency of hardware
prefetchers. Technical report, TR-HPS-2006-006, University of Texas
at Austin; May 2006.
[11] K. J. Nesbit and J. E. Smith. Data Cache Prefetching Using a Global
History Buffer. Micro, IEEE, 25, pp. 90–97; Jan. 2005.
[12] L. M. Ramos, J. L. Briz, P. E. Ibanez, and V. Vinals. Multi-level adaptive
prefetching based on performance gradient tracking. In Data Prefetching
Championship-1; 2009.
[13] Inside Intel Core Microarchitecture and Smart Memory Access, https:
//software.intel.com/sites/default/files/m/d/4/1/d/8/sma.pdf; April 2015.