Cad report

Study and Analysis of Software-Based Data
Prefetching Schemes
Priyanka Goswami
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
priyankag@email.arizona.edu
Shalaka Satam
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
shalakasatam@email.arizona.edu
Abstract— For multicore processors, data prefetching is an
effective technique to reduce latency caused by data access and
cache misses. There are two main methods of prefetching –
software and hardware prefetching. A third category is the
hybrid prefetching scheme, which is a combination of software
and hardware scheme and aims at combing the benefits of both
the techniques. For this project, we will perform a study
prefetching techniques and features used for classifying different
techniques. We will then study the most recent prefetching
techniques developed for modern day processors, classify them
based on different criteria and perform a qualitative and
quantitative evaluation of their performance. We will also
evaluate the performance of compiler based data prefetching
scheme using the built-in prefetcher of gcc compiler.
Keywords—prefetching; latency; memory; cache; processor;
multicore; software; data
I. INTRODUCTION
Prefetching is a technique used for transferring data from
the main memory to a temporary storage before it is required
by an application. It is used for reducing memory latencies by
overlapping computation with communication (data access).
An effective prefetcher should be able to accurately predict
the address of the data required in the future, thus helping in
reducing cache misses, besides improving the overall speed of
execution. Prefetching has become important for modern
computers mainly because of the exponential increase of
dataset sizes and the significant difference between DRAM
and processor speeds [1]. Prefetching has also become a key
component of Network on Chip (NoC) based multiprocessor
systems, where the effectiveness of the prefetching technique
directly translates to an improvement in overall system
performance. This is because in such a system the memory
access latency depends on the network traffic, besides the
distance between the processor request data and the storage
location [2].
Although prefetching is one of the most widely used
methods to reduce memory access latency, there are many
factors and design specifications that must be considered for it
to be effective. For example, in Chip – multiprocessors
(CMPs) increasing the lower level cache sizes is not plausible
and hence it is important that the prefetching technique used is
highly accurate to prevent cache pollution [3]. Additionally, it
is important that the prefetcher is light-weight and has low
overhead.
Over years, many different prefetching techniques have
been designed that aim to reduce memory access latency,
predict future data accesses with high accuracy, while being
light weight and having minimum overhead. The techniques
have also evolved to adapt to different processor architectures.
In the report, first we will introduce the different criteria used
for classifying prefetching techniques and introduce a new
possible criterion that can be used for classifying prefetchers,
based on the recent developments in processor architecture.
The following section will focus on classification and
analyzing the recently developed software based prefetching
techniques, the performance improvements achieved by them
and the type of processor architecture they are developed for.
We will also examine the advantages and disadvantages of
hardware and software based prefetching techniques and how
hybrid prefetching techniques have helped in combining the
best features of both the techniques. Then we will study the
effectiveness of software based data prefetching technique by
implementing compiler based prefetching on loop – intensive
codes and analyze the results.
II. CLASSIFICATION OF PREFETCHERS
Over the years, many distinctive characteristics of
prefetchers have been used to classify them. The most
commonly used characteristics used are [4] [5] [6]:
• Based on where the prefetching is implemented: This is
the most popular and oldest criteria used for classifying
prefetching techniques. The techniques are classified as:
1. Hardware based prefetcher – Prefetchers use
information of current and previous cache accesses to
fetch blocks of data. This technique generally
requires additional hardware (like extra memory to
store the history table) or modifications to the current
hardware. It is generally used where the data access
patterns are regular, which makes it easier for the
prediction algorithm the correctly determine future
data accesses.

2. Software based prefetcher – This is a compiler
optimization technique, in which prefetch instructions
are either inserted by the compiler by identifying loops
in the code, or by the programmer. The time to
prefetch depends on the loop execution time, to ensure
that communication time is not more than the
computation time. Software based prefetchers are
preferred for irregular short streams of data, which is
seen in out – of order processors. This also includes
branch prediction algorithms.
3. Hybrid prefetcher - Combination of hardware and
software based prefetcher.
• Based on what is prefetched: This is applicable if separate
memory is used for instructions and data.
1. Instruction prefetching
2. Data prefetching
• Based on events that can be used to determine if
prefetching is required: This can be event triggered (like a
cache miss), software controlled, prediction based or
consisting of look ahead counter.
• Based on the source and destination of the data being
prefetched: The source can be main memory or a lower
level cache (like L1) and the destination is a higher - level
cache (this can be private, shared or exclusively for
storing prefetched data).
• Based on component initializing the prefetching:
1. Push based in which the memory or server prefetches
data and sends it to the processor performing
computation.
2. Pull based in which the processor requests the
(prefetched) data from the memory or lower
level cache.
• Based on the technique used for identifying the data to be
retrieved: The technique can be classified as
1. Static prefetcher
2. Dynamic prefetcher
We introduce another possible classification scheme for
prefetching algorithms based on the type of processor /
processor components a technique is suitable for. like no. of
cores a processor has (single core, multi core or many core
[7]), network based multiprocessor system, chip
multiprocessor systems, etc. We will use this scheme in
addition to the previous described characteristics to classify
the different software based prefetching techniques we will
analyze in the following section. Additionally, prefetchers can
also be classified based on the data the retrieve, i.e. temporal,
spatial or both.
Some of the main metrics used to analyze the performance of
prefetchers are prefetch degree, accuracy, overhead, timely
prefetching and difference in speed of application execution.
III. ANALYSIS AND CLASSIFICATION OF PREFETCHING
TECHNIQUES
In this section, we will analyze some recently developed
prefetching techniques for modern day processors such as
NoC multiprocessors. Based on the analysis we will classify
the technique using the above criteria and the performance
improvement achieved. A similar classification and review has
been described in [4], but in this report, we will focus only on
data prefetching techniques, especially those that have not
been covered in previously published surveys.
In [8], Jain and Lin (2013) introduce the Irregular Stream
Buffer prefetcher. It aims on using an additional level of
indirection for converting correlated addresses to consecutive
addresses for irregular sequence of memory references. It
identifies temporal streams of data, by identifying the next
memory access based on the current reference. The main
components of the prefetcher are: training unit, address
mapping caches and stream predictor.
• It prefetches data from the main memory (pull based
prefetcher) and uses a software controlled prediction
based algorithm. It is designed for single and multi-core
processors and is dynamic prefetcher.
• The prefetcher is tested using the SPEC 2006 benchmarks
on Marss simulator and has obtained 23.1% speed-up and
93.7% accuracy.
• On-chip storage is 32KB with 8.4% traffic overhead.
In [9], Mao et al. (2014) describe two methods: the request
prioritization (RP) and hybrid-local global prefetch (HLGPC).
The RP consists of assigning priorities on different low-level
cache (LLC) accesses (like read request, write back request,
etc.) and use a miss state handling register (MSHR) to track
the elapsed time for servicing each request. The HLGPC is
used to control the aggressiveness (prefetch degree and
distance) of the prefetcher by using two metrics: prefetch
frequency of each core and global access frequency of LLC.
• It prefetches data from main memory (pull based
prefetcher) and stores in L1 cache and uses software
controlled algorithm and is dynamic in nature. It is
designed for chip-multiprocessors having multi-core
system and STT-RAM LLC.
• The prefetcher is tested using SPEC 2000/2006
benchmarks with the Intel Sandy Bridge configuration in
MacSim simulator and energy calculations are done
using CACTI and synopsis. The performance
improvement increases, with larger sized LLC, with
maximum improvement of 9.1%, while the energy saved
decreases with minimum improvement of 5.6% (for
8MB LLC).
• The prefetcher requires additional registers (128 7-bit
MSHR and 20-entry write buffer).

In [2], Cireno et al. (2016) have designed a prefetching system
which consists of the client (responsible for handling prefetch
requests and load data to local cache) and the server
(responsible for tracking the time and notifying the client and
maintain the directory). The prefetching scheme initiates when
there is a cache miss and then a time series based prediction is
used to determine when the prefetch request is to be serviced.
In case of a miss, the register is updated and prefetching
parameters are adjusted. For timed prefetching, the client
maintains a separate time estimation for each client.
• It is pull based prefetcher as the client gets the data from
the main memory to higher level caches (L1 and L2) and
uses a combination of event triggered and prediction
based prefetcher. It is developed for NoC based multi
core processors and is dynamic in nature. The prefetcher
is used to fetch temporal data
• The prefetcher is evaluated using Extended SPARC V8
Arch and simulated on Infinity Platform. For 16 core
system, the performance improvement is 6.25 %. Metrics
used for evaluating are processor penalty, miss rate and
network transactions.
In [10], Bakhshalipour et al. (2017) have proposed a
prefetching technique called Domino, that uses two previous
miss addresses from the global history table, to determine the
next address of data to be prefetched. There are two Miss
History Tables (MHTs) per core – (1). MHT – 1 stores the 1st
miss (1 tag + prediction field + valid bit) and (2). MHT – 2
stores two consecutive misses (2 tags + prediction field +
identifier). A single predicted address is determined by
XORing the addresses from MHT – 2. Prefetch request is sent
if there is a hit in either MHT – 1 or MHT – 2.
• The technique is used to prefetch temporal data from the
main memory to the L1 Data cache directly (pull based
prefetcher) and is intended for multi core server
processors. The prefetcher uses an event trigger (cache
miss) to initialize and is dynamic in nature.
• Domino prefetcher is evaluated using Flexus simulator
with 16 core processor (Ultra SPARC III) and 8MB LLC
and 32 KB L1-data cache and benchmarks from the
CloudSuite like MapReduce.
• The Domino prefetcher improves system performance by
26% (more for certain benchmarks), and has better
performance compared to [8] (which also prefetches
temporal data but uses PC and address correlation), for
server workloads. But evaluation has been done with the
assumption that infinite space is available to store MHT –
1/ 2 tables, which is not possible for real world systems.
In [11], Fuchs et al. (2014) use code block working sets
(CBWS), which provides the address trace of ordered lines
accessed by loops, and the prefetcher uses this data to fetch
entire data block required for the loop iteration. The proposed
prefetcher is implemented as an addition to the spatial
memory streaming prefetcher. BLOCK_BEGIN and
BLOCK_END (instructions added to the ISA) are used to
determine the block boundaries. For the prediction, to prevent
timing constraints, prefetcher stores a history of k, that enables
predicting the blocks required, farther ahead. Although
developed for in-order pipeline, the prefetcher can be also
used for out-of-order pipeline, by fetching the address during
the commit stage.
• The prefetcher is used to fetch spatial data from the main
memory to the L2 cache (pull based prefetcher) and is
software controlled, using prediction to prefetch data. It is
dynamic type and developed for multi – core processors. Since
it uses additional space to store the history table, this
prefetcher can be classified as an hybrid prefetcher.
• The CBWS prefetcher is implemented as an add on to the
SMS prefetcher and performance evaluation is done in the
GEM5 simulator using SPEC2006 benchmarks. The
performance is compared with using only the SMS prefetcher.
The metrics used for evaluation cache misses, accuracy,
timely prefetches and overall speed – up.
• The CBWS + SMS prefetcher shows less cache misses
for majority benchmarks, compared to stride, GHB and SMS
prefetcher. Also, compared to the SMS, the CBWS + SMS
shows an improvement in timely accesses, but this is seen
only for memory – intensive benchmarks. Improvement is also
seen in accuracy, while performance speed – up is of 1.16
times over the SMS prefetcher for all benchmarks.
• This prefetcher uses less than 1KB of storage but
additional space is needed to store the differential history table
(DHT). For evaluation, a DHT of size 16 is used.
In [3], Kadjo et al. propose a prefetcher that performs optimal
control-flow speculation and effective address value
speculation to give an accurate prediction of future memory
references. The B-Fetch design depends on the expected path
through the future basic blocks and the effective addresses of
the next load instructions. Firstly, the future execution path is
predicted by the branch predictor. Then the B-Fetch analyses
the variation in the contents of the registers due to the earlier
branch instructions. This information is used to predict the
effective address. The Memory History Table (MHT) stores
the source, current and offset values. The use of variation
observed in register values over the history of effective
addresses helps in rightful prefetching even for instructions
which display irregular flow of control and data access
patterns.
• The technique uses a software controlled hybrid
prefetcher which is dynamic in nature. It is pull based,
getting data from the memory to the LLC.
• The prefetcher is tested using SPEC CPU2006
benchmarks, in the GEM5 simulator and the results
obtained by the prefetcher are compared with Stride and

SMS prefetcher. For evaluation, single and multi-threaded
loads are used.
• B-Fetch prefetcher achieves a mean speed up of
23.0% over the set baseline for 12.94KB size. But to
store the MHT additional hardware in form of
memory is required.
In [12], Aziz et al. propose a method to control prefetching
aggressiveness for network-based multiprocessor. The
controller minimizes the processor penalty by adjusting
prefetching aggressiveness. The Hill Climbing approach is
used to reduce the penalty. This method has five steps. Firstly,
the read miss transactions that arrive at the directory are
captured. The address prediction predicts the next processor
demanded address. The aggressive prediction predicts the
number of cache blocks that needs to be sent to the private
cache. This is followed by building the network packet which
contains all the predicted block addresses and finally the read
request is made to the private cache of predicted addresses.
The prefetcher performs transactions on the first level of
cache. This helps in avoiding network coherence delays.
• The prefetcher uses a software controlled algorithm and
the prefetcher is operated in static and dynamic mode
(with respect to the prefetching degree). It is pull based
and uses a private type of cache for storing the retrieved
data.
• The prefetcher is tested using PARSEC and SPLASH – 2
benchmarks. Additionally, benchmarks from MiBench
have also been used. The testing is done using real
systems - four Extended Sparc V8 ArchC with L1 and L2
cache.
• The prefetcher can reduce penalty by 7% and achieves an
increase of 24% in prefetching accuracy, for fixed degree
of prefetching.
In [13], Garside and Audsley propose a stream based
prefetcher consisting of a Prefetch Unit (PU). The PU includes
a stream buffer, prefetch buffer and squash buffer. It gathers
information about the data access trends by snooping on all
transactions made to the main memory. For each CPU, a
separate shared memory tile is present to access its memory.
The PU gets the memory request and if it is present in the
prefetch buffer, the requested address is added to the squash
buffer, else it is considered as a miss and the stream buffer is
updated.
• The prefetcher is developed for NoC multiprocessors, and
is event triggered, dynamic type.
• The PU is implemented within 4x4 Bluetiles NoC on
Xilinx Virtex-7 FPGA, with external memory and each
CPU has connection to the shared memory tree. The
CPUs are configured to run at 50MHz.
• It is observed that large prefetch distances yield better
results if the memory load increases. Improvement is also
seen in timely prefetching and accuracy of fetches.
In [1], Khan et al. propose the use of low overhead runtime
sampling and fast cache modeling for prefetching. The
prefetcher improves the single thread performance and it also
minimizes off-chip traffic and off-chip bandwidth
consumption. The prefetcher samples the memory instructions
in a random manner. The data cache blocks which are
accessed by the sampled instructions are monitored to see
reusability. If reused, the stride sample is recorded. The stride
sample is the difference between current and the previous
memory address accesses by the instruction. The reuse
samples are used to form the per-instruction cache
performance model. The stride samples are analyzed to find
the appropriate prefetch distance. Then the prefetch instruction
is scheduled for the load.
• This prefetching technique uses software based algorithm
and is dynamic in nature.
• The technique was evaluated on SPEC CPU 2006
benchmarks. The technique was also evaluated with
512kB L2 cache and it found 94% misses covered on
average.
• This prefetching method maintains minimum off-chip
traffic and it avoids LLC pollution. It also lowers off-chip
bandwidth demand. The results showed that the
multicores achieved higher throughput when resource
efficient prefetching is used.
From the above study and analysis most of the recent
works in prefetching have focused on developing light weight
prefetchers, especially for NoC multiprocessors. A common
feature of all the NoC based prefetchers is storing the data
directly from the main memory to the L1 cache, instead of
having a dedicated prefetch cache. Additionally, it is seen that
all prefetchers using prediction for determining the future
accesses need additional hardware (dedicated memory space)
for storing the history and hence are better classified under
hybrid prefetchers rather than simple software based
prefetchers. Also, the features used for carrying out
predictions and the amount of history stored varies for
different techniques.
IV. SOFTWARE BASED DATA PREFETCHING
Microprocessors use fetch instruction to implement
prefetching. Fetch instructions do not block memory
operations and hence have a lockup-free cache. This leads to
bypassing of outstanding memory operations when
prefetching is done. Software-initiated prefetching requires
minimum hardware as compared to the other prefetching
techniques. The complexity of software-initiated prefetching
lies on the position of the fetch instruction within the target
application.
Most software based prefetching techniques developed are
generally used along with a standard technique like the SMS
or GHB prefetcher, as described in [11], where the CBWS is
used as an additional component of the SMS prefetcher to
improve overall performance.

In the project, we have used the built in prefetch command
available for the gcc compiler and used it in loop intensive
codes for prefetching data, that will be required for the next
iteration. This is a very simple and efficient approach, that can
be used for applications related to image processing, matrix
operations, etc. For loops that access data in strides or contain
additional computations within the loop that would require
prefetching to be done in strides, rather than just one iteration
ahead, a prefetch distance must be determined. This distance
is a function of the latency caused due to cache misses and the
cycle time of the shortest path for one iteration of the loop [6].
V. EVALUATION AND RESULTS
For evaluating the built – in compiler based prefetcher, we
initially developed a C++ code to find the transpose of a
matrix and used the _mm_prefetch (char *p, int k) to prefetch
the data before next loop iteration, where p defines the address
for prefetching the data and k signifies the type of prefetching
to be done.
Figure. 1. Types of prefetches using _mm_prefetch(). [16]
Another prefetching function is the __builtin_prefetch (&i, j,
k), where I is the pointer to the address to be fetched, and j and
k define the type of prefetching. This was used in loop
intensive benchmarks of MiBench like fft, which contains
many nested loops as well. The figure below shows a part of
the loop of fft with the prefetching commands.
#define DO_PREFETCH
....
for(i=0;i<MAXSIZE;i++)
{RealIn[i]=0;
for(j=0;j<MAXWAVES;j++)
{if (rand()%2)
{RealIn[i]+=coeff[j]*cos(amp[j]*i);}
Else { RealIn[i]+=coeff[j]*sin(amp[j]*i);
ImagIn[i]=0;
#ifdef DO_PREFETCH
__builtin_prefetch (&coeff[j+1], 0, 1);
__builtin_prefetch (&amp[j+1], 0, 1);
#endif }
}
Figure 2. Loop of fft (main) function with prefetching
For comparison, this prefetc instruction was also applied to the
Dijkstra benchmark, which has just one loop. On running the
code using gcc compiler, on the system for fft the maximum
speed up was about 15% and for matrix transpose, a speed up
of 5% was achieved. Compared to this Dijkstra showed a
negligible performance difference. It has been seen that
compiler based prefetching along with other optimization
techniques such as loop unrolling, can show significant
improvement in performance.
VI. CONCLUSION
Data prefetching is one of the most commonly used
techniques in processor for reducing memory access latency,
cache misses, achieving higher accuracy and improving
overall performance of the system. While hardware based
prefetching, schemes were suitable for in order processors,
with predictable memory accesses, with the use of out of order
execution, software based prefetching techniques have shown
better performance. With development of modern day
processors such as NoC multiprocessors and chip
multiprocessors, a combination of both hardware and software
prefetching schemes are being used, and this class of
prefetchers is called hybrid prefetchers. In the project, we
have reviewed various prefetching techniques that have been
developed for modern day processors. We have classified each
technique based on set of features which makes it easier to
identify the type of application and processor, it is suitable for.
We have also listed the evaluation methodology, assumptions
and the performance improvement achieved by each
prefetching technique. Finally, we have implemented a simple
compiler based software data prefetching technique and
evaluated the results.
REFERENCES
[1] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110. doi: 10.1109/ICPP.2014.19
[2] M. Cireno, A. Aziz and E. Barros, "Temporized data
prefetching algorithm for NoC-based multiprocessor systems,"
2016 IEEE 27th International Conference on Application-
specific Systems, Architectures and Processors (ASAP),
London, 2016, pp. 235-236.doi: 10.1109/ASAP.2016.7760805
[3] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634.doi: 10.1109/MICRO.2014.29
[4] S. Byna, Y. Chen and X. H. Sun, "A Taxonomy of Data
Prefetching Mechanisms," 2008 International Symposium on
Parallel Architectures, Algorithms, and Networks (i-span
2008), Sydney, NSW, 2008, pp. 19-24.doi: 10.1109/I-
SPAN.2008.24
[5] S. Mittal, “A Survey of Recent Prefetching Techniques for
Processor Caches”, 20xx. ACM Computing Surveys 0, 0,
Article 0 (2016), 36 pages
[6] S. VanderWiel and David J. Lilja, “A Survey of Data
Prefetching Techniques”, Technical Report No.: HPPC-96-05,
doi: 10.1.1.2.4449 [online]: http://citeseerx.ist.psu.edu/
[7]“Manycore -vs- Multicore”, [online]:
https://goparallel.sourceforge.net/ask-james-reinders-
multicore-vs-manycore/

[8] Akanksha Jain and Calvin Lin. 2013. Linearizing irregular
memory accesses for improved correlated prefetching. In
International Symposium on Microarchitecture. 247–259,
December 2013. Doi:10.1145/2540708.2540730
[9] J. Li, C. J. Xue and Yinlong Xu, "STT-RAM based
energy-efficiency hybrid cache for CMPs," 2011 IEEE/IFIP
19th International Conference on VLSI and System-on-Chip,
Hong Kong, 2011, pp. 31-36. doi:
10.1109/VLSISoC.2011.6081626
[10] M. Bakhshalipour; P. Lotfi-Kamran; H. Sarbazi-Azad,
"An Efficient Temporal Data Prefetcher for L1 Caches," in
IEEE Computer Architecture Letters, vol.PP, no.99, pp.1-1
doi: 10.1109/LCA.2017.2654347
[11] A. Fuchs, S. Mannor, U. Weiser and Y. Etsion, "Loop-
Aware Memory Prefetching Using Code Block Working
Sets," 2014 47th Annual IEEE/ACM International Symposium
on Microarchitecture, Cambridge, 2014, pp. 533-544. doi:
10.1109/MICRO.2014.27
[12] J. Garside and N. C. Audsley, "Prefetching across a
shared memory tree within a Network-on-Chip architecture,"
2013 International Symposium on System on Chip (SoC),
Tampere, 2013, pp. 1-4. doi: 10.1109/ISSoC.2013.6675268
[13] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634. doi: 10.1109/MICRO.2014.29
[14] A. Aziz, M. Cireno, E. Barros and B. Prado, "Balanced
prefetching aggressiveness controller for NoC-based
multiprocessor," 2014 27th Symposium on Integrated Circuits
and Systems Design (SBCCI), Aracaju, 2014, pp. 1-7.doi:
10.1145/2660540.2660541
[15] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110.doi: 10.1109/ICPP.2014.19
[16] Lee, J., Kim, H., and Vuduc, R. 2012. When prefetching
works, when it doesn’t, and why. ACM Trans. Archit. Code
Optim. 9, 1, Article 2 (March 2012), 29 pages
http://doi.acm.org/10.1145/2133382.2133384

Cad report

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Cad report

Similar to Cad report (20)

More from Priyanka Goswami

More from Priyanka Goswami (9)

Recently uploaded

Recently uploaded (20)

Cad report