SlideShare a Scribd company logo
Study and Analysis of Software-Based Data
Prefetching Schemes
Priyanka Goswami
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
priyankag@email.arizona.edu
Shalaka Satam
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
shalakasatam@email.arizona.edu
Abstract— For multicore processors, data prefetching is an
effective technique to reduce latency caused by data access and
cache misses. There are two main methods of prefetching –
software and hardware prefetching. A third category is the
hybrid prefetching scheme, which is a combination of software
and hardware scheme and aims at combing the benefits of both
the techniques. For this project, we will perform a study
prefetching techniques and features used for classifying different
techniques. We will then study the most recent prefetching
techniques developed for modern day processors, classify them
based on different criteria and perform a qualitative and
quantitative evaluation of their performance. We will also
evaluate the performance of compiler based data prefetching
scheme using the built-in prefetcher of gcc compiler.
Keywords—prefetching; latency; memory; cache; processor;
multicore; software; data
I. INTRODUCTION
Prefetching is a technique used for transferring data from
the main memory to a temporary storage before it is required
by an application. It is used for reducing memory latencies by
overlapping computation with communication (data access).
An effective prefetcher should be able to accurately predict
the address of the data required in the future, thus helping in
reducing cache misses, besides improving the overall speed of
execution. Prefetching has become important for modern
computers mainly because of the exponential increase of
dataset sizes and the significant difference between DRAM
and processor speeds [1]. Prefetching has also become a key
component of Network on Chip (NoC) based multiprocessor
systems, where the effectiveness of the prefetching technique
directly translates to an improvement in overall system
performance. This is because in such a system the memory
access latency depends on the network traffic, besides the
distance between the processor request data and the storage
location [2].
Although prefetching is one of the most widely used
methods to reduce memory access latency, there are many
factors and design specifications that must be considered for it
to be effective. For example, in Chip – multiprocessors
(CMPs) increasing the lower level cache sizes is not plausible
and hence it is important that the prefetching technique used is
highly accurate to prevent cache pollution [3]. Additionally, it
is important that the prefetcher is light-weight and has low
overhead.
Over years, many different prefetching techniques have
been designed that aim to reduce memory access latency,
predict future data accesses with high accuracy, while being
light weight and having minimum overhead. The techniques
have also evolved to adapt to different processor architectures.
In the report, first we will introduce the different criteria used
for classifying prefetching techniques and introduce a new
possible criterion that can be used for classifying prefetchers,
based on the recent developments in processor architecture.
The following section will focus on classification and
analyzing the recently developed software based prefetching
techniques, the performance improvements achieved by them
and the type of processor architecture they are developed for.
We will also examine the advantages and disadvantages of
hardware and software based prefetching techniques and how
hybrid prefetching techniques have helped in combining the
best features of both the techniques. Then we will study the
effectiveness of software based data prefetching technique by
implementing compiler based prefetching on loop – intensive
codes and analyze the results.
II. CLASSIFICATION OF PREFETCHERS
Over the years, many distinctive characteristics of
prefetchers have been used to classify them. The most
commonly used characteristics used are [4] [5] [6]:
• Based on where the prefetching is implemented: This is
the most popular and oldest criteria used for classifying
prefetching techniques. The techniques are classified as:
1. Hardware based prefetcher – Prefetchers use
information of current and previous cache accesses to
fetch blocks of data. This technique generally
requires additional hardware (like extra memory to
store the history table) or modifications to the current
hardware. It is generally used where the data access
patterns are regular, which makes it easier for the
prediction algorithm the correctly determine future
data accesses.
2. Software based prefetcher – This is a compiler
optimization technique, in which prefetch instructions
are either inserted by the compiler by identifying loops
in the code, or by the programmer. The time to
prefetch depends on the loop execution time, to ensure
that communication time is not more than the
computation time. Software based prefetchers are
preferred for irregular short streams of data, which is
seen in out – of order processors. This also includes
branch prediction algorithms.
3. Hybrid prefetcher - Combination of hardware and
software based prefetcher.
• Based on what is prefetched: This is applicable if separate
memory is used for instructions and data.
1. Instruction prefetching
2. Data prefetching
• Based on events that can be used to determine if
prefetching is required: This can be event triggered (like a
cache miss), software controlled, prediction based or
consisting of look ahead counter.
• Based on the source and destination of the data being
prefetched: The source can be main memory or a lower
level cache (like L1) and the destination is a higher - level
cache (this can be private, shared or exclusively for
storing prefetched data).
• Based on component initializing the prefetching:
1. Push based in which the memory or server prefetches
data and sends it to the processor performing
computation.
2. Pull based in which the processor requests the
(prefetched) data from the memory or lower
level cache.
• Based on the technique used for identifying the data to be
retrieved: The technique can be classified as
1. Static prefetcher
2. Dynamic prefetcher
We introduce another possible classification scheme for
prefetching algorithms based on the type of processor /
processor components a technique is suitable for. like no. of
cores a processor has (single core, multi core or many core
[7]), network based multiprocessor system, chip
multiprocessor systems, etc. We will use this scheme in
addition to the previous described characteristics to classify
the different software based prefetching techniques we will
analyze in the following section. Additionally, prefetchers can
also be classified based on the data the retrieve, i.e. temporal,
spatial or both.
Some of the main metrics used to analyze the performance of
prefetchers are prefetch degree, accuracy, overhead, timely
prefetching and difference in speed of application execution.
III. ANALYSIS AND CLASSIFICATION OF PREFETCHING
TECHNIQUES
In this section, we will analyze some recently developed
prefetching techniques for modern day processors such as
NoC multiprocessors. Based on the analysis we will classify
the technique using the above criteria and the performance
improvement achieved. A similar classification and review has
been described in [4], but in this report, we will focus only on
data prefetching techniques, especially those that have not
been covered in previously published surveys.
In [8], Jain and Lin (2013) introduce the Irregular Stream
Buffer prefetcher. It aims on using an additional level of
indirection for converting correlated addresses to consecutive
addresses for irregular sequence of memory references. It
identifies temporal streams of data, by identifying the next
memory access based on the current reference. The main
components of the prefetcher are: training unit, address
mapping caches and stream predictor.
• It prefetches data from the main memory (pull based
prefetcher) and uses a software controlled prediction
based algorithm. It is designed for single and multi-core
processors and is dynamic prefetcher.
• The prefetcher is tested using the SPEC 2006 benchmarks
on Marss simulator and has obtained 23.1% speed-up and
93.7% accuracy.
• On-chip storage is 32KB with 8.4% traffic overhead.
In [9], Mao et al. (2014) describe two methods: the request
prioritization (RP) and hybrid-local global prefetch (HLGPC).
The RP consists of assigning priorities on different low-level
cache (LLC) accesses (like read request, write back request,
etc.) and use a miss state handling register (MSHR) to track
the elapsed time for servicing each request. The HLGPC is
used to control the aggressiveness (prefetch degree and
distance) of the prefetcher by using two metrics: prefetch
frequency of each core and global access frequency of LLC.
• It prefetches data from main memory (pull based
prefetcher) and stores in L1 cache and uses software
controlled algorithm and is dynamic in nature. It is
designed for chip-multiprocessors having multi-core
system and STT-RAM LLC.
• The prefetcher is tested using SPEC 2000/2006
benchmarks with the Intel Sandy Bridge configuration in
MacSim simulator and energy calculations are done
using CACTI and synopsis. The performance
improvement increases, with larger sized LLC, with
maximum improvement of 9.1%, while the energy saved
decreases with minimum improvement of 5.6% (for
8MB LLC).
• The prefetcher requires additional registers (128 7-bit
MSHR and 20-entry write buffer).
In [2], Cireno et al. (2016) have designed a prefetching system
which consists of the client (responsible for handling prefetch
requests and load data to local cache) and the server
(responsible for tracking the time and notifying the client and
maintain the directory). The prefetching scheme initiates when
there is a cache miss and then a time series based prediction is
used to determine when the prefetch request is to be serviced.
In case of a miss, the register is updated and prefetching
parameters are adjusted. For timed prefetching, the client
maintains a separate time estimation for each client.
• It is pull based prefetcher as the client gets the data from
the main memory to higher level caches (L1 and L2) and
uses a combination of event triggered and prediction
based prefetcher. It is developed for NoC based multi
core processors and is dynamic in nature. The prefetcher
is used to fetch temporal data
• The prefetcher is evaluated using Extended SPARC V8
Arch and simulated on Infinity Platform. For 16 core
system, the performance improvement is 6.25 %. Metrics
used for evaluating are processor penalty, miss rate and
network transactions.
In [10], Bakhshalipour et al. (2017) have proposed a
prefetching technique called Domino, that uses two previous
miss addresses from the global history table, to determine the
next address of data to be prefetched. There are two Miss
History Tables (MHTs) per core – (1). MHT – 1 stores the 1st
miss (1 tag + prediction field + valid bit) and (2). MHT – 2
stores two consecutive misses (2 tags + prediction field +
identifier). A single predicted address is determined by
XORing the addresses from MHT – 2. Prefetch request is sent
if there is a hit in either MHT – 1 or MHT – 2.
• The technique is used to prefetch temporal data from the
main memory to the L1 Data cache directly (pull based
prefetcher) and is intended for multi core server
processors. The prefetcher uses an event trigger (cache
miss) to initialize and is dynamic in nature.
• Domino prefetcher is evaluated using Flexus simulator
with 16 core processor (Ultra SPARC III) and 8MB LLC
and 32 KB L1-data cache and benchmarks from the
CloudSuite like MapReduce.
• The Domino prefetcher improves system performance by
26% (more for certain benchmarks), and has better
performance compared to [8] (which also prefetches
temporal data but uses PC and address correlation), for
server workloads. But evaluation has been done with the
assumption that infinite space is available to store MHT –
1/ 2 tables, which is not possible for real world systems.
In [11], Fuchs et al. (2014) use code block working sets
(CBWS), which provides the address trace of ordered lines
accessed by loops, and the prefetcher uses this data to fetch
entire data block required for the loop iteration. The proposed
prefetcher is implemented as an addition to the spatial
memory streaming prefetcher. BLOCK_BEGIN and
BLOCK_END (instructions added to the ISA) are used to
determine the block boundaries. For the prediction, to prevent
timing constraints, prefetcher stores a history of k, that enables
predicting the blocks required, farther ahead. Although
developed for in-order pipeline, the prefetcher can be also
used for out-of-order pipeline, by fetching the address during
the commit stage.
• The prefetcher is used to fetch spatial data from the main
memory to the L2 cache (pull based prefetcher) and is
software controlled, using prediction to prefetch data. It is
dynamic type and developed for multi – core processors. Since
it uses additional space to store the history table, this
prefetcher can be classified as an hybrid prefetcher.
• The CBWS prefetcher is implemented as an add on to the
SMS prefetcher and performance evaluation is done in the
GEM5 simulator using SPEC2006 benchmarks. The
performance is compared with using only the SMS prefetcher.
The metrics used for evaluation cache misses, accuracy,
timely prefetches and overall speed – up.
• The CBWS + SMS prefetcher shows less cache misses
for majority benchmarks, compared to stride, GHB and SMS
prefetcher. Also, compared to the SMS, the CBWS + SMS
shows an improvement in timely accesses, but this is seen
only for memory – intensive benchmarks. Improvement is also
seen in accuracy, while performance speed – up is of 1.16
times over the SMS prefetcher for all benchmarks.
• This prefetcher uses less than 1KB of storage but
additional space is needed to store the differential history table
(DHT). For evaluation, a DHT of size 16 is used.
In [3], Kadjo et al. propose a prefetcher that performs optimal
control-flow speculation and effective address value
speculation to give an accurate prediction of future memory
references. The B-Fetch design depends on the expected path
through the future basic blocks and the effective addresses of
the next load instructions. Firstly, the future execution path is
predicted by the branch predictor. Then the B-Fetch analyses
the variation in the contents of the registers due to the earlier
branch instructions. This information is used to predict the
effective address. The Memory History Table (MHT) stores
the source, current and offset values. The use of variation
observed in register values over the history of effective
addresses helps in rightful prefetching even for instructions
which display irregular flow of control and data access
patterns.
• The technique uses a software controlled hybrid
prefetcher which is dynamic in nature. It is pull based,
getting data from the memory to the LLC.
• The prefetcher is tested using SPEC CPU2006
benchmarks, in the GEM5 simulator and the results
obtained by the prefetcher are compared with Stride and
SMS prefetcher. For evaluation, single and multi-threaded
loads are used.
• B-Fetch prefetcher achieves a mean speed up of
23.0% over the set baseline for 12.94KB size. But to
store the MHT additional hardware in form of
memory is required.
In [12], Aziz et al. propose a method to control prefetching
aggressiveness for network-based multiprocessor. The
controller minimizes the processor penalty by adjusting
prefetching aggressiveness. The Hill Climbing approach is
used to reduce the penalty. This method has five steps. Firstly,
the read miss transactions that arrive at the directory are
captured. The address prediction predicts the next processor
demanded address. The aggressive prediction predicts the
number of cache blocks that needs to be sent to the private
cache. This is followed by building the network packet which
contains all the predicted block addresses and finally the read
request is made to the private cache of predicted addresses.
The prefetcher performs transactions on the first level of
cache. This helps in avoiding network coherence delays.
• The prefetcher uses a software controlled algorithm and
the prefetcher is operated in static and dynamic mode
(with respect to the prefetching degree). It is pull based
and uses a private type of cache for storing the retrieved
data.
• The prefetcher is tested using PARSEC and SPLASH – 2
benchmarks. Additionally, benchmarks from MiBench
have also been used. The testing is done using real
systems - four Extended Sparc V8 ArchC with L1 and L2
cache.
• The prefetcher can reduce penalty by 7% and achieves an
increase of 24% in prefetching accuracy, for fixed degree
of prefetching.
In [13], Garside and Audsley propose a stream based
prefetcher consisting of a Prefetch Unit (PU). The PU includes
a stream buffer, prefetch buffer and squash buffer. It gathers
information about the data access trends by snooping on all
transactions made to the main memory. For each CPU, a
separate shared memory tile is present to access its memory.
The PU gets the memory request and if it is present in the
prefetch buffer, the requested address is added to the squash
buffer, else it is considered as a miss and the stream buffer is
updated.
• The prefetcher is developed for NoC multiprocessors, and
is event triggered, dynamic type.
• The PU is implemented within 4x4 Bluetiles NoC on
Xilinx Virtex-7 FPGA, with external memory and each
CPU has connection to the shared memory tree. The
CPUs are configured to run at 50MHz.
• It is observed that large prefetch distances yield better
results if the memory load increases. Improvement is also
seen in timely prefetching and accuracy of fetches.
In [1], Khan et al. propose the use of low overhead runtime
sampling and fast cache modeling for prefetching. The
prefetcher improves the single thread performance and it also
minimizes off-chip traffic and off-chip bandwidth
consumption. The prefetcher samples the memory instructions
in a random manner. The data cache blocks which are
accessed by the sampled instructions are monitored to see
reusability. If reused, the stride sample is recorded. The stride
sample is the difference between current and the previous
memory address accesses by the instruction. The reuse
samples are used to form the per-instruction cache
performance model. The stride samples are analyzed to find
the appropriate prefetch distance. Then the prefetch instruction
is scheduled for the load.
• This prefetching technique uses software based algorithm
and is dynamic in nature.
• The technique was evaluated on SPEC CPU 2006
benchmarks. The technique was also evaluated with
512kB L2 cache and it found 94% misses covered on
average.
• This prefetching method maintains minimum off-chip
traffic and it avoids LLC pollution. It also lowers off-chip
bandwidth demand. The results showed that the
multicores achieved higher throughput when resource
efficient prefetching is used.
From the above study and analysis most of the recent
works in prefetching have focused on developing light weight
prefetchers, especially for NoC multiprocessors. A common
feature of all the NoC based prefetchers is storing the data
directly from the main memory to the L1 cache, instead of
having a dedicated prefetch cache. Additionally, it is seen that
all prefetchers using prediction for determining the future
accesses need additional hardware (dedicated memory space)
for storing the history and hence are better classified under
hybrid prefetchers rather than simple software based
prefetchers. Also, the features used for carrying out
predictions and the amount of history stored varies for
different techniques.
IV. SOFTWARE BASED DATA PREFETCHING
Microprocessors use fetch instruction to implement
prefetching. Fetch instructions do not block memory
operations and hence have a lockup-free cache. This leads to
bypassing of outstanding memory operations when
prefetching is done. Software-initiated prefetching requires
minimum hardware as compared to the other prefetching
techniques. The complexity of software-initiated prefetching
lies on the position of the fetch instruction within the target
application.
Most software based prefetching techniques developed are
generally used along with a standard technique like the SMS
or GHB prefetcher, as described in [11], where the CBWS is
used as an additional component of the SMS prefetcher to
improve overall performance.
In the project, we have used the built in prefetch command
available for the gcc compiler and used it in loop intensive
codes for prefetching data, that will be required for the next
iteration. This is a very simple and efficient approach, that can
be used for applications related to image processing, matrix
operations, etc. For loops that access data in strides or contain
additional computations within the loop that would require
prefetching to be done in strides, rather than just one iteration
ahead, a prefetch distance must be determined. This distance
is a function of the latency caused due to cache misses and the
cycle time of the shortest path for one iteration of the loop [6].
V. EVALUATION AND RESULTS
For evaluating the built – in compiler based prefetcher, we
initially developed a C++ code to find the transpose of a
matrix and used the _mm_prefetch (char *p, int k) to prefetch
the data before next loop iteration, where p defines the address
for prefetching the data and k signifies the type of prefetching
to be done.
Figure. 1. Types of prefetches using _mm_prefetch(). [16]
Another prefetching function is the __builtin_prefetch (&i, j,
k), where I is the pointer to the address to be fetched, and j and
k define the type of prefetching. This was used in loop
intensive benchmarks of MiBench like fft, which contains
many nested loops as well. The figure below shows a part of
the loop of fft with the prefetching commands.
#define DO_PREFETCH
....
for(i=0;i<MAXSIZE;i++)
{RealIn[i]=0;
for(j=0;j<MAXWAVES;j++)
{if (rand()%2)
{RealIn[i]+=coeff[j]*cos(amp[j]*i);}
Else { RealIn[i]+=coeff[j]*sin(amp[j]*i);
ImagIn[i]=0;
#ifdef DO_PREFETCH
__builtin_prefetch (&coeff[j+1], 0, 1);
__builtin_prefetch (&amp[j+1], 0, 1);
#endif }
}
Figure 2. Loop of fft (main) function with prefetching
For comparison, this prefetc instruction was also applied to the
Dijkstra benchmark, which has just one loop. On running the
code using gcc compiler, on the system for fft the maximum
speed up was about 15% and for matrix transpose, a speed up
of 5% was achieved. Compared to this Dijkstra showed a
negligible performance difference. It has been seen that
compiler based prefetching along with other optimization
techniques such as loop unrolling, can show significant
improvement in performance.
VI. CONCLUSION
Data prefetching is one of the most commonly used
techniques in processor for reducing memory access latency,
cache misses, achieving higher accuracy and improving
overall performance of the system. While hardware based
prefetching, schemes were suitable for in order processors,
with predictable memory accesses, with the use of out of order
execution, software based prefetching techniques have shown
better performance. With development of modern day
processors such as NoC multiprocessors and chip
multiprocessors, a combination of both hardware and software
prefetching schemes are being used, and this class of
prefetchers is called hybrid prefetchers. In the project, we
have reviewed various prefetching techniques that have been
developed for modern day processors. We have classified each
technique based on set of features which makes it easier to
identify the type of application and processor, it is suitable for.
We have also listed the evaluation methodology, assumptions
and the performance improvement achieved by each
prefetching technique. Finally, we have implemented a simple
compiler based software data prefetching technique and
evaluated the results.
REFERENCES
[1] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110. doi: 10.1109/ICPP.2014.19
[2] M. Cireno, A. Aziz and E. Barros, "Temporized data
prefetching algorithm for NoC-based multiprocessor systems,"
2016 IEEE 27th International Conference on Application-
specific Systems, Architectures and Processors (ASAP),
London, 2016, pp. 235-236.doi: 10.1109/ASAP.2016.7760805
[3] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634.doi: 10.1109/MICRO.2014.29
[4] S. Byna, Y. Chen and X. H. Sun, "A Taxonomy of Data
Prefetching Mechanisms," 2008 International Symposium on
Parallel Architectures, Algorithms, and Networks (i-span
2008), Sydney, NSW, 2008, pp. 19-24.doi: 10.1109/I-
SPAN.2008.24
[5] S. Mittal, “A Survey of Recent Prefetching Techniques for
Processor Caches”, 20xx. ACM Computing Surveys 0, 0,
Article 0 (2016), 36 pages
[6] S. VanderWiel and David J. Lilja, “A Survey of Data
Prefetching Techniques”, Technical Report No.: HPPC-96-05,
doi: 10.1.1.2.4449 [online]: http://citeseerx.ist.psu.edu/
[7]“Manycore -vs- Multicore”, [online]:
https://goparallel.sourceforge.net/ask-james-reinders-
multicore-vs-manycore/
[8] Akanksha Jain and Calvin Lin. 2013. Linearizing irregular
memory accesses for improved correlated prefetching. In
International Symposium on Microarchitecture. 247–259,
December 2013. Doi:10.1145/2540708.2540730
[9] J. Li, C. J. Xue and Yinlong Xu, "STT-RAM based
energy-efficiency hybrid cache for CMPs," 2011 IEEE/IFIP
19th International Conference on VLSI and System-on-Chip,
Hong Kong, 2011, pp. 31-36. doi:
10.1109/VLSISoC.2011.6081626
[10] M. Bakhshalipour; P. Lotfi-Kamran; H. Sarbazi-Azad,
"An Efficient Temporal Data Prefetcher for L1 Caches," in
IEEE Computer Architecture Letters, vol.PP, no.99, pp.1-1
doi: 10.1109/LCA.2017.2654347
[11] A. Fuchs, S. Mannor, U. Weiser and Y. Etsion, "Loop-
Aware Memory Prefetching Using Code Block Working
Sets," 2014 47th Annual IEEE/ACM International Symposium
on Microarchitecture, Cambridge, 2014, pp. 533-544. doi:
10.1109/MICRO.2014.27
[12] J. Garside and N. C. Audsley, "Prefetching across a
shared memory tree within a Network-on-Chip architecture,"
2013 International Symposium on System on Chip (SoC),
Tampere, 2013, pp. 1-4. doi: 10.1109/ISSoC.2013.6675268
[13] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634. doi: 10.1109/MICRO.2014.29
[14] A. Aziz, M. Cireno, E. Barros and B. Prado, "Balanced
prefetching aggressiveness controller for NoC-based
multiprocessor," 2014 27th Symposium on Integrated Circuits
and Systems Design (SBCCI), Aracaju, 2014, pp. 1-7.doi:
10.1145/2660540.2660541
[15] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110.doi: 10.1109/ICPP.2014.19
[16] Lee, J., Kim, H., and Vuduc, R. 2012. When prefetching
works, when it doesn’t, and why. ACM Trans. Archit. Code
Optim. 9, 1, Article 2 (March 2012), 29 pages
http://doi.acm.org/10.1145/2133382.2133384

More Related Content

What's hot

AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
ijgca
 
Load distribution of analytical query workloads for database cluster architec...
Load distribution of analytical query workloads for database cluster architec...Load distribution of analytical query workloads for database cluster architec...
Load distribution of analytical query workloads for database cluster architec...
Matheesha Fernando
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2)
rohassanie
 
Task scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systemsTask scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systems
ijesajournal
 
Operating system
Operating systemOperating system
Operating system
Trinity Dwarka
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
vtunotesbysree
 
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
May Sit Hman
 
Linux basics
Linux basicsLinux basics
Linux basics
Raghu nath
 
Learning scheduler parameters for adaptive preemption
Learning scheduler parameters for adaptive preemptionLearning scheduler parameters for adaptive preemption
Learning scheduler parameters for adaptive preemption
csandit
 
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMSCOMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
ijcsit
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Minimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data WarehousesMinimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
International Journal of Science and Research (IJSR)
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
CSCJournals
 
Ly3421472152
Ly3421472152Ly3421472152
Ly3421472152
IJERA Editor
 
4 fuzzy aqm
4 fuzzy aqm4 fuzzy aqm
4 fuzzy aqm
quoc nguyen
 

What's hot (15)

AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...
 
Load distribution of analytical query workloads for database cluster architec...
Load distribution of analytical query workloads for database cluster architec...Load distribution of analytical query workloads for database cluster architec...
Load distribution of analytical query workloads for database cluster architec...
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2)
 
Task scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systemsTask scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systems
 
Operating system
Operating systemOperating system
Operating system
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
 
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...
 
Linux basics
Linux basicsLinux basics
Linux basics
 
Learning scheduler parameters for adaptive preemption
Learning scheduler parameters for adaptive preemptionLearning scheduler parameters for adaptive preemption
Learning scheduler parameters for adaptive preemption
 
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMSCOMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMS
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Minimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data WarehousesMinimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
Ly3421472152
Ly3421472152Ly3421472152
Ly3421472152
 
4 fuzzy aqm
4 fuzzy aqm4 fuzzy aqm
4 fuzzy aqm
 

Similar to Cad report

A Taxonomy of Data Prefetching Mechanisms
A Taxonomy of Data Prefetching MechanismsA Taxonomy of Data Prefetching Mechanisms
A Taxonomy of Data Prefetching Mechanisms
ijtsrd
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
Maurvi04
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
cscpconf
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
csandit
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
faithxdunce63732
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
IJCSEA Journal
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
Barbara Aichinger
 
Enhancing proxy based web caching system using clustering based pre fetching ...
Enhancing proxy based web caching system using clustering based pre fetching ...Enhancing proxy based web caching system using clustering based pre fetching ...
Enhancing proxy based web caching system using clustering based pre fetching ...
eSAT Publishing House
 
Performing initiative data prefetching
Performing initiative data prefetchingPerforming initiative data prefetching
Performing initiative data prefetching
Kamal Spring
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
ashishmulchandani
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Nikhil Jain
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
eSAT Journals
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdf
RayWill4
 
Performance Review of Zero Copy Techniques
Performance Review of Zero Copy TechniquesPerformance Review of Zero Copy Techniques
Performance Review of Zero Copy Techniques
CSCJournals
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
Himanshu Koli
 
Cpu provisioning algorithms for service differentiation in cloud based enviro...
Cpu provisioning algorithms for service differentiation in cloud based enviro...Cpu provisioning algorithms for service differentiation in cloud based enviro...
Cpu provisioning algorithms for service differentiation in cloud based enviro...
ieeepondy
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
DzH QWuynh
 

Similar to Cad report (20)

A Taxonomy of Data Prefetching Mechanisms
A Taxonomy of Data Prefetching MechanismsA Taxonomy of Data Prefetching Mechanisms
A Taxonomy of Data Prefetching Mechanisms
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
 
The effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handlingThe effective way of processor performance enhancement by proper branch handling
The effective way of processor performance enhancement by proper branch handling
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
 
Enhancing proxy based web caching system using clustering based pre fetching ...
Enhancing proxy based web caching system using clustering based pre fetching ...Enhancing proxy based web caching system using clustering based pre fetching ...
Enhancing proxy based web caching system using clustering based pre fetching ...
 
Performing initiative data prefetching
Performing initiative data prefetchingPerforming initiative data prefetching
Performing initiative data prefetching
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdf
 
Performance Review of Zero Copy Techniques
Performance Review of Zero Copy TechniquesPerformance Review of Zero Copy Techniques
Performance Review of Zero Copy Techniques
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Cpu provisioning algorithms for service differentiation in cloud based enviro...
Cpu provisioning algorithms for service differentiation in cloud based enviro...Cpu provisioning algorithms for service differentiation in cloud based enviro...
Cpu provisioning algorithms for service differentiation in cloud based enviro...
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 

More from Priyanka Goswami

Fog computing and data concurrency
Fog computing and data concurrencyFog computing and data concurrency
Fog computing and data concurrency
Priyanka Goswami
 
Project 3
Project 3Project 3
Project 3
Priyanka Goswami
 
Texture based feature extraction and object tracking
Texture based feature extraction and object trackingTexture based feature extraction and object tracking
Texture based feature extraction and object tracking
Priyanka Goswami
 
Stock analysis report
Stock analysis reportStock analysis report
Stock analysis report
Priyanka Goswami
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
Priyanka Goswami
 
Data Acquisition System
Data Acquisition SystemData Acquisition System
Data Acquisition System
Priyanka Goswami
 
Data Acquisition System
Data Acquisition SystemData Acquisition System
Data Acquisition System
Priyanka Goswami
 
Biomedical image processing ppt
Biomedical image processing pptBiomedical image processing ppt
Biomedical image processing ppt
Priyanka Goswami
 
Thermal Imaging and its Applications
Thermal Imaging and its ApplicationsThermal Imaging and its Applications
Thermal Imaging and its Applications
Priyanka Goswami
 

More from Priyanka Goswami (9)

Fog computing and data concurrency
Fog computing and data concurrencyFog computing and data concurrency
Fog computing and data concurrency
 
Project 3
Project 3Project 3
Project 3
 
Texture based feature extraction and object tracking
Texture based feature extraction and object trackingTexture based feature extraction and object tracking
Texture based feature extraction and object tracking
 
Stock analysis report
Stock analysis reportStock analysis report
Stock analysis report
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
Data Acquisition System
Data Acquisition SystemData Acquisition System
Data Acquisition System
 
Data Acquisition System
Data Acquisition SystemData Acquisition System
Data Acquisition System
 
Biomedical image processing ppt
Biomedical image processing pptBiomedical image processing ppt
Biomedical image processing ppt
 
Thermal Imaging and its Applications
Thermal Imaging and its ApplicationsThermal Imaging and its Applications
Thermal Imaging and its Applications
 

Recently uploaded

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 

Recently uploaded (20)

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 

Cad report

  • 1. Study and Analysis of Software-Based Data Prefetching Schemes Priyanka Goswami Electrical and Computer Engineering The University of Arizona Tucson, USA priyankag@email.arizona.edu Shalaka Satam Electrical and Computer Engineering The University of Arizona Tucson, USA shalakasatam@email.arizona.edu Abstract— For multicore processors, data prefetching is an effective technique to reduce latency caused by data access and cache misses. There are two main methods of prefetching – software and hardware prefetching. A third category is the hybrid prefetching scheme, which is a combination of software and hardware scheme and aims at combing the benefits of both the techniques. For this project, we will perform a study prefetching techniques and features used for classifying different techniques. We will then study the most recent prefetching techniques developed for modern day processors, classify them based on different criteria and perform a qualitative and quantitative evaluation of their performance. We will also evaluate the performance of compiler based data prefetching scheme using the built-in prefetcher of gcc compiler. Keywords—prefetching; latency; memory; cache; processor; multicore; software; data I. INTRODUCTION Prefetching is a technique used for transferring data from the main memory to a temporary storage before it is required by an application. It is used for reducing memory latencies by overlapping computation with communication (data access). An effective prefetcher should be able to accurately predict the address of the data required in the future, thus helping in reducing cache misses, besides improving the overall speed of execution. Prefetching has become important for modern computers mainly because of the exponential increase of dataset sizes and the significant difference between DRAM and processor speeds [1]. Prefetching has also become a key component of Network on Chip (NoC) based multiprocessor systems, where the effectiveness of the prefetching technique directly translates to an improvement in overall system performance. This is because in such a system the memory access latency depends on the network traffic, besides the distance between the processor request data and the storage location [2]. Although prefetching is one of the most widely used methods to reduce memory access latency, there are many factors and design specifications that must be considered for it to be effective. For example, in Chip – multiprocessors (CMPs) increasing the lower level cache sizes is not plausible and hence it is important that the prefetching technique used is highly accurate to prevent cache pollution [3]. Additionally, it is important that the prefetcher is light-weight and has low overhead. Over years, many different prefetching techniques have been designed that aim to reduce memory access latency, predict future data accesses with high accuracy, while being light weight and having minimum overhead. The techniques have also evolved to adapt to different processor architectures. In the report, first we will introduce the different criteria used for classifying prefetching techniques and introduce a new possible criterion that can be used for classifying prefetchers, based on the recent developments in processor architecture. The following section will focus on classification and analyzing the recently developed software based prefetching techniques, the performance improvements achieved by them and the type of processor architecture they are developed for. We will also examine the advantages and disadvantages of hardware and software based prefetching techniques and how hybrid prefetching techniques have helped in combining the best features of both the techniques. Then we will study the effectiveness of software based data prefetching technique by implementing compiler based prefetching on loop – intensive codes and analyze the results. II. CLASSIFICATION OF PREFETCHERS Over the years, many distinctive characteristics of prefetchers have been used to classify them. The most commonly used characteristics used are [4] [5] [6]: • Based on where the prefetching is implemented: This is the most popular and oldest criteria used for classifying prefetching techniques. The techniques are classified as: 1. Hardware based prefetcher – Prefetchers use information of current and previous cache accesses to fetch blocks of data. This technique generally requires additional hardware (like extra memory to store the history table) or modifications to the current hardware. It is generally used where the data access patterns are regular, which makes it easier for the prediction algorithm the correctly determine future data accesses.
  • 2. 2. Software based prefetcher – This is a compiler optimization technique, in which prefetch instructions are either inserted by the compiler by identifying loops in the code, or by the programmer. The time to prefetch depends on the loop execution time, to ensure that communication time is not more than the computation time. Software based prefetchers are preferred for irregular short streams of data, which is seen in out – of order processors. This also includes branch prediction algorithms. 3. Hybrid prefetcher - Combination of hardware and software based prefetcher. • Based on what is prefetched: This is applicable if separate memory is used for instructions and data. 1. Instruction prefetching 2. Data prefetching • Based on events that can be used to determine if prefetching is required: This can be event triggered (like a cache miss), software controlled, prediction based or consisting of look ahead counter. • Based on the source and destination of the data being prefetched: The source can be main memory or a lower level cache (like L1) and the destination is a higher - level cache (this can be private, shared or exclusively for storing prefetched data). • Based on component initializing the prefetching: 1. Push based in which the memory or server prefetches data and sends it to the processor performing computation. 2. Pull based in which the processor requests the (prefetched) data from the memory or lower level cache. • Based on the technique used for identifying the data to be retrieved: The technique can be classified as 1. Static prefetcher 2. Dynamic prefetcher We introduce another possible classification scheme for prefetching algorithms based on the type of processor / processor components a technique is suitable for. like no. of cores a processor has (single core, multi core or many core [7]), network based multiprocessor system, chip multiprocessor systems, etc. We will use this scheme in addition to the previous described characteristics to classify the different software based prefetching techniques we will analyze in the following section. Additionally, prefetchers can also be classified based on the data the retrieve, i.e. temporal, spatial or both. Some of the main metrics used to analyze the performance of prefetchers are prefetch degree, accuracy, overhead, timely prefetching and difference in speed of application execution. III. ANALYSIS AND CLASSIFICATION OF PREFETCHING TECHNIQUES In this section, we will analyze some recently developed prefetching techniques for modern day processors such as NoC multiprocessors. Based on the analysis we will classify the technique using the above criteria and the performance improvement achieved. A similar classification and review has been described in [4], but in this report, we will focus only on data prefetching techniques, especially those that have not been covered in previously published surveys. In [8], Jain and Lin (2013) introduce the Irregular Stream Buffer prefetcher. It aims on using an additional level of indirection for converting correlated addresses to consecutive addresses for irregular sequence of memory references. It identifies temporal streams of data, by identifying the next memory access based on the current reference. The main components of the prefetcher are: training unit, address mapping caches and stream predictor. • It prefetches data from the main memory (pull based prefetcher) and uses a software controlled prediction based algorithm. It is designed for single and multi-core processors and is dynamic prefetcher. • The prefetcher is tested using the SPEC 2006 benchmarks on Marss simulator and has obtained 23.1% speed-up and 93.7% accuracy. • On-chip storage is 32KB with 8.4% traffic overhead. In [9], Mao et al. (2014) describe two methods: the request prioritization (RP) and hybrid-local global prefetch (HLGPC). The RP consists of assigning priorities on different low-level cache (LLC) accesses (like read request, write back request, etc.) and use a miss state handling register (MSHR) to track the elapsed time for servicing each request. The HLGPC is used to control the aggressiveness (prefetch degree and distance) of the prefetcher by using two metrics: prefetch frequency of each core and global access frequency of LLC. • It prefetches data from main memory (pull based prefetcher) and stores in L1 cache and uses software controlled algorithm and is dynamic in nature. It is designed for chip-multiprocessors having multi-core system and STT-RAM LLC. • The prefetcher is tested using SPEC 2000/2006 benchmarks with the Intel Sandy Bridge configuration in MacSim simulator and energy calculations are done using CACTI and synopsis. The performance improvement increases, with larger sized LLC, with maximum improvement of 9.1%, while the energy saved decreases with minimum improvement of 5.6% (for 8MB LLC). • The prefetcher requires additional registers (128 7-bit MSHR and 20-entry write buffer).
  • 3. In [2], Cireno et al. (2016) have designed a prefetching system which consists of the client (responsible for handling prefetch requests and load data to local cache) and the server (responsible for tracking the time and notifying the client and maintain the directory). The prefetching scheme initiates when there is a cache miss and then a time series based prediction is used to determine when the prefetch request is to be serviced. In case of a miss, the register is updated and prefetching parameters are adjusted. For timed prefetching, the client maintains a separate time estimation for each client. • It is pull based prefetcher as the client gets the data from the main memory to higher level caches (L1 and L2) and uses a combination of event triggered and prediction based prefetcher. It is developed for NoC based multi core processors and is dynamic in nature. The prefetcher is used to fetch temporal data • The prefetcher is evaluated using Extended SPARC V8 Arch and simulated on Infinity Platform. For 16 core system, the performance improvement is 6.25 %. Metrics used for evaluating are processor penalty, miss rate and network transactions. In [10], Bakhshalipour et al. (2017) have proposed a prefetching technique called Domino, that uses two previous miss addresses from the global history table, to determine the next address of data to be prefetched. There are two Miss History Tables (MHTs) per core – (1). MHT – 1 stores the 1st miss (1 tag + prediction field + valid bit) and (2). MHT – 2 stores two consecutive misses (2 tags + prediction field + identifier). A single predicted address is determined by XORing the addresses from MHT – 2. Prefetch request is sent if there is a hit in either MHT – 1 or MHT – 2. • The technique is used to prefetch temporal data from the main memory to the L1 Data cache directly (pull based prefetcher) and is intended for multi core server processors. The prefetcher uses an event trigger (cache miss) to initialize and is dynamic in nature. • Domino prefetcher is evaluated using Flexus simulator with 16 core processor (Ultra SPARC III) and 8MB LLC and 32 KB L1-data cache and benchmarks from the CloudSuite like MapReduce. • The Domino prefetcher improves system performance by 26% (more for certain benchmarks), and has better performance compared to [8] (which also prefetches temporal data but uses PC and address correlation), for server workloads. But evaluation has been done with the assumption that infinite space is available to store MHT – 1/ 2 tables, which is not possible for real world systems. In [11], Fuchs et al. (2014) use code block working sets (CBWS), which provides the address trace of ordered lines accessed by loops, and the prefetcher uses this data to fetch entire data block required for the loop iteration. The proposed prefetcher is implemented as an addition to the spatial memory streaming prefetcher. BLOCK_BEGIN and BLOCK_END (instructions added to the ISA) are used to determine the block boundaries. For the prediction, to prevent timing constraints, prefetcher stores a history of k, that enables predicting the blocks required, farther ahead. Although developed for in-order pipeline, the prefetcher can be also used for out-of-order pipeline, by fetching the address during the commit stage. • The prefetcher is used to fetch spatial data from the main memory to the L2 cache (pull based prefetcher) and is software controlled, using prediction to prefetch data. It is dynamic type and developed for multi – core processors. Since it uses additional space to store the history table, this prefetcher can be classified as an hybrid prefetcher. • The CBWS prefetcher is implemented as an add on to the SMS prefetcher and performance evaluation is done in the GEM5 simulator using SPEC2006 benchmarks. The performance is compared with using only the SMS prefetcher. The metrics used for evaluation cache misses, accuracy, timely prefetches and overall speed – up. • The CBWS + SMS prefetcher shows less cache misses for majority benchmarks, compared to stride, GHB and SMS prefetcher. Also, compared to the SMS, the CBWS + SMS shows an improvement in timely accesses, but this is seen only for memory – intensive benchmarks. Improvement is also seen in accuracy, while performance speed – up is of 1.16 times over the SMS prefetcher for all benchmarks. • This prefetcher uses less than 1KB of storage but additional space is needed to store the differential history table (DHT). For evaluation, a DHT of size 16 is used. In [3], Kadjo et al. propose a prefetcher that performs optimal control-flow speculation and effective address value speculation to give an accurate prediction of future memory references. The B-Fetch design depends on the expected path through the future basic blocks and the effective addresses of the next load instructions. Firstly, the future execution path is predicted by the branch predictor. Then the B-Fetch analyses the variation in the contents of the registers due to the earlier branch instructions. This information is used to predict the effective address. The Memory History Table (MHT) stores the source, current and offset values. The use of variation observed in register values over the history of effective addresses helps in rightful prefetching even for instructions which display irregular flow of control and data access patterns. • The technique uses a software controlled hybrid prefetcher which is dynamic in nature. It is pull based, getting data from the memory to the LLC. • The prefetcher is tested using SPEC CPU2006 benchmarks, in the GEM5 simulator and the results obtained by the prefetcher are compared with Stride and
  • 4. SMS prefetcher. For evaluation, single and multi-threaded loads are used. • B-Fetch prefetcher achieves a mean speed up of 23.0% over the set baseline for 12.94KB size. But to store the MHT additional hardware in form of memory is required. In [12], Aziz et al. propose a method to control prefetching aggressiveness for network-based multiprocessor. The controller minimizes the processor penalty by adjusting prefetching aggressiveness. The Hill Climbing approach is used to reduce the penalty. This method has five steps. Firstly, the read miss transactions that arrive at the directory are captured. The address prediction predicts the next processor demanded address. The aggressive prediction predicts the number of cache blocks that needs to be sent to the private cache. This is followed by building the network packet which contains all the predicted block addresses and finally the read request is made to the private cache of predicted addresses. The prefetcher performs transactions on the first level of cache. This helps in avoiding network coherence delays. • The prefetcher uses a software controlled algorithm and the prefetcher is operated in static and dynamic mode (with respect to the prefetching degree). It is pull based and uses a private type of cache for storing the retrieved data. • The prefetcher is tested using PARSEC and SPLASH – 2 benchmarks. Additionally, benchmarks from MiBench have also been used. The testing is done using real systems - four Extended Sparc V8 ArchC with L1 and L2 cache. • The prefetcher can reduce penalty by 7% and achieves an increase of 24% in prefetching accuracy, for fixed degree of prefetching. In [13], Garside and Audsley propose a stream based prefetcher consisting of a Prefetch Unit (PU). The PU includes a stream buffer, prefetch buffer and squash buffer. It gathers information about the data access trends by snooping on all transactions made to the main memory. For each CPU, a separate shared memory tile is present to access its memory. The PU gets the memory request and if it is present in the prefetch buffer, the requested address is added to the squash buffer, else it is considered as a miss and the stream buffer is updated. • The prefetcher is developed for NoC multiprocessors, and is event triggered, dynamic type. • The PU is implemented within 4x4 Bluetiles NoC on Xilinx Virtex-7 FPGA, with external memory and each CPU has connection to the shared memory tree. The CPUs are configured to run at 50MHz. • It is observed that large prefetch distances yield better results if the memory load increases. Improvement is also seen in timely prefetching and accuracy of fetches. In [1], Khan et al. propose the use of low overhead runtime sampling and fast cache modeling for prefetching. The prefetcher improves the single thread performance and it also minimizes off-chip traffic and off-chip bandwidth consumption. The prefetcher samples the memory instructions in a random manner. The data cache blocks which are accessed by the sampled instructions are monitored to see reusability. If reused, the stride sample is recorded. The stride sample is the difference between current and the previous memory address accesses by the instruction. The reuse samples are used to form the per-instruction cache performance model. The stride samples are analyzed to find the appropriate prefetch distance. Then the prefetch instruction is scheduled for the load. • This prefetching technique uses software based algorithm and is dynamic in nature. • The technique was evaluated on SPEC CPU 2006 benchmarks. The technique was also evaluated with 512kB L2 cache and it found 94% misses covered on average. • This prefetching method maintains minimum off-chip traffic and it avoids LLC pollution. It also lowers off-chip bandwidth demand. The results showed that the multicores achieved higher throughput when resource efficient prefetching is used. From the above study and analysis most of the recent works in prefetching have focused on developing light weight prefetchers, especially for NoC multiprocessors. A common feature of all the NoC based prefetchers is storing the data directly from the main memory to the L1 cache, instead of having a dedicated prefetch cache. Additionally, it is seen that all prefetchers using prediction for determining the future accesses need additional hardware (dedicated memory space) for storing the history and hence are better classified under hybrid prefetchers rather than simple software based prefetchers. Also, the features used for carrying out predictions and the amount of history stored varies for different techniques. IV. SOFTWARE BASED DATA PREFETCHING Microprocessors use fetch instruction to implement prefetching. Fetch instructions do not block memory operations and hence have a lockup-free cache. This leads to bypassing of outstanding memory operations when prefetching is done. Software-initiated prefetching requires minimum hardware as compared to the other prefetching techniques. The complexity of software-initiated prefetching lies on the position of the fetch instruction within the target application. Most software based prefetching techniques developed are generally used along with a standard technique like the SMS or GHB prefetcher, as described in [11], where the CBWS is used as an additional component of the SMS prefetcher to improve overall performance.
  • 5. In the project, we have used the built in prefetch command available for the gcc compiler and used it in loop intensive codes for prefetching data, that will be required for the next iteration. This is a very simple and efficient approach, that can be used for applications related to image processing, matrix operations, etc. For loops that access data in strides or contain additional computations within the loop that would require prefetching to be done in strides, rather than just one iteration ahead, a prefetch distance must be determined. This distance is a function of the latency caused due to cache misses and the cycle time of the shortest path for one iteration of the loop [6]. V. EVALUATION AND RESULTS For evaluating the built – in compiler based prefetcher, we initially developed a C++ code to find the transpose of a matrix and used the _mm_prefetch (char *p, int k) to prefetch the data before next loop iteration, where p defines the address for prefetching the data and k signifies the type of prefetching to be done. Figure. 1. Types of prefetches using _mm_prefetch(). [16] Another prefetching function is the __builtin_prefetch (&i, j, k), where I is the pointer to the address to be fetched, and j and k define the type of prefetching. This was used in loop intensive benchmarks of MiBench like fft, which contains many nested loops as well. The figure below shows a part of the loop of fft with the prefetching commands. #define DO_PREFETCH .... for(i=0;i<MAXSIZE;i++) {RealIn[i]=0; for(j=0;j<MAXWAVES;j++) {if (rand()%2) {RealIn[i]+=coeff[j]*cos(amp[j]*i);} Else { RealIn[i]+=coeff[j]*sin(amp[j]*i); ImagIn[i]=0; #ifdef DO_PREFETCH __builtin_prefetch (&coeff[j+1], 0, 1); __builtin_prefetch (&amp[j+1], 0, 1); #endif } } Figure 2. Loop of fft (main) function with prefetching For comparison, this prefetc instruction was also applied to the Dijkstra benchmark, which has just one loop. On running the code using gcc compiler, on the system for fft the maximum speed up was about 15% and for matrix transpose, a speed up of 5% was achieved. Compared to this Dijkstra showed a negligible performance difference. It has been seen that compiler based prefetching along with other optimization techniques such as loop unrolling, can show significant improvement in performance. VI. CONCLUSION Data prefetching is one of the most commonly used techniques in processor for reducing memory access latency, cache misses, achieving higher accuracy and improving overall performance of the system. While hardware based prefetching, schemes were suitable for in order processors, with predictable memory accesses, with the use of out of order execution, software based prefetching techniques have shown better performance. With development of modern day processors such as NoC multiprocessors and chip multiprocessors, a combination of both hardware and software prefetching schemes are being used, and this class of prefetchers is called hybrid prefetchers. In the project, we have reviewed various prefetching techniques that have been developed for modern day processors. We have classified each technique based on set of features which makes it easier to identify the type of application and processor, it is suitable for. We have also listed the evaluation methodology, assumptions and the performance improvement achieved by each prefetching technique. Finally, we have implemented a simple compiler based software data prefetching technique and evaluated the results. REFERENCES [1] M. Khan, A. Sandberg and E. Hagersten, "A Case for Resource Efficient Prefetching in Multicores," 2014 43rd International Conference on Parallel Processing, Minneapolis MN, 2014, pp. 101-110. doi: 10.1109/ICPP.2014.19 [2] M. Cireno, A. Aziz and E. Barros, "Temporized data prefetching algorithm for NoC-based multiprocessor systems," 2016 IEEE 27th International Conference on Application- specific Systems, Architectures and Processors (ASAP), London, 2016, pp. 235-236.doi: 10.1109/ASAP.2016.7760805 [3] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D. Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 623-634.doi: 10.1109/MICRO.2014.29 [4] S. Byna, Y. Chen and X. H. Sun, "A Taxonomy of Data Prefetching Mechanisms," 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, 2008, pp. 19-24.doi: 10.1109/I- SPAN.2008.24 [5] S. Mittal, “A Survey of Recent Prefetching Techniques for Processor Caches”, 20xx. ACM Computing Surveys 0, 0, Article 0 (2016), 36 pages [6] S. VanderWiel and David J. Lilja, “A Survey of Data Prefetching Techniques”, Technical Report No.: HPPC-96-05, doi: 10.1.1.2.4449 [online]: http://citeseerx.ist.psu.edu/ [7]“Manycore -vs- Multicore”, [online]: https://goparallel.sourceforge.net/ask-james-reinders- multicore-vs-manycore/
  • 6. [8] Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In International Symposium on Microarchitecture. 247–259, December 2013. Doi:10.1145/2540708.2540730 [9] J. Li, C. J. Xue and Yinlong Xu, "STT-RAM based energy-efficiency hybrid cache for CMPs," 2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, Hong Kong, 2011, pp. 31-36. doi: 10.1109/VLSISoC.2011.6081626 [10] M. Bakhshalipour; P. Lotfi-Kamran; H. Sarbazi-Azad, "An Efficient Temporal Data Prefetcher for L1 Caches," in IEEE Computer Architecture Letters, vol.PP, no.99, pp.1-1 doi: 10.1109/LCA.2017.2654347 [11] A. Fuchs, S. Mannor, U. Weiser and Y. Etsion, "Loop- Aware Memory Prefetching Using Code Block Working Sets," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 533-544. doi: 10.1109/MICRO.2014.27 [12] J. Garside and N. C. Audsley, "Prefetching across a shared memory tree within a Network-on-Chip architecture," 2013 International Symposium on System on Chip (SoC), Tampere, 2013, pp. 1-4. doi: 10.1109/ISSoC.2013.6675268 [13] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D. Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 623-634. doi: 10.1109/MICRO.2014.29 [14] A. Aziz, M. Cireno, E. Barros and B. Prado, "Balanced prefetching aggressiveness controller for NoC-based multiprocessor," 2014 27th Symposium on Integrated Circuits and Systems Design (SBCCI), Aracaju, 2014, pp. 1-7.doi: 10.1145/2660540.2660541 [15] M. Khan, A. Sandberg and E. Hagersten, "A Case for Resource Efficient Prefetching in Multicores," 2014 43rd International Conference on Parallel Processing, Minneapolis MN, 2014, pp. 101-110.doi: 10.1109/ICPP.2014.19 [16] Lee, J., Kim, H., and Vuduc, R. 2012. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim. 9, 1, Article 2 (March 2012), 29 pages http://doi.acm.org/10.1145/2133382.2133384