SlideShare a Scribd company logo
SURVEY ON CACHE REPLICATION
MECHANISMS
BY
LAKSHMI YASASWI KAMIREDDY
(651771619)
CONTENTS
Abstract
1. Introduction
2. Background
3. Schemes
3.1. Victim Replication
3.2. Adaptive Selective Replication
3.3. Adaptive Probability Replication
3.4. Dynamic Reusability based Replication
3.5. Locality Aware Data Replication
4. Results
5. Conclusions
6. References
Abstract
Present day systems have a high demand of multicore processors on chip. As the number of cores on Chip Multi-
Processor (CMP) increases, the need for effective utilization (management) of the cache increases. Cache Management
plays an important role in improving the performance. This is achieved by reducing the number of misses and the miss
latency. These two factors the number of misses and the miss latency cannot be reduced at the same time. Some CMPs use
a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses while others use private L2
caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals
use selective replication to make a balance between the miss latency and on chip capacity. There are two kinds of
replication Static replication and Dynamic replication. This paper focusses more on the existing dynamic replication
schemes and gives an analysis of each scheme on several benchmarks.
1. Introduction
Upcoming generation multicore processors and applications will operate on massive data. Major challenge in near future
multicore processors is the movement of data that is being incurred by conventional cache hierarchies. This has very high
impact on the off-chip bandwidth, on-chip memory access latency and energy consumption. A large on-chip cache is
possible but it is not a scalable solution. It is limited to small number of cores, and hence the only practical option is to
physically distribute memory in pieces so that every core is near some portion of the cache. Such a solution might provide
a large amount of aggregate cache capacity and fast private memory for each core but at the same time it is difficult to
manage the distributed cache and network resources efficiently as they require architectural support for cache coherence
and consistency under the ubiquitous shared memory model. Most directory-based protocols enable fast local caching to
exploit data locality, but even they have scalability issues .Some of the most recent proposals have addressed the issue of
directory scalability in single-chip multicores using sharer compression techniques or limited directories. But, the fast
private caches still suffer from two major problems: (1) due to capacity constraints, they cannot hold the working set of
applications that operate on massive data, and (2) due to frequent communication between cores, data is often displaced
from them [1]. This has led to an increased network traffic and request rate to the last level cache. On-chip wires do not
scale at the same pace as transistors, because of which the data movement not only impacts memory access latency, but
also consumes more power due to the energy consumption of network and cache resources [2]. Though private LLC
organizations (e.g., [3]) have low hit latencies, their off-chip miss rates are high in applications that have uneven
distributions of working sets or exhibit high degrees of sharing (due to cache line replication). Shared LLC organizations
(e.g., [4]), on the other hand, lead to non-uniform cache access (NUCA) [5] that hurts on-chip locality, but their off-chip
miss rates are low since cache lines are not replicated. Several proposals have explored the idea of hybrid LLC
.Replication mechanisms have been proposed to balance between access latency and cache capacity in hybrid L2 cache
designs [6] [7]. Two types of replication approaches have been proposed: static [8, 9] and dynamic [10, 11, 12, 13, and
14]. In static replication, a data block is placed through predefined address interleaving; therefore, the LLC banks that
may contain that data block is fixed. The data placement of instruction pages in R-NUCA [8] and in S-NUCA [9] are
static. In dynamic replication, a data block can be placed in any LLC banks. Victim Replication [10] ,Adaptive Selective
Replication [11] ,Adaptive Probability Replication [12],Dynamic Reusability based Replication[13], and Locality Aware
data replication at Last Level Cache [14] fall into this category. These replication mechanisms have their own advantages
and disadvantages .The paper will be an analysis these dynamic replication schemes.
2. Background
Starting chronologically the first dynamic replication mechanism from the above mentioned is the Victim Replication
(VR)[10] mechanism which is based on shared caches, but it tries to capture evictions from the local primary cache in the
local L2 slice to reduce subsequent access latency to the same cache block. Victim replicas and global L2 cache blocks
share L2 slice capacity. In VR, all primary cache misses must first check the local L2 tags in case there’s a valid local
replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local
L2 slice and moved into the primary cache 10]. The next technique introduced is the Adaptive Selective Replication
(ASR) [11] which adopts similar replication mechanism to VR, but it focuses on the capacity contention between replicas
and global L2 cache blocks. ASR dynamically estimates the cost (extra misses) and benefit (lower hit latency) of
replication and adjusts the number of receivable victims to avoiding hurting L2 cache performance [11]. Another
replication scheme called the Adaptive Probability Replication (APR)[12] mechanism is proposed that counts each
block’s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to
estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR
adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and
insert them at appropriate position, according to their re-reference probability [12].In the same conference another
mechanism named Dynamic Reusability-based Replication (DRR) [13] was introduced. DRR is a hybrid cache
architecture that dynamically monitors the reuse pattern of cache blocks and replicates blocks with high reusability to
appropriate L2 cache slices [13]. Replicas are shared by nearby cores through a fast lookup mechanism, Network Address
Mapping, which records the location of the nearest replica in network interfaces and forwards subsequent L1 miss
requests to the replica immediately. This improved performance of shared caches by exploiting reusability based
replication, fast lookup mechanism, and replicas sharing. Most recent technique introduced is the locality-aware selective
data replication protocol for the last-level cache (LLC) [14]. This method gives lower memory access latency and energy
by replicating only high locality cache lines in the LLC slice of the requesting core, and simultaneously keeps the off-chip
miss rate low. This approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality
at the cache line granularity, and only allows replication for cache lines with high reuse [14]. A classifier is used to
capture the LLC pressure at the existing replica locations and adaptation of replication decision is done accordingly. The
locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional
coherence protocols. The following sections discuss the schemes in detail.
3. Schemes
3.1. Victim Replication (VR)
Victim replication (VR) is a hybrid scheme that combines the advantage of large capacity of shared L2 cache and low hit
latency of Private L2 cache. VR is primarily based on shared L2 cache, but in addition tries to capture evictions from the
local primary cache in the local L2 slice. Each retained victim is a local L2 replica of a line that is already existing in the
L2 of the remote home tile. When a miss occurs at the shared L2 cache, a line is brought in from memory and placed in
the on chip L2 at a home tile determined by a subset of the physical address bits, as in shared L2 cache. The requested line
is directly forwarded to the primary cache of the requesting processor. If the line’s residency in the primary cache is
terminated because of an incoming invalidation or write back request, the usual shared L2 cache protocol is followed. If a
primary cache line is evicted because of a conflict or capacity miss, then a copy of the victim line in the local slice is kept
to reduce subsequent access latency to the same line A global line with remote sharers is never evicted in favor of a local
replica, as an actively cached global line is likely to be in use. The VR replication policy will replace the following classes
of cache lines in the target set in descending priority order: (1) An invalid line; (2) A global line with no sharers; (3) An
existing replica. If there are no lines belonging to these three categories, no replica is made and the victim is evicted from
the tile as in shared L2 cache [10]. If there is more than one line in the selected category, VR picks at random. All primary
cache misses first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded
to the home tile. On a replica hit, the replica is invalidated in the local L2 slice and moved into the primary cache. When a
downgrade or invalidation request is received from the home tile, the L2 tags will also be checked in addition to the
primary cache tags [10].
3.2. Adaptive Selective Replication (ASR)
Adaptive Selective Replication ASR obtains the optimum replication level by balancing the benefits of replication against
the costs.L2 cache block replication improves memory system performance when the average L1 miss latency is reduced.
The following equation describes the average cycles for L1 cache misses normalized by instructions executed:
𝐿1 𝑚𝑖𝑠𝑠 𝑐𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
=
𝑃𝑙𝑜𝑐𝑎𝑙𝐿2 ∗ 𝐿𝑙𝑜𝑐𝑎𝑙𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
+
𝑃𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 ∗ 𝐿 𝑟𝑒𝑚𝑜𝑡𝑒𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
+
𝑃 𝑚𝑖𝑠𝑠 ∗ 𝐿 𝑚𝑖𝑠𝑠
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
Px is the probability of a memory request being satisfied by the entity x, where x is a local L2 cache, remote L2 caches, or
main memory and Lx equals the latency of each entity [11.] The combination of the localL2 and remoteL2 terms represent
the memory cycles spent on L2 cache hits and the third term depicts the memory cycles spent on L2 cache misses.
Replication increases the probability that L1 misses hit in the local L2 cache, thus the PlocalL2 term increases and the
PremoteL2 term decreases. Because the latency of a local L2 cache hit is tens of cycles faster than a remote L2 cache hit,
the net effect of increasing replication is a reduction in cycles spent on L2 cache hits. However, more replication devotes
more capacity to replica blocks, thus fewer unique blocks exist on-chip, increasing the probability of L2 cache misses,
Pmiss. If the probability of a miss increases significantly due to replication, the miss term will dominate, as the latency of
memory is hundreds of cycles greater than the L2 hit latencies. Therefore, balancing these three terms is necessary to
improve memory system performance.
Optimal performance often arises from an intermediate replication level. Figure 1 graphically depicts this tradeoff. The
Replication Benefit curve, Figure 1(a), illustrates the trend that increasing replication reduces L2 cache hit cycles. Due to
the strong locality of shared read-only requests, a small degree of L2 replication can significantly reduce L2 hit cycles by
moving many previous remote L2 hits into the local cache. In contrast, increased replication gradually reduces L2 hit
cycles because fewer unique blocks on-chip lead to fewer total L2 hits. The Replication Cost curve, Figure 1(b), illustrates
that increasing L2 replication increases the memory cycles spent on off-chip misses. The Replication Effectiveness curve,
Figure 1(c), combines the benefit and cost curves and plots the total memory cycles. Because the benefit and cost curves
are generally convex and have opposite slopes, the minimum of the Replication Effectiveness curve often lies between
allowing all replications and no replications. ASR estimates the slopes of the benefit and cost curves to approximate the
optimal replication level.
(a) (b) (c)
Figure 1[11]
By dynamically monitoring the benefit and cost of replication, ASR attempts to achieve the optimal level of replication.
ASR identifies discrete replication levels and makes a piecewise approximation of the memory cycle slope [11]. Thus
ASR simplifies the analysis to a local decision of whether the amount of replication should be increased, decreased, or
remain the same. Figure 1 illustrates the case where the current replication level, labeled C, results in HC hit cycles-per-
instruction and MC Miss cycles-per-instruction. ASR considers three alternatives: (i) increasing replication to the next
higher level, labeled H, (ii) decreasing replication to the next lower level, labeled L, or (iii) leaving the replication
unchanged [11]. To make this decision, ASR not only needs HC and MC, but also four additional hit and miss cycles-per-
instruction values: HH and MH for the next higher level and HL and ML for the next lower level. To simplify the
collection process, ASR estimates only the four differences between the hit and miss cycles-per-instruction: (1) the benefit
of increasing replication (decrease in L2 hit cycles, HC - HH); (2) the cost of increasing replication (increase in L2 miss
cycles, MH - MC); (3) the benefit of decreasing replication, (decrease in L2 miss cycles, MC - ML); and (4) the cost of
decreasing replication (increase in L2 hit cycles, HL - HC). By comparing these cost and benefit counters, ASR will
increase, decrease, or leave unchanged the replication level.
3.3. Adaptive Probability Replication (APR)
This design is based on a distributed shared L2 cache design. To predict re-reference probability, APR adds a counter for
each cache block to record and transfer the number of accesses. In APR, each tile stores re-reference probability of blocks
from other remote L2 cache slices in its network interface component using a simple lookup table called Re-Reference
Probability Buffer (RRPB) [12]. RRPB keeps re-reference probability entries for all other L2 slices. Re-reference
probability entry holds replication thresholds for different number of accesses. The replication thresholds indicate the re-
reference probability of blocks with different number of accesses. In the local L2 slice, if there is invalid block or the
victim is not a sharing global block, the replica is filled into L2 cache slice. Otherwise, the replication is abandoned. The
insert position of replica is determined by its corresponding re-reference probability. When a replica is accessed again, it
is deleted from the local L2 cache slice and moved to the local L1 cache.
APR counts every accesses of L2 cache blocks and records the number of evicted blocks with different number of
accesses, to estimate re-reference probability at runtime. For example, the re-reference probability of block with N
accesses is the proportion of the number of blocks with exceeding N accesses accounts for that of blocks with not less
than N accesses. The estimation is only performed when a global block replacement occurs (not a replica
replacement).The re-reference probabilities are updated to all other tiles in a certain interval (such as 10000 cycles) by
attaching it to any response message. Because blocks from other remote L2 slices may be accessed in local L2 cache
slices due to replication, each replica access increases also the corresponding counter associating with the block. When a
replica is accessed, the associated counter will also be moved to L1 cache block. The values of counter of blocks in L1
caches are sent back when the blocks are evicted to the home L2 slice to accelerate the number of accesses.
Like ASR, a linear feedback shift register generates a pseudo-random number which is compared to the corresponding
replication threshold. When an evicted L1 block passes through the network interface, APR captures this message, and
looks up corresponding RRPB entry according to its address. If the corresponding replication threshold to the number of
accesses of the blocks is less than the generated random number, the block is sent to the local L2 slice. Otherwise, it is
evicted to the home L2 slice. In the local L2 cache slice, if there is an invalid block or the victim is not a sharing global
block, the block is inserted. Otherwise, it is sent to the home L2 slice. Blocks with more accesses have higher re-reference
probability. Probability insertion is implemented in APR according to the number of replicated block, in which the
number of accesses indicates the insert position. If the number of accesses of block exceeds the way size, the block is
inserted at MRU position. The aim of probability insertion is to make blocks with lower re-reference probability survival
for a shorter time.
3.4. Dynamic Reusability based Replication (DRR)
DRR dynamically replicates blocks with high reusability to other appropriate L2 cache slices, and allows the replicas be
shared by nearby cores via a fast lookup mechanism [13]. A set-associative Core Access Counter Buffer (CACB) is used
to determine which block should be replicated and corresponding destination of replication. For recent accessed blocks,
CACB record access numbers for cores exceeding certain hops (for example, 2 hops, this is also the smallest distance
among home slice and replica slices) away from the home slice respectively. So, only 10 counters are enough in one
CACB entry for a 16-core CMP. When the block receives a Read request from one core, the corresponding counter
increases. Due to coherence problem, when the block receives a Write request, all the counters of the block are reset to
zero. In CACB, larger counter means higher reusability. When the maximum counter of a block reaches a certain
threshold (for example, 5), the block is to be replicated to the slice corresponding to the maximum counter.
After the replicating block and the destination being determined, the L2 cache slice sends a replication request to the
destination. When the destination receives the replication request, it allocates cache space to hold the replica. If the
destination has available space for replica, it response acknowledge to the home L2 cache slice. Otherwise, it response fail
message to the home L2 cache slice. Once the replication operation is completed, the replication destination is stored into
a set-associative Replication Directory Buffer (RDB) in home L2 slice. If the replication fails, the destination is not stored
into RDB. When a Read request reaches the home L2 cache slice, if the distance between the requesting core and the
nearest replica is less than the given replica distance(for example, 3 hops in 16-core CMP), the request is forwarded to the
nearest replica. Otherwise, the request is satisfied at the home L2 cache slice.
When the replica receives the forwarded request, it response data to the requesting core. When the data response message
passes through the network interface of the requesting core, the replica’s location is stored into a set-associative Network
Address Mapping Buffer (NAMB). NAMB is embedded into network interface and used to record replicas’ location
which have serviced for the core. When a L1 cache Read miss request passes through network interface, it first searches
NAMB. If the NAMB hit, the request is forwarded to the recoded replica location immediately. Otherwise, the request
continues to transfer to the home L2 cache slice. For coherency maintenance, when a L1 cache write miss request passes
through network interface, it transfers to the home L2 cache slice and does not search NAMB. This is to ensure write
operation can be serialized at the unique home L2 cache slice. Because NAMB is embedded into the network interface, its
access latency can be hidden in other network interface operations.
3.5. Locality Aware Data Replication at Last Level Cache
Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting
access by another core or before it is evicted. Greater the number of accesses with higher run-length, greater is the benefit
of replicating the cache line in the requester’s LLC slice. Instructions and shared-data (both read-only and read-write) can
be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the
reuse of data changes during an application’s execution.
On an L1 cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is
inserted at the private L1 cache. A Replica Reuse counter at the LLC directory entry is incremented. The replica reuse
counter is a saturating counter used to capture reuse information. It is initialized to ‘1’ on replica creation and incremented
on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the
cache line is not found there, it is either brought in from the off-chip memory or the underlying coherence protocol takes
the necessary actions to obtain the most recent copy of the cache line. A replication mode bit is used to identify whether a
replica is allowed to be created for the particular core and a home reuse counter is used to track the number of times the
cache line is accessed at the home location by the particular core. This counter is initialized to ‘0’ and incremented on
every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester’s
LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented. If this counter has reached the
Replication Threshold (RT), the requesting core is “promoted” and the cache line is inserted in its LLC slice and private
L1 cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the
requester’s private L1 cache [14].
On an L1 cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a
replica exists in the Modified (M) or Exclusive (E) state, the cache line is inserted at the private L1 cache. In addition, the
Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to
the LLC home location. The directory invalidates all the LLC replicas and L1 cache copies of the cache line, thereby
maintaining the single-writer multiple-reader invariant. On an invalidation request, both the LLC slice and L1 cache on a
core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC
home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the
acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether
the core stays as a replica sharer. If the (replica +home) reuse is greater than or equal to the RT, the core maintains replica
status, else it is demoted to non-replica status. When an L1 cache line is evicted, the LLC replica location is probed for the
same address. If a replica is found, the dirty data in the L1 cache line is merged with it, else an acknowledgement is sent
to the LLC home location. However, when an LLC replica is evicted, the L1 cache is probed for the same address and
invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. If
the replica reuse is greater than or equal RT, the core maintains replica status, else it is demoted to non-replica status.
After all acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are
reset to ‘0’. This has to be done since these sharers have not shown enough reuse to be “promoted”. If the writer is a non-
replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its
home reuse counter is incremented, else it is reset to ‘1’. This enables the replication of migratory shared data at the
writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores.
4. Results
APR improves performance by 12% on average for splash-2 benchmark over Baseline (shared cache design), 24% for
parsec benchmark over Baseline. VR displays similar performance in splash-2 and parsec benchmarks that is 5% over
Baseline for splash-2, and 4% over Baseline for parsec. ASR shows similar performance in splash-2 benchmark with VR,
but is better than VR in parsec benchmark (15% over Baseline). R-NUCA obtains 2% and 8% performance gains for
splash-2 and parsec benchmarks respectively. It is because that instructions are with strong locality and occupy fewer
capacity in splash-2 and parsec benchmarks. APR demonstrates its stable performance improvement in splash-2 and
parsec benchmarks. Totally, APR improves performance by 21% on average over baseline, by 17% over VR, by 10%
over ASR, and by 15% over R-NUCA. Replication schemes increase the miss rate of L2 cache. Figure 4 and 5 show the
normalized L2 cache miss ratio of evaluated replication schemes for splash-2 and parsec benchmarks respectively. APR
improves L2 miss ratio by as much 49% for splash-2 benchmark, and by 38% for parsec benchmark. Compared to VR and
ASR, APR shows lower L2 miss ratio. This comes from its replication filtering policy and replica insertion policy.
Probability replication filtering policy reduces the contention of L2 cache capacity, and the probability insertion policy
reduces the residency time of replicas. Both policies tend to reduce the impact on L2 limited capacity caused by extra
replicas.
Figure 2: Normalized Execution time for Splash-2 benchmarks
Figure 3: Normalized Execution time for Parsec Benchmark
Figure 4: Normalized Miss ratio for Splash-2 benchmarks
Figure 5: Normalized Miss ratio for Parsec benchmarks
DRR achieves lower read latency over other techniques. Figure 5 shows normalized L2 cache average read latency. VR,
ASR, and R-NUCA do not reduce read latency against Baseline, while DRR reduces read latency by 12%. Such results
show that DRR takes full advantage of benefits of replicas by network address mapping mechanism. Unnecessary extra
search latency offsets benefits of replicas in VR and ASR. R-NUCA’s instruction replication has limited benefit for
splash-2 and parsec benchmarks. Figure 6 shows the normalized execution time. As can be seen, the DRR improves the
total execution time of almost all benchmarks compared to baseline system, VR, ASR, and R-NUCA. The maximum
performance gain happens at dedup benchmark with about 69% performance improvement. The average performance
improvement is about 30% over the baseline system, about 16% over the VR, about 8% over ASR, and about 25% over
R-NUCA. While the performance improvements vary across different benchmarks, DRR does show better performance in
almost all cases indicating the good adaptive feature of reusability-based replication scheme. We measured the L2 cache
miss rate as shown in Figure 7. Compared to the baseline system, VR increases L2 cache miss rate about 162%, ASR
increases about 91%, R-NUCA increases about 67%, and DRR increases about 48%.
Figure 6: Normalized average read latency
Figure 7: Normalized Execution time
Figure 8: Normalized L2 miss ratio
The locality-aware protocol provides better energy consumption and performance than the other LLC data management
schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an RT of 3 achieves the
best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R-
NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall,
the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion
time compared to VR, ASR, R-NUCA and S-NUCA respectively.
Figure 9: Normalized Energy
Figure 10: Normalized Completion Time
Figure 11: Normalized L1 Cache Miss
5. Conclusions
For applications with working sets that fits within the LLC even if replication is done on every L1 cache Miss ;Locality-
aware scheme perform well both in energy and performance. VR is good for applications with high access to shared read-
write data but has higher L2 cache energy than other schemes. Applications with higher accesses to instructions and
shared read only data are benefited by ASR. Locality aware protocol, APR, DRR also perform almost the same in such
cases. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality-
aware protocol makes LLC replicas for such data when sufficient reuse is detected and hence performs well. Similarly
APR and DRR also perform comparatively well for such kind of data. Application of probability replication and
probability insertion policies to VR and ASR have proven to show better performance than the individual schemes. From
the above analysis it can be understood that it depends on the kind of data the application uses the most and hence the
replication policy has to be chosen basing on the kind of data. However out the techniques discussed it can be seen that
the newly proposed APR, DRR and the Locality Aware Replication Schemes perform better in most cases than the
existing ASR and VR. Again only dynamic replication schemes are discussed above and static replication schemes have
their own benefits in certain types of applications. This can be seen by improved performance of R-NUCA for certain
cases. So a replication scheme should be selected in such a way that it satisfies most of the needs of the CMP i.e. as many
applications as possible.
6. References
[1] G. Kurian, O. Khan, and S. Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the
40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 523–534, New York, NY, USA, 2013.
ACM.
[2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design
opportunities and challenges. In Design Automation Conference, 2012.
[3] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of
the AMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.
[4] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.
[5] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-
Chip Caches. In International Conference on Architectural Support for Programming Languages and Operating Systems,
2002.
[6] Zhang, M. and K. Asanovic, Victim replication: Maximizing capacity while hiding wire delay in tiled chip
multiprocessors. Proceedings -International Symposium on Computer Architecture, 2005: p. 336-345.
[7] Beckmann, B.M., M.R. Marty, and D.A. Wood, ASR: Adaptive selective replication for CMP caches. Proceedings of
the Annual International Symposium on Microarchitecture, MICRO, 2006: p. 443-454.
[8] Hardavellas, N., M. Ferdman, B. Falsafi and A. Ailamaki, Reactive NUCA: Near-Optimal Block Placement and
Replication in Distributed Caches. the 36th Annual International Symposium on Computer Architecture, 2009: p. 184-
195.
[9] Chang, J.C. and G.S. Sohi, Cooperative caching for chip multiprocessors the 33rd International Symposium on
Computer Archtiecture, Proceedings, 2006: p. 264-275.
[10] Beckmann, B.M. and D.A. Wood, Managing wire delay in large chip-multiprocessor caches. Micro-37 2004: 37th
Annual International Symposium on Microarchitecture, Proceedings, 2004: p. 319-330.
[11] Kandemir, M., F. Li, M. J. Irwin, and S. W. Son, A novel migration-based NUCA design for Chip Multiprocessors.
in High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008.
[12]Jinglei Wang, Dongsheng Wang, Haixia Wang, and Yibo Xue, High Performance Cache Block Replication Using
Re-Reference Probability in CMPs , High Performance Computing (HiPC), 2011 ,18th International Conference .
[13]Jinglei Wang, Dongsheng Wang, Haixia Wang, Yibo Xue, Dynamic Reusability-based Replication with Network
Address Mapping in CMPs, High Performance Computing (HiPC), 2011 ,18th International Conference.
[14]Kurian, G., Devadas, S., Khan, O.: Locality-Aware Data Replication in the Last-Level Cache. In: 20th International
Symposium on High Performance Computer Architecture, pp. 1-12, IEEE Press, New York (2014) .

More Related Content

What's hot

Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
VLSICS Design
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
IJCNCJournal
 
Automated re allocator of replicas
Automated re allocator of replicasAutomated re allocator of replicas
Automated re allocator of replicas
IJCNCJournal
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
idescitation
 
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2Aws Ali
 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
IDES Editor
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
ijdpsjournal
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
zqhxuyuan
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
NECST Lab @ Politecnico di Milano
 
Dos unit3
Dos unit3Dos unit3
Dos unit3
JebasheelaSJ
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software Systems
Vincenzo De Florio
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
Cassandra basic
Cassandra basicCassandra basic
Cassandra basic
zqhxuyuan
 

What's hot (17)

Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
 
Ab25144148
Ab25144148Ab25144148
Ab25144148
 
Automated re allocator of replicas
Automated re allocator of replicasAutomated re allocator of replicas
Automated re allocator of replicas
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
 
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 
Dos unit3
Dos unit3Dos unit3
Dos unit3
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software Systems
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Cassandra basic
Cassandra basicCassandra basic
Cassandra basic
 

Viewers also liked

Sequences pavagesposttchequie2
Sequences pavagesposttchequie2Sequences pavagesposttchequie2
Sequences pavagesposttchequie2
massillonprimaire
 
Le journal de parise 3 ppt audio d
Le journal de parise 3 ppt audio dLe journal de parise 3 ppt audio d
Le journal de parise 3 ppt audio d
massillonprimaire
 
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
Carman Ford
 
Journal poupée turquie
Journal poupée turquieJournal poupée turquie
Journal poupée turquie
massillonprimaire
 
Performance Resume 2015
Performance Resume 2015Performance Resume 2015
Performance Resume 2015Brian Miller
 
French mathematical activities from Poland
French mathematical activities from PolandFrench mathematical activities from Poland
French mathematical activities from Poland
massillonprimaire
 
ECE469 Project1
ECE469 Project1ECE469 Project1
ECE469 Project1
Lakshmi Yasaswi Kamireddy
 
C&ess presentation performance review 2016 (copy 1)
C&ess presentation  performance review 2016 (copy 1)C&ess presentation  performance review 2016 (copy 1)
C&ess presentation performance review 2016 (copy 1)
Baig Ali
 
Overview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaOverview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaJikun Lian EIT
 
Overview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaOverview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaJikun Lian EIT
 
Studio Vertex Co Profile sm
Studio Vertex Co Profile smStudio Vertex Co Profile sm
Studio Vertex Co Profile smPeter Kisilu
 
Ppt8
Ppt8Ppt8
Ppt8
Milan425
 

Viewers also liked (16)

Sequences pavagesposttchequie2
Sequences pavagesposttchequie2Sequences pavagesposttchequie2
Sequences pavagesposttchequie2
 
Le journal de parise 3 ppt audio d
Le journal de parise 3 ppt audio dLe journal de parise 3 ppt audio d
Le journal de parise 3 ppt audio d
 
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DE
 
Journal poupée turquie
Journal poupée turquieJournal poupée turquie
Journal poupée turquie
 
Performance Resume 2015
Performance Resume 2015Performance Resume 2015
Performance Resume 2015
 
newFinalReport.docx
newFinalReport.docxnewFinalReport.docx
newFinalReport.docx
 
French mathematical activities from Poland
French mathematical activities from PolandFrench mathematical activities from Poland
French mathematical activities from Poland
 
ECE469 Project1
ECE469 Project1ECE469 Project1
ECE469 Project1
 
C&ess presentation performance review 2016 (copy 1)
C&ess presentation  performance review 2016 (copy 1)C&ess presentation  performance review 2016 (copy 1)
C&ess presentation performance review 2016 (copy 1)
 
Overview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaOverview_of_Transportation_in_China
Overview_of_Transportation_in_China
 
Presentation1gee3
Presentation1gee3Presentation1gee3
Presentation1gee3
 
Overview_of_Transportation_in_China
Overview_of_Transportation_in_ChinaOverview_of_Transportation_in_China
Overview_of_Transportation_in_China
 
Nós do Açai & Cia...
Nós do Açai & Cia...Nós do Açai & Cia...
Nós do Açai & Cia...
 
Studio Vertex Co Profile sm
Studio Vertex Co Profile smStudio Vertex Co Profile sm
Studio Vertex Co Profile sm
 
J.R.R. Tolkien Quotes
J.R.R. Tolkien QuotesJ.R.R. Tolkien Quotes
J.R.R. Tolkien Quotes
 
Ppt8
Ppt8Ppt8
Ppt8
 

Similar to Survey paper _ lakshmi yasaswi kamireddy(651771619)

Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
palani kumar
 
Reducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplicationReducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplication
Pvrtechnologies Nellore
 
Reducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplicationReducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplication
Pvrtechnologies Nellore
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
Ericsson
 
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
cscpconf
 
A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...
csandit
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networks
ijwmn
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
K017317175
K017317175K017317175
K017317175
IOSR Journals
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocols
iosrjce
 
Container independent failover framework
Container independent failover frameworkContainer independent failover framework
Container independent failover frameworktelestax
 
Container Independent failover framework - Mobicents Summit 2011
Container Independent failover framework - Mobicents Summit 2011Container Independent failover framework - Mobicents Summit 2011
Container Independent failover framework - Mobicents Summit 2011telestax
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
Mshari Alabdulkarim
 
Cpu caching concepts mr mahesh
Cpu caching concepts mr maheshCpu caching concepts mr mahesh
Cpu caching concepts mr mahesh
Faridabad
 
Cmp cache architectures a survey
Cmp cache architectures   a surveyCmp cache architectures   a survey
Cmp cache architectures a survey
eSAT Publishing House
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 

Similar to Survey paper _ lakshmi yasaswi kamireddy(651771619) (20)

Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Reducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplicationReducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplication
 
Reducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplicationReducing fragmentation for in line deduplication
Reducing fragmentation for in line deduplication
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
 
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...
 
A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networks
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
K017317175
K017317175K017317175
K017317175
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocols
 
Container independent failover framework
Container independent failover frameworkContainer independent failover framework
Container independent failover framework
 
Container Independent failover framework - Mobicents Summit 2011
Container Independent failover framework - Mobicents Summit 2011Container Independent failover framework - Mobicents Summit 2011
Container Independent failover framework - Mobicents Summit 2011
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-As...
 
shashank_spdp1993_00395543
shashank_spdp1993_00395543shashank_spdp1993_00395543
shashank_spdp1993_00395543
 
Cpu caching concepts mr mahesh
Cpu caching concepts mr maheshCpu caching concepts mr mahesh
Cpu caching concepts mr mahesh
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 
Cmp cache architectures a survey
Cmp cache architectures   a surveyCmp cache architectures   a survey
Cmp cache architectures a survey
 
final_rac
final_racfinal_rac
final_rac
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

More from Lakshmi Yasaswi Kamireddy

Memory Access Scheduling
Memory Access SchedulingMemory Access Scheduling
Memory Access Scheduling
Lakshmi Yasaswi Kamireddy
 
ECE 467 Final Project
ECE 467 Final Project ECE 467 Final Project
ECE 467 Final Project
Lakshmi Yasaswi Kamireddy
 
ECE 368 Lab Project 1
ECE 368 Lab Project 1ECE 368 Lab Project 1
ECE 368 Lab Project 1
Lakshmi Yasaswi Kamireddy
 
ECE 468 Lab Project 1
ECE 468 Lab Project 1ECE 468 Lab Project 1
ECE 468 Lab Project 1
Lakshmi Yasaswi Kamireddy
 
ECE 468 Lab Project 2
ECE 468 Lab Project 2ECE 468 Lab Project 2
ECE 468 Lab Project 2
Lakshmi Yasaswi Kamireddy
 
ECE 467 Mini project 2
ECE 467 Mini project 2ECE 467 Mini project 2
ECE 467 Mini project 2
Lakshmi Yasaswi Kamireddy
 
ECE 467 Mini project 1
ECE 467 Mini project 1ECE 467 Mini project 1
ECE 467 Mini project 1
Lakshmi Yasaswi Kamireddy
 
ECE 565 presentation
ECE 565 presentationECE 565 presentation
ECE 565 presentation
Lakshmi Yasaswi Kamireddy
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
Lakshmi Yasaswi Kamireddy
 
Survey on Prefix adders
Survey on Prefix addersSurvey on Prefix adders
Survey on Prefix adders
Lakshmi Yasaswi Kamireddy
 
ECE469 proj2_Lakshmi Yasaswi Kamireddy
ECE469 proj2_Lakshmi Yasaswi KamireddyECE469 proj2_Lakshmi Yasaswi Kamireddy
ECE469 proj2_Lakshmi Yasaswi Kamireddy
Lakshmi Yasaswi Kamireddy
 

More from Lakshmi Yasaswi Kamireddy (11)

Memory Access Scheduling
Memory Access SchedulingMemory Access Scheduling
Memory Access Scheduling
 
ECE 467 Final Project
ECE 467 Final Project ECE 467 Final Project
ECE 467 Final Project
 
ECE 368 Lab Project 1
ECE 368 Lab Project 1ECE 368 Lab Project 1
ECE 368 Lab Project 1
 
ECE 468 Lab Project 1
ECE 468 Lab Project 1ECE 468 Lab Project 1
ECE 468 Lab Project 1
 
ECE 468 Lab Project 2
ECE 468 Lab Project 2ECE 468 Lab Project 2
ECE 468 Lab Project 2
 
ECE 467 Mini project 2
ECE 467 Mini project 2ECE 467 Mini project 2
ECE 467 Mini project 2
 
ECE 467 Mini project 1
ECE 467 Mini project 1ECE 467 Mini project 1
ECE 467 Mini project 1
 
ECE 565 presentation
ECE 565 presentationECE 565 presentation
ECE 565 presentation
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Survey on Prefix adders
Survey on Prefix addersSurvey on Prefix adders
Survey on Prefix adders
 
ECE469 proj2_Lakshmi Yasaswi Kamireddy
ECE469 proj2_Lakshmi Yasaswi KamireddyECE469 proj2_Lakshmi Yasaswi Kamireddy
ECE469 proj2_Lakshmi Yasaswi Kamireddy
 

Recently uploaded

addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Survey paper _ lakshmi yasaswi kamireddy(651771619)

  • 1. SURVEY ON CACHE REPLICATION MECHANISMS BY LAKSHMI YASASWI KAMIREDDY (651771619)
  • 2. CONTENTS Abstract 1. Introduction 2. Background 3. Schemes 3.1. Victim Replication 3.2. Adaptive Selective Replication 3.3. Adaptive Probability Replication 3.4. Dynamic Reusability based Replication 3.5. Locality Aware Data Replication 4. Results 5. Conclusions 6. References
  • 3. Abstract Present day systems have a high demand of multicore processors on chip. As the number of cores on Chip Multi- Processor (CMP) increases, the need for effective utilization (management) of the cache increases. Cache Management plays an important role in improving the performance. This is achieved by reducing the number of misses and the miss latency. These two factors the number of misses and the miss latency cannot be reduced at the same time. Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses while others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to make a balance between the miss latency and on chip capacity. There are two kinds of replication Static replication and Dynamic replication. This paper focusses more on the existing dynamic replication schemes and gives an analysis of each scheme on several benchmarks. 1. Introduction Upcoming generation multicore processors and applications will operate on massive data. Major challenge in near future multicore processors is the movement of data that is being incurred by conventional cache hierarchies. This has very high impact on the off-chip bandwidth, on-chip memory access latency and energy consumption. A large on-chip cache is possible but it is not a scalable solution. It is limited to small number of cores, and hence the only practical option is to physically distribute memory in pieces so that every core is near some portion of the cache. Such a solution might provide a large amount of aggregate cache capacity and fast private memory for each core but at the same time it is difficult to manage the distributed cache and network resources efficiently as they require architectural support for cache coherence and consistency under the ubiquitous shared memory model. Most directory-based protocols enable fast local caching to exploit data locality, but even they have scalability issues .Some of the most recent proposals have addressed the issue of directory scalability in single-chip multicores using sharer compression techniques or limited directories. But, the fast private caches still suffer from two major problems: (1) due to capacity constraints, they cannot hold the working set of applications that operate on massive data, and (2) due to frequent communication between cores, data is often displaced from them [1]. This has led to an increased network traffic and request rate to the last level cache. On-chip wires do not scale at the same pace as transistors, because of which the data movement not only impacts memory access latency, but also consumes more power due to the energy consumption of network and cache resources [2]. Though private LLC organizations (e.g., [3]) have low hit latencies, their off-chip miss rates are high in applications that have uneven distributions of working sets or exhibit high degrees of sharing (due to cache line replication). Shared LLC organizations (e.g., [4]), on the other hand, lead to non-uniform cache access (NUCA) [5] that hurts on-chip locality, but their off-chip miss rates are low since cache lines are not replicated. Several proposals have explored the idea of hybrid LLC .Replication mechanisms have been proposed to balance between access latency and cache capacity in hybrid L2 cache designs [6] [7]. Two types of replication approaches have been proposed: static [8, 9] and dynamic [10, 11, 12, 13, and 14]. In static replication, a data block is placed through predefined address interleaving; therefore, the LLC banks that may contain that data block is fixed. The data placement of instruction pages in R-NUCA [8] and in S-NUCA [9] are static. In dynamic replication, a data block can be placed in any LLC banks. Victim Replication [10] ,Adaptive Selective Replication [11] ,Adaptive Probability Replication [12],Dynamic Reusability based Replication[13], and Locality Aware data replication at Last Level Cache [14] fall into this category. These replication mechanisms have their own advantages and disadvantages .The paper will be an analysis these dynamic replication schemes. 2. Background Starting chronologically the first dynamic replication mechanism from the above mentioned is the Victim Replication (VR)[10] mechanism which is based on shared caches, but it tries to capture evictions from the local primary cache in the local L2 slice to reduce subsequent access latency to the same cache block. Victim replicas and global L2 cache blocks share L2 slice capacity. In VR, all primary cache misses must first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local
  • 4. L2 slice and moved into the primary cache 10]. The next technique introduced is the Adaptive Selective Replication (ASR) [11] which adopts similar replication mechanism to VR, but it focuses on the capacity contention between replicas and global L2 cache blocks. ASR dynamically estimates the cost (extra misses) and benefit (lower hit latency) of replication and adjusts the number of receivable victims to avoiding hurting L2 cache performance [11]. Another replication scheme called the Adaptive Probability Replication (APR)[12] mechanism is proposed that counts each block’s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and insert them at appropriate position, according to their re-reference probability [12].In the same conference another mechanism named Dynamic Reusability-based Replication (DRR) [13] was introduced. DRR is a hybrid cache architecture that dynamically monitors the reuse pattern of cache blocks and replicates blocks with high reusability to appropriate L2 cache slices [13]. Replicas are shared by nearby cores through a fast lookup mechanism, Network Address Mapping, which records the location of the nearest replica in network interfaces and forwards subsequent L1 miss requests to the replica immediately. This improved performance of shared caches by exploiting reusability based replication, fast lookup mechanism, and replicas sharing. Most recent technique introduced is the locality-aware selective data replication protocol for the last-level cache (LLC) [14]. This method gives lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, and simultaneously keeps the off-chip miss rate low. This approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality at the cache line granularity, and only allows replication for cache lines with high reuse [14]. A classifier is used to capture the LLC pressure at the existing replica locations and adaptation of replication decision is done accordingly. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional coherence protocols. The following sections discuss the schemes in detail. 3. Schemes 3.1. Victim Replication (VR) Victim replication (VR) is a hybrid scheme that combines the advantage of large capacity of shared L2 cache and low hit latency of Private L2 cache. VR is primarily based on shared L2 cache, but in addition tries to capture evictions from the local primary cache in the local L2 slice. Each retained victim is a local L2 replica of a line that is already existing in the L2 of the remote home tile. When a miss occurs at the shared L2 cache, a line is brought in from memory and placed in the on chip L2 at a home tile determined by a subset of the physical address bits, as in shared L2 cache. The requested line is directly forwarded to the primary cache of the requesting processor. If the line’s residency in the primary cache is terminated because of an incoming invalidation or write back request, the usual shared L2 cache protocol is followed. If a primary cache line is evicted because of a conflict or capacity miss, then a copy of the victim line in the local slice is kept to reduce subsequent access latency to the same line A global line with remote sharers is never evicted in favor of a local replica, as an actively cached global line is likely to be in use. The VR replication policy will replace the following classes of cache lines in the target set in descending priority order: (1) An invalid line; (2) A global line with no sharers; (3) An existing replica. If there are no lines belonging to these three categories, no replica is made and the victim is evicted from the tile as in shared L2 cache [10]. If there is more than one line in the selected category, VR picks at random. All primary cache misses first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local L2 slice and moved into the primary cache. When a downgrade or invalidation request is received from the home tile, the L2 tags will also be checked in addition to the primary cache tags [10]. 3.2. Adaptive Selective Replication (ASR) Adaptive Selective Replication ASR obtains the optimum replication level by balancing the benefits of replication against the costs.L2 cache block replication improves memory system performance when the average L1 miss latency is reduced.
  • 5. The following equation describes the average cycles for L1 cache misses normalized by instructions executed: 𝐿1 𝑚𝑖𝑠𝑠 𝑐𝑦𝑐𝑙𝑒𝑠 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 = 𝑃𝑙𝑜𝑐𝑎𝑙𝐿2 ∗ 𝐿𝑙𝑜𝑐𝑎𝑙𝐿2 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠) + 𝑃𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 ∗ 𝐿 𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠) + 𝑃 𝑚𝑖𝑠𝑠 ∗ 𝐿 𝑚𝑖𝑠𝑠 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠) Px is the probability of a memory request being satisfied by the entity x, where x is a local L2 cache, remote L2 caches, or main memory and Lx equals the latency of each entity [11.] The combination of the localL2 and remoteL2 terms represent the memory cycles spent on L2 cache hits and the third term depicts the memory cycles spent on L2 cache misses. Replication increases the probability that L1 misses hit in the local L2 cache, thus the PlocalL2 term increases and the PremoteL2 term decreases. Because the latency of a local L2 cache hit is tens of cycles faster than a remote L2 cache hit, the net effect of increasing replication is a reduction in cycles spent on L2 cache hits. However, more replication devotes more capacity to replica blocks, thus fewer unique blocks exist on-chip, increasing the probability of L2 cache misses, Pmiss. If the probability of a miss increases significantly due to replication, the miss term will dominate, as the latency of memory is hundreds of cycles greater than the L2 hit latencies. Therefore, balancing these three terms is necessary to improve memory system performance. Optimal performance often arises from an intermediate replication level. Figure 1 graphically depicts this tradeoff. The Replication Benefit curve, Figure 1(a), illustrates the trend that increasing replication reduces L2 cache hit cycles. Due to the strong locality of shared read-only requests, a small degree of L2 replication can significantly reduce L2 hit cycles by moving many previous remote L2 hits into the local cache. In contrast, increased replication gradually reduces L2 hit cycles because fewer unique blocks on-chip lead to fewer total L2 hits. The Replication Cost curve, Figure 1(b), illustrates that increasing L2 replication increases the memory cycles spent on off-chip misses. The Replication Effectiveness curve, Figure 1(c), combines the benefit and cost curves and plots the total memory cycles. Because the benefit and cost curves are generally convex and have opposite slopes, the minimum of the Replication Effectiveness curve often lies between allowing all replications and no replications. ASR estimates the slopes of the benefit and cost curves to approximate the optimal replication level. (a) (b) (c) Figure 1[11] By dynamically monitoring the benefit and cost of replication, ASR attempts to achieve the optimal level of replication. ASR identifies discrete replication levels and makes a piecewise approximation of the memory cycle slope [11]. Thus ASR simplifies the analysis to a local decision of whether the amount of replication should be increased, decreased, or remain the same. Figure 1 illustrates the case where the current replication level, labeled C, results in HC hit cycles-per- instruction and MC Miss cycles-per-instruction. ASR considers three alternatives: (i) increasing replication to the next higher level, labeled H, (ii) decreasing replication to the next lower level, labeled L, or (iii) leaving the replication unchanged [11]. To make this decision, ASR not only needs HC and MC, but also four additional hit and miss cycles-per- instruction values: HH and MH for the next higher level and HL and ML for the next lower level. To simplify the collection process, ASR estimates only the four differences between the hit and miss cycles-per-instruction: (1) the benefit of increasing replication (decrease in L2 hit cycles, HC - HH); (2) the cost of increasing replication (increase in L2 miss cycles, MH - MC); (3) the benefit of decreasing replication, (decrease in L2 miss cycles, MC - ML); and (4) the cost of decreasing replication (increase in L2 hit cycles, HL - HC). By comparing these cost and benefit counters, ASR will increase, decrease, or leave unchanged the replication level.
  • 6. 3.3. Adaptive Probability Replication (APR) This design is based on a distributed shared L2 cache design. To predict re-reference probability, APR adds a counter for each cache block to record and transfer the number of accesses. In APR, each tile stores re-reference probability of blocks from other remote L2 cache slices in its network interface component using a simple lookup table called Re-Reference Probability Buffer (RRPB) [12]. RRPB keeps re-reference probability entries for all other L2 slices. Re-reference probability entry holds replication thresholds for different number of accesses. The replication thresholds indicate the re- reference probability of blocks with different number of accesses. In the local L2 slice, if there is invalid block or the victim is not a sharing global block, the replica is filled into L2 cache slice. Otherwise, the replication is abandoned. The insert position of replica is determined by its corresponding re-reference probability. When a replica is accessed again, it is deleted from the local L2 cache slice and moved to the local L1 cache. APR counts every accesses of L2 cache blocks and records the number of evicted blocks with different number of accesses, to estimate re-reference probability at runtime. For example, the re-reference probability of block with N accesses is the proportion of the number of blocks with exceeding N accesses accounts for that of blocks with not less than N accesses. The estimation is only performed when a global block replacement occurs (not a replica replacement).The re-reference probabilities are updated to all other tiles in a certain interval (such as 10000 cycles) by attaching it to any response message. Because blocks from other remote L2 slices may be accessed in local L2 cache slices due to replication, each replica access increases also the corresponding counter associating with the block. When a replica is accessed, the associated counter will also be moved to L1 cache block. The values of counter of blocks in L1 caches are sent back when the blocks are evicted to the home L2 slice to accelerate the number of accesses. Like ASR, a linear feedback shift register generates a pseudo-random number which is compared to the corresponding replication threshold. When an evicted L1 block passes through the network interface, APR captures this message, and looks up corresponding RRPB entry according to its address. If the corresponding replication threshold to the number of accesses of the blocks is less than the generated random number, the block is sent to the local L2 slice. Otherwise, it is evicted to the home L2 slice. In the local L2 cache slice, if there is an invalid block or the victim is not a sharing global block, the block is inserted. Otherwise, it is sent to the home L2 slice. Blocks with more accesses have higher re-reference probability. Probability insertion is implemented in APR according to the number of replicated block, in which the number of accesses indicates the insert position. If the number of accesses of block exceeds the way size, the block is inserted at MRU position. The aim of probability insertion is to make blocks with lower re-reference probability survival for a shorter time. 3.4. Dynamic Reusability based Replication (DRR) DRR dynamically replicates blocks with high reusability to other appropriate L2 cache slices, and allows the replicas be shared by nearby cores via a fast lookup mechanism [13]. A set-associative Core Access Counter Buffer (CACB) is used to determine which block should be replicated and corresponding destination of replication. For recent accessed blocks, CACB record access numbers for cores exceeding certain hops (for example, 2 hops, this is also the smallest distance among home slice and replica slices) away from the home slice respectively. So, only 10 counters are enough in one CACB entry for a 16-core CMP. When the block receives a Read request from one core, the corresponding counter increases. Due to coherence problem, when the block receives a Write request, all the counters of the block are reset to zero. In CACB, larger counter means higher reusability. When the maximum counter of a block reaches a certain threshold (for example, 5), the block is to be replicated to the slice corresponding to the maximum counter. After the replicating block and the destination being determined, the L2 cache slice sends a replication request to the destination. When the destination receives the replication request, it allocates cache space to hold the replica. If the destination has available space for replica, it response acknowledge to the home L2 cache slice. Otherwise, it response fail message to the home L2 cache slice. Once the replication operation is completed, the replication destination is stored into a set-associative Replication Directory Buffer (RDB) in home L2 slice. If the replication fails, the destination is not stored into RDB. When a Read request reaches the home L2 cache slice, if the distance between the requesting core and the
  • 7. nearest replica is less than the given replica distance(for example, 3 hops in 16-core CMP), the request is forwarded to the nearest replica. Otherwise, the request is satisfied at the home L2 cache slice. When the replica receives the forwarded request, it response data to the requesting core. When the data response message passes through the network interface of the requesting core, the replica’s location is stored into a set-associative Network Address Mapping Buffer (NAMB). NAMB is embedded into network interface and used to record replicas’ location which have serviced for the core. When a L1 cache Read miss request passes through network interface, it first searches NAMB. If the NAMB hit, the request is forwarded to the recoded replica location immediately. Otherwise, the request continues to transfer to the home L2 cache slice. For coherency maintenance, when a L1 cache write miss request passes through network interface, it transfers to the home L2 cache slice and does not search NAMB. This is to ensure write operation can be serialized at the unique home L2 cache slice. Because NAMB is embedded into the network interface, its access latency can be hidden in other network interface operations. 3.5. Locality Aware Data Replication at Last Level Cache Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting access by another core or before it is evicted. Greater the number of accesses with higher run-length, greater is the benefit of replicating the cache line in the requester’s LLC slice. Instructions and shared-data (both read-only and read-write) can be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the reuse of data changes during an application’s execution. On an L1 cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is inserted at the private L1 cache. A Replica Reuse counter at the LLC directory entry is incremented. The replica reuse counter is a saturating counter used to capture reuse information. It is initialized to ‘1’ on replica creation and incremented on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the cache line is not found there, it is either brought in from the off-chip memory or the underlying coherence protocol takes the necessary actions to obtain the most recent copy of the cache line. A replication mode bit is used to identify whether a replica is allowed to be created for the particular core and a home reuse counter is used to track the number of times the cache line is accessed at the home location by the particular core. This counter is initialized to ‘0’ and incremented on every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester’s LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented. If this counter has reached the Replication Threshold (RT), the requesting core is “promoted” and the cache line is inserted in its LLC slice and private L1 cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the requester’s private L1 cache [14]. On an L1 cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a replica exists in the Modified (M) or Exclusive (E) state, the cache line is inserted at the private L1 cache. In addition, the Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to the LLC home location. The directory invalidates all the LLC replicas and L1 cache copies of the cache line, thereby maintaining the single-writer multiple-reader invariant. On an invalidation request, both the LLC slice and L1 cache on a core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether the core stays as a replica sharer. If the (replica +home) reuse is greater than or equal to the RT, the core maintains replica status, else it is demoted to non-replica status. When an L1 cache line is evicted, the LLC replica location is probed for the same address. If a replica is found, the dirty data in the L1 cache line is merged with it, else an acknowledgement is sent to the LLC home location. However, when an LLC replica is evicted, the L1 cache is probed for the same address and invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. If the replica reuse is greater than or equal RT, the core maintains replica status, else it is demoted to non-replica status. After all acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are reset to ‘0’. This has to be done since these sharers have not shown enough reuse to be “promoted”. If the writer is a non- replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its
  • 8. home reuse counter is incremented, else it is reset to ‘1’. This enables the replication of migratory shared data at the writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores. 4. Results APR improves performance by 12% on average for splash-2 benchmark over Baseline (shared cache design), 24% for parsec benchmark over Baseline. VR displays similar performance in splash-2 and parsec benchmarks that is 5% over Baseline for splash-2, and 4% over Baseline for parsec. ASR shows similar performance in splash-2 benchmark with VR, but is better than VR in parsec benchmark (15% over Baseline). R-NUCA obtains 2% and 8% performance gains for splash-2 and parsec benchmarks respectively. It is because that instructions are with strong locality and occupy fewer capacity in splash-2 and parsec benchmarks. APR demonstrates its stable performance improvement in splash-2 and parsec benchmarks. Totally, APR improves performance by 21% on average over baseline, by 17% over VR, by 10% over ASR, and by 15% over R-NUCA. Replication schemes increase the miss rate of L2 cache. Figure 4 and 5 show the normalized L2 cache miss ratio of evaluated replication schemes for splash-2 and parsec benchmarks respectively. APR improves L2 miss ratio by as much 49% for splash-2 benchmark, and by 38% for parsec benchmark. Compared to VR and ASR, APR shows lower L2 miss ratio. This comes from its replication filtering policy and replica insertion policy. Probability replication filtering policy reduces the contention of L2 cache capacity, and the probability insertion policy reduces the residency time of replicas. Both policies tend to reduce the impact on L2 limited capacity caused by extra replicas. Figure 2: Normalized Execution time for Splash-2 benchmarks Figure 3: Normalized Execution time for Parsec Benchmark
  • 9. Figure 4: Normalized Miss ratio for Splash-2 benchmarks Figure 5: Normalized Miss ratio for Parsec benchmarks DRR achieves lower read latency over other techniques. Figure 5 shows normalized L2 cache average read latency. VR, ASR, and R-NUCA do not reduce read latency against Baseline, while DRR reduces read latency by 12%. Such results show that DRR takes full advantage of benefits of replicas by network address mapping mechanism. Unnecessary extra search latency offsets benefits of replicas in VR and ASR. R-NUCA’s instruction replication has limited benefit for splash-2 and parsec benchmarks. Figure 6 shows the normalized execution time. As can be seen, the DRR improves the total execution time of almost all benchmarks compared to baseline system, VR, ASR, and R-NUCA. The maximum performance gain happens at dedup benchmark with about 69% performance improvement. The average performance improvement is about 30% over the baseline system, about 16% over the VR, about 8% over ASR, and about 25% over R-NUCA. While the performance improvements vary across different benchmarks, DRR does show better performance in almost all cases indicating the good adaptive feature of reusability-based replication scheme. We measured the L2 cache miss rate as shown in Figure 7. Compared to the baseline system, VR increases L2 cache miss rate about 162%, ASR increases about 91%, R-NUCA increases about 67%, and DRR increases about 48%.
  • 10. Figure 6: Normalized average read latency Figure 7: Normalized Execution time Figure 8: Normalized L2 miss ratio
  • 11. The locality-aware protocol provides better energy consumption and performance than the other LLC data management schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an RT of 3 achieves the best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R- NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall, the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion time compared to VR, ASR, R-NUCA and S-NUCA respectively. Figure 9: Normalized Energy Figure 10: Normalized Completion Time
  • 12. Figure 11: Normalized L1 Cache Miss 5. Conclusions For applications with working sets that fits within the LLC even if replication is done on every L1 cache Miss ;Locality- aware scheme perform well both in energy and performance. VR is good for applications with high access to shared read- write data but has higher L2 cache energy than other schemes. Applications with higher accesses to instructions and shared read only data are benefited by ASR. Locality aware protocol, APR, DRR also perform almost the same in such cases. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality- aware protocol makes LLC replicas for such data when sufficient reuse is detected and hence performs well. Similarly APR and DRR also perform comparatively well for such kind of data. Application of probability replication and probability insertion policies to VR and ASR have proven to show better performance than the individual schemes. From the above analysis it can be understood that it depends on the kind of data the application uses the most and hence the replication policy has to be chosen basing on the kind of data. However out the techniques discussed it can be seen that the newly proposed APR, DRR and the Locality Aware Replication Schemes perform better in most cases than the existing ASR and VR. Again only dynamic replication schemes are discussed above and static replication schemes have their own benefits in certain types of applications. This can be seen by improved performance of R-NUCA for certain cases. So a replication scheme should be selected in such a way that it satisfies most of the needs of the CMP i.e. as many applications as possible. 6. References [1] G. Kurian, O. Khan, and S. Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 523–534, New York, NY, USA, 2013. ACM. [2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design opportunities and challenges. In Design Automation Conference, 2012. [3] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010. [4] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008. [5] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On- Chip Caches. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. [6] Zhang, M. and K. Asanovic, Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. Proceedings -International Symposium on Computer Architecture, 2005: p. 336-345.
  • 13. [7] Beckmann, B.M., M.R. Marty, and D.A. Wood, ASR: Adaptive selective replication for CMP caches. Proceedings of the Annual International Symposium on Microarchitecture, MICRO, 2006: p. 443-454. [8] Hardavellas, N., M. Ferdman, B. Falsafi and A. Ailamaki, Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. the 36th Annual International Symposium on Computer Architecture, 2009: p. 184- 195. [9] Chang, J.C. and G.S. Sohi, Cooperative caching for chip multiprocessors the 33rd International Symposium on Computer Archtiecture, Proceedings, 2006: p. 264-275. [10] Beckmann, B.M. and D.A. Wood, Managing wire delay in large chip-multiprocessor caches. Micro-37 2004: 37th Annual International Symposium on Microarchitecture, Proceedings, 2004: p. 319-330. [11] Kandemir, M., F. Li, M. J. Irwin, and S. W. Son, A novel migration-based NUCA design for Chip Multiprocessors. in High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008. [12]Jinglei Wang, Dongsheng Wang, Haixia Wang, and Yibo Xue, High Performance Cache Block Replication Using Re-Reference Probability in CMPs , High Performance Computing (HiPC), 2011 ,18th International Conference . [13]Jinglei Wang, Dongsheng Wang, Haixia Wang, Yibo Xue, Dynamic Reusability-based Replication with Network Address Mapping in CMPs, High Performance Computing (HiPC), 2011 ,18th International Conference. [14]Kurian, G., Devadas, S., Khan, O.: Locality-Aware Data Replication in the Last-Level Cache. In: 20th International Symposium on High Performance Computer Architecture, pp. 1-12, IEEE Press, New York (2014) .