This document summarizes several dynamic cache replication mechanisms: Victim Replication replicates cache lines evicted from the local cache to reduce access latency. Adaptive Selective Replication dynamically adjusts replication based on estimated costs and benefits. Adaptive Probability Replication replicates blocks based on predicted reuse probabilities. Dynamic Reusability-based Replication replicates blocks with high reuse. Locality-Aware Data Replication only replicates high-locality blocks to reduce misses while maintaining low replication overhead. The document provides details on these schemes and compares their approaches to dynamic cache block replication.
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...Narendra Singh Yadav
Clustering is an important research area for mobile ad hoc networks (MANETs) as it increases the capacity of network, reduces the routing overhead and makes the network more scalable in the presence of both high mobility and a large number of mobile nodes. In clustering the clusterhead manage and store recent routing information. However the frequent change of clusterhead leads to loss of routing information stored, changes the route between two nodes, affects the performance of the routing protocol and makes the cluster structure unstable. Communication overhead in terms of exchanging messages is needed to elect a new clusterhead. The goal then would be to keep the clusterhead change as least as possible to make cluster structure more stable, to prevent loss of routing information which in turn improve the performance of routing protocol based on clustering. This can be achieved by an efficient cluster maintenance scheme. In this work, a novel clustering algorithm, namely Incremental Maintenance Clustering Scheme (IMS) is proposed for Mobile Ad Hoc Networks. The goals are yielding low number of clusterhead and clustermember changes, maintaining stable clusters, minimizing the number of clustering overhead. Through simulations the performance of IMS is compared with that of least cluster change (LCC) and maintenance scheme of Cluster Based Routing Protocol (CBRP) in terms of the number of clusterhead changes, number of cluster-member changes and clustering overhead by varying mobility and speed. The simulation results demonstrate the superiority of IMS over LCC and
maintenance scheme of CBRP.
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...Narendra Singh Yadav
Clustering is an important research area for mobile ad hoc networks (MANETs) as it increases the capacity of network, reduces the routing overhead and makes the network more scalable in the presence of both high mobility and a large number of mobile nodes. In clustering the clusterhead manage and store recent routing information. However the frequent change of clusterhead leads to loss of routing information stored, changes the route between two nodes, affects the performance of the routing protocol and makes the cluster structure unstable. Communication overhead in terms of exchanging messages is needed to elect a new clusterhead. The goal then would be to keep the clusterhead change as least as possible to make cluster structure more stable, to prevent loss of routing information which in turn improve the performance of routing protocol based on clustering. This can be achieved by an efficient cluster maintenance scheme. In this work, a novel clustering algorithm, namely Incremental Maintenance Clustering Scheme (IMS) is proposed for Mobile Ad Hoc Networks. The goals are yielding low number of clusterhead and clustermember changes, maintaining stable clusters, minimizing the number of clustering overhead. Through simulations the performance of IMS is compared with that of least cluster change (LCC) and maintenance scheme of Cluster Based Routing Protocol (CBRP) in terms of the number of clusterhead changes, number of cluster-member changes and clustering overhead by varying mobility and speed. The simulation results demonstrate the superiority of IMS over LCC and
maintenance scheme of CBRP.
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
With the shrinking technology, reduced scale and power-hungry chip IO leads to System on Chip. The design of SOC using traditional standard bus scheme encounters with issues like non-uniform delay and routing problems. Crossbars could scale better when compared to buses but tend to become huge with increasing number of nodes. NOC has become the design paradigm for SOC design for its highly regularized interconnect structure, good scalability and linear design effort. The main components of an NoC topology are the network adapters, routing nodes, and network interconnect links. This paper mainly deals with the implementation of full custom SRAM based arrays over D FF based register arrays in the design of input module of routing node in 2D mesh NOC topology. The custom SRAM blocks replace D FF(D flip flop) memory implementations to optimize area and power of the input block. Full custom design of SRAMs has been carried out by MILKYWAY, while physical implementation of the input module with SRAMs has been carried out by IC Compiler of SYNOPSYS.The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28.The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 MHz to 200 MHz. It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a
common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. A MATLAB model is used to show quantitatively how performance is improved in a common packet array design.
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...IJCNCJournal
The unbalancing load issue is a multi-variation, multi-imperative issue that corrupts the execution and productivity of processing assets. Workload adjusting methods give solutions of load unbalancing circumstances for two bothersome aspects over-burdening and under-stacking. Cloud computing utilizes planning and workload balancing for a virtualized environment, resource partaking in cloud foundation. These two factors must be handled in an improved way in cloud computing to accomplish ideal resource sharing. Henceforth, there requires productive resource, asset reservation for guaranteeing load advancement in the cloud. This work aims to present an incorporated resource, asset reservation, and workload adjusting calculation for effective cloud provisioning. The strategy develops a Priority-based Resource Scheduling Model to acquire the resource, asset reservation with threshold-based load balancing for improving the proficiency in cloud framework. Extending utilization of Virtual Machines through the suitable and sensible outstanding task at hand modifying is then practiced by intensely picking a job from submitting jobs using Priority-based Resource Scheduling Model to acquire resource asset reservation. Experimental evaluations represent, the proposed scheme gives better results by reducing execution time, with minimum resource cost and improved resource utilization in dynamic resource provisioning conditions.
In a Mobile Ad hoc Network (MANET), due to mobility, limited battery power and poor features of nodes,
network partitioning and nodes disconnecting occur frequently. To improve data availability, database
systems create multiple copies of each data object and allocate them on different nodes. This paper
proposes Automated Re-allocator of Replicas Over MANET (ARROM), that addresses these issues.
ARROM reduces the average response time of requests between clients and database servers by
reallocating replicas frequently. In addition, ARROM increases the average throughput in the network. Our
performance study indicates that ARROM improves average response time and average network
throughput in MANET as compared to resent existing scheme.
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Prioritiesidescitation
This paper describes the proposal of a priority flow
oriented design of the Ksensor architecture. Ksensor is a
multiprocessor traffic capture and analysis system for high
speed networks developed at kernel space. While the current
architecture permits the capture and analysis of data flows,
there are several scenarios where it does not perform
adequately to achieve this goal, for example, if a certain type
of traffic is more valuable than others. Thus, this work pursues
the design that allows Ksensor to provide data flow treatment
to a larger extent. This improvement will allow the new
architecture to provide more reliability in data flow capture
and processing.
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...IDES Editor
Clustering is an important research area for
mobile ad hoc networks (MANETs) as it increases the
capacity of network, reduces the routing overhead and
makes the network more scalable in the presence of both
high mobility and a large number of mobile nodes. In
clustering the clusterhead manage and store recent routing
information. However the frequent change of clusterhead
leads to loss of routing information stored, changes the route
between two nodes, affects the performance of the routing
protocol and makes the cluster structure unstable.
Communication overhead in terms of exchanging messages
is needed to elect a new clusterhead. The goal then would be
to keep the clusterhead change as least as possible to make
cluster structure more stable, to prevent loss of routing
information which in turn improve the performance of
routing protocol based on clustering. This can be achieved
by an efficient cluster maintenance scheme. In this work, a
novel clustering algorithm, namely Incremental
Maintenance Clustering Scheme (IMS) is proposed for
Mobile Ad Hoc Networks. The goals are yielding low
number of clusterhead and clustermember changes,
maintaining stable clusters, minimizing the number of
clustering overhead. Through simulations the performance
of IMS is compared with that of least cluster change (LCC)
and maintenance scheme of Cluster Based Routing Protocol
(CBRP) in terms of the number of clusterhead changes,
number of cluster-member changes and clustering overhead
by varying mobility and speed. The simulation results
demonstrate the superiority of IMS over LCC and
maintenance scheme of CBRP.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Distributed shared memory
General architecture
Design and Implementation of issues of DSM
Granularity
Factors Influencing Block size Selection
Consistency Model
Replacement strategy
Which block be replace
where to place a replace block
thrashing
heterogeneous DSM
Issues
Deadlock
Truly dependable software systems should be built with structuring techniques able to decompose the software complexity without
hiding important hypotheses and assumptions such as those regarding
their target execution environment and the expected fault- and system
models. A judicious assessment of what can be made transparent and
what should be translucent is necessary. This paper discusses a practical
example of a structuring technique built with these principles in mind:
Reflective and refractive variables. We show that our technique offers
an acceptable degree of separation of the design concerns, with limited
code intrusion; at the same time, by construction, it separates but does
not hide the complexity required for managing fault-tolerance. In particular, our technique offers access to collected system-wide information
and the knowledge extracted from that information. This can be used
to devise architectures that minimize the hazard of a mismatch between
dependable software and the target execution environments.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DECarman Ford
Come to our Ford dealer serving Middletown DE to see the incredible 2015 Ford Mustang for yourself. This iconic and muscular car is guaranteed to turn heads and get your heart pounding. The Mustang is waiting for you in our lot to prove its greatness to you.
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
With the shrinking technology, reduced scale and power-hungry chip IO leads to System on Chip. The design of SOC using traditional standard bus scheme encounters with issues like non-uniform delay and routing problems. Crossbars could scale better when compared to buses but tend to become huge with increasing number of nodes. NOC has become the design paradigm for SOC design for its highly regularized interconnect structure, good scalability and linear design effort. The main components of an NoC topology are the network adapters, routing nodes, and network interconnect links. This paper mainly deals with the implementation of full custom SRAM based arrays over D FF based register arrays in the design of input module of routing node in 2D mesh NOC topology. The custom SRAM blocks replace D FF(D flip flop) memory implementations to optimize area and power of the input block. Full custom design of SRAMs has been carried out by MILKYWAY, while physical implementation of the input module with SRAMs has been carried out by IC Compiler of SYNOPSYS.The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28.The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 MHz to 200 MHz. It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a
common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. A MATLAB model is used to show quantitatively how performance is improved in a common packet array design.
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...IJCNCJournal
The unbalancing load issue is a multi-variation, multi-imperative issue that corrupts the execution and productivity of processing assets. Workload adjusting methods give solutions of load unbalancing circumstances for two bothersome aspects over-burdening and under-stacking. Cloud computing utilizes planning and workload balancing for a virtualized environment, resource partaking in cloud foundation. These two factors must be handled in an improved way in cloud computing to accomplish ideal resource sharing. Henceforth, there requires productive resource, asset reservation for guaranteeing load advancement in the cloud. This work aims to present an incorporated resource, asset reservation, and workload adjusting calculation for effective cloud provisioning. The strategy develops a Priority-based Resource Scheduling Model to acquire the resource, asset reservation with threshold-based load balancing for improving the proficiency in cloud framework. Extending utilization of Virtual Machines through the suitable and sensible outstanding task at hand modifying is then practiced by intensely picking a job from submitting jobs using Priority-based Resource Scheduling Model to acquire resource asset reservation. Experimental evaluations represent, the proposed scheme gives better results by reducing execution time, with minimum resource cost and improved resource utilization in dynamic resource provisioning conditions.
In a Mobile Ad hoc Network (MANET), due to mobility, limited battery power and poor features of nodes,
network partitioning and nodes disconnecting occur frequently. To improve data availability, database
systems create multiple copies of each data object and allocate them on different nodes. This paper
proposes Automated Re-allocator of Replicas Over MANET (ARROM), that addresses these issues.
ARROM reduces the average response time of requests between clients and database servers by
reallocating replicas frequently. In addition, ARROM increases the average throughput in the network. Our
performance study indicates that ARROM improves average response time and average network
throughput in MANET as compared to resent existing scheme.
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Prioritiesidescitation
This paper describes the proposal of a priority flow
oriented design of the Ksensor architecture. Ksensor is a
multiprocessor traffic capture and analysis system for high
speed networks developed at kernel space. While the current
architecture permits the capture and analysis of data flows,
there are several scenarios where it does not perform
adequately to achieve this goal, for example, if a certain type
of traffic is more valuable than others. Thus, this work pursues
the design that allows Ksensor to provide data flow treatment
to a larger extent. This improvement will allow the new
architecture to provide more reliability in data flow capture
and processing.
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...IDES Editor
Clustering is an important research area for
mobile ad hoc networks (MANETs) as it increases the
capacity of network, reduces the routing overhead and
makes the network more scalable in the presence of both
high mobility and a large number of mobile nodes. In
clustering the clusterhead manage and store recent routing
information. However the frequent change of clusterhead
leads to loss of routing information stored, changes the route
between two nodes, affects the performance of the routing
protocol and makes the cluster structure unstable.
Communication overhead in terms of exchanging messages
is needed to elect a new clusterhead. The goal then would be
to keep the clusterhead change as least as possible to make
cluster structure more stable, to prevent loss of routing
information which in turn improve the performance of
routing protocol based on clustering. This can be achieved
by an efficient cluster maintenance scheme. In this work, a
novel clustering algorithm, namely Incremental
Maintenance Clustering Scheme (IMS) is proposed for
Mobile Ad Hoc Networks. The goals are yielding low
number of clusterhead and clustermember changes,
maintaining stable clusters, minimizing the number of
clustering overhead. Through simulations the performance
of IMS is compared with that of least cluster change (LCC)
and maintenance scheme of Cluster Based Routing Protocol
(CBRP) in terms of the number of clusterhead changes,
number of cluster-member changes and clustering overhead
by varying mobility and speed. The simulation results
demonstrate the superiority of IMS over LCC and
maintenance scheme of CBRP.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Distributed shared memory
General architecture
Design and Implementation of issues of DSM
Granularity
Factors Influencing Block size Selection
Consistency Model
Replacement strategy
Which block be replace
where to place a replace block
thrashing
heterogeneous DSM
Issues
Deadlock
Truly dependable software systems should be built with structuring techniques able to decompose the software complexity without
hiding important hypotheses and assumptions such as those regarding
their target execution environment and the expected fault- and system
models. A judicious assessment of what can be made transparent and
what should be translucent is necessary. This paper discusses a practical
example of a structuring technique built with these principles in mind:
Reflective and refractive variables. We show that our technique offers
an acceptable degree of separation of the design concerns, with limited
code intrusion; at the same time, by construction, it separates but does
not hide the complexity required for managing fault-tolerance. In particular, our technique offers access to collected system-wide information
and the knowledge extracted from that information. This can be used
to devise architectures that minimize the hazard of a mismatch between
dependable software and the target execution environments.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
2015 Ford Mustang Near Middletown DE | Ford Dealer Serving Middletown DECarman Ford
Come to our Ford dealer serving Middletown DE to see the incredible 2015 Ford Mustang for yourself. This iconic and muscular car is guaranteed to turn heads and get your heart pounding. The Mustang is waiting for you in our lot to prove its greatness to you.
C&ess presentation performance review 2016 (copy 1)Baig Ali
Performance of Civil Engineering and Support Services Department of OGDCL Pakistan, of "C &ESS Department"
by
Engr. Baig Ali
Chief Engineer (Civil) Contracts & JVs.
Huge selection of Storefronts Glass. Now it’s time to change your exterior with attractive and affordable glass, installation and repair solutions from Amber Imp-Ex Corp. we can assist you in your residential and commercial glazing needs whether it’s a new construction, renovation and maintenance.
For more information visit http://www.ambercopperaluminum.com/storefronts-glass/
Compositional Analysis for the Multi-Resource ServerEricsson
The Multi-Resource Server (MRS) technique has been proposed to enable predictable execution of memory intensive real-time applications on COTS multi-core platforms.
A NOVEL CACHE RESOLUTION TECHNIQUE FOR COOPERATIVE CACHING IN WIRELESS MOBILE...cscpconf
Cooperative caching is used in mobile ad hoc networks to reduce the latency perceived by the mobile clients while retrieving data and to reduce the traffic load in the network. Caching also increases the availability of data due to server disconnections. The implementation of a cooperative caching technique essentially involves four major design considerations (i) cache placement and resolution, which decides where to place and how to locate the cached data (ii) Cache admission control which decides the data to be cached (iii) Cache replacement which makes the replacement decision when the cache is full and (iv) consistency maintenance, i.e. maintaining consistency between the data in server and cache. In this paper we propose an effective cache resolution technique, which reduces the number of messages flooded in to the network to find the requested data. The experimental results gives a promising result based on the metrics of studies.
A novel cache resolution technique for cooperative caching in wireless mobile...csandit
Cooperative caching is used in mobile ad hoc networks to reduce the latency perceived by the
mobile clients while retrieving data and to reduce the traffic load in the network. Caching also
increases the availability of data due to server disconnections. The implementation of a
cooperative caching technique essentially involves four major design considerations (i) cache
placement and resolution, which decides where to place and how to locate the cached data (ii)
Cache admission control which decides the data to be cached (iii) Cache replacement which
makes the replacement decision when the cache is full and (iv) consistency maintenance, i.e.
maintaining consistency between the data in server and cache. In this paper we propose an
effective cache resolution technique, which reduces the number of messages flooded in to the
network to find the requested data. The experimental results gives a promising result based on
the metrics of studies.
Peer to peer cache resolution mechanism for mobile ad hoc networksijwmn
In this paper we investigate the problem of cache resolution in a mobile peer to peer ad hoc network. In our
vision cache resolution should satisfy the following requirements: (i) it should result in low message
overhead and (ii) the information should be retrieved with minimum delay. In this paper, we show that
these goals can be achieved by splitting the one hop neighbours in to two sets based on the transmission
range. The proposed approach reduces the number of messages flooded in to the network to find the
requested data. This scheme is fully distributed and comes at very low cost in terms of cache overhead. The
experimental results gives a promising result based on the metrics of studies
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Comparative study on Cache Coherence Protocolsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
Advances in Integrated Circuit processing allow for more microprocessor design options. As Chip Multiprocessor system (CMP) become the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition the virtualization of these computation resources exposes the system to a mix of diverse and competing workloads. On chip Cache memory is a resource of primary concern as it can be dominant in controlling overall throughput. This Paper presents analysis of various parameters affecting the performance of Multi-core Architectures like varying the number of cores, changes L2 cache size, further we have varied directory size from 64 to 2048 entries on a 4 node, 8 node 16 node and 64 node Chip multiprocessor which in turn presents an open area of research on multicore processors with private/shared last level cache as the future trend seems to be towards tiled architecture executing multiple parallel applications with optimized silicon area utilization and excellent performance.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Similar to Survey paper _ lakshmi yasaswi kamireddy(651771619) (20)
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
3. Abstract
Present day systems have a high demand of multicore processors on chip. As the number of cores on Chip Multi-
Processor (CMP) increases, the need for effective utilization (management) of the cache increases. Cache Management
plays an important role in improving the performance. This is achieved by reducing the number of misses and the miss
latency. These two factors the number of misses and the miss latency cannot be reduced at the same time. Some CMPs use
a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses while others use private L2
caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals
use selective replication to make a balance between the miss latency and on chip capacity. There are two kinds of
replication Static replication and Dynamic replication. This paper focusses more on the existing dynamic replication
schemes and gives an analysis of each scheme on several benchmarks.
1. Introduction
Upcoming generation multicore processors and applications will operate on massive data. Major challenge in near future
multicore processors is the movement of data that is being incurred by conventional cache hierarchies. This has very high
impact on the off-chip bandwidth, on-chip memory access latency and energy consumption. A large on-chip cache is
possible but it is not a scalable solution. It is limited to small number of cores, and hence the only practical option is to
physically distribute memory in pieces so that every core is near some portion of the cache. Such a solution might provide
a large amount of aggregate cache capacity and fast private memory for each core but at the same time it is difficult to
manage the distributed cache and network resources efficiently as they require architectural support for cache coherence
and consistency under the ubiquitous shared memory model. Most directory-based protocols enable fast local caching to
exploit data locality, but even they have scalability issues .Some of the most recent proposals have addressed the issue of
directory scalability in single-chip multicores using sharer compression techniques or limited directories. But, the fast
private caches still suffer from two major problems: (1) due to capacity constraints, they cannot hold the working set of
applications that operate on massive data, and (2) due to frequent communication between cores, data is often displaced
from them [1]. This has led to an increased network traffic and request rate to the last level cache. On-chip wires do not
scale at the same pace as transistors, because of which the data movement not only impacts memory access latency, but
also consumes more power due to the energy consumption of network and cache resources [2]. Though private LLC
organizations (e.g., [3]) have low hit latencies, their off-chip miss rates are high in applications that have uneven
distributions of working sets or exhibit high degrees of sharing (due to cache line replication). Shared LLC organizations
(e.g., [4]), on the other hand, lead to non-uniform cache access (NUCA) [5] that hurts on-chip locality, but their off-chip
miss rates are low since cache lines are not replicated. Several proposals have explored the idea of hybrid LLC
.Replication mechanisms have been proposed to balance between access latency and cache capacity in hybrid L2 cache
designs [6] [7]. Two types of replication approaches have been proposed: static [8, 9] and dynamic [10, 11, 12, 13, and
14]. In static replication, a data block is placed through predefined address interleaving; therefore, the LLC banks that
may contain that data block is fixed. The data placement of instruction pages in R-NUCA [8] and in S-NUCA [9] are
static. In dynamic replication, a data block can be placed in any LLC banks. Victim Replication [10] ,Adaptive Selective
Replication [11] ,Adaptive Probability Replication [12],Dynamic Reusability based Replication[13], and Locality Aware
data replication at Last Level Cache [14] fall into this category. These replication mechanisms have their own advantages
and disadvantages .The paper will be an analysis these dynamic replication schemes.
2. Background
Starting chronologically the first dynamic replication mechanism from the above mentioned is the Victim Replication
(VR)[10] mechanism which is based on shared caches, but it tries to capture evictions from the local primary cache in the
local L2 slice to reduce subsequent access latency to the same cache block. Victim replicas and global L2 cache blocks
share L2 slice capacity. In VR, all primary cache misses must first check the local L2 tags in case there’s a valid local
replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local
4. L2 slice and moved into the primary cache 10]. The next technique introduced is the Adaptive Selective Replication
(ASR) [11] which adopts similar replication mechanism to VR, but it focuses on the capacity contention between replicas
and global L2 cache blocks. ASR dynamically estimates the cost (extra misses) and benefit (lower hit latency) of
replication and adjusts the number of receivable victims to avoiding hurting L2 cache performance [11]. Another
replication scheme called the Adaptive Probability Replication (APR)[12] mechanism is proposed that counts each
block’s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to
estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR
adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and
insert them at appropriate position, according to their re-reference probability [12].In the same conference another
mechanism named Dynamic Reusability-based Replication (DRR) [13] was introduced. DRR is a hybrid cache
architecture that dynamically monitors the reuse pattern of cache blocks and replicates blocks with high reusability to
appropriate L2 cache slices [13]. Replicas are shared by nearby cores through a fast lookup mechanism, Network Address
Mapping, which records the location of the nearest replica in network interfaces and forwards subsequent L1 miss
requests to the replica immediately. This improved performance of shared caches by exploiting reusability based
replication, fast lookup mechanism, and replicas sharing. Most recent technique introduced is the locality-aware selective
data replication protocol for the last-level cache (LLC) [14]. This method gives lower memory access latency and energy
by replicating only high locality cache lines in the LLC slice of the requesting core, and simultaneously keeps the off-chip
miss rate low. This approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality
at the cache line granularity, and only allows replication for cache lines with high reuse [14]. A classifier is used to
capture the LLC pressure at the existing replica locations and adaptation of replication decision is done accordingly. The
locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional
coherence protocols. The following sections discuss the schemes in detail.
3. Schemes
3.1. Victim Replication (VR)
Victim replication (VR) is a hybrid scheme that combines the advantage of large capacity of shared L2 cache and low hit
latency of Private L2 cache. VR is primarily based on shared L2 cache, but in addition tries to capture evictions from the
local primary cache in the local L2 slice. Each retained victim is a local L2 replica of a line that is already existing in the
L2 of the remote home tile. When a miss occurs at the shared L2 cache, a line is brought in from memory and placed in
the on chip L2 at a home tile determined by a subset of the physical address bits, as in shared L2 cache. The requested line
is directly forwarded to the primary cache of the requesting processor. If the line’s residency in the primary cache is
terminated because of an incoming invalidation or write back request, the usual shared L2 cache protocol is followed. If a
primary cache line is evicted because of a conflict or capacity miss, then a copy of the victim line in the local slice is kept
to reduce subsequent access latency to the same line A global line with remote sharers is never evicted in favor of a local
replica, as an actively cached global line is likely to be in use. The VR replication policy will replace the following classes
of cache lines in the target set in descending priority order: (1) An invalid line; (2) A global line with no sharers; (3) An
existing replica. If there are no lines belonging to these three categories, no replica is made and the victim is evicted from
the tile as in shared L2 cache [10]. If there is more than one line in the selected category, VR picks at random. All primary
cache misses first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded
to the home tile. On a replica hit, the replica is invalidated in the local L2 slice and moved into the primary cache. When a
downgrade or invalidation request is received from the home tile, the L2 tags will also be checked in addition to the
primary cache tags [10].
3.2. Adaptive Selective Replication (ASR)
Adaptive Selective Replication ASR obtains the optimum replication level by balancing the benefits of replication against
the costs.L2 cache block replication improves memory system performance when the average L1 miss latency is reduced.
5. The following equation describes the average cycles for L1 cache misses normalized by instructions executed:
𝐿1 𝑚𝑖𝑠𝑠 𝑐𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
=
𝑃𝑙𝑜𝑐𝑎𝑙𝐿2 ∗ 𝐿𝑙𝑜𝑐𝑎𝑙𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
+
𝑃𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 ∗ 𝐿 𝑟𝑒𝑚𝑜𝑡𝑒𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
+
𝑃 𝑚𝑖𝑠𝑠 ∗ 𝐿 𝑚𝑖𝑠𝑠
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
Px is the probability of a memory request being satisfied by the entity x, where x is a local L2 cache, remote L2 caches, or
main memory and Lx equals the latency of each entity [11.] The combination of the localL2 and remoteL2 terms represent
the memory cycles spent on L2 cache hits and the third term depicts the memory cycles spent on L2 cache misses.
Replication increases the probability that L1 misses hit in the local L2 cache, thus the PlocalL2 term increases and the
PremoteL2 term decreases. Because the latency of a local L2 cache hit is tens of cycles faster than a remote L2 cache hit,
the net effect of increasing replication is a reduction in cycles spent on L2 cache hits. However, more replication devotes
more capacity to replica blocks, thus fewer unique blocks exist on-chip, increasing the probability of L2 cache misses,
Pmiss. If the probability of a miss increases significantly due to replication, the miss term will dominate, as the latency of
memory is hundreds of cycles greater than the L2 hit latencies. Therefore, balancing these three terms is necessary to
improve memory system performance.
Optimal performance often arises from an intermediate replication level. Figure 1 graphically depicts this tradeoff. The
Replication Benefit curve, Figure 1(a), illustrates the trend that increasing replication reduces L2 cache hit cycles. Due to
the strong locality of shared read-only requests, a small degree of L2 replication can significantly reduce L2 hit cycles by
moving many previous remote L2 hits into the local cache. In contrast, increased replication gradually reduces L2 hit
cycles because fewer unique blocks on-chip lead to fewer total L2 hits. The Replication Cost curve, Figure 1(b), illustrates
that increasing L2 replication increases the memory cycles spent on off-chip misses. The Replication Effectiveness curve,
Figure 1(c), combines the benefit and cost curves and plots the total memory cycles. Because the benefit and cost curves
are generally convex and have opposite slopes, the minimum of the Replication Effectiveness curve often lies between
allowing all replications and no replications. ASR estimates the slopes of the benefit and cost curves to approximate the
optimal replication level.
(a) (b) (c)
Figure 1[11]
By dynamically monitoring the benefit and cost of replication, ASR attempts to achieve the optimal level of replication.
ASR identifies discrete replication levels and makes a piecewise approximation of the memory cycle slope [11]. Thus
ASR simplifies the analysis to a local decision of whether the amount of replication should be increased, decreased, or
remain the same. Figure 1 illustrates the case where the current replication level, labeled C, results in HC hit cycles-per-
instruction and MC Miss cycles-per-instruction. ASR considers three alternatives: (i) increasing replication to the next
higher level, labeled H, (ii) decreasing replication to the next lower level, labeled L, or (iii) leaving the replication
unchanged [11]. To make this decision, ASR not only needs HC and MC, but also four additional hit and miss cycles-per-
instruction values: HH and MH for the next higher level and HL and ML for the next lower level. To simplify the
collection process, ASR estimates only the four differences between the hit and miss cycles-per-instruction: (1) the benefit
of increasing replication (decrease in L2 hit cycles, HC - HH); (2) the cost of increasing replication (increase in L2 miss
cycles, MH - MC); (3) the benefit of decreasing replication, (decrease in L2 miss cycles, MC - ML); and (4) the cost of
decreasing replication (increase in L2 hit cycles, HL - HC). By comparing these cost and benefit counters, ASR will
increase, decrease, or leave unchanged the replication level.
6. 3.3. Adaptive Probability Replication (APR)
This design is based on a distributed shared L2 cache design. To predict re-reference probability, APR adds a counter for
each cache block to record and transfer the number of accesses. In APR, each tile stores re-reference probability of blocks
from other remote L2 cache slices in its network interface component using a simple lookup table called Re-Reference
Probability Buffer (RRPB) [12]. RRPB keeps re-reference probability entries for all other L2 slices. Re-reference
probability entry holds replication thresholds for different number of accesses. The replication thresholds indicate the re-
reference probability of blocks with different number of accesses. In the local L2 slice, if there is invalid block or the
victim is not a sharing global block, the replica is filled into L2 cache slice. Otherwise, the replication is abandoned. The
insert position of replica is determined by its corresponding re-reference probability. When a replica is accessed again, it
is deleted from the local L2 cache slice and moved to the local L1 cache.
APR counts every accesses of L2 cache blocks and records the number of evicted blocks with different number of
accesses, to estimate re-reference probability at runtime. For example, the re-reference probability of block with N
accesses is the proportion of the number of blocks with exceeding N accesses accounts for that of blocks with not less
than N accesses. The estimation is only performed when a global block replacement occurs (not a replica
replacement).The re-reference probabilities are updated to all other tiles in a certain interval (such as 10000 cycles) by
attaching it to any response message. Because blocks from other remote L2 slices may be accessed in local L2 cache
slices due to replication, each replica access increases also the corresponding counter associating with the block. When a
replica is accessed, the associated counter will also be moved to L1 cache block. The values of counter of blocks in L1
caches are sent back when the blocks are evicted to the home L2 slice to accelerate the number of accesses.
Like ASR, a linear feedback shift register generates a pseudo-random number which is compared to the corresponding
replication threshold. When an evicted L1 block passes through the network interface, APR captures this message, and
looks up corresponding RRPB entry according to its address. If the corresponding replication threshold to the number of
accesses of the blocks is less than the generated random number, the block is sent to the local L2 slice. Otherwise, it is
evicted to the home L2 slice. In the local L2 cache slice, if there is an invalid block or the victim is not a sharing global
block, the block is inserted. Otherwise, it is sent to the home L2 slice. Blocks with more accesses have higher re-reference
probability. Probability insertion is implemented in APR according to the number of replicated block, in which the
number of accesses indicates the insert position. If the number of accesses of block exceeds the way size, the block is
inserted at MRU position. The aim of probability insertion is to make blocks with lower re-reference probability survival
for a shorter time.
3.4. Dynamic Reusability based Replication (DRR)
DRR dynamically replicates blocks with high reusability to other appropriate L2 cache slices, and allows the replicas be
shared by nearby cores via a fast lookup mechanism [13]. A set-associative Core Access Counter Buffer (CACB) is used
to determine which block should be replicated and corresponding destination of replication. For recent accessed blocks,
CACB record access numbers for cores exceeding certain hops (for example, 2 hops, this is also the smallest distance
among home slice and replica slices) away from the home slice respectively. So, only 10 counters are enough in one
CACB entry for a 16-core CMP. When the block receives a Read request from one core, the corresponding counter
increases. Due to coherence problem, when the block receives a Write request, all the counters of the block are reset to
zero. In CACB, larger counter means higher reusability. When the maximum counter of a block reaches a certain
threshold (for example, 5), the block is to be replicated to the slice corresponding to the maximum counter.
After the replicating block and the destination being determined, the L2 cache slice sends a replication request to the
destination. When the destination receives the replication request, it allocates cache space to hold the replica. If the
destination has available space for replica, it response acknowledge to the home L2 cache slice. Otherwise, it response fail
message to the home L2 cache slice. Once the replication operation is completed, the replication destination is stored into
a set-associative Replication Directory Buffer (RDB) in home L2 slice. If the replication fails, the destination is not stored
into RDB. When a Read request reaches the home L2 cache slice, if the distance between the requesting core and the
7. nearest replica is less than the given replica distance(for example, 3 hops in 16-core CMP), the request is forwarded to the
nearest replica. Otherwise, the request is satisfied at the home L2 cache slice.
When the replica receives the forwarded request, it response data to the requesting core. When the data response message
passes through the network interface of the requesting core, the replica’s location is stored into a set-associative Network
Address Mapping Buffer (NAMB). NAMB is embedded into network interface and used to record replicas’ location
which have serviced for the core. When a L1 cache Read miss request passes through network interface, it first searches
NAMB. If the NAMB hit, the request is forwarded to the recoded replica location immediately. Otherwise, the request
continues to transfer to the home L2 cache slice. For coherency maintenance, when a L1 cache write miss request passes
through network interface, it transfers to the home L2 cache slice and does not search NAMB. This is to ensure write
operation can be serialized at the unique home L2 cache slice. Because NAMB is embedded into the network interface, its
access latency can be hidden in other network interface operations.
3.5. Locality Aware Data Replication at Last Level Cache
Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting
access by another core or before it is evicted. Greater the number of accesses with higher run-length, greater is the benefit
of replicating the cache line in the requester’s LLC slice. Instructions and shared-data (both read-only and read-write) can
be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the
reuse of data changes during an application’s execution.
On an L1 cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is
inserted at the private L1 cache. A Replica Reuse counter at the LLC directory entry is incremented. The replica reuse
counter is a saturating counter used to capture reuse information. It is initialized to ‘1’ on replica creation and incremented
on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the
cache line is not found there, it is either brought in from the off-chip memory or the underlying coherence protocol takes
the necessary actions to obtain the most recent copy of the cache line. A replication mode bit is used to identify whether a
replica is allowed to be created for the particular core and a home reuse counter is used to track the number of times the
cache line is accessed at the home location by the particular core. This counter is initialized to ‘0’ and incremented on
every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester’s
LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented. If this counter has reached the
Replication Threshold (RT), the requesting core is “promoted” and the cache line is inserted in its LLC slice and private
L1 cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the
requester’s private L1 cache [14].
On an L1 cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a
replica exists in the Modified (M) or Exclusive (E) state, the cache line is inserted at the private L1 cache. In addition, the
Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to
the LLC home location. The directory invalidates all the LLC replicas and L1 cache copies of the cache line, thereby
maintaining the single-writer multiple-reader invariant. On an invalidation request, both the LLC slice and L1 cache on a
core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC
home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the
acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether
the core stays as a replica sharer. If the (replica +home) reuse is greater than or equal to the RT, the core maintains replica
status, else it is demoted to non-replica status. When an L1 cache line is evicted, the LLC replica location is probed for the
same address. If a replica is found, the dirty data in the L1 cache line is merged with it, else an acknowledgement is sent
to the LLC home location. However, when an LLC replica is evicted, the L1 cache is probed for the same address and
invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. If
the replica reuse is greater than or equal RT, the core maintains replica status, else it is demoted to non-replica status.
After all acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are
reset to ‘0’. This has to be done since these sharers have not shown enough reuse to be “promoted”. If the writer is a non-
replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its
8. home reuse counter is incremented, else it is reset to ‘1’. This enables the replication of migratory shared data at the
writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores.
4. Results
APR improves performance by 12% on average for splash-2 benchmark over Baseline (shared cache design), 24% for
parsec benchmark over Baseline. VR displays similar performance in splash-2 and parsec benchmarks that is 5% over
Baseline for splash-2, and 4% over Baseline for parsec. ASR shows similar performance in splash-2 benchmark with VR,
but is better than VR in parsec benchmark (15% over Baseline). R-NUCA obtains 2% and 8% performance gains for
splash-2 and parsec benchmarks respectively. It is because that instructions are with strong locality and occupy fewer
capacity in splash-2 and parsec benchmarks. APR demonstrates its stable performance improvement in splash-2 and
parsec benchmarks. Totally, APR improves performance by 21% on average over baseline, by 17% over VR, by 10%
over ASR, and by 15% over R-NUCA. Replication schemes increase the miss rate of L2 cache. Figure 4 and 5 show the
normalized L2 cache miss ratio of evaluated replication schemes for splash-2 and parsec benchmarks respectively. APR
improves L2 miss ratio by as much 49% for splash-2 benchmark, and by 38% for parsec benchmark. Compared to VR and
ASR, APR shows lower L2 miss ratio. This comes from its replication filtering policy and replica insertion policy.
Probability replication filtering policy reduces the contention of L2 cache capacity, and the probability insertion policy
reduces the residency time of replicas. Both policies tend to reduce the impact on L2 limited capacity caused by extra
replicas.
Figure 2: Normalized Execution time for Splash-2 benchmarks
Figure 3: Normalized Execution time for Parsec Benchmark
9. Figure 4: Normalized Miss ratio for Splash-2 benchmarks
Figure 5: Normalized Miss ratio for Parsec benchmarks
DRR achieves lower read latency over other techniques. Figure 5 shows normalized L2 cache average read latency. VR,
ASR, and R-NUCA do not reduce read latency against Baseline, while DRR reduces read latency by 12%. Such results
show that DRR takes full advantage of benefits of replicas by network address mapping mechanism. Unnecessary extra
search latency offsets benefits of replicas in VR and ASR. R-NUCA’s instruction replication has limited benefit for
splash-2 and parsec benchmarks. Figure 6 shows the normalized execution time. As can be seen, the DRR improves the
total execution time of almost all benchmarks compared to baseline system, VR, ASR, and R-NUCA. The maximum
performance gain happens at dedup benchmark with about 69% performance improvement. The average performance
improvement is about 30% over the baseline system, about 16% over the VR, about 8% over ASR, and about 25% over
R-NUCA. While the performance improvements vary across different benchmarks, DRR does show better performance in
almost all cases indicating the good adaptive feature of reusability-based replication scheme. We measured the L2 cache
miss rate as shown in Figure 7. Compared to the baseline system, VR increases L2 cache miss rate about 162%, ASR
increases about 91%, R-NUCA increases about 67%, and DRR increases about 48%.
10. Figure 6: Normalized average read latency
Figure 7: Normalized Execution time
Figure 8: Normalized L2 miss ratio
11. The locality-aware protocol provides better energy consumption and performance than the other LLC data management
schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an RT of 3 achieves the
best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R-
NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall,
the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion
time compared to VR, ASR, R-NUCA and S-NUCA respectively.
Figure 9: Normalized Energy
Figure 10: Normalized Completion Time
12. Figure 11: Normalized L1 Cache Miss
5. Conclusions
For applications with working sets that fits within the LLC even if replication is done on every L1 cache Miss ;Locality-
aware scheme perform well both in energy and performance. VR is good for applications with high access to shared read-
write data but has higher L2 cache energy than other schemes. Applications with higher accesses to instructions and
shared read only data are benefited by ASR. Locality aware protocol, APR, DRR also perform almost the same in such
cases. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality-
aware protocol makes LLC replicas for such data when sufficient reuse is detected and hence performs well. Similarly
APR and DRR also perform comparatively well for such kind of data. Application of probability replication and
probability insertion policies to VR and ASR have proven to show better performance than the individual schemes. From
the above analysis it can be understood that it depends on the kind of data the application uses the most and hence the
replication policy has to be chosen basing on the kind of data. However out the techniques discussed it can be seen that
the newly proposed APR, DRR and the Locality Aware Replication Schemes perform better in most cases than the
existing ASR and VR. Again only dynamic replication schemes are discussed above and static replication schemes have
their own benefits in certain types of applications. This can be seen by improved performance of R-NUCA for certain
cases. So a replication scheme should be selected in such a way that it satisfies most of the needs of the CMP i.e. as many
applications as possible.
6. References
[1] G. Kurian, O. Khan, and S. Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the
40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 523–534, New York, NY, USA, 2013.
ACM.
[2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design
opportunities and challenges. In Design Automation Conference, 2012.
[3] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of
the AMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.
[4] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.
[5] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-
Chip Caches. In International Conference on Architectural Support for Programming Languages and Operating Systems,
2002.
[6] Zhang, M. and K. Asanovic, Victim replication: Maximizing capacity while hiding wire delay in tiled chip
multiprocessors. Proceedings -International Symposium on Computer Architecture, 2005: p. 336-345.
13. [7] Beckmann, B.M., M.R. Marty, and D.A. Wood, ASR: Adaptive selective replication for CMP caches. Proceedings of
the Annual International Symposium on Microarchitecture, MICRO, 2006: p. 443-454.
[8] Hardavellas, N., M. Ferdman, B. Falsafi and A. Ailamaki, Reactive NUCA: Near-Optimal Block Placement and
Replication in Distributed Caches. the 36th Annual International Symposium on Computer Architecture, 2009: p. 184-
195.
[9] Chang, J.C. and G.S. Sohi, Cooperative caching for chip multiprocessors the 33rd International Symposium on
Computer Archtiecture, Proceedings, 2006: p. 264-275.
[10] Beckmann, B.M. and D.A. Wood, Managing wire delay in large chip-multiprocessor caches. Micro-37 2004: 37th
Annual International Symposium on Microarchitecture, Proceedings, 2004: p. 319-330.
[11] Kandemir, M., F. Li, M. J. Irwin, and S. W. Son, A novel migration-based NUCA design for Chip Multiprocessors.
in High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008.
[12]Jinglei Wang, Dongsheng Wang, Haixia Wang, and Yibo Xue, High Performance Cache Block Replication Using
Re-Reference Probability in CMPs , High Performance Computing (HiPC), 2011 ,18th International Conference .
[13]Jinglei Wang, Dongsheng Wang, Haixia Wang, Yibo Xue, Dynamic Reusability-based Replication with Network
Address Mapping in CMPs, High Performance Computing (HiPC), 2011 ,18th International Conference.
[14]Kurian, G., Devadas, S., Khan, O.: Locality-Aware Data Replication in the Last-Level Cache. In: 20th International
Symposium on High Performance Computer Architecture, pp. 1-12, IEEE Press, New York (2014) .