Duplication of data in storage systems is becoming increasingly common problem. The system introduces I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations and shares data with existing users if Deduplication found on the client or server side. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data.
Two most common concerns for current Cloud storage systems are data reliability and storage costs. To ensure data reliability, typically there is multi-replica (i.e. three replicas) replication strategy in current clouds is used which requires a large amount of storage consumption, that results in high storage cost for applications generating vast amount of data particularly in the Cloud. This paper presents a cost-efficient and data reliable management mechanism named SiDe (Similarity Based Deduplication) for the private cloud storage systems of enterprises. While storing the file the parameters file_name, file_size, file_storage_duration are taken as input from the user. To minimize memory requirement, the file is divided into fixed sized chunks and stores only the unique chunks along with the code for regeneration. A compressed copy is then created for the input files stored for long duration. A key generation algorithm is then used to generate a unique key for ensuring security. The simulation indicates that, compared with the stereotypical three-replica strategy, SiDe can reduce around 81-84% of the Cloud storage space consumption for file size varying between 10KB to 300KB and files having duplicate data chunks, hence considerably reduces the cloud storage cost.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Cloud computing is widely considered as potentially the next dominant technology in IT industry. It
offers basic system maintenance and scalable source management with Virtual Machines (VMs). As a essential
technology of cloud computing, VM has been a searing research issue in recent years. The high overhead of
virtualization has been well address by hardware expansion in CPU industry, and by software realization
improvement in hypervisors themselves. However, the high order on VM image storage remains a difficult
problem. Existing systems have made efforts to decrease VM image storage consumption by means of
deduplication inside a storage area network system. Nevertheless, storage area network cannot assure the
increasing demand of large-scale VM hosting for cloud computing because of its cost limitation. In this project,
we propose SILO, improved deduplication file system that has been particularly designed for major VM
deployment. Its design provide fast VM deployment with similarity and locality based fingerprint index for data
transfer and low storage consumption by means of deduplication on VM images. And implement heart beat
protocol in Meta Data Server (MDS) to recover the data from data server. It also provides a comprehensive set of
storage features including backup server for VM images, on-demand attractive through a network, and caching
through local disks by copy-on-read techniques. Experiments show that SILO features execute well and introduce
minor performance overhead.
Keywords — Deduplication, Storage area network, Load Balancing, Hash table, Disk copies.
Postponed Optimized Report Recovery under Lt Based Cloud MemoryIJARIIT
Fountain code based conveyed stockpiling system give solid online limit course of action through putting unlabeled
subset pieces into various stockpiling hubs. Luby Transformation (LT) code is one of the predominant wellspring codes for limit
systems in view of its viable recuperation. In any case, to ensure high accomplishment deciphering of wellspring code based limit
recuperation of additional segments in required and this need could avoid additional put off. We give the idea that distinctive stage
recuperation of piece is powerful to lessen the document recovery delay. We first develop a postpone display for various stage
recuperation arranges pertinent to our considered system with the made model. We focus on perfect recuperation arranges given
essentials on accomplishment decipher limit. Our numerical outcomes propose a focal tradeoff between the record recuperation
delay and the target of fruitful document unraveling and that the report recuperation deferral can be on a very basic level decrease
by in a perfect world bundle requests in a multi arrange style.
Two most common concerns for current Cloud storage systems are data reliability and storage costs. To ensure data reliability, typically there is multi-replica (i.e. three replicas) replication strategy in current clouds is used which requires a large amount of storage consumption, that results in high storage cost for applications generating vast amount of data particularly in the Cloud. This paper presents a cost-efficient and data reliable management mechanism named SiDe (Similarity Based Deduplication) for the private cloud storage systems of enterprises. While storing the file the parameters file_name, file_size, file_storage_duration are taken as input from the user. To minimize memory requirement, the file is divided into fixed sized chunks and stores only the unique chunks along with the code for regeneration. A compressed copy is then created for the input files stored for long duration. A key generation algorithm is then used to generate a unique key for ensuring security. The simulation indicates that, compared with the stereotypical three-replica strategy, SiDe can reduce around 81-84% of the Cloud storage space consumption for file size varying between 10KB to 300KB and files having duplicate data chunks, hence considerably reduces the cloud storage cost.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Cloud computing is widely considered as potentially the next dominant technology in IT industry. It
offers basic system maintenance and scalable source management with Virtual Machines (VMs). As a essential
technology of cloud computing, VM has been a searing research issue in recent years. The high overhead of
virtualization has been well address by hardware expansion in CPU industry, and by software realization
improvement in hypervisors themselves. However, the high order on VM image storage remains a difficult
problem. Existing systems have made efforts to decrease VM image storage consumption by means of
deduplication inside a storage area network system. Nevertheless, storage area network cannot assure the
increasing demand of large-scale VM hosting for cloud computing because of its cost limitation. In this project,
we propose SILO, improved deduplication file system that has been particularly designed for major VM
deployment. Its design provide fast VM deployment with similarity and locality based fingerprint index for data
transfer and low storage consumption by means of deduplication on VM images. And implement heart beat
protocol in Meta Data Server (MDS) to recover the data from data server. It also provides a comprehensive set of
storage features including backup server for VM images, on-demand attractive through a network, and caching
through local disks by copy-on-read techniques. Experiments show that SILO features execute well and introduce
minor performance overhead.
Keywords — Deduplication, Storage area network, Load Balancing, Hash table, Disk copies.
Postponed Optimized Report Recovery under Lt Based Cloud MemoryIJARIIT
Fountain code based conveyed stockpiling system give solid online limit course of action through putting unlabeled
subset pieces into various stockpiling hubs. Luby Transformation (LT) code is one of the predominant wellspring codes for limit
systems in view of its viable recuperation. In any case, to ensure high accomplishment deciphering of wellspring code based limit
recuperation of additional segments in required and this need could avoid additional put off. We give the idea that distinctive stage
recuperation of piece is powerful to lessen the document recovery delay. We first develop a postpone display for various stage
recuperation arranges pertinent to our considered system with the made model. We focus on perfect recuperation arranges given
essentials on accomplishment decipher limit. Our numerical outcomes propose a focal tradeoff between the record recuperation
delay and the target of fruitful document unraveling and that the report recuperation deferral can be on a very basic level decrease
by in a perfect world bundle requests in a multi arrange style.
Data Warehouses store integrated and consistent data in a subject-oriented data repository dedicated
especially to support business intelligence processes. However, keeping these repositories updated usually
involves complex and time-consuming processes, commonly denominated as Extract-Transform-Load tasks.
These data intensive tasks normally execute in a limited time window and their computational requirements
tend to grow in time as more data is dealt with. Therefore, we believe that a grid environment could suit
rather well as support for the backbone of the technical infrastructure with the clear financial advantage of
using already acquired desktop computers normally present in the organization. This article proposes a
different approach to deal with the distribution of ETL processes in a grid environment, taking into account
not only the processing performance of its nodes but also the existing bandwidth to estimate the grid
availability in a near future and therefore optimize workflow distribution.
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
ABSTRACT
Data in the cloud is increasing rapidly. This huge amount of data is stored in various data centers around the world. Data deduplication allows lossless compression by removing the duplicate data. So, these data centers are able to utilize the storage efficiently by removing the redundant data. Attacks in the cloud computing infrastructure are not new, but attacks based on the deduplication feature in the cloud computing is relatively new and has made its urge nowadays. Attacks on deduplication features in the cloud environment can happen in several ways and can give away sensitive information. Though, deduplication feature facilitates efficient storage usage and bandwidth utilization, there are some drawbacks of this feature. In this paper, data deduplication features are closely examined. The behavior of data deduplication depending on its various parameters are explained and analyzed in this paper.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
The wide usage of the Internet and the availability of powerful computers and high-speed networks as low cost
commodity components have a deep impact on the way we use computers today, in such a way that
these technologies facilitated the usage of multi-owner and geographically distributed resources to address
large-scale problems in many areas such as science, engineering, and commerce. The new paradigm of
Grid computing has evolved from these researches on these topics. Performance and utilization of the grid
depends on a complex and excessively dynamic procedure of optimally balancing the load among the
available nodes. In this paper, we suggest a novel two-dimensional figure of merit that depict the network
effects on load balance and fault tolerance estimation to improve the performance of the network
utilizations. The enhancement of fault tolerance is obtained by adaptively decrease replication time and
message cost. On the other hand, load balance is improved by adaptively decrease mean job response time.
Finally, analysis of Genetic Algorithm, Ant Colony Optimization, and Particle Swarm Optimization is
conducted with regards to their solutions, issues and improvements concerning load balancing in
computational grid. Consequently, a significant system utilization improvement was attained. Experimental
results eventually demonstrate that the proposed method's performance surpasses other methods.
The growth of internet of things and wireless technology has led to enormous generation of data for various application uses such as healthcare, scientific and data intensive application. Cloud based Storage Area Network (SAN) has been widely in recent time for storing and processing these data. Providing fault tolerant and continuous access to data with minimal latency and cost is challenging. For that efficient fault tolerant mechanism is required. Data replication is an efficient mechanism for providing fault tolerant mechanism that has been considered by exiting methodologies. However, data replica placement is challenging and existing method are not efficient considering application dynamic requirement of cloud based storage area network. Thus, incurring latency, due to which induce higher cost of data transmission. This work present an efficient replica placement and transmission technique using Bipartite Graph based Data Replica Placement (BGDRP) technique that aid in minimizing latency and computing cost. Performance of BGDRP is evaluated using real-time scientific application workflow. The outcome shows BGDRP technique minimize data access latency, computation time and cost over state-of-art technique.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
Data Deduplication: Venti and its improvementsUmair Amjad
Data is the primary thing which is available in digital form everywhere. To store this massive data, the storage methodology should be efficient as well as intelligent enough to find the redundant data to save. Data deduplication techniques are widely used by storage servers to eliminate the possibilities of storing multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage systems also removes duplication in existing stored data in storage systems. Hence yield a significant cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and also identify area of improvements in Venti which are addressed by other papers.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Provable Multicopy Dynamic Data Possession in Cloud Computing Systems1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
for more ieee paper / full abstract / implementation , just visit www.redpel.com
A Hybrid Cloud Approach for Secure Authorized De-DuplicationEditor IJMTER
The cloud backup is used for the personal storage of the people in terms of reducing the
mainlining process and managing the structure and storage space managing process. The challenging
process is the deduplication process in both the local and global backup de-duplications. In the prior
work they only provide the local storage de-duplication or vice versa global storage de-duplication in
terms of improving the storage capacity and the processing time. In this paper, the proposed system
is called as the ALG- Dedupe. It means the Application aware Local-Global Source De-duplication
proposed system to provide the efficient de-duplication process. It can provide the efficient deduplication process with the low system load, shortened backup window, and increased power
efficiency in the user’s personal storage. In the proposed system the large data is partitioned into
smaller part which is called as chunks of data. Here the data may contain the redundancy it will be
avoided before storing into the storage area.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Data Warehouses store integrated and consistent data in a subject-oriented data repository dedicated
especially to support business intelligence processes. However, keeping these repositories updated usually
involves complex and time-consuming processes, commonly denominated as Extract-Transform-Load tasks.
These data intensive tasks normally execute in a limited time window and their computational requirements
tend to grow in time as more data is dealt with. Therefore, we believe that a grid environment could suit
rather well as support for the backbone of the technical infrastructure with the clear financial advantage of
using already acquired desktop computers normally present in the organization. This article proposes a
different approach to deal with the distribution of ETL processes in a grid environment, taking into account
not only the processing performance of its nodes but also the existing bandwidth to estimate the grid
availability in a near future and therefore optimize workflow distribution.
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
ABSTRACT
Data in the cloud is increasing rapidly. This huge amount of data is stored in various data centers around the world. Data deduplication allows lossless compression by removing the duplicate data. So, these data centers are able to utilize the storage efficiently by removing the redundant data. Attacks in the cloud computing infrastructure are not new, but attacks based on the deduplication feature in the cloud computing is relatively new and has made its urge nowadays. Attacks on deduplication features in the cloud environment can happen in several ways and can give away sensitive information. Though, deduplication feature facilitates efficient storage usage and bandwidth utilization, there are some drawbacks of this feature. In this paper, data deduplication features are closely examined. The behavior of data deduplication depending on its various parameters are explained and analyzed in this paper.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
The wide usage of the Internet and the availability of powerful computers and high-speed networks as low cost
commodity components have a deep impact on the way we use computers today, in such a way that
these technologies facilitated the usage of multi-owner and geographically distributed resources to address
large-scale problems in many areas such as science, engineering, and commerce. The new paradigm of
Grid computing has evolved from these researches on these topics. Performance and utilization of the grid
depends on a complex and excessively dynamic procedure of optimally balancing the load among the
available nodes. In this paper, we suggest a novel two-dimensional figure of merit that depict the network
effects on load balance and fault tolerance estimation to improve the performance of the network
utilizations. The enhancement of fault tolerance is obtained by adaptively decrease replication time and
message cost. On the other hand, load balance is improved by adaptively decrease mean job response time.
Finally, analysis of Genetic Algorithm, Ant Colony Optimization, and Particle Swarm Optimization is
conducted with regards to their solutions, issues and improvements concerning load balancing in
computational grid. Consequently, a significant system utilization improvement was attained. Experimental
results eventually demonstrate that the proposed method's performance surpasses other methods.
The growth of internet of things and wireless technology has led to enormous generation of data for various application uses such as healthcare, scientific and data intensive application. Cloud based Storage Area Network (SAN) has been widely in recent time for storing and processing these data. Providing fault tolerant and continuous access to data with minimal latency and cost is challenging. For that efficient fault tolerant mechanism is required. Data replication is an efficient mechanism for providing fault tolerant mechanism that has been considered by exiting methodologies. However, data replica placement is challenging and existing method are not efficient considering application dynamic requirement of cloud based storage area network. Thus, incurring latency, due to which induce higher cost of data transmission. This work present an efficient replica placement and transmission technique using Bipartite Graph based Data Replica Placement (BGDRP) technique that aid in minimizing latency and computing cost. Performance of BGDRP is evaluated using real-time scientific application workflow. The outcome shows BGDRP technique minimize data access latency, computation time and cost over state-of-art technique.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
Data Deduplication: Venti and its improvementsUmair Amjad
Data is the primary thing which is available in digital form everywhere. To store this massive data, the storage methodology should be efficient as well as intelligent enough to find the redundant data to save. Data deduplication techniques are widely used by storage servers to eliminate the possibilities of storing multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage systems also removes duplication in existing stored data in storage systems. Hence yield a significant cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and also identify area of improvements in Venti which are addressed by other papers.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Provable Multicopy Dynamic Data Possession in Cloud Computing Systems1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
for more ieee paper / full abstract / implementation , just visit www.redpel.com
A Hybrid Cloud Approach for Secure Authorized De-DuplicationEditor IJMTER
The cloud backup is used for the personal storage of the people in terms of reducing the
mainlining process and managing the structure and storage space managing process. The challenging
process is the deduplication process in both the local and global backup de-duplications. In the prior
work they only provide the local storage de-duplication or vice versa global storage de-duplication in
terms of improving the storage capacity and the processing time. In this paper, the proposed system
is called as the ALG- Dedupe. It means the Application aware Local-Global Source De-duplication
proposed system to provide the efficient de-duplication process. It can provide the efficient deduplication process with the low system load, shortened backup window, and increased power
efficiency in the user’s personal storage. In the proposed system the large data is partitioned into
smaller part which is called as chunks of data. Here the data may contain the redundancy it will be
avoided before storing into the storage area.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Efficient usage of memory management in big data using “anti caching”eSAT Journals
Abstract The Increase in the capacity of main memory coupled with the decrease in cost has fueled the development of in-memory database systems that mange data entirely in memory, thereby eliminating the disk I/O bottleneck. However, the present system in the Big Data era, maintaining all data in memory is impossible and even unnecessary. Ideally they would like to have the high access speed of memory, with the large capacity and low price of disk. This hinges on the ability to effectively utilize both the main memory and disk. In this paper, analyze State-of-the-art approach to achieving this goal for in –memory databases, which is called as “Anti-Caching” to distinguish it from traditional caching mechanisms. Extensive experiments were conducted to study the effect of each fine-grained component of the entire process of “Anti-Caching” on both performance and prediction accuracy. To avoid the interfaces from other unrelated components of specific systems, these approaches are implemented on a uniform platform to ensure a fair comparison, they also study the usability of each approach, and how intrusive it is to the systems that intend to incorporate it. Based on results, some guidelines on designing a good “Anti-Caching” approach and sketch a general and efficient approach which can be utilized in most in-memory database systems without much code modification. Key Words: Caching, Anti-Caching, State-of-the-art approach, in-memory database systems.
Query optimization in oodbms identifying subquery for query managementijdms
This paper is based on relatively newer approach for query optimization in object databases, which uses
query decomposition and cached query results to improve execution a query. Issues that are focused here is
fast retrieval and high reuse of cached queries, Decompose Query into Sub query, Decomposition of
complex queries into smaller for fast retrieval of result.
Here we try to address another open area of query caching like handling wider queries. By using some
parts of cached results helpful for answering other queries (wider Queries) and combining many cached
queries while producing the result.
Multiple experiments were performed to prove the productivity of this newer way of optimizing a query.
The limitation of this technique is that it’s useful especially in scenarios where data manipulation rate is
very low as compared to data retrieval rate.
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTcsandit
This paper is based on relatively newer approach for query optimization in object databases,
which uses query decomposition and cached query results to improve execution a query. Issues
that are focused here is fast retrieval and high reuse of cached queries, Decompose Query into
Sub query, Decomposition of complex queries into smaller for fast retrieval of result.
Here we try to address another open area of query caching like handling wider queries. By
using some parts of cached results helpful for answering other queries (wider Queries) and
combining many cached queries while producing the result.
Multiple experiments were performed to prove the productivity of this newer way of optimizing
a query. The limitation of this technique is that it’s useful especially in scenarios where data
manipulation rate is very low as compared to data retrieval rate.
Study on potential capabilities of a nodb systemijitjournal
There is a need of optimal data to query processing technique to handle the increasing database size,
complexity, diversity of use. With the introduction of commercial website, social network, expectations are
that the high scalability, more flexible database will replace the RDBMS. Complex application and Big
Table require highly optimized queries. Users are facing the increasing bottlenecks in their data analysis. A
growing part of the database community recognizes the need for significant and fundamental changes to
database design. A new philosophy for creating database systems called noDB aims at minimizing the datato-
query time, most prominently by removing the need to load data before launching queries. That will
process queries without any data preparation or loading steps. There may not need to store data. User can
pipe raw data from websites, DBs, excel sheets into two promise sample inputs without storing anything.
This study is based on PostgreSQL systems. A series of the baseline experiment are executed to evaluate the
Performance of this system as per -a. Data loading cost, b-Query processing timing, c-Avoidance of
Collision and Deadlock, d-Enabling the Big data storage and e-Optimize query processing etc. The study
found significant possible capabilities of noDB system over the traditional database management system.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Abstract— Cloud storage is usually distributed infrastructure, where data is not stored in a single device but is spread to several storage nodes which are located in different areas. To ensure data availability some amount of redundancy has to be maintained. But introduction of data redundancy leads to additional costs such as extra storage space and communication bandwidth which required for restoring data blocks. In the existing system, the storage infrastructure is considered as homogeneous where all nodes in the system have same online availability which leads to efficiency losses. The proposed system considers that distributed storage system is heterogeneous where each node exhibit different online availability. Monte Carlo Sampling is used to measure the online availability of storage nodes. The parallel version of Particle Swarm Optimization is used to assign redundant data blocks according to their online availability. The optimal data assignment policy reduces the redundancy and their associated cost.
Similar to Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems
1. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1|P a g e Copyright@IDL-2017
Secure and Efficient Client and Server Side
Data Deduplication to Reduce Storage in
Remote Cloud Computing Systems
Hemanth Chandra N1
, Sahana D. Gowda2
,
Dept. of Computer Science
1 B.N.M Institute of Technology, Bangalore, India
2 B.N.M Institute of Technology, Bangalore, India
Abstract: Duplication of data in storage systems is
becoming increasingly common problem. The system
introduces I/O Deduplication, a storage optimization
that utilizes content similarity for improving I/O
performance by eliminating I/O operations and
reducing the mechanical delays during I/O operations
and shares data with existing users if Deduplication
found on the client or server side. I/O Deduplication
consists of three main techniques: content-based
caching, dynamic replica retrieval and selective
duplication. Each of these techniques is motivated by
our observations with I/O workload traces obtained
from actively-used production storage systems, all of
which revealed surprisingly high levels of content
similarity for both stored and accessed data.
Keywords: Deduplication, POD, Data Redundancy,
Storage Optimization, iDedup, I/O Deduplication,
I/O Performance.
1. INTRODUCTION
Duplication of data in primary storage systems is quite
common due to the technological trends that have been
driving storage capacity consolidation the elimination
of duplicate content at both the file and block levels
for improving storage space utilization is an active
area of research. Indeed, eliminating most duplicate
content is inevitable in capacity sensitive applications
such as archival storage for cost effectiveness. On the
other hand, there exist systems with a moderate degree
of content similarity in their primary storage such as
email servers, virtualized servers and NAS devices
running file and version control servers. In case of
email servers, mailing lists, circulated attachments and
SPAM can lead to duplication. Virtual machines may
run similar software and thus create collocated
duplicate content across their virtual disks. Finally, file
and version control systems servers of collaborative
groups often store copies of the same documents,
sources and executables. In such systems, if the degree
of content similarity is not overwhelming, eliminating
duplicate data may not be a primary concern.
Gray and Shenoy[1] have pointed out that given the
technology trends for price-capacity and price-
performance of memory/disk sizes and disk accesses
respectively, disk data must “cool” at the rate of 10X
per decade. They suggest data replication as a means
to this end. An instantiation of this suggestion is
intrinsic replication of data created due to
consolidation as seen now in many storage systems,
including the ones illustrated earlier. Here, it is
referring to intrinsic (or application/user generated)
data replication as opposed to forced (system
generated) redundancy such as in a RAID-1 storage
system. In such systems, capacity constraints are
invariably secondary to I/O performance.
2. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2|P a g e Copyright@IDL-2017
On-disk duplication of content and I/O traces obtained
from three varied production systems at Florida
International University(FIU) that included a
virtualized host running two department web-servers,
the department email server, and a file server for our
research group has been analyzed. Three observations
have been made from the analysis of these traces.
First, our analysis revealed significant levels of both
disk static similarity and workload static similarity
within each of these systems. Disk static similarity is
an indicator of the amount of duplicate content in the
storage medium, while workload static similarity
indicates the degree of on-disk duplicate content
accessed by the I/O workload. These similarity
measures have been defined formally in § 2. Second, a
consistent and marked discrepancy was discovered
between reuse distances for sector and content in the
I/O accesses on these systems indicating that content
is reused more frequently than sectors. Third, there is
significant overlap in content accessed over successive
intervals of longer time-frames such as days or weeks.
Based on the observations, the premise that intrinsic
content similarity is explored in storage systems and
access to replicated content within I/O workloads can
both be utilized to improve I/O performance. In doing
so, a storage optimization that utilizes content
similarity to either eliminate I/O operations altogether
or optimize the resulting disk head movement within
the storage system is designed and evaluate in I/O
Deduplication. I/O Deduplication comprises three key
techniques: (i)Content-based caching that uses the
popularity of “data content” rather than “data location”
of I/O accesses in making caching decisions,
(ii)Dynamic replica retrieval that upon a cache miss
for a read operation, dynamically chooses to retrieve a
content replica which minimizes disk head movement,
and (iii)Selective duplication that dynamically
replicates frequently accessed content in scratch space
that is distributed over the entire storage medium to
increase the effectiveness of dynamic replica retrieval.
2. RELATED WORK
This paper is related to works on Deduplication
concepts in primary memory. Nimrod Megiddo and
Dharmendra S. Modha[20] discussed about the
problem of cache management in a demand paging
scenario with uniform page sizes. A new cache
management policy is proposed, namely, Adaptive
Replacement Cache (ARC), that has several
advantages. In response to evolving and changing
access patterns, ARC dynamically, adaptively and
continually balances between the recency and
frequency components in an online and self tuning
fashion. The policy ARC uses a learning rule to
adaptively and continually revise its assumptions
about the workload. The policy ARC is empirically
universal, that is, it empirically performs as well as a
certain fixed replacement policy– even when the latter
uses the best workload-specific tuning parameter that
was selected in an offline fashion. Consequently, ARC
works uniformly well across varied workloads and
cache sizes without any need for workload specific a
priori knowledge or tuning. Various policies such as
LRU-2, 2Q, LRFU and LIRS require user-defined
parameters, and unfortunately, no single choice works
uniformly well across different workloads and cache
sizes. The policy ARC is simple-to-implement and
like LRU, has constant complexity per request. In
comparison, policies LRU-2 and LRFU both require
logarithmic time complexity in the cache size. The
policy ARC is scan-resistant: it allows one-time
sequential requests to pass through without polluting
the cache. On real-life traces drawn from numerous
domains, ARC leads to substantial performance gains
over LRU for a wide range of cache sizes.
Aayush Gupta et.al[22] have designed CA-SSD which
employs content-addressable storage (CAS) to exploit
such locality. Our CA-SSD design employs
enhancements primarily in the flash translation layer
(FTL) with minimal additional hardware, suggesting
its feasibility. Using three real-world workloads with
content information, statistical characterizations of two
aspects of value locality - value popularity and
temporal value locality - that form the foundation of
3. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3|P a g e Copyright@IDL-2017
CA-SSD is devised. CA-SSD is able to reduce average
response times by about 59-84% compared to
traditional SSDs. Even for workloads with little or no
value locality, CA-SSD continues to offer comparable
performance to a traditional SSD. Our findings
advocate adoption of CAS in SSDs, paving the way
for a new generation of these devices.
Chi Zhang, Xiang Yu, Y. Wang[23] have answered
two key questions. First, since both eager-writing and
mirroring rely on extra capacity to deliver
performance improvements, how to satisfy competing
resource demands given a fixed amount of total disk
space? Second, since eager-writing allows data to be
dynamically located, how to exploit this high degree
of location independence in an intelligent disk
scheduler? In this paper, the two key questions were
addressed and compared the resulting EW-Array
prototype performance against that of conventional
approaches. The experimental results demonstrate that
the eager writing disk array is an effective approach to
providing scalable performance for an important class
of transaction processing applications.
Kiran Srinivasan et.al[24] has proposed an inline
Deduplication solution, iDedup, for primary
workloads, while minimizing extra IOs and seeks. Our
algorithm is based on two key insights from real world
workloads: i) spatial locality exists in duplicated
primary data and ii) temporal locality exists in the
access patterns of duplicated data. Using the first
insight, only sequences of disk blocks were selectively
deduplicated. This reduces fragmentation and
amortizes the seeks caused by Deduplication. The
second insight allows us to replace the expensive, on-
disk, Deduplication metadata with a smaller, in-
memory cache. These techniques enable us to tradeoff
capacity savings for performance, as demonstrated in
our evaluation with real-world workloads. Our
evaluation shows that iDedup achieves 60-70% of the
maximum Deduplication with less than a 5% CPU
overhead and a 2-4% latency impact.
Jiri Schindler et.al[26] introduce proximal I/O, a new
technique for improving random disk I/O performance
in file systems. The key enabling technology for
proximal I/O is the ability of disk drives to retire
multiple I/O’s, spread across dozens of tracks, in a
single revolution. Compared to traditional update-in-
place or write-anywhere file systems, this technique
can provide a nearly seven-fold improvement in
random I/O performance while maintaining (near)
sequential on-disk layout. This paper quantifies
proximal I/O performance and proposes a simple data
layout engine that uses a flash memory-based write
cache to aggregate random updates until they have
sufficient density to exploit proximal I/O. The results
show that with cache of just 1% of the overall disk-
based storage capacity, it is possible to service 5.3 user
I/O requests per revolution for random updates
workload. On an aged file system, the layout can
sustain serial read bandwidth within 3% of the best
case. Despite using flash memory, the overall system
cost is just one third of that of a system with the
requisite number of spindles to achieve the equivalent
number of random I/O operations.
2.1 EXISTING SYSTEM
To address the existing data Deduplication schemes for
primary storage, such as iDedup and Offline-Dedupe,
are capacity oriented in that they focus on storage
capacity savings and only select the large requests to
deduplicate and bypass all the small requests (e.g., 4
KB, 8 KB or less). The rationale is that the small I/O
requests only account for a tiny fraction of the storage
capacity requirement, making Deduplication on them
unprofitable and potentially counterproductive
considering the substantial Deduplication overhead
involved. However, previous workload studies have
revealed that small files dominate in primary storage
systems (more than 50 percent) and are at the root of the
system performance bottleneck. Furthermore, due to the
buffer effect, primary storage workloads exhibit obvious
I/O burstiness. The important performance issue of
primary storage in the Cloud and the above
Deduplication-induced problems, a Performance-
Oriented data Deduplication scheme, called POD, rather
than a capacity-oriented one (e.g., iDedup), to improve
the I/O performance of primary storage systems in the
4. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4|P a g e Copyright@IDL-2017
Cloud is proposed. Figure 1 represents the architecture
of POD. By considering the workload characteristics.
POD takes a two-pronged approach to improving the
performance of primary storage systems and minimizing
performance overhead of Deduplication, namely, a
request-based selective Deduplication technique, called
Select-Dedup, to alleviate the data fragmentation and an
adaptive memory management scheme, called iCache,
to ease the memory contention between the bursty read
traffic and the bursty write traffic.
2.2 PROPOSED SYSTEM(POD vs IDedup)
A possible future direction is to optionally coalesce or
even eliminate altogether write I/O operations for
content that are already duplicated elsewhere on the
disk, or alternatively direct such writes to alternate
locations in the scratch space.
Figure 1. Architecture diagram of POD.
While the first option might seem similar to data
Deduplication at a high-level, a primary focus on the
performance implications of such optimizations rather
than capacity improvements has been suggested. Any
optimization for writes affects the read-side
optimizations of I/O Deduplication and a careful
analysis and evaluation of the trade-off points in this
design space is important and shares data with existing
users if Deduplication found in the client or server side.
Advantages:
1.Requires less space in the storage server
2.Data Deduplication will be done completely in all
blocks
3. IMPLEMENTATION
POD
In this paper, the system proposes POD, a
performance-oriented Deduplication scheme, to
improve the performance of primary storage systems
in the Cloud by leveraging data Deduplication on the
I/O path to remove redundant write requests while also
saving storage space. In Figure 2 shows the sequence
diagram where one can understand how the sequence
flow happens between each module. POD takes a
request-based selective Deduplication approach
(Select-Dedupe) to Deduplicating the I/O redundancy
on the critical I/O path in such a way that it minimizes
the data fragmentation problem. In the meanwhile, an
intelligent cache management (iCache) is employed in
POD to further improve read performance and
increase space saving, by adapting to I/O burstiness.
Our extensive trace-driven evaluations show that POD
significantly improves the performance and saves the
capacity of primary storage systems in the Cloud.
5. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 5|P a g e Copyright@IDL-2017
Figure 2. Sequence diagram of POD.
iDedup
In iDedup system, the system describes iDedup, an
inline Deduplication system specifically targeting
latency-sensitive, primary storage workloads and
comparison with latest techniques like POD also
known as I/O Deduplication. With latency sensitive
workloads, inline Deduplication has many
challenges: fragmentation leading to extra disk seeks
for reads, Deduplication processing overheads in the
critical path and extra latency caused by IOs for
Dedup-metadata management.
3.1 COMPARATIVE ANALYSIS
In this section, the performance of POD vs iDedup
Deduplication models is evaluated through extensive
trace-driven experiments.
(a) Time delay in iDedup
(b) Time delay in POD
Figure 3.The time delay performance of the different
Deduplication schemes.
A prototype of POD as a module in the Linux
operating system and use the trace-driven
experiments to evaluate its effectiveness and
efficiency has been implemented. In this paper, POD
with the capacity oriented scheme iDedup is
compared. Two models (POD vs iDedup) with two
parameters are compared, they are Time delay and
Cpu utilization.
6. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 6|P a g e Copyright@IDL-2017
(a) Cpu utilization in iDedup
(b) Cpu utilization in POD
Figure 4. The Cpu utilization performance of the
different Deduplication schemes.
In the first parameter comparison i.e., Time delay.
Four production files were compared for our trace
driven evaluation. In Figure 3(a) the time delay for
the iDedup model can be seen and in Figure 3(b) the
time delay for same files in Pod model is shown. For
both the deduplicator models, same files were
uploaded with name cmain, owner, clog, search.
Results shows that Pod is efficient than iDedup.
In the second parameter comparison i.e., Cpu
utilization. a single file named abc.txt was uploaded
to both Pod and iDedup models. In Figure 4(a) Cpu
utilization in iDedup model can be seen and in
Figure 4(b) the Cpu utilization in Pod model can be
seen. From the results it states that Pod is more
efficient than the iDedup model.
4. CONCLUSION
System and storage consolidation trends are driving
increased duplication of data within storage systems.
Past efforts have been primarily directed towards the
elimination of such duplication for improving storage
capacity utilization. With I/O Deduplication, a
contrary view is taken that intrinsic duplication in a
class of systems which are not capacity-bound can be
effectively utilized to improve I/O performance – the
traditional Achilles’ heel for storage systems. Three
techniques contained within I/O Deduplication work
together to either optimize I/O operations or eliminate
them altogether. An in-depth evaluation of these
mechanisms revealed that together they reduced
average disk I/O times by 28-47%, a large
improvement all of which can directly impact the
overall application-level performance of disk I/O
bound systems. The content-based caching mechanism
increased memory caching effectiveness by increasing
cache hit rates by 10% to 4x for read operations when
compared to traditional sector-based caching. Head-
position aware dynamic replica retrieval directed I/O
operations to alternate locations on-the-fly and
additionally reduced I/O times by 10-20%. Selective
duplication created additional replicas of popular
content during periods of low foreground I/O activity
and further improved the effectiveness of dynamic
replica retrieval by 23 35%.
FUTURE WORK
I/O Deduplication opens up several directions for
future work. One avenue for future work is to explore
content-based optimizations for write I/O operations.
A possible future direction is to optionally coalesce or
even eliminate altogether write I/O operations for
content that are already duplicated elsewhere on the
disk or alternatively direct such writes to alternate
locations in the scratch space. While the first option
might seem similar to data Deduplication at a high-
level, a primary focus on the performance implications
7. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 7|P a g e Copyright@IDL-2017
of such optimizations rather than capacity
improvements is suggested. Any optimization for
writes affects the read-side optimizations of I/O
Deduplication and a careful analysis and evaluation of
the trade-off points in this design space is important
and shares data with existing users if Deduplication
found in the client or server side.
REFERENCES
[1] Jim Gray and Prashant Shenoy. Rules of Thumb in
Data Engineering. Proc. of the IEEE International
Conference on Data Engineering, February 2000.
[2] Charles B. Morrey III and Dirk Grunwald.
Peabody: The Time Travelling Disk. In Proc. of the
IEEE/NASA MSST, 2003.
[3] Windsor W. Hsu, Alan Jay Smith, and Honesty C.
Young. The Automatic Improvement of Locality in
Storage Systems. ACM Transactions on Computer
Systems, 23(4):424–473, Nov 2005.
[4] S. Quinlan and S. Dorward. Venti: A New
Approach to Archival Storage. Proc. of the USENIX
Annual Technical Conference on File and Storage
Technologies, January 2002.
[5] Cyril U. Orji and Jon A. Solworth. Doubly
distorted mirrors. In Proceedings of the ACM
SIGMOD, 1993.
[6] Medha Bhadkamkar, Jorge Guerra, Luis Useche,
Sam Burnett, Jason Liptak, Raju Rangaswami, and
Vagelis Hristidis. BORG: Block-reORGanization for
Selfoptimizing Storage Systems. In Proc. of the
USENIX Annual Technical Conference on File and
Storage Technologies, February 2009.
[7] Sergey Brin, James Davis, and Hector Garcia-
Molina. Copy Detection Mechanisms for Digital
Documents. In Proc. of ACM SIGMOD, May 1995.
[8] Austin Clements, Irfan Ahmad, Murali Vilayannur,
and Jinyuan Li. Decentralized Deduplication in san
cluster file systems. In Proc. of the USENIX Annual
Technical Conference, June 2009.
[9] Burton H. Bloom. Space/time trade-offs in hash
coding with allowable errors. Communications of the
ACM, 13(7):422–426, 1970.
[10] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and
Margo Seltzer. Passive NFS Tracing of Email and
Research Workloads. In Proc. of the USENIX Annual
Technical Conferenceon File and Storage
Technologies, March 2003.
[11] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M.
Tracey. Redundancy Elimination Within Large
Collections of Files. Proc. of the USENIX Annual
Technical Conference, 2004.
[12] Binny S. Gill. On multi-level exclusive caching:
offline optimality and why promotions are better than
demotions. In Proc. of the USENIX Annual Technical
Conference on File and Storage Technologies,
Feburary 2008.
[13] Diwaker Gupta, Sangmin Lee, Michael Vrable,
Stefan Savage, Alex C. Snoeren, George Varghese,
Geoffrey Voelker, and Amin Vahdat. Difference
Engine: Harnessing Memory Redundancy in Virtual
Machines. Proc. Of the USENIX OSDI, December
2008.
[14] Hai Huang, Wanda Hung, and Kang G. Shin.
FS2: Dynamic Data Replication In Free Disk Space
For Improving Disk Performance And Energy
Consumption. In Proc. of the ACM SOSP, October
2005.
[15] Jorge Guerra, Luis Useche, Medha Bhadkamkar,
Ricardo Koller, and Raju Rangaswami. The Case for
Active Block Layer Extensions. ACM Operating
Systems Review,42(6), October 2008.
[16] N. Jain, M. Dahlin, and R. Tewari. TAPER:
Tiered Approach for Eliminating Redundancy in
8. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 8|P a g e Copyright@IDL-2017
Replica Synchronization. In Proc. of the USENIX
Conference on File and Storage Systems, 2005.
[17] Song Jiang, Feng Chen, and Xiaodong Zhang.
Clock-pro: An effective improvement of the clock
replacement. In Proc. of the USENIX Annual
Technical Conference, April 2005.
[18] Andrew Leung, Shankar Pasupathy, Garth
Goodson and Ethan Miller. Measurement and Analysis
of Large-Scale Network File System Workloads. Proc.
of the USENIX Annual Technical Conference, June
2008.
[19] Xuhui Li, Ashraf Aboulnaga, Kenneth Salem,
Aamer Sachedina, and Shaobo Gao. Second-tier cache
management using write hints. In Proc. of the
USENIX Annual Technical Conference on File and
Storage Technologies, 2005.
[20] Nimrod Megiddo and D. S. Modha. Arc: A self-
tuning, low overhead replacement cache. In Proc. of
USENIX Annual Technical Conference on File and
Storage Technologies, 2003.
[21] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L.
Traiger. Evaluation techniques for storage hierarchies.
IBM Systems Journal, 9(2):78–117, 1970.
[22] Aayush Gupta, Raghav Pisolkar, Bhuvan
Urgaonkar, and Anand Sivasubramaniam. Leveraging
Value Locality in Optimizing NAND Flash-based
SSDs, 2011.
[23] Chi Zhang, Xiang Yu, Y. Wang. Configuring and
Scheduling an Eager-Writing Disk Array for a
Transaction Processing Workload. 2002.
[24] Kiran Srinivasan, Tim Bisson, Garth Goodson,
Kaladhar Voruganti. iDedup: Latency-aware, Inline
Data Deduplication for Primary Storage. 2012.
[25] Dina Bitton and Jim Gray. Disk Shadowing. In
Proc. Of the International Conference on Very Large
Data Bases, 1988.
[26] Jiri Schindler, Sandip Shete, Keith A. Smith.
Improving Throughput for Small Disk Requests with
Proximal I/O. 2011.
[27] Mark Lillibridge, Kave Eshghi, Deepavali
Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter
Camble. Sparse indexing: large scale, inline
deduplication using sampling and locality. In Proc. of
the USENIX Annual Technical Conference on File
and Storage Technologies, February 2009.