SlideShare a Scribd company logo
Data Deduplication: Venti and its improvements
Umair Amjad
12-5044
umairamjadawan@gmail.com
Department of Computer Science, National University of Computer and Emerging Sciences, Pakistan
Abstract
Entire world is adapting digital technologies, converting from legacy approach to Digital approach. Data is
the primary thing which is available in digital form everywhere. To store this massive data, the storage
methodology should be efficient as well as intelligent enough to find the redundant data to save. Data
deduplication techniques are widely used by storage servers to eliminate the possibilities of storing
multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage
systems also removes duplication in existing stored data in storage systems. Hence yield a significant
cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and
also identify area of improvements in Venti which are addressed by other papers.
Keywords – Data deduplication; data storage; hash index; venti; archival data;
1. Introduction
The world is producing the large number of
digital data that is growing rapidly. According to a
study, the information producing per year to the
digital universe is growing by 57% annually. This
whopping growth of information is imparting a
considerable load on storage systems. Thirty-five
percent of this information is generated by
enterprises and therefore must be retained due to
regulatory compliance and legal reasons. So it is
critical to backup the data regularly to a disaster
recovery site for data availability and integrity.
Rapidly developing data arises many challenges to
the existing storage systems. One observation is
that a significant fraction of information contains
duplicates, due to reasons such as backups,
copies, and version updates. Thus, deduplication
techniques have been invented to avoid storing
redundant information.
A number of trends have motivated the
creation of deduplication solutions. Archival
systems such as Venti have identified significant
information redundancy within and across
machines due to update versions and commonly
installed applications and libraries. In addition to
storage overhead, duplicate file content can also
have other negative effects on the system. As files
are accessed, they are cached in memory and in
the hard disk cache. Duplicate content can
consume unnecessary memory cache that could
be used to cache additional unique content.
Deduplication solves these issues by locating
identical content and handling it appropriately.
Instead of storing the same file content multiple
times, we can have a new file that references the
identical content already stored in the system. The
use of deduplication results in more efficient use of
both memory cache and storage capacity.
This paper is taking Venti as base case for data
deduplication and its missing areas. After
identification of missing areas there solution is
proposed in reference to other research papers.
2. Background
In storage archives a large quantity of data
is redundant and slight changed to another chunk
of data. The term data deduplication points to the
techniques that saves only one single instance of
replicated data, and provide links to that instance
of copy in place of storing other original copies of
this data. There are many techniques exists for
eliminating redundancy from the stored data. At
present data deduplication has gained popularity in
the research community . Data deduplication is a
specialized data compression technique for
eliminating redundant data, typically to improve
storage utilization . In the deduplication process ,
redundant data is left and not stored.
By the evolution of services from tape to
disk, data deduplication has turn into a key
element in the backup process. It specifies that
only one copy of that data is saved in the
datacenter. Every user, who want to access that
copy linked to that single instance of copy. So it is
clear that data deduplication help to decrease the
size of data center. So it could say that
deduplication means that the number of the
replication of data that were usually duplicated on
the cloud should be controlled and managed to
shrink the physical storage space requested for
such replications. The basic steps for deduplication
are:
1. In first step files are divided into small
segments.
2. After the segment creation new and the
existing data are checked for similarity by
comparing fingerprints created by hashing
algorithm.
3. Then Metadata structures are updated.
4. Segments are compressed.
5. All the duplicate data is deleted and data
integrity check is performed.
2.1 Types of Data Deduplication
There are two major categories of data
deduplication on which all research is based.
1. Offline Data deduplication(Target based): In an
offline deduplication state, first data is written to the
storage disk and deduplication process take place
at a later time. It is performed on the target data
storage center. In this case the client is unmodified
and not aware of any deduplication. This
technology improves storage utilization and no one
need to wait for hash based calculations, but does
not save bandwidth.
2. Online Data deduplication(Source based): In an
online deduplication state, replicate data is deleted
before being written to the storage disk. It is
performed on the data at the source before it’s
transferred. A deduplication aware backup agent is
installed on the client which backs up only unique
data. The result is increased bandwidth and
storage efficiency. But, this enforces extra
computational load on the backup client.
Replicates are changed by pointers and the actual
replicate data is never sent over the network.
Once the timing of data deduplication has
been decided then there are number of existing
techniques that can be apply. The most used
deduplication approaches are file level hashing
and block level hashing.
1. File Level hashing : In a file level hashing
technique, the whole file is directed to a hashing
function. The hashing function is always
cryptographic hash like MD5 or SHA-1. The
cryptographic hash is used to find entire replicate
files. This approach is speedy with low
computation and low additional meta data
overhead. It works very well for complete system
backups when total duplicate files are more
common. However, the larger granularity of
replicate matching stops it from matching two files
that only differ by one single byte or bit of data.
2. Block Level Hashing: It means the file is broken
into a number of smaller sections before data
deduplication. The number of sections depends on
the type of approach that is being used. The two
most common types of block level hashing are
fixed-size chunking and variable-length chunking.
In a fixed-size chunking approach, a file is divided
up into a number of fixed-size pieces called
chunks. In a variable-length chunking approach, a
file is broken up into chunks of variable length.
Each section is passed to a cryptographic hash
function (usually MD5 or SHA-1) to get the chunk
identifier. The chunk identifier is used to locate
replicate data.
File internal changes, will cause the entire
file need to store. PPT and other files may need to
change some simple content, such as changing
the page to display the new report or the dates,
which can lead to re-store the entire document.
Block level data deduplication technology stores
only one version of the paper and the next part of
the changes between versions. File level
technology, generally less than 5:1 compression
ratio, while the block-level storage technology can
compress the data capacity of 20: 1 or even 50: 1.
2.2 Methodologies of Deduplication
At present, the research of deduplication
focuses on two aspects. One is to remove the
duplicate data as much as possible and then
reduce the storage capacity requirement. The
other is the efficiency in the resources required to
achieve. Most of the available traditional backup
systems use file-level deduplication. However the
data deduplication technology can exploit inter-file
and intra-file information redundancy to eliminate
duplicate or similarity data at the granularity block
or byte. Some of the available architecture follows
the source deduplication. However because of this
approach, user has to face delay in sending data to
the backup store, and the rest of the available
architectures which support target deduplication
strategy provide single system deduplication that
means at the target side only single system
(Server) handles all the user requests to store data
and maintains the hash index for the number of
disks attached to it.
Venti: It is a network storage system. It applies
identical hash values to find block contents so that
it decreases the data occupation of storage area.
Venti generates blocks for huge storage
applications and inspire a write-once policy to
avoid collision of the data. This network storage
system emerged in the early stages of network
storage, so it is not suitable to deal with avast data,
and the system is not scalable.
3. Venti as a base case
The key idea behind Venti, is to identify
data blocks by a hash of their contents, also called
fingerprint in this paper. Fingerprint is the source
for all the obvious benefits of Venti. As blocks are
addressed by the fingerprint of their contents, a
block cannot be modified without changing its
address (write-once behavior). Writes are
idempotent, since multiple writes of the same data
can be coalesced and do not require additional
storage. Without cooperating or coordinating,
multiple clients can share the data blocks with
Venti server.
Inherent integrity checking of data is
ensured, since both the client and the server can
compute the fingerprint of the data and compare it
to the requested fingerprint, when a block is
retrieved; and Features like replication, caching,
and load balancing are facilitated; because the
contents of a particular block are immutable, the
problem of data coherency is greatly reduced. The
main challenge of the work, on the other hand, is
also brought about by hashing. The design of Venti
requires a hash function that could generate a
unique fingerprint for every data block that a client
may want to store. Venti employs a cryptographic
hash function, Sha1, for which it is computationally
infeasible to find two distinct inputs that hash to the
same value. (To date, there are no known
collisions with Sha1.) As to the choice of storage
technology, the authors make a good enough
argument to use magnetic disks, by comparing the
prices and performance of disks and optical
storage systems.
Each block is prefixed by a header that
describes the contents of the block. The primary
purpose of the header is to provide integrity
checking during normal operation and to assist in
data recovery. The header includes a magic
number, the fingerprint and size of the block, the
time when the block was first written, and identity
of the user that wrote it. The header also includes
a user-supplied type identifier, which is explained
in Section 7. Note, only one copy of a given block
is stored in the log, thus the user and time fields
correspond to the first time the block was stored to
the server. The encoding field in the block header
indicates whether the data was compressed and, if
so, the algorithm used. The e-size field indicates
the size of the data after compression, enabling the
location of the next block in the arena to be
determined.
In addition to a log of data blocks, an
arena includes a header, a directory, and a trailer.
The header identifies the arena. The directory
contains a copy of the block header and offset for
every block in the arena. By replicating the
headers of all the blocks in one relatively small part
of the arena, the server can rapidly check or
rebuild the system's global block index. The
directory also facilitates error recovery if part of the
arena is destroyed or corrupted. The trailer
summarizes the current state of the arena itself,
including the number of blocks and the size of the
log. Within the arena, the data log and the directory
start at opposite ends and grow towards each
other. When the arena is filled, it is marked as
sealed, and a fingerprint is computed for the
contents of the entire arena. Sealed arenas are
never modified.
The basic operation of Venti is to store and
retrieve blocks based on their fingerprints. A
fingerprint is 160 bits long, and the number of
possible fingerprints far exceeds the number of
blocks stored on a server. The disparity between
the number of fingerprints and blocks means it is
impractical to map the fingerprint directly to a
location on a storage device. Instead, we use an
index to locate a block within the log. Index is
implemented using a disk-resident hash table. The
index is divided into fixed-sized buckets, each of
which is stored as a single disk block. Each bucket
contains the index map for a small section of the
fingerprint space. A hash function is used to map
fingerprints to index buckets in a roughly uniform
manner, and then the bucket is examined using
binary search. This structure is simple and
efficient, requiring one disk access to locate a
block in almost all cases.
Three applications, Vac, physical backup, and
usage with Plan 9 file system, are demonstrated to
show the effectiveness of Venti. In addition to the
development of the Venti prototype, a collection of
tools for integrity checking and error recovery were
built. The authors also gave some preliminary
performance results for read and write operations
with the Venti prototype. By using disks, they've
shown an access time for archival data that is
comparable to non-archival data. However, they
also indicated the main problem: the uncached
sequential read performance is particularly bad,
due to the requirement of random read of the index
of the sequential reads. They've pointed it out one
possible solution: read-ahead.
4. Improvements in Venti
There are three parameters which are identified in
Venti paper, those required improvement.
4.1 Hashing Collision:
'A Comparison Study of Deduplication
Implementations with Small-Scale Workloads'
solves the problem of venti which is to have hash
collision. The design of Venti requires a hash
function that generates a unique fingerprint for
every data block that a client may want to store.
For a server of a given capacity, the likelihood that
two different blocks will have the same hash value,
also known as a collision can be determined.
Although probability to have identical values of key
is extremely low but still to make sure, Small-Scale
Workloads use both encryption algorithms SHA256
and MD5 simultaneously. Each of the hash
functions maps to one of two hash tables.
4.2 Fix size chunking:
'A Low-bandwidth Network File System'
named as LBFS addresses this problem by
considering only non-overlapping chunks of files
and avoids sensitivity to shifting file offsets by
setting chunk boundaries based on file contents,
rather than on position within a file. Insertions and
deletions therefore only affect the surrounding
chunks. To divide a file into chunks, LBFS
examines every (overlapping) 48-byte region of the
file and with probability each region’s contents
considers it to be the end of a data chunk. LBFS
selects these boundary regions called breakpoints
using Rabin fingerprints. Figure shows how LBFS
might divide up a file and what happens to chunk
boundaries after a series of edits.
1. shows the original file, divided into variable
length chunks with breakpoints determined by a
hash of each 48-byte region.
2. shows the effects of inserting some text into the
file. The text is inserted in chunk c4 , producing a
new, larger chunk c8 . However, all other chunks
remain the same. Thus, one need only send c8 to
transfer the new file to a recipient that already has
the old version.
4.3 Better Access Control:
'A Low-bandwidth Network File System'
uses RPC library which support for authenticating
and encrypting traffic between a client and server.
The entire LBFS protocol, RPC headers and all, is
passed through gzip compression, tagged with a
message authentication code, and then encrypted.
At mount time, the client and server negotiate a
session key, the server authenticates itself to the
user, and the user authenticates herself to the
client, all using public key cryptography. We added
support for compression. The client and server
communicate over TCP using Sun RPC.
'POTSHARDS: Secure Long-Term
Archival Storage Without Encryption' uses secret
splitting and approximate pointers as a way to
move security from encryption to authentication
and to avoid reliance on encryption algorithms that
may be compromised at some point in the future.
Unlike encryption, secret splitting provides
information-theoretic security. Second, each user
maintains a separate, recoverable index over her
data, so a compromised index does not affect the
other users and a lost index is not equivalent to
data deletion. More importantly, in the event that a
user loses her index, both the index and the data
itself can be securely reconstructed from the user’s
shares stored across multiple archives.
5. Conclusion
Archival data is growing exponentially so it
is much needed to have system which can
eliminate data duplication in a best way. Although
paper have eloborated Venti in depth and its
improvement areas; three major issues of Venti are
discussed but there may be the cases when these
proposed solutions may fail. For hash case may
occurs when SHA and MD5 both create duplicate
keys. Similarly in second part, content based
chunking is high computational task so it can be
avoid by further improvement. Venti is not
experimented on distributed enviorement so that
can be the idea candidate for future work.
6. References
[1] "Deduplication and Compression Techniques
in Cloud Design" by Amrita Upadhyay, Pratibha R
Balihalli, Shashibhushan Ivaturi and Shrisha Rao
2012 IEEE
[2] "Avoiding the Disk Bottleneck in the Data
Domain Deduplication File System" by Benjamin
Zhu Data Domain, Inc. 6th USENIX Conference on
File and Storage Technologies
[3] P. Kulkarni, J. LaVoie, F. Douglis and J.
Tracey
Redundancy elimination within large collections of
files. On 2004 in Proc. USENIX 2004 Annual
Technical Conference.
[4] Dave Russell: Data De-duplication Will Be
Even Bigger in 2010, Gartner, 8 February 2010.
[5] Mark W. Storer, Kevin M. Greenan, Darrell D.
E. Long and Ethan L. Miller. Secure data
deduplication. In Proceedings of the 2008 ACM
Workshop on Storage Security and Survivability,
October 2008.
[6] “Fujitsu’s storage systems and related
technologies supporting cloud computing,” 2010.
[Online]. Available: http://www.fujitsu.com/global/
[7] Q. Sean and D. Sean, Venti: A New Approach
to Archival Data Storage, in Proceedings of the 1st
USENIX Conference on File and Storage
Technologies, ed. Monterey, CA: USE- NIX
Association, 2002, pp. 89-101.
[8] D. Bhagwat, K. Eshghi, D.D.E. Long and M.
Lillibridge, Extreme Binning: Scalable, Parallel
Deduplication for Chunk- based File Backup, in
2009 IEEE International Symposium on Modeling,
Analysis and Simulation of Computer and
Telecommunication Systems Mascots, 2009, pp.
237-245.
[9] J. Black. Compare-by-hash: A reasoned
analysis, in USENIX Association Proceedings of
the 2006 USENIX Annual Technical Conference,
2006, pp. 85-90.
[10] D. Borthakur, The Hadoop Distributed File
System: Architecture and Design, 2007.
URL:hadoop.apache.org/hdfs/docs/current/hdfs_de
sign.pdf, accessed in Oct 2011.
5. Conclusion
Archival data is growing exponentially so it
is much needed to have system which can
eliminate data duplication in a best way. Although
paper have eloborated Venti in depth and its
improvement areas; three major issues of Venti are
discussed but there may be the cases when these
proposed solutions may fail. For hash case may
occurs when SHA and MD5 both create duplicate
keys. Similarly in second part, content based
chunking is high computational task so it can be
avoid by further improvement. Venti is not
experimented on distributed enviorement so that
can be the idea candidate for future work.
6. References
[1] "Deduplication and Compression Techniques
in Cloud Design" by Amrita Upadhyay, Pratibha R
Balihalli, Shashibhushan Ivaturi and Shrisha Rao
2012 IEEE
[2] "Avoiding the Disk Bottleneck in the Data
Domain Deduplication File System" by Benjamin
Zhu Data Domain, Inc. 6th USENIX Conference on
File and Storage Technologies
[3] P. Kulkarni, J. LaVoie, F. Douglis and J.
Tracey
Redundancy elimination within large collections of
files. On 2004 in Proc. USENIX 2004 Annual
Technical Conference.
[4] Dave Russell: Data De-duplication Will Be
Even Bigger in 2010, Gartner, 8 February 2010.
[5] Mark W. Storer, Kevin M. Greenan, Darrell D.
E. Long and Ethan L. Miller. Secure data
deduplication. In Proceedings of the 2008 ACM
Workshop on Storage Security and Survivability,
October 2008.
[6] “Fujitsu’s storage systems and related
technologies supporting cloud computing,” 2010.
[Online]. Available: http://www.fujitsu.com/global/
[7] Q. Sean and D. Sean, Venti: A New Approach
to Archival Data Storage, in Proceedings of the 1st
USENIX Conference on File and Storage
Technologies, ed. Monterey, CA: USE- NIX
Association, 2002, pp. 89-101.
[8] D. Bhagwat, K. Eshghi, D.D.E. Long and M.
Lillibridge, Extreme Binning: Scalable, Parallel
Deduplication for Chunk- based File Backup, in
2009 IEEE International Symposium on Modeling,
Analysis and Simulation of Computer and
Telecommunication Systems Mascots, 2009, pp.
237-245.
[9] J. Black. Compare-by-hash: A reasoned
analysis, in USENIX Association Proceedings of
the 2006 USENIX Annual Technical Conference,
2006, pp. 85-90.
[10] D. Borthakur, The Hadoop Distributed File
System: Architecture and Design, 2007.
URL:hadoop.apache.org/hdfs/docs/current/hdfs_de
sign.pdf, accessed in Oct 2011.

More Related Content

What's hot

Improved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providersImproved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providers
IRJET Journal
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
IRJET Journal
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139
IJRAT
 
Data deduplication and chunking
Data deduplication and chunkingData deduplication and chunking
Data deduplication and chunking
Sanchita Kadambari
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
AM Publications
 
Available techniques in hadoop small file issue
Available techniques in hadoop small file issueAvailable techniques in hadoop small file issue
Available techniques in hadoop small file issue
IJECEIAES
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
Unmesh Baile
 
Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...
dhanarajp
 
JPJ1448 Cooperative Caching for Efficient Data Access in Disruption Toleran...
JPJ1448   Cooperative Caching for Efficient Data Access in Disruption Toleran...JPJ1448   Cooperative Caching for Efficient Data Access in Disruption Toleran...
JPJ1448 Cooperative Caching for Efficient Data Access in Disruption Toleran...
chennaijp
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
neirew J
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 
Cooperative caching for efficient data access in
Cooperative caching for efficient data access inCooperative caching for efficient data access in
Cooperative caching for efficient data access in
Shakas Technologies
 
IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication System
IRJET Journal
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET Journal
 

What's hot (17)

Improved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providersImproved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providers
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139
 
Data deduplication and chunking
Data deduplication and chunkingData deduplication and chunking
Data deduplication and chunking
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
Available techniques in hadoop small file issue
Available techniques in hadoop small file issueAvailable techniques in hadoop small file issue
Available techniques in hadoop small file issue
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...
 
paper
paperpaper
paper
 
A mathematical appraisal
A mathematical appraisalA mathematical appraisal
A mathematical appraisal
 
JPJ1448 Cooperative Caching for Efficient Data Access in Disruption Toleran...
JPJ1448   Cooperative Caching for Efficient Data Access in Disruption Toleran...JPJ1448   Cooperative Caching for Efficient Data Access in Disruption Toleran...
JPJ1448 Cooperative Caching for Efficient Data Access in Disruption Toleran...
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Cooperative caching for efficient data access in
Cooperative caching for efficient data access inCooperative caching for efficient data access in
Cooperative caching for efficient data access in
 
IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication System
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
 

Similar to Data Deduplication: Venti and its improvements

BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
dbpublications
 
IRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFSIRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFS
IRJET Journal
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
ijdms
 
Secure distributed deduplication systems
Secure distributed deduplication systemsSecure distributed deduplication systems
Secure distributed deduplication systems
Pvrtechnologies Nellore
 
Secure Distributed Deduplication Systems with Improved Reliability
Secure Distributed Deduplication Systems with Improved ReliabilitySecure Distributed Deduplication Systems with Improved Reliability
Secure Distributed Deduplication Systems with Improved Reliability
1crore projects
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2
Rishikesh Pathak
 
Operating system
Operating systemOperating system
Operating system
Hussain Ahmady
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
ijtsrd
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
dbpublications
 
database ppt(2)
database ppt(2)database ppt(2)
database ppt(2)
EshetuGeletu2
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
Rahul Chaturvedi
 
1. Chapter One.pdf
1. Chapter One.pdf1. Chapter One.pdf
1. Chapter One.pdf
fikadumola
 
Approved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in CloudApproved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in Cloud
Editor IJCATR
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
Knoldus Inc.
 
03 Data Recovery - Notes
03 Data Recovery - Notes03 Data Recovery - Notes
03 Data Recovery - NotesKranthi
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
IRJET Journal
 
Distributed virtual disk storage system
Distributed virtual disk storage systemDistributed virtual disk storage system
Distributed virtual disk storage system
Alexander Decker
 
11.distributed virtual disk storage system
11.distributed virtual disk storage system11.distributed virtual disk storage system
11.distributed virtual disk storage system
Alexander Decker
 

Similar to Data Deduplication: Venti and its improvements (20)

BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
IRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFSIRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFS
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
 
Secure distributed deduplication systems
Secure distributed deduplication systemsSecure distributed deduplication systems
Secure distributed deduplication systems
 
Secure Distributed Deduplication Systems with Improved Reliability
Secure Distributed Deduplication Systems with Improved ReliabilitySecure Distributed Deduplication Systems with Improved Reliability
Secure Distributed Deduplication Systems with Improved Reliability
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2
 
Operating system
Operating systemOperating system
Operating system
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
database ppt(2)
database ppt(2)database ppt(2)
database ppt(2)
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
1. Chapter One.pdf
1. Chapter One.pdf1. Chapter One.pdf
1. Chapter One.pdf
 
Approved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in CloudApproved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in Cloud
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
03 Data Recovery - Notes
03 Data Recovery - Notes03 Data Recovery - Notes
03 Data Recovery - Notes
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Distributed virtual disk storage system
Distributed virtual disk storage systemDistributed virtual disk storage system
Distributed virtual disk storage system
 
11.distributed virtual disk storage system
11.distributed virtual disk storage system11.distributed virtual disk storage system
11.distributed virtual disk storage system
 

More from Umair Amjad

Automated Process for Auditng in Agile - SCRUM
Automated Process for Auditng in Agile - SCRUMAutomated Process for Auditng in Agile - SCRUM
Automated Process for Auditng in Agile - SCRUM
Umair Amjad
 
Bead–Sort :: A Natural Sorting Algorithm
Bead–Sort :: A Natural Sorting AlgorithmBead–Sort :: A Natural Sorting Algorithm
Bead–Sort :: A Natural Sorting Algorithm
Umair Amjad
 
Apache logs monitoring
Apache logs monitoringApache logs monitoring
Apache logs monitoring
Umair Amjad
 
Exact Cell Decomposition of Arrangements used for Path Planning in Robotics
Exact Cell Decomposition of Arrangements used for Path Planning in RoboticsExact Cell Decomposition of Arrangements used for Path Planning in Robotics
Exact Cell Decomposition of Arrangements used for Path Planning in Robotics
Umair Amjad
 
Ruby on Rails workshop for beginner
Ruby on Rails workshop for beginnerRuby on Rails workshop for beginner
Ruby on Rails workshop for beginner
Umair Amjad
 
SQL WORKSHOP::Lecture 7
SQL WORKSHOP::Lecture 7SQL WORKSHOP::Lecture 7
SQL WORKSHOP::Lecture 7Umair Amjad
 
SQL WORKSHOP::Lecture 6
SQL WORKSHOP::Lecture 6SQL WORKSHOP::Lecture 6
SQL WORKSHOP::Lecture 6Umair Amjad
 
SQL WORKSHOP::Lecture 5
SQL WORKSHOP::Lecture 5SQL WORKSHOP::Lecture 5
SQL WORKSHOP::Lecture 5Umair Amjad
 
SQL WORKSHOP::Lecture 4
SQL WORKSHOP::Lecture 4SQL WORKSHOP::Lecture 4
SQL WORKSHOP::Lecture 4Umair Amjad
 
SQL WORKSHOP::Lecture 13
SQL WORKSHOP::Lecture 13SQL WORKSHOP::Lecture 13
SQL WORKSHOP::Lecture 13Umair Amjad
 
SQL WORKSHOP::Lecture 12
SQL WORKSHOP::Lecture 12SQL WORKSHOP::Lecture 12
SQL WORKSHOP::Lecture 12Umair Amjad
 
SQL WORKSHOP::Lecture 11
SQL WORKSHOP::Lecture 11SQL WORKSHOP::Lecture 11
SQL WORKSHOP::Lecture 11Umair Amjad
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10Umair Amjad
 
SQL WORKSHOP::Lecture 9
SQL WORKSHOP::Lecture 9SQL WORKSHOP::Lecture 9
SQL WORKSHOP::Lecture 9Umair Amjad
 
SQL WORKSHOP::Lecture 3
SQL WORKSHOP::Lecture 3SQL WORKSHOP::Lecture 3
SQL WORKSHOP::Lecture 3Umair Amjad
 
SQL WORKSHOP::Lecture 2
SQL WORKSHOP::Lecture 2SQL WORKSHOP::Lecture 2
SQL WORKSHOP::Lecture 2Umair Amjad
 
SQL WORKSHOP::Lecture 1
SQL WORKSHOP::Lecture 1SQL WORKSHOP::Lecture 1
SQL WORKSHOP::Lecture 1Umair Amjad
 
DCT based Watermarking technique
DCT based Watermarking techniqueDCT based Watermarking technique
DCT based Watermarking technique
Umair Amjad
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
Umair Amjad
 
Migration from Rails2 to Rails3
Migration from Rails2 to Rails3Migration from Rails2 to Rails3
Migration from Rails2 to Rails3
Umair Amjad
 

More from Umair Amjad (20)

Automated Process for Auditng in Agile - SCRUM
Automated Process for Auditng in Agile - SCRUMAutomated Process for Auditng in Agile - SCRUM
Automated Process for Auditng in Agile - SCRUM
 
Bead–Sort :: A Natural Sorting Algorithm
Bead–Sort :: A Natural Sorting AlgorithmBead–Sort :: A Natural Sorting Algorithm
Bead–Sort :: A Natural Sorting Algorithm
 
Apache logs monitoring
Apache logs monitoringApache logs monitoring
Apache logs monitoring
 
Exact Cell Decomposition of Arrangements used for Path Planning in Robotics
Exact Cell Decomposition of Arrangements used for Path Planning in RoboticsExact Cell Decomposition of Arrangements used for Path Planning in Robotics
Exact Cell Decomposition of Arrangements used for Path Planning in Robotics
 
Ruby on Rails workshop for beginner
Ruby on Rails workshop for beginnerRuby on Rails workshop for beginner
Ruby on Rails workshop for beginner
 
SQL WORKSHOP::Lecture 7
SQL WORKSHOP::Lecture 7SQL WORKSHOP::Lecture 7
SQL WORKSHOP::Lecture 7
 
SQL WORKSHOP::Lecture 6
SQL WORKSHOP::Lecture 6SQL WORKSHOP::Lecture 6
SQL WORKSHOP::Lecture 6
 
SQL WORKSHOP::Lecture 5
SQL WORKSHOP::Lecture 5SQL WORKSHOP::Lecture 5
SQL WORKSHOP::Lecture 5
 
SQL WORKSHOP::Lecture 4
SQL WORKSHOP::Lecture 4SQL WORKSHOP::Lecture 4
SQL WORKSHOP::Lecture 4
 
SQL WORKSHOP::Lecture 13
SQL WORKSHOP::Lecture 13SQL WORKSHOP::Lecture 13
SQL WORKSHOP::Lecture 13
 
SQL WORKSHOP::Lecture 12
SQL WORKSHOP::Lecture 12SQL WORKSHOP::Lecture 12
SQL WORKSHOP::Lecture 12
 
SQL WORKSHOP::Lecture 11
SQL WORKSHOP::Lecture 11SQL WORKSHOP::Lecture 11
SQL WORKSHOP::Lecture 11
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10
 
SQL WORKSHOP::Lecture 9
SQL WORKSHOP::Lecture 9SQL WORKSHOP::Lecture 9
SQL WORKSHOP::Lecture 9
 
SQL WORKSHOP::Lecture 3
SQL WORKSHOP::Lecture 3SQL WORKSHOP::Lecture 3
SQL WORKSHOP::Lecture 3
 
SQL WORKSHOP::Lecture 2
SQL WORKSHOP::Lecture 2SQL WORKSHOP::Lecture 2
SQL WORKSHOP::Lecture 2
 
SQL WORKSHOP::Lecture 1
SQL WORKSHOP::Lecture 1SQL WORKSHOP::Lecture 1
SQL WORKSHOP::Lecture 1
 
DCT based Watermarking technique
DCT based Watermarking techniqueDCT based Watermarking technique
DCT based Watermarking technique
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 
Migration from Rails2 to Rails3
Migration from Rails2 to Rails3Migration from Rails2 to Rails3
Migration from Rails2 to Rails3
 

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Data Deduplication: Venti and its improvements

  • 1. Data Deduplication: Venti and its improvements Umair Amjad 12-5044 umairamjadawan@gmail.com Department of Computer Science, National University of Computer and Emerging Sciences, Pakistan Abstract Entire world is adapting digital technologies, converting from legacy approach to Digital approach. Data is the primary thing which is available in digital form everywhere. To store this massive data, the storage methodology should be efficient as well as intelligent enough to find the redundant data to save. Data deduplication techniques are widely used by storage servers to eliminate the possibilities of storing multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage systems also removes duplication in existing stored data in storage systems. Hence yield a significant cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and also identify area of improvements in Venti which are addressed by other papers. Keywords – Data deduplication; data storage; hash index; venti; archival data; 1. Introduction The world is producing the large number of digital data that is growing rapidly. According to a study, the information producing per year to the digital universe is growing by 57% annually. This whopping growth of information is imparting a considerable load on storage systems. Thirty-five percent of this information is generated by enterprises and therefore must be retained due to regulatory compliance and legal reasons. So it is critical to backup the data regularly to a disaster recovery site for data availability and integrity. Rapidly developing data arises many challenges to the existing storage systems. One observation is that a significant fraction of information contains duplicates, due to reasons such as backups, copies, and version updates. Thus, deduplication techniques have been invented to avoid storing redundant information. A number of trends have motivated the creation of deduplication solutions. Archival systems such as Venti have identified significant information redundancy within and across machines due to update versions and commonly installed applications and libraries. In addition to storage overhead, duplicate file content can also have other negative effects on the system. As files are accessed, they are cached in memory and in the hard disk cache. Duplicate content can consume unnecessary memory cache that could be used to cache additional unique content. Deduplication solves these issues by locating identical content and handling it appropriately. Instead of storing the same file content multiple times, we can have a new file that references the identical content already stored in the system. The use of deduplication results in more efficient use of both memory cache and storage capacity. This paper is taking Venti as base case for data deduplication and its missing areas. After identification of missing areas there solution is proposed in reference to other research papers. 2. Background In storage archives a large quantity of data is redundant and slight changed to another chunk of data. The term data deduplication points to the techniques that saves only one single instance of replicated data, and provide links to that instance of copy in place of storing other original copies of this data. There are many techniques exists for eliminating redundancy from the stored data. At present data deduplication has gained popularity in the research community . Data deduplication is a specialized data compression technique for eliminating redundant data, typically to improve
  • 2. storage utilization . In the deduplication process , redundant data is left and not stored. By the evolution of services from tape to disk, data deduplication has turn into a key element in the backup process. It specifies that only one copy of that data is saved in the datacenter. Every user, who want to access that copy linked to that single instance of copy. So it is clear that data deduplication help to decrease the size of data center. So it could say that deduplication means that the number of the replication of data that were usually duplicated on the cloud should be controlled and managed to shrink the physical storage space requested for such replications. The basic steps for deduplication are: 1. In first step files are divided into small segments. 2. After the segment creation new and the existing data are checked for similarity by comparing fingerprints created by hashing algorithm. 3. Then Metadata structures are updated. 4. Segments are compressed. 5. All the duplicate data is deleted and data integrity check is performed. 2.1 Types of Data Deduplication There are two major categories of data deduplication on which all research is based. 1. Offline Data deduplication(Target based): In an offline deduplication state, first data is written to the storage disk and deduplication process take place at a later time. It is performed on the target data storage center. In this case the client is unmodified and not aware of any deduplication. This technology improves storage utilization and no one need to wait for hash based calculations, but does not save bandwidth. 2. Online Data deduplication(Source based): In an online deduplication state, replicate data is deleted before being written to the storage disk. It is performed on the data at the source before it’s transferred. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is increased bandwidth and storage efficiency. But, this enforces extra computational load on the backup client. Replicates are changed by pointers and the actual replicate data is never sent over the network. Once the timing of data deduplication has been decided then there are number of existing techniques that can be apply. The most used deduplication approaches are file level hashing and block level hashing. 1. File Level hashing : In a file level hashing technique, the whole file is directed to a hashing function. The hashing function is always cryptographic hash like MD5 or SHA-1. The cryptographic hash is used to find entire replicate files. This approach is speedy with low computation and low additional meta data overhead. It works very well for complete system backups when total duplicate files are more common. However, the larger granularity of replicate matching stops it from matching two files that only differ by one single byte or bit of data. 2. Block Level Hashing: It means the file is broken into a number of smaller sections before data deduplication. The number of sections depends on the type of approach that is being used. The two most common types of block level hashing are fixed-size chunking and variable-length chunking. In a fixed-size chunking approach, a file is divided up into a number of fixed-size pieces called chunks. In a variable-length chunking approach, a file is broken up into chunks of variable length. Each section is passed to a cryptographic hash function (usually MD5 or SHA-1) to get the chunk identifier. The chunk identifier is used to locate replicate data.
  • 3. File internal changes, will cause the entire file need to store. PPT and other files may need to change some simple content, such as changing the page to display the new report or the dates, which can lead to re-store the entire document. Block level data deduplication technology stores only one version of the paper and the next part of the changes between versions. File level technology, generally less than 5:1 compression ratio, while the block-level storage technology can compress the data capacity of 20: 1 or even 50: 1. 2.2 Methodologies of Deduplication At present, the research of deduplication focuses on two aspects. One is to remove the duplicate data as much as possible and then reduce the storage capacity requirement. The other is the efficiency in the resources required to achieve. Most of the available traditional backup systems use file-level deduplication. However the data deduplication technology can exploit inter-file and intra-file information redundancy to eliminate duplicate or similarity data at the granularity block or byte. Some of the available architecture follows the source deduplication. However because of this approach, user has to face delay in sending data to the backup store, and the rest of the available architectures which support target deduplication strategy provide single system deduplication that means at the target side only single system (Server) handles all the user requests to store data and maintains the hash index for the number of disks attached to it. Venti: It is a network storage system. It applies identical hash values to find block contents so that it decreases the data occupation of storage area. Venti generates blocks for huge storage applications and inspire a write-once policy to avoid collision of the data. This network storage system emerged in the early stages of network storage, so it is not suitable to deal with avast data, and the system is not scalable. 3. Venti as a base case The key idea behind Venti, is to identify data blocks by a hash of their contents, also called fingerprint in this paper. Fingerprint is the source for all the obvious benefits of Venti. As blocks are addressed by the fingerprint of their contents, a block cannot be modified without changing its address (write-once behavior). Writes are idempotent, since multiple writes of the same data can be coalesced and do not require additional storage. Without cooperating or coordinating, multiple clients can share the data blocks with Venti server. Inherent integrity checking of data is ensured, since both the client and the server can compute the fingerprint of the data and compare it to the requested fingerprint, when a block is retrieved; and Features like replication, caching, and load balancing are facilitated; because the contents of a particular block are immutable, the problem of data coherency is greatly reduced. The main challenge of the work, on the other hand, is also brought about by hashing. The design of Venti requires a hash function that could generate a unique fingerprint for every data block that a client may want to store. Venti employs a cryptographic hash function, Sha1, for which it is computationally infeasible to find two distinct inputs that hash to the same value. (To date, there are no known collisions with Sha1.) As to the choice of storage technology, the authors make a good enough argument to use magnetic disks, by comparing the prices and performance of disks and optical storage systems.
  • 4. Each block is prefixed by a header that describes the contents of the block. The primary purpose of the header is to provide integrity checking during normal operation and to assist in data recovery. The header includes a magic number, the fingerprint and size of the block, the time when the block was first written, and identity of the user that wrote it. The header also includes a user-supplied type identifier, which is explained in Section 7. Note, only one copy of a given block is stored in the log, thus the user and time fields correspond to the first time the block was stored to the server. The encoding field in the block header indicates whether the data was compressed and, if so, the algorithm used. The e-size field indicates the size of the data after compression, enabling the location of the next block in the arena to be determined. In addition to a log of data blocks, an arena includes a header, a directory, and a trailer. The header identifies the arena. The directory contains a copy of the block header and offset for every block in the arena. By replicating the headers of all the blocks in one relatively small part of the arena, the server can rapidly check or rebuild the system's global block index. The directory also facilitates error recovery if part of the arena is destroyed or corrupted. The trailer summarizes the current state of the arena itself, including the number of blocks and the size of the log. Within the arena, the data log and the directory start at opposite ends and grow towards each other. When the arena is filled, it is marked as sealed, and a fingerprint is computed for the contents of the entire arena. Sealed arenas are never modified. The basic operation of Venti is to store and retrieve blocks based on their fingerprints. A fingerprint is 160 bits long, and the number of possible fingerprints far exceeds the number of blocks stored on a server. The disparity between the number of fingerprints and blocks means it is impractical to map the fingerprint directly to a location on a storage device. Instead, we use an index to locate a block within the log. Index is implemented using a disk-resident hash table. The index is divided into fixed-sized buckets, each of which is stored as a single disk block. Each bucket contains the index map for a small section of the fingerprint space. A hash function is used to map fingerprints to index buckets in a roughly uniform manner, and then the bucket is examined using binary search. This structure is simple and efficient, requiring one disk access to locate a block in almost all cases. Three applications, Vac, physical backup, and usage with Plan 9 file system, are demonstrated to show the effectiveness of Venti. In addition to the development of the Venti prototype, a collection of tools for integrity checking and error recovery were built. The authors also gave some preliminary performance results for read and write operations with the Venti prototype. By using disks, they've shown an access time for archival data that is comparable to non-archival data. However, they also indicated the main problem: the uncached sequential read performance is particularly bad, due to the requirement of random read of the index of the sequential reads. They've pointed it out one possible solution: read-ahead. 4. Improvements in Venti There are three parameters which are identified in Venti paper, those required improvement.
  • 5. 4.1 Hashing Collision: 'A Comparison Study of Deduplication Implementations with Small-Scale Workloads' solves the problem of venti which is to have hash collision. The design of Venti requires a hash function that generates a unique fingerprint for every data block that a client may want to store. For a server of a given capacity, the likelihood that two different blocks will have the same hash value, also known as a collision can be determined. Although probability to have identical values of key is extremely low but still to make sure, Small-Scale Workloads use both encryption algorithms SHA256 and MD5 simultaneously. Each of the hash functions maps to one of two hash tables. 4.2 Fix size chunking: 'A Low-bandwidth Network File System' named as LBFS addresses this problem by considering only non-overlapping chunks of files and avoids sensitivity to shifting file offsets by setting chunk boundaries based on file contents, rather than on position within a file. Insertions and deletions therefore only affect the surrounding chunks. To divide a file into chunks, LBFS examines every (overlapping) 48-byte region of the file and with probability each region’s contents considers it to be the end of a data chunk. LBFS selects these boundary regions called breakpoints using Rabin fingerprints. Figure shows how LBFS might divide up a file and what happens to chunk boundaries after a series of edits. 1. shows the original file, divided into variable length chunks with breakpoints determined by a hash of each 48-byte region. 2. shows the effects of inserting some text into the file. The text is inserted in chunk c4 , producing a new, larger chunk c8 . However, all other chunks remain the same. Thus, one need only send c8 to transfer the new file to a recipient that already has the old version. 4.3 Better Access Control: 'A Low-bandwidth Network File System' uses RPC library which support for authenticating and encrypting traffic between a client and server. The entire LBFS protocol, RPC headers and all, is passed through gzip compression, tagged with a message authentication code, and then encrypted. At mount time, the client and server negotiate a session key, the server authenticates itself to the user, and the user authenticates herself to the client, all using public key cryptography. We added support for compression. The client and server communicate over TCP using Sun RPC. 'POTSHARDS: Secure Long-Term Archival Storage Without Encryption' uses secret splitting and approximate pointers as a way to move security from encryption to authentication and to avoid reliance on encryption algorithms that may be compromised at some point in the future. Unlike encryption, secret splitting provides information-theoretic security. Second, each user maintains a separate, recoverable index over her data, so a compromised index does not affect the other users and a lost index is not equivalent to data deletion. More importantly, in the event that a user loses her index, both the index and the data itself can be securely reconstructed from the user’s shares stored across multiple archives.
  • 6. 5. Conclusion Archival data is growing exponentially so it is much needed to have system which can eliminate data duplication in a best way. Although paper have eloborated Venti in depth and its improvement areas; three major issues of Venti are discussed but there may be the cases when these proposed solutions may fail. For hash case may occurs when SHA and MD5 both create duplicate keys. Similarly in second part, content based chunking is high computational task so it can be avoid by further improvement. Venti is not experimented on distributed enviorement so that can be the idea candidate for future work. 6. References [1] "Deduplication and Compression Techniques in Cloud Design" by Amrita Upadhyay, Pratibha R Balihalli, Shashibhushan Ivaturi and Shrisha Rao 2012 IEEE [2] "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System" by Benjamin Zhu Data Domain, Inc. 6th USENIX Conference on File and Storage Technologies [3] P. Kulkarni, J. LaVoie, F. Douglis and J. Tracey Redundancy elimination within large collections of files. On 2004 in Proc. USENIX 2004 Annual Technical Conference. [4] Dave Russell: Data De-duplication Will Be Even Bigger in 2010, Gartner, 8 February 2010. [5] Mark W. Storer, Kevin M. Greenan, Darrell D. E. Long and Ethan L. Miller. Secure data deduplication. In Proceedings of the 2008 ACM Workshop on Storage Security and Survivability, October 2008. [6] “Fujitsu’s storage systems and related technologies supporting cloud computing,” 2010. [Online]. Available: http://www.fujitsu.com/global/ [7] Q. Sean and D. Sean, Venti: A New Approach to Archival Data Storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ed. Monterey, CA: USE- NIX Association, 2002, pp. 89-101. [8] D. Bhagwat, K. Eshghi, D.D.E. Long and M. Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk- based File Backup, in 2009 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Mascots, 2009, pp. 237-245. [9] J. Black. Compare-by-hash: A reasoned analysis, in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp. 85-90. [10] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, 2007. URL:hadoop.apache.org/hdfs/docs/current/hdfs_de sign.pdf, accessed in Oct 2011.
  • 7. 5. Conclusion Archival data is growing exponentially so it is much needed to have system which can eliminate data duplication in a best way. Although paper have eloborated Venti in depth and its improvement areas; three major issues of Venti are discussed but there may be the cases when these proposed solutions may fail. For hash case may occurs when SHA and MD5 both create duplicate keys. Similarly in second part, content based chunking is high computational task so it can be avoid by further improvement. Venti is not experimented on distributed enviorement so that can be the idea candidate for future work. 6. References [1] "Deduplication and Compression Techniques in Cloud Design" by Amrita Upadhyay, Pratibha R Balihalli, Shashibhushan Ivaturi and Shrisha Rao 2012 IEEE [2] "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System" by Benjamin Zhu Data Domain, Inc. 6th USENIX Conference on File and Storage Technologies [3] P. Kulkarni, J. LaVoie, F. Douglis and J. Tracey Redundancy elimination within large collections of files. On 2004 in Proc. USENIX 2004 Annual Technical Conference. [4] Dave Russell: Data De-duplication Will Be Even Bigger in 2010, Gartner, 8 February 2010. [5] Mark W. Storer, Kevin M. Greenan, Darrell D. E. Long and Ethan L. Miller. Secure data deduplication. In Proceedings of the 2008 ACM Workshop on Storage Security and Survivability, October 2008. [6] “Fujitsu’s storage systems and related technologies supporting cloud computing,” 2010. [Online]. Available: http://www.fujitsu.com/global/ [7] Q. Sean and D. Sean, Venti: A New Approach to Archival Data Storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ed. Monterey, CA: USE- NIX Association, 2002, pp. 89-101. [8] D. Bhagwat, K. Eshghi, D.D.E. Long and M. Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk- based File Backup, in 2009 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Mascots, 2009, pp. 237-245. [9] J. Black. Compare-by-hash: A reasoned analysis, in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp. 85-90. [10] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, 2007. URL:hadoop.apache.org/hdfs/docs/current/hdfs_de sign.pdf, accessed in Oct 2011.