Deduplication in Open Spurce Cloud

DATA DEDUPLICATION IN CLOUD
STORAGE
Presented by
M Praveen Kumar

What is a cloud storage?
• Cloud computing is an emerging computing paradigm in
which resources of the computing infrastructure are
Provided as services over the Internet.
Basic characteristics of cloud:
• On-demand self-service
• Broad network access
• Resource pooling
• Measured service
• Rapid elasticity

What is Deduplication ?
 Deduplication: It is a technique which eliminates redundant
data by storing only a single copy of each file or block
 It reduces the space and the bandwidth requirements of data
storage services like cloud
 It provides major savings in backup environments(saves more
than 90% in common business scenarios)

 It is the most impactful storage technology
• In April 2008,IBM acquired Diligent
• In July 2009,EMC acquired Data Domain
• In July 2010,Dell acquired Ocarina

How are files deduped ?
• Fingerprint each file using a hash function
– Common hashes used: Sha1, Sha256, others…
– Store an index of all the hashes already in the system
• New file:
– Compute hash
– Look hash up in index table
– If new → add to index
– If known hash → store as pointer to existing data

Deduplication strategies
 There are two main deduplication strategies
file level deduplication-where only a single copy of file is stored
based on the hash value
Block level deduplication-where each file is categorized into
blocks and stores only a single copy of multiple identical block
It can have fixed-size or variable-size blocks

This particularly includes the following
• Location: Deduplication can be performed at dierent
locations. Depending on the participating machines
and steps in the customized deduplication process,it
is either performed on the client machine (source-
side) or near the final data store (target-side).
• Since that conserves network bandwidth,

• Time: Depending on the use case, deduplication is
either performed as the data is transferred from the
source to the target (in-band or in-line) or
asynchronously in well-dened intervals (out-of-band
or post-process).
• depending on the amount of data, this approach can
cause a bottleneck in terms of throughput on the
server side.

Applications
Due to its storage reducing nature,it is widely
used in
• Backup systems
• Disaster recovery

Deduplication methods
• Deduplication greatly differs in how the redundant
data is identified
• Depending on the requirements of the application
and the characteristics of data,they are categorised
as
-single instance storage
-fixed size chunking
-variable size chunking
-file type aware chunking

Single instance storage:
It does not break the files into smaller chunks but
rather uses entire file as chunks
This method only eliminates duplicate files and does
not detect if the files are altered in just ew bytes
Advantages:
Indexing performance
Low cpu usage
Disadv:
Cannot be applied for the large files with changin data

Fixed size chunking
Instead of using entire file as smallest unit It breaks the
files into equally sized chunks
If a large file is changed only the changed chunks must
be re-indexed and transferred to the backup location
Disadv:
It fails to detect redundant data if some bytes are
inserted or deleted from the file because chunk
boundaries are determined by offset rather than by
content

• Variable sized chunking
it defines breakpoints where a certain condition
becomes true
This is usually done by fixed size overlapping sliding
window
At every offset of a file,the contents of the sliding
window are analyzed and a finger print f is
calculed
If f satisfies the break condition,a new break point
has been found and new chunk is created

File type aware chunking
the best redundancy detection can be achieved
if the data stream is understood by the
chunking method
Adv:
Breakpoints are more natural
Better space savings

Futue scope
• The solutions described above provide security in different
ways but none of the solutions cover all the vulnerability. In
addition to that not even a single method has been proposed
which can establish the trust between the user and the cloud
service provider. The client is never assured that his files are
known only to him and to no one else including the SSP.
Therefore we propose a solution whose main aim is to
establish the trust between the user and SSP. This method
involves injecting an application on client side LAN. The
abstract view of the setup is shown in figure 1. The various
modules and the working of the system are described below:

Deduplication in Open Spurce Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Deduplication in Open Spurce Cloud

Similar to Deduplication in Open Spurce Cloud (20)

Recently uploaded

Recently uploaded (20)

Deduplication in Open Spurce Cloud