SlideShare a Scribd company logo
1 of 4
Data Deduplication and Chunking
Now-a-days, the demand of data storage capacity is increasing drastically. Due to more
demands of storage, the computer society is moving towards cloud storage. Security of data
and cost factors are important challenges in cloud storage. A duplicate file not only wastes the
storage, it also increases the access time. So the detection and removal of duplicate data is an
essential task. Data deduplication, an efficient approach to data reduction, has gained
increasing attention and popularity in large-scale storage systems. It eliminates redundant data
at the file or subfile level and identifies duplicate content by its cryptographically secure hash
signature. It is very tricky because neither duplicate files don’t have a common key nor they
contain error. There are several approaches to identify and remove redundant data:
 at file level
 at chunk level.
Data deduplication is technique toprevent the storage of redundant data in storage devices.
Data deduplication is gaining much attention by the researchers because it is an efficient
approach to data reduction. Deduplication identifies duplicate contents at chunk-level by
using hash functions and eliminates redundant contents at chunk level. According to
Microsoft research on deduplication, in their production primary and secondary memory are
redundant about 50% and 85% of the data respectively and should be removed by the
deduplication technology.
Chunking is a process to split a file into smaller files called chunks. In some
applications, such as remote data compression, data synchronization, and
data deduplication, chunking is important because it determines the duplicate
detection performance of the system.
Deduplication can be performed at chunk level or file level.
Chunk-level deduplication is preferred over file level deduplication because it identifies and
eliminate redundancy at a finer granularity. Chunking is one of the main challenges in the
deduplication-system. Efficient chunking is one of the key elements that decide the overall
deduplication performance.
The chunking step also play a significant impact on the deduplication ratio and performance.
An effective chunking algorithm should satisfy two properties.
Chunks must be created in a content-defined manner and all chunk sizes should have equal
probability of being selected. Chunk level deduplication scheme divides the input stream into
chunks and produce hash value that identifies each chunk uniquely. Chunking algorithm can
divide the input data stream into chunk of either fixedsize or variable size.If the fingerprints
are matched with previously stored chunk, then these duplicated chunks are removed otherwise
these are stored.
Data deduplication is the process of eliminating the redundant data. It is the technique that is
used to track and remove the same chunks in a storage unit. It is an efficient way to store data
or information. The process of deduplication is shown in figure.
Data deduplication strategies are classified into two parts i.e. File level deduplication and Block
(chunk)level deduplication. In file level deduplication, if the hash value for two files are same
then they are considered identical. Only one copy of a file is kept. Duplicate or redundant
copies of the same file are removed. This type of deduplication is also known as single instance
storage (SIS). In block (chunk)level deduplication, the file is fragmented into blocks (chunks)
then checks and removes duplicate blocks present in files. Only one copy of each block is
maintained. It reduces more space than SIS. It maybe further divided into fixed length and
variable length deduplication. In fixed length deduplication, the size of each block is constant
or fixed whereas in variable length deduplication, the size of each block is different or not
fixed.
Chunking based Data Deduplication
In this method, each file is firstly divided into a sequence of mutually exclusive blocks, called
chunks. Each chunk is a contiguous sequence of bytes from the file which is processed
independently. Various approaches are discussed in details to generate chunks.
Static Chunking
The static or fixed size chunking mechanism divides the file into same sized chunks. Figure 6
shows the working of static size chunking. It is the fastest among other chunking algorithms to
detect duplicate blocks but its performance is not satisfactory. It also leads to a serious problem
called “Boundary Shift”. Boundary shifting problem arises due to modifications in the data. If
user inserts or deletes a byte from the data, then static chunking will generate different
fingerprints for the subsequent chunks even though mostly data in file are intact.
Content Defined Chunking
To overcome the problem of boundary shift, content defined chunking (CDC) is used. CDC
reduces the amount of duplicate data found by the data deduplication systems, instead of setting
boundaries at multiples of the limited chunk size. CDC approach defines breakpoints where a
specific condition becomes true. The working of CDC is shown in figure 7. In simple words,
It can be said that CDC approach determine the chunk boundaries based on the content of the
file. Most of the deduplication systems use the CDC algorithm to achieve good data
deduplication throughput. It is also known as Variable size chunking. It uses the Rabin
fingerprint algorithm for hash value generation. It takes more processing time than the fixed
size chunking but has better deduplication efficiency.
Frequency Based Chunking Algorithm
FBC approach is a hybrid chunking algorithm that divides the byte stream according to the
chunk frequency. First, it identifies the fixed size chunks with high frequency using bloom
filter then the appearance is checked for each chunk in the bloom filter. If the chunk exists in
bloom filter, then it is passed through the parallel filter otherwise record it in one of bloom
filter then count the frequency for each chunk from parallel filter .Its working is shown in figure
13.

More Related Content

What's hot

Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storageeSAT Journals
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksijwmn
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu MattatholiAnu Mattatholi
 
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONPRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sqlDr-Dipali Meher
 
Data encryption using LSB matching algorithm and Reserving Room before Encryp...
Data encryption using LSB matching algorithm and Reserving Room before Encryp...Data encryption using LSB matching algorithm and Reserving Room before Encryp...
Data encryption using LSB matching algorithm and Reserving Room before Encryp...IJERA Editor
 
Information Upload and retrieval using SP Theory of Intelligence
Information Upload and retrieval using SP Theory of IntelligenceInformation Upload and retrieval using SP Theory of Intelligence
Information Upload and retrieval using SP Theory of IntelligenceINFOGAIN PUBLICATION
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Design of file system architecture with cluster
Design of file system architecture with clusterDesign of file system architecture with cluster
Design of file system architecture with clustereSAT Publishing House
 

What's hot (20)

Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storage
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networks
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Srinivasan2-10-12
Srinivasan2-10-12Srinivasan2-10-12
Srinivasan2-10-12
 
Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu Mattatholi
 
T7
T7T7
T7
 
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONPRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sql
 
E045026031
E045026031E045026031
E045026031
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
J0212065068
J0212065068J0212065068
J0212065068
 
C0312023
C0312023C0312023
C0312023
 
A0360109
A0360109A0360109
A0360109
 
B0330811
B0330811B0330811
B0330811
 
T9
T9T9
T9
 
Data encryption using LSB matching algorithm and Reserving Room before Encryp...
Data encryption using LSB matching algorithm and Reserving Room before Encryp...Data encryption using LSB matching algorithm and Reserving Room before Encryp...
Data encryption using LSB matching algorithm and Reserving Room before Encryp...
 
Information Upload and retrieval using SP Theory of Intelligence
Information Upload and retrieval using SP Theory of IntelligenceInformation Upload and retrieval using SP Theory of Intelligence
Information Upload and retrieval using SP Theory of Intelligence
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
1771 1775
1771 17751771 1775
1771 1775
 
Design of file system architecture with cluster
Design of file system architecture with clusterDesign of file system architecture with cluster
Design of file system architecture with cluster
 

Similar to Data deduplication and chunking

Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPijdms
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
 
11700220085_DDBMS.pptx
11700220085_DDBMS.pptx11700220085_DDBMS.pptx
11700220085_DDBMS.pptxSouvikRoy8783
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...Mumbai Academisc
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 

Similar to Data deduplication and chunking (20)

Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
P2P Cache Resolution System for MANET
P2P Cache Resolution System for MANETP2P Cache Resolution System for MANET
P2P Cache Resolution System for MANET
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
 
Operating system
Operating systemOperating system
Operating system
 
DDBMS Paper with Solution
DDBMS Paper with SolutionDDBMS Paper with Solution
DDBMS Paper with Solution
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
Advance DBMS
Advance DBMSAdvance DBMS
Advance DBMS
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Distributed file systems dfs
Distributed file systems   dfsDistributed file systems   dfs
Distributed file systems dfs
 
Deduplication in Open Spurce Cloud
Deduplication in Open Spurce CloudDeduplication in Open Spurce Cloud
Deduplication in Open Spurce Cloud
 
11700220085_DDBMS.pptx
11700220085_DDBMS.pptx11700220085_DDBMS.pptx
11700220085_DDBMS.pptx
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Computer Science Homework Help
Computer Science Homework HelpComputer Science Homework Help
Computer Science Homework Help
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 

Recently uploaded

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...Call girls in Ahmedabad High profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

Data deduplication and chunking

  • 1. Data Deduplication and Chunking Now-a-days, the demand of data storage capacity is increasing drastically. Due to more demands of storage, the computer society is moving towards cloud storage. Security of data and cost factors are important challenges in cloud storage. A duplicate file not only wastes the storage, it also increases the access time. So the detection and removal of duplicate data is an essential task. Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature. It is very tricky because neither duplicate files don’t have a common key nor they contain error. There are several approaches to identify and remove redundant data:  at file level  at chunk level. Data deduplication is technique toprevent the storage of redundant data in storage devices. Data deduplication is gaining much attention by the researchers because it is an efficient approach to data reduction. Deduplication identifies duplicate contents at chunk-level by using hash functions and eliminates redundant contents at chunk level. According to Microsoft research on deduplication, in their production primary and secondary memory are redundant about 50% and 85% of the data respectively and should be removed by the deduplication technology. Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Deduplication can be performed at chunk level or file level. Chunk-level deduplication is preferred over file level deduplication because it identifies and eliminate redundancy at a finer granularity. Chunking is one of the main challenges in the deduplication-system. Efficient chunking is one of the key elements that decide the overall deduplication performance. The chunking step also play a significant impact on the deduplication ratio and performance. An effective chunking algorithm should satisfy two properties. Chunks must be created in a content-defined manner and all chunk sizes should have equal probability of being selected. Chunk level deduplication scheme divides the input stream into chunks and produce hash value that identifies each chunk uniquely. Chunking algorithm can divide the input data stream into chunk of either fixedsize or variable size.If the fingerprints are matched with previously stored chunk, then these duplicated chunks are removed otherwise these are stored. Data deduplication is the process of eliminating the redundant data. It is the technique that is used to track and remove the same chunks in a storage unit. It is an efficient way to store data or information. The process of deduplication is shown in figure.
  • 2. Data deduplication strategies are classified into two parts i.e. File level deduplication and Block (chunk)level deduplication. In file level deduplication, if the hash value for two files are same then they are considered identical. Only one copy of a file is kept. Duplicate or redundant copies of the same file are removed. This type of deduplication is also known as single instance storage (SIS). In block (chunk)level deduplication, the file is fragmented into blocks (chunks) then checks and removes duplicate blocks present in files. Only one copy of each block is maintained. It reduces more space than SIS. It maybe further divided into fixed length and variable length deduplication. In fixed length deduplication, the size of each block is constant or fixed whereas in variable length deduplication, the size of each block is different or not fixed. Chunking based Data Deduplication In this method, each file is firstly divided into a sequence of mutually exclusive blocks, called chunks. Each chunk is a contiguous sequence of bytes from the file which is processed independently. Various approaches are discussed in details to generate chunks. Static Chunking The static or fixed size chunking mechanism divides the file into same sized chunks. Figure 6 shows the working of static size chunking. It is the fastest among other chunking algorithms to detect duplicate blocks but its performance is not satisfactory. It also leads to a serious problem called “Boundary Shift”. Boundary shifting problem arises due to modifications in the data. If user inserts or deletes a byte from the data, then static chunking will generate different fingerprints for the subsequent chunks even though mostly data in file are intact.
  • 3. Content Defined Chunking To overcome the problem of boundary shift, content defined chunking (CDC) is used. CDC reduces the amount of duplicate data found by the data deduplication systems, instead of setting boundaries at multiples of the limited chunk size. CDC approach defines breakpoints where a specific condition becomes true. The working of CDC is shown in figure 7. In simple words, It can be said that CDC approach determine the chunk boundaries based on the content of the file. Most of the deduplication systems use the CDC algorithm to achieve good data deduplication throughput. It is also known as Variable size chunking. It uses the Rabin fingerprint algorithm for hash value generation. It takes more processing time than the fixed size chunking but has better deduplication efficiency.
  • 4. Frequency Based Chunking Algorithm FBC approach is a hybrid chunking algorithm that divides the byte stream according to the chunk frequency. First, it identifies the fixed size chunks with high frequency using bloom filter then the appearance is checked for each chunk in the bloom filter. If the chunk exists in bloom filter, then it is passed through the parallel filter otherwise record it in one of bloom filter then count the frequency for each chunk from parallel filter .Its working is shown in figure 13.