SlideShare a Scribd company logo
1 of 19
Application-Aware Big Data Deduplication
in Cloud Environment
Yinjin Fu , Member, IEEE, Nong Xiao, Member, IEEE, Hong Jiang , Fellow, IEEE,
Guyu Hu, and Weiwei Chen, Member, IEEE
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 7, NO. 4, OCTOBER-DECEMBER 2019
Present by:
Md. Safayet Hossain
Id:1907551
M.Sc in Cse (kUET)
Present to:
M.M.A. Hashem
Professor
Department of Computer Science & Engineering
Khulna University of Engineering & Technology (KUET)
Khulna -9203, Bangladesh.
Abstract
Here, They proposed AppDedupe, an application-aware scalable inline distributed
deduplication framework in cloud environment.
By exploiting application awareness, data similarity and locality to optimize
distributed deduplication with inter-node two-tiered data routing and intra-node
application-aware deduplication , they could solve the challenges of traditional
deduplication techniques.
It builds application-aware similarity indices with super-chunk handprints to
speedup the intra-node deduplication process with high efficiency
INTRODUCTION
What is data deduplication ??
Data deduplication , a specialized data reduction technique widely deployed in disk-
based storage systems, not only saves data storage space, power and cooling in data
centers, also decreases significant administration time, operational complexity and
risk of human error.
It partitions large data objects into smaller parts, called chunks, represents these
chunks by their fingerprints, replaces the duplicate chunks with their fingerprints
after chunk fingerprint index lookup, and only transfers or stores the unique chunks
for the purpose of improving communication and storage efficiency.
INTRODUCTION
INTRODUCTION
The proposed AppDedupe distributed deduplication system has the following salient features that distinguish
it from the state-of-the-art mechanisms:
AppDedupe is the first research work on leveraging application awareness in the context
of distributed deduplication.
It performs two-tiered routing decision by exploiting application awareness, data
similarity and locality to direct data routing from clients to deduplication storage nodes to
achieve a good tradeoff between the conflicting goals of high deduplication effectiveness
and low system overhead.
It builds a global application route table and independent similarity indices.
Evaluation results show that it consistently and significantly outperforms the state-of-the-
art schemes in distributed deduplication efficiency by achieving high global
deduplication effectiveness
Background and motivation
Distributed Deduplication Techniques:
Background and motivation
Application Difference Redundancy Analysis
They compared the chunk fingerprints of test datasets with inter-application deduplication (inter-
app) and intra-application deduplication (inter-app) using chunk-level deduplication with a fixed
chunk size of 4 KB calculate the corresponding MD5 value as the chunk fingerprint in different
applications, including Linux kernel source code (Linux), dataset in mail server (Mail), virtual
machine images (VM), dataset in web server (Web), photo collections (Photo), music library
(Audio) and movie fileset (Video).
APPDEDUPE DESIGN
They focused on the three design principles to govern their AppDedupe system design:
i. Throughput. The deduplication throughput should scale with the number of nodes
by parallel deduplication across the storage nodes.
ii. Capacity. Similar data should be forwarded to the same deduplication node to
achieve high duplicate elimination ratio.
iii. Scalability. The distributed deduplication system should easily scale out to handle
massive data volumes with balanced workload among nodes
System Overview
It consists of three main
components:
1. clients,
2. dedupe storage nodes
3. director.
Two-Tiered Data Routing Scheme
They presented the two-tiered data routing scheme including: the file-level
application aware routing decision in director and the super-chunk level similarity
aware data routing in clients.
They use 2 algorithm for this,
Algorithm 1. Application Aware Routing Algorithm
Algorithm 2. Handprinting Based Stateful Data Routing
Interconnect Communication
Fig:3 Message exchanges for store operation
Interconnect Communication
Fig:4 Message exchanges for retrieve operation
Key data structures in dedupe process
Evaluation Platform
To evaluate parallel deduplication efficiency they used:
 Linux (Ubuntu 14.10)
 CPU (4-core 8-thread Intel X3440, 2.53 GHz and 16 GB RAM)
 Hard disk drive (Seagate ST1000DM 1TB)
 7 desktops serve as the clients
 One server serves as the director
 Three servers for dedupe storage nodes.
Workload
The Workload Characteristics
of The Real-World Datasets
and Traces
Evaluation Metrics
• Deduplication Efficiency(DE)
• Normalized Deduplication Ratio(NDR)
• Normalized Effective Deduplication Ratio(NEDR)
• Data Skew for Distributed Storage
The Comparison of parallel deduplication throughput of the application
aware similarity index structure with that of the traditional similarity
index structure for multiple data streams
 Traditional similarity index (Naive)
 Aware similarity index (Application aware)
 “cold cache” means the chunk fingerprint cache is empty
 “warm cache” means the duplicate chunk fingerprint had
already been stored in the cache
RAM Usage of Intra-Node Deduplication
for Various Application Workloads
Application-Aware Big Data Deduplication in Cloud

More Related Content

What's hot

Grid Computing
Grid ComputingGrid Computing
Grid Computingabhiritva
 
Grid computing
Grid computingGrid computing
Grid computingWipro
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)TASNEEM88
 
grid computing
grid computinggrid computing
grid computingrock om
 
Grid and cluster_computing_chapter1
Grid and cluster_computing_chapter1Grid and cluster_computing_chapter1
Grid and cluster_computing_chapter1Bharath Kumar
 
Grid computing by vaishali sahare [katkar]
Grid computing by vaishali sahare [katkar]Grid computing by vaishali sahare [katkar]
Grid computing by vaishali sahare [katkar]vaishalisahare123
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]Raul Soto
 
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermagargishankar1981
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computingsudha kar
 
Challenges and advantages of grid computing
Challenges and advantages of grid computingChallenges and advantages of grid computing
Challenges and advantages of grid computingPooja Dixit
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the gridJivan Nepali
 
Grid computing 2007
Grid computing 2007Grid computing 2007
Grid computing 2007Tank Bhavin
 
Introduction to Grid Computing
Introduction to Grid ComputingIntroduction to Grid Computing
Introduction to Grid Computingabhijeetnawal
 

What's hot (20)

Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Grid computing
Grid computingGrid computing
Grid computing
 
Grid computing
Grid computingGrid computing
Grid computing
 
grid computing
grid computinggrid computing
grid computing
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)
 
grid computing
grid computinggrid computing
grid computing
 
Grid and cluster_computing_chapter1
Grid and cluster_computing_chapter1Grid and cluster_computing_chapter1
Grid and cluster_computing_chapter1
 
Grid computing by vaishali sahare [katkar]
Grid computing by vaishali sahare [katkar]Grid computing by vaishali sahare [katkar]
Grid computing by vaishali sahare [katkar]
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar verma
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computing
 
Challenges and advantages of grid computing
Challenges and advantages of grid computingChallenges and advantages of grid computing
Challenges and advantages of grid computing
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the grid
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Grid computing 2007
Grid computing 2007Grid computing 2007
Grid computing 2007
 
Grid computing
Grid computingGrid computing
Grid computing
 
Grid Presentation
Grid PresentationGrid Presentation
Grid Presentation
 
Introduction to Grid Computing
Introduction to Grid ComputingIntroduction to Grid Computing
Introduction to Grid Computing
 

Similar to Application-Aware Big Data Deduplication in Cloud

Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storageeSAT Journals
 
Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceeSAT Publishing House
 
IRJET- Cross User Bigdata Deduplication
IRJET-  	  Cross User Bigdata DeduplicationIRJET-  	  Cross User Bigdata Deduplication
IRJET- Cross User Bigdata DeduplicationIRJET Journal
 
1-s2.0-S0957417422020759-main.pdf
1-s2.0-S0957417422020759-main.pdf1-s2.0-S0957417422020759-main.pdf
1-s2.0-S0957417422020759-main.pdfarchurssu
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
A Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationA Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationEditor IJMTER
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...ijfcstjournal
 
Green wsn optimization of energy use
Green wsn  optimization of energy useGreen wsn  optimization of energy use
Green wsn optimization of energy useijfcstjournal
 
Data Analysis In The Cloud
Data Analysis In The CloudData Analysis In The Cloud
Data Analysis In The CloudMonica Carter
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
A Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentA Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentSheila Sinclair
 
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...IRJET Journal
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Gargee Hiray
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...M H
 

Similar to Application-Aware Big Data Deduplication in Cloud (20)

Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storage
 
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous EnvironmentData Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
 
Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduce
 
IRJET- Cross User Bigdata Deduplication
IRJET-  	  Cross User Bigdata DeduplicationIRJET-  	  Cross User Bigdata Deduplication
IRJET- Cross User Bigdata Deduplication
 
1-s2.0-S0957417422020759-main.pdf
1-s2.0-S0957417422020759-main.pdf1-s2.0-S0957417422020759-main.pdf
1-s2.0-S0957417422020759-main.pdf
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
A Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationA Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-Duplication
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
 
Green wsn optimization of energy use
Green wsn  optimization of energy useGreen wsn  optimization of energy use
Green wsn optimization of energy use
 
Data Analysis In The Cloud
Data Analysis In The CloudData Analysis In The Cloud
Data Analysis In The Cloud
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
A Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentA Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving Environment
 
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
 

More from Safayet Hossain

Find Transitive closure of a Graph Using Warshall's Algorithm
Find Transitive closure of a Graph Using Warshall's AlgorithmFind Transitive closure of a Graph Using Warshall's Algorithm
Find Transitive closure of a Graph Using Warshall's AlgorithmSafayet Hossain
 
Color Guided Thermal image Super Resolution
Color Guided Thermal image Super ResolutionColor Guided Thermal image Super Resolution
Color Guided Thermal image Super ResolutionSafayet Hossain
 
Different type of attack on computer
Different type of attack on computerDifferent type of attack on computer
Different type of attack on computerSafayet Hossain
 
Region based image segmentation
Region based image segmentationRegion based image segmentation
Region based image segmentationSafayet Hossain
 
Anti- aliasing computer graphics
Anti- aliasing computer graphicsAnti- aliasing computer graphics
Anti- aliasing computer graphicsSafayet Hossain
 
detect emotion from text
detect emotion from textdetect emotion from text
detect emotion from textSafayet Hossain
 
Remittance Management System
Remittance Management System Remittance Management System
Remittance Management System Safayet Hossain
 

More from Safayet Hossain (12)

Epipolar geometry
Epipolar geometryEpipolar geometry
Epipolar geometry
 
Find Transitive closure of a Graph Using Warshall's Algorithm
Find Transitive closure of a Graph Using Warshall's AlgorithmFind Transitive closure of a Graph Using Warshall's Algorithm
Find Transitive closure of a Graph Using Warshall's Algorithm
 
Color Guided Thermal image Super Resolution
Color Guided Thermal image Super ResolutionColor Guided Thermal image Super Resolution
Color Guided Thermal image Super Resolution
 
Different type of attack on computer
Different type of attack on computerDifferent type of attack on computer
Different type of attack on computer
 
Region based image segmentation
Region based image segmentationRegion based image segmentation
Region based image segmentation
 
Anti- aliasing computer graphics
Anti- aliasing computer graphicsAnti- aliasing computer graphics
Anti- aliasing computer graphics
 
detect emotion from text
detect emotion from textdetect emotion from text
detect emotion from text
 
Vector computing
Vector computingVector computing
Vector computing
 
Green computing
Green computing Green computing
Green computing
 
E waste...
E   waste...E   waste...
E waste...
 
Economic presentation
Economic presentationEconomic presentation
Economic presentation
 
Remittance Management System
Remittance Management System Remittance Management System
Remittance Management System
 

Recently uploaded

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Application-Aware Big Data Deduplication in Cloud

  • 1. Application-Aware Big Data Deduplication in Cloud Environment Yinjin Fu , Member, IEEE, Nong Xiao, Member, IEEE, Hong Jiang , Fellow, IEEE, Guyu Hu, and Weiwei Chen, Member, IEEE IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 7, NO. 4, OCTOBER-DECEMBER 2019 Present by: Md. Safayet Hossain Id:1907551 M.Sc in Cse (kUET) Present to: M.M.A. Hashem Professor Department of Computer Science & Engineering Khulna University of Engineering & Technology (KUET) Khulna -9203, Bangladesh.
  • 2. Abstract Here, They proposed AppDedupe, an application-aware scalable inline distributed deduplication framework in cloud environment. By exploiting application awareness, data similarity and locality to optimize distributed deduplication with inter-node two-tiered data routing and intra-node application-aware deduplication , they could solve the challenges of traditional deduplication techniques. It builds application-aware similarity indices with super-chunk handprints to speedup the intra-node deduplication process with high efficiency
  • 3. INTRODUCTION What is data deduplication ?? Data deduplication , a specialized data reduction technique widely deployed in disk- based storage systems, not only saves data storage space, power and cooling in data centers, also decreases significant administration time, operational complexity and risk of human error. It partitions large data objects into smaller parts, called chunks, represents these chunks by their fingerprints, replaces the duplicate chunks with their fingerprints after chunk fingerprint index lookup, and only transfers or stores the unique chunks for the purpose of improving communication and storage efficiency.
  • 5. INTRODUCTION The proposed AppDedupe distributed deduplication system has the following salient features that distinguish it from the state-of-the-art mechanisms: AppDedupe is the first research work on leveraging application awareness in the context of distributed deduplication. It performs two-tiered routing decision by exploiting application awareness, data similarity and locality to direct data routing from clients to deduplication storage nodes to achieve a good tradeoff between the conflicting goals of high deduplication effectiveness and low system overhead. It builds a global application route table and independent similarity indices. Evaluation results show that it consistently and significantly outperforms the state-of-the- art schemes in distributed deduplication efficiency by achieving high global deduplication effectiveness
  • 6. Background and motivation Distributed Deduplication Techniques:
  • 7. Background and motivation Application Difference Redundancy Analysis They compared the chunk fingerprints of test datasets with inter-application deduplication (inter- app) and intra-application deduplication (inter-app) using chunk-level deduplication with a fixed chunk size of 4 KB calculate the corresponding MD5 value as the chunk fingerprint in different applications, including Linux kernel source code (Linux), dataset in mail server (Mail), virtual machine images (VM), dataset in web server (Web), photo collections (Photo), music library (Audio) and movie fileset (Video).
  • 8. APPDEDUPE DESIGN They focused on the three design principles to govern their AppDedupe system design: i. Throughput. The deduplication throughput should scale with the number of nodes by parallel deduplication across the storage nodes. ii. Capacity. Similar data should be forwarded to the same deduplication node to achieve high duplicate elimination ratio. iii. Scalability. The distributed deduplication system should easily scale out to handle massive data volumes with balanced workload among nodes
  • 9. System Overview It consists of three main components: 1. clients, 2. dedupe storage nodes 3. director.
  • 10. Two-Tiered Data Routing Scheme They presented the two-tiered data routing scheme including: the file-level application aware routing decision in director and the super-chunk level similarity aware data routing in clients. They use 2 algorithm for this, Algorithm 1. Application Aware Routing Algorithm Algorithm 2. Handprinting Based Stateful Data Routing
  • 11. Interconnect Communication Fig:3 Message exchanges for store operation
  • 12. Interconnect Communication Fig:4 Message exchanges for retrieve operation
  • 13. Key data structures in dedupe process
  • 14. Evaluation Platform To evaluate parallel deduplication efficiency they used:  Linux (Ubuntu 14.10)  CPU (4-core 8-thread Intel X3440, 2.53 GHz and 16 GB RAM)  Hard disk drive (Seagate ST1000DM 1TB)  7 desktops serve as the clients  One server serves as the director  Three servers for dedupe storage nodes.
  • 15. Workload The Workload Characteristics of The Real-World Datasets and Traces
  • 16. Evaluation Metrics • Deduplication Efficiency(DE) • Normalized Deduplication Ratio(NDR) • Normalized Effective Deduplication Ratio(NEDR) • Data Skew for Distributed Storage
  • 17. The Comparison of parallel deduplication throughput of the application aware similarity index structure with that of the traditional similarity index structure for multiple data streams  Traditional similarity index (Naive)  Aware similarity index (Application aware)  “cold cache” means the chunk fingerprint cache is empty  “warm cache” means the duplicate chunk fingerprint had already been stored in the cache
  • 18. RAM Usage of Intra-Node Deduplication for Various Application Workloads