Application-Aware Big Data Deduplication in Cloud

Application-Aware Big Data Deduplication
in Cloud Environment
Yinjin Fu , Member, IEEE, Nong Xiao, Member, IEEE, Hong Jiang , Fellow, IEEE,
Guyu Hu, and Weiwei Chen, Member, IEEE
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 7, NO. 4, OCTOBER-DECEMBER 2019
Present by:
Md. Safayet Hossain
Id:1907551
M.Sc in Cse (kUET)
Present to:
M.M.A. Hashem
Professor
Department of Computer Science & Engineering
Khulna University of Engineering & Technology (KUET)
Khulna -9203, Bangladesh.

Abstract
Here, They proposed AppDedupe, an application-aware scalable inline distributed
deduplication framework in cloud environment.
By exploiting application awareness, data similarity and locality to optimize
distributed deduplication with inter-node two-tiered data routing and intra-node
application-aware deduplication , they could solve the challenges of traditional
deduplication techniques.
It builds application-aware similarity indices with super-chunk handprints to
speedup the intra-node deduplication process with high efficiency

INTRODUCTION
What is data deduplication ??
Data deduplication , a specialized data reduction technique widely deployed in disk-
based storage systems, not only saves data storage space, power and cooling in data
centers, also decreases significant administration time, operational complexity and
risk of human error.
It partitions large data objects into smaller parts, called chunks, represents these
chunks by their fingerprints, replaces the duplicate chunks with their fingerprints
after chunk fingerprint index lookup, and only transfers or stores the unique chunks
for the purpose of improving communication and storage efficiency.

INTRODUCTION
The proposed AppDedupe distributed deduplication system has the following salient features that distinguish
it from the state-of-the-art mechanisms:
AppDedupe is the first research work on leveraging application awareness in the context
of distributed deduplication.
It performs two-tiered routing decision by exploiting application awareness, data
similarity and locality to direct data routing from clients to deduplication storage nodes to
achieve a good tradeoff between the conflicting goals of high deduplication effectiveness
and low system overhead.
It builds a global application route table and independent similarity indices.
Evaluation results show that it consistently and significantly outperforms the state-of-the-
art schemes in distributed deduplication efficiency by achieving high global
deduplication effectiveness

Background and motivation
Distributed Deduplication Techniques:

Background and motivation
Application Difference Redundancy Analysis
They compared the chunk fingerprints of test datasets with inter-application deduplication (inter-
app) and intra-application deduplication (inter-app) using chunk-level deduplication with a fixed
chunk size of 4 KB calculate the corresponding MD5 value as the chunk fingerprint in different
applications, including Linux kernel source code (Linux), dataset in mail server (Mail), virtual
machine images (VM), dataset in web server (Web), photo collections (Photo), music library
(Audio) and movie fileset (Video).

APPDEDUPE DESIGN
They focused on the three design principles to govern their AppDedupe system design:
i. Throughput. The deduplication throughput should scale with the number of nodes
by parallel deduplication across the storage nodes.
ii. Capacity. Similar data should be forwarded to the same deduplication node to
achieve high duplicate elimination ratio.
iii. Scalability. The distributed deduplication system should easily scale out to handle
massive data volumes with balanced workload among nodes

System Overview
It consists of three main
components:
1. clients,
2. dedupe storage nodes
3. director.

Two-Tiered Data Routing Scheme
They presented the two-tiered data routing scheme including: the file-level
application aware routing decision in director and the super-chunk level similarity
aware data routing in clients.
They use 2 algorithm for this,
Algorithm 1. Application Aware Routing Algorithm
Algorithm 2. Handprinting Based Stateful Data Routing

Interconnect Communication
Fig:3 Message exchanges for store operation

Interconnect Communication
Fig:4 Message exchanges for retrieve operation

Key data structures in dedupe process

Evaluation Platform
To evaluate parallel deduplication efficiency they used:
 Linux (Ubuntu 14.10)
 CPU (4-core 8-thread Intel X3440, 2.53 GHz and 16 GB RAM)
 Hard disk drive (Seagate ST1000DM 1TB)
 7 desktops serve as the clients
 One server serves as the director
 Three servers for dedupe storage nodes.

Workload
The Workload Characteristics
of The Real-World Datasets
and Traces

Evaluation Metrics
• Deduplication Efficiency(DE)
• Normalized Deduplication Ratio(NDR)
• Normalized Effective Deduplication Ratio(NEDR)
• Data Skew for Distributed Storage

The Comparison of parallel deduplication throughput of the application
aware similarity index structure with that of the traditional similarity
index structure for multiple data streams
 Traditional similarity index (Naive)
 Aware similarity index (Application aware)
 “cold cache” means the chunk fingerprint cache is empty
 “warm cache” means the duplicate chunk fingerprint had
already been stored in the cache

RAM Usage of Intra-Node Deduplication
for Various Application Workloads

Application-Aware Big Data Deduplication in Cloud

Application-Aware Big Data Deduplication in Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Application-Aware Big Data Deduplication in Cloud

Similar to Application-Aware Big Data Deduplication in Cloud (20)

More from Safayet Hossain

More from Safayet Hossain (12)

Recently uploaded

Recently uploaded (20)

Application-Aware Big Data Deduplication in Cloud