This is not my paper, just an assignment of the computer algorithm class I am taking to present a paper.
Title: Understanding and Surpassing Dropbox: Efficient
Incremental Synchronization in Cloud Storage Services
Authors: Shenglong Li, Quanlu Zhang, Zhi Yang, Yafei Dai
Source: http://dx.doi.org/10.1109/GLOCOM.2015.7417235
Presenter: Fajar Purnama
Video https://bit.tube/play?hash=QmSKeTyFcuKuRrTGMqLVHy43RXHrQgQkPoXhnH4MAMkf6K&channel=156033
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Presentation of Understanding and Surpassing Dropbox Globecom 2015
1. Understanding and Surpassing Dropbox: Efficient
Incremental Synchronization in Cloud Storage Services
Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1
1Peking University
(lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn
June 18, 2016
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29
2. Outline
1 Introduction
Background
Objective
Contribution
2 Related Work
Measurement of cloud storage
services
Similarity detection technique
State of The Art
3 Understanding Incremental Sync
Of Cloud Storage Services
Rsync Algorithm
Sync Mechanism on Dropbox
Detail Measurement and
Analysis
4 System Design and
Implementation
System Architecture
Delta Sharing
Chunk-Based Rsync with
Similarity Detection
Efficient conflict resolution
5 Evaluation
Modification Benchmark
File Conflict
Comparison with other cloud
services
Evaluation of Additional
Overhead
6 Conclusion
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 2 / 29
3. Introduction Background
Cloud Storage Services
With increasing demand of users for high data reliability and convenient
data access, cloud storage services have become extremely prevalent and
reached phenomenal levels of success. These are famous for file sharing
scenarios.
Sea File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 3 / 29
4. Introduction Objective
Understanding and Surpassing Dropbox
Data synchronization is the heart of cloud storage
services with incremental data synchronization is applied
to minimize network traffic.
Whether the ”modified data = uploaded data” for
active client.
Whether the ”downloaded data (passive client) =
uploaded data (active client)”.
Whether both active and passive client still presents
efficiency during file conflict.
Create an improved prototype based on findings.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 4 / 29
5. Introduction Contribution
3
Measurement on Dropbox
Conduct intensive measurements on Dropbox in file
sharing scenarios.
Mechanism on Dropbox
Unravel the sync mechanisms employed in Dropbox on
both active and passive clients.
Minbox
Design several novel mechanisms, which resolve the traffic
problems, and apply them in an efficient incremental
synchronization system
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 5 / 29
6. Related Work Measurement of cloud storage services
Measurement of cloud storage services
Drago first uncovers the Dropbox system architecture
and data sync mechanism through an ISP-level
large-scale measurement.
Li reveals the traffic overuse problem in Dropbox when
user frequently modifies the files in synced folder and
he proposes an efficient batched sync mechanism to
avoid massive metadata interaction.
Li focuses on quantifying and understanding traffic
usage effectiveness through the measurements of
several popular cloud storage services on different
devices.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 6 / 29
7. Related Work Similarity detection technique
Similarity detection technique
Xia proposes a new similarity detection
algorithm to better exploit similarity with low
RAM overhead and high throughput.
Google deploys SimHash to improve space
efficiency and query quality for web crawling.
Mark Manasse implements MinHash using
shingle sampling technique to extract features.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 7 / 29
8. Related Work State of The Art
1
While these previous works cover the data sync
mechanism as one of the key operations, none of
them tries to fully understand the mechanism of
incremental sync technique in file sharing scenario,
and measure the network traffic with different
write behaviours. Moreover, we reveal the network
traffic waste problems that are not explored before
and design several sync mechanisms to solve them.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 8 / 29
9. Related Work State of The Art
2
Our system design and implementation are
different from these works. Specifically, we design
an efficient chunk-based delta encoding
mechanism embedding similarity detection
technique, which combines locality-sensitive hash
and content defined chunking technique to
optimize the computation overhead while
guaranteeing precision. Moreover, this mechanism
can integrate with other deduplication techniques
seamlessly.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 9 / 29
10. Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
An Incremental Data Sync Algorithm
The whole point of rsync is when a file is modified on remote host is not
to send the whole file to the client but to send only the modified
part.
When a file is modified, the client retrieves a signature it which consists of
strong checksums (e.g., black2, MD5) and weak checksums (e.g., Adler-32,
a type of rolling checksum).
The client first computes weak checksums of the blocks in the changed file.
If the checksum matches one of the retrieved checksums, the client
calculates its strong checksum to verify if the two blocks are indeed the
same.
While if not match, the client rolls one byte forward and calculates weak
checksum again to find the same blocks,vwhich appeals to finding out the
skewed content.
Finally, all the different parts, called delta, can be found and sent back to
the server. The changed file is generated on the server by merging delta and
the original file, which is called patch the new file.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 10 / 29
11. Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
Illustration of Rsync
Old File
signature
New File
Delta
(patch)
+
+
ServerClient
signature
signature
Delta
(patch)
Old File
Delta
(patch)
New File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 11 / 29
12. Understanding Incremental Sync Of Cloud Storage Services Sync Mechanism on Dropbox
Dropbox Index Server and Amazon Data Server
Index
Server
Client
Data Server
1. Request file
location
2. Sends
file
location
3. Sync file using rsync on
certain file location
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 12 / 29
13. Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Active and Passive Client
Dropbox Servers
P
a
s
s
i
v
e
C
li
e
n
t
s
Active Client
10MB + 1B
sync sync
10MB
Add or
Modified
=
or
10MB1B
1B
10MBor 1B
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 13 / 29
14. Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Replacement at Different Positions
1. Divide both files into 4MB chunks. For example:
4MB 4MB 2MB
4MB 4MB 2MB
2. Check each chunks whether they are identical,
if not then execute rsync.
4MB 4MB 2MB
Same Same rsync
4MB 4MB 2MB
For Figure 1 rsync is executed on:
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
Uplink (A) &
Downlink (B)
(based on delta)
should be the
same size:
Downlink for
client A is the
same because
“active client”
already stores
signature data.
Passive clients
have to sent
the signature
to data server
and that’s why
there’s uplink.
Uplink when
“end” modified is
smaller because
4KB block for
rsync. Librsync
uses 256-bit strong
checksum and 32-
bit weak checksum ((256b+32b)/8)*(4MB/4KB) =
36KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 14 / 29
15. Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Insertion at Different Positions
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on every
block. Signature
sent = 36KB + 36
KB + 18KB =
90KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on block 2
and 3. Signature
sent = 36 KB +
18KB = 54KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on last
block only.
Signature sent =
18KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 15 / 29
16. Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 2: Modification with different amounts of data
Replace or insert different amounts of data, ranging from 4KB to
4MB, in the middle of a 4MB file and a 8MB file.
When replaced content is larger than 100KB, the amounts of data is
less than modified due to data compression in Dropbox.
Data insertion may show waste problem on larger data because of the
fixed lenght skewing, where rsync should have been able to deal with
it normally.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 16 / 29
17. Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 3: File conflict
Figure 3 A and B modifies at the same time and both sync to server.
B reaches first and A sync from server.
But when A’s modified data reaches the server and sync, B treats it as a
new file and redownload whole.
For Figure 4 the case is complicated but the case is similar to Figure 3 but
with 3 file conflicts.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 17 / 29
18. System Design and Implementation System Architecture
System Architecture
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 18 / 29
19. System Design and Implementation Delta Sharing
Delta Sharing
Usually passive client always executes repetitive rsync
to sync update timely which is a waste.
Since passive clients tends to stay online the delta
generated by the active client can be reused.
In other words passive clients doesn’t have to execute
rsync but retrieve delta from delta server.
Passive clients doesn’t have to maintain the online
state since it can be marked through index server.
If passive clients is offline for long, the previous
mechanism is used (execute rsync).
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 19 / 29
20. System Design and Implementation Chunk-Based Rsync with Similarity Detection
Similarity Detection Mechanism
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 20 / 29
21. System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 21 / 29
22. System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm Summary
Use locality-sensitive hash to detect similar chunks.
To reduce computation overhead while guaranteeing
detection precision, it is employed ImpMinHash
algorithm.
The non-deduplicated chunk were sliced into
sub-blocks using Rabin fingerprint.
Then find smallest cyclic redundant check (CRC)
checksums to identify this chunk. Finally used
Jaccard Index to compute similarity between chunks.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 22 / 29
23. System Design and Implementation Efficient conflict resolution
Efficient conflict resolution
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 23 / 29
24. Evaluation Modification Benchmark
Modification Benchmark
Replay experiment 1, and the result is unlike Dropbox, no uplink on Minbox’s passive
client.
Replay experiment 2 that the results can be seen on Figure 8 and Figure 9 where Minbox
implements similarity detection algorithm that outperforms Dropbox.
MinboxFD (native) used fixed length chunking 4MB while MinboxVD uses content
defined chunking (CDC) with average 4MB chunking.
In most cases, MinboxVD performs the best by taking advantage of CDC to avoid the
impact of content skewing.
However, for large modification workloads in 8MB file, MinboxFD outperforms MinboxVD,
because MinboxVD slices the new chunks which are not similar to original chunks.
After the matching for these chunks, MinboxVD may generate more redundant delta
compared with MinboxFD.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 24 / 29
25. Evaluation File Conflict
File Conflict
Dropbox downloads the whole file while Minbox only needs to
download the delta.
High network efficiency on Minbox.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 25 / 29
26. Evaluation Comparison with other cloud services
Comparison with other cloud services
Figure 12 shows comparison between Seafile and Minbox that also uses
CDC.
Seafile have to send the whole modified chunk, and client download whole
file in each case while Minbox only deals with the rsync part.
Comparison with others such as Google Drive and One Drive, Minbox took
advantage of the incremental sync mechanism.
For file conflict others downloads the whole file, while Minbox uses rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 26 / 29
27. Evaluation Evaluation of Additional Overhead
Evaluation of Additional Overhead
Finally, it is necessary to discuss the overhead of
Imp-MinHash in Minbox. We generate
ImpMinHash of a 4MB file and record the
signature size and computation time. The result is
that ImpMinHash has the same size as MinHash
which consumes little bytes compared with Rsync
signature. For computation time of signature,
ImpMinHash consumes two additional CPU ticks
in comparison to Rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 27 / 29
28. Conclusion
Efficient Incremental Synchronization in Cloud Storage
Services
Understanding Dropbox
In this paper, it is conducted comprehensive measurements on
Dropbox in file sharing scenario and unravel the incremental sync
mechanism inside Dropbox.
Surpassing Dropbox
Meanwhile, it is revealed the significant network traffic waste existing
in Dropbox, then designed and implemented an efficient incremental
sync system to solve these problems.
In the evaluation, Minbox significantly reduces the network traffic
during sync and solves the problem of file conflict with little overhead.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 28 / 29
29. Conclusion
Thank you
Any comments or questions?
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 29 / 29