Understanding and Surpassing Dropbox: Efficient
Incremental Synchronization in Cloud Storage Services
Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1
1Peking University
(lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn
June 18, 2016
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29
Outline
1 Introduction
Background
Objective
Contribution
2 Related Work
Measurement of cloud storage
services
Similarity detection technique
State of The Art
3 Understanding Incremental Sync
Of Cloud Storage Services
Rsync Algorithm
Sync Mechanism on Dropbox
Detail Measurement and
Analysis
4 System Design and
Implementation
System Architecture
Delta Sharing
Chunk-Based Rsync with
Similarity Detection
Efficient conflict resolution
5 Evaluation
Modification Benchmark
File Conflict
Comparison with other cloud
services
Evaluation of Additional
Overhead
6 Conclusion
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 2 / 29
Introduction Background
Cloud Storage Services
With increasing demand of users for high data reliability and convenient
data access, cloud storage services have become extremely prevalent and
reached phenomenal levels of success. These are famous for file sharing
scenarios.
Sea File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 3 / 29
Introduction Objective
Understanding and Surpassing Dropbox
Data synchronization is the heart of cloud storage
services with incremental data synchronization is applied
to minimize network traffic.
Whether the ”modified data = uploaded data” for
active client.
Whether the ”downloaded data (passive client) =
uploaded data (active client)”.
Whether both active and passive client still presents
efficiency during file conflict.
Create an improved prototype based on findings.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 4 / 29
Introduction Contribution
3
Measurement on Dropbox
Conduct intensive measurements on Dropbox in file
sharing scenarios.
Mechanism on Dropbox
Unravel the sync mechanisms employed in Dropbox on
both active and passive clients.
Minbox
Design several novel mechanisms, which resolve the traffic
problems, and apply them in an efficient incremental
synchronization system
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 5 / 29
Related Work Measurement of cloud storage services
Measurement of cloud storage services
Drago first uncovers the Dropbox system architecture
and data sync mechanism through an ISP-level
large-scale measurement.
Li reveals the traffic overuse problem in Dropbox when
user frequently modifies the files in synced folder and
he proposes an efficient batched sync mechanism to
avoid massive metadata interaction.
Li focuses on quantifying and understanding traffic
usage effectiveness through the measurements of
several popular cloud storage services on different
devices.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 6 / 29
Related Work Similarity detection technique
Similarity detection technique
Xia proposes a new similarity detection
algorithm to better exploit similarity with low
RAM overhead and high throughput.
Google deploys SimHash to improve space
efficiency and query quality for web crawling.
Mark Manasse implements MinHash using
shingle sampling technique to extract features.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 7 / 29
Related Work State of The Art
1
While these previous works cover the data sync
mechanism as one of the key operations, none of
them tries to fully understand the mechanism of
incremental sync technique in file sharing scenario,
and measure the network traffic with different
write behaviours. Moreover, we reveal the network
traffic waste problems that are not explored before
and design several sync mechanisms to solve them.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 8 / 29
Related Work State of The Art
2
Our system design and implementation are
different from these works. Specifically, we design
an efficient chunk-based delta encoding
mechanism embedding similarity detection
technique, which combines locality-sensitive hash
and content defined chunking technique to
optimize the computation overhead while
guaranteeing precision. Moreover, this mechanism
can integrate with other deduplication techniques
seamlessly.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 9 / 29
Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
An Incremental Data Sync Algorithm
The whole point of rsync is when a file is modified on remote host is not
to send the whole file to the client but to send only the modified
part.
When a file is modified, the client retrieves a signature it which consists of
strong checksums (e.g., black2, MD5) and weak checksums (e.g., Adler-32,
a type of rolling checksum).
The client first computes weak checksums of the blocks in the changed file.
If the checksum matches one of the retrieved checksums, the client
calculates its strong checksum to verify if the two blocks are indeed the
same.
While if not match, the client rolls one byte forward and calculates weak
checksum again to find the same blocks,vwhich appeals to finding out the
skewed content.
Finally, all the different parts, called delta, can be found and sent back to
the server. The changed file is generated on the server by merging delta and
the original file, which is called patch the new file.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 10 / 29
Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
Illustration of Rsync
Old File
signature
New File
Delta
(patch)
+
+
ServerClient
signature
signature
Delta
(patch)
Old File
Delta
(patch)
New File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 11 / 29
Understanding Incremental Sync Of Cloud Storage Services Sync Mechanism on Dropbox
Dropbox Index Server and Amazon Data Server
Index
Server
Client
Data Server
1. Request file
location
2. Sends
file
location
3. Sync file using rsync on
certain file location
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 12 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Active and Passive Client
Dropbox Servers
P
a
s
s
i
v
e
C
li
e
n
t
s
Active Client
10MB + 1B
sync sync
10MB
Add or
Modified
=
or
10MB1B
1B
10MBor 1B
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 13 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Replacement at Different Positions
1. Divide both files into 4MB chunks. For example:
4MB 4MB 2MB
4MB 4MB 2MB
2. Check each chunks whether they are identical,
if not then execute rsync.
4MB 4MB 2MB
Same Same rsync
4MB 4MB 2MB
For Figure 1 rsync is executed on:
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
Uplink (A) &
Downlink (B)
(based on delta)
should be the
same size:
Downlink for
client A is the
same because
“active client”
already stores
signature data.
Passive clients
have to sent
the signature
to data server
and that’s why
there’s uplink.
Uplink when
“end” modified is
smaller because
4KB block for
rsync. Librsync
uses 256-bit strong
checksum and 32-
bit weak checksum ((256b+32b)/8)*(4MB/4KB) =
36KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 14 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Insertion at Different Positions
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on every
block. Signature
sent = 36KB + 36
KB + 18KB =
90KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on block 2
and 3. Signature
sent = 36 KB +
18KB = 54KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on last
block only.
Signature sent =
18KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 15 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 2: Modification with different amounts of data
Replace or insert different amounts of data, ranging from 4KB to
4MB, in the middle of a 4MB file and a 8MB file.
When replaced content is larger than 100KB, the amounts of data is
less than modified due to data compression in Dropbox.
Data insertion may show waste problem on larger data because of the
fixed lenght skewing, where rsync should have been able to deal with
it normally.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 16 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 3: File conflict
Figure 3 A and B modifies at the same time and both sync to server.
B reaches first and A sync from server.
But when A’s modified data reaches the server and sync, B treats it as a
new file and redownload whole.
For Figure 4 the case is complicated but the case is similar to Figure 3 but
with 3 file conflicts.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 17 / 29
System Design and Implementation System Architecture
System Architecture
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 18 / 29
System Design and Implementation Delta Sharing
Delta Sharing
Usually passive client always executes repetitive rsync
to sync update timely which is a waste.
Since passive clients tends to stay online the delta
generated by the active client can be reused.
In other words passive clients doesn’t have to execute
rsync but retrieve delta from delta server.
Passive clients doesn’t have to maintain the online
state since it can be marked through index server.
If passive clients is offline for long, the previous
mechanism is used (execute rsync).
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 19 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Similarity Detection Mechanism
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 20 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 21 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm Summary
Use locality-sensitive hash to detect similar chunks.
To reduce computation overhead while guaranteeing
detection precision, it is employed ImpMinHash
algorithm.
The non-deduplicated chunk were sliced into
sub-blocks using Rabin fingerprint.
Then find smallest cyclic redundant check (CRC)
checksums to identify this chunk. Finally used
Jaccard Index to compute similarity between chunks.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 22 / 29
System Design and Implementation Efficient conflict resolution
Efficient conflict resolution
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 23 / 29
Evaluation Modification Benchmark
Modification Benchmark
Replay experiment 1, and the result is unlike Dropbox, no uplink on Minbox’s passive
client.
Replay experiment 2 that the results can be seen on Figure 8 and Figure 9 where Minbox
implements similarity detection algorithm that outperforms Dropbox.
MinboxFD (native) used fixed length chunking 4MB while MinboxVD uses content
defined chunking (CDC) with average 4MB chunking.
In most cases, MinboxVD performs the best by taking advantage of CDC to avoid the
impact of content skewing.
However, for large modification workloads in 8MB file, MinboxFD outperforms MinboxVD,
because MinboxVD slices the new chunks which are not similar to original chunks.
After the matching for these chunks, MinboxVD may generate more redundant delta
compared with MinboxFD.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 24 / 29
Evaluation File Conflict
File Conflict
Dropbox downloads the whole file while Minbox only needs to
download the delta.
High network efficiency on Minbox.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 25 / 29
Evaluation Comparison with other cloud services
Comparison with other cloud services
Figure 12 shows comparison between Seafile and Minbox that also uses
CDC.
Seafile have to send the whole modified chunk, and client download whole
file in each case while Minbox only deals with the rsync part.
Comparison with others such as Google Drive and One Drive, Minbox took
advantage of the incremental sync mechanism.
For file conflict others downloads the whole file, while Minbox uses rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 26 / 29
Evaluation Evaluation of Additional Overhead
Evaluation of Additional Overhead
Finally, it is necessary to discuss the overhead of
Imp-MinHash in Minbox. We generate
ImpMinHash of a 4MB file and record the
signature size and computation time. The result is
that ImpMinHash has the same size as MinHash
which consumes little bytes compared with Rsync
signature. For computation time of signature,
ImpMinHash consumes two additional CPU ticks
in comparison to Rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 27 / 29
Conclusion
Efficient Incremental Synchronization in Cloud Storage
Services
Understanding Dropbox
In this paper, it is conducted comprehensive measurements on
Dropbox in file sharing scenario and unravel the incremental sync
mechanism inside Dropbox.
Surpassing Dropbox
Meanwhile, it is revealed the significant network traffic waste existing
in Dropbox, then designed and implemented an efficient incremental
sync system to solve these problems.
In the evaluation, Minbox significantly reduces the network traffic
during sync and solves the problem of file conflict with little overhead.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 28 / 29
Conclusion
Thank you
Any comments or questions?
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 29 / 29

Presentation of Understanding and Surpassing Dropbox Globecom 2015

  • 1.
    Understanding and SurpassingDropbox: Efficient Incremental Synchronization in Cloud Storage Services Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1 1Peking University (lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn June 18, 2016 Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29
  • 2.
    Outline 1 Introduction Background Objective Contribution 2 RelatedWork Measurement of cloud storage services Similarity detection technique State of The Art 3 Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm Sync Mechanism on Dropbox Detail Measurement and Analysis 4 System Design and Implementation System Architecture Delta Sharing Chunk-Based Rsync with Similarity Detection Efficient conflict resolution 5 Evaluation Modification Benchmark File Conflict Comparison with other cloud services Evaluation of Additional Overhead 6 Conclusion Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 2 / 29
  • 3.
    Introduction Background Cloud StorageServices With increasing demand of users for high data reliability and convenient data access, cloud storage services have become extremely prevalent and reached phenomenal levels of success. These are famous for file sharing scenarios. Sea File Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 3 / 29
  • 4.
    Introduction Objective Understanding andSurpassing Dropbox Data synchronization is the heart of cloud storage services with incremental data synchronization is applied to minimize network traffic. Whether the ”modified data = uploaded data” for active client. Whether the ”downloaded data (passive client) = uploaded data (active client)”. Whether both active and passive client still presents efficiency during file conflict. Create an improved prototype based on findings. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 4 / 29
  • 5.
    Introduction Contribution 3 Measurement onDropbox Conduct intensive measurements on Dropbox in file sharing scenarios. Mechanism on Dropbox Unravel the sync mechanisms employed in Dropbox on both active and passive clients. Minbox Design several novel mechanisms, which resolve the traffic problems, and apply them in an efficient incremental synchronization system Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 5 / 29
  • 6.
    Related Work Measurementof cloud storage services Measurement of cloud storage services Drago first uncovers the Dropbox system architecture and data sync mechanism through an ISP-level large-scale measurement. Li reveals the traffic overuse problem in Dropbox when user frequently modifies the files in synced folder and he proposes an efficient batched sync mechanism to avoid massive metadata interaction. Li focuses on quantifying and understanding traffic usage effectiveness through the measurements of several popular cloud storage services on different devices. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 6 / 29
  • 7.
    Related Work Similaritydetection technique Similarity detection technique Xia proposes a new similarity detection algorithm to better exploit similarity with low RAM overhead and high throughput. Google deploys SimHash to improve space efficiency and query quality for web crawling. Mark Manasse implements MinHash using shingle sampling technique to extract features. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 7 / 29
  • 8.
    Related Work Stateof The Art 1 While these previous works cover the data sync mechanism as one of the key operations, none of them tries to fully understand the mechanism of incremental sync technique in file sharing scenario, and measure the network traffic with different write behaviours. Moreover, we reveal the network traffic waste problems that are not explored before and design several sync mechanisms to solve them. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 8 / 29
  • 9.
    Related Work Stateof The Art 2 Our system design and implementation are different from these works. Specifically, we design an efficient chunk-based delta encoding mechanism embedding similarity detection technique, which combines locality-sensitive hash and content defined chunking technique to optimize the computation overhead while guaranteeing precision. Moreover, this mechanism can integrate with other deduplication techniques seamlessly. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 9 / 29
  • 10.
    Understanding Incremental SyncOf Cloud Storage Services Rsync Algorithm An Incremental Data Sync Algorithm The whole point of rsync is when a file is modified on remote host is not to send the whole file to the client but to send only the modified part. When a file is modified, the client retrieves a signature it which consists of strong checksums (e.g., black2, MD5) and weak checksums (e.g., Adler-32, a type of rolling checksum). The client first computes weak checksums of the blocks in the changed file. If the checksum matches one of the retrieved checksums, the client calculates its strong checksum to verify if the two blocks are indeed the same. While if not match, the client rolls one byte forward and calculates weak checksum again to find the same blocks,vwhich appeals to finding out the skewed content. Finally, all the different parts, called delta, can be found and sent back to the server. The changed file is generated on the server by merging delta and the original file, which is called patch the new file. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 10 / 29
  • 11.
    Understanding Incremental SyncOf Cloud Storage Services Rsync Algorithm Illustration of Rsync Old File signature New File Delta (patch) + + ServerClient signature signature Delta (patch) Old File Delta (patch) New File Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 11 / 29
  • 12.
    Understanding Incremental SyncOf Cloud Storage Services Sync Mechanism on Dropbox Dropbox Index Server and Amazon Data Server Index Server Client Data Server 1. Request file location 2. Sends file location 3. Sync file using rsync on certain file location Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 12 / 29
  • 13.
    Understanding Incremental SyncOf Cloud Storage Services Detail Measurement and Analysis Active and Passive Client Dropbox Servers P a s s i v e C li e n t s Active Client 10MB + 1B sync sync 10MB Add or Modified = or 10MB1B 1B 10MBor 1B Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 13 / 29
  • 14.
    Understanding Incremental SyncOf Cloud Storage Services Detail Measurement and Analysis Experiment 1: Replacement at Different Positions 1. Divide both files into 4MB chunks. For example: 4MB 4MB 2MB 4MB 4MB 2MB 2. Check each chunks whether they are identical, if not then execute rsync. 4MB 4MB 2MB Same Same rsync 4MB 4MB 2MB For Figure 1 rsync is executed on: 4MB 4MB 2MB 4MB 4MB 2MB or 4MB 4MB 2MB 4MB 4MB 2MB or 4MB 4MB 2MB 4MB 4MB 2MB Uplink (A) & Downlink (B) (based on delta) should be the same size: Downlink for client A is the same because “active client” already stores signature data. Passive clients have to sent the signature to data server and that’s why there’s uplink. Uplink when “end” modified is smaller because 4KB block for rsync. Librsync uses 256-bit strong checksum and 32- bit weak checksum ((256b+32b)/8)*(4MB/4KB) = 36KB Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 14 / 29
  • 15.
    Understanding Incremental SyncOf Cloud Storage Services Detail Measurement and Analysis Experiment 1: Insertion at Different Positions 4MB 4MB 2MB 4MB 4MB 2MB Rsync on every block. Signature sent = 36KB + 36 KB + 18KB = 90KB 4MB 4MB 2MB 4MB 4MB 2MB Rsync on block 2 and 3. Signature sent = 36 KB + 18KB = 54KB 4MB 4MB 2MB 4MB 4MB 2MB Rsync on last block only. Signature sent = 18KB Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 15 / 29
  • 16.
    Understanding Incremental SyncOf Cloud Storage Services Detail Measurement and Analysis Experiment 2: Modification with different amounts of data Replace or insert different amounts of data, ranging from 4KB to 4MB, in the middle of a 4MB file and a 8MB file. When replaced content is larger than 100KB, the amounts of data is less than modified due to data compression in Dropbox. Data insertion may show waste problem on larger data because of the fixed lenght skewing, where rsync should have been able to deal with it normally. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 16 / 29
  • 17.
    Understanding Incremental SyncOf Cloud Storage Services Detail Measurement and Analysis Experiment 3: File conflict Figure 3 A and B modifies at the same time and both sync to server. B reaches first and A sync from server. But when A’s modified data reaches the server and sync, B treats it as a new file and redownload whole. For Figure 4 the case is complicated but the case is similar to Figure 3 but with 3 file conflicts. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 17 / 29
  • 18.
    System Design andImplementation System Architecture System Architecture Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 18 / 29
  • 19.
    System Design andImplementation Delta Sharing Delta Sharing Usually passive client always executes repetitive rsync to sync update timely which is a waste. Since passive clients tends to stay online the delta generated by the active client can be reused. In other words passive clients doesn’t have to execute rsync but retrieve delta from delta server. Passive clients doesn’t have to maintain the online state since it can be marked through index server. If passive clients is offline for long, the previous mechanism is used (execute rsync). Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 19 / 29
  • 20.
    System Design andImplementation Chunk-Based Rsync with Similarity Detection Similarity Detection Mechanism Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 20 / 29
  • 21.
    System Design andImplementation Chunk-Based Rsync with Similarity Detection Algorithm Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 21 / 29
  • 22.
    System Design andImplementation Chunk-Based Rsync with Similarity Detection Algorithm Summary Use locality-sensitive hash to detect similar chunks. To reduce computation overhead while guaranteeing detection precision, it is employed ImpMinHash algorithm. The non-deduplicated chunk were sliced into sub-blocks using Rabin fingerprint. Then find smallest cyclic redundant check (CRC) checksums to identify this chunk. Finally used Jaccard Index to compute similarity between chunks. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 22 / 29
  • 23.
    System Design andImplementation Efficient conflict resolution Efficient conflict resolution Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 23 / 29
  • 24.
    Evaluation Modification Benchmark ModificationBenchmark Replay experiment 1, and the result is unlike Dropbox, no uplink on Minbox’s passive client. Replay experiment 2 that the results can be seen on Figure 8 and Figure 9 where Minbox implements similarity detection algorithm that outperforms Dropbox. MinboxFD (native) used fixed length chunking 4MB while MinboxVD uses content defined chunking (CDC) with average 4MB chunking. In most cases, MinboxVD performs the best by taking advantage of CDC to avoid the impact of content skewing. However, for large modification workloads in 8MB file, MinboxFD outperforms MinboxVD, because MinboxVD slices the new chunks which are not similar to original chunks. After the matching for these chunks, MinboxVD may generate more redundant delta compared with MinboxFD. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 24 / 29
  • 25.
    Evaluation File Conflict FileConflict Dropbox downloads the whole file while Minbox only needs to download the delta. High network efficiency on Minbox. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 25 / 29
  • 26.
    Evaluation Comparison withother cloud services Comparison with other cloud services Figure 12 shows comparison between Seafile and Minbox that also uses CDC. Seafile have to send the whole modified chunk, and client download whole file in each case while Minbox only deals with the rsync part. Comparison with others such as Google Drive and One Drive, Minbox took advantage of the incremental sync mechanism. For file conflict others downloads the whole file, while Minbox uses rsync. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 26 / 29
  • 27.
    Evaluation Evaluation ofAdditional Overhead Evaluation of Additional Overhead Finally, it is necessary to discuss the overhead of Imp-MinHash in Minbox. We generate ImpMinHash of a 4MB file and record the signature size and computation time. The result is that ImpMinHash has the same size as MinHash which consumes little bytes compared with Rsync signature. For computation time of signature, ImpMinHash consumes two additional CPU ticks in comparison to Rsync. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 27 / 29
  • 28.
    Conclusion Efficient Incremental Synchronizationin Cloud Storage Services Understanding Dropbox In this paper, it is conducted comprehensive measurements on Dropbox in file sharing scenario and unravel the incremental sync mechanism inside Dropbox. Surpassing Dropbox Meanwhile, it is revealed the significant network traffic waste existing in Dropbox, then designed and implemented an efficient incremental sync system to solve these problems. In the evaluation, Minbox significantly reduces the network traffic during sync and solves the problem of file conflict with little overhead. Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 28 / 29
  • 29.
    Conclusion Thank you Any commentsor questions? Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 29 / 29