Open Source Data Deduplication


Published on

Data deduplication is a hot topic in storage and saves significant disk space for many environments, with some trade offs. We’ll discuss what deduplication is and where the Open Source solutions are versus commercial offerings. Presentation will lean towards the practical – where attendees can use it in their real world projects (what works, what doesn’t, should you use in production, etcetera).

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Different types of deduplication levels:File levelBlock levelVariable block versus fixed block Quantum/DD Variable Blocks
  • Pretty much the same as all compression
  • Open Source Data Deduplication

    1. 1. Open Source Data Deduplication Nick @RedWireServices (206) 829-8621 Last updated 8/10/2011
    2. 2. Introduction● What is Deduplication? Different kinds?● Why do you want it?● How does it work?● Advantages / Drawbacks● Commercial Implementations● Open Source implementations, performance, reliability, and stability of each
    3. 3. What is Data DeduplicationWikipedia:. . . data deduplication is a specialized data compressiontechnique for eliminating coarse-grained redundant data,typically to improve storage utilization. In the deduplicationprocess, duplicate data is deleted, leaving only one copy ofthe data to be stored, along with references to the uniquecopy of data. Deduplication is able to reduce the requiredstorage capacity since only the unique data is stored.Depending on the type of deduplication, redundant files maybe reduced, or even portions of files or other data that aresimilar can also be removed . . .
    4. 4. Why Dedupe?● Save disk space and money (less disks)● Less disks = less power, cooling, and space● Improve write performance (of duplicate data)● Be efficient – don’t re-copy or store previously stored data
    5. 5. Where does it Work Well?● Secondary Storage ● Backups/Archives ● Online backups with limited bandwidth/replication ● Save disk space – additional full backups take little space● Virtual Machines (Primary & Secondary)● File Shares
    6. 6. Not a Fit● Random data ● Video ● Pictures ● Music ● Encrypted files – many vendors dedupe, then encrypt
    7. 7. Types● Source / Target● Global● Fixed/Sliding Block● File Based (SIS)
    8. 8. Drawbacks● Slow writes, slower reads● High CPU/memory utilization (dedicated server is a must)● Increases data loss risk / corruption ● Collision risk of 1.3x10^-49% chance per PB ● (256 bit hash & 8KB Blocks)
    9. 9. How Does it Work?
    10. 10. Without Dedupe
    11. 11. With Dedupe
    12. 12. Block Reclamation ● In general, blocks are not removed/freed when a file is removed ● We must periodically check blocks for references, a block with no reference can be deleted, freeing allocated space ● Process can be expensive, scheduled during off-peak
    13. 13. Commercial Implementations● Just about every backup vendor ● Symantec, CommVault ● Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk, Mozy● NAS/SAN/Backup Targets ● NEC HydraStor ● DataDomain/EMC Avamar ● Quantum ● NetApp
    14. 14. Open Source Implementations● Fuse Based ● Lessfs ● SDFS (OpenDedupe)● Others ● ZFS ● btrfs (? Off-line only)● Limited (file based / SIS) ● BackupPC (reliable!) ● Rdiff-backup
    15. 15. How Good is it?● Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical● Especially true in backup or virtual environments
    16. 16. SDFS / OpenDedupe● Java 7 Based / platform agnostic● Uses fuse● S3 storage support● Snapshots● Inline or batch mode deduplication● Supposedly fast (290MBps+ on great H/W)● Support for global/clustered dedupe● Probably most mature OSS Dedupe (IMHO)
    17. 17. SDFS
    18. 18. SDFS Install & GoInstall Java# rpm –Uvh SDFS-1.0.7-2.x86_64.rpm# sudo mkfs.sdfs --volume-name=sdfs_128k --io-max-file-write-buffers=32 --volume-capacity=550GB --io-chunk-size=128 --chunk-store-data-location=/mnt/data# sudo modprobe fuse# sudo mount.sdfs -v sdfs_128k -m /mnt/dedupe
    19. 19. SDFS● Pro ● Works when configured properly ● Appears to be multithreaded● Con ● Slow / resource intensive (CPU/Memory) ● Fragile, easy to mess up options, leading to crashes, little user feedback ● Standard POSIX utilities do not show accurate data (e.g. df, must use getfattr -d <mount point>, and calculate bytes → GB/TB and % free yourself) ● Slow with 4k blocks, recommended for VMs
    20. 20. LessFS● Written in C = Less CPU Overhead● Have to build yourself (configure && make && make install)● Has replication, encryption● Uses fuse
    21. 21. LessFS Installwget http://...lessfs-1.4.2.tar.gztar zxvf *.tar.gzwget http://...db-4.8.30.tar.gzyum install buildstuff…. . .echo never >/sys/kernel/mm/redhat_transparent_hugepage/defragecho no >/sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag
    22. 22. LessFS Gosudo vi /etc/lessfs.cfgBLOCKDATA_PATH=/mnt/data/dta/blockdata.dtaMETA_PATH=/mnt/meta/mtaBLKSIZE=4096 # only 4k supported on centos 5ENCRYPT_DATA=onENCRYPT_META=offmklessfs -c /etc/lessfs.cfglessfs /etc/lessfs.cfg /mnt/dedupe
    23. 23. LessFS● Pro ● Does inline compression by default as well ● Reasonable VM compression with 128k blocks● Con ● Fragile ● Stats/FS info hard to see (per file accounting, no totals) ● Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only) ● Running with 4k blocks is not really feasible
    24. 24. LessFS
    25. 25. Other OSS● ZFS? ● Tried it, and empirically it was a drag, but I have no hard data (got like 3x dedupe with identical full backups of VMs) ● At least it’s stable…
    26. 26. Kick the Tires● Test data set; ~330GB of data ● 22GB of documents, pictures, music ● Virtual Machines – 220GB Windows 2003 Server with SQL Data – 2003 AD DC ~60GB – 2003 Server ~8GB – Two OpenSolaris VMs, 1.5 & 2.7GB – 3GB Windows 2000 VM – 15GB XP Pro VM
    27. 27. Kick the Tires● Test Environment ● AWS High CPU Extra Large Instance ● ~7GB of RAM ● ~Eight Cores ~2.5GHz each ● ext4
    28. 28. Compression Performance● First round (all “unique” data)● If another copy was put in (like another full), we should expect 100% reduction for that non-unique data (1x dedupe per run) FS Home % Home VM % VM Combined % Total MBps Data Reduction Data Reduction Reduction SDFS 4k 21GB 4.50% 109 64% 128GB 61% 16 GB lessfs 4k 24GB -9% N/A 51% N/A 50% 4 (est.) SDFS 128k 21GB 4.50% 255 16% 276GB 15% 40 GB lessfs 128k 21GB 4.50% 130 57% 183GB 44% 24 GB tar/gz --fast 21GB 4.50% 178 41% 199GB 39% 35 GB
    29. 29. Write Performance (dont trust this) MBps4035302520 MBps1510 5 0 raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k tar/gz --fast
    30. 30. Kick the Tires: Part 2● Test data set – two ~204GB full backup archives from a popular commercial vendor● Test Environment ● VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM SATA drives (meta & data separated for LessFS) ● Physical CPU: Quad Core Xeon
    31. 31. Write Performance MBps4035302520 MBps1510 5 0 raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W
    32. 32. Load(SDFS 128k)
    33. 33. Open Source Dedupe● Pro ● Free ● Can be stable, if well managed● Con ● Not in repos yet ● Efforts behind them seem very limited, 1 dev each ● No/Poor documentation
    34. 34. The Future● Eventual Commodity?● brtfs ● Dedupe planned (off-line only)
    35. 35. Conclusion/Recommendations● Dedupe is great, if it works and it meets your performance and storage requirements● OSS Dedupe has a way to go● SDFS/OpenDedupe is best OSS option right now● JungleDisk is good and cheap, but not OSS
    36. 36. About Red Wire ServicesIf you found this presentation helpful, considerRed Wire Services for your nextBackup, Archive, or IT Disaster RecoveryPlanning project.Learn more at
    37. 37. About Nick WebbNick Webb is the founder of Red Wire Services, inSeattle, WA. Nick is available to speak on a variety of ITDisaster Recovery related topics, including:● Preserving Your Digital Legacy● Getting Started with your Small Business Disaster Recovery Plan● Archive Storage for SMBsIf interested in having Nick speak to your group, pleasecall (206) 829-8621 or email