SlideShare a Scribd company logo
Open Source Data Deduplication




                        Nick Webb
nickw@redwireservices.com www.redwireservices.com @RedWireServices
                          (206) 829-8621
                          Last updated 8/10/2011
Introduction
●   What is Deduplication? Different kinds?
●   Why do you want it?
●   How does it work?
●   Advantages / Drawbacks
●   Commercial Implementations
●   Open Source implementations, performance,
    reliability, and stability of each
What is Data Deduplication
Wikipedia:

. . . data deduplication is a specialized data compression
technique for eliminating coarse-grained redundant data,
typically to improve storage utilization. In the deduplication
process, duplicate data is deleted, leaving only one copy of
the data to be stored, along with references to the unique
copy of data. Deduplication is able to reduce the required
storage capacity since only the unique data is stored.

Depending on the type of deduplication, redundant files may
be reduced, or even portions of files or other data that are
similar can also be removed . . .
Why Dedupe?
●   Save disk space and money (less disks)
●   Less disks = less power, cooling, and space
●   Improve write performance (of duplicate data)
●   Be efficient – don’t re-copy or store previously
    stored data
Where does it Work Well?
●   Secondary Storage
    ●   Backups/Archives
    ●   Online backups with limited
        bandwidth/replication
    ●   Save disk space – additional
        full backups take little space
●   Virtual Machines (Primary &
    Secondary)
●   File Shares
Not a Fit
●   Random data
    ●   Video
    ●   Pictures
    ●   Music
    ●   Encrypted files
         –   many vendors dedupe, then encrypt
Types
●   Source / Target
●   Global
●   Fixed/Sliding Block
●   File Based (SIS)
Drawbacks
●   Slow writes, slower reads
●   High CPU/memory utilization (dedicated server
    is a must)
●   Increases data loss risk / corruption
    ●   Collision risk of 1.3x10^-49% chance per PB
    ●   (256 bit hash & 8KB Blocks)
How Does it Work?
Without Dedupe
With Dedupe
Block Reclamation
 ●   In general, blocks are not
     removed/freed when a file is
     removed
 ●   We must periodically check blocks
     for references, a block with no
     reference can be deleted, freeing
     allocated space
 ●   Process can be expensive,
     scheduled during off-peak
Commercial Implementations
●   Just about every backup vendor
    ●   Symantec, CommVault
    ●   Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk,
        Mozy
●   NAS/SAN/Backup Targets
    ●   NEC HydraStor
    ●   DataDomain/EMC Avamar
    ●   Quantum
    ●   NetApp
Open Source Implementations
●   Fuse Based
    ●   Lessfs
    ●   SDFS (OpenDedupe)
●   Others
    ●   ZFS
    ●   btrfs (? Off-line only)
●   Limited (file based / SIS)
    ●   BackupPC (reliable!)
    ●   Rdiff-backup
How Good is it?
●   Many see 10-20x deduplicaiton meaning 10-20
    times more logical object storage than physical
●   Especially true in backup or virtual
    environments
SDFS / OpenDedupe
                 www.opendedup.org
●   Java 7 Based / platform agnostic
●   Uses fuse
●   S3 storage support
●   Snapshots
●   Inline or batch mode deduplication
●   Supposedly fast (290MBps+ on great H/W)
●   Support for global/clustered dedupe
●   Probably most mature OSS Dedupe (IMHO)
SDFS
SDFS Install & Go

Install Java
# rpm –Uvh SDFS-1.0.7-2.x86_64.rpm
# sudo mkfs.sdfs --volume-name=sdfs_128k 
     --io-max-file-write-buffers=32 
     --volume-capacity=550GB 
     --io-chunk-size=128 
     --chunk-store-data-location=/mnt/data
# sudo modprobe fuse
# sudo mount.sdfs -v sdfs_128k -m 
     /mnt/dedupe
SDFS
●   Pro
    ●   Works when configured properly
    ●   Appears to be multithreaded
●   Con
    ●   Slow / resource intensive (CPU/Memory)
    ●   Fragile, easy to mess up options, leading to crashes, little
        user feedback
    ●   Standard POSIX utilities do not show accurate data (e.g. df,
        must use getfattr -d <mount point>, and calculate bytes →
        GB/TB and % free yourself)
    ●   Slow with 4k blocks, recommended for VMs
LessFS
                          www.lessfs.com

●   Written in C = Less CPU Overhead
●   Have to build yourself (configure && make && make install)
●   Has replication, encryption
●   Uses fuse
LessFS Install
wget http://...lessfs-1.4.2.tar.gz
tar zxvf *.tar.gz
wget http://...db-4.8.30.tar.gz
yum install buildstuff…
. . .
echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag
echo no >
/sys/kernel/mm/redhat_transparent_hugepage/khugep
aged/defrag
LessFS Go
sudo vi /etc/lessfs.cfg
BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta
META_PATH=/mnt/meta/mta
BLKSIZE=4096 # only 4k supported on centos 5
ENCRYPT_DATA=on
ENCRYPT_META=off


mklessfs -c /etc/lessfs.cfg
lessfs /etc/lessfs.cfg /mnt/dedupe
LessFS
●   Pro
    ●   Does inline compression by default as well
    ●   Reasonable VM compression with 128k blocks
●   Con
    ●   Fragile
    ●   Stats/FS info hard to see (per file accounting, no totals)
    ●   Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only)
    ●   Running with 4k blocks is not really feasible
LessFS
Other OSS
●   ZFS?
    ●   Tried it, and empirically it was a drag, but I have no
        hard data (got like 3x dedupe with identical full
        backups of VMs)
    ●   At least it’s stable…
Kick the Tires
●   Test data set; ~330GB of data
    ●   22GB of documents, pictures, music
    ●   Virtual Machines
        –   220GB Windows 2003 Server with SQL Data
        –   2003 AD DC ~60GB
        –   2003 Server ~8GB
        –   Two OpenSolaris VMs, 1.5 & 2.7GB
        –   3GB Windows 2000 VM
        –   15GB XP Pro VM
Kick the Tires
●   Test Environment
    ●   AWS High CPU Extra Large Instance
    ●   ~7GB of RAM
    ●   ~Eight Cores ~2.5GHz each
    ●   ext4
Compression Performance
●   First round (all “unique” data)
●   If another copy was put in (like another full), we should expect
    100% reduction for that non-unique data (1x dedupe per run)
      FS              Home   % Home      VM     % VM        Combined % Total     MBps
                      Data   Reduction   Data   Reduction            Reduction
      SDFS 4k         21GB   4.50%       109    64%         128GB    61%         16
                                         GB
      lessfs 4k       24GB   -9%         N/A    51%         N/A      50%         4
      (est.)
      SDFS 128k       21GB   4.50%       255    16%         276GB    15%         40
                                         GB
      lessfs 128k     21GB   4.50%       130    57%         183GB    44%         24
                                         GB
      tar/gz --fast   21GB   4.50%       178    41%         199GB    39%         35
                                         GB
Write Performance
                      (don't trust this)
                                  MBps

40


35


30


25


20                                                                          MBps


15


10


 5


 0
     raw    SDFS 4k   lessfs 4k   SDFS 128k   lessfs 128k   tar/gz --fast
Kick the Tires: Part 2
●   Test data set – two ~204GB full backup
    archives from a popular commercial vendor
●   Test Environment
    ●   VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM
        SATA drives (meta & data separated for LessFS)
    ●   Physical CPU: Quad Core Xeon
Write Performance
                                 MBps

40


35


30


25


20                                                                           MBps


15


10


 5


 0
     raw   SDFS 128k W   SDFS 128k Re-W   LessFS 128k W   LessFS 128k Re-W
Load
(SDFS 128k)
Open Source Dedupe
●   Pro
    ●   Free
    ●   Can be stable, if well managed
●   Con
    ●   Not in repos yet
    ●   Efforts behind them seem very limited, 1 dev each
    ●   No/Poor documentation
The Future
●   Eventual Commodity?
●   brtfs
    ●   Dedupe planned (off-line only)
Conclusion/Recommendations
●   Dedupe is great, if it works and it meets your
    performance and storage requirements
●   OSS Dedupe has a way to go
●   SDFS/OpenDedupe is best OSS option right
    now
●   JungleDisk is good and cheap, but not OSS
About Red Wire Services
If you found this presentation helpful, consider
Red Wire Services for your next
Backup, Archive, or IT Disaster Recovery
Planning project.
Learn more at www.RedWireServices.com
About Nick Webb
Nick Webb is the founder of Red Wire Services, in
Seattle, WA. Nick is available to speak on a variety of IT
Disaster Recovery related topics, including:
●   Preserving Your Digital Legacy
●   Getting Started with your Small Business Disaster
    Recovery Plan
●   Archive Storage for SMBs
If interested in having Nick speak to your group, please
call (206) 829-8621 or email info@redwireservices.com

More Related Content

What's hot

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
Hanborq Inc.
 
CDW: SAN vs. NAS
CDW: SAN vs. NASCDW: SAN vs. NAS
CDW: SAN vs. NAS
Spiceworks
 
IBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageIBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object Storage
Tony Pearson
 
Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)sandeepgodfather
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
inside-BigData.com
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
Ashutosh Mate
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
Ceph Community
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
Sandeep Patil
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
Rami Jebara
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsGilHecht
 
958 and 959 sales exam prep
958 and 959 sales exam prep958 and 959 sales exam prep
958 and 959 sales exam prep
Jason Wong
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
Sandeep Patil
 
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical HighlightsMaginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics
 

What's hot (18)

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
CDW: SAN vs. NAS
CDW: SAN vs. NASCDW: SAN vs. NAS
CDW: SAN vs. NAS
 
IBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageIBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object Storage
 
Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
 
958 and 959 sales exam prep
958 and 959 sales exam prep958 and 959 sales exam prep
958 and 959 sales exam prep
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical HighlightsMaginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
 

Viewers also liked

Deduplication
DeduplicationDeduplication
Deduplication
Lars Marius Garshol
 
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
Pasquale Puzio
 
Netapp Deduplication concepts
Netapp Deduplication conceptsNetapp Deduplication concepts
Netapp Deduplication concepts
Saroj Sahu
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
Sara-Jayne Terp
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
Creative portfolio tavisca 2014
Creative portfolio tavisca 2014Creative portfolio tavisca 2014
Creative portfolio tavisca 2014
Tavisca Solutions
 
Presentation deduplication backup software and system
Presentation   deduplication backup software and systemPresentation   deduplication backup software and system
Presentation deduplication backup software and system
xKinAnx
 
Secure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloudSecure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloud
nexgentech15
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
A Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized DeduplicationA Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized Deduplication
SWAMI06
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
Lars Marius Garshol
 
Source of Data in Research
Source of Data in ResearchSource of Data in Research
Source of Data in ResearchManu K M
 
EMC Deduplication Fundamentals
EMC Deduplication FundamentalsEMC Deduplication Fundamentals
EMC Deduplication Fundamentals
emcbaltics
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubDeduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 

Viewers also liked (16)

Deduplication
DeduplicationDeduplication
Deduplication
 
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
 
Netapp Deduplication concepts
Netapp Deduplication conceptsNetapp Deduplication concepts
Netapp Deduplication concepts
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Avamar presales 1.0
Avamar presales 1.0Avamar presales 1.0
Avamar presales 1.0
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
 
Creative portfolio tavisca 2014
Creative portfolio tavisca 2014Creative portfolio tavisca 2014
Creative portfolio tavisca 2014
 
Presentation deduplication backup software and system
Presentation   deduplication backup software and systemPresentation   deduplication backup software and system
Presentation deduplication backup software and system
 
Hotel map process
Hotel map process   Hotel map process
Hotel map process
 
Secure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloudSecure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloud
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
 
A Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized DeduplicationA Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized Deduplication
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Source of Data in Research
Source of Data in ResearchSource of Data in Research
Source of Data in Research
 
EMC Deduplication Fundamentals
EMC Deduplication FundamentalsEMC Deduplication Fundamentals
EMC Deduplication Fundamentals
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubDeduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
 

Similar to Open Source Data Deduplication

VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
Red_Hat_Storage
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
Angelo Rajadurai
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
DataStax Academy
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
David Grier
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
PostgreSQL Experts, Inc.
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
Angelo Rajadurai
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
PostgreSQL Experts, Inc.
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
confluent
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Community
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
jasonajohnson
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
Command Prompt., Inc
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
Tomas Vondra
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 

Similar to Open Source Data Deduplication (20)

VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)Five steps perform_2009 (1)
Five steps perform_2009 (1)
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

Open Source Data Deduplication

  • 1. Open Source Data Deduplication Nick Webb nickw@redwireservices.com www.redwireservices.com @RedWireServices (206) 829-8621 Last updated 8/10/2011
  • 2. Introduction ● What is Deduplication? Different kinds? ● Why do you want it? ● How does it work? ● Advantages / Drawbacks ● Commercial Implementations ● Open Source implementations, performance, reliability, and stability of each
  • 3. What is Data Deduplication Wikipedia: . . . data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored. Depending on the type of deduplication, redundant files may be reduced, or even portions of files or other data that are similar can also be removed . . .
  • 4. Why Dedupe? ● Save disk space and money (less disks) ● Less disks = less power, cooling, and space ● Improve write performance (of duplicate data) ● Be efficient – don’t re-copy or store previously stored data
  • 5. Where does it Work Well? ● Secondary Storage ● Backups/Archives ● Online backups with limited bandwidth/replication ● Save disk space – additional full backups take little space ● Virtual Machines (Primary & Secondary) ● File Shares
  • 6. Not a Fit ● Random data ● Video ● Pictures ● Music ● Encrypted files – many vendors dedupe, then encrypt
  • 7. Types ● Source / Target ● Global ● Fixed/Sliding Block ● File Based (SIS)
  • 8. Drawbacks ● Slow writes, slower reads ● High CPU/memory utilization (dedicated server is a must) ● Increases data loss risk / corruption ● Collision risk of 1.3x10^-49% chance per PB ● (256 bit hash & 8KB Blocks)
  • 9. How Does it Work?
  • 12. Block Reclamation ● In general, blocks are not removed/freed when a file is removed ● We must periodically check blocks for references, a block with no reference can be deleted, freeing allocated space ● Process can be expensive, scheduled during off-peak
  • 13. Commercial Implementations ● Just about every backup vendor ● Symantec, CommVault ● Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk, Mozy ● NAS/SAN/Backup Targets ● NEC HydraStor ● DataDomain/EMC Avamar ● Quantum ● NetApp
  • 14. Open Source Implementations ● Fuse Based ● Lessfs ● SDFS (OpenDedupe) ● Others ● ZFS ● btrfs (? Off-line only) ● Limited (file based / SIS) ● BackupPC (reliable!) ● Rdiff-backup
  • 15. How Good is it? ● Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical ● Especially true in backup or virtual environments
  • 16. SDFS / OpenDedupe www.opendedup.org ● Java 7 Based / platform agnostic ● Uses fuse ● S3 storage support ● Snapshots ● Inline or batch mode deduplication ● Supposedly fast (290MBps+ on great H/W) ● Support for global/clustered dedupe ● Probably most mature OSS Dedupe (IMHO)
  • 17. SDFS
  • 18. SDFS Install & Go Install Java # rpm –Uvh SDFS-1.0.7-2.x86_64.rpm # sudo mkfs.sdfs --volume-name=sdfs_128k --io-max-file-write-buffers=32 --volume-capacity=550GB --io-chunk-size=128 --chunk-store-data-location=/mnt/data # sudo modprobe fuse # sudo mount.sdfs -v sdfs_128k -m /mnt/dedupe
  • 19. SDFS ● Pro ● Works when configured properly ● Appears to be multithreaded ● Con ● Slow / resource intensive (CPU/Memory) ● Fragile, easy to mess up options, leading to crashes, little user feedback ● Standard POSIX utilities do not show accurate data (e.g. df, must use getfattr -d <mount point>, and calculate bytes → GB/TB and % free yourself) ● Slow with 4k blocks, recommended for VMs
  • 20. LessFS www.lessfs.com ● Written in C = Less CPU Overhead ● Have to build yourself (configure && make && make install) ● Has replication, encryption ● Uses fuse
  • 21. LessFS Install wget http://...lessfs-1.4.2.tar.gz tar zxvf *.tar.gz wget http://...db-4.8.30.tar.gz yum install buildstuff… . . . echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugep aged/defrag
  • 22. LessFS Go sudo vi /etc/lessfs.cfg BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta META_PATH=/mnt/meta/mta BLKSIZE=4096 # only 4k supported on centos 5 ENCRYPT_DATA=on ENCRYPT_META=off mklessfs -c /etc/lessfs.cfg lessfs /etc/lessfs.cfg /mnt/dedupe
  • 23. LessFS ● Pro ● Does inline compression by default as well ● Reasonable VM compression with 128k blocks ● Con ● Fragile ● Stats/FS info hard to see (per file accounting, no totals) ● Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only) ● Running with 4k blocks is not really feasible
  • 25. Other OSS ● ZFS? ● Tried it, and empirically it was a drag, but I have no hard data (got like 3x dedupe with identical full backups of VMs) ● At least it’s stable…
  • 26. Kick the Tires ● Test data set; ~330GB of data ● 22GB of documents, pictures, music ● Virtual Machines – 220GB Windows 2003 Server with SQL Data – 2003 AD DC ~60GB – 2003 Server ~8GB – Two OpenSolaris VMs, 1.5 & 2.7GB – 3GB Windows 2000 VM – 15GB XP Pro VM
  • 27. Kick the Tires ● Test Environment ● AWS High CPU Extra Large Instance ● ~7GB of RAM ● ~Eight Cores ~2.5GHz each ● ext4
  • 28. Compression Performance ● First round (all “unique” data) ● If another copy was put in (like another full), we should expect 100% reduction for that non-unique data (1x dedupe per run) FS Home % Home VM % VM Combined % Total MBps Data Reduction Data Reduction Reduction SDFS 4k 21GB 4.50% 109 64% 128GB 61% 16 GB lessfs 4k 24GB -9% N/A 51% N/A 50% 4 (est.) SDFS 128k 21GB 4.50% 255 16% 276GB 15% 40 GB lessfs 128k 21GB 4.50% 130 57% 183GB 44% 24 GB tar/gz --fast 21GB 4.50% 178 41% 199GB 39% 35 GB
  • 29.
  • 30. Write Performance (don't trust this) MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k tar/gz --fast
  • 31. Kick the Tires: Part 2 ● Test data set – two ~204GB full backup archives from a popular commercial vendor ● Test Environment ● VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM SATA drives (meta & data separated for LessFS) ● Physical CPU: Quad Core Xeon
  • 32. Write Performance MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W
  • 33.
  • 35. Open Source Dedupe ● Pro ● Free ● Can be stable, if well managed ● Con ● Not in repos yet ● Efforts behind them seem very limited, 1 dev each ● No/Poor documentation
  • 36. The Future ● Eventual Commodity? ● brtfs ● Dedupe planned (off-line only)
  • 37. Conclusion/Recommendations ● Dedupe is great, if it works and it meets your performance and storage requirements ● OSS Dedupe has a way to go ● SDFS/OpenDedupe is best OSS option right now ● JungleDisk is good and cheap, but not OSS
  • 38. About Red Wire Services If you found this presentation helpful, consider Red Wire Services for your next Backup, Archive, or IT Disaster Recovery Planning project. Learn more at www.RedWireServices.com
  • 39. About Nick Webb Nick Webb is the founder of Red Wire Services, in Seattle, WA. Nick is available to speak on a variety of IT Disaster Recovery related topics, including: ● Preserving Your Digital Legacy ● Getting Started with your Small Business Disaster Recovery Plan ● Archive Storage for SMBs If interested in having Nick speak to your group, please call (206) 829-8621 or email info@redwireservices.com

Editor's Notes

  1. Different types of deduplication levels:File levelBlock levelVariable block versus fixed block Quantum/DD Variable Blocks
  2. Pretty much the same as all compression