Deduplication and Single
                                   Instance Storage

                                    Practical Applications for Backups,
                                     Archiving, and Primary Storage




                                       Presented by:

                                              Jacob Farmer
                                              Cambridge Computer


© Copyright 2009-2010, Cambridge Computer Services, Inc. – All Rights Reserved
www.CambridgeComputer.com – 781-250-3000
About Your Lecturer

       Jacob Farmer, CTO, Cambridge Computer
         • Cambridge Computer, founded in 1991, provides training, integration,
           sales, and consulting in the fields of storage management, data
           protection, and digital archiving.
       Been working in data protection and storage management for
       almost 20 years.
         • Lecturer on storage technologies for Usenix for the past 10 years.
       Hybrid of industry analyst and consultant to end-users.
         • Spend 25% of my time working in the industry, going to conferences,
           meeting with vendors.
         • 75% of my time customer-facing, helping the sales and services
           departments design solutions for end users.
       Email: jfarmer@CambridgeComputer.com


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   2
Follow Me on Twitter


        My personal activities:
         •@JacobAFarmer
                –Note the “A” – my middle initial
        My educational activities
         •@Cambridge_EDU

Usenix-On-The-Road: The Latest Trends in Storage Networking
© Copyright 2009-2010-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                     www.CambridgeComputer.com   3
Agenda / Topics

       Dedupe basics
         • What is it, how does it work, and what is all the fuss about?
         • Hashing, segmenting, indexing, etc.
       Dedupe for backup systems
         • Basic benefits
         • Different approaches for scaling backups and how they relate back to dedupe
                – Front end bottlenecks
                – Backup data-movers
                – Back-end bottlenecks and scalable deduping
       Dedupe for primary storage
         • Virtual servers, physical servers, VDI
         • Rich media dedupe
       WAN Accelerators
       Questions as time permits

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   4
What is Deduplication?

       A term that refers to a number of different methods
       and techniques for reducing multiple instances of
       identical data down to a single (or at least fewer)
       instances.
         • Common data is replaced with pointers or tokens that refer
           back to the actual data.
       Other terms for deduplication
         • Data Reduction
         • Commonality Factoring
         • Capacity Optimization
         • Single Instancing or Single-Instance Storage (SIS)

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   5
Is Deduplication a Form of
 Compression?
       Yes, and No.

       YES – Deduplication results in data taking up less
       storage space or consuming less bandwidth on a
       network circuit.
         • Note that dedupe is often used in conjunction with
           conventional compression.
       NO – Deduplication could work on data types that
       are not compressible.
         • If you have 10 identical JPEG files stored in an
           uncompressible format, they could be reduced to a single
           instance, thus freeing up 90% of your capacity.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   6
Where Do You Find Dedupe
 Solutions?
       Deduplication solutions come to market whenever costs or
       efficiencies can be achieved by eliminating redundancy.
         • Backups
                – Conventional backup systems generate tons of redundant data
         • Email systems (at rest and in flight)
                – I send an email with the same attachment to everyone in the company.
                – Then everyone stores it in his/her personal home directory
                – Everyone in the branch offices pulls it over the WAN
         • File traffic over a WAN
         • Application and O.S. binaries across multiple systems
                – Virtual Servers and Virtual Desktops
                – Backups over a WAN
         • Very large collections of rich media files

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   7
Hashing / Fingerprinting

       Hashing (aka fingerprints, digests, signatures)
         • Generates a unique number (160+ bits) based on content
         • Hash acts as a proxy for content
         • Given a hash, not computationally feasible to generate
           content
       Common Hashing Algorithms
         • MD5
         • SHA-1
         • SHA-256
         • AES
       Hash Size and the Birthday Paradox
         • The size of the hash needs to be suitable to the task at
           hand

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   8
Hash Collisions - Are they real?


                                                   Fibre Channel
                                                   Bit Error Rate
                                10-10                            10-20                     10-30
                                                                                                     Probability



       Hit by
     lightning                                 Simultaneous
                                              triple disk fault                            Cryptographic
                                                 on RAID-6                                 hash collision

                      Win the                                                Cretaceous extinction meteor
                      lottery                                                    hitting in the next second



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                    www.CambridgeComputer.com      9
What Makes Deduplication and SIS
 Technology Difficult to Engineer?
       Hash Processing
         • Modern CPUs make this much easier
                – 100+ MB/sec/core
         • Hardware co-processor cards can hash at rates north of
           1.5GB/Sec.
       Disk performance
         • Deduped data often ends up getting fragmented on disk
                – This can hurt performance especially for backup systems

       Alignment of de-dupe segments
       Indexing

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   10
Indexing Can Be Hard
            # lookups/sec


                            106
                                                               Router                    Fine grained
                            105   Purpose built hardware                                    content
                                                                                           tracking
                            104

                            103                                                            large
                                  Software Database                                      database
                            102
                                  Technology
                            101
                                                        iPod                NYC
                            100   Human Lookup Rates                     phonebook


                                  101    102      103      104     105      106    107     108      109
                                                  # records
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                     www.CambridgeComputer.com   11
Parsing / Segmenting / Chunking

       Data needs to be “chopped up” in a consistent way in order to get optimal
       dedupe ratios
       Without any kind of special segmenting strategies backup streams and
       complex file types do not dedupe effectively

       Large files are almost always changed with overstrike semantics
         • Databases, structured data, .vmdk, .pst files
       Small files are almost changed with insert semantics
         • Office apps, editors etc
       If there are large files (e.g. database tables, virtual machine images) in the
       backup mix, their treatment usually will dominate any data reduction
       strategy.
         • Don’t sweat the small stuff!
       Different vendors may have strengths with one type or another

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   12
Change Types: Insert v. Overstrike



        Insert:
             The quick brown fox jumped over the lazy dog.
             The quick brown horse jumped over the lazy dog.
                                                                                   Identical data (may be)
                                                                                   misaligned
     Overstrike:
                                                                                   “Fred” added to
                            Joe                                            Sue     employee database
                            Joe                         Fred               Sue     Identical data doesn’t
                                                                                   move

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   13
NetBackup OST (open storage option)

       API and framework that makes it very easy for a dedupe
       target device vendor to parse the data stream.
        • Pre segments content
         • Enables more efficient dedup solutions
         • Allows for smart copy between systems of only changed
           data


          PQZ                             R                                        PQR
                                            Z

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   14
Deduplication for Backups




www.CambridgeComputer.com    15
Backup Systems Have a Lot of
 Redundant Data
       Conventional backup solutions generate a ton of redundant
       data
         • Assuming weekly full backups, a file that has not changed in 5 years,
           still gets backed up 260 times!
         • Assuming daily full backups of email, a message you received 5 years
           ago gets backed up 1825 times.
                – Similarly, a record in a database from 5 years ago might be backed up
                  1825 times!
       There are really two problems to solve:
         • Minimizing the amount of redundant data that gets repeatedly
           transferred
         • Minimizing the amount of redundant data that gets stored.



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   16
Most of the Buzz on Dedupe is from
 Backup Target Vendors
       A “target” is a backup storage device
       Dedupe disk targets generally come packaged as
         • NAS
                – File server (NFS or CIFS) interface
         • Virtual tape library
                – A disk device that emulates a tape library
                – Fibre Channel or iSCSI interface
                – NAS vs VTL outside the scope of this talk




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   17
When Do Dedupe Disk Targets Shine?

       When you are backing up a lot of redundant data
         • Files that never or seldom change between backups
         • Duplicated files
         • Databases and email repositories that are receptive to
           commonality factoring
       When you are retaining backup data for a decent
       amount of time
         • Ideally you are keeping several weeks of backups
       When you seek to replicate a conventional backup
       system over a WAN.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   18
Example: NYC Law Firm with
 NetBackup
       6+ Terabytes
       Full backups every day!
         • Why? Because someone had a bad experience in the past
           with incremental backups and has trust issues
       90 day retention period
       Most files seldom change
         • Many files are scanned images that never change
       Several large databases
       Several TB of MS Exchange
       Average result – 102x capacity optimization !
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   19
Try to Visualize 100x Capacity
 Optimization




                 OR

   One 3U cabinet v. 7 full racks full of gear!

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   20
Backup Vaulting – Another Use Case
 for Dedupe




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   21
Why Replicate the Backup System?

       Relatively easy DR solution
         • Does not require additional software for the hosts
         • Does not require storage devices with replication
           capabilities
         • One system that replicate all of your hosts
                – Platform-independent

       Eliminate the need to ship tapes off site
         • Eliminate the need to encrypt tapes




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   22
Example: Defense Contractor
 Replicating ERP System
       Problem: CIO does not want employee data being
       sent off site without encrypting the tapes.
         • IT staff wants to avoid tape encryption.

       Solution:
         • Full backup of 800GB+ Oracle database to deduplicating disk target
           every day.
         • Retain backups for 60 days on disk.
                – 60 x 800 = 48TB
         • Vault backups to remote site over T1

       Outcome
         • Dedupe ration of about 70:1
         • 800GB backup job traverses the T1 in a few hours

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   23
But, Before We Get all Hot and
 Bothered . . .


 Let’s review how backup systems
 actually work!



www.CambridgeComputer.com          24
Common Backup Bottlenecks

                                                                                   Backup Clients
                                                                                   You have to get data off the
                                                                                   host and transfer it
                                                                                   Network
                                  Network                                          Seldom the real bottleneck,
                                                                                   except over a WAN

                                                                                   Backup Servers
                                                                                   I/O processing is the most
                                                                                   common bottleneck

                                                                                   Storage Devices
                                                                                   Storage devices can be a
                                                                                   bottleneck, but are seldom
                                                                                   the whole problem.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                      www.CambridgeComputer.com   25
Front-end and Network – Minimize
 Duplication in the First Place
       Backups generate a lot of redundant data, so what if
       we had smarter client software that did not generate
       redundant data?
         • Incremental Forever
                – After the first full backup, only do incremental backups
                – This is what IBM TSM does, for instance
         • Synthetic Full Backup
                – Last weeks full backup is merged with this weeks incremental
                  backups to “synthesize” this week’s full backup.
                – No need to transfer redundant data




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   26
Example: Energy Firm using IBM
 Tivoli Storage Manager
       TSM only backs up files that have changed.
         • It does not generate a lot of duplicate files
       Most of the 15TB of capacity are documentation and images
       that do not change – ever.
         • Relatively little of it is database.
         • Images don’t compress
         • Utilizing compression on TSM client for compressible files
       Over all deduplication ratio: about 2:1
         • Can’t justify the cost of dedupe across the board
         • Resolution: Set up dedupe tier for database and email
                – Do the file backups to conventional disk and tape




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   27
Synthetic Full Backups – An Approach
 that Creates a Need for Dedupe
       Synthetic Full Backups
         • “Poor man’s incremental forever”
         • Combine subsequent incremental backups with the
           previous full backup to “synthesize” the next week’s full
           backup.
         • Great technique for minimizing networking traffic from
           backups.
       Synthetic full backups require that at least two weeks
       of backups be available on disk.
         • Dedupe disk targets tend to be a big win for synthetic full
           backups

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   28
Example: Research Firm with 6 Week
 Retention and Synthetic Fulls
       60TB+
         • Mix of large file systems, content management systems,
           email, and database
       Using Commvault with heavy use of synthetic full
       backups
       6 week retention on disk
       Dedupe ratios between 8x and 16x
         • NOTE: Their backup data could not fit in one dedupe box,
           so they are managing 4 separate dedupe appliances in
           each of their locations.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   29
Theoretical v. Actual Capacity
 Your Mileage May Vary
       YMMV – one customer’s mileage
         • 48 TB raw disk
         • 36 TB with RAID-6
         • 35 +/ TB for unique capacity
         • 3-5 TB deliberately left empty for headroom
       Might hold
         • As much as 500 TB of backups
         • Or as little as 50 TB.




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   30
Dedupe on the Backup Client

       Host-side dedupe is a form of sub-file-level incremental
         • Instead of catching block-level changes, the file system changes are
           hashed and compared with the back-end storage repository.
         • Alternative to block-based CDP
         • Unique data segments are then transferred to the backup service.
       Host-side deduping is very valuable over the WAN.
         • Minimizes data that needs to be transferred
         • Typically it will dedupe across hosts, reducing files that are common to
           multiple hosts
                – Such as application and operating system binaries




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   31
WAN Backup Software with Dedupe in
 the Client
               Dedupe Client

           London
                                                                              LAN

                                                                                              New York
                                                                Shared Client & Local Recovery
            Local USB                      WAN



          Hong Kong                                       Backup Server(s)                 Jersey City
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                    www.CambridgeComputer.com   32
Backup System Network I/O
 Processing Bottlenecks


 Moving Backup Data Through the
 Network



www.CambridgeComputer.com         33
Backup Server I/O Processing is a
 Major Bottleneck
       In most enterprise backup systems a single backup
       server would be a major performance bottleneck
         • Unless you were doing incremental forever or sub-file-level
           backups
         • Add a dedup process to that and it becomes that much
           harder
       A common practice for scaling out backup server
       performance is to add network “data movers”
         • Also known as: storage nodes, media servers, media
           agents, etc.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   34
Interesting Idea – Add Deduplication
 to the Network Data Movers




                                                    Network




                          Dedicated Storage Network




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   35
I/O Processing Bottlenecks
 Network Data Movers and “LAN-Free”



  Network data
  movers                                                                                   “LAN-Free”
                                                                                           Clients



                        Dedicated Storage Network




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   36
LAN-Free Backup Clients and NDMP
 Backups
       In larger enterprise-class backup systems it is
       common to have larger servers move data directly to
       storage devices over Fibre Channel.
       The fastest way to backup large NAS server is to do
       NDMP dumps over Fibre Channel.




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   37
“LAN-FREE” – End-run Around the
 Backup Server



                                                                                            SAN Clients work
                                        G ig E                                              like slave servers.
                                                                                            They back up
                                                                                            directly to the
                                                                                            storage media, while
                                                                                            reporting metadata
                                                                                            over the LAN to the
          Storage Area Network                                                              backup server.


                                                                       Tape        Presumably all of these
                                                                       Robot       tape drives are part of a
                                                                        Arm        tape library.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   38
Dedupe with LAN-Free Backup Clients

       With LAN-Free backup you get no benefit from
       dedupe processing residing on the data movers.
         • The dedupe logic needs to sit on the target storage device
       This is where VTLs shine
         • VTL works just like tape
                – Network data movers work fine
                – LAN-Free clients work fine
         • VTLs offer higher throughput than CIFS or NFS
                – Common to see total throughput in excess of 1GB/Sec
         • VTLs might offer tighter integration with tape
       Many VTLs do dedupe as a post-process


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   39
Back End Bottlenecks


 Can your dedupe appliance keep
 pace with the backup system?



www.CambridgeComputer.com         40
Back-End Bottlenecks: Can the
 Dedupe Storage Devices Hack It?
       If you open up the flood gates, you might find that a
       single dedupe box on the LAN cannot hack it.
       Some solutions:
         • Buy lots of individual dedupe devices
         • Maybe use a VTL implementation of dedupe
                – Sorry out of the scope of this lecture
         • Post-process deduping instead of deduping on-the-fly
                – Less efficient from a capacity standpoint, but should be able to
                  achieve considerably better performance
         • New grid-based architectures that offer parallel processing
           for deduplication
         • Newer dedupe devices that are up to the task
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   41
Stand-alone Deduplication Servers

       Single server dedupe solutions are often constrained
       by:
         • RAM and processing power
         • The size of the index they can manage
         • Disk performance
       When you max out the box, you need to buy another
       one
         • Very painful incremental upgrade
         • No dedupe across multiple boxes
         • Make sure that you but a big enough box!

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   42
Object-Based File System with Grid
 Architecture and Global Dedupe
                                                                            CIFS/NFS Clients
                                                                       Backup System Data Movers
                                                                    Conventional File System Consumers




      Front-End Nodes
      Export File Systems
      Scale-out performance into GBs/Sec




       Back-End Nodes
       Manage disk, dedupe, and redundancy
       Scale-deep to Petabytes of capacity




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   43
VTL with Scalable Deduplication



                                             G ig E




                     Storage Network




                                                      De-Duplication Processors     Single Instance
                               VTL                                                    Repository

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   44
Summary: Alternative Technologies
 to Dedupe Disk Targets
       Don’t Duplicate in the First Place
         • Incremental Forever Backups
         • WAN-enabled backups, perhaps with dedupe on the client

       Throw disk at it
         • Bulk SATA arrays cost typically less than $1K per TB
                – Capacities up to 2PB
                – Densities on the order of 1PB / rack
                – MAID – power management to spin down inactive drives

       Replicate Your SAN or NAS
         • Use optimized file backup or archive solution to provide file recovery
           and to meet retention requirements



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   45
Other Examples of Dedupe
 Technology


 Primary Storage, VDI, Rich Media
 Archiving



www.CambridgeComputer.com           46
Block-Level Dedupe for Primary
 Storage
       Most dedupe solutions are designed specifically for backup
       and archival data.
       A limited number of products can dedupe on live data.
         • One day perhaps, dedupe for primary storage will be a way of life
       Great applications – those with redundant data!
         • Desktop virtualization (VDI)
                – A number of very interesting solutions are coming to market
         • VMDK backup, dedupe, and fail-over on one platform
         • Boot image servers
       Reclamation of empty disk space
         • Blank space deduplicates very nicely!



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   47
Single-Instance Storage for Virtual
 Desktops
       Storage is a big deal-breaker for many VDI use cases
         • Replaces desktop storage and desktop personnel with SAN storage and
           highly specialized storage managers
       New techniques for VDI storage break the desktop down into
       elements and find commonality across all desktops
       Virtual desktop file systems are “stitched together” from common
       elements:
         • Operating system
         • Applications or sets of applications
         • Variable elements
                – For example: anti-virus signatures
         • Personal elements
                –   Screen savers and background images
                –   Google toolbar
                –   Personal applications
                –   Personal files


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   48
Dedupe Across Large Collections of
 Rich Media Files
       Many types of files have content-level commonality across a
       large collection of files.
         • TIFF
         • JPG
         • PNG
         • OpenEXR
         • DICOM
         • MS Office Documents
         • PDFs
       A high level of commonality can be detected and de-
       duplicated, assuming a large enough sample set of data.
         • Capacity optimization (depending on file type) on the order of 2x to 10x and
           beyond.



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   49
Dedupe in WAN Accelerators




www.CambridgeComputer.com     50
MS Exchange Branch Office: Example
 of the Need for Dedupe over the WAN


          Chicago                                                                    New York

                                                                                     Atlanta
                                                      WAN

  MS Exchange Server
  Message with attachment sent to all staff.                                         San Fran
  Single instance message storage, but
  the same message crosses the WAN multiple times

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   51
WAN Accelerator with Inline Dedupe



                                                                                             Site B



  Site A

                                   WAN Accelerators / WAFS Gateways
                                                                                                 F ile S er ver s o r
                                                                                                 NAS Ap p lian ce




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com            52
Questions – If Time Permits




www.CambridgeComputer.com      53

Deduplication and single instance storage

  • 1.
    Deduplication and Single Instance Storage Practical Applications for Backups, Archiving, and Primary Storage Presented by: Jacob Farmer Cambridge Computer © Copyright 2009-2010, Cambridge Computer Services, Inc. – All Rights Reserved www.CambridgeComputer.com – 781-250-3000
  • 2.
    About Your Lecturer Jacob Farmer, CTO, Cambridge Computer • Cambridge Computer, founded in 1991, provides training, integration, sales, and consulting in the fields of storage management, data protection, and digital archiving. Been working in data protection and storage management for almost 20 years. • Lecturer on storage technologies for Usenix for the past 10 years. Hybrid of industry analyst and consultant to end-users. • Spend 25% of my time working in the industry, going to conferences, meeting with vendors. • 75% of my time customer-facing, helping the sales and services departments design solutions for end users. Email: jfarmer@CambridgeComputer.com Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 2
  • 3.
    Follow Me onTwitter My personal activities: •@JacobAFarmer –Note the “A” – my middle initial My educational activities •@Cambridge_EDU Usenix-On-The-Road: The Latest Trends in Storage Networking © Copyright 2009-2010-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 3
  • 4.
    Agenda / Topics Dedupe basics • What is it, how does it work, and what is all the fuss about? • Hashing, segmenting, indexing, etc. Dedupe for backup systems • Basic benefits • Different approaches for scaling backups and how they relate back to dedupe – Front end bottlenecks – Backup data-movers – Back-end bottlenecks and scalable deduping Dedupe for primary storage • Virtual servers, physical servers, VDI • Rich media dedupe WAN Accelerators Questions as time permits Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 4
  • 5.
    What is Deduplication? A term that refers to a number of different methods and techniques for reducing multiple instances of identical data down to a single (or at least fewer) instances. • Common data is replaced with pointers or tokens that refer back to the actual data. Other terms for deduplication • Data Reduction • Commonality Factoring • Capacity Optimization • Single Instancing or Single-Instance Storage (SIS) Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 5
  • 6.
    Is Deduplication aForm of Compression? Yes, and No. YES – Deduplication results in data taking up less storage space or consuming less bandwidth on a network circuit. • Note that dedupe is often used in conjunction with conventional compression. NO – Deduplication could work on data types that are not compressible. • If you have 10 identical JPEG files stored in an uncompressible format, they could be reduced to a single instance, thus freeing up 90% of your capacity. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 6
  • 7.
    Where Do YouFind Dedupe Solutions? Deduplication solutions come to market whenever costs or efficiencies can be achieved by eliminating redundancy. • Backups – Conventional backup systems generate tons of redundant data • Email systems (at rest and in flight) – I send an email with the same attachment to everyone in the company. – Then everyone stores it in his/her personal home directory – Everyone in the branch offices pulls it over the WAN • File traffic over a WAN • Application and O.S. binaries across multiple systems – Virtual Servers and Virtual Desktops – Backups over a WAN • Very large collections of rich media files Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 7
  • 8.
    Hashing / Fingerprinting Hashing (aka fingerprints, digests, signatures) • Generates a unique number (160+ bits) based on content • Hash acts as a proxy for content • Given a hash, not computationally feasible to generate content Common Hashing Algorithms • MD5 • SHA-1 • SHA-256 • AES Hash Size and the Birthday Paradox • The size of the hash needs to be suitable to the task at hand Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 8
  • 9.
    Hash Collisions -Are they real? Fibre Channel Bit Error Rate 10-10 10-20 10-30 Probability Hit by lightning Simultaneous triple disk fault Cryptographic on RAID-6 hash collision Win the Cretaceous extinction meteor lottery hitting in the next second Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 9
  • 10.
    What Makes Deduplicationand SIS Technology Difficult to Engineer? Hash Processing • Modern CPUs make this much easier – 100+ MB/sec/core • Hardware co-processor cards can hash at rates north of 1.5GB/Sec. Disk performance • Deduped data often ends up getting fragmented on disk – This can hurt performance especially for backup systems Alignment of de-dupe segments Indexing Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 10
  • 11.
    Indexing Can BeHard # lookups/sec 106 Router Fine grained 105 Purpose built hardware content tracking 104 103 large Software Database database 102 Technology 101 iPod NYC 100 Human Lookup Rates phonebook 101 102 103 104 105 106 107 108 109 # records Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 11
  • 12.
    Parsing / Segmenting/ Chunking Data needs to be “chopped up” in a consistent way in order to get optimal dedupe ratios Without any kind of special segmenting strategies backup streams and complex file types do not dedupe effectively Large files are almost always changed with overstrike semantics • Databases, structured data, .vmdk, .pst files Small files are almost changed with insert semantics • Office apps, editors etc If there are large files (e.g. database tables, virtual machine images) in the backup mix, their treatment usually will dominate any data reduction strategy. • Don’t sweat the small stuff! Different vendors may have strengths with one type or another Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 12
  • 13.
    Change Types: Insertv. Overstrike Insert: The quick brown fox jumped over the lazy dog. The quick brown horse jumped over the lazy dog. Identical data (may be) misaligned Overstrike: “Fred” added to Joe Sue employee database Joe Fred Sue Identical data doesn’t move Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 13
  • 14.
    NetBackup OST (openstorage option) API and framework that makes it very easy for a dedupe target device vendor to parse the data stream. • Pre segments content • Enables more efficient dedup solutions • Allows for smart copy between systems of only changed data PQZ R PQR Z Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 14
  • 15.
  • 16.
    Backup Systems Havea Lot of Redundant Data Conventional backup solutions generate a ton of redundant data • Assuming weekly full backups, a file that has not changed in 5 years, still gets backed up 260 times! • Assuming daily full backups of email, a message you received 5 years ago gets backed up 1825 times. – Similarly, a record in a database from 5 years ago might be backed up 1825 times! There are really two problems to solve: • Minimizing the amount of redundant data that gets repeatedly transferred • Minimizing the amount of redundant data that gets stored. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 16
  • 17.
    Most of theBuzz on Dedupe is from Backup Target Vendors A “target” is a backup storage device Dedupe disk targets generally come packaged as • NAS – File server (NFS or CIFS) interface • Virtual tape library – A disk device that emulates a tape library – Fibre Channel or iSCSI interface – NAS vs VTL outside the scope of this talk Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 17
  • 18.
    When Do DedupeDisk Targets Shine? When you are backing up a lot of redundant data • Files that never or seldom change between backups • Duplicated files • Databases and email repositories that are receptive to commonality factoring When you are retaining backup data for a decent amount of time • Ideally you are keeping several weeks of backups When you seek to replicate a conventional backup system over a WAN. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 18
  • 19.
    Example: NYC LawFirm with NetBackup 6+ Terabytes Full backups every day! • Why? Because someone had a bad experience in the past with incremental backups and has trust issues 90 day retention period Most files seldom change • Many files are scanned images that never change Several large databases Several TB of MS Exchange Average result – 102x capacity optimization ! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 19
  • 20.
    Try to Visualize100x Capacity Optimization OR One 3U cabinet v. 7 full racks full of gear! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 20
  • 21.
    Backup Vaulting –Another Use Case for Dedupe Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 21
  • 22.
    Why Replicate theBackup System? Relatively easy DR solution • Does not require additional software for the hosts • Does not require storage devices with replication capabilities • One system that replicate all of your hosts – Platform-independent Eliminate the need to ship tapes off site • Eliminate the need to encrypt tapes Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 22
  • 23.
    Example: Defense Contractor Replicating ERP System Problem: CIO does not want employee data being sent off site without encrypting the tapes. • IT staff wants to avoid tape encryption. Solution: • Full backup of 800GB+ Oracle database to deduplicating disk target every day. • Retain backups for 60 days on disk. – 60 x 800 = 48TB • Vault backups to remote site over T1 Outcome • Dedupe ration of about 70:1 • 800GB backup job traverses the T1 in a few hours Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 23
  • 24.
    But, Before WeGet all Hot and Bothered . . . Let’s review how backup systems actually work! www.CambridgeComputer.com 24
  • 25.
    Common Backup Bottlenecks Backup Clients You have to get data off the host and transfer it Network Network Seldom the real bottleneck, except over a WAN Backup Servers I/O processing is the most common bottleneck Storage Devices Storage devices can be a bottleneck, but are seldom the whole problem. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 25
  • 26.
    Front-end and Network– Minimize Duplication in the First Place Backups generate a lot of redundant data, so what if we had smarter client software that did not generate redundant data? • Incremental Forever – After the first full backup, only do incremental backups – This is what IBM TSM does, for instance • Synthetic Full Backup – Last weeks full backup is merged with this weeks incremental backups to “synthesize” this week’s full backup. – No need to transfer redundant data Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 26
  • 27.
    Example: Energy Firmusing IBM Tivoli Storage Manager TSM only backs up files that have changed. • It does not generate a lot of duplicate files Most of the 15TB of capacity are documentation and images that do not change – ever. • Relatively little of it is database. • Images don’t compress • Utilizing compression on TSM client for compressible files Over all deduplication ratio: about 2:1 • Can’t justify the cost of dedupe across the board • Resolution: Set up dedupe tier for database and email – Do the file backups to conventional disk and tape Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 27
  • 28.
    Synthetic Full Backups– An Approach that Creates a Need for Dedupe Synthetic Full Backups • “Poor man’s incremental forever” • Combine subsequent incremental backups with the previous full backup to “synthesize” the next week’s full backup. • Great technique for minimizing networking traffic from backups. Synthetic full backups require that at least two weeks of backups be available on disk. • Dedupe disk targets tend to be a big win for synthetic full backups Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 28
  • 29.
    Example: Research Firmwith 6 Week Retention and Synthetic Fulls 60TB+ • Mix of large file systems, content management systems, email, and database Using Commvault with heavy use of synthetic full backups 6 week retention on disk Dedupe ratios between 8x and 16x • NOTE: Their backup data could not fit in one dedupe box, so they are managing 4 separate dedupe appliances in each of their locations. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 29
  • 30.
    Theoretical v. ActualCapacity Your Mileage May Vary YMMV – one customer’s mileage • 48 TB raw disk • 36 TB with RAID-6 • 35 +/ TB for unique capacity • 3-5 TB deliberately left empty for headroom Might hold • As much as 500 TB of backups • Or as little as 50 TB. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 30
  • 31.
    Dedupe on theBackup Client Host-side dedupe is a form of sub-file-level incremental • Instead of catching block-level changes, the file system changes are hashed and compared with the back-end storage repository. • Alternative to block-based CDP • Unique data segments are then transferred to the backup service. Host-side deduping is very valuable over the WAN. • Minimizes data that needs to be transferred • Typically it will dedupe across hosts, reducing files that are common to multiple hosts – Such as application and operating system binaries Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 31
  • 32.
    WAN Backup Softwarewith Dedupe in the Client Dedupe Client London LAN New York Shared Client & Local Recovery Local USB WAN Hong Kong Backup Server(s) Jersey City Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 32
  • 33.
    Backup System NetworkI/O Processing Bottlenecks Moving Backup Data Through the Network www.CambridgeComputer.com 33
  • 34.
    Backup Server I/OProcessing is a Major Bottleneck In most enterprise backup systems a single backup server would be a major performance bottleneck • Unless you were doing incremental forever or sub-file-level backups • Add a dedup process to that and it becomes that much harder A common practice for scaling out backup server performance is to add network “data movers” • Also known as: storage nodes, media servers, media agents, etc. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 34
  • 35.
    Interesting Idea –Add Deduplication to the Network Data Movers Network Dedicated Storage Network Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 35
  • 36.
    I/O Processing Bottlenecks Network Data Movers and “LAN-Free” Network data movers “LAN-Free” Clients Dedicated Storage Network Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 36
  • 37.
    LAN-Free Backup Clientsand NDMP Backups In larger enterprise-class backup systems it is common to have larger servers move data directly to storage devices over Fibre Channel. The fastest way to backup large NAS server is to do NDMP dumps over Fibre Channel. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 37
  • 38.
    “LAN-FREE” – End-runAround the Backup Server SAN Clients work G ig E like slave servers. They back up directly to the storage media, while reporting metadata over the LAN to the Storage Area Network backup server. Tape Presumably all of these Robot tape drives are part of a Arm tape library. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 38
  • 39.
    Dedupe with LAN-FreeBackup Clients With LAN-Free backup you get no benefit from dedupe processing residing on the data movers. • The dedupe logic needs to sit on the target storage device This is where VTLs shine • VTL works just like tape – Network data movers work fine – LAN-Free clients work fine • VTLs offer higher throughput than CIFS or NFS – Common to see total throughput in excess of 1GB/Sec • VTLs might offer tighter integration with tape Many VTLs do dedupe as a post-process Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 39
  • 40.
    Back End Bottlenecks Can your dedupe appliance keep pace with the backup system? www.CambridgeComputer.com 40
  • 41.
    Back-End Bottlenecks: Canthe Dedupe Storage Devices Hack It? If you open up the flood gates, you might find that a single dedupe box on the LAN cannot hack it. Some solutions: • Buy lots of individual dedupe devices • Maybe use a VTL implementation of dedupe – Sorry out of the scope of this lecture • Post-process deduping instead of deduping on-the-fly – Less efficient from a capacity standpoint, but should be able to achieve considerably better performance • New grid-based architectures that offer parallel processing for deduplication • Newer dedupe devices that are up to the task Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 41
  • 42.
    Stand-alone Deduplication Servers Single server dedupe solutions are often constrained by: • RAM and processing power • The size of the index they can manage • Disk performance When you max out the box, you need to buy another one • Very painful incremental upgrade • No dedupe across multiple boxes • Make sure that you but a big enough box! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 42
  • 43.
    Object-Based File Systemwith Grid Architecture and Global Dedupe CIFS/NFS Clients Backup System Data Movers Conventional File System Consumers Front-End Nodes Export File Systems Scale-out performance into GBs/Sec Back-End Nodes Manage disk, dedupe, and redundancy Scale-deep to Petabytes of capacity Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 43
  • 44.
    VTL with ScalableDeduplication G ig E Storage Network De-Duplication Processors Single Instance VTL Repository Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 44
  • 45.
    Summary: Alternative Technologies to Dedupe Disk Targets Don’t Duplicate in the First Place • Incremental Forever Backups • WAN-enabled backups, perhaps with dedupe on the client Throw disk at it • Bulk SATA arrays cost typically less than $1K per TB – Capacities up to 2PB – Densities on the order of 1PB / rack – MAID – power management to spin down inactive drives Replicate Your SAN or NAS • Use optimized file backup or archive solution to provide file recovery and to meet retention requirements Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 45
  • 46.
    Other Examples ofDedupe Technology Primary Storage, VDI, Rich Media Archiving www.CambridgeComputer.com 46
  • 47.
    Block-Level Dedupe forPrimary Storage Most dedupe solutions are designed specifically for backup and archival data. A limited number of products can dedupe on live data. • One day perhaps, dedupe for primary storage will be a way of life Great applications – those with redundant data! • Desktop virtualization (VDI) – A number of very interesting solutions are coming to market • VMDK backup, dedupe, and fail-over on one platform • Boot image servers Reclamation of empty disk space • Blank space deduplicates very nicely! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 47
  • 48.
    Single-Instance Storage forVirtual Desktops Storage is a big deal-breaker for many VDI use cases • Replaces desktop storage and desktop personnel with SAN storage and highly specialized storage managers New techniques for VDI storage break the desktop down into elements and find commonality across all desktops Virtual desktop file systems are “stitched together” from common elements: • Operating system • Applications or sets of applications • Variable elements – For example: anti-virus signatures • Personal elements – Screen savers and background images – Google toolbar – Personal applications – Personal files Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 48
  • 49.
    Dedupe Across LargeCollections of Rich Media Files Many types of files have content-level commonality across a large collection of files. • TIFF • JPG • PNG • OpenEXR • DICOM • MS Office Documents • PDFs A high level of commonality can be detected and de- duplicated, assuming a large enough sample set of data. • Capacity optimization (depending on file type) on the order of 2x to 10x and beyond. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 49
  • 50.
    Dedupe in WANAccelerators www.CambridgeComputer.com 50
  • 51.
    MS Exchange BranchOffice: Example of the Need for Dedupe over the WAN Chicago New York Atlanta WAN MS Exchange Server Message with attachment sent to all staff. San Fran Single instance message storage, but the same message crosses the WAN multiple times Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 51
  • 52.
    WAN Accelerator withInline Dedupe Site B Site A WAN Accelerators / WAFS Gateways F ile S er ver s o r NAS Ap p lian ce Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 52
  • 53.
    Questions – IfTime Permits www.CambridgeComputer.com 53