EMC Deduplication Fundamentals


Published on

Deduplication reduces the amount of disk storage needed to retain and protect data by ratios of 10-30x and greater, making a disk a cost-effective alternative to tape. Data on disk is available online and onsite for longer retention periods, and restores become fast and reliable. Storing only unique data on disk also means that data can be cost-effectively replicated over existing networks to remote sites for disaster recovery and consolidated tape operations.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Another important differentiator for Data Domain systems is the Data Invulnerability Architecture. Data Domain Data Invulnerability Architecture lays out the industry's best defense against data integrity issues by providing unprecedented levels of data protection, data verification, and self-healing capabilities that are unavailable in conventional disk or tape systems.There are three key areas of data integrity protection described on this slide:First is end-to-end data verification at backup time. As illustrated by the graphic at the right, end-to-end verification means reading data after it is written and comparing it to what was sent to disk, proving that it is reachable through the file system to disk and that the data is not corrupted. Specifically, when the Data Domain Operating System receives a write request from backup software, it computes a checksum over the data. After analyzing the data for redundancy, it stores the new data segments and all of the checksums. After all the data has been written to disk, Data Domain Operating System verifies that it can read the entire file from the disk platter and through the Data Domain file system, and that the checksums of the data read back match the checksums of the written data. This confirms the data is correct and recoverable from every level of the system. If there are problems anywhere along the way—for example, if a bit has flipped on a disk drive—it will be caught. Since most restores happen within a day or two of backups, systems that verify/correct data integrity slowly over time will be too late for most recoveries.Second is a self-healing file system. Data Domain systems actively re-verify the integrity of all data every week in an ongoing background process. This scrub process will find and repair defects on the disk before they can become a problem. In addition, real-time error detection ensures that all data returned to the user during a restore is correct. On every read from disk, the system first verifies that the block read from disk is the block expected. It then uses the checksum to verify the integrity of the data. If any issue is found, the Data Domain Operating System will self-heal and correct the data error. In addition to data verification and self-healing, there are a collection of other capabilities. Data Domain with RAID 6 provides double disk failure protection; NVRAM enables fast, safe restart; and snapshots provide point-in-time file system recoverability.Backups are the data store of last resort. Data Domain Data Invulnerability Architecture provides extra levels of data integrity protection to detect faults and repair them to ensure backup data or recovery is not at risk.
  • In addition to DD Boost, EMC offers four additional Data Domain software options that can enhance the value of a Data Domain system in your environment. Note to Presenter: Click now in Slide Show mode for animation.The first is DD Virtual Tape Library software, which eliminates tape-related failures by enabling all Data Domain systems to emulate multiple tape devices over a Fibre Channel interface. This software option provides easy integration of deduplication storage in open systems and IBM i environments. Note to Presenter: Click now in Slide Show mode for animation.Next is DD Replicator software, which provides fast, network-efficient , encrypted replication for disaster recovery, remote office data protection, multi-site tape consolidation, and long-term offsite retention. DD Replicator asynchronously transfers only the compressed, deduplicated data over the WAN, making network-based replication cost-effective, fast, and reliable. In addition, you can replicate up to 270 remote sites into a single Data Domain system for consolidated protection of your distributed enterprise.Note to Presenter: Click now in Slide Show mode for animation.Next, DD Retention Lock software enables you to easily implement deduplication with file locking to satisfy IT governance and compliance policies for archive protection. DD Retention Lock also enables electronic data shredding on a per-file basis to ensure that deleted files have been disposed of in an appropriate and permanent manner, in order to maintain confidentiality of classified material, limit liability, and enforce privacy requirements.Note to Presenter: Click now in Slide Show mode for animation.Finally, DD Encryption software protects backup and archive data stored on Data Domain systems with encryption that is performed inline— before the data is written to disk. Encrypting data at rest satisfies internal governance rules and compliance regulations and protects against theft or loss of a physical system. The combination of inline encryption and deduplication provides the most secure data-at-rest encryption solution available.
  • Like other Data Domain systems, Data Domain Archiver includes a controller and storage shelves, referred to as the “active tier” in this system. The active tier can be expanded to up to four storage shelves (96 TB of usable capacity), and it is used for short-term (generally less than 90 days) retention of backup and archive data. In addition, DD Archiver also incorporates an “archive tier” with up to 23 additional storage shelves (474 TB of usable capacity). Built on a standard Data Domain controller, DD Archiver leverages existing Data Domain technology to enable high throughput of up to 9.8 TB/hr. DD Archiver is cost-optimized for long-term retention of backup and archive data—up to a total of 570 TB usable or 28.5 PB logical capacity (assuming a 50:1 deduplication ratio). In addition, the system offers the unique combination of low cost per gigabyte while still maintaining high throughput. Finally, new fault isolation capabilities ensure long-term recoverability of archive units.All of this leverages existing Data Domain system advantages, including support for network-efficient replication with DD Replicator as well as DD Retention Lock for enforcing file retention. In addition, Data Domain’s Data Invulnerability Architecture ensures data integrity for the life of the system.The combination of high-throughput, cost-optimized storage built on proven Data Domain system technology makes DD Archiver the perfect tape replacement solution.
  • Here’s a look at the latest Data Domain product family, including the recently introduced DD800 series, Data Domain Global Deduplication Array, and Data Domain Archiver (the system for long-term retention of backup and archive data).
  • OPTIONAL SLIDEEMC Global Services are a large component of the your total EMC experience. EMC Global Services allows you to…Save money by:Significantly lowering your implementation and operating expenditure costsFilling internal resource gaps for less Protecting your investments in EMC solutionsAccelerate time to value by:Reducing deployment timeAccelerating return on investment for new projectsEasing the burden of compliance while protecting critical business informationMitigate risk and get better results by:Configuring the solution to meet your requirementsImproving your service levels and reducing your management costsUsing EMC best practices and unmatched product expertise = superior customer experienceReducing disruption while taking advantage of the features and benefits of the latest EMC products and solutions
  • EMC Deduplication Fundamentals

    1. 1. Deduplication Fundamentals<br />
    2. 2. Data Domain Basics<br />Easy integration with existing environment<br />Control Tier<br />Target Tier<br />Disaster Recovery Tier<br />Backup and Archive Applications<br />CIFS, NFS, <br />NDMP, DD Boost<br />Ethernet<br />Virtual Tape <br />Library (VTL) over <br />Fibre Channel<br />EMC<br />Symantec<br />CommVault<br />IBM<br />BakBone Software<br />Vizioncore<br />Replication<br />DD890 appliance<br />DD890 appliance<br /><ul><li>2U
    3. 3. 2 to 10 ports
    4. 4. 10 and 1 GigabitEthernet; 8 Gb/s Fibre Channel
    5. 5. RAID 6
    6. 6. Up to 285 TB usable capacity with shelves
    7. 7. 2 TB or 1 TB 7.2K rpm SATA HDD in shelf
    8. 8. File system
    9. 9. NVRAM
    10. 10. N+1 fans and redundant, hot-plug power supplies</li></li></ul><li>Data Deduplication: Technology Overview<br />Store more backups in a smaller footprint<br />Thurs Incremental<br />A<br />C<br />K<br />Second Friday Full Backup<br />Friday Full Backup<br />Mon Incremental<br />A<br />B<br />H<br />B<br />C<br />D<br />E<br />F<br />L<br />G<br />H<br />A<br />B<br />C<br />D<br />A<br />E<br />F<br />G<br />Tues Incremental<br />Weds Incremental<br />C<br />B<br />I<br />E<br />G<br />J<br />A<br />B<br />C<br />D<br />E<br />F<br />G<br /> Backup Estimated <br /> Data Logical Reduction Physical<br />FRIDAY FULL 1 TB 2–4x 250 GB<br />Monday Incremental 100 GB 7–10x 10 GB<br />Tuesday Incremental 100 GB 7–10x 10 GB<br />Wednesday Incremental 100 GB 7–10x 10 GB<br />Thursday Incremental 100 GB 7–10x 10 GB<br />Second FRIDAY FULL 1 TB 50–60x 18 GB<br />TOTAL 2.4 TB 7.8x 308 GB<br />H<br />I<br />J<br />K<br />L<br />
    11. 11. Retain: Store More for Longer with Less<br />Over one year of retention in 3U of Data Domain deduplication storage<br />Backup Cumulative Estimated Physical<br />Data Logical Reduction<br />First Full 1 TB 4x 250 GB<br />Week 1<br />April 7 2.4 TB 8x 308 GB<br />Week 2<br />April 14 3.8 TB 10x 366 GB<br />Week 3<br />April 21 5.2 TB 12x 424 GB<br />Month 1<br />April 28 6.6 TB 14x 482 GB<br />Month 2<br />May 31 12.2 TB 17x 714 GB<br />Month 3<br />June 30 17.8 TB 19x 946 GB<br />Month 4<br />July 31 23.4 TB 20x 1,178 GB<br />TOTAL 23.4 TB 20x 1,178 GB<br />
    12. 12. Data Integrity: Data Invulnerability Architecture<br />Generate<br />Checksum<br />Verify<br />Data<br />Re-Checksum and Compare<br />Verify the file system metadata integrity<br />File System<br />Deduplication<br />Verify user data integrity<br />Local Compression<br />RAID<br />Verify stripe integrity<br />End-to-end data verification<br />Checksum<br />Deduplication, write to disk<br />Verify<br />Self-healing file system<br />Cleaning<br />Expired data<br />Defrag<br />Verify<br />Other<br />RAID 6<br />NVRAM<br />Snapshots<br />End-to-end data verification<br />
    13. 13. Network-Efficient Replication for True Disaster Recovery<br />Lowers WAN costs; improves service level agreements<br />WAN<br />Home<br />Home<br />Flexible replication<br /><ul><li>One-to-many
    14. 14. Many-to-one
    15. 15. Bi-directional
    16. 16. System-to-system
    17. 17. Cascaded</li></ul>1–5%<br />DB<br />Data Domain system<br />Archive data<br />1–5%<br />Backup data<br />Data Domain system<br />1–5%<br />Data Domain Global Deduplication Array<br />Data Domain system<br />Destination:<br />Data Center Hub <br />Supports hundreds of remote sites<br />Source:<br />Remote sites<br />95–99% cross-site bandwidth reduction<br />
    18. 18. DD Boost Software<br />Distributes parts of deduplication process to backup server or application clients<br />Licensable software works across Data Domain portfolio<br />Supports majority of backup software market<br />EMC Avamar and NetWorker<br />Symantec NetBackup and Backup Exec<br />Speeds backups by up to 50 percent<br />Process more backups with existing resources<br />20–40 percent less overall impact to backup server<br />80–99 percent less LAN bandwidth<br />Enables Data Domain replication management from the backup application<br />DD Boost<br />
    19. 19. Data Domain Replicator<br /><ul><li>Network-efficient and encrypted
    20. 20. Transfers only compressed, deduplicated data over the WAN
    21. 21. Consolidate up to 270 remote sites into a single system </li></ul>Additional Data Domain Software Options <br />Data Domain Virtual Tape Library<br /><ul><li>Easily integrates with Fibre Channel
    22. 22. Emulates multiple tape libraries
    23. 23. Supports open systems and IBM i operating environments</li></ul>Data Domain Encryption<br /><ul><li>Inline encryption of data at rest
    24. 24. Satisfies internal governance rules and compliance regulations
    25. 25. Protects against theft or loss of a physical system</li></ul>Data Domain Retention Lock<br /><ul><li>File locking to satisfy IT governance and compliance policies
    26. 26. Electronic data shredding </li></li></ul><li>DD Archiver Overview<br />Cost-optimized long-term retention<br />Data Domain system for backup and archive<br />Active tier: short-term data protection; less than 90 days<br />Archive tier: scalable long-term retention; multiple years<br />High-throughput deduplication storage<br />Up to 9.8 TB/hr<br />Cost optimized for long-term retention<br />Up to 570 TB usable, 28.5 PB logical capacity<br />Low cost per gigabyte while maintaining high throughput<br />Fault isolation of archive units for long-term recoverability<br />Leverage existing Data Domain system advantages<br />Supports DD Replicator and DD Retention Lock software options<br />Data Domain Data Invulnerability Architecture to ensure data integrity<br />
    27. 27. Industry’s Most Scalable Inline Deduplication Systems<br />DD Archiver<br />Global Deduplication<br /> Array<br />DD800<br />Appliance Series<br />DD600 <br />Appliance Series<br />Software options:<br />DD Boost, DD Virtual Tape Library, DD Replicator, DD Retention Lock, and DD Encryption<br />DD140Remote<br />Office Appliance<br />
    28. 28. Deduplication Storage Evaluation Criteria<br />
    29. 29. Methodology: Inline versus Post-Process Deduplication<br />Deduplication<br />POST- PROCESS<br />Deduplication After Storing<br />INLINE<br />Deduplication Before Storing<br />Deduplication<br />Store<br />3x disk accesses to shared store<br /><ul><li>Other activities unimpeded
    30. 30. Predictable
    31. 31. Simpler
    32. 32. The more processes, the more resource contention
    33. 33. Copy to tape: Too slow to stream tape
    34. 34. Recovery: Service level agreement predictability
    35. 35. Replication: Poor time-to-disaster-recovery
    36. 36. Deduplication: If interleaved with backup or restore
    37. 37. More administrationto fight these issues</li></li></ul><li>Performance: CPU-Centric versus Spindle-Bound<br />Data Domain<br />6,000<br />Fibre Channel<br />SATA<br />Throughput MB/s<br />Most<br />deduplication<br />vendors<br />50<br />50<br />100<br />150<br />200<br />Number of Disk Spindles<br />
    38. 38. Data Domain Systems Trajectory<br />Data Domain SISL Scaling Architecture: CPU-centric<br />5<br />3<br />1.5<br />0.04<br />Improvement since 2004:<br />Throughput: ~175x<br />Capacity: ~450x<br />Dual-controller Global Deduplication Array<br />DD Boost<br />2014 (est.)<br />Single-controller, standard protocols<br />Throughput GB/s<br />DD200 (2004)<br />2004<br />Future<br />2010<br />2011<br />
    39. 39. Why Data Domain?<br />Less disk to resource, less to manage<br />CPU-centric deduplication<br />Inline deduplication<br />Simple, mature, and flexible<br />Simple, mature appliance<br />Any fabric, any software, backup or archive applications<br />Resilience and disaster recovery<br />Storage of last resort<br />Fast time-to-DR readiness<br />Cross-site global compression<br />Data center or remote office<br />
    40. 40. Why EMC Global Services ?<br />Save money <br /><ul><li>Significantly lower implementation and operating expenditures
    41. 41. Fill internal resource gaps for less
    42. 42. Protect investments in EMC solutions</li></ul>Accelerate time to value<br /><ul><li>Reduce deployment time
    43. 43. Accelerate return on investment for new projects
    44. 44. Ease the burden of compliance while protecting critical business information</li></ul>Mitigate risk and get better results<br /><ul><li>Configure the solution to meet your requirements
    45. 45. Improve service levels; reduce management costs
    46. 46. EMC best practices and unmatched product expertise = superior customer experience
    47. 47. Reduce disruption while taking advantage of the features and benefits of the latest EMC products and solutions</li>