Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Footprint Reduction: Understanding IBM Storage Options


Published on

sSE20 presented at IBM Edge 2012 conference

Published in: Technology, Business
  • Be the first to comment

Data Footprint Reduction: Understanding IBM Storage Options

  1. 1. sSE20 Data Footprint Reduction: Understanding IBM Storage Efficiency Options Tony Pearson Master Inventor and Senior Managing Consultant, IBM Corp Sanjay S Bhikot Advisory Unix and Storage Administrator, Ricoh Americas Corp#IBMEDGE © 2012 IBM Corporation
  2. 2. Data Footprint Reduction is the catch-all term for a variety of technologies designed to help reduce storage costs. This session will cover thin provisioning, space- efficient copies, deduplication and compression technologies, and describe the IBM storage products that provide these capabilities.#IBMEDGE © 2012 IBM Corporation
  3. 3. Sessions -- Tony Pearson • Monday – 1:00pm Storing Archive Data for Compliance Challenges – 4:15pm IBM Watson: What it Means for Society • Tuesday – 4:15pm Using Social Media: Birds of a Feather (BOF) • Wednesday – 9:00am Data Footprint Reduction: IBM Storage options – 2:30pm IBMs Storage Strategy in the Smarter Computing era – 4:15pm IBM SONAS and the Cloud Storage Taxonomy • Thursday – 9:00am IBM Watson: What it Means for Society – 10:30am Tivoli Storage Productivity Center Overview – 5:30pm IBM Edge “Free for All” hosted by Scott Drummond 3#IBMEDGE © 2012 IBM Corporation
  4. 4. Agenda • Thin Provisioning • Space-Efficient Copy • Data Deduplication • Compression#IBMEDGE © 2012 IBM Corporation
  5. 5. History of Thin Provisioning The StorageTek Iceberg 9200 Array Introduced Thin 1997 Today Provisioning on slower 7200RPM drives for mainframe systems Thin Provisioning is available for many operating systems 1994 on IBM storage, including DS8000, IBM resold this as XIV, SVC, N series, the RAMAC Virtual Storwize V7000, Array (RVA) for DS3500 and mainframe servers DCS3700 5#IBMEDGE © 2012 IBM Corporation
  6. 6. Why Space is Over-Allocated• Scenario 1 • Scenario 2 – Space requirements – Space requirements under-estimated over-estimated – Running out of space – Capacity lasts for years requires larger volume • No data migration – New request may take • No application outages weeks to accommodate • No penalties • Application outage if not addressed in time – Data must be moved to When faced with this dilemma, the larger volume most will err on the side of over-estimating • Application outage during data movement 6#IBMEDGE © 2012 IBM Corporation
  7. 7. Fully Allocated vs. Thin Provisioned Allocated but unused space dedicated to this host, wasted until written to Host sees fully allocated amount Actual data written Empty space available to others Physical Space Allocated Host sees full virtual amount Actual data written 7#IBMEDGE © 2012 IBM Corporation
  8. 8. Fully Allocated vs. Thin Provisioned Volume/LUN – one or more extentsHost sees a volumeor LUN that consists Extent – Allocation Unitof blocks numbered One or more grains0 to nnnnnnnnnn Grain – range of 1 or more blocks Block – typically 512 or 4096 bytes 8#IBMEDGE © 2012 IBM Corporation
  9. 9. Coarse and Fine-Grain9 Block 00, 55, and 99 written8 Fully Allocated, all 10 extents allocated Coarse-Grain, only 3 extents allocated7 Fine-Grain, only 1 extent allocated65 Grain 00-014 Grain 90-99 = extent3 Grain 54-552 9 Grain 98-991 50 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Fully Allocated Coarse-Grain Fine-Grain 9#IBMEDGE © 2012 IBM Corporation
  10. 10. How IBM has implemented TP IBM DS8000 IBM XIV SVC and DS3500, Storwize DCS3700 V7000 Type Coarse Fine Fine Fine Allocation 1 GB 17 GB 16MB to 4 GB Unit 8GB Grain size 1 MB 32-256 KB 64 KB 10#IBMEDGE © 2012 IBM Corporation
  11. 11. Thick-to-Thin Migration Volume Fully-allocated mirror Thin- volume provisioned volume Copy 0 Copy 1 Only non-zero blocks copied 11#IBMEDGE © 2012 IBM Corporation
  12. 12. Empty Space Reclaim Thin Provisioning, allocations in 17GB units, with 1MB chunks (grains). Only non-zero blocks consume physical space. Avoid writing empty blocks, any I/O request that tries to write a block of all zeros to unallocated space is ignored. Background task to find empty chunks, a background task scans all blocks, looking for chunks containing all zeros. Empty space reclaimed empty chunks are returned to unallocated space, so that it can be used for other volumes 12#IBMEDGE © 2012 IBM Corporation *** IBM Confidential until July 12, 2011 ***
  13. 13. Thin Provisioning Pros • Cons Just-in-Time increased Not all file systems utilization percentage cooperate or friendly Eliminates the pressure to Deletion of files does not make accurate space free space for others estimates “sdelete” writes zeros over deleted file space Dynamically expand volume without impacting Some implementations may applications or rebooting impact I/O performance server May not support same set Reduces the data footprint of features, copy services, and lowers costs or replication Shifts focus from volumes “Writing checks you can’t to storage pool capacity cash” 13#IBMEDGE © 2012 IBM Corporation
  14. 14. Agenda • Thin Provisioning • Space-Efficient Copy • Data Deduplication • Compression#IBMEDGE © 2012 IBM Corporation
  15. 15. History of Space-Efficient Copies 1997 TodayNetApp introduces Space-Efficient CopySnapshot in its is available on manyWAFL file system IBM storage systems, 1993 including DS8000, XIV, SVC, N series, IBM Enterprise Storwize V7000, Storage Server DS3500, DS5000 and (ESS) introduces DCS3700 NOCOPY parameter on FlashCopy 15#IBMEDGE © 2012 IBM Corporation
  16. 16. Space-Efficient Copies 300 GB Source Traditional Copies Destination 1 Destination 2 Destination 3 100 GB allocated 40 GB written Space-Efficient Copies. 10% reserved 30 GB 16#IBMEDGE © 2012 IBM Corporation
  17. 17. Method 1: Copy on Write (COW) Source Destination • Copy-On-Write (COW) – Copy is set of pointers to Block A B C D original data – Write to original volume: • Pause I/O Source Destination • Copy original block of data to destination • Update original block Block A B C2 D C – Slows performance – May limit # of destination copies – Can be combined with background copy for a full copy 17#IBMEDGE © 2012 IBM Corporation
  18. 18. Method 2: Redirect on Write (ROW) Source Destination • Redirect-On-Write (ROW) – Copy is set of pointers to Block A B C D original data – Write to original volume: • Re-directed to new empty Source Destination space • Previous data left alone Block A B C D C2 – Does not impact performance – Supports many destination copies 18#IBMEDGE © 2012 IBM Corporation
  19. 19. Space-Efficient Copies Pros • Cons Supports both Some implementations Fully-allocated and may impact I/O Thin-Provisioned Sources performance Reduces the data footprint Requires that you and lowers costs estimate the maximum Allows you to keep more percentage changed copies online • Typically 10-20 % Allows you to take copies Exceeding the reserved more frequently space invalidates Can be used as destination copy checkpoint copies during batch processing 19#IBMEDGE © 2012 IBM Corporation
  20. 20. Agenda • Thin Provisioning • Space-Efficient Copy • Data Deduplication • Compression#IBMEDGE © 2012 IBM Corporation
  21. 21. History of Data Deduplication Advanced Single Today 2008 Instance Store (A-SIS) bring deduplication for the IBM N series and IBM offers a variety of NetApp disk storage choices, including ProtecTIER, N series, and Tivoli Storage 2007 Manager (TSM v6) IBM acquires Diligent and introduces the ProtecTIER TS7600 virtual tape library with data deduplication 21#IBMEDGE © 2012 IBM Corporation
  22. 22. Data Deduplication • Data deduplication reduces capacity requirements by only storing one unique instance of the data on disk and creating pointers for duplicate data elements 22#IBMEDGE © 2012 IBM Corporation
  23. 23. Deduplication reduces diskrequired for backup copies 23#IBMEDGE © 2012 IBM Corporation23
  24. 24. Two Primary Data Deduplication Approaches Hash based HyperFactor Deduplication A different approach Sometimes referred to based on an agnostic as a Content view of data Addressable Storage approach 24 #IBMEDGE © 2012 IBM Corporation24 31-May-12
  25. 25. Hash-Based Approach 1. Slice data into chunks (fixed or variable) A B C D E 2. Generate Hash per chunk and save Ah Bh Ch Dh Eh 3. Slice next data into chunks and look for Hash Match A B C D E 4. Reference data previously stored 25 #IBMEDGE © 2012 IBM Corporation25 31-May-12
  26. 26. HyperFactor Approach 1. Look through data for similarity New Data Stream 2. Read elements that are most similar 3. Diff reference with version – will use several elements Element A Element B Element C 4. Matches factored out – unique data added to repository 26 #IBMEDGE © 2012 IBM Corporation26 31-May-12
  27. 27. Assessment of Hash-basedApproachesExample: Imagine a chunk size • Applicable for all chunking of 8 KB methods• 1 TB repository has • Hash Table in Memory ~125,000,000 8 KB chunks – Overhead for in-band deduplication• Each hash is 20 bytes long – Hash table will grow with data volume• Need pointers scheme to – Growing hash-table may become reference 1 TB performance bottleneckThe hash-table requires 2.5 GB – Scalability issues RAM » no issue • Hash-Collisions must be handled • Hash table must be protectedWith a 100 TB repository – One copy might not be sufficient » ~250 GB of RAM is required 27#IBMEDGE © 2012 IBM Corporation
  28. 28. When Deduplication Occurs1. In-line Processing – As data is received by the target device it is • Deduplicated in real time • Only unique data stored on disk – Data written to the disk storage is deduplicated2. Post-Processing – As data is received by the target device it is • Temporarily stored on disk storage – Data is subsequently read back in to be processed by a deduplication engine 28#IBMEDGE © 2012 IBM Corporation
  29. 29. Comparison of Offerings Hash-based HyperFactor In-line Other vendors IBM ProtecTIER Process –TS7680G –TS7650G –TS7650 –TS7620 Express –TS7610 Express Post- • IBM Tivoli Storage Process Manager (TSM) • N series 29#IBMEDGE © 2012 IBM Corporation
  30. 30. IBM ProtecTIER with HyperFactor • Gateways – Attaches up to 1PB of disk – Two models: • TS7680 for IBM System z • TS7650G for distributed systems • Appliances – Disk included inside – Three models for distributed systems • TS7650 … in three sizes • TS7620 (New!) • TS7610 ... in two sizes 30#IBMEDGE © 2012 IBM Corporation
  31. 31. ProtecTIER vs.Tivoli Storage Manager Both Solutions Offer the Benefits of Target side Deduplication: – Greatly reduced storage capacity requirements – Lower operational costs, energy usage and TCO Complementary Solutions Today! – Faster recoveries with more data on disk Can be used together but don’t deduplicate the same data twice Use ProtecTIER When: – Highest performance and capacity scaling are required! – Up to 1400 MB/sec (2.5GB/s with 2 node) deduplication rates are needed – Deduplicated capacities up to 25 PB are required IBM TS7600 – You wish to avoid operational impact of post processing deduplication – A VTL appliance model is desired – Deduplicating across multiple TSM (or other backup) servers Use TSM 6 Built-in Deduplication When: – You desire deduplication operations be completely integrated within TSM – The benefits of deduplication are desired without separate hardware or software dependencies or licenses (ships with TSM Extended Edition) – You desire end to end data lifecycle management with minimized data TSM store 31#IBMEDGE © 2012 IBM Corporation
  32. 32. Data Deduplication Pros • Cons Designed for backups Dealing with Hash Can offer up to 25x data Collisions footprint reduction • May require byte-for-byte • Allows disk backup comparisons or keeping repositories to approach secondary copy of data cost of tape-based Some systems do not scale solutions Some systems have slow Allows more backup restores copies to remain on disk • Re-hydrating data back to for faster restores normal Available with a variety of Primary data may not interfaces, including VTL, dedupe very well OST and NAS • Your mileage may vary! 32#IBMEDGE © 2012 IBM Corporation
  33. 33. Agenda • Thin Provisioning • Space-Efficient Copy • Data Deduplication • Compression#IBMEDGE © 2012 IBM Corporation
  34. 34. History of Compression Today 1986NASA and IBM developed IBM offersthe Houston Aerospace real-time compressionSpooling Protocol (HASP) for file and block levelwith compression for long access to disk storagedistance data transmission. 1973 IBM introduced the Improved Data Recording Capability (IDRC) for the 3480 tape drive 34#IBMEDGE © 2012 IBM Corporation
  35. 35. Lossy vs. Lossless Methods Compress Compress Decompress Decompress returns data does not return back to its Exactly data back to its Good original contents the same original contents enough?• Lossy • Lossless – Used with music, photos, video, – Used with databases, medical images, scanned emails, spreadsheets, office documents, documents, source code fax machines 35#IBMEDGE © 2012 IBM Corporation
  36. 36. How Compression Works • Lempel-Ziv lossless compression builds a dictionary of repeated phrases, sequences of two or more characters that can be represented with fewer number of bits • In the above excerpt from “Lord of the Rings”, all of the red text represents repeated sequences eligible for compression!Source: The Lempel Ziv Algorithm, Christian Zeeh, 2003 36#IBMEDGE © 2012 IBM Corporation
  37. 37. Compressed VolumesAllocated but unused spacededicated to this host,wasted until written to Physical Space AllocatedActual data written Actual data written Host sees full virtual amount Physical Space Allocated, up to 80% Actual reduction from actual data data written written 37#IBMEDGE © 2012 IBM Corporation
  38. 38. Real-time Compression! Workstations • Real-time Compression for primary data IP – Less data stored on primary storage (up to 80%) Network – No changes to applications or procedures Application Servers • Before it gets to the storage array – Larger effective storage cache – Disk Array can serve more requests from its read / write cache – Lower storage CPU overhead Cache Cache • Does not cause performance degradation – Much smaller I/O / lower disk workload – Reads/Writes are faster due to storage array’s response from cache instead of disk – Additionally reads may come from advanced read ahead cache (no write cache) Disk Array 38 #IBMEDGE © 2012 IBM Corporation38
  39. 39. FIVO vs. VIFO Compressed Compressed Data Data Data Data 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6• Fixed Input, Variable Output • Variable Input, Fixed Output – WAN transmission – Random Access Compression – Sequential tape Engine™ (RACE) – IBM Tivoli Storage – IBM Real-Time Compression Manager Appliances – zip, tar, etc. – IBM SVC, Storwize V7000 39#IBMEDGE © 2012 IBM Corporation
  40. 40. Compression for Disk data Traditional Approaches Real-time Compression Compression after Modification File Compression after Modification A B C A B C A B C D E F File D MN F File D MN F G H I G H I G H I NewCompressed ABC DEF GHI New ABC DMN FGH I File Compressed ABC DEF1 GHI MN Blocks Shift File Compressed File Identical Blocks • Extra work to ‘edit’ a file • Small amount of work / I/O to edit • All blocks shift – Only one common block • Only modified block changes (this example) – Multiple common blocks – Negative impact to deduplication – Enhances deduplication • No notion of data location • Data location via map 40 #IBMEDGE © 2012 IBM Corporation40
  41. 41. Compression Without CompromiseExpected Compression Ratios Up to 80% Databases Linux virtual OSes Up to 70% Server Virtualization Windows virtual OSes Up to 55% Office 2003 Up to 75% Collaboration Office 2007 or later Up to 25% Up to 75% CAD/CAM Engineering/Design 41 #IBMEDGE © 2012 IBM Corporation41
  42. 42. Objectives:• Run over a block device• Estimate: – Portion of non-zero blocks in the volume. – Compression rate of non-zero blocks with RTC.Performance:• Runs FAST! < 60 seconds, no matter what the volume size – Typical running time on a machine with multiple disks: < 20 seconds• Give guarantees on the estimation: ~5% max error guarantee – Can improve guarantee with more running timeMethod:• Random sampling and compression throughout the volume• Collect enough non-zero samples to gain desired confidence – More zero blocks slower (takes more time to find non-zero blocks)• Mathematical analysis gives confidence guarantees• Note: we are estimating compression during migration of a volume into RTC (data at rest) 42 #IBMEDGE © 2012 IBM Corporation
  43. 43. IBM Real-Time Compression• For NAS devices • For Block devices – IBM Real-Time – SAN Volume Controller Compliance Appliance – Storwize V7000 STN 6500 SAN Volume Controller STN 6800 Storwize V7000#IBMEDGE © 2012 IBM Corporation
  44. 44. Migrating to Compressed Disk Volume Fully-allocated mirror Compressed or Thin-provisioned volume volume Copy 0 Copy 1 Only non-zero blocks copied 44#IBMEDGE © 2012 IBM Corporation
  45. 45. Data Compression Pros • Cons Can be used for data Some implementations are transmission, tape and post-process disk data • Stores uncompressed Can offer up to 80% data data first, compress later footprint reduction Some implementations Available as front-end impact performance and/or appliance or integrated consume substantial CPU into storage system resources Can be Benefits vary by data type, “Dedupe-Friendly” and whether applications do their own compression or encryption • Your mileage may vary 45#IBMEDGE © 2012 IBM Corporation
  46. 46. Thank You! Session: sSE20 Presenters: Tony Pearson, Sanjay Bhikot#IBMEDGEIntel, the Intel logo, Xeon and Xeon Inside are trademarks or registeredtrademarks of Intel Corporation in the U.S. and /or other countries.
  47. 47. Additional Resources Email: Twitter:Øtony Blog:Ø Books:Ø_tony IBM Expert Network:Øtony 62 #IBMEDGE © 2012 IBM Corporation62
  48. 48. Trademarks and disclaimers© IBM Corporation 2012. All rights reserved.Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or othercountries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of GovernmentCommerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks orregistered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States,other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITILis a registered trademark, and a registered community trademark of The Minister for the Cabinet Office, and is registered in the U.S. Patent and Trademark Office. UNIX is aregistered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks ofOracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other contries, or both and is used under licensetherefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.Other product and service names might be trademarks of IBM or other companies. Trademarks of International Business Machines Corporation in the United States, othercountries, or both can be found on the World Wide Web at is provided "AS IS" without warranty of any kind.The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actualenvironmental costs and performance characteristics may vary by customer.Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does notconstitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendorannouncements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims relatedto non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance,function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here tocommunicate IBMs current investment and development activities as a good faith effort to help with our customers future planning.Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user willexperience will vary depending upon considerations such as the amount of multiprogramming in the users job stream, the I/O configuration, the storage configuration, andthe workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios statedhere.Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact yourIBM representative or Business Partner for the most current pricing in your geography.Photographs shown may be engineering prototypes. Changes may be incorporated in production models.References in this document to IBM products or services do not imply that IBM intends to make them available in every country. 63 #IBMEDGE © 2012 IBM Corporation