Your SlideShare is downloading. ×

Techniques for Managing Huge Data LISA10

548
views

Published on

Slides from the USENIX LISA10 Tutorial on Techniques for Managing Huge Data

Slides from the USENIX LISA10 Tutorial on Techniques for Managing Huge Data

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
548
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. USENIX LISA10 November 7, 2010 Techniques for Handling Huge Storage Richard.Elling@RichardElling.com USENIX LISA’10 Conference November 8, 2010 Sunday, November 7, 2010
  • 2. USENIX LISA10 November 7, 2010 Agenda How did we get here? When good data goes bad Capacity, planning, and design What comes next? 2 Note: this tutorial uses live demos, slides not so much Sunday, November 7, 2010
  • 3. 3 History Sunday, November 7, 2010
  • 4. USENIX LISA10 November 7, 2010 Milestones in Tape Evolution 4 1951 - magnetic tape for data storage 1964 - 9 track 1972 - Quarter Inch Cartridge (QIC) 1977 - Commodore Datasette 1984 - IBM 3480 1989 - DDS/DAT 1995 - IBM 3590 2000 - T9940 2000 - LTO 2006 - T10000 2008 - TS1130 Sunday, November 7, 2010
  • 5. USENIX LISA10 November 7, 2010 Milestones in Disk Evolution 5 1954 - hard disk invented 1950s - Solid state disk invented 1981 - Shugart Associates System Interface (SASI) 1984 - Personal Computer Advanced Technology (PC/AT)Attachment, later shortened to ATA 1986 - “Small” Computer System Interface (SCSI) 1986 - Integrated Drive Electronics (IDE) 1994 - EIDE 1994 - Fibre Channel (FC) 1995 - Flash-based SSDs 2001 - Serial ATA (SATA) 2005 - Serial Attached SCSI (SAS) Sunday, November 7, 2010
  • 6. USENIX LISA10 November 7, 2010 Architectural Changes Simple, parallel interfaces Serial interfaces Aggregated serial interfaces 6 Sunday, November 7, 2010
  • 7. 7 When Good Data Goes Bad Sunday, November 7, 2010
  • 8. USENIX LISA10 November 7, 2010 Failure Rates Mean Time Between Failures (MTBF) Statistical interarrival error rate Often cited in literature and data sheets MTBF = total operating hours / total number of failures Annualized Failure Rate (AFR) AFR = operating hours per year / MTBF Expressed as a percent Example MTBF = 1,200,000 hours Year = 24 x 365 = 8,760 hours AFR = 8,760 / 1,200,000 = 0.0073 = 0.73% AFR is easier to grok than MTBF 8 Operating hours per year is a flexible definition Sunday, November 7, 2010
  • 9. USENIX LISA10 November 7, 2010 Multiple Systems and Statistics Consider 100 systems each with an MTBF = 1,000 hours At time=1,000 hours, 100 failures occurred Not all systems will see one failure 9 0 10 20 30 40 0 1 2 3 4 NumberofSystems Number of Failures Very, Very Unlucky Unlucky Very Unlucky Sunday, November 7, 2010
  • 10. USENIX LISA10 November 7, 2010 Failure Rates MTBF is a summary metric Manufacturers estimate MTBF by stressing many units for short periods of qualification time Summary metrics hide useful information Example: mortality study Study mortality of children aged 5-14 during 1996-1998 Measured 20.8 per 100,000 MTBF = 4,807 years Current world average life expectancy is 67.2 years For large populations, such as huge disk farms, the summary MTBF can appear constant Better question to be answered, “is my failure rate increasing or decreasing?” 10 Sunday, November 7, 2010
  • 11. USENIX LISA10 November 7, 2010 Why Do We Care? Summary statistics, like MTBF or AFR, can me misleading or risky if we do not also distinguish between stable and trending processes We need to analyze the ordered times between failure in relationship to the system age to describe system reliability 11 Sunday, November 7, 2010
  • 12. USENIX LISA10 November 7, 2010 Time Dependent Reliability Useful for repairable systems System can be repaired to satisfactory operation by any action Failures occur sequentially in time Measure the age of the components of a system Need to distinguish age from interarrival times (time between failures) Doesn’t have to be precise, resolution of weeks works ok Some devices report Power On Hours (POH) SMART for disks OSes Clerical solutions or inventory asset systems work fine 12 Sunday, November 7, 2010
  • 13. USENIX LISA10 November 7, 2010 TDR Example 1 13 0 5 10 15 20 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950 MeanCumulativeFailures System Age (months) Disk Set A Disk Set B Disk Set C Target MTBF Sunday, November 7, 2010
  • 14. USENIX LISA10 November 7, 2010 TDR Example 2 14 Did a common event occur? 0 5 10 15 20 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950 MeanCumulativeFailures System Age (months) Disk Set A Disk Set B Disk Set C Target MTBF Sunday, November 7, 2010
  • 15. USENIX LISA10 November 7, 2010 TDR Example 2.5 15 0 5 10 15 20 Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014 MeanCumulativeFailures Date Sunday, November 7, 2010
  • 16. USENIX LISA10 November 7, 2010 Long Term Storage Near-line disk systems for backup Access time and bandwith advantages over tape Enterprise-class tape for backup and archival 15-30 years shelf life Significant ECC Read error rate: 1e-20 Enterprise-class HDD read error rate: 1e-15 16 Sunday, November 7, 2010
  • 17. USENIX LISA10 November 7, 2010 Reliability 17 Reliability is time dependent TDR analysis reveals trends Use cumulative plots, mean cumulative plots, and recurrance rates Graphs are good Track failures and downtime by system versus age and calendar dates Corelate anomalous behavior Manage retirement, refresh, preventative processes using real data Sunday, November 7, 2010
  • 18. 18 Data Sheets Sunday, November 7, 2010
  • 19. USENIX LISA10 November 7, 2010 Reading Data Sheets Manufacturers publish useful data sheets and product guides Reliability information MTBF or AFR UER, or equivalent Warranty Performance Interface bandwidth Sustained bandwidth (aka internal or media bandwidth) Average rotational delay or rpm (HDD) Average response or seek time Native sector size Environmentals Power 19 AFR operating hours per year can be a footnote Sunday, November 7, 2010
  • 20. 20 Availability Sunday, November 7, 2010
  • 21. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? 21 Sunday, November 7, 2010
  • 22. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? Is the Internet down? 22 Sunday, November 7, 2010
  • 23. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? Is the Internet down? Is the Internet reliability 5-9’s? 23 Sunday, November 7, 2010
  • 24. USENIX LISA10 November 7, 2010 Nines Don’t Matter Is the Internet up? Is the Internet down? Is the Internet’s reliability 5-9’s? Do 5-9’s matter? 24 Sunday, November 7, 2010
  • 25. USENIX LISA10 November 7, 2010 Reliability Matters! Is the Internet up? Is the Internet down? Is the Internet’s reliability 5-9’s? Do 5-9’s matter? Reliability matters! 25 Sunday, November 7, 2010
  • 26. USENIX LISA10 November 7, 2010 Designing for Failure Change design perspective Design to success How to make it work? What you learned in school: solve the equation Can be difficult... Design for failure How to make it work when everything breaks? What you learned in the army: win the war Can be difficult... at first... 26 Sunday, November 7, 2010
  • 27. USENIX LISA10 November 7, 2010 HA-Cluster plugin Example: Design for Success x86 Server NexentaStor Shared Storage Shared Storage x86 Server NexentaStor FC SAS iSCSI Sunday, November 7, 2010
  • 28. USENIX LISA10 November 7, 2010 Designing for Failure Application-level replication Hard to implement - coding required Some activity in open community Hard to apply to general purpose computing Examples DoD, Google, Facebook, Amazon, ... The big guys Tends to scale well with size Multiple copies of data 28 Sunday, November 7, 2010
  • 29. USENIX LISA10 November 7, 2010 Reliability - Availability Reliability trumps availability If disks didn’t break, RAID would not exist If servers didn’t break, HA cluster would not exist Reliability measured in probabilities Availability measured in nines 29 Sunday, November 7, 2010
  • 30. 30 Data Retention Sunday, November 7, 2010
  • 31. USENIX LISA10 November 7, 2010 Evaluating Data Retention MTTDL = Mean Time To Data Loss Note: MTBF is not constant in the real world, but keeps math simple MTTDL[1] is a simple MTTDL model No parity (single vdev, striping, RAID-0) MTTDL[1] = MTBF / N Single Parity (mirror, RAIDZ, RAID-1, RAID-5) MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) Double Parity (3-way mirror, RAIDZ2, RAID-6) MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) Triple Parity (4-way mirror, RAIDZ3) MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3) 31 Sunday, November 7, 2010
  • 32. USENIX LISA10 November 7, 2010 Another MTTDL Model MTTDL[1] model doesn't take into account unrecoverable read But unrecoverable reads (UER) are becoming the dominant failure mode UER specifed as errors per bits read More bits = higher probability of loss per vdev MTTDL[2] model considers UER 32 Sunday, November 7, 2010
  • 33. USENIX LISA10 November 7, 2010 Why Worry about UER? Richard's study 3,684 hosts with 12,204 LUNs 11.5% of all LUNs reported read errors Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf 1.53M LUNs over 41 months RAID reconstruction discovers 8% of checksum mismatches “For some drive models as many as 4% of drives develop checksum mismatches during the 17 months examined” Manufacturers trade UER for space 33 Sunday, November 7, 2010
  • 34. USENIX LISA10 November 7, 2010 Why Worry about UER? RAID array study 34 Sunday, November 7, 2010
  • 35. USENIX LISA10 November 7, 2010 Why Worry about UER? RAID array study 35 Unrecoverable Reads Disk Disappeared “disk pull” “Disk pull” tests aren’t very useful Sunday, November 7, 2010
  • 36. USENIX LISA10 November 7, 2010 MTTDL[2] Model Probability that a reconstruction will fail Precon_fail = (N-1) * size / UER Model doesn't work for non-parity schemes single vdev, striping, RAID-0 Single Parity (mirror, RAIDZ, RAID-1, RAID-5) MTTDL[2] = MTBF / (N * Precon_fail) Double Parity (3-way mirror, RAIDZ2, RAID-6) MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) Triple Parity (4-way mirror, RAIDZ3) MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail) 36 Sunday, November 7, 2010
  • 37. USENIX LISA10 November 7, 2010 Practical View of MTTDL[1] 37 Sunday, November 7, 2010
  • 38. USENIX LISA10 November 7, 2010 MTTDL[1] Comparison 38 Sunday, November 7, 2010
  • 39. USENIX LISA10 November 7, 2010 MTTDL Models: Mirror 39 Spares are not always better... Sunday, November 7, 2010
  • 40. USENIX LISA10 November 7, 2010 MTTDL Models: RAIDZ2 40 Sunday, November 7, 2010
  • 41. USENIX LISA10 November 7, 2010 Space, Dependability, and Performance 41 Sunday, November 7, 2010
  • 42. USENIX LISA10 November 7, 2010 Dependability Use Case Customer has 15+ TB of read-mostly data 16-slot, 3.5” drive chassis 2 TB HDDs Option 1: one raidz2 set 24 TB available space 12 data 2 parity 2 hot spares, 48 hour disk replacement time MTTDL[1] = 1,790,000 years Option 2: two raidz2 sets 24 TB available space (each set) 6 data 2 parity no hot spares MTTDL[1] = 7,450,000 years 42 Sunday, November 7, 2010
  • 43. USENIX LISA10 November 7, 2010 Planning for Spares Number of systems Need for spares How many spares do you need? How often do you plan replacements? Replacing devices immediately becomes impractical Not replacing devices increases risk, but how much? There is no black/white answer, it depends... 43 Sunday, November 7, 2010
  • 44. USENIX LISA10 November 7, 2010 SparesOptimizer Demo 44 Sunday, November 7, 2010
  • 45. Capacity, Planning, and Design 45 Sunday, November 7, 2010
  • 46. USENIX LISA10 November 7, 2010 46 Space Space is a poor sizing metric, really! Technology marketing heavily pushes space Maximizing space can mean compromising performance AND reliability As disks and tapes get bigger, they don’t get better $150 rule PHB’s get all excited about space Most current capacity planning tools manage by space Sunday, November 7, 2010
  • 47. USENIX LISA10 November 7, 2010 Bandwidth Bandwidth constraints in modern systems are rare Overprovisioning for bandwidth is relatively simple Where to gain bandwidth can be tricky Link aggregation Ethernet SAS MPIO Adding parallelism beyond 2 trades off reliability 47 Sunday, November 7, 2010
  • 48. USENIX LISA10 November 7, 2010 Latency Lower latency == better performance Latency != IOPS IOPS also achieved with parallelism Parallelism only delivers latency when latency is constrained by bandwidth Latency = access time + transfer time HDD Access time limited by seek and rotate Transfer time usually limited by media or internal bandwidth SSD Access time limited by architecture more than c Transfer time limited by architecture and interface Tape Access time measured in seconds 48 Sunday, November 7, 2010
  • 49. 49 Deduplication Sunday, November 7, 2010
  • 50. USENIX LISA10 November 7, 2010 What is Deduplication? A $2.1 Billion feature 2009 buzzword of the year Technique for improving storage space efficiency Trades big I/Os for small I/Os Does not eliminate I/O Implementation styles offline or post processing data written to nonvolatile storage process comes along later and dedupes data example: tape archive dedup inline data is deduped as it is being allocated to nonvolatile storage example: ZFS 50 Sunday, November 7, 2010
  • 51. USENIX LISA10 November 7, 2010 Dedup how-to Given a bunch of data Find data that is duplicated Build a lookup table of references to data Replace duplicate data with a pointer to the entry in the lookup table Grainularity file block byte 51 Sunday, November 7, 2010
  • 52. USENIX LISA10 November 7, 2010 Dedup Constraints Size of the deduplication table Quality of the checksums Collisions happen All possible permutations of N bits cannot be stored in N/10 bits Checksums can be evaluated by probability of collisions Multiple checksums can be used, but gains are marginal Compression algorithms can work against deduplication Dedup before or after compression? 52 Sunday, November 7, 2010
  • 53. USENIX LISA10 November 7, 2010 Verification add reference checksum compress DDT entry lookup write() read data data match? new entry yes no verify? yes no yes no DDT match? 53 Sunday, November 7, 2010
  • 54. USENIX LISA10 November 7, 2010 Reference Counts 54 Eggs courtesy of Richard’s chickens Sunday, November 7, 2010
  • 55. 55 Replication Sunday, November 7, 2010
  • 56. USENIX LISA10 November 7, 2010 Replication Services Recovery Point Objective System I/O Performance Text Days Seconds Slower Faster Mirror Application Level Replication Block Replication DRBD, SNDR Object-level sync Databases, ZFS File-level sync rsync Traditional Backup NDMP, tar Hours 56 Sunday, November 7, 2010
  • 57. USENIX LISA10 November 7, 2010 How Many Copies Do You Need? Answer: at least one, more is better... One production, one backup One production, one near-line, one backup One production, one near-line, one backup, one at DR site One production, one near-line, one backup, one at DR site, one archived in a vault RAID doesn’t count Consider 3 to 4 as a minimum for important data 57 Sunday, November 7, 2010
  • 58. USENIX LISA10 November 7, 2010 Tiering Example 58 Big, honking disk array Big, honking tape library File-based backup Works great, but... Sunday, November 7, 2010
  • 59. USENIX LISA10 November 7, 2010 Tiering Example 59 Big, honking disk array Big, honking tape library File-based backup ... backups never complete 10 million files 1 million daily changes 12 hour backup window Sunday, November 7, 2010
  • 60. USENIX LISA10 November 7, 2010 Tiering Example 60 Big, honking disk array Big, honking tape library Near-line backup Backups to near-line storage and tape have different policies 10 million files 1 million daily changes weekly backup window hourly block-level replication Sunday, November 7, 2010
  • 61. USENIX LISA10 November 7, 2010 Tiering Example 61 Big, honking disk array Big, honking tape library Near-line backup Quick file restoration possible Sunday, November 7, 2010
  • 62. USENIX LISA10 November 7, 2010 Application-Level Replication Example 62 Site 2 Long-term archive option Site 1 Data stored at different sites Site 3 Application Sunday, November 7, 2010
  • 63. 63 Data Sheets Sunday, November 7, 2010
  • 64. USENIX LISA10 November 7, 2010 Reading Data Sheets Redux Manufacturers publish useful data sheets and product guides Reliability information MTBF or AFR UER, or equivalent Warranty Performance Interface bandwidth Sustained bandwidth (aka internal or media bandwidth) Average rotational delay or rpm (HDD) Average response or seek time Native sector size Environmentals Power 64 AFR operating hours per year can be a footnote Sunday, November 7, 2010
  • 65. 65 Summary Sunday, November 7, 2010
  • 66. USENIX LISA10 November 7, 2010 Key Points 66 You will need many copies of your data, get used to it The cost/byte decreases faster than kicking old habits Replication is a good thing, use often Tiering is a good thing, use often Beware of designing for success, design for failure, too Reliability trumps availability Space, dependability, performance: pick two Sunday, November 7, 2010
  • 67. 67 ThankYou! Questions? Richard.Elling@RichardElling.com Richard.Elling@Nexenta.com Sunday, November 7, 2010