Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RAIDShield of Fast15 slides

143 views

Published on

RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

Published in: Software
  • Be the first to comment

  • Be the first to like this

RAIDShield of Fast15 slides

  1. 1. 1© Copyright 2015 EMC Corporation. All rights reserved. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures Ao Ma Fred Douglis Guanlin Lu Darren Sawyer EMC Surendar Chandra Windsor Hsu Datrium
  2. 2. 2© Copyright 2015 EMC Corporation. All rights reserved. Disk failures are commonplace • Whole-disk failure • Partial failure RAID is widely deployed • Protect data against failures with redundancy Pervasive RAID Protection
  3. 3. 3© Copyright 2015 EMC Corporation. All rights reserved. Storage system is evolving • Escalated use of less reliable drives causes more whole-disk failures • Increasing disk capacity results in more sector errors Solution • Add extra redundancy (RAID5, RAID6, …) – Ensure data reliability at the cost of storage efficiency RAID Overview Is adding extra redundancy an efficient solution?
  4. 4. 4© Copyright 2015 EMC Corporation. All rights reserved. Analyzed 1 million SATA disks and revealed • Failure modes degrading RAID reliability • Reallocated sectors reflect disk reliability deterioration • Disk failure is predictable Built RAIDSHIELD, an active defense mechanism • Reconstruct failing disk before it’s too late! • PLATE: single-disk proactive protection – Deployment eliminates 70% of RAID failures • ARMOR: disk group proactive protection – Recognize vulnerable RAID groups What We Did
  5. 5. 5© Copyright 2015 EMC Corporation. All rights reserved. Background Disk failure analysis RAIDSHIELD: • Identify failure indicator • Reallocated Sector (RS) characterization • Single disk proactive protection • Disk group proactive protection 5 Outline
  6. 6. 6© Copyright 2015 EMC Corporation. All rights reserved. Disk failure does not follow a fail-stop model The production systems studied define failure as • Connection is lost • An operation exceeds the timeout threshold • Write fails Whole-disk Failure Definition
  7. 7. 7© Copyright 2015 EMC Corporation. All rights reserved. • Each disk drive model is denoted as <family-capacity> • Relative sizes within a family are ordered by the capacity number – E.g. A-2 is larger than A-1 Disk Model Population (Thousands) First Deployment Log Length (Months) A-1 34 06/2008 60 A-2 165 11/2008 60 B-1 100 06/2008 48 C-1 93 10/2010 36 C-2 253 12/2010 36 D-1 384 09/2011 21 Disk Data Collection
  8. 8. 8© Copyright 2015 EMC Corporation. All rights reserved. What Do Real Disk Failures Look Like?
  9. 9. 9© Copyright 2015 EMC Corporation. All rights reserved. 0 0 0 1 4 24 46 20 3 1 0 10 20 30 40 50 0-6 12-18 24-30 36-42 48-54 A-2 Month Distribution of Lifetime of Failed Drives 0 0 1 2 4 15 34 29 11 3 0 10 20 30 40 0-6 12-18 24-30 36-42 48-54 A-1 Month A large fraction of failed drives are found at a similar age Percentage(%)Percentage(%)
  10. 10. 10© Copyright 2015 EMC Corporation. All rights reserved. The number of affected disks keep growing • About 10% of disks get sector errors at the 3rd year Sector error numbers increase continuously • Average error count increases 25% to 300% year over year Increasing Frequency of Sector Errors
  11. 11. 11© Copyright 2015 EMC Corporation. All rights reserved. Drive failing at a similar age • Failure rate is not constant • A high risk of multiple simultaneous failures Increasing frequency of sector errors • Exacerbate risk of reconstruction failures Passive Redundancy is Inefficient Ensuring reliability in the worst case requires adding considerable extra redundancy, making it unattractive from a cost perspective
  12. 12. 12© Copyright 2015 EMC Corporation. All rights reserved. Motivation • Ensure data safety with minimal redundancy • Proactively recognize impending failures and migrate vulnerable data in advance Methodology • Identify indicator of impending failure • Indicator characterization • Proactive protection RAIDSHIELD, The Proactive Protection
  13. 13. 13© Copyright 2015 EMC Corporation. All rights reserved. Potential indicators • Various disk errors Criteria of a good indicator • It happens much more frequently on failed disks rather than working disks Approach • Quantify the discrimination between error value on failed disks and working ones – Deciles comparison is used Identify Failure Indicator
  14. 14. 14© Copyright 2015 EMC Corporation. All rights reserved. Failed disks have more media errors than working ones The discrimination is not significant enough Media Error Comparison 2 3 5 9 15 22 32 47 86 1 1 1 2 3 4 7 13 30 0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 failed disk working disk Deciles A-2MediaErrorCount
  15. 15. 15© Copyright 2015 EMC Corporation. All rights reserved. 2 23 87 187 327 522 812 1242 2025 0 0 0 0 0 1 2 6 29 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 9 failed disk working disk A-2 Deciles ReallocatedSectorCount A-2 Deciles RS is strongly correlated with disk failures Reallocated Sector (RS) Comparison
  16. 16. 16© Copyright 2015 EMC Corporation. All rights reserved. Most failed drives tend to have a larger number of RS than working ones RS is strongly correlated with whole-disk failures, followed by media errors, pending sector errors and uncorrectable sector errors Correlation Between Sector Errors And Whole-disk Failure RS is a strong indicator of impending disk failure
  17. 17. 17© Copyright 2015 EMC Corporation. All rights reserved. Larger RS count implies higher failure rate in two-month observation window Disk Failure Rate Given Different RS Count 1.7 67 75 80 83 85 86 89 90 90 91 92 93 93 94 95 0 20 40 60 80 100 0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 DiskFailureRate(%) RS count A-2 RS Characterization (1)
  18. 18. 18© Copyright 2015 EMC Corporation. All rights reserved. Larger RS count, faster to fail 10% 25% median 75% 90% RS Count TimeMargin(Days) Disk Failure Time Given Different RS Count RS Characterization (2)
  19. 19. 19© Copyright 2015 EMC Corporation. All rights reserved. RS count indicates the degree of disk reliability deterioration Use the RS count to predict impending disk failure in advance PLATE: Single Disk Proactive Protection
  20. 20. 20© Copyright 2015 EMC Corporation. All rights reserved. 70.1 66.6 64 61.8 59.9 52.1 47 42.6 39 36.9 4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27 0 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 200 300 400 500 600 failures predicted false positive Percentage(%) Both the predicted failure and false positive rates decrease as the threshold increases Simulation Result: Failures Captured Rate Given Different RS Threshold RS threshold
  21. 21. 21© Copyright 2015 EMC Corporation. All rights reserved. 5 5 15 15 80 10 70 0 20 40 60 80 100 Without Proactive Protection With Proactive Protection Hardware Failures Others Triple Failures Eliminated Triple Failures Single proactive protection reduces about 70% of RAID failures, equivalent to 88% of the triple-disk failures PLATE Deployment Result: Causes of Recovery Incidents Percentage(%)
  22. 22. 22© Copyright 2015 EMC Corporation. All rights reserved. 10% remaining triple failures • PLATE misses RAID failures caused by multiple less reliable drives, whose RS counts haven’t exceed the threshold Triage • Prioritize disk groups with highest risk Motivation of ARMOR: The RAID Group Proactive Protection
  23. 23. 23© Copyright 2015 EMC Corporation. All rights reserved. 1-1 1-2 1-3 1-4 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 ? X X X X ? ? ? ? Threat of Failure Imminent Failure Good Disk Healthy DG1 Imminent Failure of DG2 Protected DG3 Possible Failure of DG4 Single disk protection: Replace 2-3, 2-4, 3-4 (PLATE) Can’t identify DG4 nor the difference between DG2 and DG3 Group protection: Replace DG4 or increase redundancy (ARMOR) Protect DG4 and recognize the difference between DG2 and DG3 Disk Group Protection Example
  24. 24. 24© Copyright 2015 EMC Corporation. All rights reserved. Calculate the single disk failure probability • Conditional probability through Bayes Theorem Calculate the probability of a vulnerable RAID • Combination of those single disk probabilities through joint probability ARMOR Methodology
  25. 25. 25© Copyright 2015 EMC Corporation. All rights reserved. The discrimination shows ARMOR is effective to recognize endangered DGs In practice, it identifies most DG failures that are not predicted by PLATE Probability Deciles distribution Evaluation 0.25 0.33 0.39 0.44 0.46 0.5 0.63 0.73 0.93 0.15 0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 Groups with more than one failure Groups without failure
  26. 26. 26© Copyright 2015 EMC Corporation. All rights reserved. Google reports SMART metrics such as reallocated sector strongly suggest an impending failure, but they also determine that half of the failed disks show no such errors [Pinheiro’07] • Different workload and RAID rewrite Disk failure prediction • Average maximum latency [Goldszmidt’12] • SMART failure prediction [Murray’05, Hughes’02] Related Work
  27. 27. 27© Copyright 2015 EMC Corporation. All rights reserved. We analyzed 1 million SATA drives • Observe failure modes degrading RAID reliability • Reveal RS count reflects the disk reliability deterioration • Disk failure is predictable We built RAIDSHIELD, an active defense mechanism • PLATE: single disk proactive protection – Deployment eliminates 70% of RAID failures • ARMOR: disk group proactive protection – Recognize vulnerable RAID groups – Hope to deploy in future Is adding extra redundancy an efficient solution? • Use as much redundancy as needed to ensure availability • Proactive replacement should decrease the level needed Summary
  28. 28. 28© Copyright 2015 EMC Corporation. All rights reserved. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures Questions? Acknowledgement Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau Data Domain engineer team, members of AD and CTO office, Stephen Manley Contact: ao.ma@emc.com

×