Storage systems reliability
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Storage systems reliability

on

  • 221 views

This document describes the various RAID levels and how they improve system storage reliability.

This document describes the various RAID levels and how they improve system storage reliability.

Statistics

Views

Total Views
221
Views on SlideShare
221
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Storage systems reliability Document Transcript

  • 1. White Paper – System Storage Reliability Juha SaleniusStorage SystemsAs a way of defining the various facets of reliability let’s start out by configuring a hypothetical a serverusing six internal hot swap disk drives that can be either Serial Attached SCSI (SAS) or Serial AdvancedTechnology Attachment (SATA) disk drives. If these drives were SAS disks and were set up as Just aBunch of Drives (JBOD), a non RAID configuration, then the MTBF would be all we’d need to define thecomputed failure rate. With an individual drive MTBFSAS of 1,400,000 hours1 the combined MTBF for sixdrives would compute to 233,333 hours using the following equation where N is the number of thesame component (in this case disk drives) and the subscripts are tc=total components andc=component: (Special case where all components are the same)In contrast to the SAS MTBFtc, using individual SATA drives instead, each exhibiting an individualMTBFSATA of 1,000,000 hours2, the combined MTBFtc for six drives in a JBOD configuration is 166,667hours.RAID Considerations:Even with the drives configured as a RAID (levels 0, 1, 5 and 6) array the total MTBF for all the drives willremain the same as above because it does not take into account any redundancy. MTBI (Mean TimeBetween Interruption) can be used to highlight the difference in uptime based on a redundantconfiguration. In a JBOD configuration the MTBF and the MTBI are the same. You’ll notice that once wemove from a non-redundant system to a system with redundant components we move from reportingMTBF to reporting MTBI as the more meaningful term from a system perspective.Consider the following RAID levels: RAID 0 will not be considered here because it does not provide any failure protection, though it does provide higher throughput by striping the data stream across multiple drives and it is usually used with other RAID levels increasing their throughput. Certain RAID levels can be combined, for example RAID 10, RAID 50 and RAID 60. These configurations combine data striping across multiple drives combined with either mirroring or parity drives. These configurations drive increased complexity but improve performance. RAID 1 mirrors data across two disk drives which require doubling of the number of data drives. RAID 5 uses parity bits to recover from a bad read and the data/parity is written in blocks across all drives in the array with the parity distributed evenly among all drives. Because of the added parity bit information, a minimum of three or more disk drives are needed to implement RAID 5. RAID 6 is a RAID 5 configuration using yet an additional parity bit.1 Adaptec Inc. Storage Advisors Weblog 11/02/20052 IbidPage 1
  • 2. White Paper – System Storage Reliability Juha SaleniusWith RAID 5 and 6 the ratio of data storage to parity storage increases as the number of spindlesincrease, so for a system with six drives there could be the equivalent of five data drives and one paritydrive for RAID 5 and four data drives and two parity drives for RAID 6. The spindle overhead for RAID 5with five drives is 20% and doubling the total number of drives to 10 decreases the overhead to 10%.Why use RAID 6 instead of RAID 5? RAID 5 provides protection against a single failed drive. RAID 6 willprovide protection against two concurrent failures. Once a drive has failed the only time exposure thearray has to an additional failure is the time it takes to replace and re-build the failed drive, the MTTR(Mean Time To Repair) interval. With RAID 6, the exposure to an additional failure is eliminatedbecause of the additional parity. If the system has a hot swap drive the time to repair will besignificantly reduced, the re-built time can start immediately and the failed drive replaced during orafter the rebuild. The probability of another hardware failure during the MTTR interval is extremely low.But there is another disk related issue that could cause a problem during this MTTR interval and that is ahard read error, more prevalent in SATA disks.SAS and SATA Drive Considerations:Both SAS and SATA drives have well defined Bit Error Rates (BER). SAS drives are more robust thanSATA drives, exhibiting a BER of the order of one out of every 1015 bits read3, equating to one out ofevery 100 terabytes (TB) read.SATA drives are not as robust and exhibit BERs in the order of one every 1014 bits read4 or every 10 TB.What does this mean from a system perspective? To illustrate the issue, we’ll start with a SATA diskarray that has failed due to a hardware problem and is in the process of rebuilding. Let’s make someassumptions; using 500GB drives, the array has 10 drives in it and the drives each have a 1014 read BER.The following formula can determine the number of times an unrecoverable error will occur:It’s entirely possible that an array will be rebuilt 2.5 times in its life and there may be a non-recoverableerror occurring during those 2.5 rebuilds. This scenario only addresses 500GB drives, but the industryhas moved on and drive sizes have increased to 1TB and beyond, which makes this issue moreproblematic. The higher the drive size or the more drives in the array, the more frequently the non-recoverable read error can occur during rebuild.Combining Optimal RAID and Hard Drive Choices (probability) – the math:A concern with SATA disk technology is the Unrecoverable Read Error (URE) which is currently at 1014. AURE every 1014 bits equates to an error every 2.4E10 sectors. This becomes critical as the drive sizes3 Ibid4 IbidPage 2
  • 3. White Paper – System Storage Reliability Juha Saleniusincrease. When a drive fails in a 7 drive RAID 5 array made up of 2 TB SATA disks, the 6 remaining good2 TB drives will have to be read completely to recover the missing data. As the RAID controller isreconstructing the data it is very likely it will see an URE occur in the remaining media. At that point theRAID reconstruction stops.Here’s the math:There is a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5array with one failed disk, assuming a 1014 read error rate and ~23 billion sectors in 12 TB. Feelinglucky?5RAID 6 is a technique that can be used to mitigate this failure during the rebuild cycle. This is importantbecause it allows the system to recover from two disk failures, one array failure and a subsequent singlehard read error from the surviving disks in the array during the rebuild.With customers looking to reduce system cost using SATA technology, the additional overhead for RAID6 parity is becoming acceptable. But there are drawbacks to using RAID 6 which include longer writetimes due to the additional time required to generate the RAID 5 parity bit and then generating the RAID6 parity. When an error occurs during a read, the RAID 5 and RAID 6 array reduces the read throughputdue to bit recovery.As we mentioned in the beginning of this article, we wanted to constrain this discussion to defining RASand addressing increased reliability with SATA disks in a RAID environment. But there are other areasthat should be addressed at the system level that also affect disk drive performance. One such area isrotational vibration. This issue is a systemic problem in rack mount systems due to the critical thermalconstraints in 1U and 2U chassis in a NEBS environment. Rotational vibration effects are mitigated inour mechanical designs and the techniques used are covered in a separate document.5 Does RAID 6 stop working in 2019? by Robin Harris on Saturday, 27 February, 2010(http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/)Page 3
  • 4. White Paper – System Storage Reliability Juha SaleniusReliability – MTTF and MTBF (Mean Time To Failure and Mean Time Between Failure)With so much data exposed to catastrophic failure, exacerbated in the cloud computing environment,it’s important to maintain data integrity, especially in the medical, telecommunications and militarymarkets. Systems designed for these markets must address three key areas; Reliability, Availability andServiceability, system uptime must be maximized and mission critical data must be maintained.The term Mean Time To Failure (MTTF) is an estimate of the average, or mean time until the initialfailure of a design or component (you may not want to include external failures), or disruption in theoperation of the product, process, procedure, or design occurs. A failure assumes that the productcannot be repaired nor can it resume any of its normal operations without taking it out of service.MTTF it is similar to Mean Time Between Failure (MTBF) though MTBF typically is slightly longer in timethan MTTF because MTBF includes the repair time of the design or component. Also, MTBF is theaverage time between failures including the average repair time, which is known as MTTR (Mean TimeTo Repair).6What is Reliability? Per the Six Sigma SPC’s (Sigma Process Control) Quality Control Dictionary,Reliability is the probability for any given design or process to execute within the anticipated operationalor design margin for a specified period of time. In addition, the system will work under definedoperating conditions with a minimum amount of stoppage due to a design or process error. Someindicators for reliability are MTBF (Mean Time Between Failures) computations, ALT (Accelerated LifeTest using temperature chambers), MTTF (Mean Time To Failure) computations, and Chi-Square7(statistical difference between observed and expected).MTBF is a calculated indication of reliability. From a system perspective, any reliable assembly mustsatisfactorily perform its intended function under some defined circumstances which may not be part ofthe MTBF calculation’s environment. This may include conditions such as operating in varying ambienttemperatures. MTBF addresses reliability in a very controlled and limited scope. Traditionally, MTBFcalculations are based on the Telcordia Technologies Special Report SR-332, Issue 1, ReliabilityPrediction Procedure for Electronic Equipment. The results from these calculations can be used toroughly assist customers in the evaluation of the individual products, but should not be used as arepresentation or guarantee of reliability or performance of the product. MTBF is only a grossrepresentation of how reliable a system will be under clearly defined conditions, clearly not real world.If we can’t use the results of the MTBF calculation to determine when the components will wear out inthe real world, which product is better than the others and MTBF does not provide a reliable metric forfield failures, then why use it? Well it allows us to determine how often a system will fail under steadystate environmental conditions. Early in the design cycle, component MTBF can be used to determinewhich parts will initially fail enabling engineering to improve the design robustness by selecting more6 Paraphrasing the Six Sigma SPCs Quality Control Dictionary and Glossaryhttp://www.sixsigmaspc.com/dictionary/glossary.html7 ibidPage 4
  • 5. White Paper – System Storage Reliability Juha Saleniusrobust components or design using hardened assemblies. There are three methods used in the MTBFcalculations: 1. the black box, 2. the black box plus laboratory inputs, and 3. the black box plus fieldfailure inputs. While the industry traditionally uses Method 1, Kontron Inc. CBPU/CRMS uses acombination of all three methods – black box with lab inputs coupled with field data where available.For large aggregated component assemblies, such as computer baseboards, there are typically vendor-calculated MTBF; for passive components, there is industry standard failure rate data; and forproprietary components, lab or field data is available.Availability – MTBI (Mean Time Between Interruption)If MTTF and MTBF reference failure modes address the failures from an initial power on or address thefailures from a previous failure including the repair time (MTTR) what is meant by Mean Time BetweenInterruption (MTBI)? It addresses designs that provide redundancy allowing for the failure of aredundant component that will not halt (fail) the system. The system may not run at full speed duringthe time it takes to replace or re-build the failed component but it will run. MTBI time durations aremuch larger than MTTR/MTBF intervals, which is better and they could be include multiple failures (RAID6) with the replacement of the redundant components.Serviceability – MTTR (Mean Time To Repair)This term refers to how quickly and easily a system can be repaired after a MTBF, MTTF or an MTBIevent.One measure of availability is what’s touted as the Five Nines. As we’ve seen, MTBI and MTTR aretightly coupled. There is a significant amount of marketing literature promoting Five Nines availabilityfor systems designed for critical environments. But what is meant by Five Nines? This particular metricis an artifact of the monolithic telecommunications industry when the incumbent carriers exercisedcomplete control of the equipment installed in their central offices. Five Nines availability was, and inmany cases remains, a requirement of Telco-grade Service Level Agreements (SLAs), defining a ratio ofsystem uptime (MTTR/ MTBI) versus unplanned downtime (MTTR), not counting scheduledmaintenance, planned updates, reboots, etc. Five Nines availability has an uptime of 99.999% per year,or expressed conversely, its five minutes and thirty-five seconds of unplanned downtime per year,equivalent to six sigma, a 99.99966% process capability. With a downtime measured in minutes, it isvitally important that the system serviceability duration is minimized and any spare parts are availablelocally, e.g., hot spares for disk drive arrays.With Five Nines reflecting the system elements and not the network, we can easily compute thenetwork level availability. For example, if two non-redundant serial network elements each have99.999% availability, the total availability of the network is 0.99999 X 0.99999 = 0.99998 or 99.998% orFour Nines availability. Notice that with redundant components in a system we use MTBI not MTBF as ameasure of the interval between system level failures. By providing redundancy for all high poweredPage 5
  • 6. White Paper – System Storage Reliability Juha Saleniusand rotating components we increase the time the system takes to fail (MTBI) but reduce theMTTF/MTBF because there are more components to fail.ComputationsWhen evaluating a non-redundant system, all sub-system’s MTBF numbers can be viewed as a seriessequence with any single component or assembly causing a single system failure. The total calculatedMTBF will be less than the lowest individual component MTBF as illustrated in the following formula. (Standard case where all components aren’t the same)When we add redundant assemblies to the system, these combined components are measured as asingle block and the system level result is no longer MTBF but rather MTBI; the system keeps workingeven with the failed redundant component. For example, in a system with no redundant fans, the MTBFfor the fan group maybe 261,669 hours. After we add redundant fans the MTBI is 3,370,238,148 hourseven though the MTBF is reduced because of the added fans. Because this MTBI is such a large number,the fan group is virtually eliminated from the equation for system MTBI. We add redundantcomponents to increase the MTBI of the grouped components so they no longer adversely affect thesystem level MTBI because their MTBI values are so large. The system’s single point of failure is reducedby taking the assemblies that traditionally exhibit high single point of failure rates i.e., any componentassemblies that move, rotate or work at the edge of their thermal or electrical envelope and designingthe system in such as was so that these assemblies are redundant assemblies.Power supplies are also items that fail due because they are usually working at the higher end of theircomponents thermal and electrical limits. By adding redundant power supplies, the MTBF can go from125,000 hours for a single supply to an MTBI of 326,041,999 hours for a redundant pair. Like the earlierexample with the fans this is substantial change and will have a major positive impact to the systemMTBI.Page 6