This white paper discusses system storage reliability. It begins by defining key reliability metrics like MTBF and MTBI and how they apply to non-redundant and redundant storage configurations. It then analyzes the reliability impacts of different RAID levels and drive types. RAID 6 is recommended for use with SATA drives to protect against double failures during rebuild. The paper also calculates reliability statistics for various hypothetical storage systems to illustrate these concepts.
Raid the redundant array of independent disks technology overviewIT Tech
RAID (Redundant Array of Independent Disks) is a technology allowing a higher level of storage reliability and performance from disk-drive components via the technique of arranging them into arrays.
A RAID array is a configuration with multiple physical disks set up to use RAID architecture like RAID 0, RAID 1, RAID 5, etc. While the RAID array distributes data across multiple disks, it is considered as a single disk by the server operating system.
Learn more...
RAID (originally redundant array of inexpensive disks, now commonly redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.
RAID (originally redundant array of inexpensive disks, now commonly redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.
Raid the redundant array of independent disks technology overviewIT Tech
RAID (Redundant Array of Independent Disks) is a technology allowing a higher level of storage reliability and performance from disk-drive components via the technique of arranging them into arrays.
A RAID array is a configuration with multiple physical disks set up to use RAID architecture like RAID 0, RAID 1, RAID 5, etc. While the RAID array distributes data across multiple disks, it is considered as a single disk by the server operating system.
Learn more...
RAID (originally redundant array of inexpensive disks, now commonly redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.
RAID (originally redundant array of inexpensive disks, now commonly redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.
RAID, short for redundant array of independent (originally inexpensive) disks is a disk subsystem that stores your data across multiple disks to either increase the performance or provide fault tolerance to your system (some levels provide both).
Raid- Redundant Array of Inexpensive DisksMudit Mishra
The basically RAID was to combine multiple, small inexpensive disks drive into an array of disk drives which yields performance exceeding that of a Single, Large Expensive Drive(SLED). Additionally this array of drives appear to the computer as a single logical storage unit or drive.
Updated study material available for 1Z0-027 Exam-Oracle Exadata Database Machine Administration, Software Release visit@ https://www.troytec.com/1Z0-027-exams.html
Performance evolution of raid is a presentation slide about RAID, Its classification, Importance,Concept about RAID,Standard Raid Level,Implementation of Raid, Performance and Advantages Comparison among RAID Levels.
Hope It will be helpfull..................
RAID, short for redundant array of independent (originally inexpensive) disks is a disk subsystem that stores your data across multiple disks to either increase the performance or provide fault tolerance to your system (some levels provide both).
Raid- Redundant Array of Inexpensive DisksMudit Mishra
The basically RAID was to combine multiple, small inexpensive disks drive into an array of disk drives which yields performance exceeding that of a Single, Large Expensive Drive(SLED). Additionally this array of drives appear to the computer as a single logical storage unit or drive.
Updated study material available for 1Z0-027 Exam-Oracle Exadata Database Machine Administration, Software Release visit@ https://www.troytec.com/1Z0-027-exams.html
Performance evolution of raid is a presentation slide about RAID, Its classification, Importance,Concept about RAID,Standard Raid Level,Implementation of Raid, Performance and Advantages Comparison among RAID Levels.
Hope It will be helpfull..................
From http://wiki.directi.com/x/hQAa - This is a fairly detailed presentation I made at BarCamp Mumbai on building large storage networks and different SAN topologies. It covers fundamentals of selecting harddrives, RAID levels and performance of various storage architectures. This is Part I of a 3-part series.
Demystifying Storage - Building large SANsDirecti Group
From http://wiki.directi.com/x/hQAa - This is a fairly detailed presentation I made at BarCamp Mumbai on building large storage networks and different SAN topologies. It covers fundamentals of selecting harddrives, RAID levels and performance of various storage architectures. This is Part I of a 3-part series
Exercise 3-1 This chapter’s opening scenario illustrates a specific .docxnealwaters20034
Exercise 3-1 This chapter’s opening scenario illustrates a specific type of incident/disaster. Using a Web browser, search for information related to preparing an organization against terrorist attacks. Look up information on (a) anthrax or another biological attack (like smallpox), (b) sarin or another toxic gas, (c) low-level radiological contamination attacks. Exercise 3-2 Using a Web browser, search for available commercial applications that use various forms of RAID technologies, such as RAID 0 through RAID 5. What is the most common implementation? What is the most expensive?
The following sections discuss the RAID configurations that are most commonly used in the IT industry. RAID Level 0 This is not a form of redundant storage. RAID 0 creates one larger logical volume across several available hard disk drives and stores the data using a process known as disk striping, in which data segments, called stripes, are written in turn to each disk drive in the array. When this is done to allow multiple drives to be combined in order to gain large capacity without data redundancy, it is called disk striping without parity. Unfortunately, failure of one drive may make all data inaccessible. In fact, this level of RAID does not improve the risk situation when using disk drives; instead, it rather increases the risk of losing data from a single drive failure. RAID Level 1 Commonly called disk mirroring, RAID 1 uses twin drives in a computer system. The computer records all data to both drives simultaneously, providing a backup if the primary drive fails. This is a rather expensive and inefficient use of media. A variation of mirroring is called disk duplexing. With mirroring, the same drive controller manages both drives; with disk duplexing, each drive has its own controller. Mirroring is often used to create duplicate copies of operating system volumes for high-availability systems. Using this technique, a plan can be developed that mirrors and then splits disk pairs to create highly available copies of critical system drives. This can make multiple copies of critical data or programs readily available when needed for high-availability computing environments. RAID Level 2 A specialized form of disk striping with parity, RAID 2 is not widely used. It uses a specialized parity coding mechanism known as the Hamming code to store stripes of data on multiple data drives and corresponding redundant error correction on separate error-correcting drives. This approach allows the reconstruction of data if some of the data or redundant parity information is lost. There are no commercial implementations of RAID 2. Failure-Resistant Disk Systems (FRDS) Failure-Tolerant Disk Systems (FTDS) Disaster-Tolerant Disk Systems (DTDS) Protection against data loss due to replaceable unit failure Replaceable unit and environmental failure warning Protection against loss of access to data due to zone failure Replaceable unit monitoring and failure indication Protect.
RAID (redundant array of independent disks) is a way of storing the same data in different places on multiple hard disks or solid-state drives (SSDs) to protect data in the case of a drive failure.
The RAID stands for "Redundant Array of Inexpensive Disks" or "Redundant Array of Independent Disks") is a virtualization technology for data storage that incorporates multiple components of physical disc drives into one or more logical units for data replication, performance enhancement or both purposes.
This presentation helps to understand about RAID technology, working, types and different standard levels with their advantages and disadvantages.
RAID (redundant array of independent disks) is a way of storing the same data in different places on multiple hard disks or solid-state drives (SSDs) to protect data in the case of a drive failure
A technology which is used for increasing the storage reliability and performance.It is a redundant array of inexpensive disks.It is an important aspect of computer science,which is little hard for undergrads to understand.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
A tale of scale & speed: How the US Navy is enabling software delivery from l...
Storage systems reliability
1. White Paper – System Storage Reliability
Juha Salenius
Storage Systems
As a way of defining the various facets of reliability let’s start out by configuring a hypothetical a server
using six internal hot swap disk drives that can be either Serial Attached SCSI (SAS) or Serial Advanced
Technology Attachment (SATA) disk drives. If these drives were SAS disks and were set up as Just a
Bunch of Drives (JBOD), a non RAID configuration, then the MTBF would be all we’d need to define the
computed failure rate. With an individual drive MTBFSAS of 1,400,000 hours1 the combined MTBF for six
drives would compute to 233,333 hours using the following equation where N is the number of the
same component (in this case disk drives) and the subscripts are tc=total components and
c=component:
(Special case where all components are the same)
In contrast to the SAS MTBFtc, using individual SATA drives instead, each exhibiting an individual
MTBFSATA of 1,000,000 hours2, the combined MTBFtc for six drives in a JBOD configuration is 166,667
hours.
RAID Considerations:
Even with the drives configured as a RAID (levels 0, 1, 5 and 6) array the total MTBF for all the drives will
remain the same as above because it does not take into account any redundancy. MTBI (Mean Time
Between Interruption) can be used to highlight the difference in uptime based on a redundant
configuration. In a JBOD configuration the MTBF and the MTBI are the same. You’ll notice that once we
move from a non-redundant system to a system with redundant components we move from reporting
MTBF to reporting MTBI as the more meaningful term from a system perspective.
Consider the following RAID levels:
RAID 0 will not be considered here because it does not provide any failure protection, though it
does provide higher throughput by striping the data stream across multiple drives and it is
usually used with other RAID levels increasing their throughput. Certain RAID levels can be
combined, for example RAID 10, RAID 50 and RAID 60. These configurations combine data
striping across multiple drives combined with either mirroring or parity drives. These
configurations drive increased complexity but improve performance.
RAID 1 mirrors data across two disk drives which require doubling of the number of data drives.
RAID 5 uses parity bits to recover from a bad read and the data/parity is written in blocks across
all drives in the array with the parity distributed evenly among all drives. Because of the added
parity bit information, a minimum of three or more disk drives are needed to implement RAID 5.
RAID 6 is a RAID 5 configuration using yet an additional parity bit.
1
Adaptec Inc. Storage Advisors Weblog 11/02/2005
2
Ibid
Page 1
2. White Paper – System Storage Reliability
Juha Salenius
With RAID 5 and 6 the ratio of data storage to parity storage increases as the number of spindles
increase, so for a system with six drives there could be the equivalent of five data drives and one parity
drive for RAID 5 and four data drives and two parity drives for RAID 6. The spindle overhead for RAID 5
with five drives is 20% and doubling the total number of drives to 10 decreases the overhead to 10%.
Why use RAID 6 instead of RAID 5? RAID 5 provides protection against a single failed drive. RAID 6 will
provide protection against two concurrent failures. Once a drive has failed the only time exposure the
array has to an additional failure is the time it takes to replace and re-build the failed drive, the MTTR
(Mean Time To Repair) interval. With RAID 6, the exposure to an additional failure is eliminated
because of the additional parity. If the system has a hot swap drive the time to repair will be
significantly reduced, the re-built time can start immediately and the failed drive replaced during or
after the rebuild. The probability of another hardware failure during the MTTR interval is extremely low.
But there is another disk related issue that could cause a problem during this MTTR interval and that is a
hard read error, more prevalent in SATA disks.
SAS and SATA Drive Considerations:
Both SAS and SATA drives have well defined Bit Error Rates (BER). SAS drives are more robust than
SATA drives, exhibiting a BER of the order of one out of every 1015 bits read3, equating to one out of
every 100 terabytes (TB) read.
SATA drives are not as robust and exhibit BERs in the order of one every 1014 bits read4 or every 10 TB.
What does this mean from a system perspective? To illustrate the issue, we’ll start with a SATA disk
array that has failed due to a hardware problem and is in the process of rebuilding. Let’s make some
assumptions; using 500GB drives, the array has 10 drives in it and the drives each have a 1014 read BER.
The following formula can determine the number of times an unrecoverable error will occur:
It’s entirely possible that an array will be rebuilt 2.5 times in its life and there may be a non-recoverable
error occurring during those 2.5 rebuilds. This scenario only addresses 500GB drives, but the industry
has moved on and drive sizes have increased to 1TB and beyond, which makes this issue more
problematic. The higher the drive size or the more drives in the array, the more frequently the non-
recoverable read error can occur during rebuild.
Combining Optimal RAID and Hard Drive Choices (probability) – the math:
A concern with SATA disk technology is the Unrecoverable Read Error (URE) which is currently at 1014. A
URE every 1014 bits equates to an error every 2.4E10 sectors. This becomes critical as the drive sizes
3
Ibid
4
Ibid
Page 2
3. White Paper – System Storage Reliability
Juha Salenius
increase. When a drive fails in a 7 drive RAID 5 array made up of 2 TB SATA disks, the 6 remaining good
2 TB drives will have to be read completely to recover the missing data. As the RAID controller is
reconstructing the data it is very likely it will see an URE occur in the remaining media. At that point the
RAID reconstruction stops.
Here’s the math:
There is a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5
array with one failed disk, assuming a 1014 read error rate and ~23 billion sectors in 12 TB. Feeling
lucky?5
RAID 6 is a technique that can be used to mitigate this failure during the rebuild cycle. This is important
because it allows the system to recover from two disk failures, one array failure and a subsequent single
hard read error from the surviving disks in the array during the rebuild.
With customers looking to reduce system cost using SATA technology, the additional overhead for RAID
6 parity is becoming acceptable. But there are drawbacks to using RAID 6 which include longer write
times due to the additional time required to generate the RAID 5 parity bit and then generating the RAID
6 parity. When an error occurs during a read, the RAID 5 and RAID 6 array reduces the read throughput
due to bit recovery.
As we mentioned in the beginning of this article, we wanted to constrain this discussion to defining RAS
and addressing increased reliability with SATA disks in a RAID environment. But there are other areas
that should be addressed at the system level that also affect disk drive performance. One such area is
rotational vibration. This issue is a systemic problem in rack mount systems due to the critical thermal
constraints in 1U and 2U chassis in a NEBS environment. Rotational vibration effects are mitigated in
our mechanical designs and the techniques used are covered in a separate document.
5
Does RAID 6 stop working in 2019? by Robin Harris on Saturday, 27 February, 2010
(http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/)
Page 3
4. White Paper – System Storage Reliability
Juha Salenius
Reliability – MTTF and MTBF (Mean Time To Failure and Mean Time Between Failure)
With so much data exposed to catastrophic failure, exacerbated in the cloud computing environment,
it’s important to maintain data integrity, especially in the medical, telecommunications and military
markets. Systems designed for these markets must address three key areas; Reliability, Availability and
Serviceability, system uptime must be maximized and mission critical data must be maintained.
The term Mean Time To Failure (MTTF) is an estimate of the average, or mean time until the initial
failure of a design or component (you may not want to include external failures), or disruption in the
operation of the product, process, procedure, or design occurs. A failure assumes that the product
cannot be repaired nor can it resume any of its normal operations without taking it out of service.
MTTF it is similar to Mean Time Between Failure (MTBF) though MTBF typically is slightly longer in time
than MTTF because MTBF includes the repair time of the design or component. Also, MTBF is the
average time between failures including the average repair time, which is known as MTTR (Mean Time
To Repair).6
What is Reliability? Per the Six Sigma SPC’s (Sigma Process Control) Quality Control Dictionary,
Reliability is the probability for any given design or process to execute within the anticipated operational
or design margin for a specified period of time. In addition, the system will work under defined
operating conditions with a minimum amount of stoppage due to a design or process error. Some
indicators for reliability are MTBF (Mean Time Between Failures) computations, ALT (Accelerated Life
Test using temperature chambers), MTTF (Mean Time To Failure) computations, and Chi-Square7
(statistical difference between observed and expected).
MTBF is a calculated indication of reliability. From a system perspective, any reliable assembly must
satisfactorily perform its intended function under some defined circumstances which may not be part of
the MTBF calculation’s environment. This may include conditions such as operating in varying ambient
temperatures. MTBF addresses reliability in a very controlled and limited scope. Traditionally, MTBF
calculations are based on the Telcordia Technologies Special Report SR-332, Issue 1, Reliability
Prediction Procedure for Electronic Equipment. The results from these calculations can be used to
roughly assist customers in the evaluation of the individual products, but should not be used as a
representation or guarantee of reliability or performance of the product. MTBF is only a gross
representation of how reliable a system will be under clearly defined conditions, clearly not real world.
If we can’t use the results of the MTBF calculation to determine when the components will wear out in
the real world, which product is better than the others and MTBF does not provide a reliable metric for
field failures, then why use it? Well it allows us to determine how often a system will fail under steady
state environmental conditions. Early in the design cycle, component MTBF can be used to determine
which parts will initially fail enabling engineering to improve the design robustness by selecting more
6
Paraphrasing the Six Sigma SPC's Quality Control Dictionary and Glossary
http://www.sixsigmaspc.com/dictionary/glossary.html
7
ibid
Page 4
5. White Paper – System Storage Reliability
Juha Salenius
robust components or design using hardened assemblies. There are three methods used in the MTBF
calculations: 1. the black box, 2. the black box plus laboratory inputs, and 3. the black box plus field
failure inputs. While the industry traditionally uses Method 1, Kontron Inc. CBPU/CRMS uses a
combination of all three methods – black box with lab inputs coupled with field data where available.
For large aggregated component assemblies, such as computer baseboards, there are typically vendor-
calculated MTBF; for passive components, there is industry standard failure rate data; and for
proprietary components, lab or field data is available.
Availability – MTBI (Mean Time Between Interruption)
If MTTF and MTBF reference failure modes address the failures from an initial power on or address the
failures from a previous failure including the repair time (MTTR) what is meant by Mean Time Between
Interruption (MTBI)? It addresses designs that provide redundancy allowing for the failure of a
redundant component that will not halt (fail) the system. The system may not run at full speed during
the time it takes to replace or re-build the failed component but it will run. MTBI time durations are
much larger than MTTR/MTBF intervals, which is better and they could be include multiple failures (RAID
6) with the replacement of the redundant components.
Serviceability – MTTR (Mean Time To Repair)
This term refers to how quickly and easily a system can be repaired after a MTBF, MTTF or an MTBI
event.
One measure of availability is what’s touted as the Five Nines. As we’ve seen, MTBI and MTTR are
tightly coupled. There is a significant amount of marketing literature promoting Five Nines availability
for systems designed for critical environments. But what is meant by Five Nines? This particular metric
is an artifact of the monolithic telecommunications industry when the incumbent carriers exercised
complete control of the equipment installed in their central offices. Five Nines availability was, and in
many cases remains, a requirement of Telco-grade Service Level Agreements (SLAs), defining a ratio of
system uptime (MTTR/ MTBI) versus unplanned downtime (MTTR), not counting scheduled
maintenance, planned updates, reboots, etc. Five Nines availability has an uptime of 99.999% per year,
or expressed conversely, its five minutes and thirty-five seconds of unplanned downtime per year,
equivalent to six sigma, a 99.99966% process capability. With a downtime measured in minutes, it is
vitally important that the system serviceability duration is minimized and any spare parts are available
locally, e.g., hot spares for disk drive arrays.
With Five Nines reflecting the system elements and not the network, we can easily compute the
network level availability. For example, if two non-redundant serial network elements each have
99.999% availability, the total availability of the network is 0.99999 X 0.99999 = 0.99998 or 99.998% or
Four Nines availability. Notice that with redundant components in a system we use MTBI not MTBF as a
measure of the interval between system level failures. By providing redundancy for all high powered
Page 5
6. White Paper – System Storage Reliability
Juha Salenius
and rotating components we increase the time the system takes to fail (MTBI) but reduce the
MTTF/MTBF because there are more components to fail.
Computations
When evaluating a non-redundant system, all sub-system’s MTBF numbers can be viewed as a series
sequence with any single component or assembly causing a single system failure. The total calculated
MTBF will be less than the lowest individual component MTBF as illustrated in the following formula.
(Standard case where all components aren’t the same)
When we add redundant assemblies to the system, these combined components are measured as a
single block and the system level result is no longer MTBF but rather MTBI; the system keeps working
even with the failed redundant component. For example, in a system with no redundant fans, the MTBF
for the fan group maybe 261,669 hours. After we add redundant fans the MTBI is 3,370,238,148 hours
even though the MTBF is reduced because of the added fans. Because this MTBI is such a large number,
the fan group is virtually eliminated from the equation for system MTBI. We add redundant
components to increase the MTBI of the grouped components so they no longer adversely affect the
system level MTBI because their MTBI values are so large. The system’s single point of failure is reduced
by taking the assemblies that traditionally exhibit high single point of failure rates i.e., any component
assemblies that move, rotate or work at the edge of their thermal or electrical envelope and designing
the system in such as was so that these assemblies are redundant assemblies.
Power supplies are also items that fail due because they are usually working at the higher end of their
components thermal and electrical limits. By adding redundant power supplies, the MTBF can go from
125,000 hours for a single supply to an MTBI of 326,041,999 hours for a redundant pair. Like the earlier
example with the fans this is substantial change and will have a major positive impact to the system
MTBI.
Page 6