This document discusses common misunderstandings about MTBF (mean time between failures) and how using MTBF alone can lead to incorrect reliability estimates. It provides examples of how assuming components all have the same MTBF does not translate to the overall system meeting that MTBF, and how using an exponential distribution fitted to MTBF data may not accurately model actual failure rate trends that change over time. The document emphasizes using additional reliability metrics and distribution fitting for more accurate reliability analysis and decision making.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Common Mistakes with MTBF Explained
1. Common Mistakes with MTBF
MTBF is widely used to describe the reliability of a component or system. It is
also often misunderstood and used incorrectly. In some sense, the very name
“mean time between failures” contributes to the misunderstanding. The objective
of this paper is to explore the nature of the MTBF misunderstandings and the
impact on decision-making and program costs.
Mean-Time-Between-Failure (MTBF) as defined by MIL-STD-721C Definition of
Terms for Reliability and Maintainability, 12 June 1981, is
A basic measure of reliability for repairable items: The mean number of life
units during which all parts of the item perform within their specified limits, during
a particular measurement interval under stated conditions.
The related measure, Mean-Time-To-Failure (MTTF) is define as
A basic measure of reliability for non-repairable items: The total number of
life units of an item divided by the total number of failures within that population,
during a particular measurement interval under stated conditions.
These definitions are very similar. The subtle difference is important, yet the
confusion is further complicated when attempting to quantify MTBF or MTTF. In
both cases we often use the calculation as described within the MTTF definition.
This is what we would do for any group of values that we wanted to find the
mean (average) value estimate. Tally the values and divide by the number of
hours all units have operated and divide by the number of failures. This provided
an unbiased (statistically speaking) estimate of the population mean.
Keep in mind that time to failure data is often not normally distributed. The
underlying distribution for lifedata starts at time zero and increases. The
exponential family of distributions tends to describe lifedata well and is commonly
used. The unbiased estimate for the mean value of an exponential distribution is
as described for the MTTF definition above.
When working with data from a repairable system, one should use the
Nonhomogeneous Poison Process (NHPP) which is a generalization of the
Poison distribution. The estimate for the failure intensity can have various
models, yet if often assumed to be the exponential model. This results in the
common estimate of MTBF of
T (k)
MTBF =
k
Where, T(k) as the total time of one or more system operations divided by the
cumulative number of failures. [1]
2. Thus introducing the first source of confusion when considering MTBF, failure
rates, or hazard rates. Since we intuitively use the simple calculation to estimate
the mean value, many then do not then apply that estimate with the reliability
function of the appropriate distribution.
For example, if a vendor states the product has an MTTF of 16,000 hours, and
we wanted to know how many out of 100 units will fail in 8,000 hours, the
appropriate calculation is
ætö
-ç ÷
èq ø
R(t) = e
æ 8,000 ö
-ç
è 16,000 ÷
ø
R(8, 000) = e = 0.61
such that we expect 61 out of the 100 units, or 61%, of the units to operate for
the full 8,000 hours.
This is assuming an exponential distribution and non-repairable units. Given only
an MTTF value, the most likely distribution to use without additional information is
the exponential.
Extending this same example to determine the reliability at 16,000 hours, we find
that only about 1/3 of the units would be expected to still be operating. And, if
someone has this common misunderstandings of the failure rate value that
MTBF represents, then it can lead to significant loss of resources or mission
readiness.
For example, a radar detection OEM received a contract to design and
manufacture a specific system with 5,000 hours MTBF. The specification
included functionality, mission duration and expected equipment duty cycle,
along with minor variations to the airborne inhabited environment. The contract
specified 5,000 hours MTBF for the sole reliability requirement. And, the design
team designed, built and tested and accomplished a better than 5,000 hour
MTBF.
The Air Force found the unit to be the leading cause of aborted missions
(equipment related) and complained to the OEM. A careful analysis of the field
data proved the units actually achieved almost 6,000 hour MTBF, thus exceeding
the specification. Of course, this didn‟t change the data on aborted missions. In
part the OEM‟s equipment just happened to be the least reliable equipment on
the aircraft.
A short discussion with the team found some misunderstanding and that “errors
had been made”. The Air Force procurement team and the prime contractor
personal mistakenly thought the term „5,000 hours MTBF‟ meant at least 5,000
3. failure free operating hours. When in reality the term, in this case, meant that
approximately two-thirds of the units are expected to have at least one failure
over of period of 5,000 operating hours. And, in fact, the product performed about
20% better than the specification.
The problem was exacerbated by the mission requiring the use of three of the
OEM‟s unit during the mission. Reliability speaking the equipment was in series,
meaning that if any one of the three units failed, the crew had to abort the
mission. Therefore, the probability of successfully completing 1000 hours of
operation where all three units have to work is
Rsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t )
æ 1,000 ö æ 1,000 ö æ 1,000 ö
-ç -ç -ç
è 5,000 ÷ è 5,000 ÷ è 5,000 ÷
Rsys (1, 000) = e ø
×e ø
×e ø
= 0.55
Even though each of the individual units have about an 82% reliability (or
probability of surviving 1,000 hours), the three in series have only a 55%
reliability, or probability that all three will operate for 1,000 hours.
Acknowledging either a specification error or misunderstanding of the metric
errors the team still had the issue of aborted missions. Simply changing the
reliability requirements would not change the design of the equipment without a
significant re-design. Further discussion found that installing a warm standby
unit, permitted the rapid replacement of a failed unit during the mission, thus
effectively and significantly reducing mission aborts. The reliability of a 3-out-of-4
system is
m-1
æ nö
Rsys ( t ) = 1- å ç ÷ Ri ( t ) (1- R ( t ))
n-i
i=0 è i ø
where n is the number of systems out of m total have to be operating for the
overall system to be operating.[2] In the example above, n=3 and m=4, plus the
example has a reliability for a single system of about 82%. For three in series the
system reliability drops to about 55%. And the calculation for the 3 out of 4
parallel system reliability calculation results in 85%. Suffice it to say the reliability
is significantly improved.
Note, that using reliability in the above function does not require the use MTBF.
The reliability term can come from any distribution.
Calculating or using only the MTBF value to represent a product‟s reliability can
lead to more than misunderstanding. If the product performs better or worse than
expected you may have unnecessary spares expenses or not enough spares to
continue effectively. Another issue that may arise is the unexpected increase in
4. failure rate after a few years of a very low failure rate. Using the single
parameter, MTBF, does not provide information about the changing nature of
failure rates over time.
The following graph is a plot of percentage of the population that has failed over
time or cumulative distribution function plot. The red line is the plot of the fitted
exponential distribution. The data and fitted line represents the failure rate trend
that is declining over time. Over time the total number fo failures continues to
rise, yet the slope is low or less than the slope for the exponential distribution.
This is actual data and the time scale and title have been removed to protect the
source. The theta of the exponential distribution is 49,093 hours. Whereas the
Weibull distribution has a beta of 0.5823 and eta of 31,344 hours.
On this plot, the exponential distribution has a slope of 1. The fitted Weibull
distribution slope is less than one. Keep in mind that the exponential and Weibull
distribution are members of the exponential family of distribution. The formula for
the reliability function of the 2-parameter Weibull distribution is
( )
b
- th
R(t) = e
5. where the beta is the slope and eta is the characteristic life. Setting beta to 1
reduces the formula to the reliability function for the exponential distribution.
R(t) = e
( )
- tq
where theta is the characteristic life and is also the inverse of the failure rate and
commonly theta is called MTTF or MTBF.
The plot of the CDF is related to the reliability function. Reliability is the
percentage of units surviving over a specific duration. And the CDF plots the
percentage of units failed over a specific duration. The CDF is represented by
F(t) and the CDF for the Weibull distribution is
( )
b
- th
F(t) = 1- e
therefore,
R(t) = 1- F(t)
Essentially the vertical axis on the above plot reverses from rising from 0 to
100% for the CDF. For the reliability function the vertical axis rises from 100 to
0%.
Consider the above CDF plot again. If the underlying data is represented by only
one value, say MTBF, we are in effect representing the data with the ill-fitted red
line. Only at one point in time does the distribution actually represent the data,
only at the point in time where they cross. Thus, if I need to make a decision prior
to that point based on the expected reliability of the system, we would use the
exponential distribution. For example, at time 100 hours we find the MTBF based
reliability to be
R(t) = e
( )
- tq
R(100) = e
(
- 100 49,093 ) = 0.9968
We get a number and can make a decision if the system meets our reliability
requirements. Whereas, using the fitted reliability distribution, we have a
description of the data using two parameters. Calculating the reliability at the
same point of time using the Weibull distribution we find
6. ( )
b
- th
R(t) = e
( )
0.5823
- 100 31,344
R(100) = e = 0.965
The difference in estimates may or may not make a difference in the decision, yet
we often attempt to use the best available data when making important decision.
The estimate provided by the exponential distribution is potentially misleading
and in the above example over states the system‟s reliability. This error varies
and get worse when examining a shorter period of time.
This error may cause the error of accepting a system that actually does not meet
the requirements. Or, it may cause the under stocking of needed spare parts for
failures that are likely to occur, leading to reduced mission readiness.
The following CDF plot shows a different situation. Here the data tends to
increase in failure rate over time and has a slope greater than one. Again the
exponential (MTBF) estimate does not reflect the actual data very well, except at
one point.
7. Again, the title and vertical access have been removed from this plot of actual
data. The theta for the exponential distribution is 20,860 hours. And, the fitted
parameters for the Weibull distribution are: Beta equals 1.897 and eta is 23,507
hours.
Performing the reliability calculations for the two distribution at 100 hours results
in the following two results
R(t) = e
( )
- tq
R(100) = e
(
- 100 20860 ) = 0.9952
is for the exponential distribution, and for the Weibull distribution
( )
b
- th
R(t) = e
( )
1.897
- 100 23,507
R(100) = e = 0.999968
And while this difference may or may not change the decision based on the
system reliability, using the exponential distribution may lead to costly mistakes.
In this case, the system reliability estimate may be mistakenly represented as
being to low. This may lead to a cancelation of the program, or the overstocking
of spare parts.
Of course, in both examples, depending on which time point is selected the
difference between the two fitted curves is different. And if the duration on
interest is beyond the intersection of the two fitted lines, then the mistakes lead to
different results.
Another area of misleading use of MTBF is the lack of reliability apportionment.
The confusion comes from the notion of the weakest link limiting the reliability of
a system. As in the except from the poem by Oliver Wendal Homes, “The
Deacon‟s Masterpiece, or, the Wonderful One-Hoss Shay a Logical
Story.”,[3]where the chaise was build with every part was a study and strong as
all the parts. Then,
--What do you think the parson found,
When he got up and stared around?
The poor old chaise in a heap or mound,
8. As if it had been to the mill and ground!
You see, of course, if you 're not a dunce,
How it went to pieces all at once,
-- All at once, and nothing first,
-- Just as bubbles do when they burst.
In practice, products do not failure all at once and completely. In more complex
systems, while many possible components may be the first to fail, it may be
unclear exactly which component will fail first. The replacement of that
component generally does not improve the probability of failure of the other
components, thus a different component may cause the next failure.
Back to the weakest link idea. In a series system, reliability speaking, if any one
element of a system fails, then the system fails. Given technical and design
limitations there is one element that is inherently weaker than the rest of the
system. Therefore, if we know, the compressor is the weakest link in a product
and it has a MTBF of 5,000 hours. Well, then no other component needs to be
any better than 5,000 hours MTBF. Right? And, one might say that for a system
is has no field replaceable units, that upon the first failure the unit has to be
totally replaced anyway. Basically, the thought is since the compressor limits the
life of the product (the weakest link), no other component needs to be better than
5,000 hours MTBF.
Given a system goal of 5,000 hours MTBF and using the logic from above and
from the One-Hoss Shay, we create a complex product with each subsystem
designed and tested to the same goal, 5,000 MTBF. Let‟s assume the product
has a display, circuit board, and power supply, in addition to the compressor
mentioned above.
For the sake of argument, let‟s assume each of the four subsystems do actually
have an exponential distribution for expected time to failure. This means that
each subsystem has a 1/5,0000 chance of failure every hour of operation and it
stays constant over time. Inverting the MTBF to find the failure rate per hour, we
find 1/5,000 = 0.0002 failures per hour. And, let‟s say that over a two year period
the systems are expected to operate 2,500 hours.
“No problem, everything meets at least 5000 hours MTBF”, one might say. Let‟s
do the math.
Rsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t ) × R4 ( t )
æ 2,500 ö æ 2,500 ö æ 2,500 ö æ 2,500 ö
-ç -ç -ç -ç
è 5,000 ÷ è 5,000 ÷ è 5,000 ÷ è 5,000 ÷
Rsys ( 2, 500 ) = e ø
×e ø
×e ø
×e ø
= 0.135
9. The more subsystems and components designed and selected to just meet the
5k MTBF the worse the actual result. The result of a system reliability of 13.5%
over 2,500 hours assumes that each subsystem achieves only 5,000 MTBF. In
practice each will achieve some other number, yet the point is, in design and
practice if each subsystem achieves the system goal, the result will be a
surprisingly low.
Another assumption in the above example is the use of exponential distributions
to describe each subsystem. This is often not true and using Weibull or
Lognormal distribution may be appropriate. For example, the compressor most
likely has a wearout type of failure mechanism. And, we are able to find a set of
data that with analysis provides a good fit to a Weibull distribution. The Weibull
parameters for the compressor are beta of 2 and eta of 5642(note: this would be
estimated as an theta of 5,000 for a fitted exponential distribution.)
Using the new information with the same example as above, we have
2
æ 2,500 ö
-ç
è 5,642 ÷
R1 ( t ) = e ø
= 0.82
Rsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t ) × R4 ( t )
Rsys ( 2, 500 ) = ( 0.82 ) = 0.45
4
The result is better as at the early portion of the life distribution, the failure rate is
relatively low. It is only later, after about 5,000 hours does the failure rate climb
above the estimated exponential distribution. It is overstating the reliability at
2,500 hours.
Conclusion
We have the math tools and understanding to use the appropriate distributions to
describe the expected failures or reliability functions. Using MTBF for
convenience, convention or „because the customer expects that metric” all tend
to lead to poor estimates and misunderstandings. Avoiding the use of the MTBF
simplifications can only improve the description of the underlying predictions, test
or field data results.
Using the best available data to make decisions implies that we use the best
available tools to represent the data. Doing so can save you and your
organization from costly errors within your program.
10. Endnotes
[1] Paul A. Tobias, David C. Trindade. 1998. Applied Reliability. 2nd ed:
Chapman Hall/CRC Press, page 367.
[2] O'Connor, Patrick D. T. 2002.Practical reliability engineering. Edited by D.
Newton and R. Bromley. Vol. 4th ed. Patrick D.T. O'Connor with David Newton,
Richard Bromley.Chichester: Wiley, page 166.
[3] Oliver Wendal Homes, “The Deacon‟s Masterpiece, or, the Wonderful One-
Hoss Shay a Logical Story.”, Atlantic Monthly, September, 1858.