• Save
Common Mistakes with MTBF

Common Mistakes with MTBF



Paper on the issues with mtbf published in the Spring 2011 issue of the RMSP Journal. ...

Paper on the issues with mtbf published in the Spring 2011 issue of the RMSP Journal.

MTBF is widely used to describe the reliability of a component or system. It is also often misunderstood and used incorrectly. In some sense, the very name “mean time between failures” contributes to the misunderstanding. The objective of this paper is to explore the nature of the MTBF misunderstandings and the impact on decision-making and program costs.

Mean-Time-Between-Failure (MTBF) as defined by MIL-STD-721C Definition of Terms for Reliability and Maintainability, 12 June 1981, is

A basic measure of reliability for repairable items: The mean number of life units during which all parts of the item perform within their specified limits, during a particular measurement interval under stated conditions.

The related measure, Mean-Time-To-Failure (MTTF) is define as

A basic measure of reliability for non-repairable items: The total number of life units of an item divided by the total number of failures within that population, during a particular measurement interval under stated conditions.



Total Views
Views on SlideShare
Embed Views



3 Embeds 66

http://www.fmsreliability.com 64
http://www.docseek.net 1
http://www.docshut.com 1



Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Common Mistakes with MTBF Common Mistakes with MTBF Document Transcript

  • Common Mistakes with MTBFMTBF is widely used to describe the reliability of a component or system. It isalso often misunderstood and used incorrectly. In some sense, the very name“mean time between failures” contributes to the misunderstanding. The objectiveof this paper is to explore the nature of the MTBF misunderstandings and theimpact on decision-making and program costs.Mean-Time-Between-Failure (MTBF) as defined by MIL-STD-721C Definition ofTerms for Reliability and Maintainability, 12 June 1981, is A basic measure of reliability for repairable items: The mean number of lifeunits during which all parts of the item perform within their specified limits, duringa particular measurement interval under stated conditions.The related measure, Mean-Time-To-Failure (MTTF) is define as A basic measure of reliability for non-repairable items: The total number oflife units of an item divided by the total number of failures within that population,during a particular measurement interval under stated conditions.These definitions are very similar. The subtle difference is important, yet theconfusion is further complicated when attempting to quantify MTBF or MTTF. Inboth cases we often use the calculation as described within the MTTF definition.This is what we would do for any group of values that we wanted to find themean (average) value estimate. Tally the values and divide by the number ofhours all units have operated and divide by the number of failures. This providedan unbiased (statistically speaking) estimate of the population mean.Keep in mind that time to failure data is often not normally distributed. Theunderlying distribution for lifedata starts at time zero and increases. Theexponential family of distributions tends to describe lifedata well and is commonlyused. The unbiased estimate for the mean value of an exponential distribution isas described for the MTTF definition above.When working with data from a repairable system, one should use theNonhomogeneous Poison Process (NHPP) which is a generalization of thePoison distribution. The estimate for the failure intensity can have variousmodels, yet if often assumed to be the exponential model. This results in thecommon estimate of MTBF of T (k)MTBF = kWhere, T(k) as the total time of one or more system operations divided by thecumulative number of failures. [1]
  • Thus introducing the first source of confusion when considering MTBF, failurerates, or hazard rates. Since we intuitively use the simple calculation to estimatethe mean value, many then do not then apply that estimate with the reliabilityfunction of the appropriate distribution.For example, if a vendor states the product has an MTTF of 16,000 hours, andwe wanted to know how many out of 100 units will fail in 8,000 hours, theappropriate calculation is ætö -ç ÷ èq øR(t) = e æ 8,000 ö -ç è 16,000 ÷ øR(8, 000) = e = 0.61such that we expect 61 out of the 100 units, or 61%, of the units to operate forthe full 8,000 hours.This is assuming an exponential distribution and non-repairable units. Given onlyan MTTF value, the most likely distribution to use without additional information isthe exponential.Extending this same example to determine the reliability at 16,000 hours, we findthat only about 1/3 of the units would be expected to still be operating. And, ifsomeone has this common misunderstandings of the failure rate value thatMTBF represents, then it can lead to significant loss of resources or missionreadiness.For example, a radar detection OEM received a contract to design andmanufacture a specific system with 5,000 hours MTBF. The specificationincluded functionality, mission duration and expected equipment duty cycle,along with minor variations to the airborne inhabited environment. The contractspecified 5,000 hours MTBF for the sole reliability requirement. And, the designteam designed, built and tested and accomplished a better than 5,000 hourMTBF.The Air Force found the unit to be the leading cause of aborted missions(equipment related) and complained to the OEM. A careful analysis of the fielddata proved the units actually achieved almost 6,000 hour MTBF, thus exceedingthe specification. Of course, this didn‟t change the data on aborted missions. Inpart the OEM‟s equipment just happened to be the least reliable equipment onthe aircraft.A short discussion with the team found some misunderstanding and that “errorshad been made”. The Air Force procurement team and the prime contractorpersonal mistakenly thought the term „5,000 hours MTBF‟ meant at least 5,000
  • failure free operating hours. When in reality the term, in this case, meant thatapproximately two-thirds of the units are expected to have at least one failureover of period of 5,000 operating hours. And, in fact, the product performed about20% better than the specification.The problem was exacerbated by the mission requiring the use of three of theOEM‟s unit during the mission. Reliability speaking the equipment was in series,meaning that if any one of the three units failed, the crew had to abort themission. Therefore, the probability of successfully completing 1000 hours ofoperation where all three units have to work isRsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t ) æ 1,000 ö æ 1,000 ö æ 1,000 ö -ç -ç -ç è 5,000 ÷ è 5,000 ÷ è 5,000 ÷Rsys (1, 000) = e ø ×e ø ×e ø = 0.55Even though each of the individual units have about an 82% reliability (orprobability of surviving 1,000 hours), the three in series have only a 55%reliability, or probability that all three will operate for 1,000 hours.Acknowledging either a specification error or misunderstanding of the metricerrors the team still had the issue of aborted missions. Simply changing thereliability requirements would not change the design of the equipment without asignificant re-design. Further discussion found that installing a warm standbyunit, permitted the rapid replacement of a failed unit during the mission, thuseffectively and significantly reducing mission aborts. The reliability of a 3-out-of-4system is m-1 æ nöRsys ( t ) = 1- å ç ÷ Ri ( t ) (1- R ( t )) n-i i=0 è i øwhere n is the number of systems out of m total have to be operating for theoverall system to be operating.[2] In the example above, n=3 and m=4, plus theexample has a reliability for a single system of about 82%. For three in series thesystem reliability drops to about 55%. And the calculation for the 3 out of 4parallel system reliability calculation results in 85%. Suffice it to say the reliabilityis significantly improved.Note, that using reliability in the above function does not require the use MTBF.The reliability term can come from any distribution.Calculating or using only the MTBF value to represent a product‟s reliability canlead to more than misunderstanding. If the product performs better or worse thanexpected you may have unnecessary spares expenses or not enough spares tocontinue effectively. Another issue that may arise is the unexpected increase in
  • failure rate after a few years of a very low failure rate. Using the singleparameter, MTBF, does not provide information about the changing nature offailure rates over time.The following graph is a plot of percentage of the population that has failed overtime or cumulative distribution function plot. The red line is the plot of the fittedexponential distribution. The data and fitted line represents the failure rate trendthat is declining over time. Over time the total number fo failures continues torise, yet the slope is low or less than the slope for the exponential distribution.This is actual data and the time scale and title have been removed to protect thesource. The theta of the exponential distribution is 49,093 hours. Whereas theWeibull distribution has a beta of 0.5823 and eta of 31,344 hours.On this plot, the exponential distribution has a slope of 1. The fitted Weibulldistribution slope is less than one. Keep in mind that the exponential and Weibulldistribution are members of the exponential family of distribution. The formula forthe reliability function of the 2-parameter Weibull distribution is ( ) b - thR(t) = e
  • where the beta is the slope and eta is the characteristic life. Setting beta to 1reduces the formula to the reliability function for the exponential distribution.R(t) = e ( ) - tqwhere theta is the characteristic life and is also the inverse of the failure rate andcommonly theta is called MTTF or MTBF.The plot of the CDF is related to the reliability function. Reliability is thepercentage of units surviving over a specific duration. And the CDF plots thepercentage of units failed over a specific duration. The CDF is represented byF(t) and the CDF for the Weibull distribution is ( ) b - thF(t) = 1- etherefore,R(t) = 1- F(t)Essentially the vertical axis on the above plot reverses from rising from 0 to100% for the CDF. For the reliability function the vertical axis rises from 100 to0%.Consider the above CDF plot again. If the underlying data is represented by onlyone value, say MTBF, we are in effect representing the data with the ill-fitted redline. Only at one point in time does the distribution actually represent the data,only at the point in time where they cross. Thus, if I need to make a decision priorto that point based on the expected reliability of the system, we would use theexponential distribution. For example, at time 100 hours we find the MTBF basedreliability to beR(t) = e ( ) - tqR(100) = e ( - 100 49,093 ) = 0.9968We get a number and can make a decision if the system meets our reliabilityrequirements. Whereas, using the fitted reliability distribution, we have adescription of the data using two parameters. Calculating the reliability at thesame point of time using the Weibull distribution we find
  • ( ) b - thR(t) = e ( ) 0.5823 - 100 31,344R(100) = e = 0.965The difference in estimates may or may not make a difference in the decision, yetwe often attempt to use the best available data when making important decision.The estimate provided by the exponential distribution is potentially misleadingand in the above example over states the system‟s reliability. This error variesand get worse when examining a shorter period of time.This error may cause the error of accepting a system that actually does not meetthe requirements. Or, it may cause the under stocking of needed spare parts forfailures that are likely to occur, leading to reduced mission readiness.The following CDF plot shows a different situation. Here the data tends toincrease in failure rate over time and has a slope greater than one. Again theexponential (MTBF) estimate does not reflect the actual data very well, except atone point.
  • Again, the title and vertical access have been removed from this plot of actualdata. The theta for the exponential distribution is 20,860 hours. And, the fittedparameters for the Weibull distribution are: Beta equals 1.897 and eta is 23,507hours.Performing the reliability calculations for the two distribution at 100 hours resultsin the following two resultsR(t) = e ( ) - tqR(100) = e ( - 100 20860 ) = 0.9952is for the exponential distribution, and for the Weibull distribution ( ) b - thR(t) = e ( ) 1.897 - 100 23,507R(100) = e = 0.999968And while this difference may or may not change the decision based on thesystem reliability, using the exponential distribution may lead to costly mistakes.In this case, the system reliability estimate may be mistakenly represented asbeing to low. This may lead to a cancelation of the program, or the overstockingof spare parts.Of course, in both examples, depending on which time point is selected thedifference between the two fitted curves is different. And if the duration oninterest is beyond the intersection of the two fitted lines, then the mistakes lead todifferent results.Another area of misleading use of MTBF is the lack of reliability apportionment.The confusion comes from the notion of the weakest link limiting the reliability ofa system. As in the except from the poem by Oliver Wendal Homes, “TheDeacon‟s Masterpiece, or, the Wonderful One-Hoss Shay a LogicalStory.”,[3]where the chaise was build with every part was a study and strong asall the parts. Then, --What do you think the parson found, 
 When he got up and stared around?
 The poor old chaise in a heap or mound,

  • As if it had been to the mill and ground!
 You see, of course, if you re not a dunce,
 How it went to pieces all at once, --
 All at once, and nothing first, --
 Just as bubbles do when they burst.In practice, products do not failure all at once and completely. In more complexsystems, while many possible components may be the first to fail, it may beunclear exactly which component will fail first. The replacement of thatcomponent generally does not improve the probability of failure of the othercomponents, thus a different component may cause the next failure.Back to the weakest link idea. In a series system, reliability speaking, if any oneelement of a system fails, then the system fails. Given technical and designlimitations there is one element that is inherently weaker than the rest of thesystem. Therefore, if we know, the compressor is the weakest link in a productand it has a MTBF of 5,000 hours. Well, then no other component needs to beany better than 5,000 hours MTBF. Right? And, one might say that for a systemis has no field replaceable units, that upon the first failure the unit has to betotally replaced anyway. Basically, the thought is since the compressor limits thelife of the product (the weakest link), no other component needs to be better than5,000 hours MTBF.Given a system goal of 5,000 hours MTBF and using the logic from above andfrom the One-Hoss Shay, we create a complex product with each subsystemdesigned and tested to the same goal, 5,000 MTBF. Let‟s assume the producthas a display, circuit board, and power supply, in addition to the compressormentioned above.For the sake of argument, let‟s assume each of the four subsystems do actuallyhave an exponential distribution for expected time to failure. This means thateach subsystem has a 1/5,0000 chance of failure every hour of operation and itstays constant over time. Inverting the MTBF to find the failure rate per hour, wefind 1/5,000 = 0.0002 failures per hour. And, let‟s say that over a two year periodthe systems are expected to operate 2,500 hours.“No problem, everything meets at least 5000 hours MTBF”, one might say. Let‟sdo the math.Rsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t ) × R4 ( t ) æ 2,500 ö æ 2,500 ö æ 2,500 ö æ 2,500 ö -ç -ç -ç -ç è 5,000 ÷ è 5,000 ÷ è 5,000 ÷ è 5,000 ÷Rsys ( 2, 500 ) = e ø ×e ø ×e ø ×e ø = 0.135
  • The more subsystems and components designed and selected to just meet the5k MTBF the worse the actual result. The result of a system reliability of 13.5%over 2,500 hours assumes that each subsystem achieves only 5,000 MTBF. Inpractice each will achieve some other number, yet the point is, in design andpractice if each subsystem achieves the system goal, the result will be asurprisingly low.Another assumption in the above example is the use of exponential distributionsto describe each subsystem. This is often not true and using Weibull orLognormal distribution may be appropriate. For example, the compressor mostlikely has a wearout type of failure mechanism. And, we are able to find a set ofdata that with analysis provides a good fit to a Weibull distribution. The Weibullparameters for the compressor are beta of 2 and eta of 5642(note: this would beestimated as an theta of 5,000 for a fitted exponential distribution.)Using the new information with the same example as above, we have 2 æ 2,500 ö -ç è 5,642 ÷R1 ( t ) = e ø = 0.82Rsys ( t ) = R1 ( t ) × R2 ( t ) × R3 ( t ) × R4 ( t )Rsys ( 2, 500 ) = ( 0.82 ) = 0.45 4The result is better as at the early portion of the life distribution, the failure rate isrelatively low. It is only later, after about 5,000 hours does the failure rate climbabove the estimated exponential distribution. It is overstating the reliability at2,500 hours.ConclusionWe have the math tools and understanding to use the appropriate distributions todescribe the expected failures or reliability functions. Using MTBF forconvenience, convention or „because the customer expects that metric” all tendto lead to poor estimates and misunderstandings. Avoiding the use of the MTBFsimplifications can only improve the description of the underlying predictions, testor field data results.Using the best available data to make decisions implies that we use the bestavailable tools to represent the data. Doing so can save you and yourorganization from costly errors within your program.
  • Endnotes[1] Paul A. Tobias, David C. Trindade. 1998. Applied Reliability. 2nd ed: Chapman Hall/CRC Press, page 367.[2] OConnor, Patrick D. T. 2002.Practical reliability engineering. Edited by D.Newton and R. Bromley. Vol. 4th ed. Patrick D.T. OConnor with David Newton,Richard Bromley.Chichester: Wiley, page 166.[3] Oliver Wendal Homes, “The Deacon‟s Masterpiece, or, the Wonderful One-Hoss Shay a Logical Story.”, Atlantic Monthly, September, 1858.