Dart reliability


Published on

Accelerated tests on failures

Published in: Business, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thank host for having me. Harold Williams (RR editor) and Cheryl Ascarrunz (Brocade) inspired the presentation. Harold needed an article for RR, and Cheryl needed RDT results, fast! Piecewise linear model is same as in DORT, previous ASQ Statistics Group presentation. How many have planned MTBF demo tests? How many have planned reliability demo tests? How many have actually done them? How many chargeable failures did you observe? How many real failures did you observe? What did you learn? What did you want to learn? Earlier version of presentation is posted at http://www.ewh.ieee.org/r6/scv/rs/articles/DART.pdf
  • Part 1 appeared in ASQ Reliability Review, Vol. 24, No. 2, June 2004, and Part 2 will appear in Vol. 24, No. 3
  • PLFR = piecewise linear failure rate function Reliability = P[Life > t] RAF = RAF(t) = Reliability Acceleration Factor = P[Life > t|Working]/P[Life > t|Accelerated]
  • Credible means not only warm,fuzzy feeling estimates but unbiased estimates with minimum variance. Reliability is “the probability of successful function [according to customers] to specified ages [DoA, warranty, PM, useful life] under specified conditions [field conditions].” Definition is coming to be accepted, O’Connor 3 rd edition, Blanchard, and others. My interpretations are in brackets. Failure rate function depends on age. Ages at failures and survivors’ ages are sufficient but not necessary to estimate field reliability, for products already in the field. Ships and returns counts, by calendar interval, are statistically sufficient to estimate field reliability. If new product resembles old, then it’s reliability probably will resemble that of old product. All you have to test is new parts and verify that new product reliability resembles old, with adjustments a la “Credible Reliability Prediction,” http://www.asq-rd.org/publications.htm.
  • Management wants reliability, before field data are available. Real statisticians would never estimate MTBF without observing at least one failure > MTBF Management never agrees to recommended sample size and test time. The question was a joke. Two, high stress levels aren’t necessary.
  • IM stands for infant mortality. Motivates modeling infant mortality. The data for this slide used to be on http://www.intel.com/support. Intel’s units for 50 hours were dpm and other ages were in FITs to hide the apparent infant mortality. I translated dpm to FITs and graphed data, on log-log scales
  • Could use mixture Weibull to model  and  failure rate, but you probably won’t live long enough to estimate the parameters of the  portion of the mixture. I actually had a case where acceleration didn’t affect Weibull shape parameter, but I was lucky.
  • Slope b = 0.0001 and constant a = 0.0001 “ a(t)” stands for “actuarial” failure rate, because I often use discrete failure rate functions. Notation (7  t) + means (7  t) for t < 7 and 0 thereafter. MTBF = 9975.5 age units is way out of picture to the right. That’s pretty close to 1/0.0001 = 10,000 because IM isn’t very much, ~0.0001*7*7/2 = 0.00245. Dotted line indicates possible increasing failure rate. More anon.
  • See the change in slope around 170 pulse? Dr. Conconi recommended fitting linear pieces to (heart rate)^2. I lost my spreadsheet to do that. It’s easy though, simply make a pair of columns for each data point, one for the data and one for the regression fit. Fit two pieces using LINREG() to the data before and after the k-th data point; k increases for each pair. Pick the pair with the smallest SSE, SUMXMY2() function.
  • MTBF = 9975.5 is Taylor series expansion for small a and b. Exact MTBF formula involves imaginary ERF (error function) IM stands for infant mortality. The probability bto^2/2 = 0.00245 is the area of the triangle under the failure rate function 2 slides back.
  • Refer to graph of PLFR with dotted line 3 slides back for example of linearly increasing failure rate; it is also shown on next slide IM means infant mortality System acc. Is not same as parts, even for series systems, unless parts are iid (independent and identically distributed reliability functions). Proof P[System Life > t] = Product[P[Part Life > t]] for series system of independent components. P[Part Life > t] = exp[Int[a(u)du;{u,0,t}]] = plug some accelerated failure rate model in and see what system reliability formula looks like.
  • Level RH sections correspond to the first two accelerations. The L-shaped curve makes the constant failure rate increase linearly.
  • Reliability acc. Factor depends on age t. Used same parameters a, b, and t o as in PLFR and reliability slides. Increasing parameters a or b has only moderate effect on RAF. Changing to an increasing failure rate, limit of equal step stress, really changes RAF! Food for thought.
  • Alpha and beta are acceleration parameters to be estimated. Adding a lot of parameters defeats the idea of using a simple model to cope with small samples, short test times, and few failures. That stress factor formula is only a suggestion Miner’s rules says deterioration doesn’t depend on how you got there, high stress early or late, order of stresses doesn’t matter. Accelerate more piecess of the model and you have to estimate more parameters, which means you need more failures, in each piece of the model
  • I think Wayne Nelson, Bill Meeker, and perhaps Escobar did PhD or company research work on optimal designs for censored reliability tests with exponential, Weibull, and normal designs.
  • |D|-optimal design of testing to age zero is a pretty dumb. The problem is that DoE expects every design point to yield an observation of the dependent variable, age at failure. Refer to DORT article for Neyman design Min variance design requires you to specify how much variance you’re willing to tolerate. Nelson and Meeker have produced such designs assuming exponential and Weibull and perhaps lognormal reliability functions. Give them credit. See Part 2 of my article for one min. variance design for PLFR. Moderately credible design can be done with any probability, not 50%. If you run the moderately credible design and don’t get failures needed to estimate parameters, beg for more samples and test time.
  • PS The version of this table in the RR article part 2, Sept. 2004 got truncated from the bottom. Objective of moderately credible design is to have at least 50% probability of getting enough data to estimate parameters. Of course you can change 50% to anything you can stand. You have to guess values of parameters a, b, and t o to make this work. Choose n (sample size) and t (test time) until you get a design you like. Many rows are not shown. They compute probabilities of combinations of failures before and after t o P[failure < t o ] means the probability of exactly one failure before t o P[failure in [t o ,t)] means exactly one failure at or after to but before t P[  1 failure in [t o , t)|n-1] means at least one failure at or after t o but before t conditional on n-1 survivors to t o . P[Both, all] = Probability of at least one failure before to and at least one afterwards Case 3 is the preferred design, because it has the highest probability 0.504 of at least one failure before to and at least one between to and t. You need those failures to estimate the true values of the parameters. I’m disappointed the choices had so little effect on P[Both, all]. C’est la vie.
  • 20 put on test, 5 failed, and 15 survived age 45. Data with infant mortality followed by an approximately constant failure rate. (Not all rows are shown. There were 20 samples, and 15 survived 45 time units.)
  • Parameters and statistics to be estimated are in the left column, models are in the top row. Pairs of columns, 2 and 3, 4 and 5, and 6 and 7, contain simpler and fancier versions. The last three rows contain statistics for comparing the simpler and fancier models in the column pairs. The b(t-to)+ct model is statistically significantly better than the ct model. I used Excel  Solver to maximize the ln likelihood as a function of the parameters a, b, c, and t o . Under the null hypothesis of constant failure rate, b = c = 0, the mle of a is the total time on test divided by the number of failures. Under the null hypothesis of linearly increasing failure rate, c > 0, and no infant mortality, b = 0, the mle of c is the total time on test divided by twice the number of failures. The mle of parameter a for the a+b(t o  t)+ct model is 0. That means that model coincides with the b(t o  t)+ct model. (The MTBFs for the two models differ slightly, because different approximations were used. The LR test statistics differ, because they come from different degrees of freedom, the numbers of parameters estimated.) The piecewise linear models differ statistically significant at around the 25% level, when compared with the monotonically linear or constant models: a, ct, and a+ct. The mles of t o agree that infant mortality ends around age 3.3. The winning model is the b(t o  t)+ct model, because it has the maximum likelihood (closest negative value to zero). It is statistically significantly different from the ct model at the 10% significance level. (The LR test statistic is 14.389, compared to the 10% chi-square value of 6.251.) The LR statistic is the likelihood ratio to test: Ho: failure rate function is simpler model to left vs. Ha: failure rate function is fancier model in right hand column of each pair It is the ratio of –2*likelihood(Ho)/Likelihood(Ho or Ha). Sig level is your choice. Chi-square is the Sig level percentile of a chi-square RV with degrees of freedom equal to number of estimated parameters (a, b, c, and perhaps to) Accept Ho if LR statistic is smaller than chi-square. NOTE: LR test statistic has chi-square distribution asymptotically, as sample size AND number of failures increase. Have to simulate for small samples, but LR at least provides a test statistic.
  • Assume the acceleration multiplies the PLFR by x P (power law acc.). “ Reasonable” stress is one which doesn’t induce failures in unusual proportions. Of course you want more failures, sooner. The clue is to estimate parameters a, b, c, to, and p under the constraint that the MTBF computed with those parameters is as specified or that which was to be verified. LR stands for likelihood ratio (test) Repeat estimation without the MTBF constraint to have a denominator for the LR test
  • Not all data are shown. Units 12-20 survived 45 age units.
  • Column 1 contains the parameters and statistics to be estimated. Columns 2 and 3 contain the models in the first rows and the computed parameter estimates in the remaining rows. The last three rows of column 3 contain statistics for comparing the models in the columns 2 and 3. Table shows the corresponding unaccelerated parameter estimates, as functions of an MTBF prediction equal to 125, the theoretical MTBF from the a+ct model using the parameter estimates in the a+ct column of table 3 corresponding to unaccelerated test data. LThe LR test statistic is statistically significant at the 46% significance level.
  • Demonstration requirement was actually 50% of MTBF prediction. Test design was based on the assumption of constant failure rate Xcvr = electro-optical transceiver, converts digital into light and vice-versa
  • There were no failures in IM from which to estimate IM parameters b and t o . Best model turned out to be failure rate = ct, linearly increasing, with a = 0. The MTBF AF (acceleration factor) for xcvrs is 35, not the 14.6 value expected for the whole unit. We are tacitly assuming that xcvr failures are by far the most probable failure mode. There is no evidence to contradict that. The %iles were simulated assuming the asymptotic normal distribution of the max. likelihood estimator of parameter c and its asymptotic standard deviation. Seven times 16 = 112 is not too small a sample, but only two failures makes me nervous. Real statisticians would simulate the LR test statistic with two failures to make sure of exact coverage for small number of failures, even though there are 112 xcvrs on test. The reason for the ~1000 hours is that the simulation was only 100 units of 16 xcvrs. I hit F9 to recompute a bunch of times, and the range was from 750 to 1200 with an eyeball average of ~1000
  • Thanks for patience
  • MH217F1.xls is supposed to inspire you to buy “Credible Reliability Prediction” KMRelEst.xls is supposed to inspire you to buy my software for ships and returns Redundancy… is supposed to inspire more complex, complete reliability allocation Weibull… is also supposed to inspire you to buy my software for ships and returns What other freebies would people like? FMECA? Revenue management? Naturally, once you use these, you’ll want more. Call.
  • Dart reliability

    1. 1. Failure Rate Age Design and Analysis ofAccelerated Reliability Tests, with Piecewise Linear Failure Rate Functions (PLFR) ASQ SV Statistical Group Sept. 8, 2004 IEEE Reliability Society Silicon Valley Larry George Problem Solving ToolsPST http://www.fieldreliability.com 1
    2. 2. DART Abstract Part 1 proposes piecewise linear failure rate (PLFR) function models, for modeling simplicity and resemblance to the left-hand end of the bathtub curve. The PLFR is inspired by:  Failure rates are not constant, often because of infant mortality  Tests have too few samples, are for too short times, and have few failures  Need to quantify infant mortality as well as MTBF It shows how to estimate the PLFR parameters, reliability, infant mortality, and MTBF. It proposes acceleration alternatives, including one that accelerates testing greatly without screwing up results. Part 2 describes how to design and analyze accelerated reliability tests, assuming a PLFR and power law acceleration. It shows how to obtain credible results, with limited sample size and test time, at one accelerated stress level. It provides estimators for model parameters, reliability, MTBF, confidence intervals, and it shows how to test model assumptions and verify MTBF. PST http://www.fieldreliability.com 2
    3. 3. Part 1 Contents Motivation for PLFR MTBF and reliability for PLFR Acceleration of PLFR and RAFPST http://www.fieldreliability.com 3
    4. 4. DART Objectives Make credible MTBF, reliability, and failure rate function estimates  (Credible Reliability Prediction, http://www.asq-rd.org/publications.htm and http://www.fieldreliability.com/Preface.htm)  Quantify infant mortality: proportion and duration  Verify MTBF Use accelerated tests with only one, high stress level Use available information early in life cyclePST http://www.fieldreliability.com 4
    5. 5. Today’s Situation? Management wants reliability ASAP How to verify MTBF with tests that end long before MTBF, accelerated, with few if any failures? How to verify P[Life > useful life] > 0.9 with high confidence with small samples and short tests?  Has management ever agreed to sample size and test time? Can you extrapolate accelerated tests, at high stress, to working stress, with few failures well before MTBF?  NIST, ASQ [Meeker and Hahn], and others [Nelson, Bagdonavicius et al, Viertl] recommend ≥ two acc. stress levels PST http://www.fieldreliability.com 5
    6. 6. Intel FITS have Infant Mortality  Data used to be at http://www.intel.com/support 10000 28F400BX 1000 28F400BV 28F008SA 100 28F016SV 28F001 10 87C196KC 80C51BH 1 80486SXSA 0.1 1 10 80486DX2 Age, years PST http://www.fieldreliability.com 6
    7. 7. Common, Invalid Assumptions Constant failure rate  Infant mortality  initially ↓ failure rate. Monotonic ↑ or ↓ failure rate  Products often have both (rules out Weibull) [George 1995]. Cite bathtub curve Acceleration doesn’t affect Weibull shape parameter  It does, usually, according to Richard Barlow [ http://www.esc.auckland.ac.nz/Organisations/ORSNZ/Newsletter ] Can’t extrapolate to normal stress with only one accelerated stress level (one hand clapping)  Yes we can! PST http://www.fieldreliability.com 7
    8. 8. Piecewise Linear Failure Rate a(t) = a+bt = 0.0001+0.0001(7−t)+ Dotted line is a possibly ↑ failure rate Failure Rate 0.0008 0.0006 0.0004 0.0002 Age 2 4 6 8 10 12 14PST http://www.fieldreliability.com 8
    9. 9. Test Conconi Aerobic threshold is the heart rate at which the slope of work rate vs. heart rate decreases PST http://www.fieldreliability.com 9
    10. 10. Reliability with PLFR Reliability function has two parts, IM and after:  Exp[(0.0001t2)/2−t(0.0001+0.0001to)] for t < to Exp[−0.0001t−(0.0001to2)/2] for t ≥ to  P[Fail in IM] ~bto2/2 MTBF~(1−to2b)/2+to2b/6−ato4b/24 = 9975.5 Reliability Age 2 4 6 8 10 12 14 0.999 0.998 0.997 PST http://www.fieldreliability.com 10
    11. 11. Acceleration alternatives Constant segment increases to greater constant Constant segment becomes linearly increasing (limit of equal step stress); i.e. acc. induces premature wearout, Infant mortality slope increases and perhaps to, the age at the end of IM, decreases as acceleration exacerbates process defects System acceleration ≠ part accelerations! (unless parts are iid and in series)PST http://www.fieldreliability.com 11
    12. 12. Acceleration alternativesFailure Rate 0.001 Constant b ↑ Linearly ↑0.00080.00060.0004 Constant a ↑0.0002 Age 12 2 4 6 8 10 12 14 PST http://www.fieldreliability.com
    13. 13. Reliability Acceleration Factor RAF(t) = (1-RUnacc(t)/(1-Racc(t)) > 1.0  RAF(60) = 1.705 for double constant failure rate 2a from 0.0001 to 0.0002  RAF(60) = 1.288 for double infant mortality, b, increases from 0.0001 to 0.0002  RAF(60) = 11.350 for changing from constant, a, to linearly increasing failure rate, a+0.0005*t!PST http://www.fieldreliability.com 13
    14. 14. Fairly General AccelerationModel aAcc(t) = aUnAcc[t/θ(x)]/θ(x) [Xiong and Ji]  lnθ(x) = α + βx  x is stress factor, (stress-normal)/(max stress-normal)  Continuous version of equal-step stress Multiplies failure rate by a factor and rescales age t Includes Arrhenius and Eyring models, [Shaked], motivated by Miner’s rule Apply it to constant, IM slope, or entire piecewise linear failure rate functionPST http://www.fieldreliability.com 14
    15. 15. Part 2 Designs and examples  |D|-optimal and other statistical designs fail  Exponential, Weibull, and normal designs exist  Moderately credible design Contrary to popular recommendations, you need only one acceleration level Examples: estimate parameters, LR test of MTBF  Unacc. and acc. FreebiesPST http://www.fieldreliability.com 15
    16. 16. Alternative Designs |D|-optimal is versatile, but recommends tests at 0, to, and anywhere thereafter  DoE expects every design point to yield age at failure. Reliability tests often don’t. Highly censored data. Consider Neyman design for multiple strata [Neyman, George 2002 (DORT)] In minimum variance design, must specify how much variance. [Nelson, Meeker and Hahn] Moderately credible design gives 50% probability of at least one failure in infant mortality and one thereafter, sufficient to estimate piecewise linear parameters PST http://www.fieldreliability.com 16
    17. 17. Moderately Credible Design Want 50% probability of ≥ 1 failure in IM and ≥ 1 after IM before end of test, tParameters Case 1 Case 2 Case 3a constant (guess) 0.01 0.01 0.01b IM slope (guess) 0.01 0.01 0.01to IM ends (guess) 2 2 2n sample size (choose) 29 34 31t test time (choose) 7 5.6 6.4P[failure < to] 0.039 0.039 0.039P[failure in [to, t)] 0.047 0.034 0.041P[failure < to|n] 0.371 0.356 0.366P[≥ 1 failure in [to, t)|n-1] 0.739 0.680 0.718P[Both, all] 0.501 0.499 0.504 PST http://www.fieldreliability.com 17
    18. 18. Example Data (Unacc.) Sample Age at failure Survivors’ ages 1 1 2 2 3 15 4 30 5 45 6 45 19 45 20 45PST http://www.fieldreliability.com 18
    19. 19. Example ResultParameter/M a a+b(t–to) ct b(t–to)+ct a+ct a+b(t–to)odel +cta 0.007 0.004 0.008 0.000b 0.016 0.018 0.018c 0.000 0.000 0.000 0.000to 3.319 3.346 3.346MTBF 154 215 73 83 125 84ln likelihood -30.17 -28.21 -34.97 -27.78 -30.27 -27.78LR statistic 3.919 14.389 4.989Sig level 10% 10% 10%χ2 6.251 6.251 7.779 Best model PST http://www.fieldreliability.com 19
    20. 20. Put all your eggs in onebasket for acceleration a(t) = xp(a+b(to−t)++ct) Test at highest reasonable stress Predict MTBF or use specified MTBF Find mle of parameters, constrained to specified MTBF at working stress, x=1 Use LR to test specified MTBF  -2ln[L(MTBF)/L(unconstrained)]~χ2PST http://www.fieldreliability.com 20
    21. 21. Example Data (Accel.)Sample Ages at failures Survivors’ age1 12 13 24 25 106 157 208 259 3010 3511 4020 45 PST http://www.fieldreliability.com 21
    22. 22. Example Result, x = 1.5 Parameter xp(a+ct) xp(a+b(t–to)+ct) a 0.001452 0 b 0.018298 c 7.79E-05 0.000180 to 3.345768 p 5.149690 5 MTBF 125 125 Log likelihood -53.84 -56.17 LR test statistic -4.65 Sig level 10% Chi-square 9.23634 Better modelPST http://www.fieldreliability.com 22
    23. 23. Switch Example Demonstrate MTBF > 39,500 hours with 75% confidence Test 7 switches for 6 weeks (1008 hours) at 60° C with MTBF AF = 14.6 (Arrhenius) to give χ2 LCL of ~39,000 hours Xcvrs failed at 486 and 660 hours (16 xcvrs per switch), after IMPST http://www.fieldreliability.com 23
    24. 24. Real Example Data Parameter Value c 3.56E-8 per hour per hour Stdev c c/√(2n) = 2.38E-9 per hr2 MTBF √(π/2c) = 6645 hours 25th %ile of MTBF 6584 hours MTBF of 16 xcvrs acc. √(π/32c) = 1661 hours 25th %ile of 16-xcvr MTBF ~1000 hours 25th %ile of 16-xcvr MTBF, 1000*35 = 35,000 hours unacc.PST http://www.fieldreliability.com 24
    25. 25. Recommendations For simplicity, use the PLFR to approximate left- hand end of bathtub curve… Approximate acceleration with power law, rescale age if necessary and if Miner’s rule fits Use one, high level of acc. and MTBF to test hypotheses and extrapolate back to working stress Send data to pstlarry@yahoo.com for PLFR analyses, free of chargePST http://www.fieldreliability.com 25
    26. 26. Freebies athttp://www.fieldreliability.com MTBF prediction a la MIL-HDBK-217F Kaplan-Meier nonparametric reliability estimate from ages at failures and survivors’ ages Redundancy reliability allocation Weibull reliability estimate from ages at failures and survivors’ ages What would you like? PST http://www.fieldreliability.com 26
    27. 27. References Bagdonavicius, Vilijandas and Mikhail Nikulin, Accelerated Life Models, Modeling and Statistical Analysis, Chapman and Hall, New York, 2002 George, L. L., “Design of Ongoing Reliability Tests (DORT),” ASQ Reliability Review, Vol. 22, No. 4, pp 5-13, 28, Dec. 2002 George, L. L. “Design of Accelerated Reliability Tests,” ASQ Reliability Review, Part 1, Vol. 24, No. 2, pp 11-31, June. 2004 and Part 2, Vol. 24, No. 3, pp 6-28, Sept. 2004. Presentation is at http://www.ewh.ieee.org/r6/scv/rs/articles/DART.pdf Kalbfleisch, John D. and Ross L. Prentice, The Statistical Analysis of Failure Time Data, Second Edition, Wiley, New York, 2002 Meeker, William Q. and Gerald J. Hahn, How to Plan an Accelerated Life, Test: Some Practical Guidelines, Vol. 10, ASQ, 1985 Nelson, Wayne, Accelerated Testing, Wiley, New York, 1990 NIST, Engineering Statistics Handbook, Ch., “Accelerated Life Tests,” http://www.itl.nist.gov/div898/handbook/apr/section3/apr314.htm Shaked, Moshe, “Accelerated life testing for a class of linear hazard rate type distributions,” Technometrics, Vol. 20, No. 4, pp 457-466, November 1978 Viertl, Reinhard, Statistical Methods in Accelerated Life Testing, Vandenhoeck & Ruprecht, Göttingen, 1988 George, L. L., “What MTBF Do You Want?” ASQ Reliability Review, Vol. 15, No. 3, pp 23-25, Sept. 1995 Neyman, J., “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” J. of the Roy. Statist. Soc., Vol. 97, pp 558-606, 1934 Xiong, Chengjie, and Ming Ji, “Analysis of Grouped and Censored Data from Step-Stress Life Test,” IEEE Trans. on Rel., Vol. 53, No. 1, pp. 22-28, March 2004 PST http://www.fieldreliability.com 27