2012 RAMS Investment in Reliability Program versus Return – How to Decide


Published on

A short paper on ways to estimate the value of a range of reliability activities. Given at RAMS 2012.

Selecting the right tool, or the right investment for a specific reliability task is often left to the judgment of the reliability professional. With experience these choices become simpler, yet in many cases the task can be daunting. By examining the decision process we explore a means to determine the most cost effective reliability activities for specific situations.
Not all reliability tools provide useful information or timely results in every situation, yet how does one choose the best activities for a given situation. After conducting over 100 reliability program assessments and working with dozens of design teams to build effective reliability programs, the author lays out an means to trade-off the cost and benefits for the appropriate selections of reliability activities.
Considering the constraints and the objectives - there is a best set of tools to employ during the development process to produce a reliable product. This paper explore the cost/benefit equation in three different cases: High cost low volume, low cost high volume and brand new technology product development situations. Considerations include risk, models, processes, and technology along with customer or market expectations. Another significant consideration is the reliability maturity of the organization.
There isn't a single set of tools or activities that will always produce a reliable product in a cost effective manner. Carefully, considering the current situation and capabilities permit the team to select the right tools to make significant progress toward a reliable product.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

2012 RAMS Investment in Reliability Program versus Return – How to Decide

  1. 1. Investment in reliability program versus return – how to decideFred Schenkelberg, Ops A La Carte, LLCKey Words: Reliability, program, planning, investment, ROI SUMMARY & CONCLUSIONS cost in resources, time, or materials. Focusing on tasks that effectively and efficiently move the design toward a market Selecting the right tool, or the right investment for a acceptable product drives the overall program’s success. Onespecific reliability task is often left to the judgment of the aspect is the product’s reliability. Engineer and programreliability professional. With experience these choices become managers can quickly estimate the cost of specific tasks, likesimpler, yet in many cases the task can be daunting. By Highly Accelerated Life Test (HALT) or Accelerated Lifeexamining the decision process we explore a means to Test (ALT). Yet, the value returned to the program is not asdetermine the most cost effective reliability activities for clear.specific situations. This paper explores a few examples and situations to Not all reliability tools provide useful information or illustrate how to determine the Return On Investment (ROI).timely results in every situation, yet how does one choose the Your actual situation is different and the resulting ROI willbest activities for a given situation. After conducting over 100 also be different. It is the assessment of which tasks add thereliability program assessments and working with dozens of most value provides a guide to building an overall reliabilitydesign teams to build effective reliability programs, the author plan.lays out an means to trade-off the cost and benefits for the In each of three examples explored there are numerousappropriate selections of reliability activities. ‘facts’ stated that would be known by the design team. These Considering the constraints and the objectives - there is a are simply stated as facts to build information about thebest set of tools to employ during the development process to situation. Each case is built from an actual situationproduce a reliable product. This paper explore the cost/benefit experienced in my work. Also, there are many assumptionsequation in three different cases: High cost low volume, low made and stated as assumptions. In practice we do not have allcost high volume and brand new technology product the information or facts, assumptions permit the calculation todevelopment situations. Considerations include risk, models, continue, and by stating them clearly permit the team toprocesses, and technology along with customer or market challenge, understand and improve the calculations.expectations. Another significant consideration is thereliability maturity of the organization. 2 HALT AND TIME TO MARKET There isnt a single set of tools or activities that will Consider the development of a new game controller. Highalways produce a reliable product in a cost effective manner. volume with the majority of sales expected immediately afterCarefully, considering the current situation and capabilities product launch during the holiday sales period. New design,permit the team to select the right tools to make significant time to market emphasis, majority of product manufacturedprogress toward a reliable product. prior to the start of sales, no repairs and the controller is an 1 INTRODUCTION enabling part of a larger system. The controller’s reliability goal is 98% reliable over the first year of ownership when Many reliability activities are naturally part of design used as part of the game system.engineering. Adding the weight of a radio to an aircraft is atradeoff between the value of the communication function and 2.1 HALT vs ALT Discussionthe cost of lifting the additional weight. Another tradeoff One of the basic questions facing the team is, “Will theconsidered is between the value of the communication product meet the 98% reliability goal?” An ALT may helpfunction and cost of maintenance or repair. The repair cost is answer this question if we know which failure mechanism(s)in part related to the reliability of the equipment. will lead to failure during the first year [1]. This is a new Seasonal consumer products may have an emphasis on product without any field history. Other controllers designedtime to market and the cost of lost sales. Medical products for this environment have experienced a range of failuremay have an emphasis on product safety and cost of potential causes and are often dominated by shock and vibrationproduct liability. For each product development team the damage from dropping.ability to quantify the cost of unreliability is important in order The risk analysis done by the design team fully suspectsto balance the appropriate investment into achieving reliability that drop damage would be the most significant contributor toobjectives. product failures. The new controller is different enough that Each task related to the development of a product has a© IEEE 2012 – Annual Reliability and Maintainability Symposium
  2. 2. using the field data is likely to not apply. Also, it is unknown first year. 25% of the time, the underlying design has at leastwhich specific element of the design would experience failure one major failure mechanism that may be detected andfirst or at all over one year of use. Therefore, understanding resolved prior to the start of sales.the most likely failure mechanisms that are to occur is Also, consider that no testing program will uncover allimportant to discover. faults, yet let’s assume that only 10% of the time will HALT The initial project plan did not include HALT testing on and DVT not find a major (>10% failure rate) issue. Also,the first set of prototypes, rather it would sample from the HALT may not find the issue while DVT does detect the fault,second set of prototypes, 8 weeks later, just before the transfer let’s say 50% of the time. And, let’s assume HALT finds theof the design to manufacturing to conduct design verification fault only 40% of the time. Note: this low rate is pessimistictesting (DVT), including life testing. The drop testing portion for an estimate of the ability of a well executed HALT and inof the DVT is expected to take a week to accomplish. my experience HALT is much more effective. The reliability engineer on this program recommends For the value calculation, 25% chance of an unacceptableperforming HALT on the first available prototypes. Using failure rate exists in the design, times a 40% chance of HALThigh loads of random vibration and high shock loads in the finding the issue, times the cost avoided by having time toHALT plan to quickly assess the design weakness related to solve the issue without a 30 day program delay, results in anproduct drop damage. The project manager requests more expected savings of 0.25 x 0.40 x $15m = $1.5 million.information on timing, cost and benefits (value).2.2 HALT Cost There isn’t time to procure a HALT chamber within thedevelopment schedule; therefore we let’s collect quotes fromHALT labs to conduct the testing. Let’s assume a quote of$10k for one round of testing [2]. Of course, if there wereHALT facilities internally available this cost would be less. Also consider the cost of the prototypes are about 5 timesmore expensive then second round prototype units. The firstround of prototypes are a small run, specialized tooling, quick Figure 1 HALT Value Calculationturn production, costing approximately $1k for each unit. Weare requesting five units at an increased cost of 5 times overlater prototypes at an $800 price increase, or $4k. 2.4 HALT ROI Rounding out the expected costs of engineering support, The ROI is the ratio of the expected return over the cost.testing equipment support, and failure analysis support, we $1.5 million divided by $24k, which results in an ROI of overestimate an additional cost of approximately $10k. Therefore, 60.the total cost to the program to add HALT testing is This is only part of the value, as it only considered theapproximately $24k. detection of major issues thus avoiding a schedule slip. The2.3 HALT Value HALT will also find less significant issues that wouldn’t have resulted in a schedule slip, yet the earlier detection would One of the primary benefits of HALT is the potential reduce the cost of implementing design changes. Plus, HALTuncovering of new failure mechanisms in the design [3]. By may have found unique failure mechanisms beyond what theconducting the HALT on the first available prototypes the DVT would find, than leading to an incremental reduction indesign team increases the time available to resolve design achieved field failure rate.errors or make design improvements. Designers tend to designaway from failures; HALT is a tool to discover previously 3 ALT AND MARKET SHAREunknown (or unsuspected) failure mechanisms. A design team working on a medical device understands Let’s assume (for purpose of this example) the design that the market share is related to the product reliability. Theprior to any testing has a 25% chance of a failure mechanism current product performs adequately yet has the highest fieldthat will lead to an unacceptably high first year failure rate. In failure rate of similar products. Customers complain about thediscussions with the program manager we learn that they poor reliability and the market share reflects their comparativewould delay the start of production if there were a 10% or reliability ranking. The product with the highest market sharehigher expected field failure rate. And, the cost of the delay is also the most reliable.was estimated at $500k per day in lost sales. With an assumed The design challenge is to create a product that is more30 days to design and implement an improvement to resolve a reliable than the competition at about the same price point andmajor reliability issue, that would cost the program $500k/day if possible with improved functionality. The early concepts allfor 30 days, or $15 million. include a novel design using an unproven (reliability) sealing There is a good chance the design is fine and will meet material. The uncertainty suggests the implementation of anthe reliability objectives. Let’s assume 75% of the time the accelerated life test to estimate the expected productdesign has an overall failure rate of less than 10% over the© IEEE 2012 – Annual Reliability and Maintainability Symposium
  3. 3. reliability. the testing complexity and result in lower overall costs. Achieving the higher reliability is expected to result in The cost of the subsystem that holds the seal is $200 each.more than tripling the market share in the first year. This 230 x $200 estimates the cost for samples of $46k. Therefore,would result is sales of the $3k/unit priced product to jump the total cost of the ALT is approximately $96k.from 10k per year now to approximately 30k per year. This 3.3 ALT Valuewould be an additional $60m in revenue. Furthermore, theincrease in sales would require more than doubling the In this situation the test results provide a binary result.manufacturing capacity at a cost of $5 million. The decision to The population either does or does not achieve at least 99%increase the manufacturing capacity is dependent on the reliability. Keeping in mind the ALT is run with a sample toestimated product reliability. In order to have the capacity represent the population there is some uncertainty about theavailable to meet the expected demand the decision has be results. Statistical error may lead to four outcomes as shown inmade and the $5 million committed prior to the start of table 1 [6]. Assuming the test design used 90% confidence andproduction. has a 90% power, we have a 10% chance of thinking the reliability is less than it actually is, and not invest in added manufacturing capacity (lost opportunity for increased sales).3.1 Reliability Goal and ALT Discussion And, we have a 10% chance of thinking the reliability is better The current product achieves 90% reliability over two than 99% when it is not, thus investing in added capacityyears. The best competitive product is estimated to achieve when demand will not materialize.98% reliability over the same period. The goal for the newdesign is 99% reliable over two years or better. This is a majorgoal and simply conducting an ALT is not going to achieve The unknown actual Reliabilitythe result. Yet, a key element is the understanding if the goal is less then 99%has been achieved or not. The $5 million investment in Test Result Is TRUE Is FALSEmanufacturing depends on knowing if the design will or will R >= 99% Type I error Correctnot meet the goal. R < 99% Correct Type II error ALT in this case can answer the question as it’s focused Table 1 Statistical Errorson the expected dominant failure mechanism [4]. The failuremechanism and the stresses are all known. The new design Going in to the ALT we have a 50/50 chance that the newusing novel material does leave the uncertainty around how material and design will meet the 99% reliability goal.the design will actually perform. A well designed ALT has the Combining that with the uncertainty of statistical error and acapability to ascertain the expected reliability performance. $5m decision, we can calculate the value of the test.3.2 ALT Cost ALT is often an expensive test to conduct. The testdesign, samples, product operation jigs (robots, actuators,software, etc.), monitoring equipment and failure analysis alladd to the cost. Let’s assume the total test planning and setupcost is $50k. The high reliability to demonstrate will require asignificant number of samples. The following formula [5]provides a rough estimate of the number of samples neededfor a test to demonstrate 99% reliability with 90% confidenceassuming no failures of any tested samples. Figure 2 ALT Value Calculations ln(1−C) ln(1− 0.9) n= = ≅ 230 (1) ln(R) ln(0.99) 3.4 ALT ROI The ROI is the ratio of the expected return over the cost. Where, $4m over $96k results in an ROI of over 41. n is the sample size Of course, the $5m decision isn’t the only factor in the C is the statistical confidence value of the ALT. It also provides a base line for further R is the reliability testing (test cost savings), it may provide information on the amount of margin the design has over the goal and permit The 230 sample number is based on a success testing further design enhancements. It also confirms the change inapproach assuming the failure mechanism and associated reliability permitting a proactive changes in warranty accrualsstress is well understood. Reducing the sample size with the and service and repair operations.use of degradation testing, or some other method may increase© IEEE 2012 – Annual Reliability and Maintainability Symposium
  4. 4. program. 4 DERATING AND FIELD FAILURE RATE 4.3 Derating Value The specialized test and measurement industry creates The primary value of component derating is the increasevery complex electronic equipment, which are expensive tools circuit robustness of the product leads to fewer field failureswith total production of maybe 50 per year over a four year [7]. The cost of a field failure is expensive, due to theperiod. And, like other high cost/low volume products the cost replacement cost, failure analysis, and possible redesign andof failure is very high. qualification costs. Let’s assume that each field failure has an Because the unit costs are very high, the ability to test average cost $2m or four times the sales price.sufficient numbers of units to failure or at all, is severely Reducing a 10% annual failure rate (a low estimate forlimited. It is not uncommon to have only one or two units for such complex products) to 5% would results in 2.5 fewer $2mall qualification testing. Furthermore, the complexity of the failures per year for an annual savings of $5m.units provide multiple possible failure mechanisms and onlyrarely does the design provide a clear dominate failure 4.4 Derating ROImechanism to focus reliability evaluations. Given the barriers to conducting physical testing, the The ROI is the ratio of the expected return over the cost.reliability team recommends implementing detailed derating With a cost of $6 million and return of only $5m, the ROI isanalysis for the selection of every electronic component. less than one at 0.83. The design team does use some derating concepts, yet If the starting failure rate or cost of failure is low then thisonly based on a 50% guideline and without detailed analysis. ROI may not exceed the breakeven point. Also, consider theTherefore, the project manager has requested more market and competition impact. If the high failure rate causedinformation about the process, costs, and value. a loss of market share, that may further increase the cost of failure. Currently, implementing derating does not make sense in this situation.4.1 Derating and Field Failures Discussion Derating is the selection of components that have ratings 5 RELIABILITY MATURITY CONSIDERATIONS(power, voltage, etc) above the expected stress [. Selecting acapacitor that bridges a 5 volt potential that has a voltage Organizations have different capabilities and approachesrating of 10 volts would be considered a 50% derating. to reliability. In some, product reliability is not considered andSelecting components that match the expected stress and the product performance is fairly random and unpredictable.rating generally lead to premature failure of the components. Other organizations do considerable testing and use a wideThe ratings vendors provide only imply the component can range of tools to improve reliability, yet the testing and toolsexperience the stress at the rated value for a very short time. are generally done in response to customer complaints andDerating provides a margin to minimize the accumulation of field failures. And a few organizations are proactive in thedamage or the chance exposure of high enough stress to cause selection of high value reliability design activities [8].a failure. The same concept applied for mechanical designs The base culture of reactive or proactive with respect tousing a safety margins. reliability suggests different routes to making reliability At Hewlett-Packard, a study of the effects of various improvements. Less mature organizations may require trainingdesign for reliability tools found a very high correlation and maybe a pilot program to build acceptance of thebetween well executed derating programs and low field failure proposed changes. More mature organizations my not findrates. This contributed to the 50% fewer field failures additional tools with significant ROI’s, yet may understandexperienced [9]. In one particular division where the design and be able to calculate the impact of reliability improvementsteam embarked on a full implementation of derating on all on market share or customer satisfaction.products, realized a 50% reduction in field failures in the first In less mature organizations the calculations for cost andyear, and continued to realized reduced failure rates over benefit may more difficult and rely on more assumptions. Thatsubsequent years as more fully derated product designs is not a reason for not doing the calculations. State theshipped. assumptions and start the discussion to find better information for the assessment.4.2 Derating Cost In more mature organizations, while the calculations may Higher rated components cost more and are generally be easier to accomplish given the better understanding of costslarger in size. Assuming the current bill of material cost is and benefits, the ROI’s are likely to be smaller in a direct$100k and with the implementation of detailed and thorough manner. These organization also understand the value ofderating the bill of material costs rise to $200,000, or doubles. customer satisfaction and avoiding the costs associated withFor a production run of 50 units, the cost increases to $5m. the reactive engineering to field problems. Mature The additional engineering time for training, circuit organizations do not have 25% of their design engineeringanalysis, and procurement may add an additional $1m to the resources responding to field failures.project cost. The total cost is an additional $6m to the© IEEE 2012 – Annual Reliability and Maintainability Symposium
  5. 5. 6 CONCLUSIONS Test Plans, and Data Analysis. Edited by S S Wilks Samuel. Wiley Series in Probability and Mathematical The decision to add a reliability specific task generally Statistics. New York: John Wiley & Sons, 1990, pg. 3.adds cost to the development program. The costs are typically 5. Wasserman, Gary S. Reliability Verification, Testing andeasily calculated by summing engineering time, material costs, Analysis in Engineering Design. New York: Marceladded samples, added time, and other direct costs. On the Dekker, 2003, pg. 209.other hand, the benefit is more difficult to calculate. The 6. Ott, Lyman. An Introduction to Statistical Methods andbenefits may included estimated reduction of field failure Data Analysis. Belmont, Calif.: Duxbury Press, 1993, pg.rates, or estimated reduction in risks, or expected discovery 216.rates of serious field failure issue during the early design 7. Ireson, William Grant, Clyde F Coombs, and Richard Yphase. Moss. Handbook of Reliability Engineering and The calculation of value before the value is realized is Management. New York: McGraw Hill, 1995., pg. 16.9.difficult and often based on a series of assumptions. Stating 8. Crosby, Philip B. Quality Is Free: The Art of Makingthe assumptions and showing the calculations permits the team Quality Certain. New York: Signet, 1979.to understand the calculations and check the assumptions. 9. Ireson, William Grant, Clyde F Coombs, and Richard YHaving an estimated value provides a quantitative means to Moss. Handbook of Reliability Engineering anddetermine the return on investment. The ROI value provides a Management. New York: McGraw Hill, 1995, pg. 5.4.means to determine the relative value of any investment, thuspermitting the comparison of all the investment decisionsmade during a development project. BIOGRAPHIES Without the quantitative value calculation the team relies Fred Schenkelbergon the antidotal belief that the tools will provide value. In Ops A La Carte, LLCsome cases this will be obvious or not a question, yet in those 990 Richard Avenue, Suite 101cases where there is any doubt, the examples in this paper Santa Clara, CA 95050, USAprovide guidance for the ROI calculation of reliability tasks. Every product development team faces different criteria e-mail: fms@opsalacarte.comfor value (time, cost, etc.) and different sets of constraints Fred Schenkelberg is a reliability engineering and(time, samples, test capabilities, etc.). Just like creating a management consultant with Ops A La Carte, with areas ofspecific test plan, the ROI calculation is tailored to fit the focus including reliability engineering management trainingsituation. and accelerated life testing. Previously, he co-founded and Not every tool is appropriate to use and through the built the HP corporate reliability program, includinganalysis of the ROI, even with estimates and assumptions, consulting on a broad range of HP products. He is a lecturerprovides an organization the ability to select the tools that with the University of Maryland teaching a graduate levelprovide the best value. course on reliability engineering management. He earned a Master of Science degree in statistics at Stanford University in REFERENCES 1996. He earned his bachelors degrees in Physics at the United State Military Academy in 1983. Fred is an active1. Silverman, Mike. How Reliable Is Your Product? volunteer with the management committee of RAMS, Cupertino, CA: Super Star Press, December, 2010, pg. currently the Chair of the American Society of Quality 193. Reliability Division, active at the local level with the Society2. Personal Communication with Mike Silverman, June 18th, of Reliability Engineers and IEEE’s Reliability Society, IEEE 2011. reliability standards development teams and recently joined3. Hobbs, Gregg K. Accelerated Reliability Engineering : the US delegation as a voting member of the IEC TAG 56 - HALT and HASS. Chichester ; New York: Wiley, 2000, Durability. He is a Senior Member of ASQ and IEEE. He is pg. 43. an ASQ Certified Quality and Reliability Engineer.4. Nelson, Wayne. Accelerated Testing: Statistical Models,© IEEE 2012 – Annual Reliability and Maintainability Symposium