Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tutorial on Effective Reliability Program Traits and Management


Published on

The supporting paper to my tutorial at RAMS 2011 and 2013. Looking at the key features that make a great (or poor) reliability program.

The purpose of this tutorial is to highlight key traits for the effective management of a reliability program. The basic premise is no single list of reliability activities will work for every product. Every product development and production team faces a different history, constraints, and a different set of variables and uncertainties. Such that what worked for the last program may or may not be appropriate for the current project. There are a handful of key traits that separate the valuable programs from the merely busy programs. These traits and the underlying structure can provide a framework to create a cost effective and efficient reliability program.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tutorial on Effective Reliability Program Traits and Management

  1. 1. Effective Reliability Program Traits and Management Fred Schenkelberg Fred Schenkelberg Senior Reliability Consultant Ops A La Carte, LLC 990 Richard Ave, Suite 101 Santa Clara, CA 95050 Schenkelberg: page i
  2. 2. SUMMARY & PURPOSE The purpose of this tutorial is to highlight key traits for the effective management of a reliability program.The basic premise is no single list of reliability activities will work for every product. Every productdevelopment and production team faces a different history, constraints, and a different set of variables anduncertainties. Such that what worked for the last program may or may not be appropriate for the currentproject. There are a handful of key traits that separate the valuable programs from the merely busyprograms. These traits and the underlying structure can provide a framework to create a cost effective andefficient reliability program. Fred Schenkelberg Fred Schenkelberg is a reliability engineering and management consultant with Ops A La Carte, with areasof focus including reliability engineering management training and accelerated life testing. Previously, heco-founded and built the HP corporate reliability program, including consulting on a broad range of HPproducts. He is a lecturer with the University of Maryland teaching a graduate level course on reliabilityengineering management. He earned a master of science degree in statistics at Stanford University in 1996.He earned his bachelors degrees in Physics at the United State Military Academy in 1983. Fred is an activevolunteer as the Executive Producer of the American Society of Quality Reliability Division webinarprogram, IEEE reliability standards development teams and previously a voting member of the IEC TAG 56- Durability. He is a Senior Member of ASQ and IEEE. He is an ASQ Certified Quality and ReliabilityEngineer. Table of Contents1. Introduction...........................................................................................................................................12. Basic Structure......................................................................................................................................13. Reliability Goals.....................................................................................................................................14. Apportionment.......................................................................................................................................35. Feedback Mechanism.............................................................................................................................46. Determining Value.................................................................................................................................67. Maturity Model......................................................................................................................................78. Conclusions............................................................................................................................................79. References.............................................................................................................................................710. Tutorial Visuals……………………………………………………………………………………... . .8 Schenkelberg: page ii
  3. 3. 1. INTRODUCTION reliable product. This brings up the question of what is a ‘reliable product’? A product’s design, supply chain and assembly process in The objective or goal provides the direction and guidancelarge part establish the product’s reliability performance. A for the reliability program. Clearly stating the reliability goalproduct well suited for the use application will meet or exceed is a key trait of very effective programs. Leaving the goalthe customer’s durability expectations. The myriad of unstated or vaguely understood may lead to one or more ofdecisions by the entire design and production team creates the the following:eventual product reliability performance. The structure forthese decisions is the focus of this tutorial. • High field failure rate Considering that each activity of a design team takes • Product recallresources such as time and money to accomplish, focusing the • Over designed and expensive productuse of these resources on activities of high value is a common • Design team priority confusionstrategy. Including product reliability in the value propositionpermits the entire team to weigh the importance of product Another element of a process is feedback. This occursreliability and the appropriate use of tools to accomplish both within the process as part of the creation of the output, and itthe business and product reliability objectives. most certainly exists externally based on the output or process The basic premise of this tutorial is the underlying concept results.that no one set of reliability activities is appropriate for every The final result for product reliability is the customerproduct development situation. Selecting and integrating the acceptance or rejection of the product. If the productbest tools permits the execution of an effective and efficient functions longer than expected, like an HP calculator, thereliability program. product is considered a ‘good value’. If the product fails The traits of very good reliability programs and examples of quickly or often, especially compared to other productsvery poor practices in this tutorial serve to illustrate how to providing the same solution, it is considered of ‘poor value’.approach establishing an effective reliability program. In some organization the feedback is non-existent, in othersHighlighting the basic structure along with guidelines on how it is captured within a warranty claims system, in othersto tailor a reliability program will permit the repeatable within service or repair programs. Customers may complaincreation of reliable products. directly with returned products and demands for replacements, or indirectly by simple not purchasing theAcronyms and Notation product in the future. ALT Accelerated Life Test The feedback within the reliability program attempts to CAD Computer Aided Design anticipate the customer’s feedback prior to the delivery of the FMEA Failure Modes and Effects Analysis product to the customer. Depending on the product and the HALT Highly Accelerated Life Test organization, this feedback may be very formally determined, LED Light Emitting Diode highly structured and very accurate. Or, the feedback may be MTBF Mean Time Between Failure random, haphazard and inaccurate. Both types of feedback PoF Physics of Failure may be suitable, again depending on the product and SPICE Simulation Program with Integrated Circuit organization.Emphasis Establishing the appropriate set of feedback mechanisms within a reliability program is done within the context of the product reliability goals and the value to the organization of 2. BASIC STRUCTURE the feedback. The process benefits from feedback that is timely and accurate enough to make decisions. It is those A product reliability program is a process. Like any process decisions that lead to the product’s reliability in the hands ofit has inputs and outputs, plus generally some form of an the customers.objective and feedback. Furthermore, the process may or may Therefore the basic structure for any reliability program isnot be controlled or even a conscious part of the organization. to clearly establish and state the reliability goal. ThenReliability may just happen, good or bad. Results may or may determine the appropriate set of feedback mechanisms thatnot be known or understood. provide timely information to permit design and production In some organizations, the reliability program may be decisions. The ‘how’ to decide the ‘appropriate set’ is thehighly structured with required activities at each stage along subject of this tutorial.the product lifecycle. In other organizations, reliability isconsidered as a set of tests (e.g. environmental or safetycompliance). And, in some organizations, reliability is 3. RELIABILITY GOALSeffectively a part of everyone’s role. The target, objective, mission or goal is the statement that In each example above, the resulting product reliability may provides the design team focus and direction. A well statedmeet the customer’s expectations or not. There isn’t a single goal will establish the business connection to the technicalprocess that will always work. decisions related to the product durability expectations. A Going back to the basic notion of a simple process, consider well stated goal provides clarity across the organization andthe objective for a moment. For a reliability program one may permits a common language for discussing design, supplydesire a specific outcome of a reliable product. The process chain and manufacturing decisions.then should promote activities leading to the creation of a Let’s explore the definition of a ‘well stated reliability goal’. First is it not simple MTBF, “as good as or better Schenkelberg: page 1
  4. 4. than…”, or ‘a 5 year product’. These are common ‘goals’ expected to wash clothes for 10 years. An implanted hearingfound across many industries, yet none permit a clear aid is expected to last the life of the patient; if the patient is atechnical understanding of the durability expectations for the child this expectation may be more than 70 years.product. The duration expectations may be defined by contract, The common definition for reliability is market expectations, or by a business decision. The duration or life expectancy most likely is not the warranty period. For Reliability is … the ability or capability of the example, many personal computers have a 3 month or 1 year product to perform the specified function in the warranty period. Yet, the product is expected to last at least designated environment for a minimum length of two years or more with normal use. time or minimum number of cycles or events. Many products have multiple durations that are of interest. (Ireson, Coombs et al. 1995) • Out of box • Warranty Note this definition has four elements: • Design Life • Function • Environment The initial, out-of-box, or installation period is that • Duration duration when the customer is first setting up and using the • Probability product. Brand visibility is at the highest and the expectation that a new product will function as expected is very high. The3.1 Function types of failures that may occur include installation or The function is what the product is to do or perform. For configuration errors, mistaken purchase, shipping orexample, an emergency room ventilator is to provide assisted installation damage, or simply buyer error. All of thesebreathing for a person. This requires the ventilator to produce ‘failures’ cost the company producing the product resources.breathable air within a range of pressures within a prescribed The warranty period is the duration associated with thecycle of respiration. It may include requirements for filtering, producer’s promise to provide a product free of defects for atemperature, and adjustments to pressure and timing of the stated period of time. For example a computer may have a 1cycle, etc. Often, a product development team either develops year warranty period. During this one year, if the productor is given a detailed set of functional requirements. fails (usually limited to normal use and operating Often the functional elements of a product are directly environment) the producer will repair or replace the product.measurable. And, the quality function of most organizations Naturally this will cost the producer resources.verifies the design and production units meet the functional The design life is the business or market expected productrequirements. When the product does not meet the functional duration of function use. After the warranty period there isn’trequirements, it is considered a product failure. Within the an expectation for the producer to replace or repair thefunction definition, which are the most important functions, product, yet the customer may have a reasonable expectationwhich must not fail, which are functions that, if they fail may that the product will function satisfactorily over the designsimply degrade performance, if noticed by the customer at life duration. For example, many cell phones have a 3-monthall? warranty, yet as consumers we have an expectation that the phone will function for two years or more.3.2 Environment Marketing or senior management may set the design life. The environment could be considered the weather around They may want to establish a market position for the productthe product when in use. ‘Weather’ such as temperature, related to reliability. One way is to design a very robusthumidity, UV radiation intensity, etc. It should also include product with a long design life duration. HP calculators oftenenvironmental factors that provide destructive stresses, such have only a 3-month or 1-year warranty, yet many have lastedas vibration, moisture, corrosive gases, voltage transients, and 10 or more years. These calculators are known for theirmany more. robustness and often cost more to purchase – a reliability Another element of the environment is the use of the premium.product. What is the use profile? Once a day for a few Each of the three durations often involves different risksminutes, like a remote control for the stereo system. Or is it a related to the failure mechanisms. It is rare for bearings to24/7 operation such as for server system processing wear out in the first 30 days, yet more likely for a 10-yeartransactions for a major online store. The profile may include design life. Establishing three or more durations within thedetails concerning human interactions, operating modes, product reliability goal permits the design team to focus onshipping, storage, and installation. The environmental and address the full range of product reliability risks.conditions need to detail how the product responds or 3.4 Probabilitydegrades to the set of stresses the product encounters. Theenvironmental conditions focus on drivers for the product’s The probability is the likelihood of the product survivingmost likely failure mechanisms. over a specified period of time. In the formal reliability definition above, the phrase ‘ability or capability’ refers to the3.3 Duration probability. This is the statistical part of the reliability goal The duration is the amount of time or number of cycles the and without it the goal is fairly meaningless. Furthermore,product is expected to function. A computer printer may be stating a probability without an associated duration andexpected to print for five years. A washing machine is distribution is also meaningless in most cases. Schenkelberg: page 2
  5. 5. What is the chance that a particular product will function as and some will require modification. Sophisticated modelexpected over the entire expected design life? How many of include apportioned goals, addressing many functions, andthe installed units will be functional over the warranty several use profiles and several environments, differentperiod? Since each product and the associated environmental durations, and conditional probabilities. Simple models workstress vary, the use of statistics is unavoidable in describing to get started, as more details become available concerningproduct reliability. Even the definition of a product failure the design and use, sophisticated models are increasinglymay vary by customer. useful. While there are many common terms to convey the The duration may also require modification. The durationsprobability of survival, the use of a percentage surviving is are most often the same as the system level and may requirethe easiest understood and most easily applied across an modification if the various components or subsystems areorganization. Stating that 95% of units are expected to only employed during specific phases of the products use, i.e.survive over the 5 year design life, means 95 out of 100 units an installation and configuration aide.will function properly over the 5 year period. Or, that a single The probability will require modification unless the productproduct has a one in twenty chance (95%) of surviving 5 has no component or subsystem elements. This is rare exceptyears. A similar statement is that not more than 5% of for raw materials. Even a simple discrete resistor has multipleproducts fail over the full five years. Or, may be stated as not components that may have different failure mechanisms. Formore than a 1% failure rate per year. example, the resistive element and the soldering leads have A common probability statement is the inverse of the failure different functional descriptions are made of differentrate, or MTBF. The 95% reliability over 5 years (t) becomes materials and enjoy different sets of stresses that lead toapproximately 100 years MTBF (θ). This does not mean the failure. The probability of failure is not the same as for theproduct will last 100 years, it does mean that 95% of the system.products are expected to last 5 years. Another way to look at the probability differences breaks Finally stating a separate failure probability for each down the system probability of success to each element withinduration of interest provides a set of duration/probability the product. A simple system with two primary means to failcouplets that permit different focus for early or out of box (say the resistor with the resistive and connection elements asfailure risks versus the longer term failure risks. an example for discussion) and the system has a 90% If the product has a specific mission time, say an aircraft probability of successfully functioning over 20 years. If bothwith an expected 12-hour mission over a 20-year serviceable of the elements also have a 90% probability, and either thelife period. The probability of success for the 12-hour mission resistive or connection element causes a system failure, thentime maybe set relatively high. And, it may have a either the system or subsystem goals are misstated. As youconditional probability considering the number of missions already know, for a simple series system the probability ofsince the last major service. Some products have availability success for the subsystems has to be larger such that whengoals and undergo routine maintenance or repair. These they are multiplied together the result meets or exceeds theproducts and many complex systems require additional system goal.complexity in their goal setting. For the purpose of this There are excellent references for basic reliability modelingdiscussion, we are considering simple products that are not and many papers and forums to discuss even the mostnormally repaired or, products where the main interest is in complex systems. The intention is to apportion the systemthe time to the first failure. reliability goal, especially the probability value, to all major The point is that setting the reliability goal for a product is elements of the product.not as simple as stating a ‘five year life’ – it requires a clear 4.1 Establishing the probability apportionmentstatement with sufficient detail of each of the four elements:function, environment, duration, and probability. And, it may The time to establish the reliability apportionment is earlyand often should include at least three duration/probability in the project. Depending on the project and the knowncouplets. The goal establishes the direction or target for the values from field data, vendors, previous projects, etc. theentire design, supply chain and manufacturing team. apportionment may be well founded on data, or simply a guess. Both are valuable. Consider a simple example of a computer system with five 4. APPORTIONMENT major subsystems: motherboard, disk drive, monitor, power supply and keyboard. Of course there are other elements, yet The system or product level reliability goal is not sufficient for this example we are limited the list to these itself. Ideally, every component or assembly step, which If this is our first product and little is known about thehas a possible impact on the final product reliability, should reliability of any of these components (for example, whenhave an established reliability goal. Each individual element designing the first personal computers in the 80’s). Further,should have goals that are tailored to that specific element. let’s assume the system goal is 95% reliable over a 5-yearFor example a cooling fan that only operates when the period for the design life. Having no other information, ainternal temperature reaches a defined value, has a different straight-line apportionment is as good a starting place as any.use profile than the entire system. The function and Therefore, each of the five subsystems receives anenvironment are different for the specific fan than for the apportionment goal of 99% reliable over 5 years. Also, thesystem. The computer provides a platform for computer functional and environmental elements receive attention toprograms to operate along with a user interface, whereas the adjust to those subsystems particular provides cooling. Many of the environmental factors forthe computer also impact the fan, yet not everything applies Schenkelberg: page 3
  6. 6. At first, this simple method provides a starting point for the The primary intent of using reliability goals andteam’s discussion concerning reliability. It provides the basis apportionment is to permit meaningful decisions concerningfor product design, part procurement, validation and reliability along with the ability to consider product cost andverification testing, and the myriad of cost/benefit trade off other important aspects of the design in a meaningfuldecisions required during the product lifecycle. manner. Overtime, years of field data, vendor data and internalproduct testing continue to improve the understanding of 5. FEEDBACK MECHANISMSeach subsystem’s reliability. This understanding becomes thebase for the initial apportionment estimates for a new There are two basic questions in reliability engineering.product. Consider a new project for a personal computer What is going to fail? And, when will the product fail? Bothwhere only the CPU and associated chipset is new. The are related to failure mechanisms. The first may require theoverall apportionment model may start with the best available discovery of the failure mechanism. The second may requirereliability values for all the subsystems and include an the determination of the expected behavior of the failureadjustment to the motherboard value considering the mechanism over time. Both questions have a wide range ofuncertainty or estimated value change regarding the new tools available to find the answers. It is the selection of theCPU chipset. The uncertainty is relatively low and the use right tools to provide a good enough answer in an effectivewithin subtle design decisions is possible. and efficient manner that is the subject of this section. Each engineer tends to design away from failure. (Petroski4.2 Adjusting the probability apportionment 1994) And, each engineer generally knows about the most likely failure mechanisms related to their section of the Going back to the first personal computer design and design, within the realm of their experience. They may gainsimple straight-line apportionment. A little common sense additional experience as their design fails in unexpected (toand feedback from vendors may provide additional them) ways. Part of the design process is to uncover failuresinformation. The keyboard is most likely more reliable than and improve the design to avoid or lessen the probability ofthe power supply, for example. Adjusting the goal for the the same.power supply down, say to 98%, then requires an adjustment Tools such as FMEA and HALT permit the design team toin one or more of the other subsystems such that the product discover failures. Often the FMEA session permits the designremains at or above the system goal of 95%. The same rule team to share the known or expected failure mechanisms.applies for any other series system of apportionment. Occasionally, a new possible failure mode appears in these Another consideration for the apportionment adjustment is sessions. The real value is in improving the ability of thethe cost/benefit tradeoff. For nearly any development project entire team to identify unknown failures and address thethere is a limit to product cost, therefore simply purchasing effects of the known expected failures. Each person on thethe most expensive components, which may or may not be the FMEA team brings a set of known or expected failures to themost reliable, is not always an option. Back to the power discussion. The combined set increases the entire team’ssupply example above. Let’s say the vendor of the initially awareness to the larger set of possible issues.selected power supply considers the use, environment and HALT, in the broadest sense is started with the first productfunctional requirements and states that the power supply will models or bench top testing. Exploring the reaction of thehave a 95% probability of success over 5 years. That is the product to various stimulations is an exploration of where thesame value as the overall system goal, and unless all the other product works by defining where it doesn’t work. Thesubsystems are perfect (100% reliable over 5 years) the design intention of HALT is to apply stresses relevant to theteam will not achieved the reliability objective. product’s environment (vibration, voltage, temperature, usage A search reveals three alternative power supplies that will rates, etc.) and determine the boundary between functionalmeet the functional requirements. One has a 97% reliability and not functional behavior. With careful root cause analysis,at a cost of $50, the second has a 98% reliability at a cost of then uncover and understand the failures, enabling the design$100 and the third has a 99% reliability at a cost of $250. to adjust to create a more robust product. If product cost is not an issue (rarely the case) spend the Common engineering tools also permit this discovery.$250 and achieve the apportioned objective. If it is possible to Many CAD programs include basic finite element analysisimprove the reliability of other subsystems, say the monitor, capabilities. Adjusting material properties to reflect thefor less cost, to offset the difference between the 99% goal effects of aging (i.e. oxidation of polymers making them moreand 98% or 97% reliability associated with less expensive brittle) and performing a simple analysis may find agingpower supplies, than that would provide the highest reliability weaknesses in the design. The same applies for SPICEfor the least cost. This is a simple illustration of the models of circuits. Consider the expected drift of capacitorcost/benefit tradeoff; in practice these may become very values over time and the continued functionality of thecomplicated decisions. circuit. An advanced practice is to establish reliability goals and If the product is new or contains new technology orassociated apportionment for the various stage gates during assembly processes, the nature of the failures may not be wellthe product lifecycle. With each successive round of design, understood. FMEA and HALT and related discovery toolsprototyping, and analysis not only is the product improving, apply. If the project is to refine an existing product and therebut the uncertainty is also diminishing. Using the lower limit is ample internal and field data defining the areas forfor reliability estimates is one way to reflect the range of improvements, then the discovery tools do not add value.reliability uncertainty. The first question looks for what will fail. If the failures are known or the various tools help determine what will fail, the Schenkelberg: page 4
  7. 7. product reliability can be improved by addressing those expected new environment. Tools such as ALT may apply.aspects the product that lead to the failures. One approach to Thermal cycling for the solder joint attachments and highproduct design is: build, test, fix – repeat. That is, find and temperature exposure while illuminated to evaluate thefix the first element of a design to fail and the product luminosity degradation are two examples of what could beimproves. Continue to do so till there are no more failures or usefully tested.the design reaches the design limits of the materials (for The results of the discovery evaluations along withexample, the first failure occurs as the polymer case melts). engineering judgments concerning the uncertainty of failure The primary drawback to this approach is the inability to mechanism behaviors will prioritize the list of most likelyquantify the product reliability value concerning how many failure mechanisms. This list then can be sorted byunits will last how long. Understanding what will fail is appropriate stress to design accelerated life tests. More thancritical to being able to answer the second question – when one failure mechanism may be accelerated due to the highwill it fail? temperature exposure, for instance. As the design team addresses the design issues the second The reliability program most likely will have goal setting,question enables them to know if they have achieved the apportionment, initial reliability predictions based onproduct reliability goal. As with discovery tools, there are literature and vendor data, prototype testing of various soldermany tools available to determine how long a product will joint attachment mythologies, and product level acceleratedlast. Predictions, accelerated life testing, demonstration tests life testing focused on 3 to 5 different stresses.all are capable of providing an estimate of how long a product This approach takes advantage of existing knowledgemay last. concerning LED technology and previous explorations of Deterministic models may also provide results. For failure mechanisms within LED technology and solderexample, the polymer diffusion rate permits air to accumulate attachment methods. The approach also considers if any newwithin a tube, which at a critical air volume will block fluid failure modes may appear in the new, harsher environment.flow. This process can be modeled and the time to failure The approach also considers the relative low cost of thecalculated for different wall thicknesses and air pressures. individual units and the ability to quickly measure theField data is often the most accurate way to estimate actual product performance by the use of ALT. The initial risk isfield performance although it is usually not available for new high for the new environment and if the LED’s actually lastproducts or elements of new products. longer than twice the expected life of the incandescent To illustrate how to select the appropriate tools to provide systems the product provides a cost savings to the carfeedback to the design team, let’s consider a few cases. Keep manufacturer and mind that not all tools are appropriate for all situations. 5.2 The low volume high cost case5.1 The existing technology in new environment case In comparison to the first case, consider a product that has a To illustrate the existing technology in new environment very limited production volume, say 50 total units. Plus, eachcase consider the initial design of an LED brake light. This unit is very expensive; say $1million. Running 30 units eachis new technology with respect to the application of the LED in three different ALT to failure is not viable. Even gettingto the car taillight environment. While LED lighting has been one full unit for destructive HALT testing is not likely. Yetavailable in a range of applications for some time, the car all the same unknowns as above or more may apply.taillight environment is harsher and more demanding than Consider an oil exploration sensor array unit that attachesprevious application environments. Simply the ambient to the drill string during drilling and has the function totemperature extremes from overnight, outdoors in Fargo, ND monitor and report the presence of specific types of(-30°F) to direct sunlight exposure, within an unventilated hydrocarbons. This is a complex system in a very harshenclosure in the Tucson, AZ summer (180°C). Also a new environment.assembly process to attach dozens of LED elements to a brake The list of what and how the product could fail is quitelight pattern frame in a high-volume mass-production long. Given the constraint of no system level units for productassembly line will be required. testing, only a few of the tools from the first case apply: goal There is no history, no previous products on the market setting, apportionment, prediction and FMEA. The FMEA isusing LED’s in anything like the brake light environment. a discovery tool and will not provide the necessary feedbackWhat could possibly go wrong? The design team doesn’t on the product’s expected durability. Thus the onus is onknow what could go wrong. Therefore, the appropriate set of performing accurate should first discover the most likely failure mechanisms. In this case, the use of Physics of Failure (PoF) modelingFMEA and HALT both apply, for example. Both of these may be the most valuable tool available. Understanding thetools can build on what is already known about LED relationship between the expected stresses and the componentoperations and known failure mechanisms. The new level responses over time, permits the PoF models to predictenvironment may accelerate some little known failure the system life. The development of the PoF models related tomechanisms, or it may simply accelerate already well known the critical component failure mechanisms may takemechanisms. significant work, yet the option to test multiple units is not Once the failure mechanisms are known the requirement for viable. Therefore, the analytical and theoretical work permitsthe new brake lights is to last twice as long as the current the team to receive feedback on the expected productincandescent systems, or a 95% probability of lasting 10 weaknesses and expected life limiting failure mechanisms.years. Simply finding the failures, surprising failure modes or Even determining the critical component failure mechanismsnot, permits the evaluation of how long they will last in the may be difficult. Schenkelberg: page 5
  8. 8. This approach takes advantage of the existing literature Suppose after the first round of predictions we find thedetailing failure mechanisms for a wide range of components, keyboard has a lower expected reliability of 99.9% reliabilityplus the ability to evaluate individual components at much over 5 years. Furthermore, let’s assume the remaining fourless expense than the full system. The approach has more risk subsystems all meet their goal at 99%. And, it is possible toin the identification of the unknown failure mechanisms improve the reliability of the keyboard to 99.99% by spendingrelated to the full system configuration and use, yet, careful $1 more per keyboard. And, let’s assume it will cost theuse of tools like FMEA and reliability modeling permits the company $1000 per field failure for any cause. And, weteam to mitigate this risk to some extent. expect to build and sell 100,000 computers. For the current keyboard, we expect 100,000-(0.999 *5.3 The moderate volume product family variation case 100,000) = 100 keyboard failures. These will cost the A common case is the modification of an existing product. company 100*$1,000=$100,000.There is field data, the previous product testing information is For the new keyboard the cost will be 100,000 * $1 =available and the list of known failure mechanisms is well $100,000. The savings will be due to reducing the fielddefined. Furthermore the product functions, intended failures, from 100 failures to 1 failure. The new keyboard’senvironment and use profile remain basically the same. one failure costs $1,000. This is down from $100,000 for a In some regards, this is more difficult than the previous two savings of $99,000.cases. One approach would be to only test the new product For a savings of approximately $99,000 we spent $100,000,with respect to the changes, and possibly only evaluate the which may make it difficult to justify the change. Theindividual new components with the justification that nothing calculations might be more favorable if for the same cost ofelse has changed. change, a difference in reliability from 99% to 99.5% could The second possible approach is to repeat all of the be made in the power supply. For any proposed change thatevaluations and testing as done for the original product. Here impacts the reliability apportionment model the abovethe justification may be that the relatively minor changes may calculation quickly illustrates the value.adversely impact existing elements of the product. Or, worse, Yet, not all of the reliability tools directly increase orthe justification could be ‘we always do the full set of testing’ decrease the expected reliability. In some cases, the toolmentality. might only shorten the time to detect the failure mechanisms. Both approaches have risks and costs that can and should HALT is an example of this and it often finds most of thebe mitigated. Using the existing reliability models and best failure mechanisms in a design within a week, which wouldavailable data, the design team can isolate the changes and normally take months of standards based environmentalassign a range of predicted values to the new component testing to uncover. The savings in time to market risk, morereliability. In conjunction with that they can perform a very than justifies the necessity of making multiple trips to thefocused Design FMEA on the changes with an emphasis on HALT the changes impact any other element within the design. Another cost saving is the reduction in uncertainty. By At this point, the design team can decide if the uncertainty simply improving the accuracy of reliability predictions theconcerning either the interaction effects or the life uncertainty range of the estimated reliability diminishes. Once the rangewarrant further testing. If the true value for any range of no longer crosses a decision threshold to either conductreliability uncertainty will not preclude the product launch, further analysis or testing, the project resources can focusthen clearly no further testing is needed and the current improvement efforts on other high uncertainty or lowprediction if sufficient. If the low end of the range, on the reliability elements.other hand, would require further reliability improvements, orif the changes impact on other aspects of the product isunclear, then further analysis and testing will be needed. 7. MATURITY MODEL The appropriate approach considers what is known and The state of the organization is also important. A designunknowns, and the associated risks and decision points. The team that has no experience or expertise in statistical methodsintent is to provide both guidance and feedback to the design will probably flounder when trying to use an event-team that permits well informed decisions. Using too few or conditional based reliability block diagram that requirestoo many reliability tools may incur undue risks or costs. The advance statistical modeling. Getting this team to simply usewell crafted reliability program carefully considers how each a Weibull cumulative distribution plot may be a stretch, andreliability activity provides feedback toward answering the provide more value initially.two primary questions: what will fail and when will the Each organization has a set of skills, expectations,product fail. The intent is to add value to the product. structures, etc. that defines the culture concerning product reliability. Designing and applying reliability tools that will 6. DETERMINING VALUE make an impact within the organization should fit within or One way to select reliability tools for improving the be close to the organization’s current capabilities. The toolsproduct’s reliability is to consider the return on investment. If will only have impact and be useful if understood and makethe activity will not reduce risk, increase durability, reduce the current situation better. For example, a team that isengineering time, and eliminate failure mechanisms, etc. then consistently surprised by field failure modes may immediatelythe activity should not occur. benefit by conducting HALT testing to discover failure Consider a simple example. Recall the computer with five mechanism before their customers do so for them.subsystems from the apportionment discussion above. The Phil Crosby in his book Quality is Free (Crosby 1979)initial goal for each subsystem was 99% over 5 years. created a maturity matrix focused on quality. With slight Schenkelberg: page 6
  9. 9. modification, by substituting reliability for quality the same different tools would be needed. The downstairs, stage II,basic table is meaningful for the assessment of an organization would require coaching, training, and resourcesorganization’s reliability program. to break the cycle of letting surprising field failures dominate The primary difference that separates an effective reliability the engineering day. The upstairs, stage IV, organizationprogram from a non-effective one is the proactive nature of might be ready for advanced tools related to productthe program. On one occasion I conducted assessments of two modeling or field data analysis. They would have the time toorganizations located in the same building. Both designed learn advanced accelerated degradation testing methods, forand manufactured telecommunication equipment with similar example.complexity and volume. The interview schedule had megoing up and down stairs almost every hour for two days and 8. CONCLUSIONSby midday of the first day I enjoyed going upstairs anddreaded heading down. Despite all the product and business In summary, the traits of effective reliability programssimilarities the two reliability programs were dramatically include the ability to:different; as different as their reliability results. • State clear reliability goals; Downstairs the interviews started late, got interrupted by • Enable tradeoff decision-making;urgent phone calls or in-person requests; firefighting at its • Selectively use only value-added reliabilitybest. The team employed a wide range of tools, all that were activities;listed on a checklist, for each project. The reliability goals • Promote a proactive reliability culture.were not known to the design team and the few that did know The basic message is that no one list or standard of tasksthem also understood that they would not be measured nor makes an effective reliability program. The selection ofwould a failure to meet those goals impede getting the valuable tools and the establishment of a basic structure forproduct to market. The people I talked to stated reliability decision-making permit an organization to achieve thewas very important and were very busy fixing field failures or desired reliability objectives.testing (just before product launch) identified issues.Reliability was done by “the guy that left last year”. 9. REFERENCES Upstairs the interview started on time, and proceededwithout interruption. No one remembered the last time there Crosby, P. B. (1979). Quality is Free: The Art of Makinghad been an urgent need to resolve a field issue. The team Quality Certain. New York, Signet.employed reliability tools that would benefit the project asneeded. The specific testing that was done was tailored to the Ireson, W. G., C. F. Coombs, et al. (1995). Handbook ofrisks identified during the design phase. The goals were reliability engineering and management. New York, McGrawwidely known and their current status was also known, both Hill.during development and after product launch. The people Italked to stated reliability was very important and they knew Petroski, H. (1994). Design Paradigms: Case historyes ofwhat to do to meet their reliability objectives. The team’s error and judgement in engineering. Cambridge, Cambridgemanager taught reliability thinking and skills, and everyone University Press.did reliability. For both organizations the basic structure and thoughtprocess to determine which reliability tools to use wouldapply but because of their different stages of development Schenkelberg: page 7