Upcoming SlideShare
×

# Reliability Engineering

1,549 views
1,428 views

Published on

Reliability Engineering

2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,549
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
56
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Reliability Engineering

1. 1. Reliability engineering 1 Reliability engineering Reliability engineering is engineering that emphasizes dependability in the lifecycle management of a product. Dependability, or reliability, describes the ability of a system or component to function under stated conditions for a specified period of time. [1] Reliability engineering is a sub-discipline within systems engineering. Reliability is theoretically defined as the probability of failure, the frequency of failures, or in terms of availability, a probability derived from reliability and maintainability. Maintainability and maintenance may be defined as a part of reliability engineering. Reliability plays a key role in cost-effectiveness of systems. Although reliability is defined and affected by stochastic parameters, according to some acknowledged specialists, quality, reliability and safety are not achieved by mathematics and statistics. Nearly all teaching and literature on the subject emphasizes these aspects, and ignores the reality that the ranges of uncertainty involved largely invalidate quantitative methods for prediction and measurement. [2] Reliability engineering for complex systems requires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may involve the creation of proper use studies and requirements specification, hardware & software design, functional (failure) analysis, testing and analyzing manufacturing, maintenance, transport, storage, spare parts stocking, operations research, human factors and technical documentation. Also data and information acquisition / organisation may be of importance. Effective reliability engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering skills and good knowledge from many different special fields of engineering, like: tribology-, stress / fracture mechanics -, fatigue-, thermal-, shock-, electrical- and chemical "engineering". Reliability engineering is closely related to safety engineering and system safety, in that they use common methods for their analysis and may require input from each other. Reliability engineering focuses on costs of failure caused by system downtime, cost of spares, repair equipment, personnel and cost of warranty claims. The focus of safety engineering is normally not on cost, but on preserving life and nature, and therefore deals only with particular dangerous system failure modes. High reliability (safety) levels are also here the result of good engineering, attention to detail and almost never the result of only re-active failure management (Reliability Accounting / Statistics). [3] "Reliability is, after all, engineering in its most practical form" as once stated by James R. Schlesinger, Former US Secretary of Defense. Overview Reliability may be defined in the following ways: •• The idea that an item is fit for a purpose with respect to time •• The capacity of a designed, produced or maintained item to perform as required over time •• The capacity of a population of designed, produced or maintained items to perform as required over specified time •• The resistance to failure of an item over time • The probability of an item to perform a required function under stated conditions for a specified period of time •• The durability of an object. Many engineering techniques are used in reliability engineering, such as reliability hazard analysis, failure mode and effects analysis (FMEA), failure modes, mechanisms, and effects analysis (FMMEA), [4] fault tree analysis (FTA), material stress and wear calculations, fatigue and creep analysis, finite element analysis, reliability prediction, thermal (stress) analysis, corrosion analysis, human error analysis, reliability testing, statistical uncertainty estimations, Monte Carlo simulations, design of experiments, reliability centered maintenance (RCM), failure reporting and corrective actions management. Because of the large number of reliability techniques, their expense, and the varying degrees of reliability required for different situations, most projects develop a reliability program
2. 2. Reliability engineering 2 plan to specify the reliability tasks that will be performed for that specific system. Consistent with the creation of safety cases, for example ARP4761, the goal is to provide a robust set of qualitative and quantitative evidence that use of a component or system will not be associated with unacceptable risk. The basic steps to take are to: • First thoroughly identify relevant unreliability "hazards", e.g. potential conditions, events, human errors, failure modes, interactions, failure mechanisms and root causes, by specific analysis or tests •• Assess the associated system risk, by specific analysis or testing •• Propose mitigation, e.g. requirements, design changes, detection, maintenance, training, by which the risks may be lowered and controlled for at an acceptable level. •• Determine the best mitigation and get agreement on final, acceptable risk levels, possibly based on cost-benefit analysis Risk is the combination of probability and severity of the failure incident (scenario) occurring. In a deminimus definition, severity of failures include the cost of spare parts, man hours, logistics, damage (secondary failures) and downtime of machines which may cause production loss. A more complete definition of failure also can mean injury, dismemberment and death of people within the system (witness mine accidents, industrial accidents, space shuttle failures) and the same to innocent bystanders (witness the citizenry of cities like Bhopal, Love Canal, Chernobyl or Sendai and other victims of the 2011 Tōhoku earthquake and tsunami). What is acceptable is determined by the managing authority or customers or the effected communities. Residual risk is the risk that is left over after all reliability activities have finished and includes the un-identified risk and is therefore not completely quantifiable. Reliability and availability program plan A reliability program plan is used to document exactly what "best practices" (tasks, methods, tools, analysis and tests) are required for a particular (sub)system, as well as clarify customer requirements for reliability assessment. For large scale, complex systems, the reliability program plan should be a separate document. Resource determination for manpower and budgets for testing and other tasks is critical for a successful program. In general, the amount of work required for an effective program for complex systems is large. A reliability program plan is essential for achieving high levels of reliability, testability, maintainability and the resulting system Availability and is developed early during system development and refined over the systems life-cycle. It specifies not only what the reliability engineer does, but also the tasks performed by other stakeholders. A reliability program plan is approved by top program management, which is responsible for allocation of sufficient resources for its implementation. A reliability program plan may also be used to evaluate and improve availability of a system by the strategy on focusing on increasing testability & maintainability and not on reliability. Improving maintainability is generally easier than reliability. Maintainability estimates (Repair rates) are also generally more accurate. However, because the uncertainties in the reliability estimates are in most cases very large, it is likely to dominate the availability (prediction uncertainty) problem; even in the case maintainability levels are very high. When reliability is not under control more complicated issues may arise, like manpower (maintainers / customer service capability) shortage, spare part availability, logistic delays, lack of repair facilities, extensive retro-fit and complex configuration management costs and others. The problem of unreliability may be increased also due to the "domino effect" of maintenance induced failures after repairs. Only focusing on maintainability is therefore not enough. If failures are prevented, none of the others are of any importance and therefore reliability is generally regarded as the most important part of availability. Reliability needs to be evaluated and improved related to both availability and the cost of ownership (due to cost of spare parts, maintenance man-hours, transport costs, storage cost, part obsolete risks, etc.). But, as GM and Toyota have belatedly discovered, TCO also includes the down-stream liability costs when reliability calculations do not sufficiently or accurately address customers' personal bodily risks. Often a trade-off is
5. 5. Reliability engineering 5 Design for reliability Reliability design begins with the development of a (system) model. Reliability and availability models use block diagrams and fault trees to provide a graphical means of evaluating the relationships between different parts of the system. These models may incorporate predictions based on failure rates taken from historical data. While the (input data) predictions are often not accurate in an absolute sense, they are valuable to assess relative differences in design alternatives. Maintainability parameters, for example MTTR, are other inputs for these models. The most important fundamental initiating causes and failure mechanisms are to be identified and analyzed with engineering tools. A diverse set of practical guidance and practical performance and reliability requirements should be provided to designers so they can generate low-stressed designs and products that protect or are protected against damage and excessive wear. Proper Validation of input loads (requirements) may be needed and verification for reliability "performance" by testing may be needed. A Fault Tree Diagram One of the most important design techniques is redundancy. This means that if one part of the system fails, there is an alternate success path, such as a backup system. The reason why this is the ultimate design choice is related to the fact that high confidence reliability evidence for new parts / items is often not available or extremely expensive to obtain. By creating redundancy, together with a high level of failure monitoring and the avoidance of common cause failures, even a system with relative bad single channel (part) reliability, can be made highly reliable (mission reliability) on system level. No testing of reliability has to be required for this. Furthermore, by using redundancy and the use of dissimilar design and manufacturing processes (different suppliers) for the single independent channels, less sensitivity for quality issues (early childhood failures) is created and very high levels of reliability can be achieved at all moments of the development cycles (early life times and long term). Redundancy can also be applied in systems engineering by double checking requirements, data, designs, calculations, software and tests to overcome systematic failures. Another design technique to prevent failures is called physics of failure. This technique relies on understanding the physical static and dynamic failure mechanisms. It accounts for variation in load, strength and stress leading to failure at high level of detail, possible with use of modern finite element method (FEM) software programs that may handle complex geometries and mechanisms like creep, stress relaxation, fatigue and probabilistic design (Monte Carlo simulations / DOE). The material or component can be re-designed to reduce the probability of failure and to make it more robust against variation. Another common design technique is component derating: Selecting components whose tolerance significantly exceeds the expected stress, as using a heavier gauge wire that exceeds the normal specification for the expected electrical current. Another effective way to deal with unreliability issues is to perform analysis to be able to predict degradation and being able to prevent unscheduled down events / failures from occurring. RCM (Reliability Centered Maintenance) programs can be used for this. Many tasks, techniques and analyses are specific to particular industries and applications. Commonly these include: •• Built-in test (BIT) (testability analysis)
6. 6. Reliability engineering 6 • Failure mode and effects analysis (FMEA) • Reliability hazard analysis •• Reliability block-diagram analysis • Dynamic Reliability block-diagram analysis [5] •• Fault tree analysis •• Root cause analysis •• Sneak circuit analysis •• Accelerated testing •• Reliability growth analysis • Weibull analysis • Thermal analysis by finite element analysis (FEA) and / or measurement • Thermal induced, shock and vibration fatigue analysis by FEA and / or measurement •• Electromagnetic analysis •• Statistical interference • Avoidance of single point of failure •• Functional analysis and functional failure analysis (e.g., function FMEA, FHA or FFA) •• Predictive and preventive maintenance: reliability centered maintenance (RCM) analysis •• Testability analysis •• Failure diagnostics analysis (normally also incorporated in FMEA) •• Human error analysis •• Operational hazard analysis / •• Manual screening •• Integrated logistics support Results are presented during the system design reviews and logistics reviews. Reliability is just one requirement among many system requirements. Engineering trade studies are used to determine the optimum balance between reliability and other requirements and constraints. Reliability prediction and improvement Reliability prediction is the combination of the creation of a proper reliability model together with estimating (and justifying) the input parameters for this model (like failure rates for a particular failure mode or event and the mean time to repair the system for a particular failure) and finally to provide a system (or part) level estimate for the output reliability parameters (system availability or a particular functional failure frequency). Some recognized reliability engineering specialists – e.g. Patrick O'Connor, R. Barnard – have argued that too much emphasis is often given to the prediction of reliability parameters and more effort should be devoted to the prevention of failure (reliability improvement). Failures can and should be prevented in the first place for most cases. The emphasis on quantification and target setting in terms of (e.g.) MTBF might provide the idea that there is a limit to the amount of reliability that can be achieved. In theory there is no inherent limit and higher reliability does not need to be more costly in development. Another of their arguments is that prediction of reliability based on historic data can be very misleading, as a comparison is only valid for exactly the same designs, products, manufacturing processes and maintenance under exactly the same loads and environmental context. Even a minor change in detail in any of these could have major effects on reliability. Furthermore, normally the most unreliable and important items (most interesting candidates for a reliability investigation) are most often subjected to many modifications and changes. Engineering designs are in most industries updated frequently. This is the reason why the standard (re-active or pro-active) statistical methods and processes as used in the medical industry or insurance branch are not as effective for engineering. Another surprising but logical argument is that to be able to accurately predict reliability by testing, the exact mechanisms of failure must have been known in most cases and therefore – in most cases – can