Reliability engineering 1
Reliability engineering is engineering that emphasizes dependability in the lifecycle management of a product.
Dependability, or reliability, describes the ability of a system or component to function under stated conditions for a
specified period of time.
Reliability engineering is a sub-discipline within systems engineering. Reliability is
theoretically defined as the probability of failure, the frequency of failures, or in terms of availability, a probability
derived from reliability and maintainability. Maintainability and maintenance may be defined as a part of reliability
engineering. Reliability plays a key role in cost-effectiveness of systems.
Although reliability is defined and affected by stochastic parameters, according to some acknowledged specialists,
quality, reliability and safety are not achieved by mathematics and statistics. Nearly all teaching and literature on the
subject emphasizes these aspects, and ignores the reality that the ranges of uncertainty involved largely invalidate
quantitative methods for prediction and measurement.
Reliability engineering for complex systems requires a different, more elaborate systems approach than for
non-complex systems. Reliability engineering may involve the creation of proper use studies and requirements
specification, hardware & software design, functional (failure) analysis, testing and analyzing manufacturing,
maintenance, transport, storage, spare parts stocking, operations research, human factors and technical
documentation. Also data and information acquisition / organisation may be of importance. Effective reliability
engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering
skills and good knowledge from many different special fields of engineering, like: tribology-, stress / fracture
mechanics -, fatigue-, thermal-, shock-, electrical- and chemical "engineering".
Reliability engineering is closely related to safety engineering and system safety, in that they use common methods
for their analysis and may require input from each other. Reliability engineering focuses on costs of failure caused by
system downtime, cost of spares, repair equipment, personnel and cost of warranty claims. The focus of safety
engineering is normally not on cost, but on preserving life and nature, and therefore deals only with particular
dangerous system failure modes. High reliability (safety) levels are also here the result of good engineering, attention
to detail and almost never the result of only re-active failure management (Reliability Accounting / Statistics).
"Reliability is, after all, engineering in its most practical form" as once stated by James R. Schlesinger, Former US
Secretary of Defense.
Reliability may be defined in the following ways:
•• The idea that an item is fit for a purpose with respect to time
•• The capacity of a designed, produced or maintained item to perform as required over time
•• The capacity of a population of designed, produced or maintained items to perform as required over specified
•• The resistance to failure of an item over time
• The probability of an item to perform a required function under stated conditions for a specified period of time
•• The durability of an object.
Many engineering techniques are used in reliability engineering, such as reliability hazard analysis, failure mode and
effects analysis (FMEA), failure modes, mechanisms, and effects analysis (FMMEA),
fault tree analysis (FTA),
material stress and wear calculations, fatigue and creep analysis, finite element analysis, reliability prediction,
thermal (stress) analysis, corrosion analysis, human error analysis, reliability testing, statistical uncertainty
estimations, Monte Carlo simulations, design of experiments, reliability centered maintenance (RCM), failure
reporting and corrective actions management. Because of the large number of reliability techniques, their expense,
and the varying degrees of reliability required for different situations, most projects develop a reliability program
Reliability engineering 2
plan to specify the reliability tasks that will be performed for that specific system.
Consistent with the creation of safety cases, for example ARP4761, the goal is to provide a robust set of qualitative
and quantitative evidence that use of a component or system will not be associated with unacceptable risk. The basic
steps to take are to:
• First thoroughly identify relevant unreliability "hazards", e.g. potential conditions, events, human errors, failure
modes, interactions, failure mechanisms and root causes, by specific analysis or tests
•• Assess the associated system risk, by specific analysis or testing
•• Propose mitigation, e.g. requirements, design changes, detection, maintenance, training, by which the risks may
be lowered and controlled for at an acceptable level.
•• Determine the best mitigation and get agreement on final, acceptable risk levels, possibly based on cost-benefit
Risk is the combination of probability and severity of the failure incident (scenario) occurring.
In a deminimus definition, severity of failures include the cost of spare parts, man hours, logistics, damage
(secondary failures) and downtime of machines which may cause production loss. A more complete definition of
failure also can mean injury, dismemberment and death of people within the system (witness mine accidents,
industrial accidents, space shuttle failures) and the same to innocent bystanders (witness the citizenry of cities like
Bhopal, Love Canal, Chernobyl or Sendai and other victims of the 2011 Tōhoku earthquake and tsunami). What is
acceptable is determined by the managing authority or customers or the effected communities. Residual risk is the
risk that is left over after all reliability activities have finished and includes the un-identified risk and is therefore not
Reliability and availability program plan
A reliability program plan is used to document exactly what "best practices" (tasks, methods, tools, analysis and
tests) are required for a particular (sub)system, as well as clarify customer requirements for reliability assessment.
For large scale, complex systems, the reliability program plan should be a separate document. Resource
determination for manpower and budgets for testing and other tasks is critical for a successful program. In general,
the amount of work required for an effective program for complex systems is large.
A reliability program plan is essential for achieving high levels of reliability, testability, maintainability and the
resulting system Availability and is developed early during system development and refined over the systems
life-cycle. It specifies not only what the reliability engineer does, but also the tasks performed by other stakeholders.
A reliability program plan is approved by top program management, which is responsible for allocation of sufficient
resources for its implementation.
A reliability program plan may also be used to evaluate and improve availability of a system by the strategy on
focusing on increasing testability & maintainability and not on reliability. Improving maintainability is generally
easier than reliability. Maintainability estimates (Repair rates) are also generally more accurate. However, because
the uncertainties in the reliability estimates are in most cases very large, it is likely to dominate the availability
(prediction uncertainty) problem; even in the case maintainability levels are very high. When reliability is not under
control more complicated issues may arise, like manpower (maintainers / customer service capability) shortage,
spare part availability, logistic delays, lack of repair facilities, extensive retro-fit and complex configuration
management costs and others. The problem of unreliability may be increased also due to the "domino effect" of
maintenance induced failures after repairs. Only focusing on maintainability is therefore not enough. If failures are
prevented, none of the others are of any importance and therefore reliability is generally regarded as the most
important part of availability. Reliability needs to be evaluated and improved related to both availability and the cost
of ownership (due to cost of spare parts, maintenance man-hours, transport costs, storage cost, part obsolete risks,
etc.). But, as GM and Toyota have belatedly discovered, TCO also includes the down-stream liability costs when
reliability calculations do not sufficiently or accurately address customers' personal bodily risks. Often a trade-off is
Reliability engineering 3
needed between the two. There might be a maximum ratio between availability and cost of ownership. Testability of
a system should also be addressed in the plan as this is the link between reliability and maintainability. The
maintenance strategy can influence the reliability of a system (e.g. by preventive and/or predictive maintenance),
although it can never bring it above the inherent reliability.
The reliability plan should clearly provide a strategy for availability control. Whether only availability or also cost of
ownership is more important depends on the use of the system. For example, a system that is a critical link in a
production system – e.g. a big oil platform – is normally allowed to have a very high cost of ownership if this
translates to even a minor increase in availability, as the unavailability of the platform results in a massive loss of
revenue which can easily exceed the high cost of ownership. A proper reliability plan should always address RAMT
analysis in its total context. RAMT stands in this case for reliability, availability, maintainability/maintenance and
testability in context to the customer needs.
For any system, one of the first tasks of reliability engineering is to adequately specify the reliability and
maintainability requirements derived from the overall availability needs and more importantly, from proper failure
analysis or preliminary test results. Setting only availability targets is not appropriate. Reliability requirements
address the system itself, including test and assessment requirements, and associated tasks and documentation.
Reliability requirements are included in the appropriate system or subsystem requirements specifications, test plans
and contract statements. Creation of proper lower level requirements is critical.
Provision of only quantitative minimum targets (e.g. MTBF values/ Failure rates) is not sufficient for different
reasons. One reason is that a full validation (related to correctness and verifiability in time) of an quantitative
reliability allocation (requirement spec) on lower levels for complex systems can (often) not be made as a
consequence of 1) The fact that the requirements are probabalistic and 2) The high level of uncertainties involved for
showing compliance with all these probabalistic requirements 3) Good estimates of a (probabalistic) reliability
number per item are available only very late in the project, sometimes even only many years after in-service use.
Compare this problem with the continues (re-)balancing of for example lower level system mass requirements in the
development of an aircraft, which is already often a big undertaking. Notice that in this case masses do only differ in
terms of only some % and this data is non-probabalistic and available already in CAD models. In case of reliability,
the levels of unreliability (failure rates) may change with factors of decades (1000's of %)as result of very minor
deviations in design, process or anything else. The information is often not available without huge uncertainties
within the development phase. This makes this allocation problem almost impossible to do in a useful, practical,
valid manner, wich does not result in massive over- or under specification. A pragmatic approach is therefore
needed. For example; the use of general levels / classes of quantitative requirements only depending on severity of
failure effects. Also the validation of results is a far more subjective task than for any other type of requirement.
(Quantitative) Reliability parameters -in terms of MTBF - are by far the most uncertain design parameters in any
Furthermore, reliability design requirements should drive a (system or part) design to incorporate features that
prevent failures from occurring or limit consequences from failure in the first place! Not only to make some
predictions, this could potentially distract the engineering effort to a kind of accounting work. A design requirement
should be so precise enough so that a designer can "design to" it and can also prove -through analysis or testing- that
the requirement has been achieved, and if possible within some a stated confidence. A test requirement should be
detailed and could be derived from failure analysis (FEM) or other lower part or material level reliability tests, e.g.
required overload loads (or stresses) and test time needed. To derive these requirements in an effective manner, a
systems engineering based risk assessment and mitigation logic should be used. The design requirements shall be
part of the output from functional or other failure analysis or tests. These requirements (often design constraints) are
in this way derived from failure analysis or preliminary tests.
Reliability engineering 4
The maintainability requirements address the costs of repairs as well as repair time. Testability requirements provide
the link between reliability and maintainability and should address detectability of failure modes (on a particular
system level), isolation levels and the creation of diagnostics (procedures).
As indicated above, reliability engineers should also address requirements for various reliability tasks and
documentation during system development, test, production, and operation. These requirements are generally
specified in the contract statement of work and depend on how much leeway the customer wishes to provide to the
contractor. Reliability tasks include various analyses, planning, and failure reporting. Task selection depends on the
criticality of the system as well as cost. A safety critical system may require a formal failure reporting and review
process throughout development, whereas a non-critical system may rely on final test reports. The most common
reliability program tasks are documented in reliability program standards, such as MIL-STD-785 and IEEE 1332.
Failure reporting analysis and corrective action systems are a common approach for product/process reliability
Practically, most failures can in the end be traced back to a root causes of the type of human errors of any kind. For
example, human errors in:
•• Use studies
•• Requirement analysis / setting
•• Configuration control
•• Calculations / simulations / FEM analysis
•• Design drawings
•• Testing (incorrect load settings or failure measurement)
•• Statistical analysis
•• Quality control
•• Maintenance manuals
•• Incorrect feedback of information
However, humans are also very good in detection of (the same) failures, correction of failures and improvising when
abnormal situations occur. The policy that human actions should be completely ruled out of any design and
production process to improve reliability may not be effective therefore. Some tasks are better performed by humans
and some are better performed by machines. Furthermore, human errors in management and the organization of data
and information or the misuse or abuse of items may also contribute to unreliability. This is the core reason why high
levels of reliability for complex systems can only be achieved by following a robust systems engineering process
with proper planning and execution of the validation and verification tasks. This also includes careful organization of
data and information sharing and creating a "reliability culture" in the same sense as having a "safety culture" is
paramount in the development of safety critical systems.
Reliability engineering 5
Design for reliability
Reliability design begins with the development of a (system) model. Reliability and availability models use block
diagrams and fault trees to provide a graphical means of evaluating the relationships between different parts of the
system. These models may incorporate predictions based on failure rates taken from historical data. While the (input
data) predictions are often not accurate in an absolute sense, they are valuable to assess relative differences in design
alternatives. Maintainability parameters, for example MTTR, are other inputs for these models.
The most important fundamental initiating causes and failure mechanisms are to be identified and analyzed with
engineering tools. A diverse set of practical guidance and practical performance and reliability requirements should
be provided to designers so they can generate low-stressed designs and products that protect or are protected against
damage and excessive wear. Proper Validation of input loads (requirements) may be needed and verification for
reliability "performance" by testing may be needed.
A Fault Tree Diagram
One of the most important design techniques is
redundancy. This means that if one part of the
system fails, there is an alternate success path,
such as a backup system. The reason why this is
the ultimate design choice is related to the fact
that high confidence reliability evidence for new
parts / items is often not available or extremely
expensive to obtain. By creating redundancy,
together with a high level of failure monitoring
and the avoidance of common cause failures,
even a system with relative bad single channel
(part) reliability, can be made highly reliable
(mission reliability) on system level. No testing
of reliability has to be required for this.
Furthermore, by using redundancy and the use of
dissimilar design and manufacturing processes
(different suppliers) for the single independent
channels, less sensitivity for quality issues (early
childhood failures) is created and very high levels of reliability can be achieved at all moments of the development
cycles (early life times and long term). Redundancy can also be applied in systems engineering by double checking
requirements, data, designs, calculations, software and tests to overcome systematic failures.
Another design technique to prevent failures is called physics of failure. This technique relies on understanding the
physical static and dynamic failure mechanisms. It accounts for variation in load, strength and stress leading to
failure at high level of detail, possible with use of modern finite element method (FEM) software programs that may
handle complex geometries and mechanisms like creep, stress relaxation, fatigue and probabilistic design (Monte
Carlo simulations / DOE). The material or component can be re-designed to reduce the probability of failure and to
make it more robust against variation. Another common design technique is component derating: Selecting
components whose tolerance significantly exceeds the expected stress, as using a heavier gauge wire that exceeds the
normal specification for the expected electrical current.
Another effective way to deal with unreliability issues is to perform analysis to be able to predict degradation and
being able to prevent unscheduled down events / failures from occurring. RCM (Reliability Centered Maintenance)
programs can be used for this.
Many tasks, techniques and analyses are specific to particular industries and applications. Commonly these include:
•• Built-in test (BIT) (testability analysis)
Reliability engineering 6
• Failure mode and effects analysis (FMEA)
• Reliability hazard analysis
•• Reliability block-diagram analysis
• Dynamic Reliability block-diagram analysis
•• Fault tree analysis
•• Root cause analysis
•• Sneak circuit analysis
•• Accelerated testing
•• Reliability growth analysis
• Weibull analysis
• Thermal analysis by finite element analysis (FEA) and / or measurement
• Thermal induced, shock and vibration fatigue analysis by FEA and / or measurement
•• Electromagnetic analysis
•• Statistical interference
• Avoidance of single point of failure
•• Functional analysis and functional failure analysis (e.g., function FMEA, FHA or FFA)
•• Predictive and preventive maintenance: reliability centered maintenance (RCM) analysis
•• Testability analysis
•• Failure diagnostics analysis (normally also incorporated in FMEA)
•• Human error analysis
•• Operational hazard analysis /
•• Manual screening
•• Integrated logistics support
Results are presented during the system design reviews and logistics reviews. Reliability is just one requirement
among many system requirements. Engineering trade studies are used to determine the optimum balance between
reliability and other requirements and constraints.
Reliability prediction and improvement
Reliability prediction is the combination of the creation of a proper reliability model together with estimating (and
justifying) the input parameters for this model (like failure rates for a particular failure mode or event and the mean
time to repair the system for a particular failure) and finally to provide a system (or part) level estimate for the output
reliability parameters (system availability or a particular functional failure frequency).
Some recognized reliability engineering specialists – e.g. Patrick O'Connor, R. Barnard – have argued that too much
emphasis is often given to the prediction of reliability parameters and more effort should be devoted to the
prevention of failure (reliability improvement). Failures can and should be prevented in the first place for most cases.
The emphasis on quantification and target setting in terms of (e.g.) MTBF might provide the idea that there is a limit
to the amount of reliability that can be achieved. In theory there is no inherent limit and higher reliability does not
need to be more costly in development. Another of their arguments is that prediction of reliability based on historic
data can be very misleading, as a comparison is only valid for exactly the same designs, products, manufacturing
processes and maintenance under exactly the same loads and environmental context. Even a minor change in detail
in any of these could have major effects on reliability. Furthermore, normally the most unreliable and important
items (most interesting candidates for a reliability investigation) are most often subjected to many modifications and
changes. Engineering designs are in most industries updated frequently. This is the reason why the standard
(re-active or pro-active) statistical methods and processes as used in the medical industry or insurance branch are not
as effective for engineering. Another surprising but logical argument is that to be able to accurately predict reliability
by testing, the exact mechanisms of failure must have been known in most cases and therefore – in most cases – can
Reliability engineering 7
be prevented! Following the incorrect route by trying to quantify and solving a complex reliability engineering
problem in terms of MTBF or Probability and using the re-active approach is referred to by Barnard as "Playing the
Numbers Game" and is regarded as bad practise.
For existing systems, it is arguable that responsible programs would directly analyse and try to correct the root cause
of discovered failures and thereby may render the initial MTBF estimate fully invalid as new assumptions (subject to
high error levels) of the effect of the patch/redesign must be made. Another practical issue concerns a general lack of
availability of detailed failure data and not consistent filtering of failure (feedback) data or igoring statistical errors,
which are very high for rare events (like reliability related failures). Very clear guidelines must be present to be able
to count and compare failures, related to different type of root-causes (e.g. manufacturing-, maintenance-, transport-,
system-induced or inherent design failures, ). Comparing different type of causes may lead to incorrect estimations
and incorrect business decisions about the focus of improvement.
To perform a proper quantitative reliability prediction for systems may be difficult and may be very expensive if
done by testing. On part level, results can be obtained often with higher confidence as many samples might be used
for the available testing financial budget, however unfortunately these tests might lack validity on system level due
to the assumptions that had to be made for part level testing. These authors argue that it can not be emphasized
enough that testing for reliability should be done to create failures in the first place, learn from them and to improve
the system / part. The general conclusion is drawn that an accurate and an absolute prediction – by field data
comparison or testing – of reliability is in most cases not possible. An exception might be failures due to wear-out
problems like fatigue failures. In the introduction of MIL-STD-785 it is written that reliability prediction should be
used with great caution if not only used for comparison in trade-off studies.
See also: Risk Assessment#Quantitative risk assessment – Critics paragraph
Main articles: Reliability theory, Failure rate and Survival analysis
Reliability is defined as the probability that a device will perform its intended function during a specified period of
time under stated conditions. Mathematically, this may be expressed as,
where is the failure probability density function and is the length of the period of time (which is
assumed to start from time zero).
There are a few key elements of this definition:
1.1. Reliability is predicated on "intended function:" Generally, this is taken to mean operation without failure.
However, even if no individual part of the system fails, but the system as a whole does not do what was intended,
then it is still charged against the system reliability. The system requirements specification is the criterion against
which reliability is measured.
2. Reliability applies to a specified period of time. In practical terms, this means that a system has a specified
chance that it will operate without failure before time . Reliability engineering ensures that components and
materials will meet the requirements during the specified time. Units other than time may sometimes be used.
3. Reliability is restricted to operation under stated (or explicitly defined) conditions. This constraint is necessary
because it is impossible to design a system for unlimited conditions. A Mars Rover will have different specified
conditions than a family car. The operating environment must be addressed during design and testing. That same
rover may be required to operate in varying conditions requiring additional scrutiny.
Reliability engineering 8
Quantitative system reliability parameters – theory
Quantitative Requirements are specified using reliability parameters. The most common reliability parameter is the
mean time to failure (MTTF), which can also be specified as the failure rate (this is expressed as a frequency or
conditional probability density function (PDF)) or the number of failures during a given period. These parameters are
very useful for systems that are operated frequently, such as most vehicles, machinery, and electronic equipment.
Reliability increases as the MTTF increases. The MTTF is usually specified in hours, but can also be used with other
units of measurement, such as miles or cycles.
In other cases, reliability is specified as the probability of mission success. For example, reliability of a scheduled
aircraft flight can be specified as a dimensionless probability or a percentage, as in system safety engineering.
A special case of mission success is the single-shot device or system. These are devices or systems that remain
relatively dormant and only operate once. Examples include automobile airbags, thermal batteries and missiles.
Single-shot reliability is specified as a probability of one-time success, or is subsumed into a related parameter.
Single-shot missile reliability may be specified as a requirement for the probability of a hit. For such systems, the
probability of failure on demand (PFD) is the reliability measure – which actually is an unavailability number. This
PFD is derived from failure rate (a frequency of occurrence) and mission time for non-repairable systems.
For repairable systems, it is obtained from failure rate and mean-time-to-repair (MTTR) and test interval. This
measure may not be unique for a given system as this measure depends on the kind of demand. In addition to system
level requirements, reliability requirements may be specified for critical subsystems. In most cases, reliability
parameters are specified with appropriate statistical confidence intervals.
Reliability modelling is the process of predicting or understanding the reliability of a component or system prior to
its implementation. Two types of analysis that are often used to model a complete system availability (including
effects from logistics issues like spare part provisioning, transport and manpower) behavior are fault tree analysis
and reliability block diagrams. On component level the same type of analysis can be used together with others. The
input for the models can come from many sources: Testing, Earlier operational experience field data or data
handbooks from the same or mixed industries can be used. In all cases, the data must be used with great caution as
predictions are only valid in case the same product in the same context is used. Often predictions are only made to
A reliability block diagram showing a 1oo3 (1 out of 3) redundant designed
For part level predictions, two separate
fields of investigation are common:
• The physics of failure approach uses an
understanding of physical failure
mechanisms involved, such as
mechanical crack propagation or
chemical corrosion degradation or
• The parts stress modelling approach is an
empirical method for prediction based on counting the number and type of components of the system, and the
stress they undergo during operation.
Software reliability is a more challenging area that must be considered when it is a considerable component to
Reliability engineering 9
Reliability test requirements
Reliability test requirements can follow from any analysis for which the first estimate of failure probability, failure
mode or effect needs to be justified. Evidence can be generated with some level of confidence by testing. With
software-based systems, the probability is a mix of software and hardware-based failures. Testing reliability
requirements is problematic for several reasons. A single test is in most cases insufficient to generate enough
statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impractical,
and environmental conditions can be hard to predict over a systems life-cycle.
Reliability engineering is used to design a realistic and affordable test program that provides empirical evidence that
the system meets its reliability requirements. Statistical confidence levels are used to address some of these concerns.
A certain parameter is expressed along with a corresponding confidence level: for example, an MTBF of 1000 hours
at 90% confidence level. From this specification, the reliability engineer can, for example, design a test with explicit
criteria for the number of hours and number of failures until the requirement is met or failed. Different sorts of tests
The combination of required reliability level and required confidence level greatly affects the development cost and
the risk to both the customer and producer. Care is needed to select the best combination of requirements – e.g.
cost-effectiveness. Reliability testing may be performed at various levels, such as component, subsystem and system.
Also, many factors must be addressed during testing and operation, such as extreme temperature and humidity,
shock, vibration, or other environmental factors (like loss of signal, cooling or power; or other catastrophes such as
fire, floods, excessive heat, physical or security violations or other myriad forms of damage or degradation). For
systems that must last many years, accelerated life tests may be needed.
A reliability sequential test plan
The purpose of reliability testing is to
discover potential problems with the design
as early as possible and, ultimately, provide
confidence that the system meets its
Reliability testing may be performed at
several levels and there are different types
of testing. Complex systems may be tested
at component, circuit board, unit, assembly,
subsystem and system levels  . (The test
level nomenclature varies among
applications.) For example, performing
environmental stress screening tests at lower
levels, such as piece parts or small
assemblies, catches problems before they
cause failures at higher levels. Testing
proceeds during each level of integration
through full-up system testing,
developmental testing, and operational
testing, thereby reducing program risk.
However, testing does not mitigate
Reliability engineering 10
With each test both a statistical type 1 and type 2 error could be made and depends on sample size, test time,
assumptions and the needed discrimination ratio. There is risk of incorrectly accepting a bad design (type 1 error)
and the risk of incorrectly rejecting a good design (type 2 error).
It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; some
failure modes may take years to observe; some complex interactions result in a huge number of possible test cases;
and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing
can be used, such as (highly) accelerated life testing, design of experiments, and simulations.
The desired level of statistical confidence also plays an role in reliability testing. Statistical confidence is increased
by increasing either the test time or the number of items tested. Reliability test plans are designed to achieve the
specified reliability at the specified confidence level with the minimum number of test units and test time. Different
test plans result in different levels of risk to the producer and consumer. The desired reliability, statistical
confidence, and risk levels for each side influence the ultimate test plan. The customer and developer should agree in
advance on how reliability requirements will be tested.
A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are many situations
where it is not clear whether a failure is really the fault of the system. Variations in test conditions, operator
differences, weather and unexpected situations create differences between the customer and the system developer.
One strategy to address this issue is to use a scoring conference process. A scoring conference includes
representatives from the customer, the developer, the test organization, the reliability organization, and sometimes
independent observers. The scoring conference process is defined in the statement of work. Each test case is
considered by the group and "scored" as a success or failure. This scoring is the official result used by the reliability
As part of the requirements phase, the reliability engineer develops a test strategy with the customer. The test
strategy makes trade-offs between the needs of the reliability organization, which wants as much data as possible,
and constraints such as cost, schedule and available resources. Test plans and procedures are developed for each
reliability test, and results are documented.
The purpose of accelerated life testing (ALT test) is to induce field failure in the laboratory at a much faster rate by
providing a harsher, but nonetheless representative, environment. In such a test, the product is expected to fail in the
lab just as it would have failed in the field—but in much less time. The main objective of an accelerated test is either
of the following:
•• To discover failure modes
• To predict the normal field life from the high stress lab life
An Accelerated testing program can be broken down into the following steps:
•• Define objective and scope of the test
•• Collect required information about the product
•• Identify the stress(es)
•• Determine level of stress(es)
•• Conduct the accelerated test and analyze the collected data.
Common way to determine a life stress relationship are
•• Arrhenius model
•• Eyring model
•• Inverse power law model
• Temperature–humidity model
•• Temperature non-thermal model
Reliability engineering 11
Further information: Software reliability
Software reliability is a special aspect of reliability engineering. System reliability, by definition, includes all parts of
the system, including hardware, software, supporting infrastructure (including critical external interfaces), operators
and procedures. Traditionally, reliability engineering focuses on critical hardware parts of the system. Since the
widespread use of digital integrated circuit technology, software has become an increasingly critical part of most
electronics and, hence, nearly all present day systems.
There are significant differences, however, in how software and hardware behave. Most hardware unreliability is the
result of a component or material failure that results in the system not performing its intended function. Repairing or
replacing the hardware component restores the system to its original operating state. However, software does not fail
in the same sense that hardware fails. Instead, software unreliability is the result of unanticipated results of software
operations. Even relatively small software programs can have astronomically large combinations of inputs and states
that are infeasible to exhaustively test. Restoring software to its original state only works until the same combination
of inputs and states results in the same unintended result. Software reliability engineering must take this into account.
Despite this difference in the source of failure between software and hardware, several software reliability models
based on statistics have been proposed to quantify what we experience with software: the longer software is run, the
higher the probability that it will eventually be used in an untested manner and exhibit a latent defect that results in a
failure (Shooman 1987), (Musa 2005), (Denney 2005).
As with hardware, software reliability depends on good requirements, design and implementation. Software
reliability engineering relies heavily on a disciplined software engineering process to anticipate and design against
unintended consequences. There is more overlap between software quality engineering and software reliability
engineering than between hardware quality and reliability. A good software development plan is a key aspect of the
software reliability program. The software development plan describes the design and coding standards, peer
reviews, unit tests, configuration management, software metrics and software models to be used during software
A common reliability metric is the number of software faults, usually expressed as faults per thousand lines of code.
This metric, along with software execution time, is key to most software reliability models and estimates. The theory
is that the software reliability increases as the number of faults (or fault density) decreases or goes down.
Establishing a direct connection between fault density and mean-time-between-failure is difficult, however, because
of the way software faults are distributed in the code, their severity, and the probability of the combination of inputs
necessary to encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability engineer.
Other software metrics, such as complexity, are also used. This metric remains controversial, since changes in
software development and verification practices can have dramatic impact on overall defect rates.
Testing is even more important for software than hardware. Even the best software development process results in
some software faults that are nearly undetectable until tested. As with hardware, software is tested at several levels,
starting with individual units, through integration and full-up system testing. Unlike hardware, it is inadvisable to
skip levels of software testing. During all phases of testing, software faults are discovered, corrected, and re-tested.
Reliability estimates are updated based on the fault density and other metrics. At a system level,
mean-time-between-failure data can be collected and used to estimate reliability. Unlike hardware, performing
exactly the same test on exactly the same software configuration does not provide increased statistical confidence.
Instead, software reliability uses different metrics, such as code coverage.
Eventually, the software is integrated with the hardware in the top-level system, and software reliability is subsumed
by system reliability. The Software Engineering Institute's capability maturity model is a common means of
assessing the overall software development process for reliability and quality purposes.
Reliability engineering 12
Reliability engineering vs safety engineering
Reliability engineering differs from safety engineering with respect to the kind of hazards that are considered.
Reliability engineering is in the end only concerned with cost. It relates to all Reliability hazards that could transform
into incidents with a particular level of loss of revenue for the company or the customer. These can be cost due to
loss of production due to system unavailability, unexpected high or low demands for spares, repair costs, man hours,
(multiple) re-designs, interruptions on normal production (e.g. due to high repair times or due to unexpected
demands for non-stocked spares) and many other indirect costs.
Safety engineering, on the other hand, is more specific and regulated. It relates to only very specific and system
safety hazards that could potentially lead to severe accidents and is primarily concerned with loss of life, loss of
equipment, or environmental damage. The related system functional reliability requirements are sometimes
extremely high. It deals with unwanted dangerous events (for life, property, and environment) in the same sense as
reliability engineering, but does normally not directly look at cost and is not concerned with repair actions after
failure / accidents (on system level). Another difference is the level of impact of failures on society and the control of
governments. Safety engineering is often strictly controlled by governments (e.g. nuclear, aerospace, defense, rail
and oil industries).
Furthermore, safety engineering and reliability engineering may even have contradicting requirements. This relates
to system level architecture choices .Wikipedia:Citation needed For example, in train signal control systems it is
common practice to use a fail-safe system design concept. In this concept the Wrong-side failure need to be fully
controlled to an extreme low failure rate. These failures are related to possible severe effects, like frontal collisions
(2* GREEN lights). Systems are designed in a way that the far majority of failures will simply result in a temporary
or total loss of signals or open contacts of relays and generate RED lights for all trains. This is the safe state. All
trains are stopped immediately. This fail-safe logic might unfortunately lower the reliability of the system. The
reason for this is the higher risk of false tripping as any full or temporary, intermittent failure is quickly latched in a
shut-down (safe)state. Different solutions are available for this issue. See chapter Fault Tolerance below.
Reliability can be increased here by using a 2oo2 (2 out of 2) redundancy on part or system level, but this does in
turn lower the safety levels (more possibilities for Wrong Side and undetected dangerous Failures). Fault tolerant
voting systems (e.g. 2oo3 voting logic) can increase both reliability and safety on a system level. In this case the
so-called "operational" or "mission" reliability as well as the safety of a system can be increased. This is also
common practice in Aerospace systems that need continued availability and do not have a fail safe mode (e.g. flight
computers and related electrical and / or mechanical and / or hydraulic steering functions need always to be working.
There are no safe fixed positions for rudder or other steering parts when the aircraft is flying).
Basic reliability and mission (operational) reliability
The above example of a 2oo3 fault tolerant system increases both mission reliability as well as safety. However, the
"basic" reliability of the system will in this case still be lower than a non redundant (1oo1) or 2oo2 system! Basic
reliability refers to all failures, including those that might not result in system failure, but do result in maintenance
repair actions, logistic cost, use of spares, etc. For example, the replacement or repair of 1 channel in a 2oo3 voting
system that is still operating with one failed channel (which in this state actually has become a 1oo2 system) is
contributing to basic unreliability but not mission unreliability. Also, for example, the failure of the taillight of an
aircraft is not considered as a mission loss failure, but does contribute to the basic unreliability.
Reliability engineering 13
Detectability and common cause failures
When using fault tolerant (redundant architectures) systems or systems that are equipped with protection functions,
detectability of failures and avoidance of common cause failures become paramount for safe functioning and/or
Reliability operational assessment
After a system is produced, reliability engineering monitors, assesses and corrects deficiencies. Monitoring includes
electronic and visual surveillance of critical parameters identified during the fault tree analysis design stage. The data
are constantly analyzed using statistical techniques, such as Weibull analysis and linear regression, to ensure the
system reliability meets requirements. Reliability data and estimates are also key inputs for system logistics. Data
collection is highly dependent on the nature of the system. Most large organizations have quality control groups that
collect failure data on vehicles, equipment and machinery. Consumer product failures are often tracked by the
number of returns. For systems in dormant storage or on standby, it is necessary to establish a formal surveillance
program to inspect and test random samples. Any changes to the system, such as field upgrades or recall repairs,
require additional reliability testing to ensure the reliability of the modification. Since it is not possible to anticipate
all the failure modes of a given system, especially ones with a human element, failures will occur. The reliability
program also includes a systematic root cause analysis that identifies the causal relationships involved in the failure
such that effective corrective actions may be implemented. When possible, system failures and corrective actions are
reported to the reliability engineering organization.
One of the most common methods to apply to a reliability operational assessment are failure reporting, analysis and
corrective action systems (FRACAS). This systematic approach develops a reliability, safety and logistics
assessment based on Failure / Incident reporting, management, analysis and corrective/preventive actions.
Organizations today are adopting this method and utilize commercial systems such as a Web based FRACAS
application enabling an organization to create a failure/incident data repository from which statistics can be derived
to view accurate and genuine reliability, safety and quality performances.
It is extremely important to have one common source FRACAS system for all end items. Also, test results should be
able to be captured here in a practical way. Failure to adopt one easy to handle (easy data entry for field engineers
and repair shop engineers)and maintain integrated system is likely to result in a FRACAS program failure.
Some of the common outputs from a FRACAS system includes: Field MTBF, MTTR, Spares Consumption,
Reliability Growth, Failure/Incidents distribution by type, location, part no., serial no, symptom etc.
The use of past data to predict the reliability of new comparable systems/items can be misleading as reliability is a
function of the context of use and can be affected by small changes in the designs/manufacturing.
Systems of any significant complexity are developed by organizations of people, such as a commercial company or a
government agency. The reliability engineering organization must be consistent with the company's organizational
structure. For small, non-critical systems, reliability engineering may be informal. As complexity grows, the need
arises for a formal reliability function. Because reliability is important to the customer, the customer may even
specify certain aspects of the reliability organization.
There are several common types of reliability organizations. The project manager or chief engineer may employ one
or more reliability engineers directly. In larger organizations, there is usually a product assurance or specialty
engineering organization, which may include reliability, maintainability, quality, safety, human factors, logistics, etc.
In such case, the reliability engineer reports to the product assurance manager or specialty engineering manager.
In some cases, a company may wish to establish an independent reliability organization. This is desirable to ensure
that the system reliability, which is often expensive and time consuming, is not unduly slighted due to budget and
Reliability engineering 14
schedule pressures. In such cases, the reliability engineer works for the project day-to-day, but is actually employed
and paid by a separate organization within the company.
Because reliability engineering is critical to early system design, it has become common for reliability engineers,
however the organization is structured, to work as part of an integrated product team.
The American Society for Quality has a program to become a Certified Reliability Engineer, CRE. Certification is
based on education, experience, and a certification test: periodic re-certification is required. The body of knowledge
for the test includes: reliability management, design evaluation, product safety, statistical tools, design and
development, modeling, reliability testing, collecting and using data, etc.
Another highly respected certification program is the CRP
(Certified Reliability Professional). To achieve
certification, candidates must complete a series of courses focused on important Reliability Engineering topics,
successfully apply the learned body of knowledge in the workplace and publicly present this expertise in an industry
conference or journal.
Reliability engineering education
Some universities offer graduate degrees in reliability engineering. Other reliability engineers typically have an
engineering degree, which can be in any field of engineering, from an accredited university or college program.
Many engineering programs offer reliability courses, and some universities have entire reliability engineering
programs. A reliability engineer may be registered as a professional engineer by the state, but this is not required by
most employers. There are many professional conferences and industry training programs available for reliability
engineers. Several professional organizations exist for reliability engineers, including the IEEE Reliability Society,
the American Society for Quality (ASQ)
, and the Society of Reliability Engineers (SRE)
 Institute of Electrical and Electronics Engineers (1990) IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer
Glossaries. New York, NY ISBN 1-55937-079-3
 O'Connor, Patrick D. T. (2002), Practical Reliability Engineering (Fourth Ed.), John Wiley & Sons, New York. ISBN 978-0-4708-4462-5.
 Using Failure Modes, Mechanisms, and Effects Analysis in Medical Device Adverse Event Investigations, S. Cheng, D. Das, and M. Pecht,
ICBO: International Conference on Biomedical Ontology, Buffalo, NY, July 26–30, 2011, pp. 340–345
 Salvatore Distefano, Antonio Puliafito: Dependability Evaluation with Dynamic Reliability Block Diagrams and Dynamic Fault Trees. IEEE
Trans. Dependable Sec. Comput. 6(1): 4-17 (2009)
• Blanchard, Benjamin S. (1992), Logistics Engineering and Management (Fourth Ed.), Prentice-Hall, Inc.,
Englewood Cliffs, New Jersey.
• Breitler, Alan L. and Sloan, C. (2005), Proceedings of the American Institute of Aeronautics and Astronautics
(AIAA) Air Force T&E Days Conference, Nashville, TN, December, 2005: System Reliability Prediction:
towards a General Approach Using a Neural Network.
• Ebeling, Charles E., (1997), An Introduction to Reliability and Maintainability Engineering, McGraw-Hill
Companies, Inc., Boston.
Reliability engineering 15
• Denney, Richard (2005) Succeeding with Use Cases: Working Smart to Deliver Quality. Addison-Wesley
Professional Publishing. ISBN . Discusses the use of software reliability engineering in use case driven software
•• Gano, Dean L. (2007), "Apollo Root Cause Analysis" (Third Edition), Apollonian Publications, LLC., Richland,
• Holmes, Oliver Wendell, Sr. The Deacon's Masterpiece
• Kapur, K.C., and Lamberson, L.R., (1977), Reliability in Engineering Design, John Wiley & Sons, New York.
•• Kececioglu, Dimitri, (1991) "Reliability Engineering Handbook", Prentice-Hall, Englewood Cliffs, New Jersey
• Trevor Kletz (1998) Process Plants: A Handbook for Inherently Safer Design CRC ISBN 1-56032-619-0
• Leemis, Lawrence, (1995) Reliability: Probabilistic Models and Statistical Methods, 1995, Prentice-Hall. ISBN
• Frank Lees (2005). Loss Prevention in the Process Industries (3rdEdition ed.). Elsevier.
• MacDiarmid, Preston; Morris, Seymour; et al., (1995), Reliability Toolkit: Commercial Practices Edition,
Reliability Analysis Center and Rome Laboratory, Rome, New York.
•• Modarres, Mohammad; Kaminskiy, Mark; Krivtsov, Vasiliy (1999), "Reliability Engineering and Risk Analysis:
A Practical Guide, CRC Press, ISBN 0-8247-2000-8.
•• Musa, John (2005) Software Reliability Engineering: More Reliable Software Faster and Cheaper, 2nd. Edition,
•• Neubeck, Ken (2004) "Practical Reliability Analysis", Prentice Hall, New Jersey
• Neufelder, Ann Marie, (1993), Ensuring Software Reliability, Marcel Dekker, Inc., New York.
• O'Connor, Patrick D. T. (2002), Practical Reliability Engineering (Fourth Ed.), John Wiley & Sons, New York.
• Shooman, Martin, (1987), Software Engineering: Design, Reliability, and Management, McGraw-Hill, New
• Tobias, Trindade, (1995), Applied Reliability, Chapman & Hall/CRC, ISBN 0-442-00469-9
• Springer Series in Reliability Engineering (http://www.springer.com/series/6917)
• Nelson, Wayne B., (2004), Accelerated Testing – Statistical Models, Test Plans, and Data Analysis, John Wiley
& Sons, New York, ISBN 0-471-69736-2
• Bagdonavicius, V., Nikulin, M., (2002), "Accelerated Life Models. Modeling and Statistical analysis",
CHAPMAN&HALL/CRC, Boca Raton, ISBN 1-58488-186-0
US standards, specifications, and handbooks
• Aerospace Report Number: TOR-2007(8583)-6889 (http://www.everyspec.com/USAF/TORs/
TOR2007-8583-6889_14232/) Reliability Program Requirements for Space Systems, The Aerospace Corporation
(10 Jul 2007)
• DoD 3235.1-H (3rd Ed) (http://www.everyspec.com/DoD/DoD-PUBLICATIONS/DOD_3235x1-H_15048/)
Test and Evaluation of System Reliability, Availability, and Maintainability (A Primer), U.S. Department of
Defense (March 1982) .
• NASA GSFC 431-REF-000370 (http://www.everyspec.com/NASA/NASA-GSFC/GSFC-Code-Series/
GSFC_431_REF_000370_2297/) Flight Assurance Procedure: Performing a Failure Mode and Effects Analysis,
National Aeronautics and Space Administration Goddard Space Flight Center (10 Aug 1996).
• IEEE 1332–1998 (http://ieeexplore.ieee.org/xpl/standardstoc.jsp?isnumber=15567) IEEE Standard
Reliability Program for the Development and Production of Electronic Systems and Equipment, Institute of
Electrical and Electronics Engineers (1998).
• JPL D-5703 (http://www.everyspec.com/NASA/NASA-JPL/JPL_D-5703_JUL1990_15049/) Reliability
Analysis Handbook, National Aeronautics and Space Administration Jet Propulsion Laboratory (July 1990).
Reliability engineering 16
• MIL-STD-785B (http://www.everyspec.com/MIL-STD/MIL-STD-0700-0799/MIL-STD-785B_23780/)
Reliability Program for Systems and Equipment Development and Production, U.S. Department of Defense (15
Sep 1980). (*Obsolete, superseded by ANSI/GEIA-STD-0009-2008 titled Reliability Program Standard for
Systems Design, Development, and Manufacturing, 13 Nov 2008)
• MIL-HDBK-217F (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-0200-0299/
MIL-HDBK-217F_14591/) Reliability Prediction of Electronic Equipment, U.S. Department of Defense (2 Dec
• MIL-HDBK-217F (Notice 1) (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-0200-0299/
MIL-HDBK-217F_NOTICE-1_14589/) Reliability Prediction of Electronic Equipment, U.S. Department of
Defense (10 Jul 1992).
• MIL-HDBK-217F (Notice 2) (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-0200-0299/
MIL-HDBK-217F_NOTICE-2_14590/) Reliability Prediction of Electronic Equipment, U.S. Department of
Defense (28 Feb 1995).
• MIL-STD-690D (http://www.everyspec.com/MIL-STD/MIL-STD-0500-0699/MIL-STD-690D_15050/)
Failure Rate Sampling Plans and Procedures, U.S. Department of Defense (10 Jun 2005).
• MIL-HDBK-338B (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-0300-0499/
MIL-HDBK-338B_15041/) Electronic Reliability Design Handbook, U.S. Department of Defense (1 Oct 1998).
• MIL-HDBK-2173 (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-2000-2999/
MIL-HDBK-2173_15046/) Reliability-Centered Maintenance (RCM) Requirements for Naval Aircraft, Weapon
Systems, and Support Equipment, U.S. Department of Defense (30 JAN 1998); (superseded by NAVAIR
• MIL-STD-1543B (http://www.everyspec.com/MIL-STD/MIL-STD-1500-1599/MIL_STD_1543B_166/)
Reliability Program Requirements for Space and Launch Vehicles, U.S. Department of Defense (25 Oct 1988).
• MIL-STD-1629A (http://www.everyspec.com/MIL-STD/MIL-STD-1600-1699/MIL_STD_1629A_1556/)
Procedures for Performing a Failure Mode Effects and Criticality Analysis, U.S. Department of Defense (24 Nov
• MIL-HDBK-781A (http://www.everyspec.com/MIL-HDBK/MIL-HDBK-0700-0799/
MIL_HDBK_781A_1933/) Reliability Test Methods, Plans, and Environments for Engineering Development,
Qualification, and Production, U.S. Department of Defense (1 Apr 1996).
• NSWC-06 (Part A & B) (http://www.everyspec.com/USN/NSWC/
NSWC-06_RELIAB_HDBK_2006_15051/) Handbook of Reliability Prediction Procedures for Mechanical
Equipment, Naval Surface Warfare Center (10 Jan 2006).
• SR-332 (http://telecom-info.telcordia.com/site-cgi/ido/docs.cgi?ID=073944231SEARCH&KEYWORDS=&
TITLE=&DOCUMENT=sr-332&DATE=&CLASS=&COUNT=1000) Reliability Prediction Procedure for
Electronic Equipment, Telcordia Technologies (January 2011).
• FD-ARPP-01 (http://telecom-info.telcordia.com/site-cgi/ido/docs.cgi?ID=073944231SEARCH&
Reliability Prediction Procedure, Telcordia Technologies (January 2011).
Reliability engineering 17
In the UK, there are more up to date standards maintained under the sponsorship of UK MOD as Defence Standards.
The relevant Standards include:
DEF STAN 00-40 Reliability and Maintainability (R&M)
•• PART 1: Issue 5: Management Responsibilities and Requirements for Programmes and Plans
• PART 4: (ARMP-4)Issue 2: Guidance for Writing NATO R&M Requirements Documents
• PART 6: Issue 1: IN-SERVICE R & M
• PART 7 (ARMP-7) Issue 1: NATO R&M Terminology Applicable to ARMP’s
DEF STAN 00-42 RELIABILITY AND MAINTAINABILITY ASSURANCE GUIDES
•• PART 1: Issue 1: ONE-SHOT DEVICES/SYSTEMS
•• PART 2: Issue 1: SOFTWARE
• PART 3: Issue 2: R&M CASE
•• PART 4: Issue 1: Testability
•• PART 5: Issue 1: IN-SERVICE RELIABILITY DEMONSTRATIONS
DEF STAN 00-43 RELIABILITY AND MAINTAINABILITY ASSURANCE ACTIVITY
•• PART 2: Issue 1: IN-SERVICE MAINTAINABILITY DEMONSTRATIONS
DEF STAN 00-44 RELIABILITY AND MAINTAINABILITY DATA COLLECTION AND CLASSIFICATION
• PART 1: Issue 2: MAINTENANCE DATA & DEFECT REPORTING IN THE ROYAL NAVY, THE ARMY
AND THE ROYAL AIR FORCE
• PART 2: Issue 1: DATA CLASSIFICATION AND INCIDENT SENTENCING – GENERAL
• PART 3: Issue 1: INCIDENT SENTENCING – SEA
• PART 4: Issue 1: INCIDENT SENTENCING – LAND
DEF STAN 00-45 Issue 1: RELIABILITY CENTERED MAINTENANCE
DEF STAN 00-49 Issue 1: RELIABILITY AND MAINTAINABILITY MOD GUIDE TO TERMINOLOGY
These can be obtained from DSTAN (http://www.dstan.mod.uk). There are also many commercial standards,
produced by many organisations including the SAE, MSG, ARP, and IEE.
• FIDES (http://fides-reliability.org). The FIDES methodology (UTE-C 80-811) is based on the physics of
failures and supported by the analysis of test data, field returns and existing modelling.
• UTE-C 80–810 or RDF2000 (http://www.ute-fr.com/FR/). The RDF2000 methodology is based on the French
• TC 56 Standards: Dependability (http://tc56.iec.ch/about/standards0_1.htm)
Reliability engineering 18
• Prognostics Journal (http://www.prognosticsjournal.com) is an open access journal that provides an
international forum for the electronic publication of original research and industrial experience articles in all areas
of systems reliability and prognostics.
• Models and methods regarding reliability analysis (http://www.uncertainty-in-engineering.net/)
• Structural Safety (http://www.kokch.kts.ru/me/t6/SIA_6_Structural_Safety.pdf)