SlideShare a Scribd company logo
1 of 84
Download to read offline
On the Nature of FMEA…
… An introduction
Mark Gerrand, BEng(Mech), MMaintReliabEng
2
What is FMEA?
• A FMEA…
– is the identification of the ways in which a system can fail and the consequences
thereof
– assists in the implementation of design and management corrective actions to
minimize the occurrence and severity, and maximize the detection, of failure.
FMEA is an acronym for Failure Modes and Effects Analysis.
3
What is FMEA? (2)
• The term FMEA is often used interchangeably with Failure Modes Effects and Criticality
Analysis (FMECA) although strictly speaking the latter represents a FMEA incorporating
Criticality Analysis
• The FMEA is sometimes also called a Fault Hazard Analysis. However, hazards tend to
be interpreted as a class of safety-related problems, the correction or mitigation of
which do not necessarily lead to a more efficient design, operation or maintenance of
the system
• The FMEA is therefore an assessment of risk in terms of:
– Safety
– Regulatory requirements
– Customer satisfaction, i.e. ability of the mission system to meet its objectives.
4
What is FMEA? (3)
• FMEAs are often related to the part of the system life cycle under consideration:
– System FMEA – This examines the effects of failures of system functions. The
concept or initial design FMEA is a higher-level system FMEA that is undertaken at
the beginning of a system’s life cycle
– Design FMEA – This is performed from the initial design or concept through to
detailed design, and usually focuses on effects caused by failures of lower-level
elements and assemblies
– Production or Process FMEA – This is performed towards the end of the design
phase to examine the impact of faulty processes and materiel associated with high-
volume production on mission systems or subsystems
• Other types of FMEA include the service FMEA. This can be applied to specific
processes used by service providers and industries such as maintenance and even
health care services. MIL-STD-1629 also describes a damage FMEA. However, there are
many more types than are described above
• This presentation focuses on the initial design and design FMEAs.
5
Why do FMEA?
• The output of a FMEA can be used to establish:
– A more efficacious system design through improvements in inherent:
• Reliability, availability and maintainability
• Safety
– Safer operational limits such as duty cycle and operating environment
– Identification of single points of failure and mission-critical items that can lead to:
• Better maintenance strategies through improved maintenance task analysis,
reliability centred maintenance and level of repair analysis
• More highly optimised sparing
– Operator and maintainer training design input
– Operator and maintainer handbook input
– System test and inspection plan input
– Production process improvement.
6
Why do FMEA? (2)
• Because it can assist in dealing with the knowable unknowns…
– ‘Nature has established patterns originating in the turn of events, but only for the
most part’ ~ Leibniz (German scientist, mathematician et al, 1646-1716)
– Trends are difficult to ascertain when causative events may be random… but
the detection and severity of outcomes may be controlled
– Therefore, FMECA is a useful risk management tool that considers what
happens when events do not follow measures of central tendency such as
means, but rather are the ‘outliers of distributions’.
7
Why do FMEA? (3)
• However…
• If there is no management commitment to utilise the FMEA, especially during the
design stage, then the FMEA is a meaningless paper exercise
• It is a document that may be required at an accident board of inquiry or court to
demonstrate that the system and support system design and design procedures
were not negligent
• This presentation does not make you a competent FMEA practitioner but may assist
you in determining deficiencies in your knowledge so that you can seek professional
help either before or after…
8
Why not do FMEA?
• Funds aren’t available! FMEA is time consuming and resource intensive, utilising
engineering and technical personnel from the design and design-related specialities such
as integrated logistics support (ILS) and quality. The requisite system experts must be
engaged so that the analysis is not trivial
• From a safety perspective, FMEA is essential. However, a case may be made for omission
with respect to commercial off-the-shelf (COTS) acquisition if all of the following are
applicable:
– The mission profiles, modes and phases are consistent with the Original Equipment
Manufacturer (OEM) design intent
– The intended maintenance regime is consistent with the OEM design intent
– The operational environment is consistent with the OEM design intent
– Design analysis shows that any proposed modifications or deviations from the original
design will not affect the above considerations
– Field failure data is available to prove, or has proven, the safety and efficacy of the
original design
• These criteria may not apply to projects consisting of integration of COTS equipment since
the effects of component failure upon the system design may have to be explored.
Just ask the ‘Ariane 5’ design team how!
9
Limitations of FMEA
• The limitations generally do not outweigh the benefits of FMEA! Some common problems
and limits are:
– The failure modes must be able to be envisaged by the designers or analysts
– Multiple-failure interactions may not be easily foreseen due to the item-by-item and
function-by-function foci of analysis inherent in the process. Many-to-many
relationships between cause, failure mode, effect and control can result in complexity
of resolution
– External influences on the system may be easily overlooked if they do not cause
specific or obvious equipment failures. An example of this may be the long-term
effect of ionising or gamma radiation at altitude
– Similarly, external influences by the system, such as contaminating or polluting the
environment, may be easily overlooked even when the system is considered to be
operating properly. An example is the possible influence of a ferrous system
unwittingly introduced to a magnetic test and degaussing range
– Human factors are often overlooked, under the premise that the equipment is
correctly operated by caring, sharing, well-trained operators and maintainers with
appropriate levels of mechanical empathy.
10
Limitations of FMEA (2)
• Several strategies can help overcomes these problems:
– For highly safety-critical systems, other methods such as Fault Tree Analysis (FTA)
are run back-to-back with FMEA to provide extra confidence that all conceivable
hazards are considered
– Laboratory investigation may be required as to possible failure modes. Thorough
simulation or stimulation of the system may demonstrate unexpected dynamics,
such as those encountered in control loops
– A mixed-discipline review team active throughout FMECA development mitigates
the possibility of error
– Expert system and design engineering opinion must be sought when analysing
complex systems. For example, FMEA is not ‘just an ILS thing’.
11
When are FMEAs done?
Design
Reviews
Design
Verification
System Reliability &
Design Verification
Testing
Subsystem Testing
Component Testing
Concept FMEA
System FMEA
Subsystem FMEA
Component FMEA
Requirements
System
Specifications
Component
Specifications
SDR
PDR
DDR
Initial design phase
Design Complete
SDR: System Design Review
PDR: Preliminary Design Review
DDR: Detailed Design Review
12
How to do a FMECA:
Step 1… Select the Standard
• In lieu of a contractual requirement, select the reference standard which will supply the
most suitable baseline methodology for your system
• Some references for FMEA are:
1. MIL-STD-1629A, ‘Procedures for Performing a Failure Mode, Effects and Criticality
Analysis’
2. SAE ARP-5580, ‘Recommended Failure Modes and Effects Analysis (FMEA)
Practices for Non-Automobile Applications’
3. SAE J-1739, ‘Potential Failure Mode and Effects Analysis in Design (Design FMEA),
Potential Failure Mode and Effects Analysis in Manufacturing and Assembly
Processes (Process FMEA), and Potential Failure Mode and Effects Analysis for
Machinery (Machinery FMEA)’
4. AS IEC 60812, ‘Analysis techniques for system reliability - Procedure for failure
mode and effects analysis (FMEA)’
• It is important to work to a known standard to maintain consistent and recognised ratings
definitions for severity (criticality), occurrence and detection.
13
Step 2… Establish the Analysis Framework (1)
• Who are the facilitators, contributors and internal reviewers of the FMEA?
– What are their core competencies and experience?
– What is the relevance and level of authority of their input?
– A team of 4 to 8 people is considered ideal for the system, or each subsystem,
FMEA
• What is the frequency and timing of reviews, updates and deliveries?
– For example, draft deliveries prior to Design Reviews?
– Note that the FMEA is an iterative, and often staged, process.
Quote for the day:
‘A design review is
like 100%
inspection with a
so-so gauge’
14
Step 2… Establish the Analysis Framework (2)
• Who will be responsible for the means of processing and storing data and reporting
results?
– Documenting the methodology, assumptions and source data
– Enabling use of MS Excel or a proprietary software package (noting that many such
tools are merely documenting aids)
– Developing data input and output templates
– Developing configuration control, archival & backup procedures
– Enabling shared directory access by different departments
– Determining to whom will the FMEA be distributed
– Ensuring open linkages to other design and production processes
• Who in the management structure will approve the FMEA and have the capacity to
authorise recommended actions?
– The hardest part of all: management buy-in.
Is the FMEA software certified or developed to a standard suitable for use in evaluating a safety-critical system?
15
Step 2… Establish the Analysis Framework (3)
• At this point, the decisions and assumptions common to all the FMEA(s) should be considered
• The criticality analysis may follow either of two paths, depending, in part, upon the availability of
quantitative failure rate information for each line item:
– Systems in initial design phases often do not have reliability estimates in the form of failure rates
available
– Many systems providers either do not have or will not provide ‘commercially sensitive’ failure rate
information
– Software failure rates will likely only ever be low-confidence estimates
• These factors provoke a choice of either a:
– Qualitative criticality analysis (i.e. for systems with high software content, and preliminary or
unknown designs), or
– Quantitative criticality analysis
• A choice should be made to use one method only. This should be applicable to all the FMEAs in a
system so that the results are directly comparable.
16
Step 3… Define the Problems Areas of Interest
• A FMEA may have a particular need to address:
– Safety of personnel, the system itself, and external systems with which the system
interrelates
– Regulatory compliance
– Supplier design (or production) capability
– Environment impact
– Economic impact
– Any aspect of RAM
• Any or all of the above may apply for new systems, or new or problematic technology
• This in turn may determine the system or subsystems that are selected for analysis
• Some subsystems may be mission systems in their own right, such as flight simulators,
due to their prime cost, complexity and support requirements.
17
Step 4… Define the System and Identify Subsystems
• The overall system may be considered to be that which has a ‘fault free’ boundary, i.e. the
mission system plus all other systems which can contribute to the mission system’s
failure
• Assumptions regarding external influences have to be documented, e.g. fire, flood,
lightning, terrorism, malicious damage, reliability of commercial or government furnished
equipment and interfaces
• Identify the system for analysis, and determine whether analysis can be accommodated
as a single entity, or broken into a number of major subsystems, each with its own FMEA
• The boundaries for a system (and constituent subsystems) can be established through:
– The hardware or geographic orientation, e.g. the north-east XYZ beacon transmitter
– The functional service provided, e.g. training module, wide area network
– The discipline, e.g. hydraulic services, software
– Any arbitrary division that simplifies the number and type of interfaces across
boundaries.
18
Step 5… Select the Type of Analysis for Each FMEA
• A FMEA may be functional or element oriented, or any hybrid thereof:
– The functional approach is usually used prior to detailed design information being
available, and is usually undertaken in a ‘top down’, deductive manner:
• Concept and system FMEAs are usually functionally oriented
• System engineers provide significant input
– The element or component approach is usually undertaken in a ‘bottom up’ inductive
manner:
• Design FMEAs are usually component oriented
• Design engineers, as well as system engineers, provide significant input
– The hybrid approach allows complex systems to be dealt with at high levels
functionally, transitioning to constituent components at a lower level. This is
particularly suitable where equipment or components directly relate to functionality
• Although MIL-STD-1629A states that that standard only applies to hardware, the FMEA
approach (even when that standard is used) is applicable to both hardware and software
elements.
19
Step 6… Subdivide the System and Subsystems
• For the purposes of analysis, the system or subsystem in each FMEA is then
decomposed into manageable sections
• The boundaries, if not physically evident, may be determined in the same manner that
was suggested for defining the system and subsystem boundaries
• This may take the form of a logical, physical or hybrid hierarchy:
– System
– Subsystems
– Programs
– Assemblies
– Modules
– Subassemblies
– Subroutines
– Procedures
– Components or parts
– Functions
20
Step 7… Selecting the Lowest Level of Analysis (1)
• The lowest level for complete initial analysis must then be determined to finish the
subdivision of the system subsystem
– For many large systems, it is arbitrarily down to the lowest replaceable or line
replacement unit (LRU). An LRU is any part replaced by the first or organisational
level of maintenance
– For small systems or assemblies, it may be down to individual piece parts or
components
• Software systems may analyse computer software components (CSCs) as the lowest
level
• For analysis of a system, assuming a particular lowest hierarchical level ensures full
coverage. It ensures that no areas are overlooked, yet should not at such a fine degree
of resolution that needlessly engages resources and produces excessively detailed and
voluminous output.
Save the trees – don’t sweat the small stuff – and avoid information
overload!
Caveat: Unless you have to…
21
Step 7… Selecting the Lowest Level of Analysis (2)
• For this reason, performing an initial high-level system-wide assessment of criticality
(possibly as a ‘generic’ FMEA) is useful – it culls, where justifiable, non-critical
subsystems from further analysis while still allowing analysis of the more critical
subsystems. For example, a data archival service may not be considered ‘mission
critical’ to a primary service of providing online data
• Analysis should only be performed to a depth beyond which no more information, useful
from the top-level perspective and intentions of the FMEA, would be gained. Criticality
and occurrence issues tend to be obvious. Detectability issues, particularly concerning
‘hidden failures’ are not necessarily so
• For example, if the system stops flying, we don’t really care what specifically broke
because hopefully we can quickly detect the failure cause at some level and
compensate, or let the design compensate!
22
Step 7… Selecting the Lowest Level of Analysis (3)
• Consider an electronics backplane FMEA:
– Analysis of a 1/4W resistor in a particular printed circuit assembly gives the next
higher failure mode as failure of the printed circuit assembly (PCA) itself
– However, design analysis has shown that all conceivable failures are benign and
failures of this nature are covered by PCA Built In Test Equipment (BITE)
– In this particular case, no more useful information to the FMEA is provided by
component analysis of the PCA than if failure modes for the PCA alone had been
provided because it is the PCA’s diagnostic port that either reports failure or fails to
report to the system
• Selecting the lowest level of analysis should therefore be a dynamic rather than a
prescribed process.
23
Step 8… Determining the Functions
• Faults are an inability to function in the desired manner, or else operation in an undesired manner,
regardless of the cause. A failure is an ongoing fault
• Failure Modes are the particular manner in which a failure occurs for a given cause
• Therefore, the first step for the subsequent determination of failure modes is to document all the
required functions of each item
• Functions should be able to be composed in a ‘verb-noun’ or ‘verb-phrase’ type expression, e.g.
‘Transmits - Turbine torque to main rotor gearbox’
• Sources of information for the required high-level functions will include the product specifications
which should dictate regulatory as well as functional requirements.
Fault/Failure Failure Mode Effect
Function
24
Step 9… Determining the Failure Modes (1)
• A failure mode (FM) can be described as an ‘anti-function’. Therefore, for every identified
function that an element has to perform to support the system, there are one or more
failure modes
• Information sources for failure modes may be interpreted from:
– Warranty, FRACAS, and failure history databases
– Reliability Block Diagrams (RBDs)
– Functional Block Diagrams (FBDs)
– Boundary Diagrams
– Interface Matrices
– Engineering drawings, schematics, and bills of material
– RIAC FMD-97 “Failure Mode/Mechanisms Distributions” which covers electrical,
electronic, mechanical and electromechanical parts and assemblies
• Some guidance on the construction of RBDs and FBDs is available in MIL-STD-756,
‘Reliability Modeling and Prediction’, as well as IEC 61078.
How can it
break, and
what is the root
cause?
25
Step 9… Determining the Failure Modes (2)
• System functionality during all operational, idle, standby and storage phases and states
should be considered
• Some examples of operational mission phases may be taxi, take-off, departure, cruise,
system deployment, holding, descent, approach, landing. Each of these phases can be
broken down again if necessary
• Failure modes of a system or subsystem can be caused by component faults or failures
• The failure modes caused by possible software faults must not be omitted. For
example, some modes of memory corruption can sometimes allow a program to continue
execution, albeit with incorrect data in that memory
• Define what is a failure carefully, e.g. operation of a safety device such as a circuit
breaker may be a mitigating response at a local level though it may cause higher level
system failure
26
Step 9… Determining the Failure Modes (3)
• The distinction between states and modes is often arbitrary. As guidance:
• A state is a functional condition or arrangement of operation that a system must be in
order to perform certain functions
– For example, the RS232 interface must have the Data Set Ready line in a high state
prior to receiving data
• A mode is a special state (i.e. a functional condition or arrangement) that implies that the
state is extended over time rather than being transient, and there is a high likelihood
that activities characteristic of the extended state will be carried out
– For example, landing gear should be in the landing mode prior to landing.
• As long as all the functions (what does this thing do?) in the respective states and
modes (when does it have to do this?) are identified, then one doesn’t have to get
particularly bitter and twisted over these kinds of definitions.
27
Step 9… Determining the Failure Modes (4)
• Therefore, for each mission phase and mode in which the element or subsystem should
be operable, the questions are asked:
– What is the effect of the failure mode upon the desired function?
• Complete failure?
• Partial failure? (i.e. does less than intended in scope, amplitude or timing)
• Intermittent failure (i.e. intermittently starts or stops)?
• Over-functioning (i.e. does more than intended)?
• Unintended functioning (i.e. does something else)?
• Compliance failures, such as excessive exhaust emissions, often fall into the latter
category.
28
Step 9… Determining the Failure Modes (5)
• Failure modes are also often attributable to failure of an interface. Interfaces are
empirically responsible for 50% of field failures
• For each item in the FMEA, boundaries must be defined and interfaces through these
boundaries identified. A ‘black box’ concept of the item may assist. Passive, static and
non-loading bearing elements should all be considered.
Eludium Q36
Explosive
Modulator
Energy transfer: Thermal,
magnetic, electric,
electrostatic, radiation,
torque, pressure, power,
force, fields, load, impulse etc
Data transfer: Messaging,
logic, flow control, status,
data, alarm etc
Material transfer:
Liquids, solids, gases,
plasma, dust, colloids
etc
Boundary
29
Step 9… Determining the Failure Modes (6)
• Note that the speed, timing, quantity and quality of interface exchanges and any other
constraints upon them may be significant
• Relative movements and positions of assemblies may be important, e.g. a constraint on
operation of a rolling-ball type computer mouse is that is must be in contact with a
smooth, clean surface. An undesirable interface might be dust, which impedes mouse
internal roller and ball interaction
• Consideration must also be given to the human interaction and greater environment in
which the system is operating, such as weather and seasonal changes, night/day,
ionospheric state, magnetic dip etc. Human interaction may provoke faults in terms of
incorrect or insufficient maintenance as well as operator error.
30
Step 9… Determining the Failure Modes (7)
• As well as required interfaces, consideration should be given to undesired interfaces,
such as contamination, noise, vibration, blockage, leakage, overload and overflow, and
corruption, that may possibly occur
• The effect of ageing may also give rise to possible failure modes for an element such as
wear-out, erosion, misalignment, deterioration, fatigue, creep and corrosion
• Dimensional (due to wear, deformation, fatigue, thermal effects, lubrication instability
etc), as well as electronic, tolerance errors, imbalance or instability over time may be
insidious. These errors may compound or ‘stack’ at a higher subsystem or system level.
31
Step 9… Determining the Failure Modes (8)
• An example of a procedure to determine failure modes is:
1. Break complex components into subcomponents
2. Identify the functional contribution of each
3. Deduce possible failure modes for each function from comparable items:
– Consider any possible physical, time or other stresses on the item that may cause failure
– Are there any possible secondary or higher order failures that may also arise?
4. Associate causes to each failure mode
– Note that some causes can provoke more than one function or component to fail that use the
same interface or are exposed to the same conditions. These modes are termed common mode
failures. As examples:
• A broken drive belt may cause failure of an engine’s alternator, coolant pump, cooling fan,
steering assist pump, air conditioning compressor, and air injection pump
• Fuel contamination can cause multiple engine failure
5. Progressively record results on a cause-effect or ‘fishbone’ diagram.
32
Step 9… Determining the Failure Modes (9)
• A fishbone diagram:
– The part failure (‘Contactor
failure’) is the local effect
– A part failure may have many
failure modes
– A failure mode (such as
‘Contacts resistive’) may have
one or more causes
• At succeeding levels, a previous
failure mode may now become a
cause for a failure mode. For
example:
– An ‘Overcurrent’ cause results
in a ‘Contacts burnt’ failure
mode
– ‘Contacts burnt is then a
cause for ‘Contacts resistive’
and ‘Armature sticking’ failure
modes.
Contactor
failure
Terminals
loose
Solenoid
burnout
Overvoltage
Duty cycle
exceeded
Solenoid
opencircuit
Winding
failure
Input
failure
Controller
failure
Connection
failure
Contacts
burnt
Armature
sticking
Hinge
binding
Contacts
resistive
Contacts
dirty
Loss of
spring
tension
Overcurrent
33
Step 9… Determining the Failure Modes (10)
• Sometimes analysis does not assist with determining all the failure modes
• Determining unknown failure modes is often effectively maximised by ‘brainstorming’ the
information collated through the preceding processes
• Brainstorming is a non-procedural way of determining all possible alternatives in a non-
critical manner. Brainstorming attempts to utilize the imaginative and creative faculty
which is often circumvented by habitual logical or critical thought processes
• Use ‘probing’ rather than ‘leading’ types of questions, and an ‘active’ listening style
• Once all possible alternatives have been enumerated, then critical evaluation of each
alternative can be performed, and not before all input from the brainstorming session
has ceased. A number of short sessions with suitable personal review time intervening
may be more productive than a single long session.
34
Step 10… Documenting the Failure Modes
• Example subsystem: Differential Analyser
Function
Overload
protection
for
Differential
Analyser
power
supply
Item
Contactor
CB1
Failure Modes
and Causes
1. Contacts
resistive:
spring tension
loss
2. Contacts
resistive:
contacts burnt
3. Contacts
resistive:
contacts dirty
4. … etc …
Local Effect
Contactor
failure,
current
interrupted
from main
switchboard
MS1
Next
Higher
Level
Voltage
drop in
Analyser
supply cct
MS1-CB1
End Effect
Failure of
Differential
Analyser
through
240 VAC
power
supply
drop out
.
35
Step 11… Determining the Effects (1)
• Once all the pertinent failure modes have been determined, the effects can be
considered
• Consider each of the system or subsystem elements, and the possible effect of each of
the failure mode upon:
– Personnel
– Equipment
– Downtime
– Environment.
36
Step 11… Determining the Effects (2)
• Effects are documented according to the level within which they operate:
– Local effects. This concerns the effect upon the line item being analysed and
associated items in its immediate environment. The local effect may be the failure
mode
– Next Higher Level. This is for the next higher functional level or assembly in the
FMEA’s structure
– End effects. These concern the effect on the top level of the FMEA concerned, i.e.
the end effect for the system, subsystem, or component FMEA under consideration.
At the highest level, this includes the end user as well as the system. No
consideration of compensating provisions such as redundancy or safety devices is
given.
37
Step 11… Determining the Effects (3)
• End effects can percolate up to next higher level FMEAs as failure modes
• They may or may not represent a critical failure in the higher level FMEA
• Criticality is not necessarily inherited, and must be carefully interpreted from the relevant
specifications
• For example,
• A critical failure in a training subsystem will cause failure of the training
subsystem
• A failure of the training subsystem may not necessarily cause a critical failure of
the mission system, but will be considered as a failure mode in the mission
system FMEA.
38
Step 11… Determining the Effects (4)
Sim LRU Fail Sim Fail 1
Sim Assy Fail Sim Fail 2
Sim Mod Fail No Effect
AC LRU Fail AC Fail 1
AC Assy Faii AC Fail 2
AC Mod Fail No Effect
Simulator System FMEA
Aircraft System FMEA
End Effect is
on the
Aircraft
AC Failure 1 Sys Fail
AC Failure 2 Sys Fail
Sim Failure 1 Degraded
Sim Failure 2 Degraded
Mission System FMEA
End Effect is
on the
Simulator
End Effect is on the
Mission System
Sim Failure Modes
AC Failure Modes
Subsystem FMEAs
39
Step 12… Determine the Failure Detection Method (1)
• Detection is the mechanism by which a failure becomes apparent so that compensating
action for continued operation, or else corrective maintenance, can be undertaken
• Abnormal indications of any system fault from any built in test equipment or processes must
be documented, such as ‘red arc’ or unusual behaviour of readings from instruments or
gauges
• Does the method allow isolation of the fault to some particular subsystem or element?
• What are the indications of failure of the indicator itself?
– Will watchdog timers or exception handlers provide warning messages or signals?
– Can the readings be correlated with some other instrument or gauge?
• For example, the indications of an aircraft’s failed attitude indicator will not cross-
reference correctly with the rate of turn and climb/descent readings.
40
Step 12… Determine the Failure Detection Method (2)
• If no alarm, instrument, or message provides notification, how does the operator know a failure within
the system is occurring or has occurred?
– What are qualitative signs in terms of specific auditory, visual, kinaesthetic, and olfactory inputs?
– What is the diagnostic logic that the operator will use?
• For example, a clutch overload may be evident from the burning smell, which may or may not be
distinctive from burning wiring insulation. Further fault isolation may be important from a safety
perspective such as risk of fire or the production of noxious gases
• In some instances, use of operator controls will verify some aspect of system failure and permit some
degree of fault isolation through the application of appropriate logic
– For example, complete failure of an engine in a light twin-engined aircraft may be identified in the
first instance by the requirement for a large rudder deflection to offset the asymmetric thrust, and
the failure is confirmed by tentatively reducing throttle on the ‘dead’ engine
– Another example is using circuit breakers or switches to determine which circuit may be
overloaded or defective.
41
Step 12… Determine the Failure Detection Method (3)
• Failures which may not be apparent during the course of normal operation until a primary
failure occurs are known as hidden failures
– An example of this is an automatic standby pump that fails to operate when a main
hydraulic pump fails
– Another example is failure of an uninterruptible power supply when the primary
power supply is interrupted
• Conversely, a standby pump may mask the failure of a main pump, or an uninterruptible
power supply may mask an intermittent or poor quality primary power supply
• The failures may only be discovered during the course of preventive maintenance
activities
• The indications and detection of failure of the second or tertiary backup systems as well
as the primary system must also be documented.
42
Step 13… Determine the Compensating Provisions (1)
• The Compensating Provisions section documents design and operator provisions for
recovery
• Design provisions are how the system recovers itself or tolerates failure. Forms of this
include:
• Equipment redundancy and switching mechanisms, including online, standby,
and offline types
• Alternate modes or paths of operation
– An example is automatic rerouting of network messages to bypass a failed
carrier service
• Increasing tolerance to overload, including improved material, derating, less
stressful environmental and operating limits
• Acceptable or graceful degradation
• Recovery actions such as function retry or software reloading.
43
Step 13… Determine the Compensating Provisions (2)
• Operator provisions are those actions that an operator takes to overcome or mitigate a
failure
• Operator provisions may include bringing online redundant devices, or controlling the
system in an other than normal fashion
– An example of this would be load-shedding of an electrical subsystem in the event
of a generator failure with limited backup provision by operating circuit breakers to
isolate power to non-critical subsystems
• Where there is a possibility of an incorrect operator (or maintainer) response, the
consequences should also be documented
• Changes in preventive maintenance frequency or intensity may reduce the possibility of
failure.
44
Step 14… Determine the Severity (1)
• Determination of the severity classification of a failure is performed with respect to
the end effect on the functional requirement and any possible safety hazard.
Failure
An event including
interoperability in which
an item does not perform
as specified
Non-mission
Critical
Item does not prevent the
system from performing
its mission
Mission Critical
Item failure prevents the
system from performing
its mission
Non-safety Critical
Item failure prevents the
mission but does not
threaten system or
operator safety
Safety Critical
Item failure threatens
system or operator safety
45
Step 14… Determine the Severity (2)
• For MIL-STD-1629, Severity Classification falls into one of the following categories according to the
worst potential consequences. Some alternative industry interpretations are also given to illustrate
how these apply to commercial applications:
Category I Catastrophic A failure results in the major injury or death of personnel
Category II Critical
A failure results in minor injury to personnel, personnel
exposure to harmful chemicals or radiation, a fire or release
of chemicals into the environment
Category III Marginal
A failure results in a low level exposure to personnel, or
activates a facility alarm system
Category IV Minor
A failure results in minor system damage but does not cause
injury to personnel, allow any kind of exposure to operational
or service [i.e. maintenance] personnel or allow any release
of chemicals into the environment
46
Step 14… Determine the Severity (3)
• The severity classifications that are unacceptable should be defined
• Unacceptable levels are usually Levels I & II, and sometimes Level III
• The levels of injury and exposure should also be defined. Some examples are:
– Low level exposure: Less than 25% of short-term exposure limits published for
work, health and safety
– Minor injury: A small burn, cut or pinch, or light electrical shock that can be
handled by first aid and are not responsible for significant lost time
– Major injury: Requires medical attention other than first aid
47
Step 14… Determine the Severity (4)
• Other methods determine a Severity Number, usually based on a scale of 1
(goodness) to 10 (badness), which later facilitates calculation of a Risk Priority
Number for the failure mode.
• Whatever scale is used, it preferably should be traceable to a standard, and must be
documented.
Is it a ‘must work’ function?
48
Step 14… Determine the Severity (5)
• It is advantageous to formulate ground rules for each analysis by which severity can
more easily be evaluated. Some examples of these rules are:
1. For catastrophic hazards, dual component failure (items which are one-fault
tolerant) are credible (i.e. could happen)
2. For catastrophic hazards, triple component failures (items with two-fault tolerance)
are not credible (i.e. not likely to happen)
3. For critical hazards, single component failures are credible
4. For critical hazards, dual component failures are not credible
5. Generally not included in the analysis are mounting brackets, secondary structures,
wiring and enclosures
• Document any such ground rules with the other assumptions!
• Some examples of Severity Numbering, mostly for consumer-oriented products such as
for the automotive market, are on the next slide.
49
Step 14… Determine the Severity (6)
1 None – not apparent, no effect None Unlikely to be detected
2 Very minor – not apparent, minor
effect
Fit/Finish/Squeak/Rattle noticed by
< 25% of customers
20% chance of a customer return
3 Minor – nuisance to the customer FFSR noticed by 50% of customers 40% chance of a customer return
4 Very low – lowered effectiveness FFSR noticed by 75% of customers 60% chance of a customer return
5 Low – customer complaint Comfort/convenience reduced 80% chance of a customer return
6 Moderate – Potential
ineffectiveness
Comfort/convenience item(s)
inoperable
100% chance of a customer return
7 High – customer dissatisfaction Reduced performance Failure results in a customer complaint
8 Very high – ineffective service or
treatment
Loss of mission function Failure results in serious customer
complaint
9 Extremely high – regulatory non-
compliance
Loss of safe operation or regulatory
non-compliance – with warning
Failure results in non-compliance with
statutory safety standards
10 Dangerously high – injury or
death
Loss of safe operation or regulatory
non-compliance – without warning
Failure results in death
50
Step 15… The Risk Assessment Process (1)
• At this point, the FMEA is largely complete, with only criticality or priority analysis to prioritise the
effects of failure modes to finish the risk assessment process
• For the qualitative approach, there are two basic methods:
1. Determining and plotting a Probability of Occurrence, P(O), against the Severity Category for
each failure mode using MIL-STD-1629 definitions:
• Level A – Frequent
• Level B – Reasonably probable
• Level C – Occasional
• Level D – Remote
• Level E – Extremely unlikely
2. Calculating a Risk Priority Number (RPN) using the Severity Number, Occurrence Number, and
Detectability Number, and plotting the RPN against each failure mode
• The Occurrence Number is selected from a predetermined table of values in the same manner as for
the Severity Number. If failure rate data is available, a selection can be quantitatively made.
51
Step 15… The Risk Assessment Process (2)
1
Remote
Failure is unlikely
One occurrence in greater than 5 years, or less than 2 occurrences in 1 billion events (Cpk ≈
2.00)
2 One occurrence every 3 to 5 years, or 2 occurrences in 1 billion events (Cpk ≈ 2.00)
3 Low
Relatively few failures
One occurrence every 1 to 3 years, or 6 occurrences in 10 million events (Cpk ≈ 1.67)
4 One occurrence per year, or 6 occurrences in 100,000 events (Cpk ≈ 1.33)
5 One occurrence every 6 months to 1 year, or 1 occurrence in 10,000 events (Cpk ≈ 1.17)
6 Moderate
Occasional failure
One occurrence every 3 months, or 3 occurrences in 1,000 events (Cpk ≈ 1.00)
7 One occurrence every month, or 1 occurrences in 100 events (Cpk ≈ 0.83)
8 High
Repeated failure
One occurrence per week, or a probability of 5 occurrences in 100 events (Cpk ≈ 0.67)
9 One occurrence every 3 to 4 days, or a probability of 3 occurrences in 10 events (Cpk ≈ 0.33)
10
Very high
Failure is almost inevitable
More than one occurrence per day, or a probability of more than 3 occurrences in 10 events
(Cpk < 0.33)
52
Step 15… The Risk Assessment Process (3)
• If a numeric value for the failure mode, e.g. failure rate is available, then the value can
be converted to an occurrence per unit time (or per number of events), and the closest
definition of Occurrence Number selected
• The rate for a failure mode will have to be apportioned from the failure rate for the item.
An example of failure mode ratio allocation for an engine block assembly is shown on
the next slide
• Obviously if no quantitative data is available, then the analyst simply makes a qualitative
assessment from the table for the failure mode.
Learn to distinguish between the truly impossible and the highly improbable…
And then between those events that are unlikely but possible…
53
Step 15… The Risk Assessment Process (4)
Failure of piston 10% 25 FPMH or 40,000 hours MTBF
~ a failure every 3 to 5 years,  select 2 for Occurrence.
Failure of piston rings 40% 100 FPMH or 10,000 hours MTBF
~ a failure every 1 to 3 years,  select 3 for Occurrence.
Failure of connecting rod & pin 15% 38 FPMH or 26,667 hours MTBF
~ a failure every 1 to 3 years,  select 3 for Occurrence.
Failure of crankshaft 5% 13 FPMH or 80,000 hours MTBF
~ a failure in greater than 5 years,  select 1 for Occurrence.
Failure of main or big end bearings 30% 75 FPMH or 13,333 hours MTBF
~ a failure every 1 to 3 years,  select 3 for Occurrence.
100% 250 FPMH or 4,000 hours MTBF
Example: Allocating Occurrence Numbers for an engine block that would have a
100% duty cycle and an expected life of 4,000 operational hours.
54
Step 15… The Risk Assessment Process (5)
• The last step in the determination of an
RPN is selection of a Detectability
Number
• This represents the probability that a
cause or mechanism and potential
failure mode will be detected
• The Detectability number may reflect
the BITE design capability as well as
the expected operator competence in
fault detection and isolation.
1 Almost certain
2 Very high
3 High
4 Moderately high
5 Moderate
6 Low
7 Very low
8 Remote
9 Very remote
10 Absolute uncertainty
55
Step 16… Calculate the Risk Priority Number
• The RPN is therefore an integer with a range from 1 (most benign) to 1,000 (most serious).
RPN =
1 to 10 1 to 10 1 to 10
Severity No Occurrence No Detectability No
x x
56
Step 17… Using the RPN (1)
• A ‘risk threshold’ value is determined, as well as a criticality number, beyond which
action must be taken
• High RPNs can be clearly identified by:
– Plotting against each failure mode so that they are readily identifiable
– Using Pareto analysis to segregate high-level RPNs, for example, identifying the
top 20% risk
• High RPNs are actioned by reducing the occurrence or severity, or increasing the
detectability, of failure modes.
Don’t select high thresholds to
get the ‘right’ answers!
57
Step 17… Using the RPN (2)
Failure
Mode
Failure
Mode
Failure
Mode
Device
Detect
Control
cause or
mitigate
effect
Occurrence Severity
Detectability
Effect
58
Step 17… Using the RPN (3)
Occurrence
Number
Avoid or Eliminate Failure Causes
•This usually is the first item to be changed. Improvements in material,
design, testing (find and fix), changing environment, operating limits,
maintenance etc to reduce the frequency of occurrence, the cause of
failure
Severity
Number
Eliminate or Reduce the Consequences of Failure
•Change will generally require system redesign to change the effects
of failure. An example could be adding turbine engine containment
shielding to protect against rotating part failure
Detectability
Number
Identify or Detect the Failure Earlier
•Changing this item is the usually the least preferred action to improve
the RPN, as changes in the Occurrence and Severity will often have a
more desirable outcome, i.e. more influence on the cause and result
59
Step 17… Using the RPN (4)
• All significant and critical items (e.g. those possessing failure modes resulting in end
effects equivalent to Severity Category I and II definitions) must have some form of
recommended action
• Recommended actions must be detailed, potentially effective and executable
• If the granularity for resolution of significant and critical items is too high, then lower
levels of FMEA can be undertaken for these items
• The responsibility and completion date for recommended actions to improve the RPN
must be documented in the FMEA
• The final, acceptable RPN after recommended actions have been completed is then
entered into the FMEA worksheet. Otherwise, the process iterates.
60
Step 18… Costing the Risk (1)
• As well as the RPN metric, a frequency-cost metric can also be formulated to
augment analysis:
• This can be plotted in conjunction with the RPN for each failure mode to give a
bigger picture.
Failure mode frequency/year x Failure Cost/year
61
Step 18… Costing the Risk (2)
• The cost risk per annum of a particular
failure mode can be given a number in
the same manner as a severity
number, as per the following table
• The actual values used will accord to
the impact of failure upon the
particular system.
1 Insignificant < $100 PA
2 Extremely low > $500 PA
3 Low >$1,000 PA
4 Low-medium > $5,000 PA
5 Medium > $10,000 PA
6 Medium-high > $50,000 PA
7 High > $100,000 PA
8 Very high > $500,000 PA
9 Extremely high > $1,000,000 PA
10 Disastrous > $10,000,000 PA
These costs are
project specific!
62
0000 0101… Some Software FMEA Considerations
• Software can, and should, be considered in, or as a stand-alone FMEA as it inevitably forms a
management, control and diagnostic part of any system that does not solely rely on fires and steam.
It falls into the highly recommended category for full safety design effort, and provides a medium
benefit to cost rating according to NASA
• Software FMEA is only practicable at a functional level – it’s a virtual world out there
• Software modules, per se, do not fail or wear out - they only display incorrect behaviour as designed
into them:
– Software is nearly always delivered broken… but no-one knows exactly how much
– The only thing that software testing really proves is that one more bug was found
– Every software system is considered to contain faults which may lead to functional failure under
particular triggering conditions
– Failure modes for all possible dynamic conditions may be impossible to predict, or even test for
by simulation, so the emphasis is on correct design and defensive strategies.
The end of a software program is a declaration – not a fact…
63
0000 0101… Some More Software FMEA Considerations
• Therefore, analysis of software can only look for likely, rather than known, failure modes:
– Analysis may, however, determine the measures needed to be taken to prevent or mitigate
the occurrence and consequences of failure by specifying design and coding rules, and
review, inspection and testing requirements for different code functions
– One of the most useful outputs, given a rigorous development environment, is then a list of
test cases, which may be used to develop specific scenarios and stressors that may trigger
each potential failure mode
– Possible failure modes must consider any combination of inputs, timing and operating
modes that potentially produce fault conditions through programming or compiler errors.
Beware of COTS SOUP ~
Software Of Uncertain
Pedigree
64
0000 0101… Types of Software FMEA
• Software FMEA assists in identifying structural weaknesses in the design, and helps reveal weak or
missing requirements and latent software non-conformances
• There are two basic stages in software FMEA development:
– System software FMEA. This is performed as early as possible in the design phase, usually as
soon as the architecture has been developed and system functions are well-defined and
understood. This FMEA is used to evaluate the effectiveness of the top-level software
architecture and basic design protection of the system
– Detailed software FMEA. This is performed during the software detailed design and coding
stage of development. This FMEA validates that the software has been constructed to achieve
the specified mission and safety requirements. This FMEA is sometimes undertaken to examine
minimal cut sets from an initial Fault Tree Analysis that is performed instead of a system-level
FMEA.
65
0000 0101… Scope of Software FMEA
• All the software encountered in a project needs to be considered as candidates for FMEA depending
on the criticality of the application. As well as the application software, this may include other COTS
components, depending on whether they are certified to the required standard or not, such as:
– System kernel, e.g. boot and initialisation, the basic input output system
– System services, e.g. file and device input/output
– System and third-party library functions, and don’t forget the shrinkwrap, wrappers or glueware
– Development and support software, e.g. compilers, linkers, debuggers, development tools
– Embedded read-only memory software, including programmable logic
– Test software
• A decision must be made as to what hardware errors will directly influence software execution, and
what affects software interfaces, and consequently considered as causes for possible software failure
modes:
– Execution: central processor unit and memory failures, e.g. arithmetic logic unit, registers,
random access memory
– Interfaces: Peripheral failures, e.g. input/output ports, analog/digital converters, watchdog,
interrupt managers and timers that are implemented in hardware
• Just as for hardware, setting boundaries and assumptions for ‘good behaviour’ is required.
66
0000 0101… Software System FMEA
• Unintended system function due to software failure must be avoided in safety critical systems!
The FMEA will assist in identifying means of mitigating against potential failures and testing for
faults
• Initial safety requirements may come from:
– System specifications
– Regulatory compliance requirements
– Preliminary hazard analysis, which provides a matrix of potential hazards and hazard states
– Hazard testing results, which provide the required fault response times of the system
• A criticality, or potential risk, level can then be assessed for each software function determined in
the software architecture, which at this stage is usually a collection of computer software
configuration items (CSCIs).
67
0000 0101… Software System Failure Modes
• The most common failure modes for functions are:
– Failure to execute
– Incomplete execution
– Execution with incorrect timing, which includes incorrect activation and
execution time (including endless loop)
– Erroneous execution
• Two extra software failure modes specifically for interrupt service routines are:
– Failure to return, (blocking lower level priority interrupts from executing)
– Returning an incorrect priority.
68
0000 0101… Software Failure Causes
• There may be any number of causes that give rise to these failure modes, but some
general areas to consider are:
– Computational
– Logic
– Data I/O
– Data handling
– Interface
– Data definition
– Database.
69
0000 0101… Software System Effects
• Some system-level effects of failure may then be:
– The operating system halts or stops
– Program stops, with a clear error message
– Program stops, without a clear error message
– The program runs, producing obviously incorrect or ‘coarse incorrect’ results or unintended
functioning (including running too early or late)
– The program runs, producing ‘subtle incorrect’ results. These are apparently correct, but actually
incorrect, results
• The effects of the failure modes relevant to each functional subroutine are then assessed for
potentially hazardous outcomes
• From the effects, a severity can then be determined.
Service Provision Fault: Omission / Commission
Service Timing Fault: Early / Late
Service Value Fault: Coarse / Subtle Incorrect
70
0000 0101… Software Detailed FMEA
• The detailed FMEA can be applied to all, or only higher-risk modules by:
– Tracing potential failures in variables and their input (hardware or other routines)
– Tracing processing logic through the software to determine the effect of the failure
• A procedure to trace potential failures is to:
– Create a map between all input, output, local, and global variables and corresponding routines
– Develop failure modes for the data and variables. The variable (modes for input, effects for
output) failures are based on the variable type, i.e. the allowable type of data for the variable
– Develop failure modes for the events and processing logic. This involves determining what
negative effect the operators may have as well as dormant logic errors
• Note that any memory location, including buses and registers, that do not contain integrity protection
such as parity can be corrupted during operation
• At completion of a detailed FMEA, a mapping should exist from the top-level potential hazards to the
top-level critical variables.
71
0000 0101… Variable Failure Modes (1)
Variable Type Failure Mode
Analog or continuous e.g. High, > 5.0
e.g. Low, < 0.0
Boolean or logical True when it should be False
False when it should be True
Enumerated or list (e.g. A, B, C) A when it should be B
A when it should be C
B when it should be A
B when it should be C
C when it should be A
C when it should be B
72
0000 0101… Variable Failure Modes (2)
• Variables and data can also be analysed as having any of the following failure modes:
– Missing data, e.g. lost message, analog/digital conversion too granular, no read,
write or update
– Incorrect data. These can also be categorised as ‘subtle incorrect’ or ‘coarse
incorrect’, e.g. stuck, incorrect-in-range, out of range, inconsistent, bad command
– Timing of data. Data arrives too soon or too late (obsolete), e.g. data race or
collision, sampling rates too high or too low
– Extra data, e.g. data redundancy, message format incorrect, unwanted read, write or
update
• Consideration of these failure modes lead to an expansion of the previous variable map
table, as required. as shown on the next slide.
Data tolerances for real data may cause threshold logic problems in dual redundant path processing
73
0000 0101… Variable Failure Modes (3)
Variable
Name
Subroutine Type Failure Mode Local Effect
ArrayVar1 Program1 Double
Precision
Omission: no read, write or update Program failure
Commission: unwanted read, write or update Wasted processor
cycles during I/O
time
Early timing Nil if no attempt is
made to reload
original value within
timing cycle
Late timing Use of obsolete data
Subtle incorrect: Value -0.00001 to +0.5 Possible error from
use of stuck or
incorrect-in-range
value
Coarse incorrect: +0.999E5 to +1.0E360 Error from
inconsistent, or out-
of-range data
74
0000 0101… Logical Failure Modes
• Individual processing logic can also be examined for failure modes such as:
– Memory address errors, e.g. module coupling, array boundaries, and stale
pointers
– Language usage, e.g. variable declaration, possible underflow or overflow,
typing and initialisation
– Omitted event, i.e. event does not take place but execution continues
– Committed event, i.e. event takes place but shouldn’t have
– Incorrect logic, e.g. preconditions are inaccurate, event does not implement
intent, non-convergent algorithms, instability, endless loops, premature returns
– Timing or order, e.g. an event occurs too early, too late, or in wrong order
• FMEA for logic and events can be tabulated in a similar manner to the variable or
data table.
What does the software do
when the user times out?
75
0000 00101… Software Failure Detection
• Some examples of system failures are stopping, slow or incorrect responses, and startup failure.
Detection of these types of failures is often readily observed by the system lack of, or improper,
functionality by the operator. Although self-evident, the actual means of detection should be
supplemented with appropriate alarms or messages, including a means of system or operator
compensation
• Memory errors that may require detection are stack overflow or corruption, memory leakage or
exhaustion, and the behaviour of garbage collection functions in real-time systems that may need
special checks
• The detection of failures may use software watchdog timers or task heartbeat monitors, message
sequence numbers, duplicated message checks (noting that this may be a purposeful strategy
for error-checking critical commands), software interrupts, input trend analysis, etc.
.
What is the effect on
synchronisation of a
distributed system when
processor failures
occur?
76
0000 00101… Software Failure Detection (2)
• As an aid to determining fault detection possibilities, consider each member of the fault
value class against those of the fault timing classes for each failure mode:
Correct
Value
Subtle
Incorrect
Coarse
Incorrect
Omission
Correct
Timing
✓
Potentially
undetectable
Value,
semantics
detection
Timeout
Early
Time-
based
detection
Time-based
detection
Value, time,
semantics
detection
Timeout
Late
Time-
based
detection
Time-based
detection
Value, time,
semantics
detection
Timeout
Infinitely
Late
Timeout Timeout Timeout Timeout
77
0000 0101… Software Compensating Provisions
• Some examples of high-level compensating provisions are automatic reversion to safe states or
passive or degraded modes of operation where non-critical tasks are given lower priority or even
halted. The software may also force checks on the system state and resources (memory, stack,
processor availability etc), verify the command itself and provide operator warnings, before executing
critical functions. Safety-critical messages may be given the highest priority
• Note that, as opposed to hardware, added identical ‘redundant’ modules may not necessarily reduce
the occurrence of failures, as the cause for a failure mode may be identical with each added module.
One method of redundant programming uses ‘n-versions’ of ‘n’ extra modules, each of which employs
different versions or methodology to achieve the same end results to ensure that each is not as
susceptible to the same failure causes. Voting systems may operate on redundant data or use defaults
• The required software fault response time to system compensating measures as a result of fault or
failure detection may also determine the higher-level software design, such as alternative task
scheduling. It may also force alternatives in processor selection.
78
0000 0101… Software Fault Occurrence Numbers
• Only the qualitative methods (Probability of Occurrence or Occurrence Number) of failure rate
assessment are strictly applicable due to the difficulty (if not impossibility) of accurately estimating
software modules’ failure rates
• Estimates of Occurrence have traditionally used as a basis metrics such as the number of lines of
code, function points, or complexity which can be interpreted as a measure of possible fault rate:
– Lines of code is the simplest metric and perhaps returns the least confidence
– Function point estimates are derived from high-level concerns such as the number of inputs,
outputs, inquiries and files used, along with an estimate of program complexity. These have the
advantage of being able to allow an estimate to be derived prior to actual coding for system level
analysis
– Complexity metrics, such as that of Halstead, use inputs such as the number of operands and
operators available in a language, the actual quantity of these used, and volume, difficulty and
effort factors related to a program
– Many of these types of metrics are available from language-specific scanners.
79
0000 0101… Software Fault Occurrence Numbers (2)
• Occurrence values can be hard to estimate, and if historical failure rates can’t be
established, some analysis teams have found an arbitrary value of either 5 or 10
satisfactory
• The occurrence of failures for software, represent the rate at which failures have
been built into the system, as well as the frequency of use of the particular state that
invoked the particular failure mode
• From this perspective, greater emphasis on the severity number of the failure mode
and the means of detection is often felt to be more important for software.
Do you know the limits of exponentiation and factorial for your system?
80
Back to Real Numbers: Determining Criticality a la ‘1629A
• In determining a Probability of Occurrence for a failure mode according to the MIL-STD-1629
qualitative method where failure rates are not known, an inherent assumption is that the failure mode
effect is going to occur
• The quantitative method of MIL-STD-1629 uses a more structured numeric approach to determine a
Criticality Number instead. The calculation incorporates numeric failure rates and a ‘conditional
probability of loss’ as well as duty cycle. The failure mode Criticality Numbers are then summed to
determine Criticality Number for each item
• The criticality matrix or plot that utilizes an Item Criticality Number so calculated results in clearer
boundaries between items than if a more granular Probability of Occurrence had been estimated
• Thus the quantitative criticality number approach is preferable to the qualitative Probability of
Occurrence method of MIL-STD-1629 when all the failure rates for a system are known or can be
estimated.
81
Criticality Number Components…
• The Criticality Number is calculated from:
– Failure Effect Probability (β or Beta). This is the probability of the effect or loss actually
happening, given that a failure has occurred. This is assessed in a qualitative manner according
to the likelihood of either actual loss, probable loss, possible loss, or no effect. The standard
provides representative numeric values
– Part Failure rate (λp or Lambda). The item operational failure rate is estimated from field or
library data, modified for the prospective environment and usage as required
– Operating time (t) This will be the mission time factored by the duty cycle, and is in hours or
cycles per mission.
Rate
Failure
Item
Rate
Failure
Mode
Failure
=

– Failure Mode Ratio (α or Alpha). The FMR is the
fraction which represents the failure rate at which a
particular failure mode will occur with respect to the
failure rate of the whole item or function. These
proportions are assessed from test or analysis, or
simply best judgement. The total of all FMRs for an
item is 1 or 100%
82
Criticality Number Calculation…
• The Criticality Number is built up in two steps:
– Failure Mode Criticality Number (Cm). This is calculated for each failure mode of an
item
– Item Criticality Number (Cr). The Item Criticality Number is the sum of the Failure
Mode Criticality Numbers for an item
• Failure Mode Criticality is calculated as follows:
• The Item Criticality is then the sum of all the
Failure Mode Criticalities for an item: 
=
=
n
i
i
m
r C
C
1
)
(
Cm = ..p.t
83
A Final Word…
• As well as potential life cycle cost impact, an inadequate FMECA can represent a lack
of rigor in the design of the mission and support systems. This lack of rigor may
contribute to occurrences of critical failures in service and circumstances wherein critical
failures will not be satisfactorily addressable
• However, one oft-maligned design feature, the single point of failure (SPF), is a fact of
life, and may only be significant if the potential occurrence or severity are high. We live
in and use complex systems of SPFs every day without any problem by virtue of the
fact that the occurrence of failure is often very low
• Excessive redundancy can increase prime cost, operator skill requirements, operational
and maintenance costs without great benefit. Single-engine turbine aircraft are
becoming more popular in the aviation world for just these reasons
• Sometimes, though, it’s just real nice to have…
84

More Related Content

What's hot

Failure Mode Effect Analysis in Engineering Failures
Failure Mode Effect Analysis in Engineering FailuresFailure Mode Effect Analysis in Engineering Failures
Failure Mode Effect Analysis in Engineering FailuresPadmanabhan Krishnan
 
Preventive Maintenance System.pptx
Preventive Maintenance System.pptxPreventive Maintenance System.pptx
Preventive Maintenance System.pptxhabibulhoque7
 
Mechatronics principles.pptx
Mechatronics principles.pptxMechatronics principles.pptx
Mechatronics principles.pptxJacksonSaad
 
Preventive maintenance (vuthy ng)
Preventive maintenance (vuthy ng)Preventive maintenance (vuthy ng)
Preventive maintenance (vuthy ng)VUTHY NG
 
Preventive maintenance
Preventive maintenancePreventive maintenance
Preventive maintenancePramod A
 
Maintenance planning systems
Maintenance planning  systemsMaintenance planning  systems
Maintenance planning systemsAnil Mohindru
 
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2Maintenance Planning and Scheduling Maturity Matrix - #1 of 2
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2Ricky Smith CMRP, CMRT
 
ETAP - Coordination and protecion 2
ETAP -  Coordination and protecion 2ETAP -  Coordination and protecion 2
ETAP - Coordination and protecion 2Himmelstern
 
Vibration Monitoring-Vibration Transducers-Vibration Troubleshooting
Vibration Monitoring-Vibration Transducers-Vibration TroubleshootingVibration Monitoring-Vibration Transducers-Vibration Troubleshooting
Vibration Monitoring-Vibration Transducers-Vibration TroubleshootingDhanesh S
 
Introduction to Reliability Centered Maintenance
Introduction to Reliability Centered MaintenanceIntroduction to Reliability Centered Maintenance
Introduction to Reliability Centered MaintenanceDibyendu De
 
ETAP - curso protecciones etap
ETAP - curso protecciones etapETAP - curso protecciones etap
ETAP - curso protecciones etapHimmelstern
 
Maintenance management
Maintenance managementMaintenance management
Maintenance managementRajeev Sharan
 
Types of Maintenance
Types of Maintenance Types of Maintenance
Types of Maintenance Solitrend
 
Reliability engineering chapter-1csi
Reliability engineering chapter-1csiReliability engineering chapter-1csi
Reliability engineering chapter-1csiCharlton Inao
 
PREVENTIVE MAINTENANCE
PREVENTIVE MAINTENANCEPREVENTIVE MAINTENANCE
PREVENTIVE MAINTENANCEAnupriyaDurai
 

What's hot (20)

Sequential fault location
Sequential fault locationSequential fault location
Sequential fault location
 
Failure Mode Effect Analysis in Engineering Failures
Failure Mode Effect Analysis in Engineering FailuresFailure Mode Effect Analysis in Engineering Failures
Failure Mode Effect Analysis in Engineering Failures
 
Failure Modes and Effect Analysis (FMEA)
Failure Modes and Effect Analysis (FMEA)Failure Modes and Effect Analysis (FMEA)
Failure Modes and Effect Analysis (FMEA)
 
Preventive Maintenance System.pptx
Preventive Maintenance System.pptxPreventive Maintenance System.pptx
Preventive Maintenance System.pptx
 
Mechatronics principles.pptx
Mechatronics principles.pptxMechatronics principles.pptx
Mechatronics principles.pptx
 
Preventive maintenance (vuthy ng)
Preventive maintenance (vuthy ng)Preventive maintenance (vuthy ng)
Preventive maintenance (vuthy ng)
 
Preventive maintenance
Preventive maintenancePreventive maintenance
Preventive maintenance
 
PM PRESENTATION
PM PRESENTATIONPM PRESENTATION
PM PRESENTATION
 
Maintenance planning systems
Maintenance planning  systemsMaintenance planning  systems
Maintenance planning systems
 
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2Maintenance Planning and Scheduling Maturity Matrix - #1 of 2
Maintenance Planning and Scheduling Maturity Matrix - #1 of 2
 
ETAP - Coordination and protecion 2
ETAP -  Coordination and protecion 2ETAP -  Coordination and protecion 2
ETAP - Coordination and protecion 2
 
Vibration Monitoring-Vibration Transducers-Vibration Troubleshooting
Vibration Monitoring-Vibration Transducers-Vibration TroubleshootingVibration Monitoring-Vibration Transducers-Vibration Troubleshooting
Vibration Monitoring-Vibration Transducers-Vibration Troubleshooting
 
Introduction to Reliability Centered Maintenance
Introduction to Reliability Centered MaintenanceIntroduction to Reliability Centered Maintenance
Introduction to Reliability Centered Maintenance
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
ETAP - curso protecciones etap
ETAP - curso protecciones etapETAP - curso protecciones etap
ETAP - curso protecciones etap
 
Maintenance management
Maintenance managementMaintenance management
Maintenance management
 
Types of Maintenance
Types of Maintenance Types of Maintenance
Types of Maintenance
 
Reliability engineering chapter-1csi
Reliability engineering chapter-1csiReliability engineering chapter-1csi
Reliability engineering chapter-1csi
 
PREVENTIVE MAINTENANCE
PREVENTIVE MAINTENANCEPREVENTIVE MAINTENANCE
PREVENTIVE MAINTENANCE
 
1734 ob8 s
1734 ob8 s1734 ob8 s
1734 ob8 s
 

Similar to An Introduction to FMEA Analysis

Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3
Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3
Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3Abbas (Bob) Youssef MBA, PhD
 
Innovation day 2013 2.5 joris vanderschrick (verhaert) - embedded system de...
Innovation day 2013   2.5 joris vanderschrick (verhaert) - embedded system de...Innovation day 2013   2.5 joris vanderschrick (verhaert) - embedded system de...
Innovation day 2013 2.5 joris vanderschrick (verhaert) - embedded system de...Verhaert Masters in Innovation
 
Sean carter dan_deans
Sean carter dan_deansSean carter dan_deans
Sean carter dan_deansNASAPMC
 
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdf
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdfFailure Mode and Effects Analysis (FMEA) Specialist Certification.pdf
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdfdemingcertificationa
 
Failure mode and effects analysis
Failure mode and effects analysisFailure mode and effects analysis
Failure mode and effects analysisDeep parmar
 
fmea-130116034507-phpapp01.pdf
fmea-130116034507-phpapp01.pdffmea-130116034507-phpapp01.pdf
fmea-130116034507-phpapp01.pdfRajendran C
 
FMEA failure-mode-and-effect-analysis_Occupational safety and health
FMEA failure-mode-and-effect-analysis_Occupational safety and healthFMEA failure-mode-and-effect-analysis_Occupational safety and health
FMEA failure-mode-and-effect-analysis_Occupational safety and healthJing Jing Cheng
 
Pwc systems-implementation-lessons-learned
Pwc systems-implementation-lessons-learnedPwc systems-implementation-lessons-learned
Pwc systems-implementation-lessons-learnedAvi Kumar
 
failure modes and effects analysis (fmea)
failure modes and effects analysis (fmea)failure modes and effects analysis (fmea)
failure modes and effects analysis (fmea)palanivendhan
 
IT 381_Chap_7.ppt
IT 381_Chap_7.pptIT 381_Chap_7.ppt
IT 381_Chap_7.pptRajendran C
 
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...SUNDHARAVADIVELR1
 

Similar to An Introduction to FMEA Analysis (20)

FMEA
FMEAFMEA
FMEA
 
Application Of FMECA
Application Of FMECA Application Of FMECA
Application Of FMECA
 
Fmea handbook
Fmea handbookFmea handbook
Fmea handbook
 
Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3
Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3
Bob (ababs) Youssef FMEA Workshop Training at Hughes rev3
 
FMEA
FMEAFMEA
FMEA
 
Fmea handout
Fmea handoutFmea handout
Fmea handout
 
Innovation day 2013 2.5 joris vanderschrick (verhaert) - embedded system de...
Innovation day 2013   2.5 joris vanderschrick (verhaert) - embedded system de...Innovation day 2013   2.5 joris vanderschrick (verhaert) - embedded system de...
Innovation day 2013 2.5 joris vanderschrick (verhaert) - embedded system de...
 
Sean carter dan_deans
Sean carter dan_deansSean carter dan_deans
Sean carter dan_deans
 
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdf
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdfFailure Mode and Effects Analysis (FMEA) Specialist Certification.pdf
Failure Mode and Effects Analysis (FMEA) Specialist Certification.pdf
 
Failure mode and effects analysis
Failure mode and effects analysisFailure mode and effects analysis
Failure mode and effects analysis
 
fmea-130116034507-phpapp01.pdf
fmea-130116034507-phpapp01.pdffmea-130116034507-phpapp01.pdf
fmea-130116034507-phpapp01.pdf
 
FMEA failure-mode-and-effect-analysis_Occupational safety and health
FMEA failure-mode-and-effect-analysis_Occupational safety and healthFMEA failure-mode-and-effect-analysis_Occupational safety and health
FMEA failure-mode-and-effect-analysis_Occupational safety and health
 
Pwc systems-implementation-lessons-learned
Pwc systems-implementation-lessons-learnedPwc systems-implementation-lessons-learned
Pwc systems-implementation-lessons-learned
 
Fmea
FmeaFmea
Fmea
 
It 381 chap 7
It 381 chap 7It 381 chap 7
It 381 chap 7
 
Fmea
FmeaFmea
Fmea
 
failure modes and effects analysis (fmea)
failure modes and effects analysis (fmea)failure modes and effects analysis (fmea)
failure modes and effects analysis (fmea)
 
FMEA
FMEAFMEA
FMEA
 
IT 381_Chap_7.ppt
IT 381_Chap_7.pptIT 381_Chap_7.ppt
IT 381_Chap_7.ppt
 
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...
Failure mode effects analysis, Computer Integrated Manufacturing, Quality fun...
 

Recently uploaded

Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 

Recently uploaded (20)

Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 

An Introduction to FMEA Analysis

  • 1. On the Nature of FMEA… … An introduction Mark Gerrand, BEng(Mech), MMaintReliabEng
  • 2. 2 What is FMEA? • A FMEA… – is the identification of the ways in which a system can fail and the consequences thereof – assists in the implementation of design and management corrective actions to minimize the occurrence and severity, and maximize the detection, of failure. FMEA is an acronym for Failure Modes and Effects Analysis.
  • 3. 3 What is FMEA? (2) • The term FMEA is often used interchangeably with Failure Modes Effects and Criticality Analysis (FMECA) although strictly speaking the latter represents a FMEA incorporating Criticality Analysis • The FMEA is sometimes also called a Fault Hazard Analysis. However, hazards tend to be interpreted as a class of safety-related problems, the correction or mitigation of which do not necessarily lead to a more efficient design, operation or maintenance of the system • The FMEA is therefore an assessment of risk in terms of: – Safety – Regulatory requirements – Customer satisfaction, i.e. ability of the mission system to meet its objectives.
  • 4. 4 What is FMEA? (3) • FMEAs are often related to the part of the system life cycle under consideration: – System FMEA – This examines the effects of failures of system functions. The concept or initial design FMEA is a higher-level system FMEA that is undertaken at the beginning of a system’s life cycle – Design FMEA – This is performed from the initial design or concept through to detailed design, and usually focuses on effects caused by failures of lower-level elements and assemblies – Production or Process FMEA – This is performed towards the end of the design phase to examine the impact of faulty processes and materiel associated with high- volume production on mission systems or subsystems • Other types of FMEA include the service FMEA. This can be applied to specific processes used by service providers and industries such as maintenance and even health care services. MIL-STD-1629 also describes a damage FMEA. However, there are many more types than are described above • This presentation focuses on the initial design and design FMEAs.
  • 5. 5 Why do FMEA? • The output of a FMEA can be used to establish: – A more efficacious system design through improvements in inherent: • Reliability, availability and maintainability • Safety – Safer operational limits such as duty cycle and operating environment – Identification of single points of failure and mission-critical items that can lead to: • Better maintenance strategies through improved maintenance task analysis, reliability centred maintenance and level of repair analysis • More highly optimised sparing – Operator and maintainer training design input – Operator and maintainer handbook input – System test and inspection plan input – Production process improvement.
  • 6. 6 Why do FMEA? (2) • Because it can assist in dealing with the knowable unknowns… – ‘Nature has established patterns originating in the turn of events, but only for the most part’ ~ Leibniz (German scientist, mathematician et al, 1646-1716) – Trends are difficult to ascertain when causative events may be random… but the detection and severity of outcomes may be controlled – Therefore, FMECA is a useful risk management tool that considers what happens when events do not follow measures of central tendency such as means, but rather are the ‘outliers of distributions’.
  • 7. 7 Why do FMEA? (3) • However… • If there is no management commitment to utilise the FMEA, especially during the design stage, then the FMEA is a meaningless paper exercise • It is a document that may be required at an accident board of inquiry or court to demonstrate that the system and support system design and design procedures were not negligent • This presentation does not make you a competent FMEA practitioner but may assist you in determining deficiencies in your knowledge so that you can seek professional help either before or after…
  • 8. 8 Why not do FMEA? • Funds aren’t available! FMEA is time consuming and resource intensive, utilising engineering and technical personnel from the design and design-related specialities such as integrated logistics support (ILS) and quality. The requisite system experts must be engaged so that the analysis is not trivial • From a safety perspective, FMEA is essential. However, a case may be made for omission with respect to commercial off-the-shelf (COTS) acquisition if all of the following are applicable: – The mission profiles, modes and phases are consistent with the Original Equipment Manufacturer (OEM) design intent – The intended maintenance regime is consistent with the OEM design intent – The operational environment is consistent with the OEM design intent – Design analysis shows that any proposed modifications or deviations from the original design will not affect the above considerations – Field failure data is available to prove, or has proven, the safety and efficacy of the original design • These criteria may not apply to projects consisting of integration of COTS equipment since the effects of component failure upon the system design may have to be explored. Just ask the ‘Ariane 5’ design team how!
  • 9. 9 Limitations of FMEA • The limitations generally do not outweigh the benefits of FMEA! Some common problems and limits are: – The failure modes must be able to be envisaged by the designers or analysts – Multiple-failure interactions may not be easily foreseen due to the item-by-item and function-by-function foci of analysis inherent in the process. Many-to-many relationships between cause, failure mode, effect and control can result in complexity of resolution – External influences on the system may be easily overlooked if they do not cause specific or obvious equipment failures. An example of this may be the long-term effect of ionising or gamma radiation at altitude – Similarly, external influences by the system, such as contaminating or polluting the environment, may be easily overlooked even when the system is considered to be operating properly. An example is the possible influence of a ferrous system unwittingly introduced to a magnetic test and degaussing range – Human factors are often overlooked, under the premise that the equipment is correctly operated by caring, sharing, well-trained operators and maintainers with appropriate levels of mechanical empathy.
  • 10. 10 Limitations of FMEA (2) • Several strategies can help overcomes these problems: – For highly safety-critical systems, other methods such as Fault Tree Analysis (FTA) are run back-to-back with FMEA to provide extra confidence that all conceivable hazards are considered – Laboratory investigation may be required as to possible failure modes. Thorough simulation or stimulation of the system may demonstrate unexpected dynamics, such as those encountered in control loops – A mixed-discipline review team active throughout FMECA development mitigates the possibility of error – Expert system and design engineering opinion must be sought when analysing complex systems. For example, FMEA is not ‘just an ILS thing’.
  • 11. 11 When are FMEAs done? Design Reviews Design Verification System Reliability & Design Verification Testing Subsystem Testing Component Testing Concept FMEA System FMEA Subsystem FMEA Component FMEA Requirements System Specifications Component Specifications SDR PDR DDR Initial design phase Design Complete SDR: System Design Review PDR: Preliminary Design Review DDR: Detailed Design Review
  • 12. 12 How to do a FMECA: Step 1… Select the Standard • In lieu of a contractual requirement, select the reference standard which will supply the most suitable baseline methodology for your system • Some references for FMEA are: 1. MIL-STD-1629A, ‘Procedures for Performing a Failure Mode, Effects and Criticality Analysis’ 2. SAE ARP-5580, ‘Recommended Failure Modes and Effects Analysis (FMEA) Practices for Non-Automobile Applications’ 3. SAE J-1739, ‘Potential Failure Mode and Effects Analysis in Design (Design FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA), and Potential Failure Mode and Effects Analysis for Machinery (Machinery FMEA)’ 4. AS IEC 60812, ‘Analysis techniques for system reliability - Procedure for failure mode and effects analysis (FMEA)’ • It is important to work to a known standard to maintain consistent and recognised ratings definitions for severity (criticality), occurrence and detection.
  • 13. 13 Step 2… Establish the Analysis Framework (1) • Who are the facilitators, contributors and internal reviewers of the FMEA? – What are their core competencies and experience? – What is the relevance and level of authority of their input? – A team of 4 to 8 people is considered ideal for the system, or each subsystem, FMEA • What is the frequency and timing of reviews, updates and deliveries? – For example, draft deliveries prior to Design Reviews? – Note that the FMEA is an iterative, and often staged, process. Quote for the day: ‘A design review is like 100% inspection with a so-so gauge’
  • 14. 14 Step 2… Establish the Analysis Framework (2) • Who will be responsible for the means of processing and storing data and reporting results? – Documenting the methodology, assumptions and source data – Enabling use of MS Excel or a proprietary software package (noting that many such tools are merely documenting aids) – Developing data input and output templates – Developing configuration control, archival & backup procedures – Enabling shared directory access by different departments – Determining to whom will the FMEA be distributed – Ensuring open linkages to other design and production processes • Who in the management structure will approve the FMEA and have the capacity to authorise recommended actions? – The hardest part of all: management buy-in. Is the FMEA software certified or developed to a standard suitable for use in evaluating a safety-critical system?
  • 15. 15 Step 2… Establish the Analysis Framework (3) • At this point, the decisions and assumptions common to all the FMEA(s) should be considered • The criticality analysis may follow either of two paths, depending, in part, upon the availability of quantitative failure rate information for each line item: – Systems in initial design phases often do not have reliability estimates in the form of failure rates available – Many systems providers either do not have or will not provide ‘commercially sensitive’ failure rate information – Software failure rates will likely only ever be low-confidence estimates • These factors provoke a choice of either a: – Qualitative criticality analysis (i.e. for systems with high software content, and preliminary or unknown designs), or – Quantitative criticality analysis • A choice should be made to use one method only. This should be applicable to all the FMEAs in a system so that the results are directly comparable.
  • 16. 16 Step 3… Define the Problems Areas of Interest • A FMEA may have a particular need to address: – Safety of personnel, the system itself, and external systems with which the system interrelates – Regulatory compliance – Supplier design (or production) capability – Environment impact – Economic impact – Any aspect of RAM • Any or all of the above may apply for new systems, or new or problematic technology • This in turn may determine the system or subsystems that are selected for analysis • Some subsystems may be mission systems in their own right, such as flight simulators, due to their prime cost, complexity and support requirements.
  • 17. 17 Step 4… Define the System and Identify Subsystems • The overall system may be considered to be that which has a ‘fault free’ boundary, i.e. the mission system plus all other systems which can contribute to the mission system’s failure • Assumptions regarding external influences have to be documented, e.g. fire, flood, lightning, terrorism, malicious damage, reliability of commercial or government furnished equipment and interfaces • Identify the system for analysis, and determine whether analysis can be accommodated as a single entity, or broken into a number of major subsystems, each with its own FMEA • The boundaries for a system (and constituent subsystems) can be established through: – The hardware or geographic orientation, e.g. the north-east XYZ beacon transmitter – The functional service provided, e.g. training module, wide area network – The discipline, e.g. hydraulic services, software – Any arbitrary division that simplifies the number and type of interfaces across boundaries.
  • 18. 18 Step 5… Select the Type of Analysis for Each FMEA • A FMEA may be functional or element oriented, or any hybrid thereof: – The functional approach is usually used prior to detailed design information being available, and is usually undertaken in a ‘top down’, deductive manner: • Concept and system FMEAs are usually functionally oriented • System engineers provide significant input – The element or component approach is usually undertaken in a ‘bottom up’ inductive manner: • Design FMEAs are usually component oriented • Design engineers, as well as system engineers, provide significant input – The hybrid approach allows complex systems to be dealt with at high levels functionally, transitioning to constituent components at a lower level. This is particularly suitable where equipment or components directly relate to functionality • Although MIL-STD-1629A states that that standard only applies to hardware, the FMEA approach (even when that standard is used) is applicable to both hardware and software elements.
  • 19. 19 Step 6… Subdivide the System and Subsystems • For the purposes of analysis, the system or subsystem in each FMEA is then decomposed into manageable sections • The boundaries, if not physically evident, may be determined in the same manner that was suggested for defining the system and subsystem boundaries • This may take the form of a logical, physical or hybrid hierarchy: – System – Subsystems – Programs – Assemblies – Modules – Subassemblies – Subroutines – Procedures – Components or parts – Functions
  • 20. 20 Step 7… Selecting the Lowest Level of Analysis (1) • The lowest level for complete initial analysis must then be determined to finish the subdivision of the system subsystem – For many large systems, it is arbitrarily down to the lowest replaceable or line replacement unit (LRU). An LRU is any part replaced by the first or organisational level of maintenance – For small systems or assemblies, it may be down to individual piece parts or components • Software systems may analyse computer software components (CSCs) as the lowest level • For analysis of a system, assuming a particular lowest hierarchical level ensures full coverage. It ensures that no areas are overlooked, yet should not at such a fine degree of resolution that needlessly engages resources and produces excessively detailed and voluminous output. Save the trees – don’t sweat the small stuff – and avoid information overload! Caveat: Unless you have to…
  • 21. 21 Step 7… Selecting the Lowest Level of Analysis (2) • For this reason, performing an initial high-level system-wide assessment of criticality (possibly as a ‘generic’ FMEA) is useful – it culls, where justifiable, non-critical subsystems from further analysis while still allowing analysis of the more critical subsystems. For example, a data archival service may not be considered ‘mission critical’ to a primary service of providing online data • Analysis should only be performed to a depth beyond which no more information, useful from the top-level perspective and intentions of the FMEA, would be gained. Criticality and occurrence issues tend to be obvious. Detectability issues, particularly concerning ‘hidden failures’ are not necessarily so • For example, if the system stops flying, we don’t really care what specifically broke because hopefully we can quickly detect the failure cause at some level and compensate, or let the design compensate!
  • 22. 22 Step 7… Selecting the Lowest Level of Analysis (3) • Consider an electronics backplane FMEA: – Analysis of a 1/4W resistor in a particular printed circuit assembly gives the next higher failure mode as failure of the printed circuit assembly (PCA) itself – However, design analysis has shown that all conceivable failures are benign and failures of this nature are covered by PCA Built In Test Equipment (BITE) – In this particular case, no more useful information to the FMEA is provided by component analysis of the PCA than if failure modes for the PCA alone had been provided because it is the PCA’s diagnostic port that either reports failure or fails to report to the system • Selecting the lowest level of analysis should therefore be a dynamic rather than a prescribed process.
  • 23. 23 Step 8… Determining the Functions • Faults are an inability to function in the desired manner, or else operation in an undesired manner, regardless of the cause. A failure is an ongoing fault • Failure Modes are the particular manner in which a failure occurs for a given cause • Therefore, the first step for the subsequent determination of failure modes is to document all the required functions of each item • Functions should be able to be composed in a ‘verb-noun’ or ‘verb-phrase’ type expression, e.g. ‘Transmits - Turbine torque to main rotor gearbox’ • Sources of information for the required high-level functions will include the product specifications which should dictate regulatory as well as functional requirements. Fault/Failure Failure Mode Effect Function
  • 24. 24 Step 9… Determining the Failure Modes (1) • A failure mode (FM) can be described as an ‘anti-function’. Therefore, for every identified function that an element has to perform to support the system, there are one or more failure modes • Information sources for failure modes may be interpreted from: – Warranty, FRACAS, and failure history databases – Reliability Block Diagrams (RBDs) – Functional Block Diagrams (FBDs) – Boundary Diagrams – Interface Matrices – Engineering drawings, schematics, and bills of material – RIAC FMD-97 “Failure Mode/Mechanisms Distributions” which covers electrical, electronic, mechanical and electromechanical parts and assemblies • Some guidance on the construction of RBDs and FBDs is available in MIL-STD-756, ‘Reliability Modeling and Prediction’, as well as IEC 61078. How can it break, and what is the root cause?
  • 25. 25 Step 9… Determining the Failure Modes (2) • System functionality during all operational, idle, standby and storage phases and states should be considered • Some examples of operational mission phases may be taxi, take-off, departure, cruise, system deployment, holding, descent, approach, landing. Each of these phases can be broken down again if necessary • Failure modes of a system or subsystem can be caused by component faults or failures • The failure modes caused by possible software faults must not be omitted. For example, some modes of memory corruption can sometimes allow a program to continue execution, albeit with incorrect data in that memory • Define what is a failure carefully, e.g. operation of a safety device such as a circuit breaker may be a mitigating response at a local level though it may cause higher level system failure
  • 26. 26 Step 9… Determining the Failure Modes (3) • The distinction between states and modes is often arbitrary. As guidance: • A state is a functional condition or arrangement of operation that a system must be in order to perform certain functions – For example, the RS232 interface must have the Data Set Ready line in a high state prior to receiving data • A mode is a special state (i.e. a functional condition or arrangement) that implies that the state is extended over time rather than being transient, and there is a high likelihood that activities characteristic of the extended state will be carried out – For example, landing gear should be in the landing mode prior to landing. • As long as all the functions (what does this thing do?) in the respective states and modes (when does it have to do this?) are identified, then one doesn’t have to get particularly bitter and twisted over these kinds of definitions.
  • 27. 27 Step 9… Determining the Failure Modes (4) • Therefore, for each mission phase and mode in which the element or subsystem should be operable, the questions are asked: – What is the effect of the failure mode upon the desired function? • Complete failure? • Partial failure? (i.e. does less than intended in scope, amplitude or timing) • Intermittent failure (i.e. intermittently starts or stops)? • Over-functioning (i.e. does more than intended)? • Unintended functioning (i.e. does something else)? • Compliance failures, such as excessive exhaust emissions, often fall into the latter category.
  • 28. 28 Step 9… Determining the Failure Modes (5) • Failure modes are also often attributable to failure of an interface. Interfaces are empirically responsible for 50% of field failures • For each item in the FMEA, boundaries must be defined and interfaces through these boundaries identified. A ‘black box’ concept of the item may assist. Passive, static and non-loading bearing elements should all be considered. Eludium Q36 Explosive Modulator Energy transfer: Thermal, magnetic, electric, electrostatic, radiation, torque, pressure, power, force, fields, load, impulse etc Data transfer: Messaging, logic, flow control, status, data, alarm etc Material transfer: Liquids, solids, gases, plasma, dust, colloids etc Boundary
  • 29. 29 Step 9… Determining the Failure Modes (6) • Note that the speed, timing, quantity and quality of interface exchanges and any other constraints upon them may be significant • Relative movements and positions of assemblies may be important, e.g. a constraint on operation of a rolling-ball type computer mouse is that is must be in contact with a smooth, clean surface. An undesirable interface might be dust, which impedes mouse internal roller and ball interaction • Consideration must also be given to the human interaction and greater environment in which the system is operating, such as weather and seasonal changes, night/day, ionospheric state, magnetic dip etc. Human interaction may provoke faults in terms of incorrect or insufficient maintenance as well as operator error.
  • 30. 30 Step 9… Determining the Failure Modes (7) • As well as required interfaces, consideration should be given to undesired interfaces, such as contamination, noise, vibration, blockage, leakage, overload and overflow, and corruption, that may possibly occur • The effect of ageing may also give rise to possible failure modes for an element such as wear-out, erosion, misalignment, deterioration, fatigue, creep and corrosion • Dimensional (due to wear, deformation, fatigue, thermal effects, lubrication instability etc), as well as electronic, tolerance errors, imbalance or instability over time may be insidious. These errors may compound or ‘stack’ at a higher subsystem or system level.
  • 31. 31 Step 9… Determining the Failure Modes (8) • An example of a procedure to determine failure modes is: 1. Break complex components into subcomponents 2. Identify the functional contribution of each 3. Deduce possible failure modes for each function from comparable items: – Consider any possible physical, time or other stresses on the item that may cause failure – Are there any possible secondary or higher order failures that may also arise? 4. Associate causes to each failure mode – Note that some causes can provoke more than one function or component to fail that use the same interface or are exposed to the same conditions. These modes are termed common mode failures. As examples: • A broken drive belt may cause failure of an engine’s alternator, coolant pump, cooling fan, steering assist pump, air conditioning compressor, and air injection pump • Fuel contamination can cause multiple engine failure 5. Progressively record results on a cause-effect or ‘fishbone’ diagram.
  • 32. 32 Step 9… Determining the Failure Modes (9) • A fishbone diagram: – The part failure (‘Contactor failure’) is the local effect – A part failure may have many failure modes – A failure mode (such as ‘Contacts resistive’) may have one or more causes • At succeeding levels, a previous failure mode may now become a cause for a failure mode. For example: – An ‘Overcurrent’ cause results in a ‘Contacts burnt’ failure mode – ‘Contacts burnt is then a cause for ‘Contacts resistive’ and ‘Armature sticking’ failure modes. Contactor failure Terminals loose Solenoid burnout Overvoltage Duty cycle exceeded Solenoid opencircuit Winding failure Input failure Controller failure Connection failure Contacts burnt Armature sticking Hinge binding Contacts resistive Contacts dirty Loss of spring tension Overcurrent
  • 33. 33 Step 9… Determining the Failure Modes (10) • Sometimes analysis does not assist with determining all the failure modes • Determining unknown failure modes is often effectively maximised by ‘brainstorming’ the information collated through the preceding processes • Brainstorming is a non-procedural way of determining all possible alternatives in a non- critical manner. Brainstorming attempts to utilize the imaginative and creative faculty which is often circumvented by habitual logical or critical thought processes • Use ‘probing’ rather than ‘leading’ types of questions, and an ‘active’ listening style • Once all possible alternatives have been enumerated, then critical evaluation of each alternative can be performed, and not before all input from the brainstorming session has ceased. A number of short sessions with suitable personal review time intervening may be more productive than a single long session.
  • 34. 34 Step 10… Documenting the Failure Modes • Example subsystem: Differential Analyser Function Overload protection for Differential Analyser power supply Item Contactor CB1 Failure Modes and Causes 1. Contacts resistive: spring tension loss 2. Contacts resistive: contacts burnt 3. Contacts resistive: contacts dirty 4. … etc … Local Effect Contactor failure, current interrupted from main switchboard MS1 Next Higher Level Voltage drop in Analyser supply cct MS1-CB1 End Effect Failure of Differential Analyser through 240 VAC power supply drop out .
  • 35. 35 Step 11… Determining the Effects (1) • Once all the pertinent failure modes have been determined, the effects can be considered • Consider each of the system or subsystem elements, and the possible effect of each of the failure mode upon: – Personnel – Equipment – Downtime – Environment.
  • 36. 36 Step 11… Determining the Effects (2) • Effects are documented according to the level within which they operate: – Local effects. This concerns the effect upon the line item being analysed and associated items in its immediate environment. The local effect may be the failure mode – Next Higher Level. This is for the next higher functional level or assembly in the FMEA’s structure – End effects. These concern the effect on the top level of the FMEA concerned, i.e. the end effect for the system, subsystem, or component FMEA under consideration. At the highest level, this includes the end user as well as the system. No consideration of compensating provisions such as redundancy or safety devices is given.
  • 37. 37 Step 11… Determining the Effects (3) • End effects can percolate up to next higher level FMEAs as failure modes • They may or may not represent a critical failure in the higher level FMEA • Criticality is not necessarily inherited, and must be carefully interpreted from the relevant specifications • For example, • A critical failure in a training subsystem will cause failure of the training subsystem • A failure of the training subsystem may not necessarily cause a critical failure of the mission system, but will be considered as a failure mode in the mission system FMEA.
  • 38. 38 Step 11… Determining the Effects (4) Sim LRU Fail Sim Fail 1 Sim Assy Fail Sim Fail 2 Sim Mod Fail No Effect AC LRU Fail AC Fail 1 AC Assy Faii AC Fail 2 AC Mod Fail No Effect Simulator System FMEA Aircraft System FMEA End Effect is on the Aircraft AC Failure 1 Sys Fail AC Failure 2 Sys Fail Sim Failure 1 Degraded Sim Failure 2 Degraded Mission System FMEA End Effect is on the Simulator End Effect is on the Mission System Sim Failure Modes AC Failure Modes Subsystem FMEAs
  • 39. 39 Step 12… Determine the Failure Detection Method (1) • Detection is the mechanism by which a failure becomes apparent so that compensating action for continued operation, or else corrective maintenance, can be undertaken • Abnormal indications of any system fault from any built in test equipment or processes must be documented, such as ‘red arc’ or unusual behaviour of readings from instruments or gauges • Does the method allow isolation of the fault to some particular subsystem or element? • What are the indications of failure of the indicator itself? – Will watchdog timers or exception handlers provide warning messages or signals? – Can the readings be correlated with some other instrument or gauge? • For example, the indications of an aircraft’s failed attitude indicator will not cross- reference correctly with the rate of turn and climb/descent readings.
  • 40. 40 Step 12… Determine the Failure Detection Method (2) • If no alarm, instrument, or message provides notification, how does the operator know a failure within the system is occurring or has occurred? – What are qualitative signs in terms of specific auditory, visual, kinaesthetic, and olfactory inputs? – What is the diagnostic logic that the operator will use? • For example, a clutch overload may be evident from the burning smell, which may or may not be distinctive from burning wiring insulation. Further fault isolation may be important from a safety perspective such as risk of fire or the production of noxious gases • In some instances, use of operator controls will verify some aspect of system failure and permit some degree of fault isolation through the application of appropriate logic – For example, complete failure of an engine in a light twin-engined aircraft may be identified in the first instance by the requirement for a large rudder deflection to offset the asymmetric thrust, and the failure is confirmed by tentatively reducing throttle on the ‘dead’ engine – Another example is using circuit breakers or switches to determine which circuit may be overloaded or defective.
  • 41. 41 Step 12… Determine the Failure Detection Method (3) • Failures which may not be apparent during the course of normal operation until a primary failure occurs are known as hidden failures – An example of this is an automatic standby pump that fails to operate when a main hydraulic pump fails – Another example is failure of an uninterruptible power supply when the primary power supply is interrupted • Conversely, a standby pump may mask the failure of a main pump, or an uninterruptible power supply may mask an intermittent or poor quality primary power supply • The failures may only be discovered during the course of preventive maintenance activities • The indications and detection of failure of the second or tertiary backup systems as well as the primary system must also be documented.
  • 42. 42 Step 13… Determine the Compensating Provisions (1) • The Compensating Provisions section documents design and operator provisions for recovery • Design provisions are how the system recovers itself or tolerates failure. Forms of this include: • Equipment redundancy and switching mechanisms, including online, standby, and offline types • Alternate modes or paths of operation – An example is automatic rerouting of network messages to bypass a failed carrier service • Increasing tolerance to overload, including improved material, derating, less stressful environmental and operating limits • Acceptable or graceful degradation • Recovery actions such as function retry or software reloading.
  • 43. 43 Step 13… Determine the Compensating Provisions (2) • Operator provisions are those actions that an operator takes to overcome or mitigate a failure • Operator provisions may include bringing online redundant devices, or controlling the system in an other than normal fashion – An example of this would be load-shedding of an electrical subsystem in the event of a generator failure with limited backup provision by operating circuit breakers to isolate power to non-critical subsystems • Where there is a possibility of an incorrect operator (or maintainer) response, the consequences should also be documented • Changes in preventive maintenance frequency or intensity may reduce the possibility of failure.
  • 44. 44 Step 14… Determine the Severity (1) • Determination of the severity classification of a failure is performed with respect to the end effect on the functional requirement and any possible safety hazard. Failure An event including interoperability in which an item does not perform as specified Non-mission Critical Item does not prevent the system from performing its mission Mission Critical Item failure prevents the system from performing its mission Non-safety Critical Item failure prevents the mission but does not threaten system or operator safety Safety Critical Item failure threatens system or operator safety
  • 45. 45 Step 14… Determine the Severity (2) • For MIL-STD-1629, Severity Classification falls into one of the following categories according to the worst potential consequences. Some alternative industry interpretations are also given to illustrate how these apply to commercial applications: Category I Catastrophic A failure results in the major injury or death of personnel Category II Critical A failure results in minor injury to personnel, personnel exposure to harmful chemicals or radiation, a fire or release of chemicals into the environment Category III Marginal A failure results in a low level exposure to personnel, or activates a facility alarm system Category IV Minor A failure results in minor system damage but does not cause injury to personnel, allow any kind of exposure to operational or service [i.e. maintenance] personnel or allow any release of chemicals into the environment
  • 46. 46 Step 14… Determine the Severity (3) • The severity classifications that are unacceptable should be defined • Unacceptable levels are usually Levels I & II, and sometimes Level III • The levels of injury and exposure should also be defined. Some examples are: – Low level exposure: Less than 25% of short-term exposure limits published for work, health and safety – Minor injury: A small burn, cut or pinch, or light electrical shock that can be handled by first aid and are not responsible for significant lost time – Major injury: Requires medical attention other than first aid
  • 47. 47 Step 14… Determine the Severity (4) • Other methods determine a Severity Number, usually based on a scale of 1 (goodness) to 10 (badness), which later facilitates calculation of a Risk Priority Number for the failure mode. • Whatever scale is used, it preferably should be traceable to a standard, and must be documented. Is it a ‘must work’ function?
  • 48. 48 Step 14… Determine the Severity (5) • It is advantageous to formulate ground rules for each analysis by which severity can more easily be evaluated. Some examples of these rules are: 1. For catastrophic hazards, dual component failure (items which are one-fault tolerant) are credible (i.e. could happen) 2. For catastrophic hazards, triple component failures (items with two-fault tolerance) are not credible (i.e. not likely to happen) 3. For critical hazards, single component failures are credible 4. For critical hazards, dual component failures are not credible 5. Generally not included in the analysis are mounting brackets, secondary structures, wiring and enclosures • Document any such ground rules with the other assumptions! • Some examples of Severity Numbering, mostly for consumer-oriented products such as for the automotive market, are on the next slide.
  • 49. 49 Step 14… Determine the Severity (6) 1 None – not apparent, no effect None Unlikely to be detected 2 Very minor – not apparent, minor effect Fit/Finish/Squeak/Rattle noticed by < 25% of customers 20% chance of a customer return 3 Minor – nuisance to the customer FFSR noticed by 50% of customers 40% chance of a customer return 4 Very low – lowered effectiveness FFSR noticed by 75% of customers 60% chance of a customer return 5 Low – customer complaint Comfort/convenience reduced 80% chance of a customer return 6 Moderate – Potential ineffectiveness Comfort/convenience item(s) inoperable 100% chance of a customer return 7 High – customer dissatisfaction Reduced performance Failure results in a customer complaint 8 Very high – ineffective service or treatment Loss of mission function Failure results in serious customer complaint 9 Extremely high – regulatory non- compliance Loss of safe operation or regulatory non-compliance – with warning Failure results in non-compliance with statutory safety standards 10 Dangerously high – injury or death Loss of safe operation or regulatory non-compliance – without warning Failure results in death
  • 50. 50 Step 15… The Risk Assessment Process (1) • At this point, the FMEA is largely complete, with only criticality or priority analysis to prioritise the effects of failure modes to finish the risk assessment process • For the qualitative approach, there are two basic methods: 1. Determining and plotting a Probability of Occurrence, P(O), against the Severity Category for each failure mode using MIL-STD-1629 definitions: • Level A – Frequent • Level B – Reasonably probable • Level C – Occasional • Level D – Remote • Level E – Extremely unlikely 2. Calculating a Risk Priority Number (RPN) using the Severity Number, Occurrence Number, and Detectability Number, and plotting the RPN against each failure mode • The Occurrence Number is selected from a predetermined table of values in the same manner as for the Severity Number. If failure rate data is available, a selection can be quantitatively made.
  • 51. 51 Step 15… The Risk Assessment Process (2) 1 Remote Failure is unlikely One occurrence in greater than 5 years, or less than 2 occurrences in 1 billion events (Cpk ≈ 2.00) 2 One occurrence every 3 to 5 years, or 2 occurrences in 1 billion events (Cpk ≈ 2.00) 3 Low Relatively few failures One occurrence every 1 to 3 years, or 6 occurrences in 10 million events (Cpk ≈ 1.67) 4 One occurrence per year, or 6 occurrences in 100,000 events (Cpk ≈ 1.33) 5 One occurrence every 6 months to 1 year, or 1 occurrence in 10,000 events (Cpk ≈ 1.17) 6 Moderate Occasional failure One occurrence every 3 months, or 3 occurrences in 1,000 events (Cpk ≈ 1.00) 7 One occurrence every month, or 1 occurrences in 100 events (Cpk ≈ 0.83) 8 High Repeated failure One occurrence per week, or a probability of 5 occurrences in 100 events (Cpk ≈ 0.67) 9 One occurrence every 3 to 4 days, or a probability of 3 occurrences in 10 events (Cpk ≈ 0.33) 10 Very high Failure is almost inevitable More than one occurrence per day, or a probability of more than 3 occurrences in 10 events (Cpk < 0.33)
  • 52. 52 Step 15… The Risk Assessment Process (3) • If a numeric value for the failure mode, e.g. failure rate is available, then the value can be converted to an occurrence per unit time (or per number of events), and the closest definition of Occurrence Number selected • The rate for a failure mode will have to be apportioned from the failure rate for the item. An example of failure mode ratio allocation for an engine block assembly is shown on the next slide • Obviously if no quantitative data is available, then the analyst simply makes a qualitative assessment from the table for the failure mode. Learn to distinguish between the truly impossible and the highly improbable… And then between those events that are unlikely but possible…
  • 53. 53 Step 15… The Risk Assessment Process (4) Failure of piston 10% 25 FPMH or 40,000 hours MTBF ~ a failure every 3 to 5 years,  select 2 for Occurrence. Failure of piston rings 40% 100 FPMH or 10,000 hours MTBF ~ a failure every 1 to 3 years,  select 3 for Occurrence. Failure of connecting rod & pin 15% 38 FPMH or 26,667 hours MTBF ~ a failure every 1 to 3 years,  select 3 for Occurrence. Failure of crankshaft 5% 13 FPMH or 80,000 hours MTBF ~ a failure in greater than 5 years,  select 1 for Occurrence. Failure of main or big end bearings 30% 75 FPMH or 13,333 hours MTBF ~ a failure every 1 to 3 years,  select 3 for Occurrence. 100% 250 FPMH or 4,000 hours MTBF Example: Allocating Occurrence Numbers for an engine block that would have a 100% duty cycle and an expected life of 4,000 operational hours.
  • 54. 54 Step 15… The Risk Assessment Process (5) • The last step in the determination of an RPN is selection of a Detectability Number • This represents the probability that a cause or mechanism and potential failure mode will be detected • The Detectability number may reflect the BITE design capability as well as the expected operator competence in fault detection and isolation. 1 Almost certain 2 Very high 3 High 4 Moderately high 5 Moderate 6 Low 7 Very low 8 Remote 9 Very remote 10 Absolute uncertainty
  • 55. 55 Step 16… Calculate the Risk Priority Number • The RPN is therefore an integer with a range from 1 (most benign) to 1,000 (most serious). RPN = 1 to 10 1 to 10 1 to 10 Severity No Occurrence No Detectability No x x
  • 56. 56 Step 17… Using the RPN (1) • A ‘risk threshold’ value is determined, as well as a criticality number, beyond which action must be taken • High RPNs can be clearly identified by: – Plotting against each failure mode so that they are readily identifiable – Using Pareto analysis to segregate high-level RPNs, for example, identifying the top 20% risk • High RPNs are actioned by reducing the occurrence or severity, or increasing the detectability, of failure modes. Don’t select high thresholds to get the ‘right’ answers!
  • 57. 57 Step 17… Using the RPN (2) Failure Mode Failure Mode Failure Mode Device Detect Control cause or mitigate effect Occurrence Severity Detectability Effect
  • 58. 58 Step 17… Using the RPN (3) Occurrence Number Avoid or Eliminate Failure Causes •This usually is the first item to be changed. Improvements in material, design, testing (find and fix), changing environment, operating limits, maintenance etc to reduce the frequency of occurrence, the cause of failure Severity Number Eliminate or Reduce the Consequences of Failure •Change will generally require system redesign to change the effects of failure. An example could be adding turbine engine containment shielding to protect against rotating part failure Detectability Number Identify or Detect the Failure Earlier •Changing this item is the usually the least preferred action to improve the RPN, as changes in the Occurrence and Severity will often have a more desirable outcome, i.e. more influence on the cause and result
  • 59. 59 Step 17… Using the RPN (4) • All significant and critical items (e.g. those possessing failure modes resulting in end effects equivalent to Severity Category I and II definitions) must have some form of recommended action • Recommended actions must be detailed, potentially effective and executable • If the granularity for resolution of significant and critical items is too high, then lower levels of FMEA can be undertaken for these items • The responsibility and completion date for recommended actions to improve the RPN must be documented in the FMEA • The final, acceptable RPN after recommended actions have been completed is then entered into the FMEA worksheet. Otherwise, the process iterates.
  • 60. 60 Step 18… Costing the Risk (1) • As well as the RPN metric, a frequency-cost metric can also be formulated to augment analysis: • This can be plotted in conjunction with the RPN for each failure mode to give a bigger picture. Failure mode frequency/year x Failure Cost/year
  • 61. 61 Step 18… Costing the Risk (2) • The cost risk per annum of a particular failure mode can be given a number in the same manner as a severity number, as per the following table • The actual values used will accord to the impact of failure upon the particular system. 1 Insignificant < $100 PA 2 Extremely low > $500 PA 3 Low >$1,000 PA 4 Low-medium > $5,000 PA 5 Medium > $10,000 PA 6 Medium-high > $50,000 PA 7 High > $100,000 PA 8 Very high > $500,000 PA 9 Extremely high > $1,000,000 PA 10 Disastrous > $10,000,000 PA These costs are project specific!
  • 62. 62 0000 0101… Some Software FMEA Considerations • Software can, and should, be considered in, or as a stand-alone FMEA as it inevitably forms a management, control and diagnostic part of any system that does not solely rely on fires and steam. It falls into the highly recommended category for full safety design effort, and provides a medium benefit to cost rating according to NASA • Software FMEA is only practicable at a functional level – it’s a virtual world out there • Software modules, per se, do not fail or wear out - they only display incorrect behaviour as designed into them: – Software is nearly always delivered broken… but no-one knows exactly how much – The only thing that software testing really proves is that one more bug was found – Every software system is considered to contain faults which may lead to functional failure under particular triggering conditions – Failure modes for all possible dynamic conditions may be impossible to predict, or even test for by simulation, so the emphasis is on correct design and defensive strategies. The end of a software program is a declaration – not a fact…
  • 63. 63 0000 0101… Some More Software FMEA Considerations • Therefore, analysis of software can only look for likely, rather than known, failure modes: – Analysis may, however, determine the measures needed to be taken to prevent or mitigate the occurrence and consequences of failure by specifying design and coding rules, and review, inspection and testing requirements for different code functions – One of the most useful outputs, given a rigorous development environment, is then a list of test cases, which may be used to develop specific scenarios and stressors that may trigger each potential failure mode – Possible failure modes must consider any combination of inputs, timing and operating modes that potentially produce fault conditions through programming or compiler errors. Beware of COTS SOUP ~ Software Of Uncertain Pedigree
  • 64. 64 0000 0101… Types of Software FMEA • Software FMEA assists in identifying structural weaknesses in the design, and helps reveal weak or missing requirements and latent software non-conformances • There are two basic stages in software FMEA development: – System software FMEA. This is performed as early as possible in the design phase, usually as soon as the architecture has been developed and system functions are well-defined and understood. This FMEA is used to evaluate the effectiveness of the top-level software architecture and basic design protection of the system – Detailed software FMEA. This is performed during the software detailed design and coding stage of development. This FMEA validates that the software has been constructed to achieve the specified mission and safety requirements. This FMEA is sometimes undertaken to examine minimal cut sets from an initial Fault Tree Analysis that is performed instead of a system-level FMEA.
  • 65. 65 0000 0101… Scope of Software FMEA • All the software encountered in a project needs to be considered as candidates for FMEA depending on the criticality of the application. As well as the application software, this may include other COTS components, depending on whether they are certified to the required standard or not, such as: – System kernel, e.g. boot and initialisation, the basic input output system – System services, e.g. file and device input/output – System and third-party library functions, and don’t forget the shrinkwrap, wrappers or glueware – Development and support software, e.g. compilers, linkers, debuggers, development tools – Embedded read-only memory software, including programmable logic – Test software • A decision must be made as to what hardware errors will directly influence software execution, and what affects software interfaces, and consequently considered as causes for possible software failure modes: – Execution: central processor unit and memory failures, e.g. arithmetic logic unit, registers, random access memory – Interfaces: Peripheral failures, e.g. input/output ports, analog/digital converters, watchdog, interrupt managers and timers that are implemented in hardware • Just as for hardware, setting boundaries and assumptions for ‘good behaviour’ is required.
  • 66. 66 0000 0101… Software System FMEA • Unintended system function due to software failure must be avoided in safety critical systems! The FMEA will assist in identifying means of mitigating against potential failures and testing for faults • Initial safety requirements may come from: – System specifications – Regulatory compliance requirements – Preliminary hazard analysis, which provides a matrix of potential hazards and hazard states – Hazard testing results, which provide the required fault response times of the system • A criticality, or potential risk, level can then be assessed for each software function determined in the software architecture, which at this stage is usually a collection of computer software configuration items (CSCIs).
  • 67. 67 0000 0101… Software System Failure Modes • The most common failure modes for functions are: – Failure to execute – Incomplete execution – Execution with incorrect timing, which includes incorrect activation and execution time (including endless loop) – Erroneous execution • Two extra software failure modes specifically for interrupt service routines are: – Failure to return, (blocking lower level priority interrupts from executing) – Returning an incorrect priority.
  • 68. 68 0000 0101… Software Failure Causes • There may be any number of causes that give rise to these failure modes, but some general areas to consider are: – Computational – Logic – Data I/O – Data handling – Interface – Data definition – Database.
  • 69. 69 0000 0101… Software System Effects • Some system-level effects of failure may then be: – The operating system halts or stops – Program stops, with a clear error message – Program stops, without a clear error message – The program runs, producing obviously incorrect or ‘coarse incorrect’ results or unintended functioning (including running too early or late) – The program runs, producing ‘subtle incorrect’ results. These are apparently correct, but actually incorrect, results • The effects of the failure modes relevant to each functional subroutine are then assessed for potentially hazardous outcomes • From the effects, a severity can then be determined. Service Provision Fault: Omission / Commission Service Timing Fault: Early / Late Service Value Fault: Coarse / Subtle Incorrect
  • 70. 70 0000 0101… Software Detailed FMEA • The detailed FMEA can be applied to all, or only higher-risk modules by: – Tracing potential failures in variables and their input (hardware or other routines) – Tracing processing logic through the software to determine the effect of the failure • A procedure to trace potential failures is to: – Create a map between all input, output, local, and global variables and corresponding routines – Develop failure modes for the data and variables. The variable (modes for input, effects for output) failures are based on the variable type, i.e. the allowable type of data for the variable – Develop failure modes for the events and processing logic. This involves determining what negative effect the operators may have as well as dormant logic errors • Note that any memory location, including buses and registers, that do not contain integrity protection such as parity can be corrupted during operation • At completion of a detailed FMEA, a mapping should exist from the top-level potential hazards to the top-level critical variables.
  • 71. 71 0000 0101… Variable Failure Modes (1) Variable Type Failure Mode Analog or continuous e.g. High, > 5.0 e.g. Low, < 0.0 Boolean or logical True when it should be False False when it should be True Enumerated or list (e.g. A, B, C) A when it should be B A when it should be C B when it should be A B when it should be C C when it should be A C when it should be B
  • 72. 72 0000 0101… Variable Failure Modes (2) • Variables and data can also be analysed as having any of the following failure modes: – Missing data, e.g. lost message, analog/digital conversion too granular, no read, write or update – Incorrect data. These can also be categorised as ‘subtle incorrect’ or ‘coarse incorrect’, e.g. stuck, incorrect-in-range, out of range, inconsistent, bad command – Timing of data. Data arrives too soon or too late (obsolete), e.g. data race or collision, sampling rates too high or too low – Extra data, e.g. data redundancy, message format incorrect, unwanted read, write or update • Consideration of these failure modes lead to an expansion of the previous variable map table, as required. as shown on the next slide. Data tolerances for real data may cause threshold logic problems in dual redundant path processing
  • 73. 73 0000 0101… Variable Failure Modes (3) Variable Name Subroutine Type Failure Mode Local Effect ArrayVar1 Program1 Double Precision Omission: no read, write or update Program failure Commission: unwanted read, write or update Wasted processor cycles during I/O time Early timing Nil if no attempt is made to reload original value within timing cycle Late timing Use of obsolete data Subtle incorrect: Value -0.00001 to +0.5 Possible error from use of stuck or incorrect-in-range value Coarse incorrect: +0.999E5 to +1.0E360 Error from inconsistent, or out- of-range data
  • 74. 74 0000 0101… Logical Failure Modes • Individual processing logic can also be examined for failure modes such as: – Memory address errors, e.g. module coupling, array boundaries, and stale pointers – Language usage, e.g. variable declaration, possible underflow or overflow, typing and initialisation – Omitted event, i.e. event does not take place but execution continues – Committed event, i.e. event takes place but shouldn’t have – Incorrect logic, e.g. preconditions are inaccurate, event does not implement intent, non-convergent algorithms, instability, endless loops, premature returns – Timing or order, e.g. an event occurs too early, too late, or in wrong order • FMEA for logic and events can be tabulated in a similar manner to the variable or data table. What does the software do when the user times out?
  • 75. 75 0000 00101… Software Failure Detection • Some examples of system failures are stopping, slow or incorrect responses, and startup failure. Detection of these types of failures is often readily observed by the system lack of, or improper, functionality by the operator. Although self-evident, the actual means of detection should be supplemented with appropriate alarms or messages, including a means of system or operator compensation • Memory errors that may require detection are stack overflow or corruption, memory leakage or exhaustion, and the behaviour of garbage collection functions in real-time systems that may need special checks • The detection of failures may use software watchdog timers or task heartbeat monitors, message sequence numbers, duplicated message checks (noting that this may be a purposeful strategy for error-checking critical commands), software interrupts, input trend analysis, etc. . What is the effect on synchronisation of a distributed system when processor failures occur?
  • 76. 76 0000 00101… Software Failure Detection (2) • As an aid to determining fault detection possibilities, consider each member of the fault value class against those of the fault timing classes for each failure mode: Correct Value Subtle Incorrect Coarse Incorrect Omission Correct Timing ✓ Potentially undetectable Value, semantics detection Timeout Early Time- based detection Time-based detection Value, time, semantics detection Timeout Late Time- based detection Time-based detection Value, time, semantics detection Timeout Infinitely Late Timeout Timeout Timeout Timeout
  • 77. 77 0000 0101… Software Compensating Provisions • Some examples of high-level compensating provisions are automatic reversion to safe states or passive or degraded modes of operation where non-critical tasks are given lower priority or even halted. The software may also force checks on the system state and resources (memory, stack, processor availability etc), verify the command itself and provide operator warnings, before executing critical functions. Safety-critical messages may be given the highest priority • Note that, as opposed to hardware, added identical ‘redundant’ modules may not necessarily reduce the occurrence of failures, as the cause for a failure mode may be identical with each added module. One method of redundant programming uses ‘n-versions’ of ‘n’ extra modules, each of which employs different versions or methodology to achieve the same end results to ensure that each is not as susceptible to the same failure causes. Voting systems may operate on redundant data or use defaults • The required software fault response time to system compensating measures as a result of fault or failure detection may also determine the higher-level software design, such as alternative task scheduling. It may also force alternatives in processor selection.
  • 78. 78 0000 0101… Software Fault Occurrence Numbers • Only the qualitative methods (Probability of Occurrence or Occurrence Number) of failure rate assessment are strictly applicable due to the difficulty (if not impossibility) of accurately estimating software modules’ failure rates • Estimates of Occurrence have traditionally used as a basis metrics such as the number of lines of code, function points, or complexity which can be interpreted as a measure of possible fault rate: – Lines of code is the simplest metric and perhaps returns the least confidence – Function point estimates are derived from high-level concerns such as the number of inputs, outputs, inquiries and files used, along with an estimate of program complexity. These have the advantage of being able to allow an estimate to be derived prior to actual coding for system level analysis – Complexity metrics, such as that of Halstead, use inputs such as the number of operands and operators available in a language, the actual quantity of these used, and volume, difficulty and effort factors related to a program – Many of these types of metrics are available from language-specific scanners.
  • 79. 79 0000 0101… Software Fault Occurrence Numbers (2) • Occurrence values can be hard to estimate, and if historical failure rates can’t be established, some analysis teams have found an arbitrary value of either 5 or 10 satisfactory • The occurrence of failures for software, represent the rate at which failures have been built into the system, as well as the frequency of use of the particular state that invoked the particular failure mode • From this perspective, greater emphasis on the severity number of the failure mode and the means of detection is often felt to be more important for software. Do you know the limits of exponentiation and factorial for your system?
  • 80. 80 Back to Real Numbers: Determining Criticality a la ‘1629A • In determining a Probability of Occurrence for a failure mode according to the MIL-STD-1629 qualitative method where failure rates are not known, an inherent assumption is that the failure mode effect is going to occur • The quantitative method of MIL-STD-1629 uses a more structured numeric approach to determine a Criticality Number instead. The calculation incorporates numeric failure rates and a ‘conditional probability of loss’ as well as duty cycle. The failure mode Criticality Numbers are then summed to determine Criticality Number for each item • The criticality matrix or plot that utilizes an Item Criticality Number so calculated results in clearer boundaries between items than if a more granular Probability of Occurrence had been estimated • Thus the quantitative criticality number approach is preferable to the qualitative Probability of Occurrence method of MIL-STD-1629 when all the failure rates for a system are known or can be estimated.
  • 81. 81 Criticality Number Components… • The Criticality Number is calculated from: – Failure Effect Probability (β or Beta). This is the probability of the effect or loss actually happening, given that a failure has occurred. This is assessed in a qualitative manner according to the likelihood of either actual loss, probable loss, possible loss, or no effect. The standard provides representative numeric values – Part Failure rate (λp or Lambda). The item operational failure rate is estimated from field or library data, modified for the prospective environment and usage as required – Operating time (t) This will be the mission time factored by the duty cycle, and is in hours or cycles per mission. Rate Failure Item Rate Failure Mode Failure =  – Failure Mode Ratio (α or Alpha). The FMR is the fraction which represents the failure rate at which a particular failure mode will occur with respect to the failure rate of the whole item or function. These proportions are assessed from test or analysis, or simply best judgement. The total of all FMRs for an item is 1 or 100%
  • 82. 82 Criticality Number Calculation… • The Criticality Number is built up in two steps: – Failure Mode Criticality Number (Cm). This is calculated for each failure mode of an item – Item Criticality Number (Cr). The Item Criticality Number is the sum of the Failure Mode Criticality Numbers for an item • Failure Mode Criticality is calculated as follows: • The Item Criticality is then the sum of all the Failure Mode Criticalities for an item:  = = n i i m r C C 1 ) ( Cm = ..p.t
  • 83. 83 A Final Word… • As well as potential life cycle cost impact, an inadequate FMECA can represent a lack of rigor in the design of the mission and support systems. This lack of rigor may contribute to occurrences of critical failures in service and circumstances wherein critical failures will not be satisfactorily addressable • However, one oft-maligned design feature, the single point of failure (SPF), is a fact of life, and may only be significant if the potential occurrence or severity are high. We live in and use complex systems of SPFs every day without any problem by virtue of the fact that the occurrence of failure is often very low • Excessive redundancy can increase prime cost, operator skill requirements, operational and maintenance costs without great benefit. Single-engine turbine aircraft are becoming more popular in the aviation world for just these reasons • Sometimes, though, it’s just real nice to have…
  • 84. 84