2. Investigating causes of failures &
mishaps
Stop and ask yourself…
Did you really find the causes
of the failure?
Like icebergs, most of the problem is
usually below the surface!
4. Technical Proficiency
• Once the accident happened how did Gene
Krantz rely on the skills and expertise of his
people?
• How did Lovell work to initiate actions in the
spaceship? Was he able to balance that with
his technical responsibilities in the craft? How
did he do it?
• What steps does your unit take to maintain
Technical Proficiency?
Lessons from Apoll0 13
5. Teambuilding
• How did Lovell contribute to the group process
when Mattingly wanted to practice the docking
procedure again after 3 hrs of practice?
• When Krantz had the team in the classroom how
did he establish the goal and then how did he go
about motivating others to achieve the goal of
returning the space craft safely to earth?
• Did Lovell make the right call when faced with the
challenge of forcing Mattingly to stay behind
because of the fear of measles?
• How does a leader successfully build a strong
team, but then separate him or herself from the
Team to make a critical decision?
• How’s your Team doing?
Lessons from Apoll0 13
6. Effective Communications
• Even as everything is breaking loose in
Mission Control, Gene Krantz asks his team
to “Work the Problem.” He then listened to
the experts report in on their areas of the
mission. How did his effective comms set the
stage for a successful recovery?
• Krantz stated “Failure is not an option” and
Lovell told his crew “I intend to go home.” By
clearly stating their ideas and vision how did it
direct the teams towards mission
accomplishment?
• Whose the best communicator you’ve ever
worked with? What made them excel?
Lessons from Apoll0 13
7. Vision Development & Implementation
• JFK’s Vision: "I believe that this nation
should commit itself to achieving the goal,
before this decade is out, of landing a man
on the moon and returning him safely to
Earth.“
• How does a stated vision focus the unit
and bring the crew together?
• Lovell states; “Columbus, Lindberg, and
Armstrong; it is not a miracle for man to
walk on the moon, we just decided to go.”
• What’s the vision at your unit? Has
everyone decided “to go?” What can your
unit do to get everyone “on board”?
Lessons from Apoll0 13
8. Conflict Management
• How did Lovell deal with stress and conflict in the
LEM?
• How did the CO2 challenge help the crew to
overcome the conflict they were experiencing?
• Is there more or less conflict when people are busy
and focused or when there is less to do and folks have
time on their hands? Why?
• How did Krantz and Lovell go about alleviating conflict
between the crew and the Medical team?
Lessons from Apoll0 13
9. Decision Making & Problem
Solving
• How did the Team live the Competency of
Decision Making and Problem Solving in
working the “Power” problem to conclusion?
• Right after the explosion Krantz’s asks
Mission Control “What do we have on the
Space Craft that’s good?”
• Why did he ask this question?
• How did it aid in making the correct decision
to shut down the fuel cells?
• Does everyone at your Teamt ensure that the
Decision Makers have all the available and
correct information? Why or Why not?
Lessons from Apoll0 13
10. Creativity and Innovation
• We’ve discussed a lot of positive leadership
qualities during this session. How did Gene
Krantz create an environment with his
Mission Control team to ensure they were
able to figure out how to solve the CO2
problem with a “Square Peg in a Round
Hole!”
• Lovell states at the end of the movie;
“Thousands of people worked to bring the 3
of us back home.” How did creativity and
innovation make the “Successful Failure” a
reality?
• How does your unit build on Lessons
Learned?
Lessons from Apoll0 13
12. Investigating causes of failures &
mishaps
When performing an investigation, it is necessary to look at more than
just the immediately visible cause, which is often the proximate cause.
There are underlying organizational causes that are more difficult to
see, however, they may contribute significantly to the undesired
outcome and, if not corrected, they will continue to create similar types
of problems. These are root causes.
Requirements for mishap reporting and investigating all mishaps and
investigations must identify the proximate causes(s), root causes(s) and
contributing factor(s).
13. Definitions
Proximate Cause(s) (Direct Cause)
• The event(s) that occurred, including any condition(s) that
existed immediately before the undesired outcome, directly
resulted in its occurrence and, if eliminated or modified, would
have prevented the undesired outcome.
• Examples of proximate causes:
Equipment Human
• Arched • Pushed incorrect button
• Leaked • Fell
• Over-loaded • Dropped tool
• Over-heated • Connected wires
14. Root Cause(s)
Definitions
• One of multiple factors (events, conditions or organizational factors)
that contributed to or created the proximate cause and subsequent
undesired outcome and, if eliminated, or modified would have
prevented the undesired outcome. Typically multiple root causes
contribute to an undesired outcome.
Organizational factors
• Any operational or management structural entity that exerts control
over the system at any stage in its life cycle, including but not limited
to the system’s concept development, design, fabrication, test,
maintenance, operation, and disposal.
• Examples: resource management (budget, staff, training); policy
(content, implementation, verification); and management decisions.
15. Definitions
Root Cause Analysis (RCA)
• A structured evaluation method that identifies the root causes for
an undesired outcome and the actions adequate to prevent
recurrence. Root cause analysis should continue until
organizational factors have been identified, or until data are
exhausted.
• RCA is a method that helps professionals determine:
• What happened.
• How it happened.
• Why it happened.
• Allows learning from past problems, failures, and accidents.
16. Root Cause Analysis - Steps
1. Identify and clearly define the undesired outcome (outage).
2. Gather data.
3. Create a timeline.
4. Place events & conditions on an event and causal factor tree.
5. Use a fault tree or other method/tool to identify all potential causes.
6. Decompose system failures down to a basic events or conditions (Further describe what
happened)
7. Identify specific failure modes (Immediate Causes)
8. Continue asking “WHY” to identify root causes.
9. Check your logic and your facts. Eliminate items that are not causes or contributing
factors.
10. Generate solutions that address both proximate causes and root causes.
17. Root Cause Analysis - Steps
Clearly define the undesirable outcome.
• Describe the undesired outcome.
• For example: “software failed to deploy,” “transaction failed,” or
“XYZ project schedule significantly slipped.”
Gather data.
Identify facts surrounding the undesired outcome.
• When did the undesired outcome occur?
• Where did it occur?
• What conditions were present prior to its occurrence?
• What controls or barriers could have prevented its
occurrence but did not?
• What are all the potential causes?
• What actions can prevent recurrence?
• What amelioration occurred? Did it prevent further damage?
18. Root Cause Analysis - Steps
Create a timeline (sequence diagram)
• Illustrate the sequence of events in chronological order
horizontally across the page.
• Depict relationships between conditions, events, and exceeded
or failed barriers/controls.
Exceeded-
Failed Barrier
Or Control
Event
Undesired
Outcome
Condition
Event Event
19. Root Cause Analysis - Steps
Create a timeline (sequence diagram)
• If amelioration occurred (e.g., reboot server, move application to
another server), this should be included in the evaluation to ensure that
it did not contribute to the undesired outcome.
Example: In the of a server reboot, the investigation should ensure that
the reboot was the result of the mishap and a result of latent hardware
defects.
Exceeded-
Failed Barrier
Or Control
Event
Undesired
Outcome
Condition
Event Event
Exceeded-
Failed
Amelioration
20. Root Cause Analysis - Steps
Example: simple timeline.
Application failed
to Go Live
Operating system
started up
Lost
transactions
(Penalties
paid)
Tech. Used
Wrong Method
To Correct
Server
Powered Up
Switch port
in wrong
VLAN
21. Root Cause Analysis - Steps
Create an event and causal factor tree.
(A visual representation of the causes that led to the failure or mishap.)
• Place the undesired outcome at the top of the tree.
• Add all events, conditions, and exceeded/failed barriers that occurred immediately
before the undesired outcome and might have caused it.
Application failed
to Go Live
Operating system
started up
Technician Used
Wrong
Method to Correct
Lost transactions (Penalties paid)
Server
Powered Up
Switch port in
wrong VLAN
22. Root Cause Analysis - Steps
Create an event and causal factor tree.
• Brainstorm to ensure that all
possible causes are included, NOT
just those that you are sure are
involved.
• Be sure to consider people,
hardware, software, policy,
procedures, and the environment.
Electric power
tripped
Application failed to
Go Live
Operating system
started up
Technician Used Wrong
Method to Correct
Lost transactions (Penalties Paid)
Server
Powered Up
Switch port in
wrong VLAN
Technicians not
properly trained
Power Supply
Failed
Port labeled
incorrectly
Switch labeled
incorrectly
NIC driver
wrong
23. Root Cause Analysis - Steps
Create an event and causal factor
tree continued...
• If you have solid data indicating
that one of the possible causes is
not applicable, it can be
eliminated from the tree.
Caution: Do not be too eager to eliminate
early on. If there is a possibility that it is a
causal factor, leave it and eliminate it later
when more information is available.
Electric power
tripped
Application failed to
Go Live
Operating system
started up
Technician Used Wrong
Method to Correct
Lost transactions (Penalties Paid)
Server
Powered Up
Switch port in
wrong VLAN
Technicians not
properly trained
Power Supply
Failed
Port labeled
incorrectly
Switch labeled
incorrectly
NIC driver
wrong
X
24. Root Cause Analysis - Steps
Create an event and causal factor tree
continued…
• You may use a fault tree to determine all
potential causes and to decompose the
failure down to the “basic event” (e.g.,
system component level).
Electric power
tripped
Application failed to
Go Live
Technician Used Wrong
Method to Correct
Lost transactions (Penalties Paid)
Switch port in
wrong VLAN
Technicians not
properly trained
Switch labeled
incorrectly
Port labeled
incorrectly
Power supply
failed
NIC driver
wrong
Maintenance swap Diagram wrong
with no re-label
Confusing labels
Operating system
started up
25. Root Cause Analysis - Steps
Create an event and causal factor
tree continued…
• A fault tree can also be used to
identify all possible types of
human failures.
Didn’t Perceive
System Feedback
Application failed to
Go Live
Technician Used Wrong
Method to Correct
Lost transactions (Penalties paid)
Switch port in
wrong VLAN
Didn’t Understand
System Feedback
Operation system
started up
Correct Interpretation
Incorrect Decision
Correct Decision But
Incorrect Action
Perception Error Interpretation Error Decision-Making Error Action-Execution Error
Rule-Based
Error
Knowledge-Based
Error
Skill-Based
Error
26. Root Cause Analysis - Steps
Create an event and causal factor tree continued…
• After you have identified all the possible causes, ask yourself “WHY” each
may have occurred.
• Be sure to keep your questions focused on the original issue. For example
“Why was the condition present?”; “Why did the event occur?”; “Why was
the parameter exceeded?” or “Why did the condition fail?”
Event #2 Failed or Exceeded
Barrier or Control
Undesired Outcome
Event #1 Condition
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Event #2
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Condition
Existed or
Changed
WHY
Failed
Exceeded
Barrier or
Control
WHY
Failed
Exceeded
Barrier or
Control
WHY
Failed
Exceeded
Barrier or
Control
27. Root Cause Analysis – Steps
Continue to ask “why” until you have reached:
1. Root cause(s) - including all
organizational factors that exert control
over the design, fabrication,
development, maintenance, operation,
and disposal of the system.
2. A problem that is not correctable by IT or
IT contractor.
3. Insufficient data to continue.
28. Root Cause Analysis- Steps
The resultant tree of questions and
answers should lead to a
comprehensive picture of
POTENTIAL causes for the
undesired outcome
Event #2 Failed or Exceeded
Barrier or Control
Undesired Outcome
Event #1 Condition
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Event #2
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Condition
Existed or
Changed
WHY
Failed
Exceeded
Barrier or
Control X
WHY
Failed
Exceeded
Barrier or
Control
WHY
Failed
Exceeded
Barrier or
Control
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
29. X X WHY
WHY
Failed
Exceeded
Barrier or
Control
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Failed
Exceeded
Barrier or
Control
WHY
Event #2
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Condition
Existed or
Changed
Check your logic with a detailed review of
each potential cause.
• Verify it is a contributor or cause.
• If the action, deficiency, or decision in
question were corrected, eliminated
or avoided, would the undesired
outcome be prevented or avoided?
> If no, then eliminate it from the
tree.
Root Cause Analysis- Steps
Event #2 Failed or Exceeded
Barrier or Control
Undesired Outcome
Event #1 Condition
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Event #1
Occurred
X
Failed
Exceeded
Barrier or
Control
X X X X
X X
X
X
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
X X X X X X
X X
X
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
30. Root Cause Analysis - Steps
Create an event and causal factor tree continued…
• The remaining items on the tree are the causes (or probable causes). necessary to
produce the undesired outcome.
• Proximate causes are those immediately before the undesired outcome.
• Intermediate causes are those between the proximate and root causes.
• Root causes are organizational factors or systemic problems located at the bottom
of the tree.
PROXIMATE
CAUSES
INTERMEDIATE
CAUSES
ROOT CAUSES
Event #2 Failed or Exceeded
Barrier or Control
Undesired Outcome
Event #1 Condition
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Failed/Exceeded
Barrier or Control
WHY
Event #2
Occurred
WHY
Event #2
Occurred
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY
WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY
WHY
Condition
Existed or
Changed
WHY
Condition
Existed or
Changed
WHY
Failed/Exceeded
Barrier or Control
31. Root Cause Analysis- Steps
Some people choose to leave contributing factors on the tree to show
all factors that influenced the event.
Contributing factor: An event or condition that may have contributed to the
occurrence of an undesired outcome but, if eliminated or modified, would not by
itself have prevented the occurrence.
If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is
clear that they are not causes.
Contributing
Factors
Event #2 Failed or Exceeded
Barrier or Control
Undesired Outcome
Event #1 Condition
WHY
Event #1
Occurred
WHY
Event #1
Occurred
WHY
Failed/Exceeded
Barrier or Control
WHY
Event #2
Occurred
WHY
Event #2
Occurred
WHY
Condition
Existed or
Changed
WHY
Condition
Existed or
Changed
WHY
Failed/Exceeded
Barrier or Control
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY
WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY
32. No IP connection to
VLAN assigned
Incorrect server static
address used
incorrectly
Engineer did not
read correct label
network
Root Cause is Much Deeper
Keep Asking Why
Investigating Causes of
Failures & Mishaps
Application failed to
Go Live
Technician Used Wrong
Method to Correct
Lost transaction (Penalties paid)
Switch port in
wrong VLAN
Operation system
started up
33. Investigating Causes of Failures & Mishaps
VLAN changed in
unrelated move
Application failed to
Go Live
No IP connection to
network
VLAN incorrectly
Incorrect server
static address used
assigned
Engineer did not
read correct label
Technician Used Wrong
Method to Correct
Lost transactions (Penalties paid)
Switch port
in wrong
VLAN
Operating system
started up
No Quality
Inspection
Insufficient
Quality Staff
Insufficient
Budget
Procedure
Incorrect
Not Updated
Correct Interpretation
Incorrect Decision
Decision-Making Error
New Task Insufficient
Anomaly Training
Training Does
Not Exist
Not Under Configuration
Mgmt
Insufficient
Training Budget
Organization Under Estimates
Importance of Anomaly Training
34. Root Cause Analysis- Steps
Generating Recommendations:
At a minimum corrective actions should be generated to eliminate proximate
causes and eliminate or mitigate the negative effects of root causes.
When multiple causes exist, there is limited budget, or it is difficult to
determine what should be corrected:
• Quantitative analysis can be used to determine the total contribution of
each cause to the undesirable outcome .
• Fishbone diagrams (or other methods) can be used to arrange causes
in order of their importance.
• Those causes which contribute most to the undesirable outcome
should be eliminated or the negative effects should be mitigated to
minimize risk.
35. Definitions of RCA &
Related Terms
Cause (Causal Factor) An event or condition that results in an effect. Anything that shapes or influences the outcome.
Proximate Cause(s) The event(s) that occurred, including any condition(s) that existed immediately before the undesired
outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the
undesired outcome. Also known as the direct cause(s).
Root Cause(s) One of multiple factors (events, conditions or organizational factors) that contributed to or created the
proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented
the undesired outcome. Typically multiple root causes contribute to an undesired outcome.
Root Cause Analysis (RCA) A structured evaluation method that identifies the root causes for an undesired outcome and the actions
adequate to prevent recurrence. Root cause analysis should continue until organizational factors have
been identified, or until data are exhausted.
Event A real-time occurrence describing one discrete action, typically an error, failure, or malfunction.
Examples: pipe broke, power lost, lightning struck, person opened valve, etc…
Condition Any as-found state, whether or not resulting from an event, that may have safety, health, quality, security,
operational, or environmental implications.
Organizational Factors Any operational or management structural entity that exerts control over the system at any stage in its life
cycle, including but not limited to the system’s concept development, design, fabrication, test,
maintenance, operation, and disposal.
Examples: resource management (budget, staff, training); policy (content, implementation, verification);
and management decisions.
Contributing Factor An event or condition that may have contributed to the occurrence of an undesired outcome but, if
eliminated or modified, would not by itself have prevented the occurrence.
Barrier A physical device or an administrative control used to reduce risk of the undesired outcome to an
acceptable level. Barriers can provide physical intervention (e.g., a guardrail) or procedural separation in
time and space (e.g., lock-out-tag-out procedure).
36. MIR Process / Forms
Major Incident – Severe Business impact:
• service, system or infrastructure component not functioning adequately to enable business
process
• total loss of service, system or infrastructure component
Major Incidents can also be considered to be those which do not entirely impede the use of the
service, system or infrastructure component such as:
• continuous slow response
• general degradation of service
• Refer: http://thinkingproblemmanagement.blogspot.com