Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong

Lessons from history – Case studies

that might help spot where things can

go wrong

Mike Taylor, Advitech Pty Ltd, Mayfield, Australia

Incident Prevention Strategy, Feb 2016

• Risk-based intervention - develop a framework
for the ongoing identification and verification of
risk profiling, incorporating risk control measure
verification, and consideration of deployment
practices to target areas of risk priority.
• Human and organisational factors - research and
consider the impact of human and organisational
factors on risk management and reporting.

A few clues on where

risk control measures

may be weak or missing altogether
• “We’ll risk assess that out”
• “Everybody knows” assumptions
• Specification errors
• Management systems
• Unclear responsibilities
• Human error

Other warning signs

• Too much emphasis on the risk assessment
process, rather than the outcomes
• Some methods good for establishing priorities,
but not much else
• Reliance placed on barriers and controls
• Controls may not be as effective as first thought

• Control weaknesses may lie dormant for years

What about barriers and controls?

• Essential to list them
• Essential to judge their effectiveness
• Be wary of re-evaluating risk until proposed
barriers and controls are in place and found to be
effective
• Sometimes the existing controls are the ones that
are the weakest

Faults and failures

• Failure: Function not performed
• Fault: Loss of capability to perform the function
when called upon to do so
• Dangerous undetected faults: May lie dormant
for years before failure actually occurs
• Initial fault may be random or non-random

Random hardware failures

• Corrosion, wear, seizure, loosening, etc
• Predictable as to their rate, but not as to when
the next failure will occur
• Often detected and repaired before any damage
caused
• Various sources of information available
(histories)
• Conventional statistical analysis and modeling

Engineers comforted by predictability

and numbers

• Calculating probability of failure on demand,
based on a uniform failure rate λ :
PFDG = 2 [(1- βD) λDD + (1- β) λDD]2 tCE tGE

+ βD λDD MTTR + β λDU ( T1/2 + MRT)
• Perhaps even seduced by the numbers?

Non-random failures

• So-called “systematic failures”
• Not related to normal degradation mechanisms
of corrosion, wear, etc
• Deterministic rather than probabalistic
• Often more difficult to detect and eliminate
• Actual failure may be the first indication of
trouble

What can be learned from history of

non-random faults and failures?

• Quantitative information (component life, failure
modes, etc) generally not applicable
• Fewer obvious examples, unlike failures of
hardware components
• Not amenable to statistical analysis or modeling

• Subtle, underlying causes, often overlooked in
post-incident investigations

Why might systematic (non-random)

failures receive less attention?

• People may assume that existing management
systems and processes are able to deal with them
• Examples:
– Design reviews
– Approvals processes
– Issues tracking
– Management of change
– Check / back-check systems

Case studies

• Barriers and controls found to be less effective
than initially assumed
• Non-random failures. Events not equally likely.
• Underlying faults or weaknesses that can remain
undetected for long periods

Clapham Junction, London, 1988

• Three trains collided
• 35 people killed
• Signal was green when it should have been red

• A wiring fault, after modification work
• Immediate fault was dormant for about eight
hours
• Underlying fault dormant for years

• (pic site)

Source: Hidden A, 1989, Investigation into the Clapham Junction Railway Accident, Department of Transport, London

• (pic site)

Source: Hidden A, 1989

Milton Keynes, North London, 2008


• Fault was noticed before a collision could occur

• A software specification error, as part of
modification work
• Fault was dormant for months

Non-random failures
• Random hardware failures
– Corrosion
– Wear
– Fatigue
– etc
• Predictable as to their rate, but not as to when the next
Source: RAIB, 2010 Special Investigation – Review of the
railway industry’s investigation of an irregular signal
sequence at Milton Keynes, 29 December 2008,
Department of Transport

Falkirk, Scotland, 2009

• Points were set in the wrong position for the train
to pass safely
• Train at 100 km/hour, fortunately did not derail
• Fault was dormant for a few hours
• Underlying fault dormant for years

Case study:
Falkirk, Scotland, 2009
• Points were set in the wrong position for the train to pass
safely
• Train at 80 km/ hour fortunately did not derail
• Proper testing not carried out after the work
Source: RAIB, 2010 Rail Accident Report Incident at Greenhill Upper Junction, near Falkirk 22 March 2009, Department of Transport
Report 04/2010

Non-random failures

– Corrosion
– Wear
– Fatigue
– etc
one will occur
Source: RAIB, 2010 Rail Accident Report Incident Report 04/2010

Falkirk, Scotland

• Wire count not performed in the field
• Field workers assumed wire count done in the
workshop

Cootamundra, NSW, 2009


• Fault was noticed before a collision could occur

• An error during the design was not properly
tracked
• Fault was dormant for two years

Source: ATSB TRANSPORT SAFETY REPORT Rail Occurrence Investigation RO-2009-009 , Reported signal irregularity at Cootamundra
NSW involving trains ST22 and 4MB7 , 12 November 2009

Minneapolis, MN, 2007

• Steel bridge collapsed
• 13 persons killed
• Design fault, carried through to construction

• Fault was dormant for 40 years

Source: National Transportation Safety Board, Accident report NTSB/HAR-08/03 PB2008-916203, Collapse of I-35W Highway
Bridge Minneapolis, Minnesota , August 1, 2007.

Source: Accident Report NTSB/HAR-08/03 PB2008-916203

USAir, Aliquippa, PA, 1994

• Aircraft crashed during landing approach, with all
on board lost
• Control system failure
• Original failure modes analysis anticipated such a
failure
• Analysis did not properly anticipate the effects
• Fault was dormant for 25 years
• Fault not revealed until two other aircraft
incidents

Source: Aircraft Accident Report – Uncontrolled Descent and Collision with Terrain US Air Flight 427, Boeing 737-300, N513AU, Near
Alquippa, Pennsylvania, September 8 1994 National Transportation Safety Board PB 99-910401

Source: National Transportation Safety Board PB 99-910401

Alaska Airlines,

Anacapa Island, CA, 2000

• Aircraft crashed soon after take-off. All on board
lost.
• Mechanical failure of screw thread and nut
• Evidence of wear could have been detected, but
was not
• Fault was dormant for ten years

Source: Aircraft Accident Report Loss of Control and Impact with Pacific Ocean Alaska Airlines Flight 261 McDonnell Douglas MD
83, N963AS About 2.7 Miles North of Anacapa Island, California January 31, 2000, National Transportation Safety Board NTSB/AAR
02/01 PB2002-910402

Non-random failures
– Corrosion
– Wear
– Fatigue
– etc
one will occur
Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402

Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402

American Airlines,

Belle Harbor, NY, 2001

• Aircraft crashed shortly after take-off, with all on
board lost
• Pilot error
• Haptic feedback (“feel”) of rudder pedals
different from many other similar aircraft
• Aggressive use of rudder. Vertical stabilizer
overloaded.

Source: Aircraft Accident Report NTSB/AAR-04/04 , In-Flight Separation of Vertical Stabilizer American Airlines Flight 587 Airbus
Industrie A300-605R, N14053 Belle Harbor, New York November 12, 2001, National Transportation Safety Board, PB2004-910404
Notation 7439B

Cape Hillsborough, Qld, Australia, 2003

• Emergency medical services helicopter mission
• Aircraft crashed into sea on foggy night, with all
on board lost
• Possible loss of spatial orientation
• Several key risk factors present
• Operators unaware of US study into risk factors
• Fault was dormant for ten years

Source: Aviation Safety Investigation 2003 04282, Bell 407 VH-HT Cape Hillsborough, Qld, 17 October 2003, Australian Transport

Markham Colliery, UK, 1973

• Brake rod broke (fatigue fracture)
• 18 people killed
• Poor design: No practicable means of lubrication

• Warning from 1961 incident
• Crack probably present when inspected in 1961

Source: Calder JW , 1974, Accident at Markham Colliery Derbyshire: report on the cause of, and circumstances attending, the
overwind, which occurred at Markham Colliery, Derbyshire, on 30 July 1973. Department of Energy

Qantas, Batam Island, Indonesia, 2010

• A380 engine rotor failure
• Significant damage from debris
• Caused by broken oil feed pipe, poorly
manufactured
• Failure modes analysis did not properly anticipate
the effects
• Two faults, each dormant for several years

Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013. In-flight uncontained
engine failure Airbus A380, VH0QA, overhead Bantam Island, Indonesia, 4 November 2010

Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013

Conclusions

• Plenty of new mistakes to be made, without
repeating the old ones
• Human error implicated in most of these cases
• Human error rates much higher than those for
physical devices
• Statistics not much help when dealing with non-
random failures

Conclusions

• Easy to lose sight of the real issues if just focused
on process
• Misplaced reliance on barriers and controls,
especially existing controls
• Weakness can remain dormant for years

Implications for designers

and operators

• Recognise that one systematic fault can undo all
the good work with random hardware failure
predictions
• Recognise the places where things can go wrong:

– Specification errors
– Failure mode assumptions
– “Everybody knows” assumptions
– Unclear responsibilities
• Look for subtle signs of problems during
operations

Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong

Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong

Similar to Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong (20)

More from NSW Environment and Planning

More from NSW Environment and Planning (20)

Recently uploaded

Recently uploaded (20)

Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong