How the Congressional Budget Office Assists Lawmakers
Mike Taylor: Lessons from history - Case studies that might help spot where things can go wrong
1. Lessons from history – Case studies
that might help spot where things can
go wrong
Mike Taylor, Advitech Pty Ltd, Mayfield, Australia
2. Incident Prevention Strategy, Feb 2016
• Risk-based intervention - develop a framework
for the ongoing identification and verification of
risk profiling, incorporating risk control measure
verification, and consideration of deployment
practices to target areas of risk priority.
• Human and organisational factors - research and
consider the impact of human and organisational
factors on risk management and reporting.
4. A few clues on where
risk control measures
may be weak or missing altogether
• “We’ll risk assess that out”
• “Everybody knows” assumptions
• Specification errors
• Management systems
• Unclear responsibilities
• Human error
5. Other warning signs
• Too much emphasis on the risk assessment
process, rather than the outcomes
• Some methods good for establishing priorities,
but not much else
• Reliance placed on barriers and controls
• Controls may not be as effective as first thought
• Control weaknesses may lie dormant for years
7. What about barriers and controls?
• Essential to list them
• Essential to judge their effectiveness
• Be wary of re-evaluating risk until proposed
barriers and controls are in place and found to be
effective
• Sometimes the existing controls are the ones that
are the weakest
8. Faults and failures
• Failure: Function not performed
• Fault: Loss of capability to perform the function
when called upon to do so
• Dangerous undetected faults: May lie dormant
for years before failure actually occurs
• Initial fault may be random or non-random
9. Random hardware failures
• Corrosion, wear, seizure, loosening, etc
• Predictable as to their rate, but not as to when
the next failure will occur
• Often detected and repaired before any damage
caused
• Various sources of information available
(histories)
• Conventional statistical analysis and modeling
10. Engineers comforted by predictability
and numbers
• Calculating probability of failure on demand,
based on a uniform failure rate λ :
PFDG = 2 [(1- βD) λDD + (1- β) λDD]2 tCE tGE
+ βD λDD MTTR + β λDU ( T1/2 + MRT)
• Perhaps even seduced by the numbers?
11. Non-random failures
• So-called “systematic failures”
• Not related to normal degradation mechanisms
of corrosion, wear, etc
• Deterministic rather than probabalistic
• Often more difficult to detect and eliminate
• Actual failure may be the first indication of
trouble
12. What can be learned from history of
non-random faults and failures?
• Quantitative information (component life, failure
modes, etc) generally not applicable
• Fewer obvious examples, unlike failures of
hardware components
• Not amenable to statistical analysis or modeling
• Subtle, underlying causes, often overlooked in
post-incident investigations
13. Why might systematic (non-random)
failures receive less attention?
• People may assume that existing management
systems and processes are able to deal with them
• Examples:
– Design reviews
– Approvals processes
– Issues tracking
– Management of change
– Check / back-check systems
14. Case studies
• Barriers and controls found to be less effective
than initially assumed
• Non-random failures. Events not equally likely.
• Underlying faults or weaknesses that can remain
undetected for long periods
15. Clapham Junction, London, 1988
• Three trains collided
• 35 people killed
• Signal was green when it should have been red
• A wiring fault, after modification work
• Immediate fault was dormant for about eight
hours
• Underlying fault dormant for years
16. • (pic site)
Source: Hidden A, 1989, Investigation into the Clapham Junction Railway Accident, Department of Transport, London
18. Milton Keynes, North London, 2008
• Signal was green when it should have been red
• Fault was noticed before a collision could occur
• A software specification error, as part of
modification work
• Fault was dormant for months
19. Non-random failures
• Random hardware failures
– Corrosion
– Wear
– Fatigue
– etc
• Predictable as to their rate, but not as to when the next
Source: RAIB, 2010 Special Investigation – Review of the
railway industry’s investigation of an irregular signal
sequence at Milton Keynes, 29 December 2008,
Department of Transport
20. Falkirk, Scotland, 2009
• Points were set in the wrong position for the train
to pass safely
• Train at 100 km/hour, fortunately did not derail
• A wiring fault, after modification work
• Fault was dormant for a few hours
• Underlying fault dormant for years
21.
22. Case study:
Falkirk, Scotland, 2009
• Points were set in the wrong position for the train to pass
safely
• Train at 80 km/ hour fortunately did not derail
• A wiring fault, after modification work
• Proper testing not carried out after the work
Source: RAIB, 2010 Rail Accident Report Incident at Greenhill Upper Junction, near Falkirk 22 March 2009, Department of Transport
Report 04/2010
23. Non-random failures
• Random hardware failures
– Corrosion
– Wear
– Fatigue
– etc
• Predictable as to their rate, but not as to when the next
one will occur
Source: RAIB, 2010 Rail Accident Report Incident Report 04/2010
24. Falkirk, Scotland
• Wire count not performed in the field
• Field workers assumed wire count done in the
workshop
25. Cootamundra, NSW, 2009
• Signal was green when it should have been red
• Fault was noticed before a collision could occur
• An error during the design was not properly
tracked
• Fault was dormant for two years
26. Source: ATSB TRANSPORT SAFETY REPORT Rail Occurrence Investigation RO-2009-009 , Reported signal irregularity at Cootamundra
NSW involving trains ST22 and 4MB7 , 12 November 2009
27. Minneapolis, MN, 2007
• Steel bridge collapsed
• 13 persons killed
• Design fault, carried through to construction
• Fault was dormant for 40 years
28. Source: National Transportation Safety Board, Accident report NTSB/HAR-08/03 PB2008-916203, Collapse of I-35W Highway
Bridge Minneapolis, Minnesota , August 1, 2007.
33. USAir, Aliquippa, PA, 1994
• Aircraft crashed during landing approach, with all
on board lost
• Control system failure
• Original failure modes analysis anticipated such a
failure
• Analysis did not properly anticipate the effects
• Fault was dormant for 25 years
• Fault not revealed until two other aircraft
incidents
34. Source: Aircraft Accident Report – Uncontrolled Descent and Collision with Terrain US Air Flight 427, Boeing 737-300, N513AU, Near
Alquippa, Pennsylvania, September 8 1994 National Transportation Safety Board PB 99-910401
36. Alaska Airlines,
Anacapa Island, CA, 2000
• Aircraft crashed soon after take-off. All on board
lost.
• Mechanical failure of screw thread and nut
• Evidence of wear could have been detected, but
was not
• Fault was dormant for ten years
37. Source: Aircraft Accident Report Loss of Control and Impact with Pacific Ocean Alaska Airlines Flight 261 McDonnell Douglas MD
83, N963AS About 2.7 Miles North of Anacapa Island, California January 31, 2000, National Transportation Safety Board NTSB/AAR
02/01 PB2002-910402
38. Non-random failures
• Random hardware failures
– Corrosion
– Wear
– Fatigue
– etc
• Predictable as to their rate, but not as to when the next
one will occur
Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402
41. American Airlines,
Belle Harbor, NY, 2001
• Aircraft crashed shortly after take-off, with all on
board lost
• Pilot error
• Haptic feedback (“feel”) of rudder pedals
different from many other similar aircraft
• Aggressive use of rudder. Vertical stabilizer
overloaded.
42. Source: Aircraft Accident Report NTSB/AAR-04/04 , In-Flight Separation of Vertical Stabilizer American Airlines Flight 587 Airbus
Industrie A300-605R, N14053 Belle Harbor, New York November 12, 2001, National Transportation Safety Board, PB2004-910404
Notation 7439B
43. Cape Hillsborough, Qld, Australia, 2003
• Emergency medical services helicopter mission
• Aircraft crashed into sea on foggy night, with all
on board lost
• Possible loss of spatial orientation
• Several key risk factors present
• Operators unaware of US study into risk factors
• Fault was dormant for ten years
44. Source: Aviation Safety Investigation 2003 04282, Bell 407 VH-HT Cape Hillsborough, Qld, 17 October 2003, Australian Transport
45. Markham Colliery, UK, 1973
• Brake rod broke (fatigue fracture)
• 18 people killed
• Poor design: No practicable means of lubrication
• Warning from 1961 incident
• Crack probably present when inspected in 1961
46. Source: Calder JW , 1974, Accident at Markham Colliery Derbyshire: report on the cause of, and circumstances attending, the
overwind, which occurred at Markham Colliery, Derbyshire, on 30 July 1973. Department of Energy
49. Qantas, Batam Island, Indonesia, 2010
• A380 engine rotor failure
• Significant damage from debris
• Caused by broken oil feed pipe, poorly
manufactured
• Failure modes analysis did not properly anticipate
the effects
• Two faults, each dormant for several years
50. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013. In-flight uncontained
engine failure Airbus A380, VH0QA, overhead Bantam Island, Indonesia, 4 November 2010
51. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
52. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
53. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
54. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
55. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
56. Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
57. Conclusions
• Plenty of new mistakes to be made, without
repeating the old ones
• Human error implicated in most of these cases
• Human error rates much higher than those for
physical devices
• Statistics not much help when dealing with non-
random failures
58. Conclusions
• Easy to lose sight of the real issues if just focused
on process
• Misplaced reliance on barriers and controls,
especially existing controls
• Weakness can remain dormant for years
59. Implications for designers
and operators
• Recognise that one systematic fault can undo all
the good work with random hardware failure
predictions
• Recognise the places where things can go wrong:
– Specification errors
– Failure mode assumptions
– “Everybody knows” assumptions
– Unclear responsibilities
• Look for subtle signs of problems during
operations