SlideShare a Scribd company logo
1 of 211
Download to read offline
CRANFIELD UNIVERSITY
Alexander Tomczynski
APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO
THE MANAGEMENT OF AIRWORTHINESS
DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND
TECHNOLOGY
Military Aerospace and Airworthiness
MSc
Academic Year: 2013 - 2014
Supervisor: Dr Simon Place
March 2014
CRANFIELD UNIVERSITY
DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND
TECHNOLOGY
Military Aerospace and Airworthiness
MSc
Academic Year 2013 - 2014
Alexander Tomczynski
APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO
THE MANAGEMENT OF AIRWORTHINESS
Supervisor: Dr Simon Place
March 2014
This thesis is submitted in partial fulfilment of the requirements for
the degree of Master of Science
© Crown Copyright 2014. All rights reserved. No part of this
publication may be reproduced without the written permission of the
copyright owner.
i
ABSTRACT
Complex safety critical systems in high hazard industries continue to have
accidents despite improvements in reliability, understanding of human factors
and the behaviour of organisations. Resilience engineering offers a new
paradigm in safety science and proposes that safety is defined as success
under varying performance conditions. The theory is examined and its
applicability to airworthiness is discussed. A related technique, the Functional
Resonance Analysis Method (FRAM), treats system performance as a control
problem. This methodology is employed to create an airworthiness
management tool for the Royal Air Force Tornado aircraft fleet. Data was
gathered through occurrence report data, practical experience and semi-
structured interviews with a variety of personnel within the airworthiness
system. The tool comprises a spreadsheet model with an accompanying
interactive visualisation tool. The tool is used to analyse two air safety
occurrences and also to attempt to provide a resilience based risk assessment
of an airworthiness issue. It was concluded that resilience engineering presents
a promising basis for better management of airworthiness. The initial version of
the tool was found to work well but extensive development work is required to
produce a desktop IT airworthiness resilience dashboard tool.
Keywords:
SYSTEM SAFETY, SAFETY CRITICAL SYSTEMS, ACCIDENT
INVESTIGATION
iii
ACKNOWLEDGEMENTS
I would like to thank my wife Natalie for her encouragement and support.
Also worthy of thanks are Professor Erik Hollnagel and the rest of the “FRAMily”
who have collected both online and at the 2013 meeting in Munich. The shared
knowledge and experience has been most instructive.
This project would not have been possible without the enthusiastic participation
of a large number of people at Royal Air Force Station Marham - service
personnel and employees of BAE Systems and Rolls Royce.
The guidance provided by my supervisor Dr Simon Place has been invaluable
in the completion of this project and I thank him for it.
iv
In remembrance of No. CXX Squadron, Crew 3
“Endurance”
v
TABLE OF CONTENTS
ABSTRACT ......................................................................................................... i
ACKNOWLEDGEMENTS...................................................................................iii
LIST OF FIGURES.............................................................................................ix
LIST OF TABLES..............................................................................................xii
LIST OF EQUATIONS......................................................................................xiv
LIST OF ABBREVIATIONS...............................................................................xv
1 INTRODUCTION............................................................................................. 1
1.1 Introduction ............................................................................................... 1
1.2 Background – Theories of Safety.............................................................. 2
1.3 Background – The Practical Requirement ................................................ 3
1.4 What is ‘Airworthiness Management’?...................................................... 5
1.5 The Research Aim .................................................................................... 5
1.6 Objectives................................................................................................. 5
1.7 Methodology Overview ............................................................................. 6
1.8 Descriptions and Definitions ..................................................................... 6
1.9 Thesis Structure........................................................................................ 6
2 LITERATURE REVIEW................................................................................... 9
2.1 Airworthiness in the Context of Safety ...................................................... 9
2.1.1 Accident Investigations....................................................................... 9
2.1.2 Initial and Type Airworthiness .......................................................... 10
2.1.3 Safety Management ......................................................................... 10
2.1.4 Continuing Airworthiness.................................................................. 11
2.2 A History of Safety Theory...................................................................... 11
2.2.1 Technological Age – Governing Philosophy..................................... 14
2.2.2 Technological Age – Tools............................................................... 14
2.2.3 Limits of Probabilistic Risk Assessment ........................................... 16
2.2.4 Human Factors................................................................................. 18
2.2.5 Organisational.................................................................................. 18
2.3 Complexity .............................................................................................. 19
2.3.1 Complexity Theory ........................................................................... 20
2.3.2 Systems Thinking and Systems Engineering ................................... 22
2.3.3 Control Theory ................................................................................. 24
2.3.4 Non-Linear Dynamics....................................................................... 25
2.4 Resilience Engineering ........................................................................... 26
2.4.1 Resilience Engineering as a Successor to Safety Management ...... 30
2.4.2 Under Specification of Performance Conditions............................... 30
2.4.3 Performance Variability .................................................................... 31
2.4.4 Examples of Resilience Engineering in Practice .............................. 31
2.4.5 Criticism of Resilience Engineering.................................................. 33
2.4.6 Resilience Engineering and Airworthiness ....................................... 34
vi
2.4.7 Lean Resilience................................................................................ 38
2.5 Functional Resonance Analysis Method................................................. 38
2.6 Quantifying Resilience ............................................................................ 39
2.7 Concluding Remarks............................................................................... 40
3 METHODOLOGY.......................................................................................... 41
3.1 Introduction ............................................................................................. 41
3.2 Working Arrangements ........................................................................... 41
3.3 Research Interviews ............................................................................... 41
3.4 Model Development................................................................................ 43
3.5 Air Safety Information Management System Data .................................. 44
3.5.1 Data Extraction................................................................................. 44
3.5.2 Assignment of Related Functions to Incidents ................................. 46
4 BUILDING THE TORNADO AIRWORTHINESS SYSTEM MODEL USING
THE FUNCTIONAL RESONANCE ANALYSIS METHOD................................ 47
4.1 Basic Principles ...................................................................................... 47
4.2 Taxonomy............................................................................................... 48
4.3 FRAM Step 0 – Recognise the Purpose of the FRAM Analysis.............. 50
4.4 FRAM Step 1a – Identify and Describe the Initial Function List. ............. 51
4.5 FRAM Step 1b – Verify Functions with Experts ...................................... 53
4.6 Step 2 – Identification of Output Variability ............................................. 56
4.7 Step 2a – Identify the Type of Function .................................................. 56
4.8 Step 2b – Identify Internal Sources of Output Variability......................... 59
4.9 Step 2c – Identify External Sources of Output Variability........................ 60
4.10 Step 2d – Most Likely Dimension of Output Variability.......................... 61
4.11 Step 3 – Aggregation of Variability........................................................ 65
4.12 Step 4 – Consequences of the Analysis ............................................... 71
4.12.1 Step 4a – Damping Factors............................................................ 71
4.12.2 Step 4b Performance Indicators..................................................... 71
4.13 Summary of TASM Layout.................................................................... 74
5 TORNADO AIRWORTHINESS SYSTEM MODEL VISUALISATION TOOL . 77
5.1 Need for the Tool .................................................................................... 77
5.2 Microsoft Visio ........................................................................................ 77
5.3 Building the Tool ..................................................................................... 77
5.3.1 General Functional Areas................................................................. 77
5.3.2 Functions.......................................................................................... 79
5.3.3 External Dependencies .................................................................... 82
5.3.4 Functional Activities.......................................................................... 84
5.4 Exploiting the Tool .................................................................................. 85
5.5 Summary ................................................................................................ 89
6 USING THE TORNADO AIRWORTHINESS SYSTEM MODEL FOR
INCIDENT ANALYSIS...................................................................................... 93
6.1 Case for Using FRAM for Incident Modelling .......................................... 93
vii
6.2 Incident One – Thrust Reverser Incidents............................................... 94
6.2.1 Description of Incidents.................................................................... 95
6.2.2 Summary of the Investigations ......................................................... 96
6.2.3 Instantiation of the FRAM Model ...................................................... 98
6.2.4 The Sources of Variability .............................................................. 102
6.2.5 Insights from TASM........................................................................ 108
6.3 Incident 2 – Missing Rigging Pin........................................................... 111
6.3.1 Description of Incident.................................................................... 111
6.3.2 Summary of Investigation............................................................... 112
6.3.3 Instantiation of the TASM............................................................... 116
6.3.4 Insights from TASM........................................................................ 119
7 USING THE TORNADO AIRWORTHINESS SYSTEM MODEL FOR RISK
ANALYSIS...................................................................................................... 121
7.1 Case for Using TASM for Risk Analysis................................................ 121
7.2 Current Theoretical Basis for Airworthiness Risk Management ............ 123
7.3 Proposal of FRAM Based Airworthiness Risk Theory........................... 124
7.4 Proposal for a FRAM Based Risk Assessment Process....................... 127
7.5 Risk Example – Operation of Components in Excess of Cleared Life .. 131
7.5.1 Generating a FRAM Model Risk Assessment ................................ 131
7.5.2 Insights into Risk ............................................................................ 144
7.6 Proposal for a FRAM Based Risk Management ................................... 148
7.7 Chapter Summary................................................................................. 149
8 DISCUSSION.............................................................................................. 151
8.1 Applicability of the Resilience Engineering Paradigm to Airworthiness. 151
8.2 The Tornado Airworthiness System Model – Initial Version.................. 155
8.3 Incident Investigation ............................................................................ 156
8.3.1 Data Collection............................................................................... 157
8.3.2 Aids to Investigation ....................................................................... 157
8.4 Risk Assessment .................................................................................. 158
8.4.1 Hazard Management vs Functional Resonance Management....... 160
8.5 Utility of the TASM for Type Airworthiness Activities ............................ 160
8.6 Utility of the TASM for Continuing Airworthiness Activities ................... 162
8.7 Utility of TASM for Duty Holder Activity................................................. 164
8.8 Potential Use for System Improvement................................................. 165
8.9 Potential for Further Development of the TASM ................................... 166
8.9.1 Increased Model Fidelity ................................................................ 167
8.9.2 Application of Bayesian and/or Fuzzy Logic................................... 168
8.9.3 Expansion into Operational Safety Management ........................... 168
8.10 Chapter Summary............................................................................... 169
9 CONCLUSIONS.......................................................................................... 171
9.1 Summary .............................................................................................. 171
9.2 Recommendations................................................................................ 172
viii
9.2.1 Manage Airworthiness as a Control Problem ................................. 173
9.2.2 Use the TASM to Control the Airworthiness System ...................... 173
9.2.3 Review Airworthiness Risk from a Resilience Perspective............. 173
9.2.4 Use FRAM as a Means to Improve System Resilience and
Efficiency................................................................................................. 173
9.3 Potential for Further Research and Development................................. 174
9.4 Concluding Remarks............................................................................. 174
REFERENCES............................................................................................... 177
Appendix A –TORNADO AIRWORTHINESS FRAM MODEL..................... 185
Appendix B – TORNADO AIRWORTHINESS MODEL VISUALISATION... 187
Appendix C – PARTICIPANTS BRIEFING SHEET .................................... 188
ix
LIST OF FIGURES
Figure 1-1 - Nimrod MR2 XV230........................................................................ 3
Figure 1-2 - RAF Tornado GR4 Aircraft............................................................. 4
Figure 2-1 Accident Analysis and Risk Assessment Methods.......................... 13
Figure 2-2 Three Tracks on the Evolution of Safety Theory ............................. 13
Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management ..... 21
Figure 2-4 General Form of a Model of Socio-technical Control....................... 24
Figure 2-5 The Four Cornerstones of Resilience.............................................. 29
Figure 2-6 Conceptual Framework for Resilience Engineering ........................ 30
Figure 2-7 Framework for managing the impact organisation, technology and
human factors have on safety management systems................................ 37
Figure 2-8 FRAM Function ............................................................................... 38
Figure 4-1 FRAM Model Visualisation Demonstrating Taxonomy .................... 49
Figure 4-2 TASM Step 12 – Screen Capture Showing Applicable Spreadsheet
Areas ......................................................................................................... 54
Figure 4-3 Visualising Functional Output Variability ......................................... 56
Figure 4-4 Instances of Functional Output Variability Recorded in Occurrence
Reports 2012/13 ........................................................................................ 58
Figure 4-5 Instances of Reported Functional Output Variability by Function Type
.................................................................................................................. 59
Figure 4-6 Total Instances of Functional Output Variability Recorded in
Occurrence Reports 2012/13..................................................................... 59
Figure 4-7 TASM Step 2 – Screen Capture Showing Applicable Spreadsheet
Areas ......................................................................................................... 65
Figure 4-8 Tracing Output Downstream Dependencies (Screen Capture) ....... 66
Figure 4-9 Rough Score Matrix ........................................................................ 70
Figure 4-10 Rough Downstream Function Variability Score............................. 70
Figure 4-11 TASM Step 3 – Screen Capture Showing Applicable Spreadsheet
Areas ......................................................................................................... 71
Figure 4-12 TASM Step 4 – Screen Capture Showing Applicable Spreadsheet
Areas ......................................................................................................... 72
Figure 4-13 Example FRAM for 2 Functions, A and B...................................... 75
x
Figure 5-1 Visualisation Functional Groupings................................................. 78
Figure 5-2 A Function and Its Aspects ............................................................. 79
Figure 5-3 Screen Capture of Visualisation Tool with Functions Added........... 81
Figure 5-4 Screen Capture of Visualisation Tool with External Dependencies
Added ........................................................................................................ 83
Figure 5-5 5-6 Screen Capture of Visualisation Tool with all Functional Activities
Shown........................................................................................................ 85
Figure 5-7 Activities and Dependencies Linked to Aspects of the ‘Train
Maintenance Personnel’ Function ............................................................. 86
Figure 5-8 Selecting Layers within Visio – Screen Capture.............................. 87
Figure 5-9 DII Visio Viewer – Screen Capture.................................................. 88
Figure 5-10 Visualisation Tool Key................................................................... 90
Figure 6-1 Tornado GR4 with Thrust Reversers Deployed .............................. 94
Figure 6-2 Thrust Reverser Incidents Visualisation........................................ 101
Figure 6-3 Propulsion & Electrical System ..................................................... 102
Figure 6-4 Electrical System Potential Functionally Resonant Activities ........ 104
Figure 6-5 Instantiation of Thrust Reverse Occurrence Reports .................... 110
Figure 6-6 Location of Where Lost Pin was Installed ..................................... 113
Figure 6-7 General Installation Location of Lost Pin....................................... 114
Figure 6-8 Pin Location in Tool Kit ................................................................. 115
Figure 6-9 Visualisation Tool Output for Rigging Tool Occurrence................. 117
Figure 6-10 Instantiation of Rigging Pin Occurrence...................................... 118
Figure 7-1 Tornado Process for Emergent Airworthiness Issues ................... 122
Figure 7-2 Current Theoretical Basis for Tornado Airworthiness Risk
Management............................................................................................ 124
Figure 7-3 Proposed Functional Resonance Risk Management Theory -
Visualisation of a Generic Hazardous Process........................................ 125
Figure 7-4 FRAM Model Risk Assessment Process....................................... 129
Figure 7-5 Operation of Components Beyond Cleared Life - First Stage Risk
Visualisation, Excluding Background Functions ...................................... 132
Figure 7-6 Visualisation of Hazard Generation Process................................. 141
Figure 7-7 Visualisation of Potential Accident Processes............................... 143
xi
Figure 7-8 Proposed Risk Management Process........................................... 148
Figure 8-1 Fractal Property of the FRAM - Function Decomposed into Lower
Level Functions ....................................................................................... 159
Figure 8-2 TASM Development Pathway ....................................................... 167
xii
LIST OF TABLES
Table 2-1 Herrera 's Ages of Safety Theory ..................................................... 12
Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment ............... 17
Table 2-3 Examples of Resilience Engineering in Practice .............................. 31
Table 3-1 D-ASOR Classifications included in Data......................................... 45
Table 4-1 Example FRAM frame for Fault Diagnosis ....................................... 52
Table 4-2 Listing of TASM Functions ............................................................... 55
Table 4-3 Summary of Internal Variability ........................................................ 60
Table 4-4 Summary of External Variability ....................................................... 61
Table 4-5 Example TASM Recording of Step 2a-c for Function 67 - Engine Fleet
Monitoring.................................................................................................. 61
Table 4-6 Elaborate Description of Output Variability....................................... 62
Table 4-7 Characterising Output Variability – Flight Servicing ......................... 63
Table 4-8 Classifications for Frequency of Output Variability ........................... 64
Table 4-9 Classification of Amplitude of Performance Variability ..................... 65
Table 4-10 Aggregation of Variability for Flight Servicing................................. 69
Table 4-11 Example of Step 4 - Flight Servicing .............................................. 73
Table 6-1 Thrust Reverser Air Safety Occurrence Reports 2012/13 ................ 95
Table 6-2 Thrust Reverse Occurrences with Detailed Investigation................. 96
Table 6-3 Thrust Reverser FRAM Instantiation ................................................ 99
Table 6-4 FRAM Model of Electrical System.................................................. 105
Table 6-5 Electrical System Precondition Variability ...................................... 107
Table 6-6 Functional Variability Noted From Investigation ............................. 116
Table 7-1 Configuration Management Aspects .............................................. 133
Table 7-2 Summary of Second Stage of Risk Assessment ............................ 135
Table 7-3 Stage 2 - Scheduled Maintenance Function................................... 136
Table 7-4 Stage 2 - Force and A4 Operations Function (Part 1) .................... 137
Table 7-5 Stage 2 - Force and A4 Operations Function (Part 2) .................... 138
Table 7-6 Stage 3 Replacement of Life Limited Parts Function...................... 139
xiii
Table 7-7 Example Accident Generating Function FRAM Frame Layout ....... 145
Table 7-8 Avionic Flight Systems Output – Baseline FRAM Model................ 147
Table 8-1 Utility of TASM for TAA Activities ................................................... 161
Table 8-2 Potential CAMO Use of TASM ....................................................... 163
Table 8-3 Aviation Duty Holder Use of TASM ................................................ 164
xiv
LIST OF EQUATIONS
Equation 1 – Linear System ............................................................................. 25
Equation 2 - Additive Property.......................................................................... 25
Equation 3 – Homogeneous Property .............................................................. 25
Equation 4 – Non Linear System; lack of Additive Property ............................. 25
Equation 5 – Non Linear System; lack of Homogeneous Property................... 25
Equation 6 - Rough Downstream Function Variability Score............................ 70
xv
LIST OF ABBREVIATIONS
A4
AC
AcciMap
ADF
AEB
AESO
ALARP
ARC
ATTAC
ASIMS
ATHEANA
AWFL
CAM
CAMO
CAMSS
CMU
CREAM
CSNI
DAOS
DASOR
DE&S
DII
DMS
DO
DQAFF
EA
EngO
ETTO
FAST
FMEA
FMECA
FOC
FRAM
GSE
HAS
HAZOPS
HAZID
NATO designator for Logistics/Engineering
Aircraft
Accident Map
Acceptable Deferred Fault
Accident Evolution and Barrier Function
Air Engineering Standing Orders
As Low As Reasonably Practicable
Airworthiness Review Certificate
Aircraft Tornado Transformation Availability Contract
Air Safety Information Management System
A Technique for Human Error ANAlysis
AirWorthiness Flight Limitations
Continuing Airworthiness Manager
Continuing Airworthiness Management Organisation
Continued Airworthiness Management Support Services
Combined Maintenance and Upgrade Unit
Cognitive Reliability Error Analysis Method
Committee on the Safety of Nuclear Installations
Design Approved Organisation Scheme
Defence Air Safety Occurrence Report
Defence Equipment and Support
Defence Information Infrastructure
Dedicated Maintenance System
Design Organisation
Defence Quality Assurance Field Force
Engineering Authority
Engineer Officer
Efficiency Thoroughness Trade Off
Fast Air Support Team
Failure Mode Effects Analysis
Failure Mode and Criticality Analysis
Force Operations Centre
Functional Resonance Analysis Method
Ground Support Equipment
Hardened Aircraft Shelter
Hazard and Operability Study
Hazard Identification
xvi
HCR
HEAT
HERA
HFACS
HPES
HRO
ITEA
JEngO
JSP
LFT
LITS
LOAA
MAA
MAOS
MAP-01
MERMOS
MOD
MMD
MORT
MSG
MRP
MTO
MWO
NAT
NATO
NETMA
OOPS
OSI
ORG
PST
PT
QA
QMS
R2
RA
RAF
RCA
ROCET
Human Cognitive Reliability
Human Error Assessment Technique
Human Error in Air Traffic Management Technique
Human Factors Analysis and Classification System
Human Performance Enhancement System
High Reliability Organisation
Independent Technical Evaluation and Advice
Junior Engineering Officer
Joint Service Publication
Latest Finish Time
Logistics Information Technology System
Letter Of Airworthiness Authority
Military Airworthiness Authority
Maintenance Approved Organisation Scheme
Manual of Airworthiness Processes - 01
Méthode d’Evaluation de la Réalisation des Missions Opérateur
pour la Sûreté
Ministry of Defence
Man Made Disaster
Maintenance Oversight and Risk Tree
Maintenance Steering Group
MAA Regulatory Publications
Man-Technology-Organisation
Maintenance Work Order
North Atlantic Treaty Organisation
Normal Accident Theory
NATO Eurofighter and Tornado Management Agency
Out Of Phase Servicing
Occurrence Safety Investigation
Occurrence Review Group
Propulsion Support Team
Project Team
Quality Assurance
Quality Management System
2nd
Line Repair
Regulatory Article
Royal Air Force
Root Cause Analysis
RB199 Operational Contract for Engine Transformation
xvii
RTS
RTSA
SEngO
SI(T)
SQEP
STAMP
STANEVAL
STEP
TAA
TAP
TASM
TGRF
THERP
TME
TRACEr
TSEMP
Release To Service
Release To Service Authority
Senior Engineering Officer
Special Instruction (Technical)
Suitably Qualified and Experienced Person
Systems Theoretic Accident Model
STANdards EVALuation
Sequential Timed Event Plotting
Type Airworthiness Authority
Technical Assistance Process
Tornado Airworthiness System Model
Tornado Ground Attack & Reconnaissance Force
Technique for Human Reliability Analysis
Testing and Measuring Equipment
Technique for The Retrospective Analysis of Cognitive Error
Tornado Safety and Environmental Management Plan
1
1 INTRODUCTION
“I can see him now, fighting with the controls, trying his best…
…We want some justice and the MOD to sit up and take notice, what they have
done could have been avoided; we live in hope that they will not let this happen
in the future.”
Mrs Adele Squires, Wife of Flight Lieutenant Al Squires, Captain of the Nimrod
aircraft XV230
1.1 Introduction
Air accidents have shown that aircraft are sometimes not as safe or as
airworthy as was previously imagined. With huge resources applied to ensuring
airworthiness, why do accidents still occur? It is often said that such accidents
could be prevented if the lessons of the past been heeded. Yet despite many
investigations and recommendations, accidents still occur. Why is this so? Are
existing tools for safety analysis inadequately applied or inadequate in of
themselves? How can those charged with responsibility over complex
hazardous systems in industry, transportation or the military work better to
prevent accidents and yet still achieve their operational objectives?
Design engineers are duty bound to demonstrate that their system may initially
be operated without an unacceptable level of harm. Thereafter, the system must
be maintained and continually monitored for the increase of risk beyond
acceptable levels. For organisations that have very few if any accidents, this is
a major challenge; how can the risk of something that has not happened be
measured and managed? In the aviation domain, airworthiness is a property
that requires continual management and assessment. Whilst the property is
attributable to the materiel itself, aircraft systems experience almost constant
contact with humans and thus airworthiness is inherently bound up with the
humans who manage, operate and maintain aircraft. Aircraft and their
supporting organisations are complex ‘socio-technical systems’. Resilience
engineering is a new concept that provides insight into this relationship and
offers useful models and tools for better management of the safety of such
2
systems. If complex socio-technical systems managing airworthiness are better
understood, then perhaps future accidents will be prevented.
1.2 Background – Theories of Safety
There have been accidents involving complex systems ever since the industrial
revolution. The management of safety has consequently been a concern since
these times but a theoretical basis for safety did not emerge until the 1930s.
Herrea (2012) divides the development of safety theory into 4 overlapping ages
of safety theory - the ages of technology, human factors, organisational safety
and complexity. The age of technology dealt with the design of machines and
why they fail whereas human factors has traditionally been concerned with why
humans fail to do what is expected of them. Organisational safety has been
concerned with the safe management of potentially hazardous enterprises and
how these fail – Reason’s (1997) famous ‘Swiss Cheese’ being the pre-eminent
model in the field. However detailed the taxonomies of failure in the first 3 ages
were, the associated models of accident causation have been linear. An
emerging 4th
age of safety theory is that of complexity. In complexity theories,
accident causation models are non-linear and are sometimes said to be
intractable. The term Resilience Engineering has come to encompass the use
of these models; the practise seeks to develop socio-technological systems that
are resilient against those variations in system performance which may cause
accidents. Airworthiness has been mostly associated with technological safety
theory, with reliability and safety assessment methods such as fault tree
analysis dominating the thinking of designers and regulators. Although design
for human factors has been an issue since the 1940s the field has generally
been concerned with operator performance; human factors in maintenance has
only more recently come into the spot light (Reason and Hobbs, 2003). The
ability of aircraft operating authorities and regulators to maintain continuing
airworthiness has been the subject analysis from an organisational safety
standpoint due to accidents such as Alaska Air 261 (Woltjer, 2007). More
recently accidents such as Air France 447 (Stoop, 2013) have shown that
3
unexpected results can emerge from increasingly complex systems and that a
lack of resilience can be fatal.
1.3 Background – The Practical Requirement
This research will specifically address the management of airworthiness within
the United Kingdom’s military. On the 2nd
September 2006 the UK military
suffered its single largest loss of life since the 1982 Falklands War, when a
Royal Air Force (RAF) Nimrod MR2 aircraft was destroyed near Kandahar,
Afghanistan. This was not the consequence of a hostile act or the outcome of
operator error. It was an accident caused by a failure to establish the correct
level of initial airworthiness though the design of modifications and thereafter a
failure to maintain continuing airworthiness in the condition of fuel and hot air
systems. The independent inquiry into the incident identified that the deeper
causes were organizational and managerial (Haddon-Cave, 2009).
Figure 1-1 - Nimrod MR2 XV230 (McKenzie, 2012)
As a consequence of the recommendations made by the Nimrod Review,
military airworthiness management has been comprehensively overhauled as
part of a reorganisation of ‘Air Safety’ within the Ministry of Defence (MOD). The
previously “byzantine” (Haddon-Cave, 2009) regulation of air safety has been
simplified through the establishment of the Military Aviation Authority (MAA).
Key to the new system has been the establishment of a chain of ‘duty holders’
who are named senior military officers with legal responsibility for the safety of
aircraft operated by their organisation. Duty Holders rely on Type Airworthiness
4
Authorities (TAA) and Continuous Airworthiness Managers (CAM) to ensure
that the airworthiness of their aircraft is adequately established and maintained.
In practise this is achieved through a variety of processes aimed at managing
the risk of a technical failure. There is an engineering programme to maintain
the integrity of the systems’ initial airworthiness whilst developing the system’s
capability and also a maintenance programme specified by the Engineering
Authority (EA) (reporting to the TAA) and implemented by the Continuing
Airworthiness Management Organisation (CAMO). In common with most socio-
technological systems, these processes do not operate exactly as designed or
documented. Particular concerns centre around the human factors within the
maintenance programme and whether or not appropriate engineering
‘standards and practises’ can be ensured in the face of pressures to produce
operational output within an increasingly lean front line organisation. These are
typical examples of the Efficiency-Thoroughness Trade-Off (ETTO) principle
highlighted within the Resilience Engineering literature (Hollnagel et al, 2007).
Practical experience of the messy realities of military aircraft operations and
back-office airworthiness assessment was the genesis of the research aim. This
research uses the RAF’s Tornado Ground Attack Reconnaissance Force
(TGRF) as a case study.
Figure 1-2 - RAF Tornado GR4 Aircraft (Crown Copyright, 2009)
5
1.4 What is ‘Airworthiness Management’?
Large organisations exist to maintain, modify, provide resources, operate and
monitor aircraft fleets in order to keep them airworthy. The way in which this
multitude of functions is carried out has a variety of effects on the aircraft
system and the property of airworthiness. Those responsible for airworthiness
can only manage it indirectly by managing of the functioning of the organisation.
This is achieved by means of tasking maintenance or setting policy, defining an
organisational structure (including contracting out elements), providing
resources and conducting quality assurance. So whilst making engineering
assessments and specifying what physical actions are to be carried out on an
aircraft system is critical, the management of airworthiness is a wider
endeavour.
1.5 The Research Aim
The aim of this thesis is:
To apply resilience engineering concepts by producing a system
model of an airworthiness management organisation in order to
provide a tool to improve management of airworthiness.
1.6 Objectives
In order to achieve the aim the following research objectives were established:
 Review the theoretical background to safety management and the
implications for airworthiness management.
 Review the concepts of Resilience Engineering with an emphasis on
applying it to airworthiness management.
 Establish a theoretical framework for a model of an airworthiness
management system.
 Gather and use primary research data to establish and validate a model
of the airworthiness management system for the RAF Tornado Force.
 Using the model, develop a tool to enhance the airworthiness
management system of the RAF Tornado Force.
6
1.7 Methodology Overview
A literature review of resilience engineering was carried out, which branched out
into source disciplines of systems thinking and engineering; control theory; non-
linear dynamics and complexity theory. A search for work in this area
addressing airworthiness or technical safety in other domains was conducted.
For the Tornado case study, the safety, airworthiness and assurance plans of
the various elements of the organisation were examined. Resilience
engineering provides a number of modelling techniques that could be applied to
the case study; these were assessed and down selected to the Functional
Resonance Analysis Method (FRAM). The system was assessed by semi-
structured interviews with key personnel as well as using a large amount of
information and experience gained from working within the system. The FRAM
Model was built within a spreadsheet and a separate model visualisation tool
was created using Microsoft Visio. This allowed for the identification of various
potential leading indicators for system safety. In order to validate the FRAM
model, specific case studies were required. Two incident reports and an
emergent airworthiness risk were selected for analysis.
1.8 Descriptions and Definitions
For simplicity the standard terminology as described within MAA02 – Military
Aviation Authority Master Glossary (MAA, 2012) is adopted for this thesis.
There are a number of minor differences in emphasis between terms used here
and in civil aviation or other domains; these are discussed where relevant.
1.9 Thesis Structure
This thesis is structured around the research objectives:
 Chapter 2 describes the theoretical foundations for resilience engineering
in the context of the other theories of safety and safety engineering
practise in other domains. Potentially useful models are analysed.
 Chapter 3 details the methodology for carrying out the primary research.
 Chapter 4 describes the process for building the case study FRAM Model
– the Tornado Airworthiness System Model.
7
 Chapter 5 describes the development of the FRAM visualisation tool.
 Chapter 6 discusses how the FRAM Model may be used for incident
analysis with reference to two examples.
 Chapter 7 gives a process for, and example of the FRAM Model as a risk
assessment tool.
 Chapter 8 provides a general discussion of the case study exercise,
focussing on the applicability of Resilience Engineering to aspects of
airworthiness practise.
 Chapter 9 provides some conclusions.
9
2 LITERATURE REVIEW
The literature review will examine arguments for broadening the scope of
airworthiness to address the complexities of managing modern aircraft,
maintenance and support organisations. Existing notions of cause, failure and
hazards are challenged as the theoretical background to resilience engineering
is described. Models and methods for understanding and managing the safety
and airworthiness of complex systems are examined using the paradigm of
resilience engineering.
2.1 Airworthiness in the Context of Safety
There are a number of definitions for the term airworthiness; all these have at
their core the need for the aircraft to be able to be operated in safety or as the
MAA has it; ‘without significant hazard’. Hazard is further defined as ‘an
intermediate state where the potential for harm exists’ (MAA, 2012b). The
hazard is said to lie between a cause (such as a technical or human failure) and
an accident. So whilst airworthiness is clearly a target for aerospace design
organisations to meet through satisfaction of certification standards, it is also an
element of system safety that requires management throughout the lifecycle of
the system. It is analogous to ‘technical safety’ or in other domains, which is
often separated ‘operational’ or ‘occupational’ safety.
2.1.1 Accident Investigations
The need to investigate loss of life or near misses is both a pragmatic and moral
choice. The conclusions drawn from such investigations are extremely
important at a human level but also critical to restoring system safety. It is
therefore vital for accident investigators to use mental and procedural models
that reflect the complexity of modern technologies. One of the largest accident
investigation agencies, the National Transportation Safety Board (NTSB)
determines a ‘probable cause’ in all its reports (Johnson and Holloway, 2004)
but ICAO recommends that ‘causes’ – plural are determined (ICAO, 2001).
This indicates a governing accident chain theory in the former organisation but
perhaps a slightly more sophisticated model in the latter. Various writers (De
10
Landre et al., 2006),(Coury et al., 2008) have proposed models or frameworks
in which multiple causes can be described in accident investigation. Much has
been written about the intersection between legal frameworks and accident
investigation methodologies. Dekker (2003) for example has described the
detrimental effect of the adversarial nature of justice. The rest of this chapter will
describe how assigning ‘root’ or probable cause to accidents is potentially
unhelpful in the context of complex systems. It follows therefore that notions of
blame or individual responsibility are often problematic to apply.
2.1.2 Initial and Type Airworthiness
Much of the airworthiness of a system is ‘designed-in’ before manufacture. This
involves specifications, systems configuration and assumptions on support and
maintenance philosophy. A structured systems engineering approach to safety
as described in ARP 4761 (SAE, 1996) is used to convince regulators that a
type certificate can be issued. The evolution of safety requirements and
regulation over a system’s lifecycle causes difficulty (Kelly and McDermid,
1999). Military aircraft in particular are often retained in service for many
decades. Whilst the technology may remain relatively constant, experience
shows that it is usual operational usage to evolve over the course of the
lifecycle. For this reason it is important to regularly adjust, validate and reassess
airworthiness assessments if the type airworthiness of a design is to be
maintained.
2.1.3 Safety Management
For many complex systems, the development of safety cases is a mandatory
requirement (MoD, 2007) and in particular for military airworthiness this is
governed by MAA Regulatory Article 1205 (MAA, 2013). The concept of a
safety case is the presentation or collation of a body of evidence to assure
interested parties that the system is safe. This body of evidence is collected and
organised according to mental or procedural models. The theoretical basis for
these models are the same theories of safety as described below. Safety
management systems are similarly structured according to the prevailing
11
theoretical approach to safety. An evolution in modelling requires an evolved
approach to safety management.
2.1.4 Continuing Airworthiness
Continuing airworthiness relates to the maintenance of a particular, safe system
state for each of the individual aircraft being managed (MAA, 2012b). Given that
it is never possible to comprehensively inspect/audit each aircraft before every
flight, there must be assumptions made as to the effect of organisational and
human interactions with the aircraft so as to maintain the system in a safe state.
Understanding maintenance system performance is critical to assuring
continued airworthiness. This achieved through a Continuing Airworthiness
Management Organisation (CAMO) which provides assurance that its specified
tasks are being undertaken successfully. This is primarily achieved through a
quality assurance system, which ensures that rigorous processes are
established (Casey, 2013).
2.2 A History of Safety Theory
Chapter One sketched out a chronological view of ‘Ages’ of safety theory. New
theories tend to gain traction as a result of the investigation to major accidents.
Herrera (2012) describes how safety theory has evolved across technological,
human factors, organisational and complexity ‘ages’, identifying key accidents
and ideas on a time line, which is summarised in Table 2-1:
12
Table 2-1 Herrera 's Ages of Safety Theory
Leonhardt et al (2009) presents breakdown of safety methodologies within a
Resilience Engineering White Paper. This document describes Technical,
Human Factors, Organisational and Systemic accident analysis and risk
assessment methods. Systemic models/methods are those that have recently
emerged to provide a means of analysing safety from a ‘complexity’ standpoint.
These are shown chronologically in Figure 2-1 with an expansion of each
abbreviation available within the glossary.
Time Accidents Technology Human Factors Organisational Complexity
1930s Domino Model
1940 - 50s
Failure Mode Effects
Analysis (FMEA)
Human Factors
Design
Task Analysis
1960s Aberfan Colliery Disaster
Fault Tree Analysis (FTA) -
Minute-Man Missiles &
Boeing aircraft
Energy Barrier Model
Technique for Human
Error Rate Prediction
1970s
Flixborough & Seveso Chemical
Plants
Tenerife Aircraft Collision
Three Mile Island Nuclear Plant
Probalistic Risk
Assessment (WASH-1400
Reactor Safety Study)
Hazard & Operability
Analysis
Energy Damage and
Countermeasure Strategies
Man Made
Disaster
Information
Perspective
1980s
Bhopal Chemical Plant
Challenger Space Shuttle
Chernobyl Nuclear Plant
Kings Cross Railway
Piper Alpha Oil & Gas
Dryden Aviation
Crew Resource
Management
Safety Culture
Swiss Cheese
Model
Normal
Accident Theory
1990s
Warsaw Air Crash
Iraq Friendly Fire
Cali Air Crash
Arianne 5 - Space
Norne Air Crash
Longford Oil & Gas
Mandatory Safety Cases
(UK)
Normal Deviations
Man,
Technology and
Organisation
Concept
Drift into Failure
Risk Influence
Model
High Reliability
Organisations
2000s
Uberlingen Air Crash
Columbia Space Shuttle
Helios Airways
Texas City Refinery
Nimrod Air Crash
Air France 447
Deepwater Horizon
Human Factors
Analysis &
Classification System
Failure of
Leadership,
Culture &
Priorities
Aviation Safety
Management
Systems
Resilience
Engineering
Theory of
Practical Drift
"Age" of Safety Theory
13
Figure 2-1 Accident Analysis and Risk Assessment Methods (Leonhardt et al,
2009)
Saleh et al (2010) present a slightly different narrative in the development of
safety theory. Whilst they note most of the same key ideas and developments,
they identify three tracks in safety theory leading towards the modern ‘system
and control theoretic’. These are illustrated below:
Figure 2-2 Three Tracks on the Evolution of Safety Theory (Saleh et al., 2010)
The tracks are not exhaustive and there is some cross coupling between ideas.
Herrera’s (2012) technological age can be likened to the middle track, the
defence in depth track is comparable to the organisational age whilst the top
14
track has many human factors elements but takes much from the current ‘age of
complexity’. The current state of the art is given as a systems engineering-
control theory approach. Saleh (2010) acknowledges that the literature in the
field is particularly fractured. This is perhaps because the various theories
emanate from disparate fields such as psychology, reliability, operations studies
and management.
2.2.1 Technological Age – Governing Philosophy
The predominant theme in the technological age of safety theory is that of a
‘chain of causation’; first visualised as a set of toppling dominos by Heinrich
(1950). Each domino represented a factor in the accident: Management
controls; failure of a man; unsafe acts or mechanical conditions; the accident;
injury. Once the first domino was toppled removal of either of the others would
prevent the final injury domino toppling. Related to this is the concept of an
accident or event chain, where causative elements or events link together to
form a chain, which if it had been broken would have prevented the accident. It
is unclear where this idea originated; it is perhaps a reflection that a linear view
of the world still represents the defining popular narrative for any major
accident. Leveson (2011) links this to an erroneous assumption that there is
always a cause for any given accident.
2.2.2 Technological Age – Tools
The notion of a linear event chain gave rise to methods of analysing system
safety or the related property of reliability. The Fault Tree Analysis (FTA)
methodologies were developed from reliability studies of the American
Minuteman missile system and quickly developed into a methodology for
analysing safety by defining the probability of an unsafe condition developing
(Herrera, 2012). Closely associated are event trees which define hierarchies of
events post a single initiating event (such as an unsafe condition). These
analyses use stochastic methods to forecast top level probabilities for accidents
caused by single or multiple failures lower down in the system. There is always
a mathematical audit trail from the top level system safety target, for example
hull loss probability in commercial aviation, down to individual system or
15
component reliability data or predictions. Importantly, modern system safety
assessments contain more qualitative information based on expert
understanding of systems; carried out through Functional Hazard Assessments
(FHAs) (Dalton, 1996). When analysing accidents using event chain type
models such as FTA, there is a question of how far back it is appropriate to go
in order to find an initiating event. Leveson (2011) argues that selection of
initiating events is often arbitrary in accident analysis. It has been accepted in a
large number of major accident reports that management commitment to safety
or ‘safety culture’ is a key factor in risk of accident (Dekker, 2005), yet there is
no clear way in which these vital considerations can be fitted into an event chain
model. Reason (1997) espouses a version of the event chain in the famous
‘Swiss cheese’ model of organisational accidents. Reason’s cheese has
become the de-facto mental model for understanding safety and accidents
within the military aviation community as shown by articles in the RAF’s Air
Clues in-house safety magazine demonstrate (Anon, 2011; Gale et al., 2013).
Whilst Haddon-Cave’s (2009) investigation into Nimrod addresses issues of
culture and complexity, his view of causation is essentially linear. Leveson
(2011) outlines why linear accident models of the technological age such as the
Swiss Cheese are no longer considered acceptable:
 Direct Causality – there is a reliance on the notion that there is always a
linear relationship between event A causing event B.
 Subjectivity in Selecting Events – The backward chain of events is
often shown to stop for a number of arbitrary reasons, which could
include familiarity with a particular event in the sequence (“We’ve seen
this before”), it deviates from a standard (component operates outside its
specification) or a lack of information (such as inability to understand a
human performance issue).
 Subjectivity in Selecting Chaining Conditions – It is often not clear
which factors caused each other.
 Discounting System Factors – Event chain models generally deal with
proximate causes and do not deal with issues such as culture or
16
organisational pressures which can pervade through a socio-technical
system.
A useful example of how this approach to accident analysis can prove
disastrous is given by Leveson (2011). She notes how an incident where a DC-
10 lost a cargo door (without loss of life) was attributed to the failure of a
baggage handler to close the door properly rather than a design floor meant
that two years later a similar incident resulted in the complete loss of a DC-10
near Paris in 1974.
2.2.3 Limits of Probabilistic Risk Assessment
Both civil and military airworthiness certification standards require certain safety
targets to be met. These targets are expressed in terms of probabilities,
principally probability of hull loss and death of passengers or crew; for military
aircraft this is specified in Regulatory Article 1230 – Design Safety Targets
(MAA, 2012a). There are various other targets regarding risk of harm to third
parties or other unsafe conditions – these are operating risks. Operating risks
are also commonly assigned qualitative risk levels; in the case of military
aviation this process is specified in Regulatory Article 1210 – Management of
Operating Risk to Life (MAA, 2012a). This regulation advises Platform
Operators and Project Teams to make use of Fault Tree Analysis to enable
calculation of these risks. For some UK military platforms this has resulted in
the introduction of ‘Loss Models’ to guide the assessment of new or emergent
risks. In the case of Tornado, the Loss Model (Sugden, 2011) is not a tool that
can be used in isolation for predictive risk assessment; rather it uses incident
statistics to provide a current picture of loss rates across the fleet (Woodbridge,
2012). The regulation and recommended practise (SAE, 2010; Lloyd and Tye,
1982) for both civil and military airworthiness and safety targets is for the use of
fault tree and dependency diagram models. These methods of probabilistic risk
assessment (PRA) are linear, which usefully provides for aggregation of total
risk. There are however a variety of issues to consider in their use. Apostolakis
(2004) provides a summary of some of the benefits and criticisms of PRA.
However in the case of airworthiness certification risk assessments the process
17
is generally based on a qualitative assessment of Functional Hazard Analysis
(FHA). FHA allows expert subjective analysis to provide an element of linkage
between various hazards. Equally Common Cause Analysis (CCA)
methodologies go some way to accounting for system-wide failure mechanisms.
The literature on resilience engineering disputes Apostolakis’ (2004) claim that
PRA deals effectively with true complexity.
Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment (Apostolakis,
2004)
Benefits Criticisms
 Multiple failures considered
 Increases likelihood of spotting
complex failure interactions.
 Facilitates communication.
 Integrated Approach.
 Identifies unknown areas for
research.
 Focuses risk management activity
on key areas
 Human actions during
accident scenarios cannot
be modelled.
 Difficulty of quantifying
software failures.
 Cannot model safety
culture.
 Difficulty estimating design
and manufacturing errors.
PRA models are essentially a product of the ‘technical era’ of safety science,
they assume linear behaviour and that the systems being analysed are
tractable; thus decomposable into independent subsystems. This remains the
de-facto approach to managing most complex socio-technical systems and
forms the basis of the safety case approach prevalent within many regulatory
environments. The fundamental assumptions that justify their use are
questionable when applied to complex socio-technical systems. The principle
concern is that the human element cannot be satisfactorily modelled using
Boolean logic, in systems where there are frequent interactions with humans,
whether operators, maintainers or design or support engineers this presents the
possibility that common cause failures will be built into the system and that the
relationships will be non-linear.
18
2.2.4 Human Factors
Herrera (2012) outlines how 20th
century disasters such as Three Mile Island
and Flixborough showed that the event chain models were becoming
inadequate – the focus began to shift to human failing, with the human identified
as the number one unreliable component in the event chain. Herrera (2012)
highlights two trends in the age of human factors; studies concerned with
eliminating human error by design for human performance and studies into how
humans cope with disturbances.
2.2.5 Organisational
‘Man Made Disaster’ theory was the initiating scholarly theory behind
organisational accident theory (Saleh et al., 2010). This theory noted that within
a certain class of events known as ‘man made disasters’ there were multiple
events chains that reached a long back into the past and that management and
organisation were key factors in causing accidents. Saleh (2010) also notes
‘Normal Accident Theory’ and ‘High Reliability Organisations’ as key precepts of
the organizational accident. Normal accident theory notes that there are tight
couplings between interacting causal factors in complex system accidents and
that they cannot be predicted. This has been condemned as a somewhat
fatalistic view. Herrera (2012) sees High Reliability Organisation Theory as a
counter to Normal Accident Theory. This characterises successful organisations
as those operating complex systems with a very small number of accidents.
Saleh (2010) notes that the research highlights a number of common
characteristics of such organisations such as:
 Preoccupation with failure and organizational learning.
 Commitment to and consensus on production and safety as concomitant
organizational goals.
 Organizational slack and redundancy.
These facets of successfully safe or high reliability organisations correspond to
aspects of ‘safety culture’ as described by Reason (1997) and others.
19
2.3 Complexity
Aircraft are complicated machines; they have many components interacting in a
multitude of combinations. Dekker (2011) holds that analytic reduction, as
practised within traditional linear safety analysis, is unable to describe how
system elements and processes behave when exposed to multiple
simultaneous influences. He also describes the key distinction between a
complicated system such as an aircraft, which could conceivably be
disassembled then reassembled by a single person and complex systems. A
complex system is one where the boundaries are ‘fussy’ (require highly detailed
definition) and the structure is intractable; an aircraft operated subject to human
factors, culture, regulatory and organisational factors is therefore complex.
Cilliers (2005) defines complex systems as those having the following
properties:
 Large numbers of simple elements.
 Dynamic, propagating and non-linear interactions; these define
behaviour which is emergent and cannot be understood by inspection of
components nor predicted by deterministic methods.
 Open, exchanging energy and information with the environment.
 Memory is distributed within the system, influencing behaviour.
 Adaptive behaviour; without the intervention of external agents.
This study assumes that the complete aircraft system, incorporating its
operation and support is complex rather than simply complicated. It could also
be argued that the edition of extensive software within aircraft renders the
system complex. The safety management system and airworthiness
management in particular must deal with complexity.
For those charged with managing the safety of complex systems, understanding
models for accidents and studying post mortem analyses of accidents does not
present a comprehensive approach to prevention. It is generally accepted that
events, hazards and risks often combine in unexpected ways. Is it therefore
adequate to manage safety risk as a game of ‘whack-a-mole’; eliminating or
20
mitigating risks as and when they become apparent (Zarboutis and Wright,
2006)?
It may be argued that a proactive reporting culture does much to allow
elimination or mitigation of risks before they materialise. Heinrich’s (1950) ‘ice
berg’ model drives much of this effort to uncover previously unknown risk and
there is an indisputable logic which says that knowing about a risk is a first step
to eliminating or managing it. The continued history of complex accidents tells
us that this approach may never be completely effective in preventing
unexpected failure (Hollnagel, 2007). Leveson (2011) explains that the concept
of a High Reliability Organisation confuses notions of safety and reliability. Just
because individual components of a socio-technical system can be proven to be
individually reliable it does not follow that safety will necessarily emerge as a
system property. Systems may be reliable yet unsafe, such as the NASA Mars
lander which crashed because the designer failed to anticipate the interaction
between the software and mechanical systems. Equally it is possible for a
system to be unreliable yet safe where systems fail-safe.
2.3.1 Complexity Theory
Accident investigation or analysis of complex system failure requires a mental
model to be applied to the accident scenario (Hollnagel, 2011). Similarly
accident prevention through risk management uses modelling to understand
potential accidents. Hitchens (2003) describes how complexity is relative to the
observer’s frame of reference. Modelling complex systems requires judgement
as to the extent of elaboration or its converse; encapsulation. He proposes that
systems derive their degree of complexity from their variety, connectedness and
disorder. Socio-technical systems are increasing in complexity as a result of the
increased use of networks. Manson (2001) provides a useful review of
complexity theory, most of the branches of which have an antecedent in general
systems theory. Three main branches of complexity theory are identified;
‘algorithmic complexity’ which gives that complexity is defined by the difficulty in
describing system characteristics. ‘Deterministic complexity’ deals with chaos or
catastrophe theories which posit that stable complex systems may become
21
suddenly unstable ‘Aggregate complexity’ deals with how elements interact to
produce complexity. A key property of complex systems is that of emergence
which describes how system-wide characteristics cannot be computed by the
aggregation system component behaviour. Zabourtis (2006) highlights that
patterns that emerge from complex socio-technical systems which erode the
resilience of complex systems. Grøtan et al (2011) gives a good account of the
theoretical foundations of complexity and how they can be applied to risk
assessment; the ‘Cynefin’ Framework provides a summary.
Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management (Grøtan
et al., 2011)
Generally the literature shows that whilst linear thinking has reached its limits
within system safety science, complexity theory has yet to be completely
applied to the problem. Zabourtis (2006) identifies how complexity theories can
be used to replace HAZOPS type safety analyses. The key inputs should be:
 How can system entities co-adapt?
 What will the probable effect be on the whole?
 How can such patterns be eliminated?
22
The output of such an analysis should therefore be some means of avoiding the
emergent harmful properties. Dekker (2011) advises that complexity theories
can be applied to accident investigation if the search for a single cause is
dropped and multiple narratives are allowed to overlap and on occasion
contradict each other. The nature of complexity defies analysis; Cilliers (2005)
writes on the ‘incompressibility’ of complex systems, in that the only reliable
model of a complex system is that which has the same level of detail as the
system itself. Clearly this is impractical, yet as any model will involve
simplification, disregarded elements may have non-linear effects and the
magnitude of the potential outcomes may be non-trivial. However Cilliers (2005)
also states that whilst modelling and computing complex systems will never be
sufficient, it is still necessary.
2.3.2 Systems Thinking and Systems Engineering
The concept of a system is well-established with roots in philosophy and
thermodynamic theories leading to theories and practise surrounding systems
engineering. Hitchens (2003) provides one definition:
A system is an open set of complementary, interacting parts with properties,
capabilities and behaviours emerging both from the parts and their interactions.
The concept of emergence is an important one; accidents are emergent system
states of disorder. Systems engineering involves the generation of models to
represent a system (Oliver et al., 1997). Leveson (2011) first describes how
safety ought to fit into systems engineering’s primary activities – Needs
Analysis, Feasibility studies, Trade studies, System architecture development
and Interface analysis. This is the basis for system safety assessments
employed in generating evidence for airworthiness certification as per ARP
4761 (Dalton, 1996). Saleh (2010) distinguishes between failure modes
attributable to component failure and those failures attributable to emergent or
interactive failures; his thesis is that a systems theoretic approach addresses
this second set of failures. However he raises concerns that formal systems
theoretic approaches such as co-ordinatability and consistency in hierarchical
and multilevel systems are yet to be fully applied to safety analysis. Leveson’s
23
(2011) Systems-Theoretic Accident Model and Processes (STAMP) uses
control theory and processes as the key to prevention of accidents. It
decomposes the system across the complete lifecycle, from concept to
disposal, into a series of control loops. The key to prevention of accidents is
said to be keeping the entire system in a state of equilibrium, which is achieved
by applying constraints to implement control. The model is said to more
effectively deal with software than traditional notions of failure. STAMP utilises
descriptions of control loops at technological subsystem level, human controller
level and socio-technical organisation level, shown in Figure 2-5. STAMP uses
a taxonomy of control loop failure modes as an audit check list. Salmon et al
(2012) compares STAMP to other models concluding that STAMP provides a
more comprehensive system description but it is difficult to incorporate human
failures into the model, which itself needs a highly developed understanding of
the whole system. This highlights the difficulty in applying theoretically strong
models of complexity to particular scenarios.
24
Figure 2-4 General Form of a Model of Socio-technical Control (Leveson,
2011)
2.3.3 Control Theory
STAMP (Leveson, 2011) suggests that safety can be treated as a control
engineering problem and Saleh (2010) identifies this idea as an important
corollary to the development of a systems thinking approach to safety.
Kontogiannis and Malakis (2012a) describe how the concept of a model with
control loops is fundamental to systems safety incorporating human and
organisational factors. Hollnagel and Woods (2005) produced an Extended
COntrol Model (ECOM) which describes generically how organisational
25
processes transfers downwards to directly interact and control the technological
system and hence alter its state. The Viable System Model (VSM) uses
cybernetics principles to describe how safety goals are transferred downwards
through an organisation and how output is controlled by various measures such
as audit (Espejo, 1989). Kontogiannis (2012a) combines these two models and
applies them to studying the accident involving the crash of flight AEW-241 in
December 1997. Like many control and systems models in the safety literature
Kontogiannis (2012a) highlights the difficulty of applying the models for the
purposes of accident prevention. Kontogiannis (2012b) also tries to apply these
principles in a case study involving emergency helicopter operations.
2.3.4 Non-Linear Dynamics
Control of complex socio-technical systems needs to address the problem of
non-linear behaviour. Bendat (1998) describes how physical and engineering
systems can be divided into linear and non-linear systems. A system is linear,
if for any inputs and and for any constants ,
Equation 1 – Linear System (Bendat, 1998)
[ ] [ ] [ ]
This leads to 2 properties:
Equation 2 - Additive Property (Bendat, 1998)
[ ] [ ] [ ]
Equation 3 – Homogeneous Property (Bendat, 1998)
[ ] [ ]
A non-linear system is therefore one where,
Equation 4 – Non Linear System; lack of Additive Property (Bendat, 1998)
[ ] [ ] [ ]
Equation 5 – Non Linear System; lack of Homogeneous Property (Bendat, 1998)
[ ] [ ]
26
This means that for a linear system with a random theoretical Gaussian
probability density function as an input (e.g. a normal distribution), the system
will transform that data and produce an output with a Gaussian probability
density function as an output. Bendat (1998) also makes the point that any
physical system will display non-linear properties if the input conditions are
suitably wide. As this is true for numerous examples in flight dynamics it is also
true for various instances in safety and reliability, where oversimplifying
assumptions are made regarding the condition of equipment and its interaction
with maintenance and operating organisations. Human behaviour often defies
mathematical modelling due to its complexity and non-linear properties. As
previously described, it is common for safety analyses and models to assume
linear behaviour. In fact complex socio-technical systems generally exhibit a
lack of additive and homogeneous properties; where different inputs combine to
produce unexpected and ‘out-of-control’ outputs resulting in accidents. This
explains some of the difficulties encountered in producing a workable approach
to human and organisational reliability, as outlined by Rasmussen (1997). Non-
linear effects explain the concept of emergence that is the behaviour of linear
systems are predictable and tractable, yet nonlinear systems produce
unexpected results. Grøtan (2011) outlines how this leads to the concept of
‘Black Swan’ events that are unexpected with a huge impact – such as a
catastrophic accident with a complex system. These are understandable in
retrospect but could not have been predicted. Leveson (2011) describes how
such accidents are as a result of non-linear interactions between components of
the system, whether human, organisational or technological. The key to
developing an improved method of managing safety and estimating risk will be
to understand and predict these non-linear interactions.
2.4 Resilience Engineering
The theory of resilience engineering is emerging as a response to the problems
posed to safety management and engineering by complexity theory and the age
of the organisational accident as described by Reason (1997). The central
theme is to move from a focus on failure, where notions of component reliability
27
are applied to complex systems, humans and organisations; to looking at how
systems can succeed under varying conditions. The literature on the subject is
somewhat fragmented, although a series of books has been published, which
bring together the key ideas. One of the aviation organisations embracing
resilience engineering is EUROCONTROL which is a multinational air traffic
management service provider with Leonhardt et al (2009) publishing a white
paper on the application of resilience engineering within the organisation. This
illustrates that there is a blurred line between ‘traditional resilience’ study as
applied to infrastructure, and resilience engineering which has emerged from
the study of safety. Hollnagel et al (2011) give a simple definition of resilience:
“Resilience is the intrinsic ability of a system to adjust its functioning prior to,
during, or following changes and disturbances, so that it can sustain required
operations under both expected and unexpected conditions.”
Woods and Hollnagel (2007) set the scene for resilience engineering. They
outline fundamentals which include a shift away from the traditional safety focus
on ‘what went wrong’ (hindsight) and what could go wrong (risk assessment) to
a focus on ‘what can go right’ for risk assessment and ‘what did go right’ for
accident analysis – also neatly summarised by Schafer (2012). Resilience
engineering also rejects the notion of human failure, error taxonomies and
reliability analysis of complex systems in favour of a theory that failures
represent either the breakdown in strategies for coping with complexity, or an
unfavourable combination of functional variability within a system (technological,
human or organisational). In resilience engineering, safety is redefined as the
ability to succeed under varying conditions. By observing how systems work
under everyday pressures, it should be possible to understand the level of
resilience in a system and how it might be engineered to increase this quality.
For the purposes of both accident investigation and risk assessment it is
necessary to move away from linear combinations of events to an
understanding of how a system might lose its dynamic stability and veer into an
accident trajectory (Hollnagel et al., 2007). In summary, there are four key
precepts to Resilience Engineering:
28
1. Performance conditions are always underspecified. Individuals
and organisations must therefore adjust what they do to match current
demands and resources. Because resources and time are finite, such
adjustments will inevitably be approximate.
2. Some adverse events can be attributed to a breakdown or
malfunctioning of components and normal system functions, but others
cannot. The latter can best be understood as the result of unexpected
combinations of performance variability.
3. Safety management cannot be based exclusively on hindsight,
nor rely on error tabulation and the calculation of failure probabilities.
Safety management must be proactive as well as reactive.
4. Safety cannot be isolated from the core (business) process, or
vice versa. Safety is the prerequisite for productivity, and productivity is
the prerequisite for safety. Safety must therefore be achieved by
improvements rather than by constraints.
These precepts define a theoretical approach drawn from various ideas about
organisational accidents and safety culture. The key development is the focus
on the functions within the system and the emphasis on improving their
combined performance, rather than a focus on the potential sources of hazards
and barriers for accident prevention. This positive standpoint is a key attraction
to the approach; the drive for operational performance improvement and safety
can be in synergy rather than in conflict. Hollnagel (2011) gives four
cornerstones to the practise of resilience engineering. The first is knowing what
to do to respond to everyday disturbances – the actual. The second is knowing
how to monitor potential threats from the environment and from the functioning
of the system itself – the critical. The third part of the practise is knowing what to
expect in terms of threats and opportunities in order to address potential.
Finally, the fourth ‘cornerstone’ is that of the ability to address the factual
through learning.
29
A slightly different conceptual framework for Resilience Engineering is
presented by Madni (2009); offering more concrete requirements for
operationalising the practise:
Responding
(Actual)
Learning
(Factual)
Monitoring
(critical)
Anticipating
(Potential)
Knowing what
has happened
Knowing what
to do
Knowing what
to look for
Knowing what
to expect
Figure 2-5 The Four Cornerstones of Resilience (Hollnagel, 2007)
30
Figure 2-6 Conceptual Framework for Resilience Engineering (Madni, 2009)
2.4.1 Resilience Engineering as a Successor to Safety Management
Leonhardt et al (2009) puts the resilience engineering approach to safety
management simply:
The more likely it is that something goes right, the less likely it is that it goes
wrong.
Cambon (2006) provides a resilience framework for assessing safety
management systems; they propose a number of metrics based on Tripod
theory, which essentially measures the performance conditions under which the
SMS operates. The balance of these performance conditions is said to
determine the stability of the SMS. ‘Engineering’ implies design and
Beauchamp (2006) notes how this can be achieved through organisational
learning to provide organisational resilience; a model for guidance is provided.
Zarboutis (2006) describes how, analogous to Rasmussen’s (1997) approach to
organisational drift, resilience engineering can identify symptoms of an erosion
in resilience. Johansson (2008) provides a ‘quick and dirty’ approach to
evaluating resilience in systems; a helpful overview but does not prescribe
specific improvement or change activities. Stoker (2008) outlines a
comprehensive approach to the assessment of operational resilience,
effectively specifying a goal based hierarchy for elements contributing to
resilience; producing a check list approach. Whilst this is undoubtedly a
valuable activity, it is questionable whether it will be able to deal with the
emergence of safety issues.
2.4.2 Under Specification of Performance Conditions
Under specification of performance conditions, that is the factors that affect the
execution of a particular function is key concept in the literature (Hollnagel,
2007). In most organisations performance conditions are subject to control
through rules, with the idea that this will improve safety. Hale (2013) reviews the
literature on this, noting that there are two approaches; a classical top down
approach, punishing transgression and secondly a bottom up approach that
31
sees expert ability to adapt to changing circumstances as paramount.
Nathanael (2006) notes that it is impossible to make what happens in practise
match that which is espoused by officialdom; the key to generating resilience is
dialogue between the hierarchical levels.
2.4.3 Performance Variability
Resilience engineering regards performance variability as inherently useful; it
allows operations to continue in underspecified conditions. It also provides the
potential for coupling between functions where upstream performance variability
combines with downstream performance variability to grow in amplitude. This
phenomenon can be harnessed for system success or else it provides an origin
for safety risk ( Hollnagel, 2012).
2.4.4 Examples of Resilience Engineering in Practice
Resilience engineering is more theoretical than its name suggests and
discussion abounds over the practicality of implementing its precepts is
uncertain. However, its principles can be found in evidence where it was not
specifically applied. Table 2.3 provides a brief summary of some examples.
Table 2-3 Examples of Resilience Engineering in Practice
Industry Tools Insights
Process
Industry
Survey of
workforce
using
Principal
Component
Analysis
Shirali et al.(2013) attempt quantitative
measurement of resilience at an organisational
level. Only possible to measure the potential for
resilience rather than resilience itself. The
following variables are given as indicators:
 Top management commitment
 Just culture
 Learning culture
 Awareness and opacity
 Preparedness
 Flexibility
Process
Industry
Bayesian
Networks
Resilience
Dashboard
Pasman et al. (2013) define a holistic control
methodology for plant safety using leading
indicators derived from process measurements
within the plant. Also use of process simulation
tools to develop scenarios. Traditional
32
(not currently
achievable)
HAZOP/FMEA analyses do not capture all
potential accident scenarios.
Key Points:
 Technical resilience can be
measured/simulated. Organisational
factors less so.
 Importance of leading indicators to
enable response to variations
 Difficulty in dealing with drift in safety
metrics.
 Safety Gains made through
interdepartmental cooperation vs
common cause failures.
 Advocate extensive use of bow-ties.
Aviation Interviews,
audit and
expert
analysis
An investigation into both the sources of
resilience and sources of brittleness.
Comparison of two comparable small air
carriers. Identification through extensive
interviews. Resilience and brittleness
categorised and risk assessed (Saurin and
Carim Junior, 2012).
Air Traffic
Management
FRAM Analysis of a mid-air collision fatal accident.
Provides notes on buffering capacity, flexibility,
margins, tolerance and cross scale interactions.
There was no root cause – aircraft and ATM
was operating normally. The system was
inadequate (de Carvalho, 2011).
Aviation Bayesian
Belief
Networks
(BBN)
Examines the use of and qualification of experts
to provide probability estimates for BBN.
Hidden common causes in BBN – principally
safety culture. Difficulty in estimating
frequencies or probabilities of rare events. BBN
assume the ‘Causal Markov Condition’
therefore common cause failures are difficult to
deal with – maybe applying BBN to FRAM
would solve this issue (Brooker, 2011).
Aviation FRAM Alaska Airlines flight 261 accident analysed to
understand FRAMs performance against 5 key
resilience characteristics: buffering capacity,
flexibility, margin, tolerance, and cross-scale
33
interactions (Woltjer, 2007).
Railways FRAM Interdisciplinary safety analysis of complex
socio-technological systems based on the
Functional Resonance Accident Model: an
application to railway traffic supervision
(Belmonte et al., 2011).
Nuclear FRAM Specific case study surrounding a task to move
Nuclear Fuel – a specific task analysis rather
than a generic system approach (Lundberg,
2008).
2.4.5 Criticism of Resilience Engineering
Oxstrand and Sylvander (2010) argue that Resilience engineering is little more
than a rebranding of safety culture; they do not see how the practise can be
applied to the nuclear industry which already uses both PRA and human
reliability analyses in the licensing of nuclear plants. In this industry it is argued,
safety culture forms part of every operation. The nuclear industry defines safety
culture as:
“Safety Culture is that assembly of characteristics and attitudes in organisations
and individuals which establishes that, as an overriding priority, nuclear plant
safety issues receive the attention warranted by their significance.”
International Atomic Energy Authority (Edwards et al., 2013)
Clearly safety culture is fundamental to engineering resilience into a socio-
technical system. The theory of safety culture does not in of itself propose a
different conceptual framework for the origin of unsafe system performance.
Also some safety culture literature describes a requirement for safety to become
the overriding priority for an organisation (Edwards et al., 2013). Clearly this is
at odds with notions of efficiency-thoroughness trade-offs and the requirement
to increase the proportion of activities that ‘go right’ as a means for reducing the
number that ‘go wrong. Whilst Resilience Engineering draws on much of the
theory around safety culture, it goes a lot further in proposing ways in which
organisations can be designed, analysed and modified in order to deliver
34
resilience. Le Coze (2013) describes a number of criticisms of Resilience
engineering the foremost amongst these being scepticism over the need to
introduce a new vocabulary to safety science. He also notes that the social
concept of power is missing from the resilience literature, although it could be
argued that the exercise of social power could be modelled as a function or a
resource. He also notes that many have disagreed with the notion that
resilience engineering does not present anything new; it collects simply
connects a number of existing ideas, foremost of which is the High Reliability
Organisation concept. He does note that the proof of the concept will be in its
application to real systems – testing the worth of the ‘engineering’ aspect of the
theory. McDonald (2008) asserts that Resilience Engineering is attractive
because other models are weak. He notes that the theory needs to be further
unified and demonstrated in practical examples.
2.4.6 Resilience Engineering and Airworthiness
Current MAA (2011a) policy is based on the idea that airworthiness is made up
of four pillars: the safety management system, compliance with recognised
standards, competence (of people and organisations) and independent
assessment. All of these activities and qualities are likely to contribute to the
resilience of an airworthiness system. Wilson (2008) provides a system model
for resilience of an airworthiness system and presents a number of key ideas:
 The requirement for ‘organisational mindfulness’ – a safety culture keen
to seek out areas of risk.
 Balancing ALARP principles with ‘And Still Stay In Business’ which could
be thought of as an efficiency thoroughness trade off; as per Hollnagel
(2011).
 Understand how the organisational boundaries contribute to safety;
dealing with outsourcing, partnering and regulation.
 Translate strategies into management frameworks for managing
organisational risk – these can be represented by ‘framework diagrams’
that show the factors that impact on safety management systems.
35
This work was succeeded by a thesis by Wilison (2012) which produced a
framework called RISK2VALUE which provides an integrated management
framework and decision support tool kit which address both safety and value
management at an organisational level. A generic diagram shown at Figure 2-7
is provided to support decisions – the use of which is illustrated by means of an
extensive diagram mapping various relationships. The strength of this approach
is that it either provides a generic approach to an audit of airworthiness or would
guide the construction of a new system. Equally it provides an assessment of
socio-technical factors surrounding accidents. A criticism that could be levelled
at the tool is that the linkages between the elements are not explicitly defined
and it therefore unclear how changes would influence the path that the
organisation took through the diagram.
37
Figure 2-7 Framework for managing the impact organisation, technology and human factors have on safety management systems (Wilson, 2008)
38
2.4.7 Lean Resilience
Leondhart (2009) notes that modern business systems are largely premised on
‘just-in-time’ processes. This methodology increases efficiency and
consequently coupling between upstream and downstream functions. Individual
system boundaries are more difficult to define as, for example, maintenance
units become increasingly tightly dependent on supply chains. Carney (2010)
urged caution in the introduction of lean principles and envisaged a hybrid
between lean maintenance and a more traditional model. Resilience
engineering in other domains has shown that it is in fact possible to harness the
approach to introduce production improvement alongside safety (Hounsgaard,
2013). Lean methodology is profoundly linear in its thinking (Carney, 2010); this
methodology is easily deployable in a highly tractable system such as a
production line. In less tractable systems such as maintenance it is likely that
Resilience Engineering techniques will produce better results.
2.5 Functional Resonance Analysis Method
The resilience engineering literature lacks specific methodologies or tools for
practical implementation of resilience engineering principles. The notable
exception is Hollnagel’s (2012) Functional Resonance Analysis Method
(FRAM). This is a technique for building models of complex socio-technological
systems. It differs from STAMP, in that it is a method for generating a model
rather than a model. FRAM maps the system as a series of functions, defined
by their various ‘aspects’ and linked ‘activities’.
O
C
P
I
T
R
FUNCTION
Time Control
Output
ResourcesPreconditions
Input
Figure 2-8 FRAM Function
39
By analysing the output variability from each function and the extent to which
this variability is damped up-stream, it is possible to begin to understand how to
analyse system performance from a resilience engineering point of view. The
FRAM forms the basis of the case study in later chapters and is described in
detail in Chapter 4.
2.6 Quantifying Resilience
Most approaches to quantifying resilience rely on surveys and audit approaches
such as those described by Shirali (2013) or by Saurin (2012). However whilst
an overall system assessment is of value, system managers are interested in
particular risks and being able to quantify them and manage them towards
ALARP levels, as required by legislation. Within process industries a high
degree of automation can be achieved within intensive data collection and
monitoring. These aspects mean that it is comparatively easy to run simulations
and model different systems. Risks can therefore be assessed in a more
quantifiable manner Pasman (2013). A reliability approach to safety is easily
quantifiable through linear decomposition to produce probabilistic risk
assessment. By contrast it is much more difficult to provide quantitative
assessment using a resilience engineering approach. Luxhøj (2003) and
Williams (1996) present Bayesian Belief Networks as a potential solution to low
probability – high consequence risks. Slater (2013) has presented an approach
to nesting BBN within a FRAM model and hence providing a way of quantifying
risk analysis developed through FRAM. He presents this technique as an
alternative to HAZOPS for use in process and transport industry. Brooker
(2011) analyses BBN in the aviation domain, specifically focusses on the ability
of experts to provide accurate assessments of probability in the case of low
probability events. He notes the ‘Causal Markov Condition’ which is an
assumption in BBN that there is no common cause Failure mode across the
network; issues such as ‘safety culture’ are therefore difficult to address. Other
potential techniques for quantification are the use of fuzzy logic or fuzzy set
theory with the use of Monte Carlo simulation (Shirali, 2013). An approach to
quantifying resilience in the context of civil infrastructure is presented by Vugrin
40
(2009), providing a menu of control engineering methodologies that may be
suitable. The issue of data collection in more human centric systems remains a
barrier to expansion of this method. Quantification is the key if Resilience
Engineering is going to gain ground against more traditional risk assessment
techniques.
2.7 Concluding Remarks
The various ages of safety theory were all products of the technology of their
time. Now in an age characterised by networked technology it is clearly time to
fully address notions of complexity for the purpose of providing safe systems.
This is certainly the case for the new generation of civil and military aircraft.
Resilience Engineering appears to offer a different approach to previous
theories and models. In particular the notion that accidents emerge from
unforeseen combinations of varying functional performance is a powerful one. It
offers the prospect that analysis from this perspective might provide risk insights
that may otherwise be missed. It also rings true from experience within an
airworthiness environment. Notions of ‘accident trajectories’ and holes in
processes or defences do not resonate in the same way. There is an
opportunity to combine efforts in process improvement and efficiency with
safety strategies. Resilience engineering offers the theoretical framework and
FRAM provides a potential method. This will be explored in subsequent
sections. It remains the case however that there is some way to go to
operationalize Resilience Engineering; Madni (2009) lists the key issues:
 Help organizational decision makers in making trade-offs between
severe production pressures, required safety levels and acceptable risk.
 Measure organizational resilience.
 Identify ways to engineer the resilience of organizations.
The following chapters outline a case study in which this approach is tested.
41
3 METHODOLOGY
3.1 Introduction
In order to meet the research aim it was necessary to choose a technique with
which to model an airworthiness management system. The literature review
revealed that the Functional Resonance Analysis Method (FRAM) was the best
way to practically apply resilience engineering principles. The FRAM therefore
formed the basis of the practical element of the research. A single case study
organisation was used, with an aspiration of delivering an operationally useful
tool to the organisation at the end of the project. The case study was conducted
in two stages:
 Stage 1 – Construct a FRAM Model of the Airworthiness Management
System and concurrently develop a visualisation tool.
 Stage 2 – Test the model using scenarios drawn from occurrence
reporting and potential in-service airworthiness risks.
The model was developed iteratively, using expert opinion and data from a
variety of sources.
3.2 Working Arrangements
A key difficulty reported by other FRAM practitioners has been understanding
‘work as done’ rather than ‘work as imagined’. This was mitigated by conducting
the research from within the case study organisation on a part time basis, whilst
working within the Force Operations Centre. Moreover, this was preceded by 9
years work in other roles in military airworthiness; including quality assurance,
process improvement and error investigation roles. This provided insight into
‘work as done’ practise. Whilst there was a risk of bias, this was mitigated to
some extent through exposing parts of the model to other workers within the
organisation for verification.
3.3 Research Interviews
Semi-structured interviews were conducted with 19 different workers across all
of the functions. The interviews were flexibly arranged at the interviewees work
42
location (generally offices but control rooms and tool stores were also visited). A
pre-briefing was provided in the form of a two sided A4 document, shown at
Appendix C. The average interview duration was around 30 minutes, giving a
rough total of around nine and a half hours of interview time over the course of
the project. The general interview structure was as follows:
 Check understanding and clarify scope of the study.
 Confirm that participant was currently engaged in the function as part of
their daily activity.
 Check accuracy of each of the function aspects.
 Open questioning to highlight particular areas of variability in the
‘aspects’ of the function.
 Open questions to ascertain whether any aspects had been missed.
 Open questions to ascertain whether participants work covered any
further relevant functions.
The following research interviews were conducted:
 Deputy Continuing Airworthiness Manager
 Engineering Authority – various team members.
 Military Airworthiness Review Certificate team member.
 Continuing Airworthiness Management Organisation Quality Manager.
 Experienced Aircraft Technician at Inspector Level.
 Tornado Forward Fleet Manager.
 Front Line Squadron Senior and Junior Engineering Officers.
 Front Line Squadron Rectification controller, Line Controller, Weapons,
Mechanical and Avionic Trade Managers, with additional contributions
from various mechanics, technicians, supervisors and inspectors.
 Tool Stores Controller.
 Rolls Royce Technical Support Manager.
 BAES Technical Support Manager.
 BAES Reliability Engineering Manager.
 Depth Workshops Supervisor.
43
 Ground Support Equipment Trade Manager.
 Station Air Safety Officer.
3.4 Model Development
Most if not all risk assessment or incident investigation methods require
practitioners to be trained in the application of the technique and are generally
most effectively applied in teams (e.g. HAZOPS, Safety Panels, etc.). Time and
resources precluded this approach for the case study; however insight from a
number of other practitioners’ case studies was gained through attendance at
the annual FRAM Workshop in Munich. Whilst Hollnagel’s (2012) guidelines for
FRAM model development were followed, the final Tornado Airworthiness
System Model used a number of innovative approaches. The main innovation
was the use of a Microsoft Visio drawing to provide an interactive ‘visualisation’
tool. This approach allowed the creation of a much larger model than has been
recorded to date in the literature. The visualisation tool was developed
concurrently with the spreadsheet model, which allowed for greater accuracy by
cross-checking between the two methods of describing the model. The final
model contains a total of 69 individual functions with 985 individual aspects
described. Where inconsistencies in the model became apparent or there was a
gap in knowledge, a variety of experts were used to provide additional
information through conversation or correspondence. In particular, various key
meetings were attended which provided insights that assisted with model
development:
 Force Operations Centre Daily Summary.
 Joint Qualifications and Trials Meeting.
 Level B Capability Programme Reviews.
 Various Upgrade Readiness Reviews.
 Fleet Planning Meetings.
 Scheduled Maintenance Reviews.
 Depth HQ Value Stream Analysis – Continuous Improvement Event.
 Mission Essential Equipment Continuous Improvement Event.
 Air Safety Occurrence Investigators Workshop.
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL
A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL

More Related Content

Similar to A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL

Aviation risk and safety management methods and applications in aviation org...
Aviation risk and safety management  methods and applications in aviation org...Aviation risk and safety management  methods and applications in aviation org...
Aviation risk and safety management methods and applications in aviation org...BatDeegii
 
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design  Integrated Models And AlgorithmsAirline Fleet Assignment And Schedule Design  Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design Integrated Models And AlgorithmsJennifer Roman
 
Aerospace Trends and New Technology Developments
Aerospace Trends and New Technology DevelopmentsAerospace Trends and New Technology Developments
Aerospace Trends and New Technology DevelopmentsEWI
 
Project final report
Project final reportProject final report
Project final reportALIN BABU
 
Wide area protection & Control technologies
Wide area protection & Control technologiesWide area protection & Control technologies
Wide area protection & Control technologiesPower System Operation
 
Dual-Band Mobile Phone Jammer
Dual-Band Mobile Phone JammerDual-Band Mobile Phone Jammer
Dual-Band Mobile Phone JammerMohamed Atef
 
Advancing Safety: Enhancing Nuclear Reactor Technologies
Advancing Safety: Enhancing Nuclear Reactor TechnologiesAdvancing Safety: Enhancing Nuclear Reactor Technologies
Advancing Safety: Enhancing Nuclear Reactor TechnologiesFlowryFlowryn
 
CMP Engineers Capability Statement - 2014 - V3
CMP Engineers Capability Statement - 2014 - V3CMP Engineers Capability Statement - 2014 - V3
CMP Engineers Capability Statement - 2014 - V3Peter McGiffin
 
2009 development and implementation of a control system for a quadrotor uav
2009 development and implementation of a control system for a quadrotor uav2009 development and implementation of a control system for a quadrotor uav
2009 development and implementation of a control system for a quadrotor uavjaeaj
 
Extract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyExtract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyTACook Consultants
 
Extract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyExtract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyMateus Siwek
 
MEng Report Merged - FINAL
MEng Report Merged - FINALMEng Report Merged - FINAL
MEng Report Merged - FINALAmit Ramji ✈
 
MEng Report Merged - FINAL
MEng Report Merged - FINALMEng Report Merged - FINAL
MEng Report Merged - FINALAmit Ramji ✈
 
Innovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based PersonInnovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based PersonAustin Jensen
 
alex_woolford_thesis
alex_woolford_thesisalex_woolford_thesis
alex_woolford_thesisAlex Woolford
 
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...Phil Carr
 

Similar to A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL (20)

Aviation risk and safety management methods and applications in aviation org...
Aviation risk and safety management  methods and applications in aviation org...Aviation risk and safety management  methods and applications in aviation org...
Aviation risk and safety management methods and applications in aviation org...
 
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design  Integrated Models And AlgorithmsAirline Fleet Assignment And Schedule Design  Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
 
Report_FAT1_Final
Report_FAT1_FinalReport_FAT1_Final
Report_FAT1_Final
 
Aerospace Trends and New Technology Developments
Aerospace Trends and New Technology DevelopmentsAerospace Trends and New Technology Developments
Aerospace Trends and New Technology Developments
 
Project final report
Project final reportProject final report
Project final report
 
Wide area protection & Control technologies
Wide area protection & Control technologiesWide area protection & Control technologies
Wide area protection & Control technologies
 
Dual-Band Mobile Phone Jammer
Dual-Band Mobile Phone JammerDual-Band Mobile Phone Jammer
Dual-Band Mobile Phone Jammer
 
HSE Manual -1.pdf
HSE Manual -1.pdfHSE Manual -1.pdf
HSE Manual -1.pdf
 
Advancing Safety: Enhancing Nuclear Reactor Technologies
Advancing Safety: Enhancing Nuclear Reactor TechnologiesAdvancing Safety: Enhancing Nuclear Reactor Technologies
Advancing Safety: Enhancing Nuclear Reactor Technologies
 
CMP Engineers Capability Statement - 2014 - V3
CMP Engineers Capability Statement - 2014 - V3CMP Engineers Capability Statement - 2014 - V3
CMP Engineers Capability Statement - 2014 - V3
 
2009 development and implementation of a control system for a quadrotor uav
2009 development and implementation of a control system for a quadrotor uav2009 development and implementation of a control system for a quadrotor uav
2009 development and implementation of a control system for a quadrotor uav
 
Extract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyExtract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance Study
 
Extract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance StudyExtract | T.A. Cook Offshore Wind Maintenance Study
Extract | T.A. Cook Offshore Wind Maintenance Study
 
Vol2ch01
Vol2ch01Vol2ch01
Vol2ch01
 
Shu thesis
Shu thesisShu thesis
Shu thesis
 
MEng Report Merged - FINAL
MEng Report Merged - FINALMEng Report Merged - FINAL
MEng Report Merged - FINAL
 
MEng Report Merged - FINAL
MEng Report Merged - FINALMEng Report Merged - FINAL
MEng Report Merged - FINAL
 
Innovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based PersonInnovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based Person
 
alex_woolford_thesis
alex_woolford_thesisalex_woolford_thesis
alex_woolford_thesis
 
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
 

A_TOMCZYNSKI_MAA-DISS-19-A__665187 FINAL

  • 1. CRANFIELD UNIVERSITY Alexander Tomczynski APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO THE MANAGEMENT OF AIRWORTHINESS DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND TECHNOLOGY Military Aerospace and Airworthiness MSc Academic Year: 2013 - 2014 Supervisor: Dr Simon Place March 2014
  • 2.
  • 3. CRANFIELD UNIVERSITY DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND TECHNOLOGY Military Aerospace and Airworthiness MSc Academic Year 2013 - 2014 Alexander Tomczynski APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO THE MANAGEMENT OF AIRWORTHINESS Supervisor: Dr Simon Place March 2014 This thesis is submitted in partial fulfilment of the requirements for the degree of Master of Science © Crown Copyright 2014. All rights reserved. No part of this publication may be reproduced without the written permission of the copyright owner.
  • 4.
  • 5. i ABSTRACT Complex safety critical systems in high hazard industries continue to have accidents despite improvements in reliability, understanding of human factors and the behaviour of organisations. Resilience engineering offers a new paradigm in safety science and proposes that safety is defined as success under varying performance conditions. The theory is examined and its applicability to airworthiness is discussed. A related technique, the Functional Resonance Analysis Method (FRAM), treats system performance as a control problem. This methodology is employed to create an airworthiness management tool for the Royal Air Force Tornado aircraft fleet. Data was gathered through occurrence report data, practical experience and semi- structured interviews with a variety of personnel within the airworthiness system. The tool comprises a spreadsheet model with an accompanying interactive visualisation tool. The tool is used to analyse two air safety occurrences and also to attempt to provide a resilience based risk assessment of an airworthiness issue. It was concluded that resilience engineering presents a promising basis for better management of airworthiness. The initial version of the tool was found to work well but extensive development work is required to produce a desktop IT airworthiness resilience dashboard tool. Keywords: SYSTEM SAFETY, SAFETY CRITICAL SYSTEMS, ACCIDENT INVESTIGATION
  • 6.
  • 7. iii ACKNOWLEDGEMENTS I would like to thank my wife Natalie for her encouragement and support. Also worthy of thanks are Professor Erik Hollnagel and the rest of the “FRAMily” who have collected both online and at the 2013 meeting in Munich. The shared knowledge and experience has been most instructive. This project would not have been possible without the enthusiastic participation of a large number of people at Royal Air Force Station Marham - service personnel and employees of BAE Systems and Rolls Royce. The guidance provided by my supervisor Dr Simon Place has been invaluable in the completion of this project and I thank him for it.
  • 8. iv In remembrance of No. CXX Squadron, Crew 3 “Endurance”
  • 9. v TABLE OF CONTENTS ABSTRACT ......................................................................................................... i ACKNOWLEDGEMENTS...................................................................................iii LIST OF FIGURES.............................................................................................ix LIST OF TABLES..............................................................................................xii LIST OF EQUATIONS......................................................................................xiv LIST OF ABBREVIATIONS...............................................................................xv 1 INTRODUCTION............................................................................................. 1 1.1 Introduction ............................................................................................... 1 1.2 Background – Theories of Safety.............................................................. 2 1.3 Background – The Practical Requirement ................................................ 3 1.4 What is ‘Airworthiness Management’?...................................................... 5 1.5 The Research Aim .................................................................................... 5 1.6 Objectives................................................................................................. 5 1.7 Methodology Overview ............................................................................. 6 1.8 Descriptions and Definitions ..................................................................... 6 1.9 Thesis Structure........................................................................................ 6 2 LITERATURE REVIEW................................................................................... 9 2.1 Airworthiness in the Context of Safety ...................................................... 9 2.1.1 Accident Investigations....................................................................... 9 2.1.2 Initial and Type Airworthiness .......................................................... 10 2.1.3 Safety Management ......................................................................... 10 2.1.4 Continuing Airworthiness.................................................................. 11 2.2 A History of Safety Theory...................................................................... 11 2.2.1 Technological Age – Governing Philosophy..................................... 14 2.2.2 Technological Age – Tools............................................................... 14 2.2.3 Limits of Probabilistic Risk Assessment ........................................... 16 2.2.4 Human Factors................................................................................. 18 2.2.5 Organisational.................................................................................. 18 2.3 Complexity .............................................................................................. 19 2.3.1 Complexity Theory ........................................................................... 20 2.3.2 Systems Thinking and Systems Engineering ................................... 22 2.3.3 Control Theory ................................................................................. 24 2.3.4 Non-Linear Dynamics....................................................................... 25 2.4 Resilience Engineering ........................................................................... 26 2.4.1 Resilience Engineering as a Successor to Safety Management ...... 30 2.4.2 Under Specification of Performance Conditions............................... 30 2.4.3 Performance Variability .................................................................... 31 2.4.4 Examples of Resilience Engineering in Practice .............................. 31 2.4.5 Criticism of Resilience Engineering.................................................. 33 2.4.6 Resilience Engineering and Airworthiness ....................................... 34
  • 10. vi 2.4.7 Lean Resilience................................................................................ 38 2.5 Functional Resonance Analysis Method................................................. 38 2.6 Quantifying Resilience ............................................................................ 39 2.7 Concluding Remarks............................................................................... 40 3 METHODOLOGY.......................................................................................... 41 3.1 Introduction ............................................................................................. 41 3.2 Working Arrangements ........................................................................... 41 3.3 Research Interviews ............................................................................... 41 3.4 Model Development................................................................................ 43 3.5 Air Safety Information Management System Data .................................. 44 3.5.1 Data Extraction................................................................................. 44 3.5.2 Assignment of Related Functions to Incidents ................................. 46 4 BUILDING THE TORNADO AIRWORTHINESS SYSTEM MODEL USING THE FUNCTIONAL RESONANCE ANALYSIS METHOD................................ 47 4.1 Basic Principles ...................................................................................... 47 4.2 Taxonomy............................................................................................... 48 4.3 FRAM Step 0 – Recognise the Purpose of the FRAM Analysis.............. 50 4.4 FRAM Step 1a – Identify and Describe the Initial Function List. ............. 51 4.5 FRAM Step 1b – Verify Functions with Experts ...................................... 53 4.6 Step 2 – Identification of Output Variability ............................................. 56 4.7 Step 2a – Identify the Type of Function .................................................. 56 4.8 Step 2b – Identify Internal Sources of Output Variability......................... 59 4.9 Step 2c – Identify External Sources of Output Variability........................ 60 4.10 Step 2d – Most Likely Dimension of Output Variability.......................... 61 4.11 Step 3 – Aggregation of Variability........................................................ 65 4.12 Step 4 – Consequences of the Analysis ............................................... 71 4.12.1 Step 4a – Damping Factors............................................................ 71 4.12.2 Step 4b Performance Indicators..................................................... 71 4.13 Summary of TASM Layout.................................................................... 74 5 TORNADO AIRWORTHINESS SYSTEM MODEL VISUALISATION TOOL . 77 5.1 Need for the Tool .................................................................................... 77 5.2 Microsoft Visio ........................................................................................ 77 5.3 Building the Tool ..................................................................................... 77 5.3.1 General Functional Areas................................................................. 77 5.3.2 Functions.......................................................................................... 79 5.3.3 External Dependencies .................................................................... 82 5.3.4 Functional Activities.......................................................................... 84 5.4 Exploiting the Tool .................................................................................. 85 5.5 Summary ................................................................................................ 89 6 USING THE TORNADO AIRWORTHINESS SYSTEM MODEL FOR INCIDENT ANALYSIS...................................................................................... 93 6.1 Case for Using FRAM for Incident Modelling .......................................... 93
  • 11. vii 6.2 Incident One – Thrust Reverser Incidents............................................... 94 6.2.1 Description of Incidents.................................................................... 95 6.2.2 Summary of the Investigations ......................................................... 96 6.2.3 Instantiation of the FRAM Model ...................................................... 98 6.2.4 The Sources of Variability .............................................................. 102 6.2.5 Insights from TASM........................................................................ 108 6.3 Incident 2 – Missing Rigging Pin........................................................... 111 6.3.1 Description of Incident.................................................................... 111 6.3.2 Summary of Investigation............................................................... 112 6.3.3 Instantiation of the TASM............................................................... 116 6.3.4 Insights from TASM........................................................................ 119 7 USING THE TORNADO AIRWORTHINESS SYSTEM MODEL FOR RISK ANALYSIS...................................................................................................... 121 7.1 Case for Using TASM for Risk Analysis................................................ 121 7.2 Current Theoretical Basis for Airworthiness Risk Management ............ 123 7.3 Proposal of FRAM Based Airworthiness Risk Theory........................... 124 7.4 Proposal for a FRAM Based Risk Assessment Process....................... 127 7.5 Risk Example – Operation of Components in Excess of Cleared Life .. 131 7.5.1 Generating a FRAM Model Risk Assessment ................................ 131 7.5.2 Insights into Risk ............................................................................ 144 7.6 Proposal for a FRAM Based Risk Management ................................... 148 7.7 Chapter Summary................................................................................. 149 8 DISCUSSION.............................................................................................. 151 8.1 Applicability of the Resilience Engineering Paradigm to Airworthiness. 151 8.2 The Tornado Airworthiness System Model – Initial Version.................. 155 8.3 Incident Investigation ............................................................................ 156 8.3.1 Data Collection............................................................................... 157 8.3.2 Aids to Investigation ....................................................................... 157 8.4 Risk Assessment .................................................................................. 158 8.4.1 Hazard Management vs Functional Resonance Management....... 160 8.5 Utility of the TASM for Type Airworthiness Activities ............................ 160 8.6 Utility of the TASM for Continuing Airworthiness Activities ................... 162 8.7 Utility of TASM for Duty Holder Activity................................................. 164 8.8 Potential Use for System Improvement................................................. 165 8.9 Potential for Further Development of the TASM ................................... 166 8.9.1 Increased Model Fidelity ................................................................ 167 8.9.2 Application of Bayesian and/or Fuzzy Logic................................... 168 8.9.3 Expansion into Operational Safety Management ........................... 168 8.10 Chapter Summary............................................................................... 169 9 CONCLUSIONS.......................................................................................... 171 9.1 Summary .............................................................................................. 171 9.2 Recommendations................................................................................ 172
  • 12. viii 9.2.1 Manage Airworthiness as a Control Problem ................................. 173 9.2.2 Use the TASM to Control the Airworthiness System ...................... 173 9.2.3 Review Airworthiness Risk from a Resilience Perspective............. 173 9.2.4 Use FRAM as a Means to Improve System Resilience and Efficiency................................................................................................. 173 9.3 Potential for Further Research and Development................................. 174 9.4 Concluding Remarks............................................................................. 174 REFERENCES............................................................................................... 177 Appendix A –TORNADO AIRWORTHINESS FRAM MODEL..................... 185 Appendix B – TORNADO AIRWORTHINESS MODEL VISUALISATION... 187 Appendix C – PARTICIPANTS BRIEFING SHEET .................................... 188
  • 13. ix LIST OF FIGURES Figure 1-1 - Nimrod MR2 XV230........................................................................ 3 Figure 1-2 - RAF Tornado GR4 Aircraft............................................................. 4 Figure 2-1 Accident Analysis and Risk Assessment Methods.......................... 13 Figure 2-2 Three Tracks on the Evolution of Safety Theory ............................. 13 Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management ..... 21 Figure 2-4 General Form of a Model of Socio-technical Control....................... 24 Figure 2-5 The Four Cornerstones of Resilience.............................................. 29 Figure 2-6 Conceptual Framework for Resilience Engineering ........................ 30 Figure 2-7 Framework for managing the impact organisation, technology and human factors have on safety management systems................................ 37 Figure 2-8 FRAM Function ............................................................................... 38 Figure 4-1 FRAM Model Visualisation Demonstrating Taxonomy .................... 49 Figure 4-2 TASM Step 12 – Screen Capture Showing Applicable Spreadsheet Areas ......................................................................................................... 54 Figure 4-3 Visualising Functional Output Variability ......................................... 56 Figure 4-4 Instances of Functional Output Variability Recorded in Occurrence Reports 2012/13 ........................................................................................ 58 Figure 4-5 Instances of Reported Functional Output Variability by Function Type .................................................................................................................. 59 Figure 4-6 Total Instances of Functional Output Variability Recorded in Occurrence Reports 2012/13..................................................................... 59 Figure 4-7 TASM Step 2 – Screen Capture Showing Applicable Spreadsheet Areas ......................................................................................................... 65 Figure 4-8 Tracing Output Downstream Dependencies (Screen Capture) ....... 66 Figure 4-9 Rough Score Matrix ........................................................................ 70 Figure 4-10 Rough Downstream Function Variability Score............................. 70 Figure 4-11 TASM Step 3 – Screen Capture Showing Applicable Spreadsheet Areas ......................................................................................................... 71 Figure 4-12 TASM Step 4 – Screen Capture Showing Applicable Spreadsheet Areas ......................................................................................................... 72 Figure 4-13 Example FRAM for 2 Functions, A and B...................................... 75
  • 14. x Figure 5-1 Visualisation Functional Groupings................................................. 78 Figure 5-2 A Function and Its Aspects ............................................................. 79 Figure 5-3 Screen Capture of Visualisation Tool with Functions Added........... 81 Figure 5-4 Screen Capture of Visualisation Tool with External Dependencies Added ........................................................................................................ 83 Figure 5-5 5-6 Screen Capture of Visualisation Tool with all Functional Activities Shown........................................................................................................ 85 Figure 5-7 Activities and Dependencies Linked to Aspects of the ‘Train Maintenance Personnel’ Function ............................................................. 86 Figure 5-8 Selecting Layers within Visio – Screen Capture.............................. 87 Figure 5-9 DII Visio Viewer – Screen Capture.................................................. 88 Figure 5-10 Visualisation Tool Key................................................................... 90 Figure 6-1 Tornado GR4 with Thrust Reversers Deployed .............................. 94 Figure 6-2 Thrust Reverser Incidents Visualisation........................................ 101 Figure 6-3 Propulsion & Electrical System ..................................................... 102 Figure 6-4 Electrical System Potential Functionally Resonant Activities ........ 104 Figure 6-5 Instantiation of Thrust Reverse Occurrence Reports .................... 110 Figure 6-6 Location of Where Lost Pin was Installed ..................................... 113 Figure 6-7 General Installation Location of Lost Pin....................................... 114 Figure 6-8 Pin Location in Tool Kit ................................................................. 115 Figure 6-9 Visualisation Tool Output for Rigging Tool Occurrence................. 117 Figure 6-10 Instantiation of Rigging Pin Occurrence...................................... 118 Figure 7-1 Tornado Process for Emergent Airworthiness Issues ................... 122 Figure 7-2 Current Theoretical Basis for Tornado Airworthiness Risk Management............................................................................................ 124 Figure 7-3 Proposed Functional Resonance Risk Management Theory - Visualisation of a Generic Hazardous Process........................................ 125 Figure 7-4 FRAM Model Risk Assessment Process....................................... 129 Figure 7-5 Operation of Components Beyond Cleared Life - First Stage Risk Visualisation, Excluding Background Functions ...................................... 132 Figure 7-6 Visualisation of Hazard Generation Process................................. 141 Figure 7-7 Visualisation of Potential Accident Processes............................... 143
  • 15. xi Figure 7-8 Proposed Risk Management Process........................................... 148 Figure 8-1 Fractal Property of the FRAM - Function Decomposed into Lower Level Functions ....................................................................................... 159 Figure 8-2 TASM Development Pathway ....................................................... 167
  • 16. xii LIST OF TABLES Table 2-1 Herrera 's Ages of Safety Theory ..................................................... 12 Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment ............... 17 Table 2-3 Examples of Resilience Engineering in Practice .............................. 31 Table 3-1 D-ASOR Classifications included in Data......................................... 45 Table 4-1 Example FRAM frame for Fault Diagnosis ....................................... 52 Table 4-2 Listing of TASM Functions ............................................................... 55 Table 4-3 Summary of Internal Variability ........................................................ 60 Table 4-4 Summary of External Variability ....................................................... 61 Table 4-5 Example TASM Recording of Step 2a-c for Function 67 - Engine Fleet Monitoring.................................................................................................. 61 Table 4-6 Elaborate Description of Output Variability....................................... 62 Table 4-7 Characterising Output Variability – Flight Servicing ......................... 63 Table 4-8 Classifications for Frequency of Output Variability ........................... 64 Table 4-9 Classification of Amplitude of Performance Variability ..................... 65 Table 4-10 Aggregation of Variability for Flight Servicing................................. 69 Table 4-11 Example of Step 4 - Flight Servicing .............................................. 73 Table 6-1 Thrust Reverser Air Safety Occurrence Reports 2012/13 ................ 95 Table 6-2 Thrust Reverse Occurrences with Detailed Investigation................. 96 Table 6-3 Thrust Reverser FRAM Instantiation ................................................ 99 Table 6-4 FRAM Model of Electrical System.................................................. 105 Table 6-5 Electrical System Precondition Variability ...................................... 107 Table 6-6 Functional Variability Noted From Investigation ............................. 116 Table 7-1 Configuration Management Aspects .............................................. 133 Table 7-2 Summary of Second Stage of Risk Assessment ............................ 135 Table 7-3 Stage 2 - Scheduled Maintenance Function................................... 136 Table 7-4 Stage 2 - Force and A4 Operations Function (Part 1) .................... 137 Table 7-5 Stage 2 - Force and A4 Operations Function (Part 2) .................... 138 Table 7-6 Stage 3 Replacement of Life Limited Parts Function...................... 139
  • 17. xiii Table 7-7 Example Accident Generating Function FRAM Frame Layout ....... 145 Table 7-8 Avionic Flight Systems Output – Baseline FRAM Model................ 147 Table 8-1 Utility of TASM for TAA Activities ................................................... 161 Table 8-2 Potential CAMO Use of TASM ....................................................... 163 Table 8-3 Aviation Duty Holder Use of TASM ................................................ 164
  • 18. xiv LIST OF EQUATIONS Equation 1 – Linear System ............................................................................. 25 Equation 2 - Additive Property.......................................................................... 25 Equation 3 – Homogeneous Property .............................................................. 25 Equation 4 – Non Linear System; lack of Additive Property ............................. 25 Equation 5 – Non Linear System; lack of Homogeneous Property................... 25 Equation 6 - Rough Downstream Function Variability Score............................ 70
  • 19. xv LIST OF ABBREVIATIONS A4 AC AcciMap ADF AEB AESO ALARP ARC ATTAC ASIMS ATHEANA AWFL CAM CAMO CAMSS CMU CREAM CSNI DAOS DASOR DE&S DII DMS DO DQAFF EA EngO ETTO FAST FMEA FMECA FOC FRAM GSE HAS HAZOPS HAZID NATO designator for Logistics/Engineering Aircraft Accident Map Acceptable Deferred Fault Accident Evolution and Barrier Function Air Engineering Standing Orders As Low As Reasonably Practicable Airworthiness Review Certificate Aircraft Tornado Transformation Availability Contract Air Safety Information Management System A Technique for Human Error ANAlysis AirWorthiness Flight Limitations Continuing Airworthiness Manager Continuing Airworthiness Management Organisation Continued Airworthiness Management Support Services Combined Maintenance and Upgrade Unit Cognitive Reliability Error Analysis Method Committee on the Safety of Nuclear Installations Design Approved Organisation Scheme Defence Air Safety Occurrence Report Defence Equipment and Support Defence Information Infrastructure Dedicated Maintenance System Design Organisation Defence Quality Assurance Field Force Engineering Authority Engineer Officer Efficiency Thoroughness Trade Off Fast Air Support Team Failure Mode Effects Analysis Failure Mode and Criticality Analysis Force Operations Centre Functional Resonance Analysis Method Ground Support Equipment Hardened Aircraft Shelter Hazard and Operability Study Hazard Identification
  • 20. xvi HCR HEAT HERA HFACS HPES HRO ITEA JEngO JSP LFT LITS LOAA MAA MAOS MAP-01 MERMOS MOD MMD MORT MSG MRP MTO MWO NAT NATO NETMA OOPS OSI ORG PST PT QA QMS R2 RA RAF RCA ROCET Human Cognitive Reliability Human Error Assessment Technique Human Error in Air Traffic Management Technique Human Factors Analysis and Classification System Human Performance Enhancement System High Reliability Organisation Independent Technical Evaluation and Advice Junior Engineering Officer Joint Service Publication Latest Finish Time Logistics Information Technology System Letter Of Airworthiness Authority Military Airworthiness Authority Maintenance Approved Organisation Scheme Manual of Airworthiness Processes - 01 Méthode d’Evaluation de la Réalisation des Missions Opérateur pour la Sûreté Ministry of Defence Man Made Disaster Maintenance Oversight and Risk Tree Maintenance Steering Group MAA Regulatory Publications Man-Technology-Organisation Maintenance Work Order North Atlantic Treaty Organisation Normal Accident Theory NATO Eurofighter and Tornado Management Agency Out Of Phase Servicing Occurrence Safety Investigation Occurrence Review Group Propulsion Support Team Project Team Quality Assurance Quality Management System 2nd Line Repair Regulatory Article Royal Air Force Root Cause Analysis RB199 Operational Contract for Engine Transformation
  • 21. xvii RTS RTSA SEngO SI(T) SQEP STAMP STANEVAL STEP TAA TAP TASM TGRF THERP TME TRACEr TSEMP Release To Service Release To Service Authority Senior Engineering Officer Special Instruction (Technical) Suitably Qualified and Experienced Person Systems Theoretic Accident Model STANdards EVALuation Sequential Timed Event Plotting Type Airworthiness Authority Technical Assistance Process Tornado Airworthiness System Model Tornado Ground Attack & Reconnaissance Force Technique for Human Reliability Analysis Testing and Measuring Equipment Technique for The Retrospective Analysis of Cognitive Error Tornado Safety and Environmental Management Plan
  • 22.
  • 23. 1 1 INTRODUCTION “I can see him now, fighting with the controls, trying his best… …We want some justice and the MOD to sit up and take notice, what they have done could have been avoided; we live in hope that they will not let this happen in the future.” Mrs Adele Squires, Wife of Flight Lieutenant Al Squires, Captain of the Nimrod aircraft XV230 1.1 Introduction Air accidents have shown that aircraft are sometimes not as safe or as airworthy as was previously imagined. With huge resources applied to ensuring airworthiness, why do accidents still occur? It is often said that such accidents could be prevented if the lessons of the past been heeded. Yet despite many investigations and recommendations, accidents still occur. Why is this so? Are existing tools for safety analysis inadequately applied or inadequate in of themselves? How can those charged with responsibility over complex hazardous systems in industry, transportation or the military work better to prevent accidents and yet still achieve their operational objectives? Design engineers are duty bound to demonstrate that their system may initially be operated without an unacceptable level of harm. Thereafter, the system must be maintained and continually monitored for the increase of risk beyond acceptable levels. For organisations that have very few if any accidents, this is a major challenge; how can the risk of something that has not happened be measured and managed? In the aviation domain, airworthiness is a property that requires continual management and assessment. Whilst the property is attributable to the materiel itself, aircraft systems experience almost constant contact with humans and thus airworthiness is inherently bound up with the humans who manage, operate and maintain aircraft. Aircraft and their supporting organisations are complex ‘socio-technical systems’. Resilience engineering is a new concept that provides insight into this relationship and offers useful models and tools for better management of the safety of such
  • 24. 2 systems. If complex socio-technical systems managing airworthiness are better understood, then perhaps future accidents will be prevented. 1.2 Background – Theories of Safety There have been accidents involving complex systems ever since the industrial revolution. The management of safety has consequently been a concern since these times but a theoretical basis for safety did not emerge until the 1930s. Herrea (2012) divides the development of safety theory into 4 overlapping ages of safety theory - the ages of technology, human factors, organisational safety and complexity. The age of technology dealt with the design of machines and why they fail whereas human factors has traditionally been concerned with why humans fail to do what is expected of them. Organisational safety has been concerned with the safe management of potentially hazardous enterprises and how these fail – Reason’s (1997) famous ‘Swiss Cheese’ being the pre-eminent model in the field. However detailed the taxonomies of failure in the first 3 ages were, the associated models of accident causation have been linear. An emerging 4th age of safety theory is that of complexity. In complexity theories, accident causation models are non-linear and are sometimes said to be intractable. The term Resilience Engineering has come to encompass the use of these models; the practise seeks to develop socio-technological systems that are resilient against those variations in system performance which may cause accidents. Airworthiness has been mostly associated with technological safety theory, with reliability and safety assessment methods such as fault tree analysis dominating the thinking of designers and regulators. Although design for human factors has been an issue since the 1940s the field has generally been concerned with operator performance; human factors in maintenance has only more recently come into the spot light (Reason and Hobbs, 2003). The ability of aircraft operating authorities and regulators to maintain continuing airworthiness has been the subject analysis from an organisational safety standpoint due to accidents such as Alaska Air 261 (Woltjer, 2007). More recently accidents such as Air France 447 (Stoop, 2013) have shown that
  • 25. 3 unexpected results can emerge from increasingly complex systems and that a lack of resilience can be fatal. 1.3 Background – The Practical Requirement This research will specifically address the management of airworthiness within the United Kingdom’s military. On the 2nd September 2006 the UK military suffered its single largest loss of life since the 1982 Falklands War, when a Royal Air Force (RAF) Nimrod MR2 aircraft was destroyed near Kandahar, Afghanistan. This was not the consequence of a hostile act or the outcome of operator error. It was an accident caused by a failure to establish the correct level of initial airworthiness though the design of modifications and thereafter a failure to maintain continuing airworthiness in the condition of fuel and hot air systems. The independent inquiry into the incident identified that the deeper causes were organizational and managerial (Haddon-Cave, 2009). Figure 1-1 - Nimrod MR2 XV230 (McKenzie, 2012) As a consequence of the recommendations made by the Nimrod Review, military airworthiness management has been comprehensively overhauled as part of a reorganisation of ‘Air Safety’ within the Ministry of Defence (MOD). The previously “byzantine” (Haddon-Cave, 2009) regulation of air safety has been simplified through the establishment of the Military Aviation Authority (MAA). Key to the new system has been the establishment of a chain of ‘duty holders’ who are named senior military officers with legal responsibility for the safety of aircraft operated by their organisation. Duty Holders rely on Type Airworthiness
  • 26. 4 Authorities (TAA) and Continuous Airworthiness Managers (CAM) to ensure that the airworthiness of their aircraft is adequately established and maintained. In practise this is achieved through a variety of processes aimed at managing the risk of a technical failure. There is an engineering programme to maintain the integrity of the systems’ initial airworthiness whilst developing the system’s capability and also a maintenance programme specified by the Engineering Authority (EA) (reporting to the TAA) and implemented by the Continuing Airworthiness Management Organisation (CAMO). In common with most socio- technological systems, these processes do not operate exactly as designed or documented. Particular concerns centre around the human factors within the maintenance programme and whether or not appropriate engineering ‘standards and practises’ can be ensured in the face of pressures to produce operational output within an increasingly lean front line organisation. These are typical examples of the Efficiency-Thoroughness Trade-Off (ETTO) principle highlighted within the Resilience Engineering literature (Hollnagel et al, 2007). Practical experience of the messy realities of military aircraft operations and back-office airworthiness assessment was the genesis of the research aim. This research uses the RAF’s Tornado Ground Attack Reconnaissance Force (TGRF) as a case study. Figure 1-2 - RAF Tornado GR4 Aircraft (Crown Copyright, 2009)
  • 27. 5 1.4 What is ‘Airworthiness Management’? Large organisations exist to maintain, modify, provide resources, operate and monitor aircraft fleets in order to keep them airworthy. The way in which this multitude of functions is carried out has a variety of effects on the aircraft system and the property of airworthiness. Those responsible for airworthiness can only manage it indirectly by managing of the functioning of the organisation. This is achieved by means of tasking maintenance or setting policy, defining an organisational structure (including contracting out elements), providing resources and conducting quality assurance. So whilst making engineering assessments and specifying what physical actions are to be carried out on an aircraft system is critical, the management of airworthiness is a wider endeavour. 1.5 The Research Aim The aim of this thesis is: To apply resilience engineering concepts by producing a system model of an airworthiness management organisation in order to provide a tool to improve management of airworthiness. 1.6 Objectives In order to achieve the aim the following research objectives were established:  Review the theoretical background to safety management and the implications for airworthiness management.  Review the concepts of Resilience Engineering with an emphasis on applying it to airworthiness management.  Establish a theoretical framework for a model of an airworthiness management system.  Gather and use primary research data to establish and validate a model of the airworthiness management system for the RAF Tornado Force.  Using the model, develop a tool to enhance the airworthiness management system of the RAF Tornado Force.
  • 28. 6 1.7 Methodology Overview A literature review of resilience engineering was carried out, which branched out into source disciplines of systems thinking and engineering; control theory; non- linear dynamics and complexity theory. A search for work in this area addressing airworthiness or technical safety in other domains was conducted. For the Tornado case study, the safety, airworthiness and assurance plans of the various elements of the organisation were examined. Resilience engineering provides a number of modelling techniques that could be applied to the case study; these were assessed and down selected to the Functional Resonance Analysis Method (FRAM). The system was assessed by semi- structured interviews with key personnel as well as using a large amount of information and experience gained from working within the system. The FRAM Model was built within a spreadsheet and a separate model visualisation tool was created using Microsoft Visio. This allowed for the identification of various potential leading indicators for system safety. In order to validate the FRAM model, specific case studies were required. Two incident reports and an emergent airworthiness risk were selected for analysis. 1.8 Descriptions and Definitions For simplicity the standard terminology as described within MAA02 – Military Aviation Authority Master Glossary (MAA, 2012) is adopted for this thesis. There are a number of minor differences in emphasis between terms used here and in civil aviation or other domains; these are discussed where relevant. 1.9 Thesis Structure This thesis is structured around the research objectives:  Chapter 2 describes the theoretical foundations for resilience engineering in the context of the other theories of safety and safety engineering practise in other domains. Potentially useful models are analysed.  Chapter 3 details the methodology for carrying out the primary research.  Chapter 4 describes the process for building the case study FRAM Model – the Tornado Airworthiness System Model.
  • 29. 7  Chapter 5 describes the development of the FRAM visualisation tool.  Chapter 6 discusses how the FRAM Model may be used for incident analysis with reference to two examples.  Chapter 7 gives a process for, and example of the FRAM Model as a risk assessment tool.  Chapter 8 provides a general discussion of the case study exercise, focussing on the applicability of Resilience Engineering to aspects of airworthiness practise.  Chapter 9 provides some conclusions.
  • 30.
  • 31. 9 2 LITERATURE REVIEW The literature review will examine arguments for broadening the scope of airworthiness to address the complexities of managing modern aircraft, maintenance and support organisations. Existing notions of cause, failure and hazards are challenged as the theoretical background to resilience engineering is described. Models and methods for understanding and managing the safety and airworthiness of complex systems are examined using the paradigm of resilience engineering. 2.1 Airworthiness in the Context of Safety There are a number of definitions for the term airworthiness; all these have at their core the need for the aircraft to be able to be operated in safety or as the MAA has it; ‘without significant hazard’. Hazard is further defined as ‘an intermediate state where the potential for harm exists’ (MAA, 2012b). The hazard is said to lie between a cause (such as a technical or human failure) and an accident. So whilst airworthiness is clearly a target for aerospace design organisations to meet through satisfaction of certification standards, it is also an element of system safety that requires management throughout the lifecycle of the system. It is analogous to ‘technical safety’ or in other domains, which is often separated ‘operational’ or ‘occupational’ safety. 2.1.1 Accident Investigations The need to investigate loss of life or near misses is both a pragmatic and moral choice. The conclusions drawn from such investigations are extremely important at a human level but also critical to restoring system safety. It is therefore vital for accident investigators to use mental and procedural models that reflect the complexity of modern technologies. One of the largest accident investigation agencies, the National Transportation Safety Board (NTSB) determines a ‘probable cause’ in all its reports (Johnson and Holloway, 2004) but ICAO recommends that ‘causes’ – plural are determined (ICAO, 2001). This indicates a governing accident chain theory in the former organisation but perhaps a slightly more sophisticated model in the latter. Various writers (De
  • 32. 10 Landre et al., 2006),(Coury et al., 2008) have proposed models or frameworks in which multiple causes can be described in accident investigation. Much has been written about the intersection between legal frameworks and accident investigation methodologies. Dekker (2003) for example has described the detrimental effect of the adversarial nature of justice. The rest of this chapter will describe how assigning ‘root’ or probable cause to accidents is potentially unhelpful in the context of complex systems. It follows therefore that notions of blame or individual responsibility are often problematic to apply. 2.1.2 Initial and Type Airworthiness Much of the airworthiness of a system is ‘designed-in’ before manufacture. This involves specifications, systems configuration and assumptions on support and maintenance philosophy. A structured systems engineering approach to safety as described in ARP 4761 (SAE, 1996) is used to convince regulators that a type certificate can be issued. The evolution of safety requirements and regulation over a system’s lifecycle causes difficulty (Kelly and McDermid, 1999). Military aircraft in particular are often retained in service for many decades. Whilst the technology may remain relatively constant, experience shows that it is usual operational usage to evolve over the course of the lifecycle. For this reason it is important to regularly adjust, validate and reassess airworthiness assessments if the type airworthiness of a design is to be maintained. 2.1.3 Safety Management For many complex systems, the development of safety cases is a mandatory requirement (MoD, 2007) and in particular for military airworthiness this is governed by MAA Regulatory Article 1205 (MAA, 2013). The concept of a safety case is the presentation or collation of a body of evidence to assure interested parties that the system is safe. This body of evidence is collected and organised according to mental or procedural models. The theoretical basis for these models are the same theories of safety as described below. Safety management systems are similarly structured according to the prevailing
  • 33. 11 theoretical approach to safety. An evolution in modelling requires an evolved approach to safety management. 2.1.4 Continuing Airworthiness Continuing airworthiness relates to the maintenance of a particular, safe system state for each of the individual aircraft being managed (MAA, 2012b). Given that it is never possible to comprehensively inspect/audit each aircraft before every flight, there must be assumptions made as to the effect of organisational and human interactions with the aircraft so as to maintain the system in a safe state. Understanding maintenance system performance is critical to assuring continued airworthiness. This achieved through a Continuing Airworthiness Management Organisation (CAMO) which provides assurance that its specified tasks are being undertaken successfully. This is primarily achieved through a quality assurance system, which ensures that rigorous processes are established (Casey, 2013). 2.2 A History of Safety Theory Chapter One sketched out a chronological view of ‘Ages’ of safety theory. New theories tend to gain traction as a result of the investigation to major accidents. Herrera (2012) describes how safety theory has evolved across technological, human factors, organisational and complexity ‘ages’, identifying key accidents and ideas on a time line, which is summarised in Table 2-1:
  • 34. 12 Table 2-1 Herrera 's Ages of Safety Theory Leonhardt et al (2009) presents breakdown of safety methodologies within a Resilience Engineering White Paper. This document describes Technical, Human Factors, Organisational and Systemic accident analysis and risk assessment methods. Systemic models/methods are those that have recently emerged to provide a means of analysing safety from a ‘complexity’ standpoint. These are shown chronologically in Figure 2-1 with an expansion of each abbreviation available within the glossary. Time Accidents Technology Human Factors Organisational Complexity 1930s Domino Model 1940 - 50s Failure Mode Effects Analysis (FMEA) Human Factors Design Task Analysis 1960s Aberfan Colliery Disaster Fault Tree Analysis (FTA) - Minute-Man Missiles & Boeing aircraft Energy Barrier Model Technique for Human Error Rate Prediction 1970s Flixborough & Seveso Chemical Plants Tenerife Aircraft Collision Three Mile Island Nuclear Plant Probalistic Risk Assessment (WASH-1400 Reactor Safety Study) Hazard & Operability Analysis Energy Damage and Countermeasure Strategies Man Made Disaster Information Perspective 1980s Bhopal Chemical Plant Challenger Space Shuttle Chernobyl Nuclear Plant Kings Cross Railway Piper Alpha Oil & Gas Dryden Aviation Crew Resource Management Safety Culture Swiss Cheese Model Normal Accident Theory 1990s Warsaw Air Crash Iraq Friendly Fire Cali Air Crash Arianne 5 - Space Norne Air Crash Longford Oil & Gas Mandatory Safety Cases (UK) Normal Deviations Man, Technology and Organisation Concept Drift into Failure Risk Influence Model High Reliability Organisations 2000s Uberlingen Air Crash Columbia Space Shuttle Helios Airways Texas City Refinery Nimrod Air Crash Air France 447 Deepwater Horizon Human Factors Analysis & Classification System Failure of Leadership, Culture & Priorities Aviation Safety Management Systems Resilience Engineering Theory of Practical Drift "Age" of Safety Theory
  • 35. 13 Figure 2-1 Accident Analysis and Risk Assessment Methods (Leonhardt et al, 2009) Saleh et al (2010) present a slightly different narrative in the development of safety theory. Whilst they note most of the same key ideas and developments, they identify three tracks in safety theory leading towards the modern ‘system and control theoretic’. These are illustrated below: Figure 2-2 Three Tracks on the Evolution of Safety Theory (Saleh et al., 2010) The tracks are not exhaustive and there is some cross coupling between ideas. Herrera’s (2012) technological age can be likened to the middle track, the defence in depth track is comparable to the organisational age whilst the top
  • 36. 14 track has many human factors elements but takes much from the current ‘age of complexity’. The current state of the art is given as a systems engineering- control theory approach. Saleh (2010) acknowledges that the literature in the field is particularly fractured. This is perhaps because the various theories emanate from disparate fields such as psychology, reliability, operations studies and management. 2.2.1 Technological Age – Governing Philosophy The predominant theme in the technological age of safety theory is that of a ‘chain of causation’; first visualised as a set of toppling dominos by Heinrich (1950). Each domino represented a factor in the accident: Management controls; failure of a man; unsafe acts or mechanical conditions; the accident; injury. Once the first domino was toppled removal of either of the others would prevent the final injury domino toppling. Related to this is the concept of an accident or event chain, where causative elements or events link together to form a chain, which if it had been broken would have prevented the accident. It is unclear where this idea originated; it is perhaps a reflection that a linear view of the world still represents the defining popular narrative for any major accident. Leveson (2011) links this to an erroneous assumption that there is always a cause for any given accident. 2.2.2 Technological Age – Tools The notion of a linear event chain gave rise to methods of analysing system safety or the related property of reliability. The Fault Tree Analysis (FTA) methodologies were developed from reliability studies of the American Minuteman missile system and quickly developed into a methodology for analysing safety by defining the probability of an unsafe condition developing (Herrera, 2012). Closely associated are event trees which define hierarchies of events post a single initiating event (such as an unsafe condition). These analyses use stochastic methods to forecast top level probabilities for accidents caused by single or multiple failures lower down in the system. There is always a mathematical audit trail from the top level system safety target, for example hull loss probability in commercial aviation, down to individual system or
  • 37. 15 component reliability data or predictions. Importantly, modern system safety assessments contain more qualitative information based on expert understanding of systems; carried out through Functional Hazard Assessments (FHAs) (Dalton, 1996). When analysing accidents using event chain type models such as FTA, there is a question of how far back it is appropriate to go in order to find an initiating event. Leveson (2011) argues that selection of initiating events is often arbitrary in accident analysis. It has been accepted in a large number of major accident reports that management commitment to safety or ‘safety culture’ is a key factor in risk of accident (Dekker, 2005), yet there is no clear way in which these vital considerations can be fitted into an event chain model. Reason (1997) espouses a version of the event chain in the famous ‘Swiss cheese’ model of organisational accidents. Reason’s cheese has become the de-facto mental model for understanding safety and accidents within the military aviation community as shown by articles in the RAF’s Air Clues in-house safety magazine demonstrate (Anon, 2011; Gale et al., 2013). Whilst Haddon-Cave’s (2009) investigation into Nimrod addresses issues of culture and complexity, his view of causation is essentially linear. Leveson (2011) outlines why linear accident models of the technological age such as the Swiss Cheese are no longer considered acceptable:  Direct Causality – there is a reliance on the notion that there is always a linear relationship between event A causing event B.  Subjectivity in Selecting Events – The backward chain of events is often shown to stop for a number of arbitrary reasons, which could include familiarity with a particular event in the sequence (“We’ve seen this before”), it deviates from a standard (component operates outside its specification) or a lack of information (such as inability to understand a human performance issue).  Subjectivity in Selecting Chaining Conditions – It is often not clear which factors caused each other.  Discounting System Factors – Event chain models generally deal with proximate causes and do not deal with issues such as culture or
  • 38. 16 organisational pressures which can pervade through a socio-technical system. A useful example of how this approach to accident analysis can prove disastrous is given by Leveson (2011). She notes how an incident where a DC- 10 lost a cargo door (without loss of life) was attributed to the failure of a baggage handler to close the door properly rather than a design floor meant that two years later a similar incident resulted in the complete loss of a DC-10 near Paris in 1974. 2.2.3 Limits of Probabilistic Risk Assessment Both civil and military airworthiness certification standards require certain safety targets to be met. These targets are expressed in terms of probabilities, principally probability of hull loss and death of passengers or crew; for military aircraft this is specified in Regulatory Article 1230 – Design Safety Targets (MAA, 2012a). There are various other targets regarding risk of harm to third parties or other unsafe conditions – these are operating risks. Operating risks are also commonly assigned qualitative risk levels; in the case of military aviation this process is specified in Regulatory Article 1210 – Management of Operating Risk to Life (MAA, 2012a). This regulation advises Platform Operators and Project Teams to make use of Fault Tree Analysis to enable calculation of these risks. For some UK military platforms this has resulted in the introduction of ‘Loss Models’ to guide the assessment of new or emergent risks. In the case of Tornado, the Loss Model (Sugden, 2011) is not a tool that can be used in isolation for predictive risk assessment; rather it uses incident statistics to provide a current picture of loss rates across the fleet (Woodbridge, 2012). The regulation and recommended practise (SAE, 2010; Lloyd and Tye, 1982) for both civil and military airworthiness and safety targets is for the use of fault tree and dependency diagram models. These methods of probabilistic risk assessment (PRA) are linear, which usefully provides for aggregation of total risk. There are however a variety of issues to consider in their use. Apostolakis (2004) provides a summary of some of the benefits and criticisms of PRA. However in the case of airworthiness certification risk assessments the process
  • 39. 17 is generally based on a qualitative assessment of Functional Hazard Analysis (FHA). FHA allows expert subjective analysis to provide an element of linkage between various hazards. Equally Common Cause Analysis (CCA) methodologies go some way to accounting for system-wide failure mechanisms. The literature on resilience engineering disputes Apostolakis’ (2004) claim that PRA deals effectively with true complexity. Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment (Apostolakis, 2004) Benefits Criticisms  Multiple failures considered  Increases likelihood of spotting complex failure interactions.  Facilitates communication.  Integrated Approach.  Identifies unknown areas for research.  Focuses risk management activity on key areas  Human actions during accident scenarios cannot be modelled.  Difficulty of quantifying software failures.  Cannot model safety culture.  Difficulty estimating design and manufacturing errors. PRA models are essentially a product of the ‘technical era’ of safety science, they assume linear behaviour and that the systems being analysed are tractable; thus decomposable into independent subsystems. This remains the de-facto approach to managing most complex socio-technical systems and forms the basis of the safety case approach prevalent within many regulatory environments. The fundamental assumptions that justify their use are questionable when applied to complex socio-technical systems. The principle concern is that the human element cannot be satisfactorily modelled using Boolean logic, in systems where there are frequent interactions with humans, whether operators, maintainers or design or support engineers this presents the possibility that common cause failures will be built into the system and that the relationships will be non-linear.
  • 40. 18 2.2.4 Human Factors Herrera (2012) outlines how 20th century disasters such as Three Mile Island and Flixborough showed that the event chain models were becoming inadequate – the focus began to shift to human failing, with the human identified as the number one unreliable component in the event chain. Herrera (2012) highlights two trends in the age of human factors; studies concerned with eliminating human error by design for human performance and studies into how humans cope with disturbances. 2.2.5 Organisational ‘Man Made Disaster’ theory was the initiating scholarly theory behind organisational accident theory (Saleh et al., 2010). This theory noted that within a certain class of events known as ‘man made disasters’ there were multiple events chains that reached a long back into the past and that management and organisation were key factors in causing accidents. Saleh (2010) also notes ‘Normal Accident Theory’ and ‘High Reliability Organisations’ as key precepts of the organizational accident. Normal accident theory notes that there are tight couplings between interacting causal factors in complex system accidents and that they cannot be predicted. This has been condemned as a somewhat fatalistic view. Herrera (2012) sees High Reliability Organisation Theory as a counter to Normal Accident Theory. This characterises successful organisations as those operating complex systems with a very small number of accidents. Saleh (2010) notes that the research highlights a number of common characteristics of such organisations such as:  Preoccupation with failure and organizational learning.  Commitment to and consensus on production and safety as concomitant organizational goals.  Organizational slack and redundancy. These facets of successfully safe or high reliability organisations correspond to aspects of ‘safety culture’ as described by Reason (1997) and others.
  • 41. 19 2.3 Complexity Aircraft are complicated machines; they have many components interacting in a multitude of combinations. Dekker (2011) holds that analytic reduction, as practised within traditional linear safety analysis, is unable to describe how system elements and processes behave when exposed to multiple simultaneous influences. He also describes the key distinction between a complicated system such as an aircraft, which could conceivably be disassembled then reassembled by a single person and complex systems. A complex system is one where the boundaries are ‘fussy’ (require highly detailed definition) and the structure is intractable; an aircraft operated subject to human factors, culture, regulatory and organisational factors is therefore complex. Cilliers (2005) defines complex systems as those having the following properties:  Large numbers of simple elements.  Dynamic, propagating and non-linear interactions; these define behaviour which is emergent and cannot be understood by inspection of components nor predicted by deterministic methods.  Open, exchanging energy and information with the environment.  Memory is distributed within the system, influencing behaviour.  Adaptive behaviour; without the intervention of external agents. This study assumes that the complete aircraft system, incorporating its operation and support is complex rather than simply complicated. It could also be argued that the edition of extensive software within aircraft renders the system complex. The safety management system and airworthiness management in particular must deal with complexity. For those charged with managing the safety of complex systems, understanding models for accidents and studying post mortem analyses of accidents does not present a comprehensive approach to prevention. It is generally accepted that events, hazards and risks often combine in unexpected ways. Is it therefore adequate to manage safety risk as a game of ‘whack-a-mole’; eliminating or
  • 42. 20 mitigating risks as and when they become apparent (Zarboutis and Wright, 2006)? It may be argued that a proactive reporting culture does much to allow elimination or mitigation of risks before they materialise. Heinrich’s (1950) ‘ice berg’ model drives much of this effort to uncover previously unknown risk and there is an indisputable logic which says that knowing about a risk is a first step to eliminating or managing it. The continued history of complex accidents tells us that this approach may never be completely effective in preventing unexpected failure (Hollnagel, 2007). Leveson (2011) explains that the concept of a High Reliability Organisation confuses notions of safety and reliability. Just because individual components of a socio-technical system can be proven to be individually reliable it does not follow that safety will necessarily emerge as a system property. Systems may be reliable yet unsafe, such as the NASA Mars lander which crashed because the designer failed to anticipate the interaction between the software and mechanical systems. Equally it is possible for a system to be unreliable yet safe where systems fail-safe. 2.3.1 Complexity Theory Accident investigation or analysis of complex system failure requires a mental model to be applied to the accident scenario (Hollnagel, 2011). Similarly accident prevention through risk management uses modelling to understand potential accidents. Hitchens (2003) describes how complexity is relative to the observer’s frame of reference. Modelling complex systems requires judgement as to the extent of elaboration or its converse; encapsulation. He proposes that systems derive their degree of complexity from their variety, connectedness and disorder. Socio-technical systems are increasing in complexity as a result of the increased use of networks. Manson (2001) provides a useful review of complexity theory, most of the branches of which have an antecedent in general systems theory. Three main branches of complexity theory are identified; ‘algorithmic complexity’ which gives that complexity is defined by the difficulty in describing system characteristics. ‘Deterministic complexity’ deals with chaos or catastrophe theories which posit that stable complex systems may become
  • 43. 21 suddenly unstable ‘Aggregate complexity’ deals with how elements interact to produce complexity. A key property of complex systems is that of emergence which describes how system-wide characteristics cannot be computed by the aggregation system component behaviour. Zabourtis (2006) highlights that patterns that emerge from complex socio-technical systems which erode the resilience of complex systems. Grøtan et al (2011) gives a good account of the theoretical foundations of complexity and how they can be applied to risk assessment; the ‘Cynefin’ Framework provides a summary. Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management (Grøtan et al., 2011) Generally the literature shows that whilst linear thinking has reached its limits within system safety science, complexity theory has yet to be completely applied to the problem. Zabourtis (2006) identifies how complexity theories can be used to replace HAZOPS type safety analyses. The key inputs should be:  How can system entities co-adapt?  What will the probable effect be on the whole?  How can such patterns be eliminated?
  • 44. 22 The output of such an analysis should therefore be some means of avoiding the emergent harmful properties. Dekker (2011) advises that complexity theories can be applied to accident investigation if the search for a single cause is dropped and multiple narratives are allowed to overlap and on occasion contradict each other. The nature of complexity defies analysis; Cilliers (2005) writes on the ‘incompressibility’ of complex systems, in that the only reliable model of a complex system is that which has the same level of detail as the system itself. Clearly this is impractical, yet as any model will involve simplification, disregarded elements may have non-linear effects and the magnitude of the potential outcomes may be non-trivial. However Cilliers (2005) also states that whilst modelling and computing complex systems will never be sufficient, it is still necessary. 2.3.2 Systems Thinking and Systems Engineering The concept of a system is well-established with roots in philosophy and thermodynamic theories leading to theories and practise surrounding systems engineering. Hitchens (2003) provides one definition: A system is an open set of complementary, interacting parts with properties, capabilities and behaviours emerging both from the parts and their interactions. The concept of emergence is an important one; accidents are emergent system states of disorder. Systems engineering involves the generation of models to represent a system (Oliver et al., 1997). Leveson (2011) first describes how safety ought to fit into systems engineering’s primary activities – Needs Analysis, Feasibility studies, Trade studies, System architecture development and Interface analysis. This is the basis for system safety assessments employed in generating evidence for airworthiness certification as per ARP 4761 (Dalton, 1996). Saleh (2010) distinguishes between failure modes attributable to component failure and those failures attributable to emergent or interactive failures; his thesis is that a systems theoretic approach addresses this second set of failures. However he raises concerns that formal systems theoretic approaches such as co-ordinatability and consistency in hierarchical and multilevel systems are yet to be fully applied to safety analysis. Leveson’s
  • 45. 23 (2011) Systems-Theoretic Accident Model and Processes (STAMP) uses control theory and processes as the key to prevention of accidents. It decomposes the system across the complete lifecycle, from concept to disposal, into a series of control loops. The key to prevention of accidents is said to be keeping the entire system in a state of equilibrium, which is achieved by applying constraints to implement control. The model is said to more effectively deal with software than traditional notions of failure. STAMP utilises descriptions of control loops at technological subsystem level, human controller level and socio-technical organisation level, shown in Figure 2-5. STAMP uses a taxonomy of control loop failure modes as an audit check list. Salmon et al (2012) compares STAMP to other models concluding that STAMP provides a more comprehensive system description but it is difficult to incorporate human failures into the model, which itself needs a highly developed understanding of the whole system. This highlights the difficulty in applying theoretically strong models of complexity to particular scenarios.
  • 46. 24 Figure 2-4 General Form of a Model of Socio-technical Control (Leveson, 2011) 2.3.3 Control Theory STAMP (Leveson, 2011) suggests that safety can be treated as a control engineering problem and Saleh (2010) identifies this idea as an important corollary to the development of a systems thinking approach to safety. Kontogiannis and Malakis (2012a) describe how the concept of a model with control loops is fundamental to systems safety incorporating human and organisational factors. Hollnagel and Woods (2005) produced an Extended COntrol Model (ECOM) which describes generically how organisational
  • 47. 25 processes transfers downwards to directly interact and control the technological system and hence alter its state. The Viable System Model (VSM) uses cybernetics principles to describe how safety goals are transferred downwards through an organisation and how output is controlled by various measures such as audit (Espejo, 1989). Kontogiannis (2012a) combines these two models and applies them to studying the accident involving the crash of flight AEW-241 in December 1997. Like many control and systems models in the safety literature Kontogiannis (2012a) highlights the difficulty of applying the models for the purposes of accident prevention. Kontogiannis (2012b) also tries to apply these principles in a case study involving emergency helicopter operations. 2.3.4 Non-Linear Dynamics Control of complex socio-technical systems needs to address the problem of non-linear behaviour. Bendat (1998) describes how physical and engineering systems can be divided into linear and non-linear systems. A system is linear, if for any inputs and and for any constants , Equation 1 – Linear System (Bendat, 1998) [ ] [ ] [ ] This leads to 2 properties: Equation 2 - Additive Property (Bendat, 1998) [ ] [ ] [ ] Equation 3 – Homogeneous Property (Bendat, 1998) [ ] [ ] A non-linear system is therefore one where, Equation 4 – Non Linear System; lack of Additive Property (Bendat, 1998) [ ] [ ] [ ] Equation 5 – Non Linear System; lack of Homogeneous Property (Bendat, 1998) [ ] [ ]
  • 48. 26 This means that for a linear system with a random theoretical Gaussian probability density function as an input (e.g. a normal distribution), the system will transform that data and produce an output with a Gaussian probability density function as an output. Bendat (1998) also makes the point that any physical system will display non-linear properties if the input conditions are suitably wide. As this is true for numerous examples in flight dynamics it is also true for various instances in safety and reliability, where oversimplifying assumptions are made regarding the condition of equipment and its interaction with maintenance and operating organisations. Human behaviour often defies mathematical modelling due to its complexity and non-linear properties. As previously described, it is common for safety analyses and models to assume linear behaviour. In fact complex socio-technical systems generally exhibit a lack of additive and homogeneous properties; where different inputs combine to produce unexpected and ‘out-of-control’ outputs resulting in accidents. This explains some of the difficulties encountered in producing a workable approach to human and organisational reliability, as outlined by Rasmussen (1997). Non- linear effects explain the concept of emergence that is the behaviour of linear systems are predictable and tractable, yet nonlinear systems produce unexpected results. Grøtan (2011) outlines how this leads to the concept of ‘Black Swan’ events that are unexpected with a huge impact – such as a catastrophic accident with a complex system. These are understandable in retrospect but could not have been predicted. Leveson (2011) describes how such accidents are as a result of non-linear interactions between components of the system, whether human, organisational or technological. The key to developing an improved method of managing safety and estimating risk will be to understand and predict these non-linear interactions. 2.4 Resilience Engineering The theory of resilience engineering is emerging as a response to the problems posed to safety management and engineering by complexity theory and the age of the organisational accident as described by Reason (1997). The central theme is to move from a focus on failure, where notions of component reliability
  • 49. 27 are applied to complex systems, humans and organisations; to looking at how systems can succeed under varying conditions. The literature on the subject is somewhat fragmented, although a series of books has been published, which bring together the key ideas. One of the aviation organisations embracing resilience engineering is EUROCONTROL which is a multinational air traffic management service provider with Leonhardt et al (2009) publishing a white paper on the application of resilience engineering within the organisation. This illustrates that there is a blurred line between ‘traditional resilience’ study as applied to infrastructure, and resilience engineering which has emerged from the study of safety. Hollnagel et al (2011) give a simple definition of resilience: “Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions.” Woods and Hollnagel (2007) set the scene for resilience engineering. They outline fundamentals which include a shift away from the traditional safety focus on ‘what went wrong’ (hindsight) and what could go wrong (risk assessment) to a focus on ‘what can go right’ for risk assessment and ‘what did go right’ for accident analysis – also neatly summarised by Schafer (2012). Resilience engineering also rejects the notion of human failure, error taxonomies and reliability analysis of complex systems in favour of a theory that failures represent either the breakdown in strategies for coping with complexity, or an unfavourable combination of functional variability within a system (technological, human or organisational). In resilience engineering, safety is redefined as the ability to succeed under varying conditions. By observing how systems work under everyday pressures, it should be possible to understand the level of resilience in a system and how it might be engineered to increase this quality. For the purposes of both accident investigation and risk assessment it is necessary to move away from linear combinations of events to an understanding of how a system might lose its dynamic stability and veer into an accident trajectory (Hollnagel et al., 2007). In summary, there are four key precepts to Resilience Engineering:
  • 50. 28 1. Performance conditions are always underspecified. Individuals and organisations must therefore adjust what they do to match current demands and resources. Because resources and time are finite, such adjustments will inevitably be approximate. 2. Some adverse events can be attributed to a breakdown or malfunctioning of components and normal system functions, but others cannot. The latter can best be understood as the result of unexpected combinations of performance variability. 3. Safety management cannot be based exclusively on hindsight, nor rely on error tabulation and the calculation of failure probabilities. Safety management must be proactive as well as reactive. 4. Safety cannot be isolated from the core (business) process, or vice versa. Safety is the prerequisite for productivity, and productivity is the prerequisite for safety. Safety must therefore be achieved by improvements rather than by constraints. These precepts define a theoretical approach drawn from various ideas about organisational accidents and safety culture. The key development is the focus on the functions within the system and the emphasis on improving their combined performance, rather than a focus on the potential sources of hazards and barriers for accident prevention. This positive standpoint is a key attraction to the approach; the drive for operational performance improvement and safety can be in synergy rather than in conflict. Hollnagel (2011) gives four cornerstones to the practise of resilience engineering. The first is knowing what to do to respond to everyday disturbances – the actual. The second is knowing how to monitor potential threats from the environment and from the functioning of the system itself – the critical. The third part of the practise is knowing what to expect in terms of threats and opportunities in order to address potential. Finally, the fourth ‘cornerstone’ is that of the ability to address the factual through learning.
  • 51. 29 A slightly different conceptual framework for Resilience Engineering is presented by Madni (2009); offering more concrete requirements for operationalising the practise: Responding (Actual) Learning (Factual) Monitoring (critical) Anticipating (Potential) Knowing what has happened Knowing what to do Knowing what to look for Knowing what to expect Figure 2-5 The Four Cornerstones of Resilience (Hollnagel, 2007)
  • 52. 30 Figure 2-6 Conceptual Framework for Resilience Engineering (Madni, 2009) 2.4.1 Resilience Engineering as a Successor to Safety Management Leonhardt et al (2009) puts the resilience engineering approach to safety management simply: The more likely it is that something goes right, the less likely it is that it goes wrong. Cambon (2006) provides a resilience framework for assessing safety management systems; they propose a number of metrics based on Tripod theory, which essentially measures the performance conditions under which the SMS operates. The balance of these performance conditions is said to determine the stability of the SMS. ‘Engineering’ implies design and Beauchamp (2006) notes how this can be achieved through organisational learning to provide organisational resilience; a model for guidance is provided. Zarboutis (2006) describes how, analogous to Rasmussen’s (1997) approach to organisational drift, resilience engineering can identify symptoms of an erosion in resilience. Johansson (2008) provides a ‘quick and dirty’ approach to evaluating resilience in systems; a helpful overview but does not prescribe specific improvement or change activities. Stoker (2008) outlines a comprehensive approach to the assessment of operational resilience, effectively specifying a goal based hierarchy for elements contributing to resilience; producing a check list approach. Whilst this is undoubtedly a valuable activity, it is questionable whether it will be able to deal with the emergence of safety issues. 2.4.2 Under Specification of Performance Conditions Under specification of performance conditions, that is the factors that affect the execution of a particular function is key concept in the literature (Hollnagel, 2007). In most organisations performance conditions are subject to control through rules, with the idea that this will improve safety. Hale (2013) reviews the literature on this, noting that there are two approaches; a classical top down approach, punishing transgression and secondly a bottom up approach that
  • 53. 31 sees expert ability to adapt to changing circumstances as paramount. Nathanael (2006) notes that it is impossible to make what happens in practise match that which is espoused by officialdom; the key to generating resilience is dialogue between the hierarchical levels. 2.4.3 Performance Variability Resilience engineering regards performance variability as inherently useful; it allows operations to continue in underspecified conditions. It also provides the potential for coupling between functions where upstream performance variability combines with downstream performance variability to grow in amplitude. This phenomenon can be harnessed for system success or else it provides an origin for safety risk ( Hollnagel, 2012). 2.4.4 Examples of Resilience Engineering in Practice Resilience engineering is more theoretical than its name suggests and discussion abounds over the practicality of implementing its precepts is uncertain. However, its principles can be found in evidence where it was not specifically applied. Table 2.3 provides a brief summary of some examples. Table 2-3 Examples of Resilience Engineering in Practice Industry Tools Insights Process Industry Survey of workforce using Principal Component Analysis Shirali et al.(2013) attempt quantitative measurement of resilience at an organisational level. Only possible to measure the potential for resilience rather than resilience itself. The following variables are given as indicators:  Top management commitment  Just culture  Learning culture  Awareness and opacity  Preparedness  Flexibility Process Industry Bayesian Networks Resilience Dashboard Pasman et al. (2013) define a holistic control methodology for plant safety using leading indicators derived from process measurements within the plant. Also use of process simulation tools to develop scenarios. Traditional
  • 54. 32 (not currently achievable) HAZOP/FMEA analyses do not capture all potential accident scenarios. Key Points:  Technical resilience can be measured/simulated. Organisational factors less so.  Importance of leading indicators to enable response to variations  Difficulty in dealing with drift in safety metrics.  Safety Gains made through interdepartmental cooperation vs common cause failures.  Advocate extensive use of bow-ties. Aviation Interviews, audit and expert analysis An investigation into both the sources of resilience and sources of brittleness. Comparison of two comparable small air carriers. Identification through extensive interviews. Resilience and brittleness categorised and risk assessed (Saurin and Carim Junior, 2012). Air Traffic Management FRAM Analysis of a mid-air collision fatal accident. Provides notes on buffering capacity, flexibility, margins, tolerance and cross scale interactions. There was no root cause – aircraft and ATM was operating normally. The system was inadequate (de Carvalho, 2011). Aviation Bayesian Belief Networks (BBN) Examines the use of and qualification of experts to provide probability estimates for BBN. Hidden common causes in BBN – principally safety culture. Difficulty in estimating frequencies or probabilities of rare events. BBN assume the ‘Causal Markov Condition’ therefore common cause failures are difficult to deal with – maybe applying BBN to FRAM would solve this issue (Brooker, 2011). Aviation FRAM Alaska Airlines flight 261 accident analysed to understand FRAMs performance against 5 key resilience characteristics: buffering capacity, flexibility, margin, tolerance, and cross-scale
  • 55. 33 interactions (Woltjer, 2007). Railways FRAM Interdisciplinary safety analysis of complex socio-technological systems based on the Functional Resonance Accident Model: an application to railway traffic supervision (Belmonte et al., 2011). Nuclear FRAM Specific case study surrounding a task to move Nuclear Fuel – a specific task analysis rather than a generic system approach (Lundberg, 2008). 2.4.5 Criticism of Resilience Engineering Oxstrand and Sylvander (2010) argue that Resilience engineering is little more than a rebranding of safety culture; they do not see how the practise can be applied to the nuclear industry which already uses both PRA and human reliability analyses in the licensing of nuclear plants. In this industry it is argued, safety culture forms part of every operation. The nuclear industry defines safety culture as: “Safety Culture is that assembly of characteristics and attitudes in organisations and individuals which establishes that, as an overriding priority, nuclear plant safety issues receive the attention warranted by their significance.” International Atomic Energy Authority (Edwards et al., 2013) Clearly safety culture is fundamental to engineering resilience into a socio- technical system. The theory of safety culture does not in of itself propose a different conceptual framework for the origin of unsafe system performance. Also some safety culture literature describes a requirement for safety to become the overriding priority for an organisation (Edwards et al., 2013). Clearly this is at odds with notions of efficiency-thoroughness trade-offs and the requirement to increase the proportion of activities that ‘go right’ as a means for reducing the number that ‘go wrong. Whilst Resilience Engineering draws on much of the theory around safety culture, it goes a lot further in proposing ways in which organisations can be designed, analysed and modified in order to deliver
  • 56. 34 resilience. Le Coze (2013) describes a number of criticisms of Resilience engineering the foremost amongst these being scepticism over the need to introduce a new vocabulary to safety science. He also notes that the social concept of power is missing from the resilience literature, although it could be argued that the exercise of social power could be modelled as a function or a resource. He also notes that many have disagreed with the notion that resilience engineering does not present anything new; it collects simply connects a number of existing ideas, foremost of which is the High Reliability Organisation concept. He does note that the proof of the concept will be in its application to real systems – testing the worth of the ‘engineering’ aspect of the theory. McDonald (2008) asserts that Resilience Engineering is attractive because other models are weak. He notes that the theory needs to be further unified and demonstrated in practical examples. 2.4.6 Resilience Engineering and Airworthiness Current MAA (2011a) policy is based on the idea that airworthiness is made up of four pillars: the safety management system, compliance with recognised standards, competence (of people and organisations) and independent assessment. All of these activities and qualities are likely to contribute to the resilience of an airworthiness system. Wilson (2008) provides a system model for resilience of an airworthiness system and presents a number of key ideas:  The requirement for ‘organisational mindfulness’ – a safety culture keen to seek out areas of risk.  Balancing ALARP principles with ‘And Still Stay In Business’ which could be thought of as an efficiency thoroughness trade off; as per Hollnagel (2011).  Understand how the organisational boundaries contribute to safety; dealing with outsourcing, partnering and regulation.  Translate strategies into management frameworks for managing organisational risk – these can be represented by ‘framework diagrams’ that show the factors that impact on safety management systems.
  • 57. 35 This work was succeeded by a thesis by Wilison (2012) which produced a framework called RISK2VALUE which provides an integrated management framework and decision support tool kit which address both safety and value management at an organisational level. A generic diagram shown at Figure 2-7 is provided to support decisions – the use of which is illustrated by means of an extensive diagram mapping various relationships. The strength of this approach is that it either provides a generic approach to an audit of airworthiness or would guide the construction of a new system. Equally it provides an assessment of socio-technical factors surrounding accidents. A criticism that could be levelled at the tool is that the linkages between the elements are not explicitly defined and it therefore unclear how changes would influence the path that the organisation took through the diagram.
  • 58.
  • 59. 37 Figure 2-7 Framework for managing the impact organisation, technology and human factors have on safety management systems (Wilson, 2008)
  • 60. 38 2.4.7 Lean Resilience Leondhart (2009) notes that modern business systems are largely premised on ‘just-in-time’ processes. This methodology increases efficiency and consequently coupling between upstream and downstream functions. Individual system boundaries are more difficult to define as, for example, maintenance units become increasingly tightly dependent on supply chains. Carney (2010) urged caution in the introduction of lean principles and envisaged a hybrid between lean maintenance and a more traditional model. Resilience engineering in other domains has shown that it is in fact possible to harness the approach to introduce production improvement alongside safety (Hounsgaard, 2013). Lean methodology is profoundly linear in its thinking (Carney, 2010); this methodology is easily deployable in a highly tractable system such as a production line. In less tractable systems such as maintenance it is likely that Resilience Engineering techniques will produce better results. 2.5 Functional Resonance Analysis Method The resilience engineering literature lacks specific methodologies or tools for practical implementation of resilience engineering principles. The notable exception is Hollnagel’s (2012) Functional Resonance Analysis Method (FRAM). This is a technique for building models of complex socio-technological systems. It differs from STAMP, in that it is a method for generating a model rather than a model. FRAM maps the system as a series of functions, defined by their various ‘aspects’ and linked ‘activities’. O C P I T R FUNCTION Time Control Output ResourcesPreconditions Input Figure 2-8 FRAM Function
  • 61. 39 By analysing the output variability from each function and the extent to which this variability is damped up-stream, it is possible to begin to understand how to analyse system performance from a resilience engineering point of view. The FRAM forms the basis of the case study in later chapters and is described in detail in Chapter 4. 2.6 Quantifying Resilience Most approaches to quantifying resilience rely on surveys and audit approaches such as those described by Shirali (2013) or by Saurin (2012). However whilst an overall system assessment is of value, system managers are interested in particular risks and being able to quantify them and manage them towards ALARP levels, as required by legislation. Within process industries a high degree of automation can be achieved within intensive data collection and monitoring. These aspects mean that it is comparatively easy to run simulations and model different systems. Risks can therefore be assessed in a more quantifiable manner Pasman (2013). A reliability approach to safety is easily quantifiable through linear decomposition to produce probabilistic risk assessment. By contrast it is much more difficult to provide quantitative assessment using a resilience engineering approach. Luxhøj (2003) and Williams (1996) present Bayesian Belief Networks as a potential solution to low probability – high consequence risks. Slater (2013) has presented an approach to nesting BBN within a FRAM model and hence providing a way of quantifying risk analysis developed through FRAM. He presents this technique as an alternative to HAZOPS for use in process and transport industry. Brooker (2011) analyses BBN in the aviation domain, specifically focusses on the ability of experts to provide accurate assessments of probability in the case of low probability events. He notes the ‘Causal Markov Condition’ which is an assumption in BBN that there is no common cause Failure mode across the network; issues such as ‘safety culture’ are therefore difficult to address. Other potential techniques for quantification are the use of fuzzy logic or fuzzy set theory with the use of Monte Carlo simulation (Shirali, 2013). An approach to quantifying resilience in the context of civil infrastructure is presented by Vugrin
  • 62. 40 (2009), providing a menu of control engineering methodologies that may be suitable. The issue of data collection in more human centric systems remains a barrier to expansion of this method. Quantification is the key if Resilience Engineering is going to gain ground against more traditional risk assessment techniques. 2.7 Concluding Remarks The various ages of safety theory were all products of the technology of their time. Now in an age characterised by networked technology it is clearly time to fully address notions of complexity for the purpose of providing safe systems. This is certainly the case for the new generation of civil and military aircraft. Resilience Engineering appears to offer a different approach to previous theories and models. In particular the notion that accidents emerge from unforeseen combinations of varying functional performance is a powerful one. It offers the prospect that analysis from this perspective might provide risk insights that may otherwise be missed. It also rings true from experience within an airworthiness environment. Notions of ‘accident trajectories’ and holes in processes or defences do not resonate in the same way. There is an opportunity to combine efforts in process improvement and efficiency with safety strategies. Resilience engineering offers the theoretical framework and FRAM provides a potential method. This will be explored in subsequent sections. It remains the case however that there is some way to go to operationalize Resilience Engineering; Madni (2009) lists the key issues:  Help organizational decision makers in making trade-offs between severe production pressures, required safety levels and acceptable risk.  Measure organizational resilience.  Identify ways to engineer the resilience of organizations. The following chapters outline a case study in which this approach is tested.
  • 63. 41 3 METHODOLOGY 3.1 Introduction In order to meet the research aim it was necessary to choose a technique with which to model an airworthiness management system. The literature review revealed that the Functional Resonance Analysis Method (FRAM) was the best way to practically apply resilience engineering principles. The FRAM therefore formed the basis of the practical element of the research. A single case study organisation was used, with an aspiration of delivering an operationally useful tool to the organisation at the end of the project. The case study was conducted in two stages:  Stage 1 – Construct a FRAM Model of the Airworthiness Management System and concurrently develop a visualisation tool.  Stage 2 – Test the model using scenarios drawn from occurrence reporting and potential in-service airworthiness risks. The model was developed iteratively, using expert opinion and data from a variety of sources. 3.2 Working Arrangements A key difficulty reported by other FRAM practitioners has been understanding ‘work as done’ rather than ‘work as imagined’. This was mitigated by conducting the research from within the case study organisation on a part time basis, whilst working within the Force Operations Centre. Moreover, this was preceded by 9 years work in other roles in military airworthiness; including quality assurance, process improvement and error investigation roles. This provided insight into ‘work as done’ practise. Whilst there was a risk of bias, this was mitigated to some extent through exposing parts of the model to other workers within the organisation for verification. 3.3 Research Interviews Semi-structured interviews were conducted with 19 different workers across all of the functions. The interviews were flexibly arranged at the interviewees work
  • 64. 42 location (generally offices but control rooms and tool stores were also visited). A pre-briefing was provided in the form of a two sided A4 document, shown at Appendix C. The average interview duration was around 30 minutes, giving a rough total of around nine and a half hours of interview time over the course of the project. The general interview structure was as follows:  Check understanding and clarify scope of the study.  Confirm that participant was currently engaged in the function as part of their daily activity.  Check accuracy of each of the function aspects.  Open questioning to highlight particular areas of variability in the ‘aspects’ of the function.  Open questions to ascertain whether any aspects had been missed.  Open questions to ascertain whether participants work covered any further relevant functions. The following research interviews were conducted:  Deputy Continuing Airworthiness Manager  Engineering Authority – various team members.  Military Airworthiness Review Certificate team member.  Continuing Airworthiness Management Organisation Quality Manager.  Experienced Aircraft Technician at Inspector Level.  Tornado Forward Fleet Manager.  Front Line Squadron Senior and Junior Engineering Officers.  Front Line Squadron Rectification controller, Line Controller, Weapons, Mechanical and Avionic Trade Managers, with additional contributions from various mechanics, technicians, supervisors and inspectors.  Tool Stores Controller.  Rolls Royce Technical Support Manager.  BAES Technical Support Manager.  BAES Reliability Engineering Manager.  Depth Workshops Supervisor.
  • 65. 43  Ground Support Equipment Trade Manager.  Station Air Safety Officer. 3.4 Model Development Most if not all risk assessment or incident investigation methods require practitioners to be trained in the application of the technique and are generally most effectively applied in teams (e.g. HAZOPS, Safety Panels, etc.). Time and resources precluded this approach for the case study; however insight from a number of other practitioners’ case studies was gained through attendance at the annual FRAM Workshop in Munich. Whilst Hollnagel’s (2012) guidelines for FRAM model development were followed, the final Tornado Airworthiness System Model used a number of innovative approaches. The main innovation was the use of a Microsoft Visio drawing to provide an interactive ‘visualisation’ tool. This approach allowed the creation of a much larger model than has been recorded to date in the literature. The visualisation tool was developed concurrently with the spreadsheet model, which allowed for greater accuracy by cross-checking between the two methods of describing the model. The final model contains a total of 69 individual functions with 985 individual aspects described. Where inconsistencies in the model became apparent or there was a gap in knowledge, a variety of experts were used to provide additional information through conversation or correspondence. In particular, various key meetings were attended which provided insights that assisted with model development:  Force Operations Centre Daily Summary.  Joint Qualifications and Trials Meeting.  Level B Capability Programme Reviews.  Various Upgrade Readiness Reviews.  Fleet Planning Meetings.  Scheduled Maintenance Reviews.  Depth HQ Value Stream Analysis – Continuous Improvement Event.  Mission Essential Equipment Continuous Improvement Event.  Air Safety Occurrence Investigators Workshop.