Resilience Engineering: A field of study, a community, and some perspective shifting.

Resilience Engineering
The ﬁeld, the community, and some
perspective shifting.
John Allspaw
Adaptive Capacity Labs

@@ -1,2 +1,2 @@
- +
Showing 1 changed ﬁle with 1 addition and 1 deletion.

index.html
example #2

http://bitly.com/AllspawThesis

http://stella.report
Year-long project
Researchers analyzed 3 incidents, at:
Six themes
•Postmortems as re-calibration
•Blameless v. sanctionless after action actions
•Controlling the costs of coordination
•Visualizations during anomaly management
•Strange Loops
•Dark Debt

What You Are In For
1. Resilience Engineering: a field and a community
2. Accentuating the positive
3. Avoidance of shallow data
4. Some food for thought

• A field of study that emerged largely from Cognitive Systems Engineering,
early 2000s.
• 7 symposia over 12 years

Community
is largely made up of practitioners and researchers from….
working in these domains…
Aviation/ATM
Rail
Maritime
Space
Surgery Power Plants
Intelligence
Agencies
Law Enforcement
Mining
Construction
Explosives
Fireﬁghting
Anesthesia
Pediatrics
Power Grid &
Distribution
Military
Agencies
Software Engineering
Human Factors & Ergonomics Cognitive Systems Engineering Cybernetics Complexity Science Engineering*
Psychology Sociology Ecology Safety Science

Some of the cast of characters
David Woods
CSEL/OSU
Shawna Perry
Univ of Florida
Emergency Medicine
Dr. Richard Cook
Anesthesiologist
Researcher
Ivonne Andrade Herrera
SINTEF
Erik Hollnagel
Univ of S. Denmark
Anne-Sophie Nyssen
University de Liege Johan Bergström
Lund University
Sidney Dekker
Griﬃth University
Asher Balkin
CSEL/OSU
Laura Maguire
CSEL/OSU

Sample of Research
Experiences in Fukushima Dai-ichi nuclear power plant in light of resilience engineering
Unmanned Aircraft Systems in (Inter)national Airspace: Resilience as a Lever in the Debate
Sociotechnical Networks for Power Grid Resilience: South Korean Case Study
Limits on adaptation: Modeling Resilience and Brittleness in Hospital Emergency Departments

externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results

externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
macro
descriptions
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results

code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results

code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
systemsystem framing
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results

deploy organization/
“monitoring”
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code deploy
organization/

code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
The Work Is Done
Here
Your Product Or
Service
The Stuff You Build and
Maintain With

code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results

code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
Copyright © 2016 by R.I. Cook for ACL, all rights reserved
ack: Michael Angeles http://konigi.com/tools/
What matters. Why what matters matters.
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
Why is it doing that?
What needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
What does it mean?
What’s it doing?
What does it mean?
What is happening?
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting

code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
What does it mean?
What’s it doing?
What does it mean?
What is happening?
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
What does it mean?
What’s it doing?
What does it mean?
What is happening?
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting

code deploy
organization/
encapsulation “monitoring”
hat needs to change?
What does it mean?
What’s it doing?
What does it mean?
What is happening?
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
go
purp
ris
cogn
act
intera
spe
ges
cli
sig
represe
code deploy
organization/
encapsulation “monitoring”
hat needs to change?
What does it mean?
What’s it doing?
What does it mean?
What is happening?
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
go
purp
ris
cogn
act
intera
spe
ges
cli
sig
represe
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting

code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
Time
…and things are
changing here
things are
changing
here…

“above the line”
…is not “management”
…is not “organization design” or reporting structures
…is how people work (detect/diagnose/solve problems, both acute and
chronic) alongside and with technology and each other, under continual
trade-off scenarios, that provide the audacity to build and sustain adaptive
capacity.

Resilience is something that a system
does, not what a system has.

“Resilience is an expression of how people,

alone or together,

cope with everyday situations – large and small –

by adjusting their performance to the conditions.

An organization’s performance is resilient if it can function as required

under expected and unexpected conditions alike

(changes/disturbances/opportunities).”

Hollnagel, Erik. Safety-II in Practice: Developing the Resilience Potentials

–David Woods (2015)
“Resilience is
sustained adaptive capacity.”

Resilience is the story of the outage
that didn’t happen.

If you haven’t found people responsible for
outcomes, you haven’t “seen” the system.

Humans are predominantly seen as a liability or hazard.
They are a problem to be fixed.
Traditional view on the role of people (“Safety-I”)
Humans are seen as a resource necessary for system flexibility and resilience.
They provide flexible solutions to many potential problems.
RE view on the role of people in complex systems (“Safety-II”)

How does our software work, really?
How does our software break, really?
What do we do to keep it all working?

explanations of accidents
Safety-I
Accidents are caused by failures and malfunctions. The purpose of an
investigation is to identify causes and contributory factors.
Safety-II
Things go well and fail in basically the same ways, regardless of outcome.
The purpose of an investigation is to understand how things usually go right
as a basis for explaining how things occasionally go wrong.

incidents
(outages, degradations, breaches, accidents, near-misses, “glitches”,
untoward/unexpected events, etc.)

what makes incidents
interesting & valuable?

code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
results
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line externally sourced
code (e.g. DB)
resultsdelivery
technology
stack
results
incidents as…
drivers of software design
- “incidents of yesterday inform the architectures of tomorrow”
- incidents “below the line” drive changes “above the line"
- stafﬁng, budgets, planning, roadmaps, etc.
- shape the design of new components, subsystems, architectures
💥

5/6/2010 - “Flash Crash” - loss of $1 trillion in market value in <10min
3/23/2012 - BATS IPO - systems issue halted the exchange’s own IPO
5/23/2012 - Facebook IPO - systems issue delayed IPO trading
8/1/2012 - Knight Capital - $461 million in 45 minutes
“Regulation SCI”
- tend also to give birth to new forms of regulations, policies, norms,
compliance requirements, explosion of documentation, auditing, constraints,
etc.
- “incidents of yesterday inform the rules of tomorrow”
- inﬂuence stafﬁng, budgets, planning, roadmaps, etc.
PCI-DSS
1988-1998, Visa and MasterCard reported
credit card losses due to fraud of $750 million
incidents as…
motivators for policy

incidents tend to focus our
attention on what matters
💥

incidents help us gauge the delta between
how
the system works
how we think
the system works
Δ
{almost always greater than we imagine

“…nonroutine, challenging events, because these tough cases have the
greatest potential for uncovering elements of expertise and related
cognitive phenomena.” (Klein, Crandall, Hoffman, 2006)
A family of well-worn methods, approaches, and techniques
Cognitive task/work analysis
Process tracing
Conversation analysis
Critical decision method
Critical incident technique
more…
research validates these opportunities

incident
54 minutes
start resolve

12 minutes
54 minutes
start resolve
detect
incident

20 minutes
73 minutes
12 minutes
54 minutes
start resolve
detect
start
detect
resolve
incidents

12 minutes
54 minutes
start resolve
detect
20 minutes
73 minutes
start
detect
resolve
5
25 minutes
start
detect
resolve
incidents

incidents
12 minutes
54 minutes
start resolve
detect
20 minutes
73 minutes
start
detect
resolve
5
25 minutes
start
detect
resolve
135 minutes
100 minutes
start
detect
resolve

incidents
12 minutes
54 minutes
start resolve
detect
20 minutes
73 minutes
start
detect
resolve
5
25 minutes
start
detect
resolve
135 minutes
100 minutes
start
detect
resolve
minutes

incidents
minutes
janfebmaraprmayjun

incidents
minutes
jan feb mar apr may jun

“Resilience is an expression of how people, alone or together,
cope with everyday situations – large and small –

by adjusting their performance to the conditions.

An organization’s performance is resilient if it can function as required

under expected and unexpected conditions alike

(changes/disturbances/opportunities).”

“Resilience is
sustained adaptive capacity.”

What is it doing?!
Why is it doing that?!
What will it do next?
How did it get into this state?
WTF is happening?
If we do Y, will it help us ﬁgure out what to do?
Is it getting worse?
It looks like it’s ﬁxed…but is it…?
If we do X, will it prevent it from getting worse…or make it worse?
Who else should we call that can help us?
Is this OUR issue, or are we BEING ATTACKED?!

incidents provide calibration about…
how decisions are focused
how attention ﬂows
how work is coordinated
how escalation manifests
the weight of time pressure
the effects of uncertainty
the impact of ambiguity
what consequences are consequential

What can we learn
about these…
how decisions are focused
how attention ﬂows
how work is coordinated
how escalation manifests
the weight of time pressure
the effects of uncertainty
the impact of ambiguity
what consequences are consequential
…from these?
(M)TTR?
(M)TTD?
Frequency of incidents?
Severity of incidents?
Customer impact?
Number of deploys?
“…while there is value in the items on the right, we value the items on the left more.”

Thought Food
• We cannot comprehensively understand how our systems behave - we
continually build and revise our understandings based on (relatively sparse)
signals our tech sends us.
• Continuous delivery, “Chaos”/fault injection, are coping strategies (hedges)
for the above state of affairs.
• Understanding activities “above the line” are basically unexplored or
ignored in our industry, and this needs to change.

Resilience Engineering: A field of study, a community, and some perspective shifting.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Resilience Engineering: A field of study, a community, and some perspective shifting.

Similar to Resilience Engineering: A field of study, a community, and some perspective shifting. (20)

More from John Allspaw

More from John Allspaw (16)

Recently uploaded

Recently uploaded (20)

Resilience Engineering: A field of study, a community, and some perspective shifting.