A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.
Also, in a world of complex systems, human error as an explanation for failure is somewhat a fallacy, an obstacle to learning and therefore, to create resilient systems.
2. João Miranda
17 years in the IT world: developer, scrum master, ALM team
lead, dev team lead, agile coach, solution architect
Copes (tries to!) with 10+ Scrum teams
DevOps Lisbon meetup co-organizer
3. Human Factors & System
Safety
Lund University - MsC and Learning Labs
4. In the 19th Century,
things were a bit
simpler... Harvest at La Crau, with Montmajour in the
Background, June 1888, Van Gogh Museum
5. In the early 20th Century,
things got more
complicated…
Industrial Revolution - History.com
6. “Now one of the very first requirements for a man who is fit
to handle pig iron as a regular occupation is that he shall
be so stupid and so phlegmatic that he more nearly
resembles in his mental make-up the ox than any other
type.”
F.W. Taylor. Principles of Scientific Management.
1911. New York and London, Harper & brothers.
13. Complex Systems - In Layman’s Terms
“components come together to behave in different
(sometimes surprising) ways that they never would on their
own, in isolation.”
John Allspaw
“Resilience Engineering Part II: Lenses” (2012)
19. 5 domains of decision-making
(The fifth, in the middle, is disorder)
Cynefin Framework
By Snowden - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=33783436
Cynefin
20. Cynefin Framework - in detail
By Edwin Stoop (User:Marillion!!62) - [1], CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53810658
21. Cognitive Demands of a Domain
● Dynamism
● Number of parts and extensiveness of its interconnections
● Uncertainty
● Risk
A domain is complex if high in all of these dimensions.
* David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
22. Most of the time,
we are in the complex domain.
Probe. Sense. Respond.
23. We need to be there in order to keep up
with the fast pace of change.
27. “A system is resilient if it can adjust its functioning prior to,
during, or following events (changes, disturbances, and
opportunities), and thereby sustain required operations
under both expected and unexpected conditions.”
Erik Hollnagel
“Resilience Engineering”
A Definition
32. “[Systems] that can manage something before it happens,
by analysing the developments in the world around and
preparing itself as well as possible.”
Erik Hollnagel
“Resilience Engineering”
35. Keep Up with New Business Models
E.g.: Web APIs*, Fintechs**.
*John Musser, “20 Business Models in 20 Minutes” (2013)
** McKinsey&Company, “Cutting Through the FinTech Noise: Markers of Success, Imperatives for Banks” (2015)
37. Do GameDay Exercises
“An exercise that tests a company’s systems, software, and
people in the course of preparing for a response to a
disastrous event”*
*Tom Limoncelli et. al., “Resilience Engineering: Learning to Embrace Failure” (2012)
40. “[Look for] that which could seriously affect the system’s
performance in the near term – positively or negatively. The
monitoring must cover the system’s own performance as
well as what happens in the environment.”
Erik Hollnagel
“Resilience Assessment Grid (RAG)”
51. “Knowing what to do, or being able to respond to regular
and irregular changes, disturbances, and opportunities by
activating prepared actions or by adjusting current mode
of functioning.”
Erik Hollnagel
“Resilience Assessment Grid (RAG)”
52. Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980),
via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
Characteristics of Response in
Escalating Scenarios
53. “[people] tend to neglect how processes develop over time
(awareness of rates) versus assessing how things are in
the moment.”
54. “…[people] have difficulty in dealing with exponential
developments (hard to imagine how fast things can
change, or accelerate).”
55. “…[people] tend to
think in causal series as opposed to causal nets
(A, therefore B) ->
(A and B, therefore C and D, therefore E and A and F)”
56. Pitfalls to Be Aware of
A sample.
* David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
57. Failure to Adapt to New Events
People may get fixated on initial assessments.
58. Failure to Use External Guidance to
Direct Focus
E.g.: Start treating a cause before treating more pressing
consequences.
59. Failures of Prospective Memory
Forgetting to recall an intention for some future point in time.
60. Treating Interconnected Events as
Independent
E.g.: Failing to consider how a recently deployed change to
the Users API may be causing the Check-out process to fail.
64. Response on Simple and Complicated Systems
Runbooks
and
Linear Cause-Consequence Analysis
are usually enough.
65. Response on Complex Systems
Probe-Sense-Respond
Ensure Different Roles can Work as a Team (Dev+Ops)
Remove Barriers to Information Sharing
Leverage GameDay Insights
(Seamlessly) Record Events for Post-Mortem Analysis
68. “Manage something not only when it happens but also after
it has happened. (...) [A system] can use this learning to
adjust both how it monitors and how it responds. ”
Erik Hollnagel
“Resilience Engineering”
Learning from the Past
81. “Employing simplicity thinking and linear logic,
the official findings and the judicial rulings
determined that the train driver was
“exclusively” responsible for the crash.”*
* Disaster complexity and the Santiago de
Compostela train derailment
82. Amazon’s outage
“Amazon’s massive
AWS outage was
caused by human
error.
One incorrect command
and the whole internet
suffers.”
Recode. March 2, 2017
83. “During the deployment of the new code, however, one of
Knight’s technicians did not copy the new code to one of the
eight SMARS computer servers. Knight did not have a
second technician review this deployment (...)”
Knightmare: A DevOps Cautionary Tale
Knight Capital Loses $440 Million in 30 Minutes
86. well established that accidents cannot be
attributed to a single cause or (...) a
single individual
87. Four Needs
an accident report must fulfill
Sidney Dekker, “The psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)
89. The way we look at human error focuses
on moral and existential needs.
90. ● Human error is cause of failure
● Engineered systems are safe
● Make progress by protecting systems
from unreliable humans
“Old” View Of Human Error
91. Hindsight Bias
“The inclination, after an event has ocurred, to see the
event as having been predictable, despite there having
been little or no objective basis for predicting it.”
“Hindsight bias”
93. Fundamental Attribution Error
“Our tendency to explain someone’s behaviour based on
internal factors, such as personality or disposition, and to
underestimate the influence that external factors, such as
situational influences (...).”
“Fundamental Attribution Error - Definition & Overview”
94. It’s easier to change people than basic
beliefs about a system.
95. Local Rationality Principle
“People do things that make sense to them given their
goals, understanding of the situation and focus of attention
at that time.
Work needs to be understood from the local perspectives of
those doing the work.”
“Local Rationality”
97. “The human tendency to create possible alternatives to life
events that have already occurred.
They are thoughts that consist of ‘If I had only’.”
“Counterfactual Thinking”
Counterfactuals
98. Counterfactuals can affect people’s
emotions, e.g.: regret, guilt or relief.
They can also affect how they decide
who deserves blame and responsibility.
100. A New View on Human
Error
Human error as a symptom of failure
101. ● Human error as symptom of failure
● Safety is not inherent in systems
● Human error connected to features of
people, tools, tasks and operating
environment
“New” View
102. Moving From Anedocte to
Concept-Based Results
Five steps from context-specific to concept-dependent.
Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)
103. 1. Layout Sequence of Events in
Context-Specific Language
How people's mindset unfolded parallel with the
situation evolving around them and how people
influenced course of events?
104. 2. Divide Sequence of Events into
Episodes
If the accident evolves over a long period of time.
105. 3. Find Out How the World Looked or
Changed During Each Episode
Couple behaviour with situation. Connect available
information with how it was presented to people.
106. 4. Identify People's Goals, Focus of
Attention and Knowledge Active at the
Time
What people know and what they try to accomplish
(their goals) determines where they will look, hence
the data that is available to them.
107. 5. Step Up to a Conceptual Description
It’s crucial so that we can learn from failures and
identify commonalities between different events.
108. Now go and make your
organization more
humane...