SlideShare a Scribd company logo
1 of 110
Download to read offline
Resilience Engineering &
Human Error
… in a world of complexity
João Miranda
17 years in the IT world: developer, scrum master, ALM team
lead, dev team lead, agile coach, solution architect
Copes (tries to!) with 10+ Scrum teams
DevOps Lisbon meetup co-organizer
Human Factors & System
Safety
Lund University - MsC and Learning Labs
In the 19th Century,
things were a bit
simpler... Harvest at La Crau, with Montmajour in the
Background, June 1888, Van Gogh Museum
In the early 20th Century,
things got more
complicated…
Industrial Revolution - History.com
“Now one of the very first requirements for a man who is fit
to handle pig iron as a regular occupation is that he shall
be so stupid and so phlegmatic that he more nearly
resembles in his mental make-up the ox than any other
type.”
F.W. Taylor. Principles of Scientific Management.
1911. New York and London, Harper & brothers.
In the 21st
Century, well...
Things change faster and faster.
Deepwater Horizon.
Production Safety
We need both.
What is complex?
A Complex World
Complex Systems - In Layman’s Terms
“components come together to behave in different
(sometimes surprising) ways that they never would on their
own, in isolation.”
John Allspaw
“Resilience Engineering Part II: Lenses” (2012)
Emergent
Behaviour
Emergent
Behaviour
Feedback
Loops
Emergent
Behaviour
Cascading
Failures
Feedback
Loops
Emergent
Behaviour
Non-Linear
Responses
Cascading
Failures
Feedback
Loops
Feedback
Loops
Emergent
Behaviour
Non-Linear
Responses
Cascading
Failures
Fluid
boundaries
5 domains of decision-making
(The fifth, in the middle, is disorder)
Cynefin Framework
By Snowden - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=33783436
Cynefin
Cynefin Framework - in detail
By Edwin Stoop (User:Marillion!!62) - [1], CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53810658
Cognitive Demands of a Domain
● Dynamism
● Number of parts and extensiveness of its interconnections
● Uncertainty
● Risk
A domain is complex if high in all of these dimensions.
* David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
Most of the time,
we are in the complex domain.
Probe. Sense. Respond.
We need to be there in order to keep up
with the fast pace of change.
Sometimes, we fall into the chaotic.
Act! Sense. Respond.
We want to move to the complicated or
simple domains, whenever possible.
From novel or emergent practices to
good and best practices.
Resilience Engineering
How to Survive In Today’s World
“A system is resilient if it can adjust its functioning prior to,
during, or following events (changes, disturbances, and
opportunities), and thereby sustain required operations
under both expected and unexpected conditions.”
Erik Hollnagel
“Resilience Engineering”
A Definition
Resilience Engineering is not IT specific
It’s not even its main focus
Roots on System Safety
Aviation, Shipping, Nuclear, Health Care, ...
1. Anticipate the Future
“[Systems] that can manage something before it happens,
by analysing the developments in the world around and
preparing itself as well as possible.”
Erik Hollnagel
“Resilience Engineering”
Anticipation Implies Risk-Taking
But there’s also risk on not anticipating: to react to late.
Detect Shifting Technical Trends
Did you anticipate: containers, microservices, big data, ...?
Keep Up with New Business Models
E.g.: Web APIs*, Fintechs**.
*John Musser, “20 Business Models in 20 Minutes” (2013)
** McKinsey&Company, “Cutting Through the FinTech Noise: Markers of Success, Imperatives for Banks” (2015)
Perform Architectural Reviews
Adopting new architectures or tech must not be taken lightly.
Do GameDay Exercises
“An exercise that tests a company’s systems, software, and
people in the course of preparing for a response to a
disastrous event”*
*Tom Limoncelli et. al., “Resilience Engineering: Learning to Embrace Failure” (2012)
Anticipation:
Know
What to Expect
2. Monitor the Present
“[Look for] that which could seriously affect the system’s
performance in the near term – positively or negatively. The
monitoring must cover the system’s own performance as
well as what happens in the environment.”
Erik Hollnagel
“Resilience Assessment Grid (RAG)”
Technical Actors In IT
i.e.: your hardware and software
Anomaly Detection
Collect Metrics, Trigger Alerts
QA & Monitoring. They tend to be the same thing over time...
Make it
Visible
Graph everything
Social Actors In IT
i.e.: humans and their interactions
How well do the different teams (e.g.
Dev and Ops) work together?
How does the organization handle
conflicting goals?
Beware the siren calls for creating
“objective” and/or “precise”
performance indicators.
Monitor:
Know
What to Look For
3. Respond Now
“Knowing what to do, or being able to respond to regular
and irregular changes, disturbances, and opportunities by
activating prepared actions or by adjusting current mode
of functioning.”
Erik Hollnagel
“Resilience Assessment Grid (RAG)”
Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980),
via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
Characteristics of Response in
Escalating Scenarios
“[people] tend to neglect how processes develop over time
(awareness of rates) versus assessing how things are in
the moment.”
“…[people] have difficulty in dealing with exponential
developments (hard to imagine how fast things can
change, or accelerate).”
“…[people] tend to
think in causal series as opposed to causal nets
(A, therefore B) ->
(A and B, therefore C and D, therefore E and A and F)”
Pitfalls to Be Aware of
A sample.
* David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
Failure to Adapt to New Events
People may get fixated on initial assessments.
Failure to Use External Guidance to
Direct Focus
E.g.: Start treating a cause before treating more pressing
consequences.
Failures of Prospective Memory
Forgetting to recall an intention for some future point in time.
Treating Interconnected Events as
Independent
E.g.: Failing to consider how a recently deployed change to
the Users API may be causing the Check-out process to fail.
Overreliance on Familiar Signs
“The site is so slow. It must be the database again.”
Response in IT
An overview
Response Triggers
Users start complaining
Chance
Anomaly detection and alerting systems works
Response on Simple and Complicated Systems
Runbooks
and
Linear Cause-Consequence Analysis
are usually enough.
Response on Complex Systems
Probe-Sense-Respond
Ensure Different Roles can Work as a Team (Dev+Ops)
Remove Barriers to Information Sharing
Leverage GameDay Insights
(Seamlessly) Record Events for Post-Mortem Analysis
Response:
Know
What to Do
4. Learn from the Past
“Manage something not only when it happens but also after
it has happened. (...) [A system] can use this learning to
adjust both how it monitors and how it responds. ”
Erik Hollnagel
“Resilience Engineering”
Learning from the Past
Obstacles to Learning
Why learning organizations are (probably) a minority.
Human Error “Explains” Everything
We tend to fall prey to our own biases
Blame Culture
Real or perceived. It doesn’t matter.
Misguided Focus On Production
Efficiency.
“I’m too busy to sharpen my saw.”
Common Ways to Learn
● Post-Mortems
● Retrospectives
● Accident Reports
Learn from Failure and Success
“Near-misses” are especially interesting.
Learning:
Know
What Happened
FourCornerstonesofResilience
John Allspaw
“Resilience Engineering Part II: Lenses” (2012)
Hmm… Not too difficult, in theory.
How Organizations Process Information
Pathological Bureaucratic Generative
Power-oriented Rule-oriented Performance-oriented
Low co-operation Modest co-operation High co-operation
Messengers shot Messengers neglected Messengers trained
Responsibilities shirked Narrow responsibilities Risks are shared
Bridging discouraged Bridging tolerated Bridging encouraged
Failure leads to scapegoating Failure leads to justice Failure leads to inquiry
Novelty crushed Novelty leads to problems Novelty implemented
Ron Westrum, “A typology of organisational cultures” (2004)
Learning requires a
new view on
human error.
Human Error
There is no such things as...
“Employing simplicity thinking and linear logic,
the official findings and the judicial rulings
determined that the train driver was
“exclusively” responsible for the crash.”*
* Disaster complexity and the Santiago de
Compostela train derailment
Amazon’s outage
“Amazon’s massive
AWS outage was
caused by human
error.
One incorrect command
and the whole internet
suffers.”
Recode. March 2, 2017
“During the deployment of the new code, however, one of
Knight’s technicians did not copy the new code to one of the
eight SMARS computer servers. Knight did not have a
second technician review this deployment (...)”
Knightmare: A DevOps Cautionary Tale
Knight Capital Loses $440 Million in 30 Minutes
“Was it a human or technical error?”
Does this question make sense in
complex systems?
well established that accidents cannot be
attributed to a single cause or (...) a
single individual
Four Needs
an accident report must fulfill
Sidney Dekker, “The psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)
Epistemological
Preventive
Moral
Existential
Most of the time they are in conflict.
The way we look at human error focuses
on moral and existential needs.
● Human error is cause of failure
● Engineered systems are safe
● Make progress by protecting systems
from unreliable humans
“Old” View Of Human Error
Hindsight Bias
“The inclination, after an event has ocurred, to see the
event as having been predictable, despite there having
been little or no objective basis for predicting it.”
“Hindsight bias”
People don’t go to work to do a bad job.
Fundamental Attribution Error
“Our tendency to explain someone’s behaviour based on
internal factors, such as personality or disposition, and to
underestimate the influence that external factors, such as
situational influences (...).”
“Fundamental Attribution Error - Definition & Overview”
It’s easier to change people than basic
beliefs about a system.
Local Rationality Principle
“People do things that make sense to them given their
goals, understanding of the situation and focus of attention
at that time.
Work needs to be understood from the local perspectives of
those doing the work.”
“Local Rationality”
Normal people, doing normal things.
“The human tendency to create possible alternatives to life
events that have already occurred.
They are thoughts that consist of ‘If I had only’.”
“Counterfactual Thinking”
Counterfactuals
Counterfactuals can affect people’s
emotions, e.g.: regret, guilt or relief.
They can also affect how they decide
who deserves blame and responsibility.
We don’t find
cause.
We select cause.
A New View on Human
Error
Human error as a symptom of failure
● Human error as symptom of failure
● Safety is not inherent in systems
● Human error connected to features of
people, tools, tasks and operating
environment
“New” View
Moving From Anedocte to
Concept-Based Results
Five steps from context-specific to concept-dependent.
Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)
1. Layout Sequence of Events in
Context-Specific Language
How people's mindset unfolded parallel with the
situation evolving around them and how people
influenced course of events?
2. Divide Sequence of Events into
Episodes
If the accident evolves over a long period of time.
3. Find Out How the World Looked or
Changed During Each Episode
Couple behaviour with situation. Connect available
information with how it was presented to people.
4. Identify People's Goals, Focus of
Attention and Knowledge Active at the
Time
What people know and what they try to accomplish
(their goals) determines where they will look, hence
the data that is available to them.
5. Step Up to a Conceptual Description
It’s crucial so that we can learn from failures and
identify commonalities between different events.
Now go and make your
organization more
humane...
...and Resilient!
1. Complex world =>
Emerging Behaviour
2. Resilience => Learning
3. No such thing as human
error

More Related Content

Similar to Resilience Engineering & Human Error... in IT

Machine Learning Operations (MLOps) - Active Failures and Latent Conditions
Machine Learning Operations (MLOps) - Active Failures and Latent ConditionsMachine Learning Operations (MLOps) - Active Failures and Latent Conditions
Machine Learning Operations (MLOps) - Active Failures and Latent ConditionsFlavio Clesio
 
Cognitive Computing for Tacit Knowledge1
Cognitive Computing for Tacit Knowledge1Cognitive Computing for Tacit Knowledge1
Cognitive Computing for Tacit Knowledge1Lucia Gradinariu
 
Machine Learning, AI and the Brain
Machine Learning, AI and the Brain Machine Learning, AI and the Brain
Machine Learning, AI and the Brain TechExeter
 
Artificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptArtificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptsagarvbrvbr
 
The Future of Security: How Artificial Intelligence Will Impact Us
The Future of Security: How Artificial Intelligence Will Impact UsThe Future of Security: How Artificial Intelligence Will Impact Us
The Future of Security: How Artificial Intelligence Will Impact UsPECB
 
Ansaldo STS at CPExpo 2013: "Risks and Security Management in Logistics and ...
Ansaldo STS at CPExpo 2013:  "Risks and Security Management in Logistics and ...Ansaldo STS at CPExpo 2013:  "Risks and Security Management in Logistics and ...
Ansaldo STS at CPExpo 2013: "Risks and Security Management in Logistics and ...Leonardo
 
Cyber Security in Railways Systems, Ansaldo STS experience
Cyber Security in Railways Systems, Ansaldo STS  experienceCyber Security in Railways Systems, Ansaldo STS  experience
Cyber Security in Railways Systems, Ansaldo STS experienceCommunity Protection Forum
 
31 c0n2017 final
31 c0n2017 final31 c0n2017 final
31 c0n2017 finalBryan Fite
 
How to break apart a monolithic system safely without destroying your team - ...
How to break apart a monolithic system safely without destroying your team - ...How to break apart a monolithic system safely without destroying your team - ...
How to break apart a monolithic system safely without destroying your team - ...Matthew Skelton
 
Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Skelton Thatcher Consulting Ltd
 
What (Else) Can Agile Learn From Complexity
What (Else) Can Agile Learn From ComplexityWhat (Else) Can Agile Learn From Complexity
What (Else) Can Agile Learn From ComplexityJurgen Appelo
 
SBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisSBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisTao Xie
 
Ai for life sciences - are we ready
Ai for life sciences  - are we readyAi for life sciences  - are we ready
Ai for life sciences - are we readyJack C Crawford
 
International journal of engineering issues vol 2015 - no 1 - paper3
International journal of engineering issues   vol 2015 - no 1 - paper3International journal of engineering issues   vol 2015 - no 1 - paper3
International journal of engineering issues vol 2015 - no 1 - paper3sophiabelthome
 
There is no impenetrable system - So, why we are still waiting to get breached?
There is no impenetrable system - So, why we are still waiting to get breached?There is no impenetrable system - So, why we are still waiting to get breached?
There is no impenetrable system - So, why we are still waiting to get breached?Nane Kratzke
 
Introduction to soft computing V 1.0
Introduction to soft computing  V 1.0Introduction to soft computing  V 1.0
Introduction to soft computing V 1.0Dr. C.V. Suresh Babu
 
Artificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptArtificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptSaurabhUpadhyay874937
 
Artificial intelligence introduction
Artificial intelligence  introduction Artificial intelligence  introduction
Artificial intelligence introduction San1705
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 

Similar to Resilience Engineering & Human Error... in IT (20)

Machine Learning Operations (MLOps) - Active Failures and Latent Conditions
Machine Learning Operations (MLOps) - Active Failures and Latent ConditionsMachine Learning Operations (MLOps) - Active Failures and Latent Conditions
Machine Learning Operations (MLOps) - Active Failures and Latent Conditions
 
Cognitive Computing for Tacit Knowledge1
Cognitive Computing for Tacit Knowledge1Cognitive Computing for Tacit Knowledge1
Cognitive Computing for Tacit Knowledge1
 
Machine Learning, AI and the Brain
Machine Learning, AI and the Brain Machine Learning, AI and the Brain
Machine Learning, AI and the Brain
 
Artificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptArtificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.ppt
 
The Future of Security: How Artificial Intelligence Will Impact Us
The Future of Security: How Artificial Intelligence Will Impact UsThe Future of Security: How Artificial Intelligence Will Impact Us
The Future of Security: How Artificial Intelligence Will Impact Us
 
Ansaldo STS at CPExpo 2013: "Risks and Security Management in Logistics and ...
Ansaldo STS at CPExpo 2013:  "Risks and Security Management in Logistics and ...Ansaldo STS at CPExpo 2013:  "Risks and Security Management in Logistics and ...
Ansaldo STS at CPExpo 2013: "Risks and Security Management in Logistics and ...
 
Cyber Security in Railways Systems, Ansaldo STS experience
Cyber Security in Railways Systems, Ansaldo STS  experienceCyber Security in Railways Systems, Ansaldo STS  experience
Cyber Security in Railways Systems, Ansaldo STS experience
 
31 c0n2017 final
31 c0n2017 final31 c0n2017 final
31 c0n2017 final
 
How to break apart a monolithic system safely without destroying your team - ...
How to break apart a monolithic system safely without destroying your team - ...How to break apart a monolithic system safely without destroying your team - ...
How to break apart a monolithic system safely without destroying your team - ...
 
Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016
 
What (Else) Can Agile Learn From Complexity
What (Else) Can Agile Learn From ComplexityWhat (Else) Can Agile Learn From Complexity
What (Else) Can Agile Learn From Complexity
 
SBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisSBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and Analysis
 
Ai for life sciences - are we ready
Ai for life sciences  - are we readyAi for life sciences  - are we ready
Ai for life sciences - are we ready
 
International journal of engineering issues vol 2015 - no 1 - paper3
International journal of engineering issues   vol 2015 - no 1 - paper3International journal of engineering issues   vol 2015 - no 1 - paper3
International journal of engineering issues vol 2015 - no 1 - paper3
 
There is no impenetrable system - So, why we are still waiting to get breached?
There is no impenetrable system - So, why we are still waiting to get breached?There is no impenetrable system - So, why we are still waiting to get breached?
There is no impenetrable system - So, why we are still waiting to get breached?
 
Introduction to soft computing V 1.0
Introduction to soft computing  V 1.0Introduction to soft computing  V 1.0
Introduction to soft computing V 1.0
 
Artificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.pptArtificial Intelligence- Introduction.ppt
Artificial Intelligence- Introduction.ppt
 
What is Complexity For?
What is Complexity For?What is Complexity For?
What is Complexity For?
 
Artificial intelligence introduction
Artificial intelligence  introduction Artificial intelligence  introduction
Artificial intelligence introduction
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 

Recently uploaded

{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai
{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai
{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, MumbaiPooja Nehwal
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girladitipandeya
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampPLCLeadershipDevelop
 
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...Pooja Nehwal
 
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual serviceanilsa9823
 
LPC Warehouse Management System For Clients In The Business Sector
LPC Warehouse Management System For Clients In The Business SectorLPC Warehouse Management System For Clients In The Business Sector
LPC Warehouse Management System For Clients In The Business Sectorthomas851723
 
LPC User Requirements for Automated Storage System Presentation
LPC User Requirements for Automated Storage System PresentationLPC User Requirements for Automated Storage System Presentation
LPC User Requirements for Automated Storage System Presentationthomas851723
 
Fifteenth Finance Commission Presentation
Fifteenth Finance Commission PresentationFifteenth Finance Commission Presentation
Fifteenth Finance Commission Presentationmintusiprd
 
LPC Operations Review PowerPoint | Operations Review
LPC Operations Review PowerPoint | Operations ReviewLPC Operations Review PowerPoint | Operations Review
LPC Operations Review PowerPoint | Operations Reviewthomas851723
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Pooja Nehwal
 
CEO of Google, Sunder Pichai's biography
CEO of Google, Sunder Pichai's biographyCEO of Google, Sunder Pichai's biography
CEO of Google, Sunder Pichai's biographyHafizMuhammadAbdulla5
 
Risk management in surgery (bailey and love).pptx
Risk management in surgery (bailey and love).pptxRisk management in surgery (bailey and love).pptx
Risk management in surgery (bailey and love).pptxSaujanya Jung Pandey
 
LPC Facility Design And Re-engineering Presentation
LPC Facility Design And Re-engineering PresentationLPC Facility Design And Re-engineering Presentation
LPC Facility Design And Re-engineering Presentationthomas851723
 
Training Methods and Training Objectives
Training Methods and Training ObjectivesTraining Methods and Training Objectives
Training Methods and Training Objectivesmintusiprd
 
GENUINE Babe,Call Girls IN Badarpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Badarpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Badarpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Badarpur Delhi | +91-8377087607dollysharma2066
 
Board Diversity Initiaive Launch Presentation
Board Diversity Initiaive Launch PresentationBoard Diversity Initiaive Launch Presentation
Board Diversity Initiaive Launch Presentationcraig524401
 

Recently uploaded (20)

sauth delhi call girls in Defence Colony🔝 9953056974 🔝 escort Service
sauth delhi call girls in Defence Colony🔝 9953056974 🔝 escort Servicesauth delhi call girls in Defence Colony🔝 9953056974 🔝 escort Service
sauth delhi call girls in Defence Colony🔝 9953056974 🔝 escort Service
 
{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai
{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai
{ 9892124323 }} Call Girls & Escorts in Hotel JW Marriott juhu, Mumbai
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Ameerpet high-profile Call Girl
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC Bootcamp
 
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...
Pooja Mehta 9167673311, Trusted Call Girls In NAVI MUMBAI Cash On Payment , V...
 
Becoming an Inclusive Leader - Bernadette Thompson
Becoming an Inclusive Leader - Bernadette ThompsonBecoming an Inclusive Leader - Bernadette Thompson
Becoming an Inclusive Leader - Bernadette Thompson
 
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Charbagh Lucknow best sexual service
 
LPC Warehouse Management System For Clients In The Business Sector
LPC Warehouse Management System For Clients In The Business SectorLPC Warehouse Management System For Clients In The Business Sector
LPC Warehouse Management System For Clients In The Business Sector
 
LPC User Requirements for Automated Storage System Presentation
LPC User Requirements for Automated Storage System PresentationLPC User Requirements for Automated Storage System Presentation
LPC User Requirements for Automated Storage System Presentation
 
Fifteenth Finance Commission Presentation
Fifteenth Finance Commission PresentationFifteenth Finance Commission Presentation
Fifteenth Finance Commission Presentation
 
LPC Operations Review PowerPoint | Operations Review
LPC Operations Review PowerPoint | Operations ReviewLPC Operations Review PowerPoint | Operations Review
LPC Operations Review PowerPoint | Operations Review
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
 
CEO of Google, Sunder Pichai's biography
CEO of Google, Sunder Pichai's biographyCEO of Google, Sunder Pichai's biography
CEO of Google, Sunder Pichai's biography
 
Risk management in surgery (bailey and love).pptx
Risk management in surgery (bailey and love).pptxRisk management in surgery (bailey and love).pptx
Risk management in surgery (bailey and love).pptx
 
LPC Facility Design And Re-engineering Presentation
LPC Facility Design And Re-engineering PresentationLPC Facility Design And Re-engineering Presentation
LPC Facility Design And Re-engineering Presentation
 
Training Methods and Training Objectives
Training Methods and Training ObjectivesTraining Methods and Training Objectives
Training Methods and Training Objectives
 
GENUINE Babe,Call Girls IN Badarpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Badarpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Badarpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Badarpur Delhi | +91-8377087607
 
Call Girls Service Tilak Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
Call Girls Service Tilak Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICECall Girls Service Tilak Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICE
Call Girls Service Tilak Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
 
Rohini Sector 16 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 16 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 16 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 16 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Board Diversity Initiaive Launch Presentation
Board Diversity Initiaive Launch PresentationBoard Diversity Initiaive Launch Presentation
Board Diversity Initiaive Launch Presentation
 

Resilience Engineering & Human Error... in IT

  • 1. Resilience Engineering & Human Error … in a world of complexity
  • 2. João Miranda 17 years in the IT world: developer, scrum master, ALM team lead, dev team lead, agile coach, solution architect Copes (tries to!) with 10+ Scrum teams DevOps Lisbon meetup co-organizer
  • 3. Human Factors & System Safety Lund University - MsC and Learning Labs
  • 4. In the 19th Century, things were a bit simpler... Harvest at La Crau, with Montmajour in the Background, June 1888, Van Gogh Museum
  • 5. In the early 20th Century, things got more complicated… Industrial Revolution - History.com
  • 6. “Now one of the very first requirements for a man who is fit to handle pig iron as a regular occupation is that he shall be so stupid and so phlegmatic that he more nearly resembles in his mental make-up the ox than any other type.” F.W. Taylor. Principles of Scientific Management. 1911. New York and London, Harper & brothers.
  • 8. Things change faster and faster.
  • 9.
  • 12. What is complex? A Complex World
  • 13. Complex Systems - In Layman’s Terms “components come together to behave in different (sometimes surprising) ways that they never would on their own, in isolation.” John Allspaw “Resilience Engineering Part II: Lenses” (2012)
  • 19. 5 domains of decision-making (The fifth, in the middle, is disorder) Cynefin Framework By Snowden - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=33783436 Cynefin
  • 20. Cynefin Framework - in detail By Edwin Stoop (User:Marillion!!62) - [1], CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53810658
  • 21. Cognitive Demands of a Domain ● Dynamism ● Number of parts and extensiveness of its interconnections ● Uncertainty ● Risk A domain is complex if high in all of these dimensions. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  • 22. Most of the time, we are in the complex domain. Probe. Sense. Respond.
  • 23. We need to be there in order to keep up with the fast pace of change.
  • 24. Sometimes, we fall into the chaotic. Act! Sense. Respond.
  • 25. We want to move to the complicated or simple domains, whenever possible. From novel or emergent practices to good and best practices.
  • 26. Resilience Engineering How to Survive In Today’s World
  • 27. “A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.” Erik Hollnagel “Resilience Engineering” A Definition
  • 28.
  • 29. Resilience Engineering is not IT specific It’s not even its main focus
  • 30. Roots on System Safety Aviation, Shipping, Nuclear, Health Care, ...
  • 32. “[Systems] that can manage something before it happens, by analysing the developments in the world around and preparing itself as well as possible.” Erik Hollnagel “Resilience Engineering”
  • 33. Anticipation Implies Risk-Taking But there’s also risk on not anticipating: to react to late.
  • 34. Detect Shifting Technical Trends Did you anticipate: containers, microservices, big data, ...?
  • 35. Keep Up with New Business Models E.g.: Web APIs*, Fintechs**. *John Musser, “20 Business Models in 20 Minutes” (2013) ** McKinsey&Company, “Cutting Through the FinTech Noise: Markers of Success, Imperatives for Banks” (2015)
  • 36. Perform Architectural Reviews Adopting new architectures or tech must not be taken lightly.
  • 37. Do GameDay Exercises “An exercise that tests a company’s systems, software, and people in the course of preparing for a response to a disastrous event”* *Tom Limoncelli et. al., “Resilience Engineering: Learning to Embrace Failure” (2012)
  • 39. 2. Monitor the Present
  • 40. “[Look for] that which could seriously affect the system’s performance in the near term – positively or negatively. The monitoring must cover the system’s own performance as well as what happens in the environment.” Erik Hollnagel “Resilience Assessment Grid (RAG)”
  • 41. Technical Actors In IT i.e.: your hardware and software
  • 43. QA & Monitoring. They tend to be the same thing over time...
  • 45. Social Actors In IT i.e.: humans and their interactions
  • 46. How well do the different teams (e.g. Dev and Ops) work together?
  • 47. How does the organization handle conflicting goals?
  • 48. Beware the siren calls for creating “objective” and/or “precise” performance indicators.
  • 51. “Knowing what to do, or being able to respond to regular and irregular changes, disturbances, and opportunities by activating prepared actions or by adjusting current mode of functioning.” Erik Hollnagel “Resilience Assessment Grid (RAG)”
  • 52. Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980), via John Allspaw, “Resilience Engineering Part II: Lenses” (2012) Characteristics of Response in Escalating Scenarios
  • 53. “[people] tend to neglect how processes develop over time (awareness of rates) versus assessing how things are in the moment.”
  • 54. “…[people] have difficulty in dealing with exponential developments (hard to imagine how fast things can change, or accelerate).”
  • 55. “…[people] tend to think in causal series as opposed to causal nets (A, therefore B) -> (A and B, therefore C and D, therefore E and A and F)”
  • 56. Pitfalls to Be Aware of A sample. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  • 57. Failure to Adapt to New Events People may get fixated on initial assessments.
  • 58. Failure to Use External Guidance to Direct Focus E.g.: Start treating a cause before treating more pressing consequences.
  • 59. Failures of Prospective Memory Forgetting to recall an intention for some future point in time.
  • 60. Treating Interconnected Events as Independent E.g.: Failing to consider how a recently deployed change to the Users API may be causing the Check-out process to fail.
  • 61. Overreliance on Familiar Signs “The site is so slow. It must be the database again.”
  • 62. Response in IT An overview
  • 63. Response Triggers Users start complaining Chance Anomaly detection and alerting systems works
  • 64. Response on Simple and Complicated Systems Runbooks and Linear Cause-Consequence Analysis are usually enough.
  • 65. Response on Complex Systems Probe-Sense-Respond Ensure Different Roles can Work as a Team (Dev+Ops) Remove Barriers to Information Sharing Leverage GameDay Insights (Seamlessly) Record Events for Post-Mortem Analysis
  • 67. 4. Learn from the Past
  • 68. “Manage something not only when it happens but also after it has happened. (...) [A system] can use this learning to adjust both how it monitors and how it responds. ” Erik Hollnagel “Resilience Engineering” Learning from the Past
  • 69. Obstacles to Learning Why learning organizations are (probably) a minority.
  • 70. Human Error “Explains” Everything We tend to fall prey to our own biases
  • 71. Blame Culture Real or perceived. It doesn’t matter.
  • 72. Misguided Focus On Production Efficiency. “I’m too busy to sharpen my saw.”
  • 73. Common Ways to Learn ● Post-Mortems ● Retrospectives ● Accident Reports
  • 74. Learn from Failure and Success “Near-misses” are especially interesting.
  • 77. Hmm… Not too difficult, in theory.
  • 78. How Organizations Process Information Pathological Bureaucratic Generative Power-oriented Rule-oriented Performance-oriented Low co-operation Modest co-operation High co-operation Messengers shot Messengers neglected Messengers trained Responsibilities shirked Narrow responsibilities Risks are shared Bridging discouraged Bridging tolerated Bridging encouraged Failure leads to scapegoating Failure leads to justice Failure leads to inquiry Novelty crushed Novelty leads to problems Novelty implemented Ron Westrum, “A typology of organisational cultures” (2004)
  • 79. Learning requires a new view on human error.
  • 80. Human Error There is no such things as...
  • 81. “Employing simplicity thinking and linear logic, the official findings and the judicial rulings determined that the train driver was “exclusively” responsible for the crash.”* * Disaster complexity and the Santiago de Compostela train derailment
  • 82. Amazon’s outage “Amazon’s massive AWS outage was caused by human error. One incorrect command and the whole internet suffers.” Recode. March 2, 2017
  • 83. “During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment (...)” Knightmare: A DevOps Cautionary Tale Knight Capital Loses $440 Million in 30 Minutes
  • 84. “Was it a human or technical error?”
  • 85. Does this question make sense in complex systems?
  • 86. well established that accidents cannot be attributed to a single cause or (...) a single individual
  • 87. Four Needs an accident report must fulfill Sidney Dekker, “The psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)
  • 89. The way we look at human error focuses on moral and existential needs.
  • 90. ● Human error is cause of failure ● Engineered systems are safe ● Make progress by protecting systems from unreliable humans “Old” View Of Human Error
  • 91. Hindsight Bias “The inclination, after an event has ocurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it.” “Hindsight bias”
  • 92. People don’t go to work to do a bad job.
  • 93. Fundamental Attribution Error “Our tendency to explain someone’s behaviour based on internal factors, such as personality or disposition, and to underestimate the influence that external factors, such as situational influences (...).” “Fundamental Attribution Error - Definition & Overview”
  • 94. It’s easier to change people than basic beliefs about a system.
  • 95. Local Rationality Principle “People do things that make sense to them given their goals, understanding of the situation and focus of attention at that time. Work needs to be understood from the local perspectives of those doing the work.” “Local Rationality”
  • 96. Normal people, doing normal things.
  • 97. “The human tendency to create possible alternatives to life events that have already occurred. They are thoughts that consist of ‘If I had only’.” “Counterfactual Thinking” Counterfactuals
  • 98. Counterfactuals can affect people’s emotions, e.g.: regret, guilt or relief. They can also affect how they decide who deserves blame and responsibility.
  • 99. We don’t find cause. We select cause.
  • 100. A New View on Human Error Human error as a symptom of failure
  • 101. ● Human error as symptom of failure ● Safety is not inherent in systems ● Human error connected to features of people, tools, tasks and operating environment “New” View
  • 102. Moving From Anedocte to Concept-Based Results Five steps from context-specific to concept-dependent. Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)
  • 103. 1. Layout Sequence of Events in Context-Specific Language How people's mindset unfolded parallel with the situation evolving around them and how people influenced course of events?
  • 104. 2. Divide Sequence of Events into Episodes If the accident evolves over a long period of time.
  • 105. 3. Find Out How the World Looked or Changed During Each Episode Couple behaviour with situation. Connect available information with how it was presented to people.
  • 106. 4. Identify People's Goals, Focus of Attention and Knowledge Active at the Time What people know and what they try to accomplish (their goals) determines where they will look, hence the data that is available to them.
  • 107. 5. Step Up to a Conceptual Description It’s crucial so that we can learn from failures and identify commonalities between different events.
  • 108. Now go and make your organization more humane...
  • 110. 1. Complex world => Emerging Behaviour 2. Resilience => Learning 3. No such thing as human error