SlideShare a Scribd company logo
1 of 29
FAILURE



DR JOHN ROOKSBY
OR …
RESILIENCE
IN THIS LECTURE…
This lecture
• Will introduce you to many of the themes I will cover on
  the course.
• Will characterise failure as the norm rather than the
  exception in systems operation.
• Will outline why critical systems engineering must
  address organisational and human factors as well as
  technical issues.
• Will build upon the idea of socio-technical systems
  engineering introduced in the last lecture, and will
  introduce the idea of resilience engineering
A STORY
A professor has to give an important lecture. He wakes up
late because his alarm clock fails to go off.
His wife has left the house already. Unfortunately she has
left the kitchen tap running and it has flooded the floor.
The professor rushes to clean up the mess.
He gets to his car only to realise he has locked his car and
house keys inside.
He has left a spare house-key with a neighbour – but the
neighbour is away.
He phones his wife but she doesn’t answer.
A STORY
He calls a friend and asks for a lift, but the friend’s car is
broken down.
The professor sets off for the bus, but then remembers there
is a bus strike.
He calls a taxi, but the taxi company is overwhelmed because
of the bus strike.
He gives up, calls work and cancels the lecture.




                         This story is adapted from Perrow C (1984) Normal Accidents. Living with
                         High Risk Technologies Basic Books.
ABOUT FAILURE
Failure is a judgement
Failures are common
Failures often have multiple causes
Failures cascade
Some failures are more serious than others
Failures often have no ill effect
Failures can often be recovered from
Engineering cannot eliminate failure
Success is as complex as failure
FAILURE IS A JUDGMENT
What do we judge the exact failure to be?
• Failure to get to work? Failure to give lecture? The smaller
  failures that led to cancellation?
What do we judge to be a significant failure?
• Does cancelling a lecture matter?
• Can cancellation be corrected for?
Different perspectives can be taken on failure
• Different explanations often suit different purposes
• There may sometimes be no definite agreement about a
  failure, but this does not mean any interpretation will do.
Sources:
Graph - The Passport Delays of Summer 1999.
NAO Report.
Images – BBC News
                                              Passport issuing 1998/9
FAILURES ARE COMMON
Errors and failures happen all the time, particularly in
complex systems where there is a lot to go wrong.
How many errors have you made in the last half an hour?

                                                                        If servers in a data
                                                                        center have 99.999%
                                                                        reliability, what are the
                                                                        odds that all will be
                                                                        working at any one time:
                                                                        a) if it has 10,000
                                                                           servers?
                                                                        b) if it has 100,000
                                                                           servers?
http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
FAILURES OFTEN HAVE
MULTIPLE CAUSES
There were multiple (mainly mundane) causes behind the
lecture cancellation:
  •   Human error (leaving tap running, forgetting keys)
  •   Practices and procedures (Waking up late, rushing)
  •   Technical failure (Alarm clock, Car)
  •   System design (Door allows you to be locked out)
  •   Environment (Lives too far from work)
  •   External failures (Bus strike, lack of taxi capability)
  •   Planning (Relying on a single lecturer)

Who or what is responsible?
Who has responsibility?
http://gizmodo.com/5844628/a-passenger-airplane-nearly-flew-upside-down-because-of-a-dumb-pilot
FAILURES CASCADE
Complex systems have a high number of components and
will be dependent on a high number of external factors.
These interdependencies may not always be apparent.
Often the cause or causes of failure are at an order of remove
from the failure itself
• A simplistic view is that there are chains of failure. A
  domino effect where one problem leads to another
• A more complex view is that failures have complex webs
  of causes and influences
• We may also view failures in terms of problems with
  defenses
Disasters often result from unfortunate coincidences and
combinations of failure.
SWISS CHEESE
MODEL
               Operation




               Software




               Hardware
SOME FAILURES ARE MORE
SERIOUS THAN OTHERS
It is often helpful to distinguish between faults, errors, failures,
disasters and catastrophe. But there is no consistently used
terminology.
Failure is a judgment
The seriousness of a failure is contextually dependent.
• Failure in a life-critical system vs in a word processor
• When is it acceptable for an aging component to fail?
• When is it acceptable to take risks (e.g. do maintenance)?
Engineers take different perspectives on failure. Some argue
that all failures, no matter how small, should be taken seriously.
Some argue we need systems to be “good enough”.
FAILURES OFTEN HAVE NO ILL
EFFECT
An error or failure may happen
many times with no ill effect.
• This can lead people to be
  complacent
• It may one day lead to disaster
For example the Columbia shuttle
disaster occurred when foam
damaged tiles on the shuttle
• Similar foam strikes had
  happened many times
• NASA couldn’t believe this strike
  would cause the loss of
  Columbia
FAILURES CAN OFTEN BE
RECOVERED FROM
A disaster is rarely an instantaneous event. Often a disaster
results from an unfortunate combination of failures and often
these take place over a period of time.
• Failures can often be mitigated
• Failures can often be recovered from
A resilient system is one that is able to recover from failures.
It is the opposite of a brittle system.
We must give operators the ability to mitigate and recover
from failure.
Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary
ENGINEERING CANNOT
ELIMINATE FAILURES
Good engineering can greatly reduce but never eliminate the
possibility of failure.
• Testing can be used to find problems but never show their
  absence
• Formal methods can be used to eliminate design faults but
  this does not mean problems will not emerge in
  manufacturing or system operation
Critical systems engineering must focus on operation as well
as design.
Systems are increasing operated as services rather than
products, so this risk is increasingly on the developers (!)
SUCCESS IS AS COMPLEX AS
FAILURE
We need to learn from success, not just failure
• But success is even harder to define than failure.
Success is a judgment
• One person’s success is another’s failure
• A successful system may just be one that hasn’t yet failed
Success can be studied in terms of
• Noteworthy success
• Ordinary operation
• “Successful failures”
SOCIO-TECHNICAL SYSTEMS
ENGINEERING

                     Society

                  Organisations

              People and Processes

Socio-             Applications
Technical
Systems                               Software
Engineering   Communications + Data   Engineering
                  Management
                Operating Systems

                   Equipment
RESILIENCE
Design for failure
• How can a system fail gracefully and appropriately?
Design for recovery
• How can a system be designed to support mitigation and
  recovery from failure?
Design for avoidance
• How can we reduce the number of failures a system will
  encounter?
For all of these we need to understand systems operation.
Critical systems engineering is not just about the design
process, but also about understanding operation.
Microsoft “containerised” data centre
SUMMARY
1. Failure is the norm, not the exception
2. Resilient systems are able to cope with, recover from and
   avoid failure
3. Resilience is a socio-technical, not technical problem
HOMEWORK
First read
Chapter 3 “The Human Contribution” from J Reason (2008)
The Human Contribution. Farnham, Ashgate.


Then
Make a note of any interesting slips, lapses, mistakes,
violations, etc. that you have made recently

More Related Content

What's hot

Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Dustin Collins
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)ClubHack
 
E guide weathering the storm at your business
E guide weathering the storm at your businessE guide weathering the storm at your business
E guide weathering the storm at your businessSoma Technology Group
 
Tool Box Talk - Human Induced Failures 2
Tool Box Talk  - Human Induced Failures  2Tool Box Talk  - Human Induced Failures  2
Tool Box Talk - Human Induced Failures 2Ricky Smith CMRP, CMRT
 
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...Sam Knutson
 
When Things Break
When Things BreakWhen Things Break
When Things BreakRon Graham
 
The complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceThe complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceeyob eshetu
 
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Adrian Sanabria
 
Blameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowBlameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowVictorOps
 
Successful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceSuccessful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceRoberta Macklin
 
Typing Work by Complexity
Typing Work by ComplexityTyping Work by Complexity
Typing Work by ComplexityDerek W. Wade
 

What's hot (16)

Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015
 
Human errors
Human errorsHuman errors
Human errors
 
Dit yvol3iss41
Dit yvol3iss41Dit yvol3iss41
Dit yvol3iss41
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)
 
E guide weathering the storm at your business
E guide weathering the storm at your businessE guide weathering the storm at your business
E guide weathering the storm at your business
 
Tool Box Talk - Human Induced Failures 2
Tool Box Talk  - Human Induced Failures  2Tool Box Talk  - Human Induced Failures  2
Tool Box Talk - Human Induced Failures 2
 
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
 
When Things Break
When Things BreakWhen Things Break
When Things Break
 
The complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceThe complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenance
 
Downtime-Whitepaper
Downtime-WhitepaperDowntime-Whitepaper
Downtime-Whitepaper
 
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
 
Creating a Technology Disaster Plan
Creating a Technology Disaster PlanCreating a Technology Disaster Plan
Creating a Technology Disaster Plan
 
232 a7d01
232 a7d01232 a7d01
232 a7d01
 
Blameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowBlameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to Know
 
Successful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceSuccessful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User Acceptance
 
Typing Work by Complexity
Typing Work by ComplexityTyping Work by Complexity
Typing Work by Complexity
 

Viewers also liked

CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1John Rooksby
 
Studying foursquare
Studying foursquareStudying foursquare
Studying foursquareMattias Rost
 
CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2John Rooksby
 
Designing apps lecture
Designing apps lectureDesigning apps lecture
Designing apps lectureJohn Rooksby
 
Testing Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingTesting Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingJohn Rooksby
 
Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5John Rooksby
 
Self tracking and digital health
Self tracking and digital healthSelf tracking and digital health
Self tracking and digital healthJohn Rooksby
 
Top 10 lies of Entrepreneurs
Top 10 lies of EntrepreneursTop 10 lies of Entrepreneurs
Top 10 lies of Entrepreneurshuer1278ft
 
Entrepreneurship & business modelling workshop
Entrepreneurship & business modelling  workshopEntrepreneurship & business modelling  workshop
Entrepreneurship & business modelling workshophgomersall
 
It's All About Execution in a Startup
It's All About Execution in a StartupIt's All About Execution in a Startup
It's All About Execution in a StartupAbhishek Shah
 
Leadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesLeadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesAdam Walz
 
17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs PossesAbhishek Shah
 
Building a Career as an Entrepreneur
Building a Career as an EntrepreneurBuilding a Career as an Entrepreneur
Building a Career as an EntrepreneurEric Tachibana
 
99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must readEric Tachibana
 
Stealing Your Einstein Ideas
Stealing Your Einstein IdeasStealing Your Einstein Ideas
Stealing Your Einstein IdeasAbhishek Shah
 
10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneurEric Tachibana
 
How to run a scrappy startup
How to run a scrappy startupHow to run a scrappy startup
How to run a scrappy startupRashmi Sinha
 

Viewers also liked (20)

CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1
 
Studying foursquare
Studying foursquareStudying foursquare
Studying foursquare
 
CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2
 
Designing apps lecture
Designing apps lectureDesigning apps lecture
Designing apps lecture
 
Testing Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingTesting Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport Issuing
 
Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5
 
Self tracking and digital health
Self tracking and digital healthSelf tracking and digital health
Self tracking and digital health
 
Innovation Can be Trained
Innovation Can be TrainedInnovation Can be Trained
Innovation Can be Trained
 
Top 10 lies of Entrepreneurs
Top 10 lies of EntrepreneursTop 10 lies of Entrepreneurs
Top 10 lies of Entrepreneurs
 
Entrepreneurship & business modelling workshop
Entrepreneurship & business modelling  workshopEntrepreneurship & business modelling  workshop
Entrepreneurship & business modelling workshop
 
It's All About Execution in a Startup
It's All About Execution in a StartupIt's All About Execution in a Startup
It's All About Execution in a Startup
 
Leadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesLeadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur Attributes
 
17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses
 
How can entrepreneurial mindset be developed in organisations?
How can entrepreneurial mindset be developed in organisations?How can entrepreneurial mindset be developed in organisations?
How can entrepreneurial mindset be developed in organisations?
 
Building a Career as an Entrepreneur
Building a Career as an EntrepreneurBuilding a Career as an Entrepreneur
Building a Career as an Entrepreneur
 
99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read
 
Stealing Your Einstein Ideas
Stealing Your Einstein IdeasStealing Your Einstein Ideas
Stealing Your Einstein Ideas
 
5 things I wish I knew before starting up
5 things I wish I knew before starting up5 things I wish I knew before starting up
5 things I wish I knew before starting up
 
10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur
 
How to run a scrappy startup
How to run a scrappy startupHow to run a scrappy startup
How to run a scrappy startup
 

Similar to CS5032 Lecture 2: Failure

Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsYury Roa
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed SystemsInes Sombra
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software designUwe Friedrichsen
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeAlberto Acerbis
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Jaap van Ekris
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systemsJaap van Ekris
 
Debugging microservices in production
Debugging microservices in productionDebugging microservices in production
Debugging microservices in productionbcantrill
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service DesignUwe Friedrichsen
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskUwe Friedrichsen
 
2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systemsJaap van Ekris
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!Bert Ertman
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindbcantrill
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadKevin Crawley
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos EngineeringYury Roa
 
Problem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyProblem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyRonald Bartels
 

Similar to CS5032 Lecture 2: Failure (20)

Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed Systems
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems
 
Debugging microservices in production
Debugging microservices in productionDebugging microservices in production
Debugging microservices in production
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service Design
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
dist_systems.pdf
dist_systems.pdfdist_systems.pdf
dist_systems.pdf
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mind
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Problem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyProblem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technology
 

More from John Rooksby

Implementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentImplementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentJohn Rooksby
 
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickDigital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickJohn Rooksby
 
How to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsHow to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsJohn Rooksby
 
Guest lecture: Designing mobile apps
Guest lecture: Designing mobile appsGuest lecture: Designing mobile apps
Guest lecture: Designing mobile appsJohn Rooksby
 
Talk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseTalk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseJohn Rooksby
 
Intimacy and Mobile Devices
Intimacy and Mobile DevicesIntimacy and Mobile Devices
Intimacy and Mobile DevicesJohn Rooksby
 
CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2John Rooksby
 
CS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureCS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureJohn Rooksby
 
CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2John Rooksby
 
CS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureCS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureJohn Rooksby
 

More from John Rooksby (12)

Implementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentImplementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App Deployment
 
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickDigital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
 
How to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsHow to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change tools
 
Guest lecture: Designing mobile apps
Guest lecture: Designing mobile appsGuest lecture: Designing mobile apps
Guest lecture: Designing mobile apps
 
Talk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseTalk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday Use
 
Fitts' Law
Fitts' LawFitts' Law
Fitts' Law
 
Intimacy and Mobile Devices
Intimacy and Mobile DevicesIntimacy and Mobile Devices
Intimacy and Mobile Devices
 
Making data
Making dataMaking data
Making data
 
CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2
 
CS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureCS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructure
 
CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2
 
CS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureCS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failure
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

CS5032 Lecture 2: Failure

  • 4. IN THIS LECTURE… This lecture • Will introduce you to many of the themes I will cover on the course. • Will characterise failure as the norm rather than the exception in systems operation. • Will outline why critical systems engineering must address organisational and human factors as well as technical issues. • Will build upon the idea of socio-technical systems engineering introduced in the last lecture, and will introduce the idea of resilience engineering
  • 5. A STORY A professor has to give an important lecture. He wakes up late because his alarm clock fails to go off. His wife has left the house already. Unfortunately she has left the kitchen tap running and it has flooded the floor. The professor rushes to clean up the mess. He gets to his car only to realise he has locked his car and house keys inside. He has left a spare house-key with a neighbour – but the neighbour is away. He phones his wife but she doesn’t answer.
  • 6. A STORY He calls a friend and asks for a lift, but the friend’s car is broken down. The professor sets off for the bus, but then remembers there is a bus strike. He calls a taxi, but the taxi company is overwhelmed because of the bus strike. He gives up, calls work and cancels the lecture. This story is adapted from Perrow C (1984) Normal Accidents. Living with High Risk Technologies Basic Books.
  • 7. ABOUT FAILURE Failure is a judgement Failures are common Failures often have multiple causes Failures cascade Some failures are more serious than others Failures often have no ill effect Failures can often be recovered from Engineering cannot eliminate failure Success is as complex as failure
  • 8. FAILURE IS A JUDGMENT What do we judge the exact failure to be? • Failure to get to work? Failure to give lecture? The smaller failures that led to cancellation? What do we judge to be a significant failure? • Does cancelling a lecture matter? • Can cancellation be corrected for? Different perspectives can be taken on failure • Different explanations often suit different purposes • There may sometimes be no definite agreement about a failure, but this does not mean any interpretation will do.
  • 9. Sources: Graph - The Passport Delays of Summer 1999. NAO Report. Images – BBC News Passport issuing 1998/9
  • 10. FAILURES ARE COMMON Errors and failures happen all the time, particularly in complex systems where there is a lot to go wrong. How many errors have you made in the last half an hour? If servers in a data center have 99.999% reliability, what are the odds that all will be working at any one time: a) if it has 10,000 servers? b) if it has 100,000 servers? http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
  • 11. FAILURES OFTEN HAVE MULTIPLE CAUSES There were multiple (mainly mundane) causes behind the lecture cancellation: • Human error (leaving tap running, forgetting keys) • Practices and procedures (Waking up late, rushing) • Technical failure (Alarm clock, Car) • System design (Door allows you to be locked out) • Environment (Lives too far from work) • External failures (Bus strike, lack of taxi capability) • Planning (Relying on a single lecturer) Who or what is responsible? Who has responsibility?
  • 13. FAILURES CASCADE Complex systems have a high number of components and will be dependent on a high number of external factors. These interdependencies may not always be apparent. Often the cause or causes of failure are at an order of remove from the failure itself • A simplistic view is that there are chains of failure. A domino effect where one problem leads to another • A more complex view is that failures have complex webs of causes and influences • We may also view failures in terms of problems with defenses Disasters often result from unfortunate coincidences and combinations of failure.
  • 14. SWISS CHEESE MODEL Operation Software Hardware
  • 15. SOME FAILURES ARE MORE SERIOUS THAN OTHERS It is often helpful to distinguish between faults, errors, failures, disasters and catastrophe. But there is no consistently used terminology. Failure is a judgment The seriousness of a failure is contextually dependent. • Failure in a life-critical system vs in a word processor • When is it acceptable for an aging component to fail? • When is it acceptable to take risks (e.g. do maintenance)? Engineers take different perspectives on failure. Some argue that all failures, no matter how small, should be taken seriously. Some argue we need systems to be “good enough”.
  • 16.
  • 17. FAILURES OFTEN HAVE NO ILL EFFECT An error or failure may happen many times with no ill effect. • This can lead people to be complacent • It may one day lead to disaster For example the Columbia shuttle disaster occurred when foam damaged tiles on the shuttle • Similar foam strikes had happened many times • NASA couldn’t believe this strike would cause the loss of Columbia
  • 18. FAILURES CAN OFTEN BE RECOVERED FROM A disaster is rarely an instantaneous event. Often a disaster results from an unfortunate combination of failures and often these take place over a period of time. • Failures can often be mitigated • Failures can often be recovered from A resilient system is one that is able to recover from failures. It is the opposite of a brittle system. We must give operators the ability to mitigate and recover from failure.
  • 19. Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary
  • 20. ENGINEERING CANNOT ELIMINATE FAILURES Good engineering can greatly reduce but never eliminate the possibility of failure. • Testing can be used to find problems but never show their absence • Formal methods can be used to eliminate design faults but this does not mean problems will not emerge in manufacturing or system operation Critical systems engineering must focus on operation as well as design. Systems are increasing operated as services rather than products, so this risk is increasingly on the developers (!)
  • 21.
  • 22. SUCCESS IS AS COMPLEX AS FAILURE We need to learn from success, not just failure • But success is even harder to define than failure. Success is a judgment • One person’s success is another’s failure • A successful system may just be one that hasn’t yet failed Success can be studied in terms of • Noteworthy success • Ordinary operation • “Successful failures”
  • 23.
  • 24.
  • 25. SOCIO-TECHNICAL SYSTEMS ENGINEERING Society Organisations People and Processes Socio- Applications Technical Systems Software Engineering Communications + Data Engineering Management Operating Systems Equipment
  • 26. RESILIENCE Design for failure • How can a system fail gracefully and appropriately? Design for recovery • How can a system be designed to support mitigation and recovery from failure? Design for avoidance • How can we reduce the number of failures a system will encounter? For all of these we need to understand systems operation. Critical systems engineering is not just about the design process, but also about understanding operation.
  • 28. SUMMARY 1. Failure is the norm, not the exception 2. Resilient systems are able to cope with, recover from and avoid failure 3. Resilience is a socio-technical, not technical problem
  • 29. HOMEWORK First read Chapter 3 “The Human Contribution” from J Reason (2008) The Human Contribution. Farnham, Ashgate. Then Make a note of any interesting slips, lapses, mistakes, violations, etc. that you have made recently

Editor's Notes

  1. 90 37
  2. All Nippon Airways. Flight “flips” on 6/9/2011
  3. During the swine flu outbreak?How well has a medic washed hands? (line infections)
  4. Qantas Batam Island Engine Explosion
  5. Unsinkable system
  6. The failures and successes of the London Ambulance Service
  7. Talk also aboutvirtualisation