• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,688
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
100
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Responding to Outages Maturely John Allspaw SVP, Tech Ops Code As Craft, BerlinTuesday, April 24, 12
  • 2. OPERABILITYTuesday, April 24, 12
  • 3. PRODUCTIONTuesday, April 24, 12
  • 4. http://WhoOwnsMyAvailability.comTuesday, April 24, 12
  • 5. Tuesday, April 24, 12
  • 6. How important is this?Tuesday, April 24, 12
  • 7. Tuesday, April 24, 12
  • 8. Tuesday, April 24, 12
  • 9. Tuesday, April 24, 12
  • 10. Tuesday, April 24, 12
  • 11. Tuesday, April 24, 12
  • 12. Tuesday, April 24, 12
  • 13. Tuesday, April 24, 12
  • 14. Tuesday, April 24, 12
  • 15. Tuesday, April 24, 12
  • 16. Tuesday, April 24, 12
  • 17. Tuesday, April 24, 12
  • 18. Tuesday, April 24, 12
  • 19. How important is this?Tuesday, April 24, 12
  • 20. How Can This Happen?Tuesday, April 24, 12
  • 21. Complicated? Complex?Tuesday, April 24, 12
  • 22. Complex Systems • Cascading Failures • Difficult to determine boundaries • Complex systems may be open • Complex systems may have a memory • Complex systems may be nested • Dynamic network of multiplicity • May produce emergent phenomena • Relationships are non-linear • Relationships contain feedback loopsTuesday, April 24, 12
  • 23. How Can This Happen? It does happen. And it will again.Tuesday, April 24, 12 And again.
  • 24. Tuesday, April 24, 12
  • 25. Optimization MTBF MTTRTuesday, April 24, 12
  • 26. http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12
  • 27. How does team troubleshooting happen?Tuesday, April 24, 12
  • 28. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear TimeTuesday, April 24, 12
  • 29. Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear TimeTuesday, April 24, 12
  • 30. Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentiallyTuesday, April 24, 12
  • 31. Tuesday, April 24, 12
  • 32. So What Can We Do?Tuesday, April 24, 12
  • 33. We Learn From OthersTuesday, April 24, 12
  • 34. Characteristics of response to escalating scenariosTuesday, April 24, 12
  • 35. Characteristics of response to escalating scenarios ...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Tuesday, April 24, 12
  • 36. Characteristics of response to escalating scenarios ...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate) “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Tuesday, April 24, 12
  • 37. Characteristics of response to escalating scenarios ...inclined to think in causal series, instead of causal nets. A therefore B, instead of A, therefore B and C (therefore D and E), etc. “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Tuesday, April 24, 12
  • 38. PitfallsThematicVagabondingTuesday, April 24, 12
  • 39. PitfallsGoal Fixation(encystment)Tuesday, April 24, 12
  • 40. PitfallsRefusal to makedecisionsTuesday, April 24, 12
  • 41. Heroism Non-communicating lone wolf-ismsTuesday, April 24, 12
  • 42. Distraction Irrelevant noise in comm channelsTuesday, April 24, 12
  • 43. Jens Rasmussen, 1983 Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models” IEEE Transactions On Systems, Man, and Cybernetics, May 1983Tuesday, April 24, 12
  • 44. SKILL - BASED Simple, routine RULE - BASED Knowable, but unfamiliar KNOWLEDGE - BASED (Reason, 1990) WTF IS GOING ON?Tuesday, April 24, 12
  • 45. Team Troubleshooting • Which causes did you consider first? • Which ones did you not consider at all? • How much of what you considered comes from recent history? • How much comes from observations from other team members?Tuesday, April 24, 12
  • 46. Team Troubleshooting • How effective is the response team in communicating to other groups? Users? • How long does it take to exhaust obvious cause(s)?Tuesday, April 24, 12
  • 47. Team DynamicsTuesday, April 24, 12
  • 48. High Reliability Organizations • Air Traffic Control • Complex Socio-Technical systems • Naval Air Operations At Sea • Efficiency <-> Thoroughness • Electrical Power Systems • Time/Resource Constrained • Etc. • Engineering-drivenTuesday, April 24, 12
  • 49. Tuesday, April 24, 12
  • 50. “The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea” Rochlin, La Porte, and Roberts. Naval War College Review 1987 http://govleaders.org/reliability.htmTuesday, April 24, 12
  • 51. Tuesday, April 24, 12
  • 52. Close interdependence between groupsTuesday, April 24, 12
  • 53. Close reciprocal coordination and information sharing, resulting in overlapping knowledgeTuesday, April 24, 12
  • 54. High redundancy: multiple people observing the same event and sharing informationTuesday, April 24, 12
  • 55. Broad definition of who belongs to the team.Tuesday, April 24, 12
  • 56. Teammates are included in the communication loops rather than excluded.Tuesday, April 24, 12
  • 57. Lots of error correction.Tuesday, April 24, 12
  • 58. High levels of situation comprehension: maintain constant awareness of the possibility of accidents.Tuesday, April 24, 12
  • 59. High levels of interpersonal skillsTuesday, April 24, 12
  • 60. Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.Tuesday, April 24, 12
  • 61. Patterns of authority are changed to meet the demands of the events: organizational flexibility.Tuesday, April 24, 12
  • 62. The reporting of errors and faults is rewarded, not punished.Tuesday, April 24, 12
  • 63. So What Else Can We Do?Tuesday, April 24, 12
  • 64. We DrillTuesday, April 24, 12
  • 65. We GameDayTuesday, April 24, 12
  • 66. Tuesday, April 24, 12
  • 67. We Learn To ImproviseTuesday, April 24, 12
  • 68. IMPROVISATIONTuesday, April 24, 12
  • 69. IMPROVISATIONTuesday, April 24, 12
  • 70. We Learn From Our MistakesTuesday, April 24, 12
  • 71. Postmortems • Full timelines: What happened, when, who involved • Review in public, everyone invited • Search for “second stories” instead of “human error” • Cultivating a blameless environment • Giving requisite authority to individuals to improve thingsTuesday, April 24, 12
  • 72. Qualifying Response High signal:noise in comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand and working? Improvised tooling or solutions? Metrics visibility? Collaborative and skillful communication?Tuesday, April 24, 12
  • 73. RemediationTuesday, April 24, 12
  • 74. We Share Near-Miss EventsTuesday, April 24, 12
  • 75. Near Misses Hey everybody - Don’t be like me. I tried to X, but that wasn’t a good idea. It almost exploded everyone. So, don’t do: (details about X) Love, JoeTuesday, April 24, 12
  • 76. Near Misses • Can act like “vaccines” - help system safety without actually hurting anything • Happen more often, so provide more data on latent failures • Powerful reminder of hazards, and slows down the process of forgetting to be afraidTuesday, April 24, 12
  • 77. Practice! • How we troubleshoot in the moment, as a distributed team • How we handle time pressure • How we Observe/Orient/Decide/Act • How we communicate during emergencies • How we trust (or not) each other during emergencies • How we relate to emergencies when things are normal • How we could detect how we are protected during normal times (i.e., why aren’t we going down RIGHT NOW?)Tuesday, April 24, 12
  • 78. Resilient Response • Can learn from other fields • Can train for outages • Can learn from mistakes • Can learn from successes as well as failuresTuesday, April 24, 12
  • 79. http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12
  • 80. THE ENDTuesday, April 24, 12
  • 81. A parting word A parting challengeTuesday, April 24, 12
  • 82. Two PropositionsTuesday, April 24, 12
  • 83. 100 changes 6 change-related issuesTuesday, April 24, 12
  • 84. 100 > 6Tuesday, April 24, 12
  • 85. Proposition #1 “Ways in which things go right are special cases of the ways in which things go wrong.”Tuesday, April 24, 12
  • 86. Proposition #1 Successes = failures gone wrong Study the failures, generalize from that. Potential data sources: 6 out of 100Tuesday, April 24, 12
  • 87. Proposition #2 “Ways in which things go wrong are special cases of the ways in which things go right.”Tuesday, April 24, 12
  • 88. Proposition #2 Failures = successes gone wrong Study the successes, generalize from thatTuesday, April 24, 12 Potential data sources: 94 out of 100
  • 89. 94/100 ? ORTuesday, April 24, 12 6/100 ?
  • 90. What and WHY Do Things Go RIGHT?Tuesday, April 24, 12
  • 91. Not just: why did we fail? But also: why did we succeed?Tuesday, April 24, 12
  • 92. Mature Role of Automation “Ironies of Automation” - Lisanne Bainbridge http://www.bainbrdg.demon.co.uk/Papers/Ironies.htmlTuesday, April 24, 12
  • 93. Mature Role of Automation • Moves humans from manual operator to supervisor • Extends and augments human abilities, doesn’t replace it • Doesn’t remove “human error” • Are brittle • Recognize that there is always discretionary space for humans • Recognizes the Law of Stretched SystemsTuesday, April 24, 12
  • 94. Law of Stretched Systems “Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity” D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006Tuesday, April 24, 12