Your SlideShare is downloading. ×
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

5,629
views

Published on



When things go wrong, our judgement is clouded at best, blinded at worst.

In order to successfully navigate a large-scale outage, being aware of potentials gaps in knowledge and context can help make for a better outcome. The Human Factors and Systems Safety community have been studying how people situate themselves, coordinate amongst a team, use tooling, make decisions, and keep their cool under sometimes very stressful and escalating scenarios. We can learn from this research in order to adopt a more mature stance when the s*#t hits the fan.

We’re going to look closely at how people behave under these circumstances using real-world examples and scan what we can learn from High Reliability Organizations(HROs) and fields such as aviation, military, and trauma-driven healthcare.

Published in: Business

3 Comments
14 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,629
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
139
Comments
3
Likes
14
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Escalating Scenarios A Deep Dive Into Outage Pitfalls John Allspaw Velocity London 2012Wednesday, October 3, 12
  • 2. TROUBLESHOOTING This is NOT about troubleshooting Or, not just about troubleshootingWednesday, October 3, 12
  • 3. LAYOUT • Criteria • Situational Awareness • HROs • Decision Making • Communication • Team Coordination • A little bit of psychologyWednesday, October 3, 12
  • 4. Wednesday, October 3, 12
  • 5. Wednesday, October 3, 12
  • 6. How important is this?Wednesday, October 3, 12
  • 7. Wednesday, October 3, 12
  • 8. Wednesday, October 3, 12
  • 9. Wednesday, October 3, 12
  • 10. Wednesday, October 3, 12
  • 11. Wednesday, October 3, 12
  • 12. Oct 2011 Sept 2012Wednesday, October 3, 12
  • 13. Wednesday, October 3, 12
  • 14. Wednesday, October 3, 12
  • 15. Wednesday, October 3, 12
  • 16. Wednesday, October 3, 12
  • 17. Where to learn from?Wednesday, October 3, 12
  • 18. TMIWednesday, October 3, 12
  • 19. Wednesday, October 3, 12
  • 20. Wednesday, October 3, 12
  • 21. Kegworth 1989Wednesday, October 3, 12
  • 22. Dr. Richard Cook, Velocity US 2012 http://www.youtube.com/watch?v=R_PDc0HFdP0Wednesday, October 3, 12
  • 23. “The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea” Rochlin, La Porte, and Roberts. Naval War College Review 1987 http://govleaders.org/reliability.htmWednesday, October 3, 12
  • 24. Wednesday, October 3, 12
  • 25. Wednesday, October 3, 12
  • 26. What Goes On In Our Heads?Wednesday, October 3, 12
  • 27. Jens Rasmussen, 1983 Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models” IEEE Transactions On Systems, Man, and Cybernetics, May 1983Wednesday, October 3, 12
  • 28. SKILL - BASED Simple, routine RULE - BASED Knowable, but unfamiliar KNOWLEDGE - BASED (Reason, 1990) WTF IS GOING ON?Wednesday, October 3, 12
  • 29. Situational Awareness"the perception of elements in the environment within a volume oftime and space, the comprehension of their meaning, and theprojection of their status in the near future,” - (Endsley, 1995)"keeping track of what is going on around you in a complex,dynamic environment" (Moray, 2005, p. 4)"knowing what is going on so you can figure out what todo" (Adam, 1993)Wednesday, October 3, 12
  • 30. OODA Loop Observe Orient Decide Act Metrics Analysis Planning Execution Monitoring Visualization Resourcing Alerting Correlation Alarming credit: http://blog.b3k.us/ooda.htmlWednesday, October 3, 12
  • 31. Canonical Work “Towards a Theory of Situational Awareness” Mica Endsley, Human Factors (1995) http://www.satechnologies.com/Papers/pdf/Toward%20a%20Theory %20of%20SA.pdfWednesday, October 3, 12
  • 32. Situational Awareness Level I Perception Level II Comprehension Level III ProjectionWednesday, October 3, 12
  • 33. System capability Interface design Stress and workload Complexity Automation Task/System Factors Feedback Situational Awareness Perception Performance State of the of elements Comprehension Projection Decision of actions environment in current situation of current situation of future status LEVEL I LEVEL II LEVEL III Information processing Individual Factors mechanisms Long term Goals and memory states Automaticity objectives Preconceptions - Abilities (expectations) - Experience - Training (Endsley)Wednesday, October 3, 12
  • 34. Level One: PerceptionWednesday, October 3, 12
  • 35. Wednesday, October 3, 12
  • 36. Context Can you spot the anomaly?Wednesday, October 3, 12
  • 37. Context 24 hoursWednesday, October 3, 12
  • 38. Context 7 daysWednesday, October 3, 12
  • 39. Context Normal But NoisyWednesday, October 3, 12
  • 40. Level TwoWednesday, October 3, 12 Comprehension
  • 41. Level TwoWednesday, October 3, 12
  • 42. Mental Models • Categorization & Comprehension • Mental “map” or “schema” • Informed by experience, stored in memoryWednesday, October 3, 12
  • 43. Mental ModelsWednesday, October 3, 12
  • 44. Level ThreeWednesday, October 3, 12
  • 45. Mental ModelsWednesday, October 3, 12
  • 46. Level Three Common Clues you’re losing SA at this level • Ambiguity • Fixation • Confusion • Lack of Information • Failure to maintain • Failure to meet expected checkpoint or target • Failure to resolve discrepancies • A bad gut feeling that things are not quite right Wednesday, October 3, 12
  • 47. Characteristics of response to escalating scenariosWednesday, October 3, 12
  • 48. Characteristics of response to escalating scenarios ...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • 49. Characteristics of response to escalating scenarios ...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate) “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • 50. Characteristics of response to escalating scenarios ...inclined to think in causal SERIES, instead of causal NETS. A therefore B, instead of A, therefore B and C (therefore D and E), etc. “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • 51. SA PitfallsRequisite Memory TrapWednesday, October 3, 12
  • 52. SA PitfallsWorkload, anxiety,fatigue, other stressorsWednesday, October 3, 12
  • 53. SA PitfallsData OverloadWednesday, October 3, 12
  • 54. SA PitfallsMisplace SalienceWednesday, October 3, 12
  • 55. http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdfWednesday, October 3, 12
  • 56. http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdfWednesday, October 3, 12
  • 57. Wednesday, October 3, 12
  • 58. SA Pitfalls Complexity Creep “Everything should be as simple as it can be, but not simpler.” - paraphrased, EinsteinWednesday, October 3, 12
  • 59. SA PitfallsPoor Mental ModelsWednesday, October 3, 12
  • 60. SA PitfallsOut-Of-The-LoopSyndromeWednesday, October 3, 12
  • 61. SA PitfallsRefusal to makedecisionsWednesday, October 3, 12
  • 62. Heroism Non-communicating lone wolf-ismsWednesday, October 3, 12
  • 63. Distraction Irrelevant noise in comm channelsWednesday, October 3, 12
  • 64. Wednesday, October 3, 12
  • 65. TEAMS • Divide and conquer applied to problem space, division of labor • Incident resolution vs. Problem resolution • Reproducibility • Fault Tolerance EffectsWednesday, October 3, 12
  • 66. TEAMS Shotgun debuggingWednesday, October 3, 12
  • 67. JOINT ACTIVITY • Interpredictability • Common Ground • Directability http://csel.eng.ohio-state.edu/woods/distributed/CG%20final.pdfWednesday, October 3, 12
  • 68. InterpredictabilityWednesday, October 3, 12
  • 69. Common GroundWednesday, October 3, 12
  • 70. DirectabilityWednesday, October 3, 12
  • 71. ImprovisationWednesday, October 3, 12
  • 72. IMPROVISATIONWednesday, October 3, 12
  • 73. IMPROVISATIONWednesday, October 3, 12
  • 74. Improvisation “...you can’t improvise on nothing; you got to improvise on something.” Charles MingusWednesday, October 3, 12
  • 75. Diagnose the problem Represent the problem Detect the Generate a Apply course of Leverage Problem/Opportunity action Points EvaluateWednesday, October 3, 12
  • 76. Communication Recommendations •Explicitness •Assertiveness •TimingWednesday, October 3, 12
  • 77. Assertiveness • Passive • Assertive • AggressiveWednesday, October 3, 12
  • 78. Wednesday, October 3, 12
  • 79. ExerciseWednesday, October 3, 12
  • 80. Communication • IRC? • Face-To-Face? • Conference Call? • Morse Code?Wednesday, October 3, 12
  • 81. Kegworth 1989Wednesday, October 3, 12
  • 82. Transmission Encode Decode Meaning Meaning Sender ReceiverWednesday, October 3, 12
  • 83. Transmission Encode Decode Meaning Meaning Sender ReceiverWednesday, October 3, 12
  • 84. Transmission Encode Decode Meaning Meaning Sender Receiver Meaning Meaning Receiver Sender Decode Encode TransmissionWednesday, October 3, 12
  • 85. Transmission Encode Decode Meaning Meaning Sender Receiver Meaning Meaning Receiver Sender Decode Encode TransmissionWednesday, October 3, 12
  • 86. Feedback InformationalWednesday, October 3, 12
  • 87. Feedback CorrectiveWednesday, October 3, 12
  • 88. Feedback ReinforcingWednesday, October 3, 12
  • 89. Decision Making Naturalistic Decision Making (NDM) Gary KleinWednesday, October 3, 12
  • 90. Decision Making Step One: What is the problem?Wednesday, October 3, 12
  • 91. Decision Making Step Two: What shall I do?Wednesday, October 3, 12
  • 92. Decision Making Recognition-Primed DecisionsWednesday, October 3, 12
  • 93. Decision Making Rule-Based DecisionsWednesday, October 3, 12
  • 94. Decision Making Choice decisionsWednesday, October 3, 12
  • 95. Decision Making Creative decisionsWednesday, October 3, 12
  • 96. Decision Making Decreasing cognitive effort Decreasing effects of stress Creative Choice Rule-Based RPD Increasing cognitive effort Increasing effects of stressWednesday, October 3, 12
  • 97. Decision Making PRE-MortemWednesday, October 3, 12
  • 98. ToolingWednesday, October 3, 12
  • 99. ? ?Wednesday, October 3, 12
  • 100. Time Period MetricWednesday, October 3, 12
  • 101. ControlsWednesday, October 3, 12
  • 102. ALERTS • Meant to boost SA • Alarm overload • High false alarm rates • Routinely disable alertsWednesday, October 3, 12
  • 103. Alert ReliabilityWednesday, October 3, 12
  • 104. ALERT DESIGN • Signal:Noise can be difficult • Easy to err on more false alarms • Decay in trust • Origins: Undetectable conditionsWednesday, October 3, 12
  • 105. ALERT DESIGN ConfirmationWednesday, October 3, 12
  • 106. ALERT DESIGN ExpectancyWednesday, October 3, 12
  • 107. ALERT DESIGNWednesday, October 3, 12
  • 108. ALERT DESIGN • Don’t make people singularly reliant on alarms • Support alarm confirmation activities • Make alarms unambiguous • Reduce, reduce, reduce false alerts • Set missed/false alert trade-offs appropriatelyWednesday, October 3, 12
  • 109. ALERT DESIGN • Use multiple modalities • Minimize alarm disruptions to ongoing activities • Support the assessment/diagnosis of multiple alerts • Support global SA of systems in an alarm stateWednesday, October 3, 12
  • 110. Mature Role of Automation “Ironies of Automation” - Lisanne Bainbridge http://www.bainbrdg.demon.co.uk/Papers/Ironies.htmlWednesday, October 3, 12
  • 111. Mature Role of Automation • Moves humans from manual operator to supervisor • Extends and augments human abilities, doesn’t replace it • Doesn’t remove “human error” • Are brittle • Recognize that there is always discretionary space for humans • Recognizes the Law of Stretched SystemsWednesday, October 3, 12
  • 112. SUMMARYWednesday, October 3, 12
  • 113. So what can we do? “In preparing for battle, I have always found that plans are useless but planning is indispensable.” - EisenhowerWednesday, October 3, 12
  • 114. So what can we do? We develop our Non-Technical Skills • Situational Awareness • Communication • Decision Making • Improvisation • Crew Resource Management (CRM)Wednesday, October 3, 12
  • 115. So what can we do? We tailor our environment to adapt • Tooling to support SA • Learning from outages (PostMortem) • Anticipating problems (PreMortem) • Gather Meta-MetricsWednesday, October 3, 12
  • 116. Wednesday, October 3, 12
  • 117. THE ENDWednesday, October 3, 12
  • 118. Credits • http://www.flickr.com/photos/28650594@N03/2718027136/ • http://www.flickr.com/photos/telstar/2816887784/ • http://www.flickr.com/photos/soldiersmediacenter/3855375117/ • http://www.flickr.com/photos/19743256@N00/2640217771/ • http://www.flickr.com/photos/29456680@N06/6148754691/ • https://www.flickr.com/photos/spencediddy/3197199659/sizes/l/in/faves-allspaw/ • http://www.flickr.com/photos/splorp/64027565/sizes/l/in/photostream/Wednesday, October 3, 12