Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Upcoming SlideShare
Loading in...5
×
 

Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

on

  • 5,665 views

...



When things go wrong, our judgement is clouded at best, blinded at worst.

In order to successfully navigate a large-scale outage, being aware of potentials gaps in knowledge and context can help make for a better outcome. The Human Factors and Systems Safety community have been studying how people situate themselves, coordinate amongst a team, use tooling, make decisions, and keep their cool under sometimes very stressful and escalating scenarios. We can learn from this research in order to adopt a more mature stance when the s*#t hits the fan.

We’re going to look closely at how people behave under these circumstances using real-world examples and scan what we can learn from High Reliability Organizations(HROs) and fields such as aviation, military, and trauma-driven healthcare.

Statistics

Views

Total Views
5,665
Views on SlideShare
5,376
Embed Views
289

Actions

Likes
14
Downloads
127
Comments
3

16 Embeds 289

http://tech.m6web.fr 104
https://twitter.com 73
http://candidosalesg.wordpress.com 33
http://www.twylah.com 18
http://localhost 17
http://candidosg.com 14
http://lanyrd.com 6
http://www.flavors.me 5
http://local.tryghost.org 4
http://flavors.me 4
http://www.linkedin.com 4
https://si0.twimg.com 3
http://leed.galsungen.net 1
https://twimg0-a.akamaihd.net 1
http://fr.flavors.me 1
http://japhy.soup.io 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • @jallspaw Thank you for the follow-up, and fingers crossed for the next time 'round :-)
    Are you sure you want to
    Your message goes here
    Processing…
  • Hey Sascha - unfortunately, no, there wasn't a video of the talk. :( I'm likely to give the talk soon where it can be recorded, tho.
    Are you sure you want to
    Your message goes here
    Processing…
  • John, was this talk recorded by any chance? Nice slides, but I get the feeling that they only convey ~10% of the content of the talk.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls Presentation Transcript

  • Escalating Scenarios A Deep Dive Into Outage Pitfalls John Allspaw Velocity London 2012Wednesday, October 3, 12
  • TROUBLESHOOTING This is NOT about troubleshooting Or, not just about troubleshootingWednesday, October 3, 12
  • LAYOUT • Criteria • Situational Awareness • HROs • Decision Making • Communication • Team Coordination • A little bit of psychologyWednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • How important is this?Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Oct 2011 Sept 2012Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Where to learn from?Wednesday, October 3, 12
  • TMIWednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • Kegworth 1989Wednesday, October 3, 12
  • Dr. Richard Cook, Velocity US 2012 http://www.youtube.com/watch?v=R_PDc0HFdP0Wednesday, October 3, 12
  • “The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea” Rochlin, La Porte, and Roberts. Naval War College Review 1987 http://govleaders.org/reliability.htmWednesday, October 3, 12
  • Wednesday, October 3, 12
  • Wednesday, October 3, 12
  • What Goes On In Our Heads?Wednesday, October 3, 12
  • Jens Rasmussen, 1983 Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models” IEEE Transactions On Systems, Man, and Cybernetics, May 1983Wednesday, October 3, 12
  • SKILL - BASED Simple, routine RULE - BASED Knowable, but unfamiliar KNOWLEDGE - BASED (Reason, 1990) WTF IS GOING ON?Wednesday, October 3, 12
  • Situational Awareness"the perception of elements in the environment within a volume oftime and space, the comprehension of their meaning, and theprojection of their status in the near future,” - (Endsley, 1995)"keeping track of what is going on around you in a complex,dynamic environment" (Moray, 2005, p. 4)"knowing what is going on so you can figure out what todo" (Adam, 1993)Wednesday, October 3, 12
  • OODA Loop Observe Orient Decide Act Metrics Analysis Planning Execution Monitoring Visualization Resourcing Alerting Correlation Alarming credit: http://blog.b3k.us/ooda.htmlWednesday, October 3, 12
  • Canonical Work “Towards a Theory of Situational Awareness” Mica Endsley, Human Factors (1995) http://www.satechnologies.com/Papers/pdf/Toward%20a%20Theory %20of%20SA.pdfWednesday, October 3, 12
  • Situational Awareness Level I Perception Level II Comprehension Level III ProjectionWednesday, October 3, 12
  • System capability Interface design Stress and workload Complexity Automation Task/System Factors Feedback Situational Awareness Perception Performance State of the of elements Comprehension Projection Decision of actions environment in current situation of current situation of future status LEVEL I LEVEL II LEVEL III Information processing Individual Factors mechanisms Long term Goals and memory states Automaticity objectives Preconceptions - Abilities (expectations) - Experience - Training (Endsley)Wednesday, October 3, 12
  • Level One: PerceptionWednesday, October 3, 12
  • Wednesday, October 3, 12
  • Context Can you spot the anomaly?Wednesday, October 3, 12
  • Context 24 hoursWednesday, October 3, 12
  • Context 7 daysWednesday, October 3, 12
  • Context Normal But NoisyWednesday, October 3, 12
  • Level TwoWednesday, October 3, 12 Comprehension
  • Level TwoWednesday, October 3, 12
  • Mental Models • Categorization & Comprehension • Mental “map” or “schema” • Informed by experience, stored in memoryWednesday, October 3, 12
  • Mental ModelsWednesday, October 3, 12
  • Level ThreeWednesday, October 3, 12
  • Mental ModelsWednesday, October 3, 12
  • Level Three Common Clues you’re losing SA at this level • Ambiguity • Fixation • Confusion • Lack of Information • Failure to maintain • Failure to meet expected checkpoint or target • Failure to resolve discrepancies • A bad gut feeling that things are not quite right Wednesday, October 3, 12
  • Characteristics of response to escalating scenariosWednesday, October 3, 12
  • Characteristics of response to escalating scenarios ...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • Characteristics of response to escalating scenarios ...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate) “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • Characteristics of response to escalating scenarios ...inclined to think in causal SERIES, instead of causal NETS. A therefore B, instead of A, therefore B and C (therefore D and E), etc. “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Wednesday, October 3, 12
  • SA PitfallsRequisite Memory TrapWednesday, October 3, 12
  • SA PitfallsWorkload, anxiety,fatigue, other stressorsWednesday, October 3, 12
  • SA PitfallsData OverloadWednesday, October 3, 12
  • SA PitfallsMisplace SalienceWednesday, October 3, 12
  • http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdfWednesday, October 3, 12
  • http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdfWednesday, October 3, 12
  • Wednesday, October 3, 12
  • SA Pitfalls Complexity Creep “Everything should be as simple as it can be, but not simpler.” - paraphrased, EinsteinWednesday, October 3, 12
  • SA PitfallsPoor Mental ModelsWednesday, October 3, 12
  • SA PitfallsOut-Of-The-LoopSyndromeWednesday, October 3, 12
  • SA PitfallsRefusal to makedecisionsWednesday, October 3, 12
  • Heroism Non-communicating lone wolf-ismsWednesday, October 3, 12
  • Distraction Irrelevant noise in comm channelsWednesday, October 3, 12
  • Wednesday, October 3, 12
  • TEAMS • Divide and conquer applied to problem space, division of labor • Incident resolution vs. Problem resolution • Reproducibility • Fault Tolerance EffectsWednesday, October 3, 12
  • TEAMS Shotgun debuggingWednesday, October 3, 12
  • JOINT ACTIVITY • Interpredictability • Common Ground • Directability http://csel.eng.ohio-state.edu/woods/distributed/CG%20final.pdfWednesday, October 3, 12
  • InterpredictabilityWednesday, October 3, 12
  • Common GroundWednesday, October 3, 12
  • DirectabilityWednesday, October 3, 12
  • ImprovisationWednesday, October 3, 12
  • IMPROVISATIONWednesday, October 3, 12
  • IMPROVISATIONWednesday, October 3, 12
  • Improvisation “...you can’t improvise on nothing; you got to improvise on something.” Charles MingusWednesday, October 3, 12
  • Diagnose the problem Represent the problem Detect the Generate a Apply course of Leverage Problem/Opportunity action Points EvaluateWednesday, October 3, 12
  • Communication Recommendations •Explicitness •Assertiveness •TimingWednesday, October 3, 12
  • Assertiveness • Passive • Assertive • AggressiveWednesday, October 3, 12
  • Wednesday, October 3, 12
  • ExerciseWednesday, October 3, 12
  • Communication • IRC? • Face-To-Face? • Conference Call? • Morse Code?Wednesday, October 3, 12
  • Kegworth 1989Wednesday, October 3, 12
  • Transmission Encode Decode Meaning Meaning Sender ReceiverWednesday, October 3, 12
  • Transmission Encode Decode Meaning Meaning Sender ReceiverWednesday, October 3, 12
  • Transmission Encode Decode Meaning Meaning Sender Receiver Meaning Meaning Receiver Sender Decode Encode TransmissionWednesday, October 3, 12
  • Transmission Encode Decode Meaning Meaning Sender Receiver Meaning Meaning Receiver Sender Decode Encode TransmissionWednesday, October 3, 12
  • Feedback InformationalWednesday, October 3, 12
  • Feedback CorrectiveWednesday, October 3, 12
  • Feedback ReinforcingWednesday, October 3, 12
  • Decision Making Naturalistic Decision Making (NDM) Gary KleinWednesday, October 3, 12
  • Decision Making Step One: What is the problem?Wednesday, October 3, 12
  • Decision Making Step Two: What shall I do?Wednesday, October 3, 12
  • Decision Making Recognition-Primed DecisionsWednesday, October 3, 12
  • Decision Making Rule-Based DecisionsWednesday, October 3, 12
  • Decision Making Choice decisionsWednesday, October 3, 12
  • Decision Making Creative decisionsWednesday, October 3, 12
  • Decision Making Decreasing cognitive effort Decreasing effects of stress Creative Choice Rule-Based RPD Increasing cognitive effort Increasing effects of stressWednesday, October 3, 12
  • Decision Making PRE-MortemWednesday, October 3, 12
  • ToolingWednesday, October 3, 12
  • ? ?Wednesday, October 3, 12
  • Time Period MetricWednesday, October 3, 12
  • ControlsWednesday, October 3, 12
  • ALERTS • Meant to boost SA • Alarm overload • High false alarm rates • Routinely disable alertsWednesday, October 3, 12
  • Alert ReliabilityWednesday, October 3, 12
  • ALERT DESIGN • Signal:Noise can be difficult • Easy to err on more false alarms • Decay in trust • Origins: Undetectable conditionsWednesday, October 3, 12
  • ALERT DESIGN ConfirmationWednesday, October 3, 12
  • ALERT DESIGN ExpectancyWednesday, October 3, 12
  • ALERT DESIGNWednesday, October 3, 12
  • ALERT DESIGN • Don’t make people singularly reliant on alarms • Support alarm confirmation activities • Make alarms unambiguous • Reduce, reduce, reduce false alerts • Set missed/false alert trade-offs appropriatelyWednesday, October 3, 12
  • ALERT DESIGN • Use multiple modalities • Minimize alarm disruptions to ongoing activities • Support the assessment/diagnosis of multiple alerts • Support global SA of systems in an alarm stateWednesday, October 3, 12
  • Mature Role of Automation “Ironies of Automation” - Lisanne Bainbridge http://www.bainbrdg.demon.co.uk/Papers/Ironies.htmlWednesday, October 3, 12
  • Mature Role of Automation • Moves humans from manual operator to supervisor • Extends and augments human abilities, doesn’t replace it • Doesn’t remove “human error” • Are brittle • Recognize that there is always discretionary space for humans • Recognizes the Law of Stretched SystemsWednesday, October 3, 12
  • SUMMARYWednesday, October 3, 12
  • So what can we do? “In preparing for battle, I have always found that plans are useless but planning is indispensable.” - EisenhowerWednesday, October 3, 12
  • So what can we do? We develop our Non-Technical Skills • Situational Awareness • Communication • Decision Making • Improvisation • Crew Resource Management (CRM)Wednesday, October 3, 12
  • So what can we do? We tailor our environment to adapt • Tooling to support SA • Learning from outages (PostMortem) • Anticipating problems (PreMortem) • Gather Meta-MetricsWednesday, October 3, 12
  • Wednesday, October 3, 12
  • THE ENDWednesday, October 3, 12
  • Credits • http://www.flickr.com/photos/28650594@N03/2718027136/ • http://www.flickr.com/photos/telstar/2816887784/ • http://www.flickr.com/photos/soldiersmediacenter/3855375117/ • http://www.flickr.com/photos/19743256@N00/2640217771/ • http://www.flickr.com/photos/29456680@N06/6148754691/ • https://www.flickr.com/photos/spencediddy/3197199659/sizes/l/in/faves-allspaw/ • http://www.flickr.com/photos/splorp/64027565/sizes/l/in/photostream/Wednesday, October 3, 12