Resilient                Response in                Complex                Systems                                  John A...
OPERABILITYSunday, March 11, 12
PRODUCTIONSunday, March 11, 12
http://whoownsmyavailability.comSunday, March 11, 12
Sunday, March 11, 12
How important is this?Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
How important is this?Sunday, March 11, 12
How Can This Happen?Sunday, March 11, 12
Complicated?                          Complex?Sunday, March 11, 12
Complex Systems    •      Cascading Failures    •      Difficult to determine boundaries    •      Complex systems may be op...
1998Sunday, March 11, 12
How Can This Happen?                     It does happen.                    And it will again.Sunday, March 11, 12        ...
Sunday, March 11, 12
Optimization                       MTBF                       MTTRSunday, March 11, 12
http://www.flickr.com/photos/sparktography/75499095/Sunday, March 11, 12
How does team                       troubleshooting                           happen?Sunday, March 11, 12
Problem Starts                   Detection                          Evaluation                                  Response  ...
Problem Starts                                       Stress                   Detection                          Evaluatio...
Forced beyond learned roles         Actions whose consequences are both important and         difficult to see         Cogni...
Sunday, March 11, 12
So What                       Can We Do?Sunday, March 11, 12
We Learn From                          OthersSunday, March 11, 12
Characteristics of response to    escalating scenariosSunday, March 11, 12
Characteristics of response to    escalating scenarios                       ...tend to neglect how processes             ...
Characteristics of response to    escalating scenarios                       ...have difficulty in dealing with             ...
Characteristics of response to    escalating scenarios                       ...inclined to think in causal series,       ...
PitfallsThematicVagabondingSunday, March 11, 12
PitfallsGoal Fixation(encystment)Sunday, March 11, 12
PitfallsRefusal to makedecisionsSunday, March 11, 12
Heroism         Non-communicating lone wolf-ismsSunday, March 11, 12
Distraction         Irrelevant noise in comm channelsSunday, March 11, 12
Jens Rasmussen, 1983                                                       Senior Member, IEEE         “Skills, Rules, and...
SKILL - BASED                             Simple, routine RULE - BASED                        Knowable, but unfamiliar KNO...
Team DynamicsSunday, March 11, 12
High Reliability Organizations      • Air Traffic Control           • Complex Socio-Technical                               ...
Sunday, March 11, 12
“The Self-Designing High-Reliability Organization:         Aircraft Carrier Flight Operations at Sea”         Rochlin, La ...
"So you want to understand an aircraft carrier? Well, just                       imagine that its a busy day, and you shri...
Close interdependence         between groupsSunday, March 11, 12
Close reciprocal            coordination and            information sharing,            resulting in overlapping          ...
High redundancy: multiple            people observing the same            event and sharing            informationSunday, ...
Broad definition of who         belongs to the team.Sunday, March 11, 12
Teammates are included in         the communication loops         rather than excluded.Sunday, March 11, 12
Lots of error correction.Sunday, March 11, 12
High levels of situation         comprehension: maintain         constant awareness of the         possibility of accident...
High levels of interpersonal         skillsSunday, March 11, 12
Maintenance of detailed         records of past incidents         that are closely examined         with a view to learnin...
Patterns of authority are         changed to meet the         demands of the events:         organizational flexibility.Sun...
The reporting of errors and         faults is rewarded, not         punished.Sunday, March 11, 12
So What Else                       Can We Do?Sunday, March 11, 12
We DrillSunday, March 11, 12
We GameDaySunday, March 11, 12
Sunday, March 11, 12
We Learn To ImproviseSunday, March 11, 12
IMPROVISATIONSunday, March 11, 12
IMPROVISATIONSunday, March 11, 12
We Learn From Our                           MistakesSunday, March 11, 12
Postmortems      • Full timelines: What happened, when      • Review in public, everyone invited      • Search for “second...
Qualifying Response        High signal:noise in comm channels?        Troubleshooting fatigue?        Troubleshooting hand...
RemediationSunday, March 11, 12
Mature Role of Automation       “Ironies of Automation” - Lisanne Bainbridge          http://www.bainbrdg.demon.co.uk/Pape...
Mature Role of Automation        •       Moves humans from manual operator to supervisor        •       Extends and augmen...
Law of Stretched Systems         “Every system is stretched to operate at its         capacity; as soon as there is some  ...
We Share Near-Miss                            EventsSunday, March 11, 12
Near Misses                       Hey everybody -                       Don’t be like me. I tried to X, but               ...
Near Misses      • Can act like “vaccines” - help system safety without actually              hurting anything      • Happ...
A parting word    A parting challengeSunday, March 11, 12
Two PropositionsSunday, March 11, 12
100 changes         6 change-related issuesSunday, March 11, 12
100 > 6Sunday, March 11, 12
Proposition #1         “Ways in which things go right are special cases         of the ways in which things go wrong.”Sund...
Proposition #1                       Successes = failures gone wrong                       Study the failures, generalize ...
Proposition #2         “Ways in which things go wrong are special         cases of the ways in which things go right.”Sund...
Proposition #2                          Failures = successes gone wrong                          Study the successes, gene...
94/100 ?                          ORSunday, March 11, 12                       6/100 ?
What and WHY Do Things     Go RIGHT?Sunday, March 11, 12
Not just:                           why did we fail?        But also:                       why did we succeed?Sunday, Mar...
Resilient Response  •      Can learn from other fields  •      Can train for outages  •      Can learn from mistakes  •    ...
http://www.flickr.com/photos/sparktography/75499095/Sunday, March 11, 12
THE ENDSunday, March 11, 12
Upcoming SlideShare
Loading in...5
×

Resilient Response In Complex Systems

6,659

Published on

QCon London 2012 Keynote

Published in: Technology

Transcript of "Resilient Response In Complex Systems"

  1. 1. Resilient Response in Complex Systems John Allspaw SVP, Tech Ops Qcon London 2012Sunday, March 11, 12
  2. 2. OPERABILITYSunday, March 11, 12
  3. 3. PRODUCTIONSunday, March 11, 12
  4. 4. http://whoownsmyavailability.comSunday, March 11, 12
  5. 5. Sunday, March 11, 12
  6. 6. How important is this?Sunday, March 11, 12
  7. 7. Sunday, March 11, 12
  8. 8. Sunday, March 11, 12
  9. 9. Sunday, March 11, 12
  10. 10. Sunday, March 11, 12
  11. 11. Sunday, March 11, 12
  12. 12. Sunday, March 11, 12
  13. 13. Sunday, March 11, 12
  14. 14. Sunday, March 11, 12
  15. 15. Sunday, March 11, 12
  16. 16. Sunday, March 11, 12
  17. 17. Sunday, March 11, 12
  18. 18. Sunday, March 11, 12
  19. 19. How important is this?Sunday, March 11, 12
  20. 20. How Can This Happen?Sunday, March 11, 12
  21. 21. Complicated? Complex?Sunday, March 11, 12
  22. 22. Complex Systems • Cascading Failures • Difficult to determine boundaries • Complex systems may be open • Complex systems may have a memory • Complex systems may be nested • Dynamic network of multiplicity • May produce emergent phenomena • Relationships are non-linear • Relationships contain feedback loopsSunday, March 11, 12
  23. 23. 1998Sunday, March 11, 12
  24. 24. How Can This Happen? It does happen. And it will again.Sunday, March 11, 12 And again.
  25. 25. Sunday, March 11, 12
  26. 26. Optimization MTBF MTTRSunday, March 11, 12
  27. 27. http://www.flickr.com/photos/sparktography/75499095/Sunday, March 11, 12
  28. 28. How does team troubleshooting happen?Sunday, March 11, 12
  29. 29. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear TimeSunday, March 11, 12
  30. 30. Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear TimeSunday, March 11, 12
  31. 31. Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentiallySunday, March 11, 12
  32. 32. Sunday, March 11, 12
  33. 33. So What Can We Do?Sunday, March 11, 12
  34. 34. We Learn From OthersSunday, March 11, 12
  35. 35. Characteristics of response to escalating scenariosSunday, March 11, 12
  36. 36. Characteristics of response to escalating scenarios ...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Sunday, March 11, 12
  37. 37. Characteristics of response to escalating scenarios ...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate) “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Sunday, March 11, 12
  38. 38. Characteristics of response to escalating scenarios ...inclined to think in causal series, instead of causal nets. A therefore B, instead of A, therefore B and C (therefore D and E), etc. “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980Sunday, March 11, 12
  39. 39. PitfallsThematicVagabondingSunday, March 11, 12
  40. 40. PitfallsGoal Fixation(encystment)Sunday, March 11, 12
  41. 41. PitfallsRefusal to makedecisionsSunday, March 11, 12
  42. 42. Heroism Non-communicating lone wolf-ismsSunday, March 11, 12
  43. 43. Distraction Irrelevant noise in comm channelsSunday, March 11, 12
  44. 44. Jens Rasmussen, 1983 Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models” IEEE Transactions On Systems, Man, and Cybernetics, May 1983Sunday, March 11, 12
  45. 45. SKILL - BASED Simple, routine RULE - BASED Knowable, but unfamiliar KNOWLEDGE - BASED (Reason, 1990) WTF IS GOING ON?Sunday, March 11, 12
  46. 46. Team DynamicsSunday, March 11, 12
  47. 47. High Reliability Organizations • Air Traffic Control • Complex Socio-Technical systems • Naval Air Operations At Sea • Efficiency <-> Thoroughness • Electrical Power Systems • Time/Resource Constrained • Etc. • Engineering-drivenSunday, March 11, 12
  48. 48. Sunday, March 11, 12
  49. 49. “The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea” Rochlin, La Porte, and Roberts. Naval War College Review 1987 http://govleaders.org/reliability.htmSunday, March 11, 12
  50. 50. "So you want to understand an aircraft carrier? Well, just imagine that its a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Make sure the equipment is so close to the edge of the envelope that its fragile. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close-up. Oh, and by the way, try not to kill anyone."                                             -- Senior officer, Air DivisionSunday, March 11, 12
  51. 51. Close interdependence between groupsSunday, March 11, 12
  52. 52. Close reciprocal coordination and information sharing, resulting in overlapping knowledgeSunday, March 11, 12
  53. 53. High redundancy: multiple people observing the same event and sharing informationSunday, March 11, 12
  54. 54. Broad definition of who belongs to the team.Sunday, March 11, 12
  55. 55. Teammates are included in the communication loops rather than excluded.Sunday, March 11, 12
  56. 56. Lots of error correction.Sunday, March 11, 12
  57. 57. High levels of situation comprehension: maintain constant awareness of the possibility of accidents.Sunday, March 11, 12
  58. 58. High levels of interpersonal skillsSunday, March 11, 12
  59. 59. Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.Sunday, March 11, 12
  60. 60. Patterns of authority are changed to meet the demands of the events: organizational flexibility.Sunday, March 11, 12
  61. 61. The reporting of errors and faults is rewarded, not punished.Sunday, March 11, 12
  62. 62. So What Else Can We Do?Sunday, March 11, 12
  63. 63. We DrillSunday, March 11, 12
  64. 64. We GameDaySunday, March 11, 12
  65. 65. Sunday, March 11, 12
  66. 66. We Learn To ImproviseSunday, March 11, 12
  67. 67. IMPROVISATIONSunday, March 11, 12
  68. 68. IMPROVISATIONSunday, March 11, 12
  69. 69. We Learn From Our MistakesSunday, March 11, 12
  70. 70. Postmortems • Full timelines: What happened, when • Review in public, everyone invited • Search for “second stories” instead of “human error” • Cultivating a blameless environment • Giving requisite authority to individuals to improve thingsSunday, March 11, 12
  71. 71. Qualifying Response High signal:noise in comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand? Improvised tooling or solutions? Metrics visibility? Collaborative and skillful communication?Sunday, March 11, 12
  72. 72. RemediationSunday, March 11, 12
  73. 73. Mature Role of Automation “Ironies of Automation” - Lisanne Bainbridge http://www.bainbrdg.demon.co.uk/Papers/Ironies.htmlSunday, March 11, 12
  74. 74. Mature Role of Automation • Moves humans from manual operator to supervisor • Extends and augments human abilities, doesn’t replace it • Doesn’t remove “human error” • Are brittle • Recognize that there is always discretionary space for humans • Recognizes the Law of Stretched SystemsSunday, March 11, 12
  75. 75. Law of Stretched Systems “Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity” D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006Sunday, March 11, 12
  76. 76. We Share Near-Miss EventsSunday, March 11, 12
  77. 77. Near Misses Hey everybody - Don’t be like me. I tried to X, but that wasn’t a good idea. It almost exploded everyone. So, don’t do: (details about X) Love, JoeSunday, March 11, 12
  78. 78. Near Misses • Can act like “vaccines” - help system safety without actually hurting anything • Happen more often, so provide more data on latent failures • Powerful reminder of hazards, and slows down the process of forgetting to be afraidSunday, March 11, 12
  79. 79. A parting word A parting challengeSunday, March 11, 12
  80. 80. Two PropositionsSunday, March 11, 12
  81. 81. 100 changes 6 change-related issuesSunday, March 11, 12
  82. 82. 100 > 6Sunday, March 11, 12
  83. 83. Proposition #1 “Ways in which things go right are special cases of the ways in which things go wrong.”Sunday, March 11, 12
  84. 84. Proposition #1 Successes = failures gone wrong Study the failures, generalize from that. Potential data sources: 6 out of 100Sunday, March 11, 12
  85. 85. Proposition #2 “Ways in which things go wrong are special cases of the ways in which things go right.”Sunday, March 11, 12
  86. 86. Proposition #2 Failures = successes gone wrong Study the successes, generalize from thatSunday, March 11, 12 Potential data sources: 94 out of 100
  87. 87. 94/100 ? ORSunday, March 11, 12 6/100 ?
  88. 88. What and WHY Do Things Go RIGHT?Sunday, March 11, 12
  89. 89. Not just: why did we fail? But also: why did we succeed?Sunday, March 11, 12
  90. 90. Resilient Response • Can learn from other fields • Can train for outages • Can learn from mistakes • Can learn from successes as well as failuresSunday, March 11, 12
  91. 91. http://www.flickr.com/photos/sparktography/75499095/Sunday, March 11, 12
  92. 92. THE ENDSunday, March 11, 12

×