Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amplifying Sources of Resilience: What Research Says

13 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2KltPtb.

John Allspaw talks about applying Resilience Engineering thinking & paradigms to the world of software engineering and outlines productive avenues to locate, amplify, support, and build this capacity. Filmed at qconlondon.com.

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. He served as CTO at Etsy. His publications include the books “The Art of Capacity Planning”, “Web Operations” and “The DevOps Handbook”. His 2009 talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Amplifying Sources of Resilience: What Research Says

  1. 1. Amplifying Sources of Resilience What the Research Says John Allspaw (@allspaw) Adaptive Capacity Labs (@adaptiveclabs)
  2. 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ resilience-thinking-paradigm/ • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. me 2009 Velocity Conf Consortium for Resilient Internet-Facing Business IT Adaptive Capacity Labs
  5. 5. Resilience Engineering Is a Field of Study • Emerged from Cognitive Systems Engineering • Early 2000s, largely in response to NASA events in 1999 and 2000 • 7 symposia over 12 years
  6. 6. Resilience Engineering is a Community is largely made up of practitioners and researchers from…. Cybernetics Engineering* Ecology Safety Science Biology Control Systems Human Factors & Ergonomics Cognitive Systems Engineering Complexity Science Cognitive Psychology Sociology Operations Research
  7. 7. working in domains such as… Rail Maritime Surgery Intelligence AgenciesLaw Enforcement Aviation/ATM Space Mining Construction Explosives FirefightingAnesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Resilience Engineering is a Community
  8. 8. Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU
  9. 9. Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU J. Paul ReedJessica DeVitaCasey RosenthalNora Jones (me)
  10. 10. externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  11. 11. externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  12. 12. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  13. 13. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools systemsystem framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  14. 14. deploy organization/ “monitoring” Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/
  15. 15. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With
  16. 16. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  17. 17. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  18. 18. Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  19. 19. What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  20. 20. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…
  21. 21. Amplifying Sources of Resilience What the Research Says John Allspaw Adaptive Capacity Labs clarifications about what this actually is can’t amplify something if you can’t find it first
  22. 22. what we find when we look closely at incidents in software
  23. 23. logs time of year day of year time of day observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/ observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work
  24. 24. - tenure - domain expertise - past experience with details
  25. 25. multiple perspectives on • what “it” is that is happening • what can — and what cannot — be done to “stem the bleeding” or “reduce the blast radius” • who has authority to take certain actions • what shouldn’t be tried to mitigate or repair
  26. 26. - problem detection - generating hypotheses - diagnostic actions - therapeutic actions - sacrifice decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive
  27. 27. time pressure high consequences
  28. 28. this is not “debugging” “troubleshooting”
  29. 29. therefore…“resilience”?
  30. 30. software development may incur future liability in order to achieve short-term goals 1992 Ward Cunningham THANK YOU, WARD!
  31. 31. resilience is not: • preventative design • fault-tolerance • redundancy • Chaos Engineering • stuff about software or hardware • a property that a system has
  32. 32. unforeseen unanticipated unexpected fundamentally surprising
  33. 33. –Scott Sagan “The Limits of Safety” “things that have never happened before happen all the time”
  34. 34. resilience is: • proactive activities aimed at preparing to be unprepared — without an ability to justify it economically! • sustaining the potential for future adaptive action when conditions change • something that a system does, not what it has
  35. 35. sustained adaptive capacity
  36. 36. sustained adaptive capacity
  37. 37. Poised To Adapt 1. Knowing what the platform is supposed to do 2. Knowing how the platform works 3. What the platform’s behavior means 4. Being able to devise a change that addresses 1, 2, & 3 5. Being able to predict the effects of that change 6. Being able to force the platform to change in that way 7. Being prepared to deal with the consequences Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA
  38. 38. Finding sources of resilience means finding and understanding cognitive work. observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  39. 39. all incidents can be worse what are things (people, maneuvers, knowledge, etc.) that went into preventing it from being worse?
  40. 40. How can I find this “adaptive capacity”? Find incidents that have: • high degree of surprise • whose consequences were not severe • and look closely at the details about what went into making it not nearly as bad as it could have been • protect and acknowledge explicitly the sources you find
  41. 41. indications of surprise and novelty <murphy> wtf happened here <steve> I have no idea what is going on <laurie> well that's terrifying
  42. 42. indications about contrasting mental models <laurie> so you want to rebuild {server01} first? <laurie> neither box has been touched yet <laurie> and im a tad nervous to do both at once <lisa> wait wait, i thought the X table was small <jeremy> I'm still a bit confused why B and A are different if A got to 0 and B is still at 3099 <tim>: oh I see.. the retry interval is pretty aggressive
  43. 43. why not look at incidents with severe consequences? • scrutiny from stakeholders with face-saving agenda tend to block deep inquiry • with “medium-severe” incidents the cost of getting details/descriptions of people’s perspectives is low relative to the potential gain • “Goldilocks” incidents are the ideal
  44. 44. some (contextual) sources • esoteric knowledge and expertise in the organization • flexible and dynamic staffing for novel situations • authority that is expected to migrate across roles • a “constant sense of unease” that drives explorations of “normal” work • capture and dissemination of near-misses
  45. 45. Summary! • Resilience is something a system does, not what a system has. • Creating and sustaining adaptive capacity in an organization while being unable to justify doing it specifically = resilient action. • How people (the flexible elements of The System™) cope with surprise is the path to finding sources of resilience
  46. 46. Resilience is the story of the outage that didn’t happen.
  47. 47. Thank You! @allspaw Resilience Is A Verb (Woods, 2018) http://bit.ly/ResilienceIsAVerb Stella Report http://stella.report https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs SRE Cognitive Work (chapter in Seeking SRE, O’Reilly Media) http://bit.ly/SRECognitiveWork How Complex Systems Fail (Cook, 1998) http://bit.ly/ComplexSystemsFailure
  48. 48. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ resilience-thinking-paradigm/

×