Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Incident Coordination Workshop


Published on

by Eric Sigler, Head of DevOps, PagerDuty

  • Hey guys! Who wants to chat with me? More photos with me here 👉
    Are you sure you want to  Yes  No
    Your message goes here

Incident Coordination Workshop

  1. 1. @esigler Eric Sigler, Head of DevOps, PagerDuty Incident Response & Coordination
  2. 2. @esigler Everyone can improve their Incident Response process
  3. 3. @esigler Take the time to clearly define your process today
  4. 4. @esigler Why should organizations invest time improving it?
  5. 5. @esigler Puppet / DORA “State of DevOps 2016 Report”
  6. 6. @esigler What is Incident Response?
  7. 7. @esigler Prepare Execute Improve Incident Response “Outer Loop”
  8. 8. @esigler Prepare Execute Improve
  9. 9. @esigler Prepare: Monitoring Alerting Process Practice
  10. 10. @esigler Prepare: Monitoring Alerting Process Practice
  11. 11. @esigler “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it.” @esigler
  12. 12. @esigler Don’t forget the business metrics
  13. 13. @esigler Prepare: Monitoring Alerting Process Practice
  14. 14. @esigler Setting up an alarm in AWS CloudWatch
  15. 15. @esigler
  16. 16. @esigler
  17. 17. @esigler
  18. 18. @esigler
  19. 19. @esigler
  20. 20. @esigler
  21. 21. @esigler “I don’t want to get woken up at 3AM.”
  22. 22. @esigler Scope down your alerts!
  23. 23. @esigler Make it Immediate
  24. 24. @esigler A problem in Production at 3AM? I’m there with bells on. A problem in Staging at 3AM? Maybe less so.
  25. 25. @esigler Make it Human
  26. 26. @esigler Humans are terrible shell script interpreters, especially at 3AM.
  27. 27. @esigler Make it Actionable
  28. 28. The “everything’s OK” alarm. @esigler
  29. 29. @esigler Alerts should be: Immediately Human Actionable
  30. 30. @esigler … so relate them to the business.
  31. 31. “We now have more time to focus on other proje Connie-Lynne Villani Senior Manager
  32. 32. @esigler Configure AWS CloudWatch to integrate with PagerDuty
  33. 33. @esigler Grouping Alerts
  34. 34. @esigler
  35. 35. @esigler
  36. 36. @esigler
  37. 37. @esigler
  38. 38. @esigler AWS Incidents in PagerDuty
  39. 39. @esigler AWS Incidents in PagerDuty
  40. 40. @esigler Prepare: Monitoring Alerting Process Practice
  41. 41. @esigler “Like getting lawn care advice from the superintendent of Augusta National”
  42. 42. @esigler Know Your Role(s)…
  43. 43. @esigler Incident Commander
  44. 44. @esigler Consider a volunteer IC schedule
  45. 45. @esigler Deputy Incident Commander
  46. 46. @esigler Scribe
  47. 47. @esigler Subject Matter Experts
  48. 48. @esigler What criteria should you use to launch an Incident Response?
  49. 49. @esigler Post incident criteria widely. Don’t litigate during a call.
  50. 50. @esigler Prepare: Monitoring Alerting Process Practice
  51. 51. @esigler Practice your Incident Response plan beforehand
  52. 52. @esigler Consider injecting failure, and testing “all of the above”.
  53. 53. @esigler Don’t forget to include the rest of the business in your Incident Coordination
  54. 54. @esigler Prepare Execute Improve
  55. 55. @esigler Assess & Triage Resolve or Remediate Learn & Review Incident Response “Inner Loop”
  56. 56. @esigler Assess & Triage Resolve or Remediate Learn & Review
  57. 57. @esigler Elect a leader (IC) at the beginning of the call
  58. 58. @esigler How to give a status update
  59. 59. @esigler Assess & Triage Resolve or Remediate Learn & Review
  60. 60. @esigler Delegating tasks on a call
  61. 61. @esigler Don’t forget to check back in.
  62. 62. @esigler Have a clear mechanism for making decisions.
  63. 63. @esigler “IC, I think we should do X” “The proposed action is X, is there any strong objection?”
  64. 64. @esigler Dealing with communications “challenges”
  65. 65. @esigler Humor is best in context.
  66. 66. @esigler DT5: Roger that GND: Delta Tug 5, you can go right on bravo DT5: Right on bravo, taxi. (…): Testing, testing. 1-2-3-4. GND: Well, you can count to 4. It’s a step in the right direction. Find another frequency to test on now. (…): Sorry
  67. 67. @esigler Assess & Triage Resolve or Remediate Learn & Review
  68. 68. @esigler Capture everything, and call out what’s important now vs. later.
  69. 69. @esigler Decreasing the scope of a call
  70. 70. @esigler Capture everything for the postmortem / learning review.
  71. 71. @esigler Prepare Execute Improve
  72. 72. @esigler “You can’t fire your way to reliability.” Ensure your postmortems are blameless
  73. 73. @esigler Beware of: Counterfactual Reasoning Normative Language Mechanistic Reasoning
  74. 74. @esigler Maintain every postmortem in a collection / archive.
  75. 75. @esigler Review your Incident Response process
  76. 76. @esigler
  77. 77. “We’ve handed out responsibility for handling alerts to the teams that know the most about the service. They’re the people who can generally fix things fastest.” Sam Eaton Vice President of Engineering
  78. 78. @esigler FD: “OK, why don’t, you gotta pass the data for the crew checklist anyway onboard, d MC: “Right” FD: “Don’tcha got a page update? Well why don't we read it up to them and that'll se MC: “Alright.” FD: “Both that mattered as well as what page you want it in the checklist?” MC: “OK.”
  79. 79. @esigler TELMU: "Flight, TELMU.” FD: "Go TELMU.” TELMU: "We show the LEM overhead hatch is closed, and the heater current looks n FD: "OK." GUIDE: "Flight, Guidance." FD: "Go Guidance" GUIDE: "We've had a hardware restart, I don't know what it was."
  80. 80. @esigler FD: "GNC, you wanna look at it? See if you've seen a problem" Lovell: "Houston, we've had a problem ..." FD: "Rog, we're copying it CAPCOM, we see a hardware restart" Lovell: "... Main B Bus undervolt" FD: "You see an AC bus undervolt there guidance, er, ah, EECOM?" EECOM: "Negative flight" FD: "I believe the crew reported it." ???: "We got a main B undervolt"
  81. 81. @esigler EECOM: "OK flight we've got some instrumentation issues ... let me add em up” FD: "Rog" CAPCOM: "OK stand by 13 we're looking at it" EECOM: "We may have had an instrumentation problem flight" FD: "Rog" INCO: "Flight, INCO” FD: "Go INCO” INCO: "We switched to wide beam about the time he had that problem"
  82. 82. @esigler Haise: "...the voltage is looking good. And we had a pretty large bang associated with FD: "OK" CAPCOM: "Roger, Fred." FD: "INCO, you said you went to wide beam with that?" INCO: "Yes" FD: "Let's see if we can correlate those times get the time when you went to wide-beam INCO: "OK"
  83. 83. @esigler
  84. 84. @esigler Challenge: Audit your alarms this week. Are they all immediately human actionable?
  85. 85. @esigler Puppet / DORA “State of DevOps 2016 Report”
  86. 86. @esigler You'll sleep better at night
  87. 87. See You At AWS Summit San Francisco! April 18-19
  88. 88. @esigler LEARN MORE
  89. 89. @esigler Eric Sigler, Head of DevOps, PagerDuty Incident Response & Coordination