Mean Time to Sleep: Quantifying the On-Call Experience

17,873 views
16,197 views

Published on

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

1 Comment
15 Likes
Statistics
Notes
  • Good Job man
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
17,873
On SlideShare
0
From Embeds
0
Number of Embeds
8,316
Actions
Shares
0
Downloads
53
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide

Mean Time to Sleep: Quantifying the On-Call Experience

  1. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  2. Laurie Denness @lozzd Ryan Frantz @ryan_frantz
  3. @lozzd • @ryan_frantz Who is in an on-call rotation?
  4. @lozzd • @ryan_frantz Who is on call right now?
  5. @lozzd • @ryan_frantz Who feels like on-call sucks?
  6. Welcome. How is on call?
  7. @lozzd • @ryan_frantz Let’s help our people sleep
  8. @lozzd • @ryan_frantz Make on-call more bearable
  9. @lozzd • @ryan_frantz Incremental Changes
  10. @lozzd • @ryan_frantz Email to Acknowledge
  11. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  12. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  13. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  14. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  15. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  16. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  17. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  18. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night?
  19. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
  20. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night? • Can it wait until the morning?
  21. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  22. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  23. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  24. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  25. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  26. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  27. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  28. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes
  29. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  30. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  31. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to runbook
  32. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating
  33. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating • Trigger alert percentage of pool over threshold
  34. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  35. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  36. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  37. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  38. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team
  39. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd)
  40. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors)
  41. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors) • Duplicate crons (Chef)
  42. @lozzd • @ryan_frantz Incremental Improvements? • Maybe
  43. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick
  44. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick • We didn’t know because we didn’t measure
  45. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really)
  46. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really) • But, we weren’t measuring anything
  47. @lozzd • @ryan_frantz What should we measure?
  48. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity)
  49. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not)
  50. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours?
  51. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours? • Noisy hosts/services
  52. @lozzd • @ryan_frantz Opsweekly
  53. @lozzd • @ryan_frantzWe have data.
  54. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports
  55. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing
  56. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts
  57. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts 4.Profit
  58. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch)
  59. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature
  60. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature • Computers can do this for us!
  61. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com
  62. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info
  63. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai
  64. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai • Create Nagios host configs based on data
  65. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks
  66. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks • Created new template that sets a servicegroup that depends on the Graphite service.
  67. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything!
  68. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything! • Or move them to email only
  69. @lozzd • @ryan_frantz More Quantification!
  70. @lozzd • @ryan_frantz Reviewing the Year • Use reports
  71. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search
  72. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search • Identify noisiest alerts
  73. @lozzd • @ryan_frantz Reviewing the Year YEARLY REPORT SCREENSHOTS
  74. @lozzd • @ryan_frantz • Great time to look at this data and make improvements Nagios Hack Day/Week
  75. @lozzd • @ryan_frantz • Great time to look at this data and make improvements • If Disk Space is the worst. Can we rethink that? Nagios Hack Day/Week
  76. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation
  77. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops
  78. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops • More teams starting this but Search Team is at 100%
  79. @lozzd • @ryan_frantz Sleep Tracking
  80. @lozzd • @ryan_frantz
  81. “Track your life!” - @ph
  82. @lozzd • @ryan_frantz
  83. @lozzd • @ryan_frantz
  84. @lozzd • @ryan_frantz
  85. @lozzd • @ryan_frantz
  86. @lozzd • @ryan_frantz Did it work?
  87. @lozzd • @ryan_frantz Did it work?
  88. @lozzd • @ryan_frantz Did it work? • Yes.
  89. @lozzd • @ryan_frantz Did it work? • Yes.
  90. @lozzd • @ryan_frantz Did it work? • Yes. • Signal to noise ratio is much better
  91. @lozzd • @ryan_frantz Did it work? • Yes.
  92. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that
  93. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things
  94. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things • Keep monitoring
  95. @lozzd • @ryan_frantz What’s next?
  96. @lozzd • @ryan_frantz • We focus on people’s sleep The Effect of Sleep
  97. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day The Effect of Sleep
  98. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep
  99. @lozzd • @ryan_frantz • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep • Subjective: Pittsburgh Sleepiness Scale • Objective: Psychomotor vigilance task (PVT) to measure alertness
  100. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program
  101. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program • Security have started using past sleep data to check for weird logins to systems
  102. @lozzd • @ryan_frantz More context: nagios-herald
  103. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends
  104. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends • Can we attribute particular actions to reduced noise volume? • Aggregate alerts • Non-downtimed alerts
  105. @lozzd • @ryan_frantz Thanks
  106. @lozzd • @ryan_frantz Etsy Ops Team
  107. @lozzd • @ryan_frantz SewMona
  108. @lozzd • @ryan_frantz Open Source/Links • http://ryanfrantz.com/mtts • https://github.com/etsy/opsweekly • https://github.com/etsy/nagios-herald • https://github.com/jonlives/jawboneup_to_graphite • http://codeascraft.com
  109. @lozzd • @ryan_frantz Questions?
  110. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  111. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience

×