• Like
Mean Time to Sleep: Quantifying the On-Call Experience
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Mean Time to Sleep: Quantifying the On-Call Experience

  • 5,732 views
Published

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that …

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Good Job man
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,732
On SlideShare
0
From Embeds
0
Number of Embeds
15

Actions

Shares
Downloads
39
Comments
1
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  • 2. Laurie Denness @lozzd Ryan Frantz @ryan_frantz
  • 3. @lozzd • @ryan_frantz Who is in an on-call rotation?
  • 4. @lozzd • @ryan_frantz Who is on call right now?
  • 5. @lozzd • @ryan_frantz Who feels like on-call sucks?
  • 6. Welcome. How is on call?
  • 7. @lozzd • @ryan_frantz Let’s help our people sleep
  • 8. @lozzd • @ryan_frantz Make on-call more bearable
  • 9. @lozzd • @ryan_frantz Incremental Changes
  • 10. @lozzd • @ryan_frantz Email to Acknowledge
  • 11. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 12. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 13. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 14. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 15. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 16. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 17. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  • 18. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night?
  • 19. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
  • 20. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night? • Can it wait until the morning?
  • 21. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  • 22. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  • 23. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  • 24. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  • 25. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  • 26. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  • 27. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  • 28. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes
  • 29. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  • 30. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  • 31. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to runbook
  • 32. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating
  • 33. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating • Trigger alert percentage of pool over threshold
  • 34. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  • 35. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  • 36. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  • 37. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  • 38. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team
  • 39. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd)
  • 40. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors)
  • 41. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors) • Duplicate crons (Chef)
  • 42. @lozzd • @ryan_frantz Incremental Improvements? • Maybe
  • 43. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick
  • 44. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick • We didn’t know because we didn’t measure
  • 45. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really)
  • 46. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really) • But, we weren’t measuring anything
  • 47. @lozzd • @ryan_frantz What should we measure?
  • 48. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity)
  • 49. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not)
  • 50. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours?
  • 51. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours? • Noisy hosts/services
  • 52. @lozzd • @ryan_frantz Opsweekly
  • 53. @lozzd • @ryan_frantzWe have data.
  • 54. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports
  • 55. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing
  • 56. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts
  • 57. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts 4.Profit
  • 58. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch)
  • 59. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature
  • 60. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature • Computers can do this for us!
  • 61. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com
  • 62. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info
  • 63. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai
  • 64. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai • Create Nagios host configs based on data
  • 65. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks
  • 66. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks • Created new template that sets a servicegroup that depends on the Graphite service.
  • 67. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything!
  • 68. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything! • Or move them to email only
  • 69. @lozzd • @ryan_frantz More Quantification!
  • 70. @lozzd • @ryan_frantz Reviewing the Year • Use reports
  • 71. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search
  • 72. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search • Identify noisiest alerts
  • 73. @lozzd • @ryan_frantz Reviewing the Year YEARLY REPORT SCREENSHOTS
  • 74. @lozzd • @ryan_frantz • Great time to look at this data and make improvements Nagios Hack Day/Week
  • 75. @lozzd • @ryan_frantz • Great time to look at this data and make improvements • If Disk Space is the worst. Can we rethink that? Nagios Hack Day/Week
  • 76. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation
  • 77. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops
  • 78. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops • More teams starting this but Search Team is at 100%
  • 79. @lozzd • @ryan_frantz Sleep Tracking
  • 80. @lozzd • @ryan_frantz
  • 81. “Track your life!” - @ph
  • 82. @lozzd • @ryan_frantz
  • 83. @lozzd • @ryan_frantz
  • 84. @lozzd • @ryan_frantz
  • 85. @lozzd • @ryan_frantz
  • 86. @lozzd • @ryan_frantz Did it work?
  • 87. @lozzd • @ryan_frantz Did it work?
  • 88. @lozzd • @ryan_frantz Did it work? • Yes.
  • 89. @lozzd • @ryan_frantz Did it work? • Yes.
  • 90. @lozzd • @ryan_frantz Did it work? • Yes. • Signal to noise ratio is much better
  • 91. @lozzd • @ryan_frantz Did it work? • Yes.
  • 92. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that
  • 93. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things
  • 94. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things • Keep monitoring
  • 95. @lozzd • @ryan_frantz What’s next?
  • 96. @lozzd • @ryan_frantz • We focus on people’s sleep The Effect of Sleep
  • 97. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day The Effect of Sleep
  • 98. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep
  • 99. @lozzd • @ryan_frantz • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep • Subjective: Pittsburgh Sleepiness Scale • Objective: Psychomotor vigilance task (PVT) to measure alertness
  • 100. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program
  • 101. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program • Security have started using past sleep data to check for weird logins to systems
  • 102. @lozzd • @ryan_frantz More context: nagios-herald
  • 103. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends
  • 104. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends • Can we attribute particular actions to reduced noise volume? • Aggregate alerts • Non-downtimed alerts
  • 105. @lozzd • @ryan_frantz Thanks
  • 106. @lozzd • @ryan_frantz Etsy Ops Team
  • 107. @lozzd • @ryan_frantz SewMona
  • 108. @lozzd • @ryan_frantz Open Source/Links • http://ryanfrantz.com/mtts • https://github.com/etsy/opsweekly • https://github.com/etsy/nagios-herald • https://github.com/jonlives/jawboneup_to_graphite • http://codeascraft.com
  • 109. @lozzd • @ryan_frantz Questions?
  • 110. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  • 111. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience