Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mean Time to Sleep: Quantifying the On-Call Experience

26,286 views

Published on

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

  • If you are looking for trusted essay writing service I highly recommend ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐ The service I received was great. I got an A on my final paper which really helped my grade. Knowing that I can count on them in the future has really helped relieve the stress, anxiety and workload. I recommend everyone to give them a try. You'll be glad you did.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I like this service ⇒ www.HelpWriting.net ⇐ from Academic Writers. I don't have enough time write it by myself.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you’re looking for a great essay service then you should check out ⇒ www.HelpWriting.net ⇐. A friend of mine asked them to write a whole dissertation for him and he said it turned out great! Afterwards I also ordered an essay from them and I was very happy with the work I got too.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid To Manage Facebook Fan Pages! Facebook Fan Page Workers Required - Start Immediately. ◆◆◆ http://t.cn/AieXipTS
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Eat This POTENT Vegetable To Melt Diabetic Fat. IMPORTANT: Be careful, only eat it twice a day or you will lose diabetic belly fat too fast... ➤➤ http://ishbv.com/bloodsug/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Mean Time to Sleep: Quantifying the On-Call Experience

  1. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  2. Laurie Denness @lozzd Ryan Frantz @ryan_frantz
  3. @lozzd • @ryan_frantz Who is in an on-call rotation?
  4. @lozzd • @ryan_frantz Who is on call right now?
  5. @lozzd • @ryan_frantz Who feels like on-call sucks?
  6. Welcome. How is on call?
  7. @lozzd • @ryan_frantz Let’s help our people sleep
  8. @lozzd • @ryan_frantz Make on-call more bearable
  9. @lozzd • @ryan_frantz Incremental Changes
  10. @lozzd • @ryan_frantz Email to Acknowledge
  11. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  12. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  13. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  14. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  15. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  16. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  17. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  18. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night?
  19. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
  20. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night? • Can it wait until the morning?
  21. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  22. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  23. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  24. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  25. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  26. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  27. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  28. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes
  29. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  30. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  31. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to runbook
  32. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating
  33. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating • Trigger alert percentage of pool over threshold
  34. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  35. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  36. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  37. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  38. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team
  39. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd)
  40. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors)
  41. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors) • Duplicate crons (Chef)
  42. @lozzd • @ryan_frantz Incremental Improvements? • Maybe
  43. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick
  44. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick • We didn’t know because we didn’t measure
  45. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really)
  46. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really) • But, we weren’t measuring anything
  47. @lozzd • @ryan_frantz What should we measure?
  48. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity)
  49. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not)
  50. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours?
  51. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours? • Noisy hosts/services
  52. @lozzd • @ryan_frantz Opsweekly
  53. @lozzd • @ryan_frantzWe have data.
  54. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports
  55. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing
  56. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts
  57. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts 4.Profit
  58. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch)
  59. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature
  60. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature • Computers can do this for us!
  61. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com
  62. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info
  63. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai
  64. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai • Create Nagios host configs based on data
  65. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks
  66. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks • Created new template that sets a servicegroup that depends on the Graphite service.
  67. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything!
  68. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything! • Or move them to email only
  69. @lozzd • @ryan_frantz More Quantification!
  70. @lozzd • @ryan_frantz Reviewing the Year • Use reports
  71. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search
  72. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search • Identify noisiest alerts
  73. @lozzd • @ryan_frantz Reviewing the Year YEARLY REPORT SCREENSHOTS
  74. @lozzd • @ryan_frantz • Great time to look at this data and make improvements Nagios Hack Day/Week
  75. @lozzd • @ryan_frantz • Great time to look at this data and make improvements • If Disk Space is the worst. Can we rethink that? Nagios Hack Day/Week
  76. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation
  77. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops
  78. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops • More teams starting this but Search Team is at 100%
  79. @lozzd • @ryan_frantz Sleep Tracking
  80. @lozzd • @ryan_frantz
  81. “Track your life!” - @ph
  82. @lozzd • @ryan_frantz
  83. @lozzd • @ryan_frantz
  84. @lozzd • @ryan_frantz
  85. @lozzd • @ryan_frantz
  86. @lozzd • @ryan_frantz Did it work?
  87. @lozzd • @ryan_frantz Did it work?
  88. @lozzd • @ryan_frantz Did it work? • Yes.
  89. @lozzd • @ryan_frantz Did it work? • Yes.
  90. @lozzd • @ryan_frantz Did it work? • Yes. • Signal to noise ratio is much better
  91. @lozzd • @ryan_frantz Did it work? • Yes.
  92. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that
  93. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things
  94. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things • Keep monitoring
  95. @lozzd • @ryan_frantz What’s next?
  96. @lozzd • @ryan_frantz • We focus on people’s sleep The Effect of Sleep
  97. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day The Effect of Sleep
  98. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep
  99. @lozzd • @ryan_frantz • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep • Subjective: Pittsburgh Sleepiness Scale • Objective: Psychomotor vigilance task (PVT) to measure alertness
  100. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program
  101. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program • Security have started using past sleep data to check for weird logins to systems
  102. @lozzd • @ryan_frantz More context: nagios-herald
  103. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends
  104. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends • Can we attribute particular actions to reduced noise volume? • Aggregate alerts • Non-downtimed alerts
  105. @lozzd • @ryan_frantz Thanks
  106. @lozzd • @ryan_frantz Etsy Ops Team
  107. @lozzd • @ryan_frantz SewMona
  108. @lozzd • @ryan_frantz Open Source/Links • http://ryanfrantz.com/mtts • https://github.com/etsy/opsweekly • https://github.com/etsy/nagios-herald • https://github.com/jonlives/jawboneup_to_graphite • http://codeascraft.com
  109. @lozzd • @ryan_frantz Questions?
  110. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  111. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience

×