Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.
20. @lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in
the middle of the night?
21. @lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in
the middle of the night?
• Can it wait until the morning?
51. @lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
52. @lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
53. @lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
• Noisy hosts/services
70. @lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
71. @lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
• Computers can do this for us!
74. @lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
75. @lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
• Create Nagios host configs based on data
77. @lozzd • @ryan_frantz
Service Dependencies
• Hundreds of Graphite-sourced checks
• Created new template that sets a servicegroup that
depends on the Graphite service.
78. @lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
79. @lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
• Or move them to email only
86. @lozzd • @ryan_frantz
• Great time to look at this data and make improvements
Nagios Hack Day/Week
87. @lozzd • @ryan_frantz
• Great time to look at this data and make improvements
• If Disk Space is the worst. Can we rethink that?
Nagios Hack Day/Week
89. @lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
90. @lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
• More teams starting this but Search Team is at 100%
114. @lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
115. @lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
• Keep monitoring
118. @lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
The Effect of Sleep
119. @lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
• How do we measure the impact of sleep loss/
deprivation?
The Effect of Sleep
120. @lozzd • @ryan_frantz
• But not the effect on the person when they come to
work the next day
• How do we measure the impact of sleep loss/
deprivation?
The Effect of Sleep
• Subjective: Pittsburgh Sleepiness Scale
• Objective: Psychomotor vigilance task (PVT) to measure
alertness
122. @lozzd • @ryan_frantz
Beyond Opsweekly
• Employee wellness program
• Security have started using past sleep data to check for
weird logins to systems
124. @lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
125. @lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
• Can we attribute particular actions to reduced noise
volume?
• Aggregate alerts
• Non-downtimed alerts