@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience
Laurie Denness
@lozzd
Ryan Frantz
@ryan_frantz
@lozzd • @ryan_frantz
Who is in an on-call rotation?
@lozzd • @ryan_frantz
Who is on call right now?
@lozzd • @ryan_frantz
Who feels like on-call sucks?
Welcome. How is on call?
@lozzd • @ryan_frantz
Let’s help our people sleep
@lozzd • @ryan_frantz
Make on-call more
bearable
@lozzd • @ryan_frantz
Incremental Changes
@lozzd • @ryan_frantz
Email to
Acknowledge
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care i...
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care i...
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context
• Previous service state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context
• Previous service state
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to...
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to...
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to...
@lozzd • @ryan_frantz
Alert Storms
• Reduce noise when 200 things go wrong by aggregating
@lozzd • @ryan_frantz
Alert Storms
• Reduce noise when 200 things go wrong by aggregating
• Trigger alert percentage of po...
@lozzd • @ryan_frantz
Low friction downtime
• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Low friction downtime
• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Downtime Reminders
• Help prevent false notifications
@lozzd • @ryan_frantz
Downtime Reminders
• Help prevent false notifications
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (tr...
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (tr...
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
• More ideas; hoped they’d stick
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
• More ideas; hoped they’d stick
• We didn’t know because we didn’...
@lozzd • @ryan_frantz
Measure Everything
• “You can’t manage what you can’t measure.”
- Deming (not really)
@lozzd • @ryan_frantz
Measure Everything
• “You can’t manage what you can’t measure.”
- Deming (not really)
• But, we were...
@lozzd • @ryan_frantz
What should we measure?
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable v...
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable v...
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable v...
@lozzd • @ryan_frantz
Opsweekly
@lozzd • @ryan_frantzWe have data.
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
3.Aggregate a...
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
3.Aggregate a...
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios featu...
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios featu...
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef...
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef...
@lozzd • @ryan_frantz
Service Dependencies
• Hundreds of Graphite-sourced checks
@lozzd • @ryan_frantz
Service Dependencies
• Hundreds of Graphite-sourced checks
• Created new template that sets a servic...
@lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
@lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
• Or move...
@lozzd • @ryan_frantz
More Quantification!
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
• Use search
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
• Use search
• Identify noisiest alerts
@lozzd • @ryan_frantz
Reviewing the Year
YEARLY REPORT SCREENSHOTS
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
Nagios Hack Day/Week
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
• If Disk Space is the worst. Can we rethink...
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’...
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’...
@lozzd • @ryan_frantz
Sleep Tracking
@lozzd • @ryan_frantz
“Track your life!” - @ph
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Signal to noise ratio is much better
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time...
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time...
@lozzd • @ryan_frantz
What’s next?
@lozzd • @ryan_frantz
• We focus on people’s sleep
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
...
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
...
@lozzd • @ryan_frantz
• But not the effect on the person when they come to
work the next day
• How do we measure the impac...
@lozzd • @ryan_frantz
Beyond Opsweekly
• Employee wellness program
@lozzd • @ryan_frantz
Beyond Opsweekly
• Employee wellness program
• Security have started using past sleep data to check ...
@lozzd • @ryan_frantz
More context: nagios-herald
@lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting ...
@lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting ...
@lozzd • @ryan_frantz
Thanks
@lozzd • @ryan_frantz
Etsy Ops Team
@lozzd • @ryan_frantz
SewMona
@lozzd • @ryan_frantz
Open Source/Links
• http://ryanfrantz.com/mtts
• https://github.com/etsy/opsweekly
• https://github....
@lozzd • @ryan_frantz
Questions?
@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience
@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Mean Time to Sleep: Quantifying the On-Call Experience
Upcoming SlideShare
Loading in...5
×

Mean Time to Sleep: Quantifying the On-Call Experience

8,328

Published on

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

1 Comment
13 Likes
Statistics
Notes
  • Good Job man
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
8,328
On Slideshare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
49
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

Mean Time to Sleep: Quantifying the On-Call Experience

  1. 1. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  2. 2. Laurie Denness @lozzd Ryan Frantz @ryan_frantz
  3. 3. @lozzd • @ryan_frantz Who is in an on-call rotation?
  4. 4. @lozzd • @ryan_frantz Who is on call right now?
  5. 5. @lozzd • @ryan_frantz Who feels like on-call sucks?
  6. 6. Welcome. How is on call?
  7. 7. @lozzd • @ryan_frantz Let’s help our people sleep
  8. 8. @lozzd • @ryan_frantz Make on-call more bearable
  9. 9. @lozzd • @ryan_frantz Incremental Changes
  10. 10. @lozzd • @ryan_frantz Email to Acknowledge
  11. 11. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  12. 12. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  13. 13. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  14. 14. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  15. 15. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  16. 16. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  17. 17. @lozzd • @ryan_frantz Email to Acknowledge • Replying “ack” with some context makes it appear in IRC too
  18. 18. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night?
  19. 19. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
  20. 20. @lozzd • @ryan_frantz Email Only Alerts • Do you care if RAID becomes degraded in the middle of the night? • Do you care if one of your web/hadoop/X boxes dies in the middle of the night? • Can it wait until the morning?
  21. 21. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  22. 22. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  23. 23. @lozzd • @ryan_frantz • Previous service state • Duration in that state Added Context • Previous service state
  24. 24. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state
  25. 25. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  26. 26. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  27. 27. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients
  28. 28. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes
  29. 29. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  30. 30. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to Runbook
  31. 31. @lozzd • @ryan_frantz Added Context • Previous service state • Duration in that state • Alert recipients • Notes • Link to runbook
  32. 32. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating
  33. 33. @lozzd • @ryan_frantz Alert Storms • Reduce noise when 200 things go wrong by aggregating • Trigger alert percentage of pool over threshold
  34. 34. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  35. 35. @lozzd • @ryan_frantz Low friction downtime • IRC commands to downtime hosts/sets of hosts
  36. 36. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  37. 37. @lozzd • @ryan_frantz Downtime Reminders • Help prevent false notifications
  38. 38. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team
  39. 39. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd)
  40. 40. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors)
  41. 41. @lozzd • @ryan_frantz Event Handlers • Teach Nagios to augment the team • Restarting services (nscd) • Re-running jobs (transient errors) • Duplicate crons (Chef)
  42. 42. @lozzd • @ryan_frantz Incremental Improvements? • Maybe
  43. 43. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick
  44. 44. @lozzd • @ryan_frantz Incremental Improvements? • Maybe • More ideas; hoped they’d stick • We didn’t know because we didn’t measure
  45. 45. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really)
  46. 46. @lozzd • @ryan_frantz Measure Everything • “You can’t manage what you can’t measure.” - Deming (not really) • But, we weren’t measuring anything
  47. 47. @lozzd • @ryan_frantz What should we measure?
  48. 48. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity)
  49. 49. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not)
  50. 50. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours?
  51. 51. @lozzd • @ryan_frantz What should we measure? • Volume of alerts (total, by severity) • Alert categorization (actionable vs not) • Alert times: Off-hours? • Noisy hosts/services
  52. 52. @lozzd • @ryan_frantz Opsweekly
  53. 53. @lozzd • @ryan_frantzWe have data.
  54. 54. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports
  55. 55. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing
  56. 56. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts
  57. 57. @lozzd • @ryan_frantz Aggregate alerts 1. Look at reports 2.Wow, look at all those alerts for the same thing 3.Aggregate alerts 4.Profit
  58. 58. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch)
  59. 59. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature
  60. 60. @lozzd • @ryan_frantz Parent relationships • Prevent alerts due to upstream issues (downed switch) • Standard Nagios feature • Computers can do this for us!
  61. 61. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com
  62. 62. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info
  63. 63. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai
  64. 64. @lozzd • @ryan_frantz Parent relationships • signalvnoise.com • LLDP on host shows switch info • Put switch info into Chef using ohai • Create Nagios host configs based on data
  65. 65. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks
  66. 66. @lozzd • @ryan_frantz Service Dependencies • Hundreds of Graphite-sourced checks • Created new template that sets a servicegroup that depends on the Graphite service.
  67. 67. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything!
  68. 68. @lozzd • @ryan_frantz Keep on analyzing • It’s okay to just identify and delete alerts that don’t mean anything! • Or move them to email only
  69. 69. @lozzd • @ryan_frantz More Quantification!
  70. 70. @lozzd • @ryan_frantz Reviewing the Year • Use reports
  71. 71. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search
  72. 72. @lozzd • @ryan_frantz Reviewing the Year • Use reports • Use search • Identify noisiest alerts
  73. 73. @lozzd • @ryan_frantz Reviewing the Year YEARLY REPORT SCREENSHOTS
  74. 74. @lozzd • @ryan_frantz • Great time to look at this data and make improvements Nagios Hack Day/Week
  75. 75. @lozzd • @ryan_frantz • Great time to look at this data and make improvements • If Disk Space is the worst. Can we rethink that? Nagios Hack Day/Week
  76. 76. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation
  77. 77. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops
  78. 78. @lozzd • @ryan_frantz Outsource Your Alerts • Etsy’s Search Team has on-call rotation • A whole subset of alerts that don’t go to Ops • More teams starting this but Search Team is at 100%
  79. 79. @lozzd • @ryan_frantz Sleep Tracking
  80. 80. @lozzd • @ryan_frantz
  81. 81. “Track your life!” - @ph
  82. 82. @lozzd • @ryan_frantz
  83. 83. @lozzd • @ryan_frantz
  84. 84. @lozzd • @ryan_frantz
  85. 85. @lozzd • @ryan_frantz
  86. 86. @lozzd • @ryan_frantz Did it work?
  87. 87. @lozzd • @ryan_frantz Did it work?
  88. 88. @lozzd • @ryan_frantz Did it work? • Yes.
  89. 89. @lozzd • @ryan_frantz Did it work? • Yes.
  90. 90. @lozzd • @ryan_frantz Did it work? • Yes. • Signal to noise ratio is much better
  91. 91. @lozzd • @ryan_frantz Did it work? • Yes.
  92. 92. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that
  93. 93. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things
  94. 94. @lozzd • @ryan_frantz Did it work? • Yes. • Okay, so it’s a little more complicated than that • Adding alerts all the time means new “annoying” things • Keep monitoring
  95. 95. @lozzd • @ryan_frantz What’s next?
  96. 96. @lozzd • @ryan_frantz • We focus on people’s sleep The Effect of Sleep
  97. 97. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day The Effect of Sleep
  98. 98. @lozzd • @ryan_frantz • We focus on people’s sleep • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep
  99. 99. @lozzd • @ryan_frantz • But not the effect on the person when they come to work the next day • How do we measure the impact of sleep loss/ deprivation? The Effect of Sleep • Subjective: Pittsburgh Sleepiness Scale • Objective: Psychomotor vigilance task (PVT) to measure alertness
  100. 100. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program
  101. 101. @lozzd • @ryan_frantz Beyond Opsweekly • Employee wellness program • Security have started using past sleep data to check for weird logins to systems
  102. 102. @lozzd • @ryan_frantz More context: nagios-herald
  103. 103. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends
  104. 104. @lozzd • @ryan_frantz More reports • We have a bunch of data, we can build better reports, drill down to analyze alerting trends • Can we attribute particular actions to reduced noise volume? • Aggregate alerts • Non-downtimed alerts
  105. 105. @lozzd • @ryan_frantz Thanks
  106. 106. @lozzd • @ryan_frantz Etsy Ops Team
  107. 107. @lozzd • @ryan_frantz SewMona
  108. 108. @lozzd • @ryan_frantz Open Source/Links • http://ryanfrantz.com/mtts • https://github.com/etsy/opsweekly • https://github.com/etsy/nagios-herald • https://github.com/jonlives/jawboneup_to_graphite • http://codeascraft.com
  109. 109. @lozzd • @ryan_frantz Questions?
  110. 110. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  111. 111. @lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×