SlideShare a Scribd company logo
1 of 132
Download to read offline
@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience
Laurie Denness
@lozzd
Ryan Frantz
@ryan_frantz
@lozzd • @ryan_frantz
Who is in an on-call rotation?
@lozzd • @ryan_frantz
Who is on call right now?
@lozzd • @ryan_frantz
Who feels like on-call sucks?
Welcome. How is on call?
@lozzd • @ryan_frantz
Let’s help our people sleep
@lozzd • @ryan_frantz
Make on-call more
bearable
@lozzd • @ryan_frantz
Incremental Changes
@lozzd • @ryan_frantz
Email to
Acknowledge
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge
• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in
the middle of the night?
@lozzd • @ryan_frantz
Email Only Alerts
• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in
the middle of the night?
• Can it wait until the morning?
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context
• Previous service state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context
• Previous service state
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to Runbook
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to Runbook
@lozzd • @ryan_frantz
Added Context
• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to runbook
@lozzd • @ryan_frantz
Alert Storms
• Reduce noise when 200 things go wrong by aggregating
@lozzd • @ryan_frantz
Alert Storms
• Reduce noise when 200 things go wrong by aggregating
• Trigger alert percentage of pool over threshold
@lozzd • @ryan_frantz
Low friction downtime
• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Low friction downtime
• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Downtime Reminders
• Help prevent false notifications
@lozzd • @ryan_frantz
Downtime Reminders
• Help prevent false notifications
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (transient errors)
@lozzd • @ryan_frantz
Event Handlers
• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (transient errors)
• Duplicate crons (Chef)
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
• More ideas; hoped they’d stick
@lozzd • @ryan_frantz
Incremental Improvements?
• Maybe
• More ideas; hoped they’d stick
• We didn’t know because we didn’t measure
@lozzd • @ryan_frantz
Measure Everything
• “You can’t manage what you can’t measure.”
- Deming (not really)
@lozzd • @ryan_frantz
Measure Everything
• “You can’t manage what you can’t measure.”
- Deming (not really)
• But, we weren’t measuring anything
@lozzd • @ryan_frantz
What should we measure?
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
@lozzd • @ryan_frantz
What should we measure?
• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
• Noisy hosts/services
@lozzd • @ryan_frantz
Opsweekly
@lozzd • @ryan_frantzWe have data.
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
3.Aggregate alerts
@lozzd • @ryan_frantz
Aggregate alerts
1. Look at reports
2.Wow, look at all those alerts for the same thing
3.Aggregate alerts
4.Profit
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
@lozzd • @ryan_frantz
Parent relationships
• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
• Computers can do this for us!
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
@lozzd • @ryan_frantz
Parent relationships
• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
• Create Nagios host configs based on data
@lozzd • @ryan_frantz
Service Dependencies
• Hundreds of Graphite-sourced checks
@lozzd • @ryan_frantz
Service Dependencies
• Hundreds of Graphite-sourced checks
• Created new template that sets a servicegroup that
depends on the Graphite service.
@lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
@lozzd • @ryan_frantz
Keep on analyzing
• It’s okay to just identify and delete alerts that don’t
mean anything!
• Or move them to email only
@lozzd • @ryan_frantz
More Quantification!
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
• Use search
@lozzd • @ryan_frantz
Reviewing the Year
• Use reports
• Use search
• Identify noisiest alerts
@lozzd • @ryan_frantz
Reviewing the Year
YEARLY REPORT SCREENSHOTS
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
Nagios Hack Day/Week
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
• If Disk Space is the worst. Can we rethink that?
Nagios Hack Day/Week
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
@lozzd • @ryan_frantz
Outsource Your Alerts
• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
• More teams starting this but Search Team is at 100%
@lozzd • @ryan_frantz
Sleep Tracking
@lozzd • @ryan_frantz
“Track your life!” - @ph
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Signal to noise ratio is much better
@lozzd • @ryan_frantz
Did it work?
• Yes.
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
@lozzd • @ryan_frantz
Did it work?
• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
• Keep monitoring
@lozzd • @ryan_frantz
What’s next?
@lozzd • @ryan_frantz
• We focus on people’s sleep
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to
work the next day
• How do we measure the impact of sleep loss/
deprivation?
The Effect of Sleep
@lozzd • @ryan_frantz
• But not the effect on the person when they come to
work the next day
• How do we measure the impact of sleep loss/
deprivation?
The Effect of Sleep
• Subjective: Pittsburgh Sleepiness Scale
• Objective: Psychomotor vigilance task (PVT) to measure
alertness
@lozzd • @ryan_frantz
Beyond Opsweekly
• Employee wellness program
@lozzd • @ryan_frantz
Beyond Opsweekly
• Employee wellness program
• Security have started using past sleep data to check for
weird logins to systems
@lozzd • @ryan_frantz
More context: nagios-herald
@lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
@lozzd • @ryan_frantz
More reports
• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
• Can we attribute particular actions to reduced noise
volume?
• Aggregate alerts
• Non-downtimed alerts
@lozzd • @ryan_frantz
Thanks
@lozzd • @ryan_frantz
Etsy Ops Team
@lozzd • @ryan_frantz
SewMona
@lozzd • @ryan_frantz
Open Source/Links
• http://ryanfrantz.com/mtts
• https://github.com/etsy/opsweekly
• https://github.com/etsy/nagios-herald
• https://github.com/jonlives/jawboneup_to_graphite
• http://codeascraft.com
@lozzd • @ryan_frantz
Questions?
@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience
@lozzd • @ryan_frantz
Mean Time to Sleep
Quantifying the on-call experience

More Related Content

Viewers also liked

Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Hakka Labs
 
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
Ontico
 
9M 2016 Consolidated Results (November 4, 2016)
 9M 2016 Consolidated Results (November 4, 2016) 9M 2016 Consolidated Results (November 4, 2016)
9M 2016 Consolidated Results (November 4, 2016)
Terna SpA
 

Viewers also liked (9)

Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
 
Data Driven Monitoring
Data Driven MonitoringData Driven Monitoring
Data Driven Monitoring
 
Rencontres Mondiales Du Logiciel Libre 2009
Rencontres Mondiales Du Logiciel Libre 2009Rencontres Mondiales Du Logiciel Libre 2009
Rencontres Mondiales Du Logiciel Libre 2009
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Состояние сетевой безопасности в 2016 году
Состояние сетевой безопасности в 2016 году Состояние сетевой безопасности в 2016 году
Состояние сетевой безопасности в 2016 году
 
Fully Automated Nagios Jm2L 2009
Fully Automated Nagios Jm2L 2009Fully Automated Nagios Jm2L 2009
Fully Automated Nagios Jm2L 2009
 
Introduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use CasesIntroduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use Cases
 
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
Лучшие практики Continuous Delivery с Docker / Дмитрий Столяров (Флант)
 
9M 2016 Consolidated Results (November 4, 2016)
 9M 2016 Consolidated Results (November 4, 2016) 9M 2016 Consolidated Results (November 4, 2016)
9M 2016 Consolidated Results (November 4, 2016)
 

Similar to Mean Time to Sleep: Quantifying the On-Call Experience

Social tv by Guillermo Christen
Social tv by Guillermo ChristenSocial tv by Guillermo Christen
Social tv by Guillermo Christen
techMAP
 
Make It Work! Building & Sustaining Relationships
Make It Work! Building & Sustaining RelationshipsMake It Work! Building & Sustaining Relationships
Make It Work! Building & Sustaining Relationships
koshea084
 

Similar to Mean Time to Sleep: Quantifying the On-Call Experience (16)

BrightonSEO July 2021: Spilling the T in EAT- Easy CRO Tricks for User Trust
BrightonSEO July 2021: Spilling the T in EAT- Easy CRO Tricks for User TrustBrightonSEO July 2021: Spilling the T in EAT- Easy CRO Tricks for User Trust
BrightonSEO July 2021: Spilling the T in EAT- Easy CRO Tricks for User Trust
 
Live Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win bigLive Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win big
 
Tweakers Anonymous
Tweakers AnonymousTweakers Anonymous
Tweakers Anonymous
 
How to moderate comments like a ninja
How to moderate comments like a ninjaHow to moderate comments like a ninja
How to moderate comments like a ninja
 
Social tv by Guillermo Christen
Social tv by Guillermo ChristenSocial tv by Guillermo Christen
Social tv by Guillermo Christen
 
Less 'Oh Shit' With GIT
Less 'Oh Shit' With GITLess 'Oh Shit' With GIT
Less 'Oh Shit' With GIT
 
Make It Work! Building & Sustaining Relationships
Make It Work! Building & Sustaining RelationshipsMake It Work! Building & Sustaining Relationships
Make It Work! Building & Sustaining Relationships
 
Getting Started with WordPress Development
Getting Started with WordPress DevelopmentGetting Started with WordPress Development
Getting Started with WordPress Development
 
Twitter Tips
Twitter TipsTwitter Tips
Twitter Tips
 
Twitter basics
Twitter basicsTwitter basics
Twitter basics
 
Find a job without looking
Find a job without lookingFind a job without looking
Find a job without looking
 
Find a Job Without Looking
Find a Job Without LookingFind a Job Without Looking
Find a Job Without Looking
 
Stuck in the Middle with You: Exploring the Connections Between Your App and ...
Stuck in the Middle with You: Exploring the Connections Between Your App and ...Stuck in the Middle with You: Exploring the Connections Between Your App and ...
Stuck in the Middle with You: Exploring the Connections Between Your App and ...
 
The Long and Short of Twitter
The Long and Short of TwitterThe Long and Short of Twitter
The Long and Short of Twitter
 
Open ID and Django
Open ID and DjangoOpen ID and Django
Open ID and Django
 
Authorship & Google+ - SEMpdx November 2012
Authorship & Google+ - SEMpdx November 2012Authorship & Google+ - SEMpdx November 2012
Authorship & Google+ - SEMpdx November 2012
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 

Mean Time to Sleep: Quantifying the On-Call Experience