SlideShare a Scribd company logo
1 of 21
Download to read offline
THE ON-CALL FIREFIGHT 
SURVIVAL GUIDE 
DOWN & DIRTY IN 
THE DEVOPS TRENCHES 
Being on-call sucks 
but with these tips, it can get better.
TABLE OF CONTENTS 
00 
01 What happens when you’re on-call 
02 BEFORE you go on-call 
03 DURING a crisis 
04 AFTER an incident 
05 Additional resources 
The VictorOps On-Call Firefight Survival Guide
SUFFERING FROM 
GHOST PAGER LEG? 
“ With the phantom vibrations, the brain sometimes 
misinterprets sensory input according to the 
preconceived hypothesis that a vibrating sensation 
will be coming from the phone. In other words, it 
seems smartphone users are just so primed for, and 
attentive to, the sensation of their phone going 
off that they simply experience the occasional 
false alarm.” 
(also known as Phantom Cellphone Vibrations or 
Phantom Vibration Syndrome) HOW TO STOP PHANTOM VIBRATIONS 
Use a different device 
50% 
63% 
75% 
Change location of device 
Use audible ringer 
0 1 
http://en.wikipedia.org/wiki/Phantom_vibration_syndrome 
http://mentalfloss.com/article/30960/why-do-people-feel-phantom-cellphone-vibrations source: http://www.ncbi.nlm.nih.gov/ 
pubmed/3363267 
The VictorOps On-Call Firefight Survival Guide
Rated sleep quality was lower, and sleepiness was 
higher during the subsequent day. It was suggested 
that the effects were due to apprehension/ 
uneasiness induced by the prospect of being 
awakened by an alarm.” 
WHAT HAPPENS TO 
YOUR SLEEP... 
Less Rapid Eye Movement (REM) sleep 
Less Slow Wave Sleep (SWS) 
Higher Heart Rate 
(When on-call.) HERE’S WHAT HAPPENS 
The VictorOps On-Call Firefight Survival Guide 
source: http://www.ncbi.nlm.nih.gov/pubmed/3363267 
01 
“
013 
PSYCHOLOGICAL IMPACTS 
OF WORKING ON-CALL, LONG-TERM 
(Being on-call can affect...) 
On-shift performance Recovery Long term health & well-being 
The VictorOps On-Call Firefight Survival Guide 
source: http://www2.hull.ac.uk/science/psychology/research/healthandappliedpsychology/oncallworking.aspx
ALL OF THE RESEARCH MEANS 
NOTHING WHEN YOU’RE IN THE 
MIDDLE OF FIGHTING A FIRE. 
BEFORE, DURING and AFTER a critical incident takes place... 
(Some steps to follow.) 
02 
photo credit: http://www.flickr.com/photos/thenationalguard 
The VictorOps On-Call Firefight Survival Guide
Personal contact methods: Be sure that the alerts are getting to you the way 
you prefer by customizing how you get notified, whether that’s SMS, email or 
phone. You can also use more than one way of communicating if you’re afraid 
of missing anything. (For example, first send a text and then call me if I don’t 
respond in 5 minutes.) 
Escalation rules: The feeling of having no back-up can bring a person to a 
panicked state where they begin making bad decisions. You can easily avoid 
this by having an escalation plan in place ahead of time. These sort of escalation 
policies are easy to set up so that everyone on the team knows when more 
people need to be involved. 
Alert routing: It can be beneficial if certain alerts can be seen by certain people, 
so when setting up alerts, be sure to route them to the team that is most 
capable of solving the problem. Additionally, there are times when specific 
alerts should be seen by different people, simply for the sake of knowing 
what’s happening with the infrastructure, as in the case of a CTO or SVP 
of Engineering. 
STEP #1. 
SETTING UP ALERTS. 
02 
Personal contact methods: 
Escalation rules: 
Alert routing: 
The correct configuration of your alerts is crucial. 
The VictorOps On-Call Firefight Survival Guide
02 The VictorOps On-Call Firefight Survival Guide 
STEP #2 
PICK THE RIGHT HAND OFF DAY 
(What a difference a day makes.) 
Mondays and Fridays are frequently interrupted by 
long weekends and holidays. If your handoff is a 
manual process or you’re passing a physical 
pager around, forgetting to do the switch on a 
Friday could mean someone’s stuck with an extra 
weekend on-call. 
WEDNESDAY, FTW! 
Try doing the on-call handoff on Wednesday. Most 
everyone will be in the office, and you won’t have the 
distractions of the previous or the coming weekend 
interfering with an orderly handoff.
02 The VictorOps On-Call Firefight Survival Guide 
STEP #3 
HOLD A HAND-OFF MEETING 
(All together now, kumbaya!) 
Most people don’t pay too much attention to the 
monitoring system when they’re not on-call. This is 
a good thing. We all need to regain our sanity and 
tune out when we can. But this means that when your 
turn does come up, you may not be prepared for 
everything that’s happening. New issues you haven’t 
heard about may have arisen in the past week. 
SHARE THE KNOWLEDGE 
Getting everyone on the team together for a short week-ly 
meeting on handoff day can greatly ease the transition. 
Whoever was on-call can talk about the big events of the 
previous week and if they’ve noticed anything trending in 
the wrong direction, they can provide some early warning. 
This is also a chance for the whole team to see and hear who 
is on-call, and to check in on how the platform is doing.
02 The VictorOps On-Call Firefight Survival Guide 
(Do you see what I see?) 
If you work in a large enough organization, 
consider having someone from each of the dis-ciplines 
(development, infrastructure, database 
management) on-call at the same time. You can 
split up alerts based on who should be responding, 
and everyone on-call will know they can contact 
the others if they need help out of a jam. 
2 HEADS ARE BETTER THAN ONE. 
A second set of eyes with a different perspective can be 
invaluable in a firefight when every minute counts. 
STEP #4 
USE THE BUDDY SYSTEM
STEP #5 
GO MOBILE 
02 The VictorOps On-Call Firefight Survival Guide 
(All in the palm of your hand.) 
These days, most people carry a computer in 
their pocket. Smart phones may not be an 
ideal way to manage complex systems, but 
apps exist that can enable remediation for 
many problems without having to go home or 
hunt for a wifi hotspot. 
LIFE SAVERS 
By adopting technologies and policies that 
allow for remote problem solving you can 
reduce time-to-resolution and save marriages 
at the same time.
CRISIS MANAGEMENT: 
Tips to remember when you’re in the trenches 
03 The VictorOps On-Call Firefight Survival Guide
03 The VictorOps On-Call Firefight Survival Guide 
TIP #1 
CLEAR HEADS SOLVE PROBLEMS 
(It’s right in front of you.) 
We’ve all experienced the situation where the 
solution to a problem is staring you in the face but 
you’re too distracted to see it. Jumping right into a 
problem when you’re still thinking about that jerk 
that cut you off in traffic will only increase 
your stress and decrease your focus. 
WAYS TO LET GO OF DISTRACTIONS 
The time you take to get into a problem-solving mode 
will more than pay for itself in quicker, better, more 
creative solutions. 
Get a drink of water. 
Go for a walk around the block. 
Take some deep breaths.
03 The VictorOps On-Call Firefight Survival Guide 
TIP #2 
DONT TRY TO TAKE THE HILL BY YOURSELF. 
(Be wary of hero culture.) 
Techies love to tell stories of when they single-handedly 
saved their Fortune-500 company from 
disaster using only their wits. Here’s a hint: most of 
these stories are nonsense. Taking sole responsibility 
for solving a major problem puts too much pressure 
on you, and makes it harder to focus. 
DID YOU SEE THAT? 
Besides, you’re managing a big distributed platform, right? 
There’s a good chance that no one in your organization per-fectly 
understands every moving part. Even if you are deal-ing 
with things you know, a second set of eyes might notice 
that minor typo in the config that you didn’t see.
03 The VictorOps On-Call Firefight Survival Guide 
(Work in shifts for extended crises.) 
Sometimes the problem is so serious that, even 
when the root cause is found, it’s clear that 
it will take hours or even days to get it fixed. 
Get enough people on the problem to fix the 
problem, and let everyone else keep on with 
whatever they’re doing. 
PRESERVE THE RESERVES. 
When you reach hour 12 of the recovery process, you’ll be 
grateful that there are fresh, rested people ready to take over 
and give the first responders a break. 
TIP #3 
DON’T WAKE UP THE WHOLE TEAM. 
12
03 The VictorOps On-Call Firefight Survival Guide 
TIP #4 
YOUR TEAM IS JUST AS TIRED AND STRESSED-OUT... 
(as you are.) 
When you’re in the thick of things, it’s easy to get 
angry. Maybe someone seems more distracted than 
you’d like. Maybe someone makes a joke to ease 
the tension and you don’t think it’s funny. Maybe 
someone gets short-tempered with you. Relax, and 
put things in perspective. 
IN 100 YEARS, NO ONE WILL CARE 
WHAT HAPPENED TODAY. 
Remember that everyone who’s dealing with the issue is 
feeling the pressure, maybe even more than you are. Lash-ing 
out in anger will only escalate the stress on everyone, so 
don’t do it. Try to keep an even keel, and you’ll notice that 
your calm demeanor can help diffuse other people’s panic 
and anger.
03 The VictorOps On-Call Firefight Survival Guide 
(Try to foster a culture where troubleshooting happens in a safe space.) 
Sometimes, a problem will arise because of a 
human error. It’s important to know what 
happened to cause the problem so that 
everyone can learn from the mistake. But that 
doesn’t mean that the troubleshooting and 
repair process should grind to a halt in favor 
of finger-pointing. 
WHY FINGER-POINTING 
IS COUNTER-PRODUCTIIVE 
• It takes the team’s focus away from where it should be. 
• It creates a tense atmosphere that can cause people to 
fear sharing. 
TIP #5 
PUT OFF THE BLAME DISCUSSION.
RESOURCES: 
Tools for the next incident, just in case. 
04 The VictorOps On-Call Firefight Survival Guide
04 The VictorOps On-Call Firefight Survival Guide 
POST MORTEM 
CHECKLIST 
• What happened? 
• Who was affected? 
• What was done to fix it? 
• How was the business affected? 
• What can be done to prevent this from happening again? 
• What actions they took at what time? 
• What effects they observed? 
• What expectations did they have? 
• What assumptions did they make? 
• What is their understanding of the timeline of events as they occurred? 
Generally, as a team... 
Participants should be able to account for the following, 
without fear of punishment or retribution…
04 The VictorOps On-Call Firefight Survival Guide 
ADDITIONAL 
RESOURCES 
http://puppetlabs.com/blog/devops-developers-resource-bridging-gap 
http://art.cim3.org/pm_workshop/Project_Post_Mortem/Postmortem_Template--MSF.html 
http://velocityconf.com/velocity2013/public/schedule/detail/28251 
http://victorops.force.com/knowledgebase 
http://codeascraft.com/2012/05/22/blameless-postmortems/ 
http://www.continentalmessage.com/blog/best-practices-for-establishing-on-call-contact-procedures
Be victorious. 
victorops.com/survival

More Related Content

Similar to On-call Firefight Survival Guide

Beating digital distractions
Beating digital distractionsBeating digital distractions
Beating digital distractionsscharrlibrary
 
Simulating Real World Attack
Simulating Real World AttackSimulating Real World Attack
Simulating Real World Attacktmacuk
 
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...Adrian Sanabria
 
The Cluetrain Manifesto 46 60 Chapters
The Cluetrain Manifesto 46 60 ChaptersThe Cluetrain Manifesto 46 60 Chapters
The Cluetrain Manifesto 46 60 ChaptersMarlo la O'
 
Crisis communication
Crisis communicationCrisis communication
Crisis communicationFrank Calberg
 
UX Sofia 2011 - Conrad Albrecht-Buehler
UX Sofia 2011 - Conrad Albrecht-BuehlerUX Sofia 2011 - Conrad Albrecht-Buehler
UX Sofia 2011 - Conrad Albrecht-BuehlerLucrat
 
Polpeo rehearsing-and-managing-social-media-crisis-2014
Polpeo rehearsing-and-managing-social-media-crisis-2014Polpeo rehearsing-and-managing-social-media-crisis-2014
Polpeo rehearsing-and-managing-social-media-crisis-2014Polpeo
 
Common Technology Misconceptions
Common Technology MisconceptionsCommon Technology Misconceptions
Common Technology MisconceptionsNick Toadvine
 
How to Avoid the 'Stall Call' and Other Distractions While on a Conference Call
How to Avoid the 'Stall Call' and Other Distractions While on a Conference CallHow to Avoid the 'Stall Call' and Other Distractions While on a Conference Call
How to Avoid the 'Stall Call' and Other Distractions While on a Conference CallInterCall
 
The Productive Entrepreneur
The Productive EntrepreneurThe Productive Entrepreneur
The Productive EntrepreneurLife Hacks 24
 
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013E moderation carrot-rehearsing-managing-social-media-crisis-july-2013
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013Emoderation
 
Rehearsing and Managing a Social Media Crisis
Rehearsing and Managing a Social Media CrisisRehearsing and Managing a Social Media Crisis
Rehearsing and Managing a Social Media CrisisCarrot Communications
 
Sentiment Analysis and Applications in the News and Media Industry
Sentiment Analysis and Applications in the News and Media IndustrySentiment Analysis and Applications in the News and Media Industry
Sentiment Analysis and Applications in the News and Media IndustryRobin Leonard
 
Be Productive NOT Busy! Wayne Weathersby (Podcast)
Be Productive NOT Busy! Wayne Weathersby (Podcast)Be Productive NOT Busy! Wayne Weathersby (Podcast)
Be Productive NOT Busy! Wayne Weathersby (Podcast)Wayne Weathersby
 
30 minute coaching intensives origin
30 minute coaching intensives   origin30 minute coaching intensives   origin
30 minute coaching intensives originSimon Bozeat
 
A warts free life-1 new
A warts free life-1 newA warts free life-1 new
A warts free life-1 newAbrie Snyman
 
Best Expert Tips for Managing a Remote Workforce During Coronavirus Outbreak
Best Expert Tips for Managing a Remote Workforce During Coronavirus OutbreakBest Expert Tips for Managing a Remote Workforce During Coronavirus Outbreak
Best Expert Tips for Managing a Remote Workforce During Coronavirus OutbreakSmartworks
 

Similar to On-call Firefight Survival Guide (20)

Beating digital distractions
Beating digital distractionsBeating digital distractions
Beating digital distractions
 
Crisis management 2013
Crisis management 2013Crisis management 2013
Crisis management 2013
 
Simulating Real World Attack
Simulating Real World AttackSimulating Real World Attack
Simulating Real World Attack
 
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
 
The Cluetrain Manifesto 46 60 Chapters
The Cluetrain Manifesto 46 60 ChaptersThe Cluetrain Manifesto 46 60 Chapters
The Cluetrain Manifesto 46 60 Chapters
 
Crisis communication
Crisis communicationCrisis communication
Crisis communication
 
UX Sofia 2011 - Conrad Albrecht-Buehler
UX Sofia 2011 - Conrad Albrecht-BuehlerUX Sofia 2011 - Conrad Albrecht-Buehler
UX Sofia 2011 - Conrad Albrecht-Buehler
 
Polpeo rehearsing-and-managing-social-media-crisis-2014
Polpeo rehearsing-and-managing-social-media-crisis-2014Polpeo rehearsing-and-managing-social-media-crisis-2014
Polpeo rehearsing-and-managing-social-media-crisis-2014
 
Become a-better-listener
Become a-better-listenerBecome a-better-listener
Become a-better-listener
 
Common Technology Misconceptions
Common Technology MisconceptionsCommon Technology Misconceptions
Common Technology Misconceptions
 
How to Avoid the 'Stall Call' and Other Distractions While on a Conference Call
How to Avoid the 'Stall Call' and Other Distractions While on a Conference CallHow to Avoid the 'Stall Call' and Other Distractions While on a Conference Call
How to Avoid the 'Stall Call' and Other Distractions While on a Conference Call
 
The Productive Entrepreneur
The Productive EntrepreneurThe Productive Entrepreneur
The Productive Entrepreneur
 
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013E moderation carrot-rehearsing-managing-social-media-crisis-july-2013
E moderation carrot-rehearsing-managing-social-media-crisis-july-2013
 
Rehearsing and Managing a Social Media Crisis
Rehearsing and Managing a Social Media CrisisRehearsing and Managing a Social Media Crisis
Rehearsing and Managing a Social Media Crisis
 
Sentiment Analysis and Applications in the News and Media Industry
Sentiment Analysis and Applications in the News and Media IndustrySentiment Analysis and Applications in the News and Media Industry
Sentiment Analysis and Applications in the News and Media Industry
 
Be Productive NOT Busy! Wayne Weathersby (Podcast)
Be Productive NOT Busy! Wayne Weathersby (Podcast)Be Productive NOT Busy! Wayne Weathersby (Podcast)
Be Productive NOT Busy! Wayne Weathersby (Podcast)
 
30 minute coaching intensives origin
30 minute coaching intensives   origin30 minute coaching intensives   origin
30 minute coaching intensives origin
 
A warts free life-1 new
A warts free life-1 newA warts free life-1 new
A warts free life-1 new
 
Online learningorientationppt1 (1)
Online learningorientationppt1 (1)Online learningorientationppt1 (1)
Online learningorientationppt1 (1)
 
Best Expert Tips for Managing a Remote Workforce During Coronavirus Outbreak
Best Expert Tips for Managing a Remote Workforce During Coronavirus OutbreakBest Expert Tips for Managing a Remote Workforce During Coronavirus Outbreak
Best Expert Tips for Managing a Remote Workforce During Coronavirus Outbreak
 

More from VictorOps

Failure as Success Devops Roadtrip Seattle 2016
Failure as Success Devops Roadtrip Seattle 2016Failure as Success Devops Roadtrip Seattle 2016
Failure as Success Devops Roadtrip Seattle 2016VictorOps
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck VictorOps
 
DevOps: A Practical Guide
DevOps: A Practical GuideDevOps: A Practical Guide
DevOps: A Practical GuideVictorOps
 
Crisis Communication Webinar
Crisis Communication WebinarCrisis Communication Webinar
Crisis Communication WebinarVictorOps
 
The Importance of Minimum Viable Runbooks Webinar
The Importance of Minimum Viable Runbooks WebinarThe Importance of Minimum Viable Runbooks Webinar
The Importance of Minimum Viable Runbooks WebinarVictorOps
 
DevOps Roadtrip - Denver
DevOps Roadtrip - DenverDevOps Roadtrip - Denver
DevOps Roadtrip - DenverVictorOps
 
VictorOps & Raygun: A Stunning Integration
VictorOps & Raygun: A Stunning IntegrationVictorOps & Raygun: A Stunning Integration
VictorOps & Raygun: A Stunning IntegrationVictorOps
 
ChatOps: The New Interface of DevOps
ChatOps: The New Interface of DevOpsChatOps: The New Interface of DevOps
ChatOps: The New Interface of DevOpsVictorOps
 
6 Steps to Creating a Minimum Viable Runbook Infographic
6 Steps to Creating a Minimum Viable Runbook Infographic6 Steps to Creating a Minimum Viable Runbook Infographic
6 Steps to Creating a Minimum Viable Runbook InfographicVictorOps
 
Incident Lifecycle Infographic
Incident Lifecycle InfographicIncident Lifecycle Infographic
Incident Lifecycle InfographicVictorOps
 
Crisis Management & Why It's Important Infographic
Crisis Management & Why It's Important InfographicCrisis Management & Why It's Important Infographic
Crisis Management & Why It's Important InfographicVictorOps
 
Real World ChatOps
Real World ChatOpsReal World ChatOps
Real World ChatOpsVictorOps
 
DevOps Culture Shift: Expanding On-Call Responsibilties
DevOps Culture Shift: Expanding On-Call ResponsibiltiesDevOps Culture Shift: Expanding On-Call Responsibilties
DevOps Culture Shift: Expanding On-Call ResponsibiltiesVictorOps
 
Tips & Tricks To Reducing TTR
Tips & Tricks To Reducing TTRTips & Tricks To Reducing TTR
Tips & Tricks To Reducing TTRVictorOps
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeVictorOps
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreVictorOps
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringVictorOps
 
The Art & Zen of Managing Nagios with Puppet
The Art & Zen of Managing Nagios with PuppetThe Art & Zen of Managing Nagios with Puppet
The Art & Zen of Managing Nagios with PuppetVictorOps
 
ChatOps Unplugged
ChatOps UnpluggedChatOps Unplugged
ChatOps UnpluggedVictorOps
 
Post-mortem Fail
Post-mortem FailPost-mortem Fail
Post-mortem FailVictorOps
 

More from VictorOps (20)

Failure as Success Devops Roadtrip Seattle 2016
Failure as Success Devops Roadtrip Seattle 2016Failure as Success Devops Roadtrip Seattle 2016
Failure as Success Devops Roadtrip Seattle 2016
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck
 
DevOps: A Practical Guide
DevOps: A Practical GuideDevOps: A Practical Guide
DevOps: A Practical Guide
 
Crisis Communication Webinar
Crisis Communication WebinarCrisis Communication Webinar
Crisis Communication Webinar
 
The Importance of Minimum Viable Runbooks Webinar
The Importance of Minimum Viable Runbooks WebinarThe Importance of Minimum Viable Runbooks Webinar
The Importance of Minimum Viable Runbooks Webinar
 
DevOps Roadtrip - Denver
DevOps Roadtrip - DenverDevOps Roadtrip - Denver
DevOps Roadtrip - Denver
 
VictorOps & Raygun: A Stunning Integration
VictorOps & Raygun: A Stunning IntegrationVictorOps & Raygun: A Stunning Integration
VictorOps & Raygun: A Stunning Integration
 
ChatOps: The New Interface of DevOps
ChatOps: The New Interface of DevOpsChatOps: The New Interface of DevOps
ChatOps: The New Interface of DevOps
 
6 Steps to Creating a Minimum Viable Runbook Infographic
6 Steps to Creating a Minimum Viable Runbook Infographic6 Steps to Creating a Minimum Viable Runbook Infographic
6 Steps to Creating a Minimum Viable Runbook Infographic
 
Incident Lifecycle Infographic
Incident Lifecycle InfographicIncident Lifecycle Infographic
Incident Lifecycle Infographic
 
Crisis Management & Why It's Important Infographic
Crisis Management & Why It's Important InfographicCrisis Management & Why It's Important Infographic
Crisis Management & Why It's Important Infographic
 
Real World ChatOps
Real World ChatOpsReal World ChatOps
Real World ChatOps
 
DevOps Culture Shift: Expanding On-Call Responsibilties
DevOps Culture Shift: Expanding On-Call ResponsibiltiesDevOps Culture Shift: Expanding On-Call Responsibilties
DevOps Culture Shift: Expanding On-Call Responsibilties
 
Tips & Tricks To Reducing TTR
Tips & Tricks To Reducing TTRTips & Tricks To Reducing TTR
Tips & Tricks To Reducing TTR
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies Anymore
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based Monitoring
 
The Art & Zen of Managing Nagios with Puppet
The Art & Zen of Managing Nagios with PuppetThe Art & Zen of Managing Nagios with Puppet
The Art & Zen of Managing Nagios with Puppet
 
ChatOps Unplugged
ChatOps UnpluggedChatOps Unplugged
ChatOps Unplugged
 
Post-mortem Fail
Post-mortem FailPost-mortem Fail
Post-mortem Fail
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

On-call Firefight Survival Guide

  • 1. THE ON-CALL FIREFIGHT SURVIVAL GUIDE DOWN & DIRTY IN THE DEVOPS TRENCHES Being on-call sucks but with these tips, it can get better.
  • 2. TABLE OF CONTENTS 00 01 What happens when you’re on-call 02 BEFORE you go on-call 03 DURING a crisis 04 AFTER an incident 05 Additional resources The VictorOps On-Call Firefight Survival Guide
  • 3. SUFFERING FROM GHOST PAGER LEG? “ With the phantom vibrations, the brain sometimes misinterprets sensory input according to the preconceived hypothesis that a vibrating sensation will be coming from the phone. In other words, it seems smartphone users are just so primed for, and attentive to, the sensation of their phone going off that they simply experience the occasional false alarm.” (also known as Phantom Cellphone Vibrations or Phantom Vibration Syndrome) HOW TO STOP PHANTOM VIBRATIONS Use a different device 50% 63% 75% Change location of device Use audible ringer 0 1 http://en.wikipedia.org/wiki/Phantom_vibration_syndrome http://mentalfloss.com/article/30960/why-do-people-feel-phantom-cellphone-vibrations source: http://www.ncbi.nlm.nih.gov/ pubmed/3363267 The VictorOps On-Call Firefight Survival Guide
  • 4. Rated sleep quality was lower, and sleepiness was higher during the subsequent day. It was suggested that the effects were due to apprehension/ uneasiness induced by the prospect of being awakened by an alarm.” WHAT HAPPENS TO YOUR SLEEP... Less Rapid Eye Movement (REM) sleep Less Slow Wave Sleep (SWS) Higher Heart Rate (When on-call.) HERE’S WHAT HAPPENS The VictorOps On-Call Firefight Survival Guide source: http://www.ncbi.nlm.nih.gov/pubmed/3363267 01 “
  • 5. 013 PSYCHOLOGICAL IMPACTS OF WORKING ON-CALL, LONG-TERM (Being on-call can affect...) On-shift performance Recovery Long term health & well-being The VictorOps On-Call Firefight Survival Guide source: http://www2.hull.ac.uk/science/psychology/research/healthandappliedpsychology/oncallworking.aspx
  • 6. ALL OF THE RESEARCH MEANS NOTHING WHEN YOU’RE IN THE MIDDLE OF FIGHTING A FIRE. BEFORE, DURING and AFTER a critical incident takes place... (Some steps to follow.) 02 photo credit: http://www.flickr.com/photos/thenationalguard The VictorOps On-Call Firefight Survival Guide
  • 7. Personal contact methods: Be sure that the alerts are getting to you the way you prefer by customizing how you get notified, whether that’s SMS, email or phone. You can also use more than one way of communicating if you’re afraid of missing anything. (For example, first send a text and then call me if I don’t respond in 5 minutes.) Escalation rules: The feeling of having no back-up can bring a person to a panicked state where they begin making bad decisions. You can easily avoid this by having an escalation plan in place ahead of time. These sort of escalation policies are easy to set up so that everyone on the team knows when more people need to be involved. Alert routing: It can be beneficial if certain alerts can be seen by certain people, so when setting up alerts, be sure to route them to the team that is most capable of solving the problem. Additionally, there are times when specific alerts should be seen by different people, simply for the sake of knowing what’s happening with the infrastructure, as in the case of a CTO or SVP of Engineering. STEP #1. SETTING UP ALERTS. 02 Personal contact methods: Escalation rules: Alert routing: The correct configuration of your alerts is crucial. The VictorOps On-Call Firefight Survival Guide
  • 8. 02 The VictorOps On-Call Firefight Survival Guide STEP #2 PICK THE RIGHT HAND OFF DAY (What a difference a day makes.) Mondays and Fridays are frequently interrupted by long weekends and holidays. If your handoff is a manual process or you’re passing a physical pager around, forgetting to do the switch on a Friday could mean someone’s stuck with an extra weekend on-call. WEDNESDAY, FTW! Try doing the on-call handoff on Wednesday. Most everyone will be in the office, and you won’t have the distractions of the previous or the coming weekend interfering with an orderly handoff.
  • 9. 02 The VictorOps On-Call Firefight Survival Guide STEP #3 HOLD A HAND-OFF MEETING (All together now, kumbaya!) Most people don’t pay too much attention to the monitoring system when they’re not on-call. This is a good thing. We all need to regain our sanity and tune out when we can. But this means that when your turn does come up, you may not be prepared for everything that’s happening. New issues you haven’t heard about may have arisen in the past week. SHARE THE KNOWLEDGE Getting everyone on the team together for a short week-ly meeting on handoff day can greatly ease the transition. Whoever was on-call can talk about the big events of the previous week and if they’ve noticed anything trending in the wrong direction, they can provide some early warning. This is also a chance for the whole team to see and hear who is on-call, and to check in on how the platform is doing.
  • 10. 02 The VictorOps On-Call Firefight Survival Guide (Do you see what I see?) If you work in a large enough organization, consider having someone from each of the dis-ciplines (development, infrastructure, database management) on-call at the same time. You can split up alerts based on who should be responding, and everyone on-call will know they can contact the others if they need help out of a jam. 2 HEADS ARE BETTER THAN ONE. A second set of eyes with a different perspective can be invaluable in a firefight when every minute counts. STEP #4 USE THE BUDDY SYSTEM
  • 11. STEP #5 GO MOBILE 02 The VictorOps On-Call Firefight Survival Guide (All in the palm of your hand.) These days, most people carry a computer in their pocket. Smart phones may not be an ideal way to manage complex systems, but apps exist that can enable remediation for many problems without having to go home or hunt for a wifi hotspot. LIFE SAVERS By adopting technologies and policies that allow for remote problem solving you can reduce time-to-resolution and save marriages at the same time.
  • 12. CRISIS MANAGEMENT: Tips to remember when you’re in the trenches 03 The VictorOps On-Call Firefight Survival Guide
  • 13. 03 The VictorOps On-Call Firefight Survival Guide TIP #1 CLEAR HEADS SOLVE PROBLEMS (It’s right in front of you.) We’ve all experienced the situation where the solution to a problem is staring you in the face but you’re too distracted to see it. Jumping right into a problem when you’re still thinking about that jerk that cut you off in traffic will only increase your stress and decrease your focus. WAYS TO LET GO OF DISTRACTIONS The time you take to get into a problem-solving mode will more than pay for itself in quicker, better, more creative solutions. Get a drink of water. Go for a walk around the block. Take some deep breaths.
  • 14. 03 The VictorOps On-Call Firefight Survival Guide TIP #2 DONT TRY TO TAKE THE HILL BY YOURSELF. (Be wary of hero culture.) Techies love to tell stories of when they single-handedly saved their Fortune-500 company from disaster using only their wits. Here’s a hint: most of these stories are nonsense. Taking sole responsibility for solving a major problem puts too much pressure on you, and makes it harder to focus. DID YOU SEE THAT? Besides, you’re managing a big distributed platform, right? There’s a good chance that no one in your organization per-fectly understands every moving part. Even if you are deal-ing with things you know, a second set of eyes might notice that minor typo in the config that you didn’t see.
  • 15. 03 The VictorOps On-Call Firefight Survival Guide (Work in shifts for extended crises.) Sometimes the problem is so serious that, even when the root cause is found, it’s clear that it will take hours or even days to get it fixed. Get enough people on the problem to fix the problem, and let everyone else keep on with whatever they’re doing. PRESERVE THE RESERVES. When you reach hour 12 of the recovery process, you’ll be grateful that there are fresh, rested people ready to take over and give the first responders a break. TIP #3 DON’T WAKE UP THE WHOLE TEAM. 12
  • 16. 03 The VictorOps On-Call Firefight Survival Guide TIP #4 YOUR TEAM IS JUST AS TIRED AND STRESSED-OUT... (as you are.) When you’re in the thick of things, it’s easy to get angry. Maybe someone seems more distracted than you’d like. Maybe someone makes a joke to ease the tension and you don’t think it’s funny. Maybe someone gets short-tempered with you. Relax, and put things in perspective. IN 100 YEARS, NO ONE WILL CARE WHAT HAPPENED TODAY. Remember that everyone who’s dealing with the issue is feeling the pressure, maybe even more than you are. Lash-ing out in anger will only escalate the stress on everyone, so don’t do it. Try to keep an even keel, and you’ll notice that your calm demeanor can help diffuse other people’s panic and anger.
  • 17. 03 The VictorOps On-Call Firefight Survival Guide (Try to foster a culture where troubleshooting happens in a safe space.) Sometimes, a problem will arise because of a human error. It’s important to know what happened to cause the problem so that everyone can learn from the mistake. But that doesn’t mean that the troubleshooting and repair process should grind to a halt in favor of finger-pointing. WHY FINGER-POINTING IS COUNTER-PRODUCTIIVE • It takes the team’s focus away from where it should be. • It creates a tense atmosphere that can cause people to fear sharing. TIP #5 PUT OFF THE BLAME DISCUSSION.
  • 18. RESOURCES: Tools for the next incident, just in case. 04 The VictorOps On-Call Firefight Survival Guide
  • 19. 04 The VictorOps On-Call Firefight Survival Guide POST MORTEM CHECKLIST • What happened? • Who was affected? • What was done to fix it? • How was the business affected? • What can be done to prevent this from happening again? • What actions they took at what time? • What effects they observed? • What expectations did they have? • What assumptions did they make? • What is their understanding of the timeline of events as they occurred? Generally, as a team... Participants should be able to account for the following, without fear of punishment or retribution…
  • 20. 04 The VictorOps On-Call Firefight Survival Guide ADDITIONAL RESOURCES http://puppetlabs.com/blog/devops-developers-resource-bridging-gap http://art.cim3.org/pm_workshop/Project_Post_Mortem/Postmortem_Template--MSF.html http://velocityconf.com/velocity2013/public/schedule/detail/28251 http://victorops.force.com/knowledgebase http://codeascraft.com/2012/05/22/blameless-postmortems/ http://www.continentalmessage.com/blog/best-practices-for-establishing-on-call-contact-procedures