Best Practices
for On Call
Teams
2021
Mandi Walls
DevOps Advocate
@lnxchk
mwalls@pagerduty.com
Proprietary & Confidential
On-Call
A formalized process and schedule for
responding to unplanned incidents, alerts,
and/or system, service or application
issues
Proprietary & Confidential
Why On-Call?
Proprietary & Confidential
Why On-Call?
Bring Subject Matter Experts (SMEs) in at the
beginning
Reduce the chaos of responding to alerts and incidents
Minimize time to acknowledge and resolve
Minimize handoffs, context switching, and burnout
Proprietary & Confidential
Every Business is a Digital Business
Make payments
Shop online
Be entertained
Order food
Be connected
Get around
Do work
Buy anything
Stay healthy
Proprietary & Confidential
Who Goes On-Call?
Proprietary & Confidential
Benefits of On-Call to Your Team
Know Exactly:
When to be available
Who to call
What you might be called for
Choose schedules that meet service needs
Proprietary & Confidential
Handoff Meetings
A formal handoff to the new on-call responder to
ensure they have all the context they need for their
shift
Proprietary & Confidential
Equipment
(Usually) Company Provided:
Laptop
Phone
Access to the Internet
Backup Access
Proprietary & Confidential
Accounts and Access
Prepare a checklist for your team
❏ Working local copy of repos
❏ Configured environments
❏ Current credentials for third-party services
❏ VPN access
❏ Passwords and permissions to environments
❏ Access to monitoring and dashboards
Proprietary & Confidential
Team Norms
Proprietary & Confidential
Responsibilities
PREVENT
RESOLVE
MOBILIZE
TRIAGE
Proprietary & Confidential
Not Responsibilities
Solve everything
Keep a regular
workload
Sit on every incident
Spend every waking
moment on alerts
Proprietary & Confidential
Building an On-Call Culture
Proprietary & Confidential
Empathy
Proprietary & Confidential
Psychological Safety
Proprietary & Confidential
Onboarding
Create Shadow Rotations:
New folks join in “listen only mode”
Allow for people to learn new ways of operating
Creates a low-stress environment
Builds confidence
Proprietary & Confidential
Iteration and Improvement
MTTA and MTTR
Mean Time to Market for fixes
Escaped defects - unfound bugs
Proprietary & Confidential
Escalating Beyond Your Team
Complexity complicates
diagnoses
Don’t keep the wrong
people involved
Focus on folks who can fix
Keep other stakeholders
informed out of band
Never Hesitate to
Escalate
Proprietary & Confidential
Initiating Major Incidents
Ensure you have a way to kick-off a Major Incident
!ic page
Proprietary & Confidential
Etiquette and Setting Expectations
Proprietary & Confidential
Humane On-Call
Allow rescheduling
Pre-emptive backup notifications
Sleep is necessary
Watch for burnout
Sleep. Is. Necessary.
Proprietary & Confidential
Team Participation
Equal responsibilities
Holiday coverage
Monitor hours on call, number of sleep-time alerts
Talk about good behaviors
Proprietary & Confidential
Time Management
Lower regular workload during on call weeks
Proprietary & Confidential
Avoiding Burnout
Monitor team health
Work on systematic improvement with peers
Extend empathy to team members
Proprietary & Confidential
Implementing Suggestions
Proprietary & Confidential
Taking Stock of Your Alerts
Manage alert fatigue
Ensure all alerts are actionable
Complete and current docs
Manage external dependencies
Clear the noise
Disable junk alerts
Proprietary & Confidential
Prioritization of Projects
Start with stable
Prioritize for the customer
Don’t inflate priorities
Proprietary & Confidential
Leveraging the Traditional NOC
L1 response
Utilize runbooks
Cross train as Incident Commanders
Proprietary & Confidential
Flexible Models
Experiment with shift length
Utilize follow-the-sun and sleep/wake when
possible
24x7, 24x5, 8x5 as appropriate for services
Partner with other teams
Proprietary & Confidential
Establishing Good Behaviors
Set standards for the team
Manage MTTA, MTTR
Don’t hesitate to escalate
Sober on-call
Proprietary & Confidential
Additional Resources
Proprietary & Confidential
PagerDuty Resources
For step-by-step instructions for setting up your team in PagerDuty, see this On-Call
Rotations and Schedules resources page
How to Get Notified Before You Go On-Call in PagerDuty
Sign up for our e-book
Keep an eye on our events page https://www.pagerduty.com/events/ for meetups,
webinars, PagerDuty Connects, and other opportunities
For in-depth training check out PagerDuty University:
https://www.pagerduty.com/university/
Join the PagerDuty Community at https://forums.pagerduty.com
Proprietary & Confidential
Industry Resources
Increment, a magazine published by Stripe published an issue about on-call as their very first issue
https://increment.com/on-call/
Alice Goldfuss’s open source on-call handbook: https://github.com/alicegoldfuss/oncall-handbook
New Relic shares some of their best practices for on-call, as well as their incident response workflows
https://blog.newrelic.com/engineering/on-call-and-incident-response-new-relic-best-practices/
In this classic session from the Velocity conference, Etsy’s team talks about how they worked to quantify their
on-call. Mean Time to Sleep
https://www.youtube.com/watch?v=FLqucVb_et0&feature=youtu.be&ab_channel=LaurieDenness
More resources at https://goingoncall.pagerduty.com/resources

PagerDuty: Best Practices for On Call Teams