How to Build a Healthy On-Call Culture

SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN
How to build a healthy  
on-call culture

https://unsplash.com/photos/yO3whNbzxsc
2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc
2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html

ON-CALL CAN BE A SOURCE OF
STRESS AND BURNOUT.
https://unsplash.com/photos/Of8C-QHqagM

The analysis revealed significant effects of
extended work availability on the daily
start-of-day mood and cortisol awakening
response.
EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL.
J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.

https://www.atlassian.com/blog/software-teams/modern-software-development-trends

https://en.dopl3r.com/memes/hot-topics/microservices/247404

You build it, you run it.
DR. WERNER VOGELS, CTO AMAZON

Dev - Ops
Developers on-call
Dev - ManagementIncreasing demands

Everything was fine, until it wasn’t.

INCIDENT TIMELINE
Customers report
problem.
5:30pm

INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
5:30pm 5:50pm

INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
On-call engineer
looks at recent
changes on Jira.
5:30pm 5:50pm 5:55

INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
On-call engineer
looks at recent
changes on Jira.
On-call adds me as a
responder.  
So, I get paged.
5:30pm 5:50pm 5:55 6:00

INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
On-call engineer
looks at recent
changes on Jira.
On-call adds me as a
responder.  
So, I get paged. We bring in the
incident response
team and enter
statuspage entry.
5:30pm 5:50pm 5:55 6:00 6:15

INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
6:40pm

INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
6:40pm 6:45pm

INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
6:40pm 6:45pm 8:00pm

INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
Fix the code and ship
it. We still have
inconsistencies.
6:40pm 6:45pm 8:00pm 9:00pm

INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
Fix the code and ship
it. We still have
inconsistencies. Run data sync job
and bring back the
app into a healthy
state.
6:40pm 6:45pm 8:00pm 9:00pm 2:00am

Actionable alerts
Training
Transparency
Analysis and learning
KEY TAKEAWAYS

ACTIONABLE ALERTS
PROVIDE CONTEXT AND
GUIDANCE TO REDUCE MTTR
AND STRESS.
TAKEAWAY 1

Automated alerting
Catch inconsistencies before customer impact.
Group similar alerts automatically.
Escalation paths
Make it easy to call for help. Ensure someone is
taking care of the problem.
One click actions and guides
Leverage one-click actions to triage and
remediate issues. Have runbooks as guides.
Actionable
alerts

TRAINING GIVES CONFIDENCE.
TAKEAWAY 2

Training Onboarding
Get new engineers ready to be on-call.  
Explain the basics and give access to right tools.  
Use shadowing as you bring new people in.
Game day
Rehearse like it is real. Know your role during
incidents and have fun at the same time.

Onboarding
Get new engineers ready to be on-call.  
Explain the basics and give access to right tools.  
Use shadowing as you bring new people in.
Game day
Rehearse like it is real. Know your role during
incidents and have fun at the same time.
Training

TRANSPARENCY MAKES  
ON-CALL MORE HUMANE.
TAKEAWAY 3

Open company, no bullshit
Make it written, make it available.
atlassian.com/software/jira/ops/handbook
Statuspage updates
Communicate incident status with internal and
external stakeholders.
Transparency

ANALYZE AND
CONTINUOUSLY LEARN
FROM EACH INCIDENT.
TAKEAWAY 4

Collect operational data
Record every detail on on-call changes and
incident response process.
Postmortems
Write a detailed document on the incident. While
doing that, don’t blame anyone.
Compensate
Remember: On-call is not leisure time. Give your
employees something in return.
Analysis and
learning

https://unsplash.com/photos/hRdVSYpffas
ON-CALL CAN BE  
HAPPIER AND HEALTHIER

SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN
Thank you!

How to Build a Healthy On-Call Culture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to Build a Healthy On-Call Culture

Similar to How to Build a Healthy On-Call Culture (20)

More from Atlassian

More from Atlassian (20)

Recently uploaded

Recently uploaded (20)

How to Build a Healthy On-Call Culture