Raise your hand if you enjoy being buried in alerts or woken up at 2am? (Yeah... thought so.) Ever-rising customer expectations around high availability and performance put massive pressure on the teams who develop and support SaaS products. And teams are literally losing sleep over it.
Until outages and other incidents are a thing of the past, organizations need to invest in a way of dealing with them that won't lead to burn-out. In this session, you'll learn how to combine the latest tooling with DevOps practices in the pursuit of a sustainable incident response workflow. It's all about transparency, actionable alerts, resilience, and learning from each incident.
3. ON-CALL CAN BE A SOURCE OF
STRESS AND BURNOUT.
https://unsplash.com/photos/Of8C-QHqagM
4. The analysis revealed significant effects of
extended work availability on the daily
start-of-day mood and cortisol awakening
response.
EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL.
J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
15. INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
On-call engineer
looks at recent
changes on Jira.
On-call adds me as a
responder.
So, I get paged.
5:30pm 5:50pm 5:55 6:00
16. INCIDENT TIMELINE
Customers report
problem.
Page “alerting”
team through
Slack app.
On-call engineer
looks at recent
changes on Jira.
On-call adds me as a
responder.
So, I get paged. We bring in the
incident response
team and enter
statuspage entry.
5:30pm 5:50pm 5:55 6:00 6:15
18. INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
6:40pm 6:45pm
19. INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
6:40pm 6:45pm 8:00pm
20. INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
Fix the code and ship
it. We still have
inconsistencies.
6:40pm 6:45pm 8:00pm 9:00pm
21. INCIDENT TIMELINE
We disable one of
the clusters and
stop the problem.
Get alerts from
Cloudwatch and
associate them
with the incident.
After a lot of
debugging, we
find a bug.
Fix the code and ship
it. We still have
inconsistencies. Run data sync job
and bring back the
app into a healthy
state.
6:40pm 6:45pm 8:00pm 9:00pm 2:00am
24. Automated alerting
Catch inconsistencies before customer impact.
Group similar alerts automatically.
Escalation paths
Make it easy to call for help. Ensure someone is
taking care of the problem.
One click actions and guides
Leverage one-click actions to triage and
remediate issues. Have runbooks as guides.
Actionable
alerts
25. Automated alerting
Catch inconsistencies before customer impact.
Group similar alerts automatically.
Escalation paths
Make it easy to call for help. Ensure someone is
taking care of the problem.
One click actions and guides
Leverage one-click actions to triage and
remediate issues. Have runbooks as guides.
Actionable
alerts
26. Automated alerting
Catch inconsistencies before customer impact.
Group similar alerts automatically.
Escalation paths
Make it easy to call for help. Ensure someone is
taking care of the problem.
One click actions and guides
Leverage one-click actions to triage and
remediate issues. Have runbooks as guides.
Actionable
alerts
28. Training Onboarding
Get new engineers ready to be on-call.
Explain the basics and give access to right tools.
Use shadowing as you bring new people in.
Game day
Rehearse like it is real. Know your role during
incidents and have fun at the same time.
29. Onboarding
Get new engineers ready to be on-call.
Explain the basics and give access to right tools.
Use shadowing as you bring new people in.
Game day
Rehearse like it is real. Know your role during
incidents and have fun at the same time.
Training
31. Open company, no bullshit
Make it written, make it available.
atlassian.com/software/jira/ops/handbook
Statuspage updates
Communicate incident status with internal and
external stakeholders.
Transparency
32. Open company, no bullshit
Make it written, make it available.
atlassian.com/software/jira/ops/handbook
Statuspage updates
Communicate incident status with internal and
external stakeholders.
Transparency
34. Collect operational data
Record every detail on on-call changes and
incident response process.
Postmortems
Write a detailed document on the incident. While
doing that, don’t blame anyone.
Compensate
Remember: On-call is not leisure time. Give your
employees something in return.
Analysis and
learning
35. Collect operational data
Record every detail on on-call changes and
incident response process.
Postmortems
Write a detailed document on the incident. While
doing that, don’t blame anyone.
Compensate
Remember: On-call is not leisure time. Give your
employees something in return.
Analysis and
learning
36. Collect operational data
Record every detail on on-call changes and
incident response process.
Postmortems
Write a detailed document on the incident. While
doing that, don’t blame anyone.
Compensate
Remember: On-call is not leisure time. Give your
employees something in return.
Analysis and
learning