The worst time to discover your processes are totally broken? During a live incident. Avoid incident management flops by running regular simulations that actually reflect a realistic incident environment. I'll teach you how.
Palms are sweaty, knees weak, arms are heavy...sound like your first on call shift?
One of the biggest challenges in incident response work, especially for newer SREs, is the lack of safe spaces to fail. Incident simulations can be an effective way to take the terror out of that first on-call shift, but they take careful planning. In this talk, I’ll explore different types of simulations (from tabletops to full-on realistic mock incidents), how and when to utilize them, and how to make sure you get the most out of them.
Ashley Sawatsky is an expert in incident management and communication, with a focus on the SaaS world. In her 6+ years of experience building and scaling Shopify's incident response program, she developed the ability to fluently translate the technical aspects of SRE incident response to Legal, PR, Customer Support, and Executive stakeholders. Now, she works at Rootly as Senior Developer Relations Advocate, where she engages with the SRE and incident response communities, and consults with customers from the world's largest tech companies—like Canva, Figma, NVIDIA, and more—on their incident response strategies.
1. Fake It ‘Til
You Make It
How to get the most out of your incident simulations
Ashley Sawatsky
Reliability Advocate
2. Hey, I’m Ashley 👋
➔ Managed guest complaints /
escalations at Disney
➔ Joined Shopify in 2016
➔ Founding member of Shopify’s
Incident Response team, went on to
build and lead Incident
Communications
➔ Joined Rootly this year where I get
to help other organizations level up
their incident management
practices!
9. ➔ Single points of failure
➔ Gaps in process
➔ Bad process
➔ Communication breakdowns
➔ Misalignment
➔ Broken escalation paths
➔ Slowdowns
👀 You’re looking for…
13. This is
not a
performance
exercise.
1 Incident simulations shouldn’t feel
like a test for employees. Make it
clear to everyone participating that
the only consequence of making
mistakes during the simulation is
learning.
14. Be
transparent.
2 People should know they’re
participating in a simulation that is
intended to challenge them and
expose weaknesses in the incident
response process.
15. Make it
feel real.
3 The environment should be as
realistic and close to the actual
experience of running an incident
as possible.
16. Don’t stop
the clock.
4 We’ll get into the mechanics and
how to handle it when someone
needs to step “outside” of the
simulation soon. This will be done
in a way that doesn’t derail the
overall scenario.
17. Save
feedback for
the retro.
5
Encourage participants to avoid
breaking the “fourth wall”. No “next
time we run one of these we
should…” while the simulation is
active.
18. Make it fun.
Let the participants expense their
lunch that day, or send a fun gift
like a relaxing scented candle, a
specially designed sticker, etc.
6
21. Don’t go
nuclear.
2
Scenarios that are too severe are
likely to be escalated to VP level or
higher. For crisis exercises, bring in
a neutral third-party.
22. Work in some
curveballs.
3
Think of your scenario as dynamic,
like a choose-your-own-adventure.
Have some curveballs ready to
throw to keep things interesting.
26. Facilitator
➔ Can be the organizer or someone else
➔ Ensures the simulation runs smoothly
by thinking on their feet
➔ Welcomes participants
➔ Available for questions from
participants
➔ Notifies participants when simulation
is over
➔ Runs post-simulation retrospective
27. Responders
➔ Identify responders for every incident
role
➔ Don’t default to your strongest
responders
➔ May be new to on-call or in need of
practice to build confidence
➔ Work with leads to identify who would
most benefit
28. Observers ➔ Your strongest incident responders
from each response team (SRE,
Communications, Legal, etc)
➔ Experts in your organization’s incident
response practices
➔ Silent observers throughout the
simulation
➔ Noting opportunities for improvement
29. Scribe
➔ Documents key moments, decisions,
observer discussion points, and
responder actions
➔ Prepares timelines for retrospective
➔ Supports the facilitator as needed
33. The
Viewing
Room
➔ Video call where the facilitator,
observers, and scribe will hang out
during the simulation
➔ Discuss how things are going, adjust
as needed
➔ Note any observations to discuss in
the retrospective
➔ Scribe records key points
40. Manage incidents the
easy way with Rootly
Rootly brings your entire incident response tooling suite together so you can manage incidents
from start to finish right from Slack.
Create automated workflows that are completely customized to your incident response
process to handle tasks like paging, escalation, surfacing playbooks, syncing action items to
Jira and other tooling, updating your status page, and more.
Want to see it in action? Come find me between talks!
Trusted by