OSMC 2024 | The story of firefighting: learnings from the incident management by Daniil Doronkin.pdf

The story of ﬁreﬁghting: learnings
from the incident management
Daniil Doronkin, in

8
Amazon
Bad weather caused power failures throughout AWS US East.
Google
Successive lightning strikes on their European datacenter caused loss of
power to Google Compute Engine storage systems within that region.
Github
On January 28th, 2016 GitHub experienced a disruption in the power at
their primary datacenter.
Publicly available post-mortem list

9
Incident Experience
Stop “bleeding” Turn costs into learnings Increase system maturity

11
Love learning new things about distributed systems
Joining on-call to help with incidents
Arrived from the Netherlands by train in time

12
The opinions expressed in this presentation are strictly
personal and do not reﬂect the ofﬁcial stance of any
organization I am associated with

18
The business is
growing rapidly
It has beneﬁts, but also brings a number
of challenges and raises expectations.
More features here and now
The team has to ﬁgure out how to deliver high quality
in timely manner
Many customers want to be on board
The platform must be able to scale accordingly
Scaling across continents
Geo-distribution is the key to success

25
Tools
Impact
Escalation
Logs
M
e
t
r
i
c
s
Roles
Priority
Traces
Collaboration

27
Investigation Hotﬁx Monitoring
Decision making Track timelines Involving people
Stakeholders
updates
Internal
communications
Updates Directs
Briefs
Fireﬁghter
Communicator
Coordinator

28
Comfortable
Environment
A well-equipped environment provides
the necessary tools and resources to
respond effectively to incidents
PagerDuty, Grafana OnCall, Squadcast and more
Incident response tools to manage an incident
workﬂow end to end
Slack, Zoom, Teams and more
Communications tools to embrace collaboration and
keep the interested parties posted
Custom tools
Panic buttons, Slack Ops bots, Documentation
storage, everything that helps and increases visibility

29
Tools
Roles
Collaboration
Impact
Escalation
Logs
M
e
t
r
i
c
s
Priority
Traces

32
Priorities in incident
response
Represent the urgency and impact level,
guiding teams on how quickly and intensely
to address an issue
Critical incident
Highest priority, typically representing a total system
outage or critical service failure
High priority
Partial functionality is impaired, or a critical subset of
users is impacted
Moderate priority
Moderate issues affecting a speciﬁc feature or subset of
users
P0
P1
P2

33
Data Analytics service job failed to aggregate the data collected in the past
hours. No system performance insights (internal users only) updated.
P1
P0
P2
Tracking the order delivery status in the mobile app is broken, users
cannot see where their order is at the moment.
Check out order page is broken, signiﬁcant drop in the number of actual
orders detected.

34
L1 - Engineer on duty, business hours
L2 - Engineering Manager
L3 - On-call SRE, 24/7
L4 - Global Incident Response team, 24/7

35
Tools
Roles
Collaboration
Impact Escalation
Logs
M
e
t
r
i
c
s
Priority
Traces

42
Collaboration Roles Tools
Impact Priority Escalation
Logs Metrics Traces

44
“Shisa Kanko” for the rescue
Japan's railway safety method reduces errors
Point and Call system
The system works by engaging multiple senses and
as a result increases concentration and attention
How is this useful for us?
The simple and clear actions can be developed so
that everyone can learn and apply them.

45
Runbook aka
incident guideline
Shared knowledge for the day-to-day IT
operations including incident
management process
✅Check the priority
If is it critical - create an incident
✅Check the scope
If it is global - escalate according to the policy
✅Start the call
Invite relevant people such as service experts
✅Assign roles
Make sure everyone is aware of its responsibility
✅[Communicator] Keep audience updated
Ensure timely updates in public channels
✅[Firefighter] Open firefighting dashboard
It is your entry point to the incident investigation
…
✅Close incident
Update the status and initiate postmortem

46
Infrastructure issue?
No
What type?
Yes
Failover trafﬁc Scale out Kill the pod
Availability zone
(AZ) outage
Trafﬁc
increased
Failing Pod
❗Plan extra buffer for AZ failure
❗Take into account peak season
❗Perform capacity tests

47
Dependency issue?
No
Do you have a fallback?
Yes
“Turn off” dependency
Escalate to the
corresponding team
❗Consider fallback solutions if possible
❗Hide dependencies behind feature ﬂags
❗Rely on resiliency patterns
Yes No

48
Recent deployment?
No
Can you rollback?
Yes
Rollback deployment Rollout hotﬁx
❗Deploy to pre-production environment
❗Use canary deployments
❗Use feature ﬂags
Yes No

49
Database issue?
No
What type?
Yes
Rollback schema,
restore data
Kill blocking sessions
Kill long running
queries, scale out
Schema
migration
Deadlocks Performance
❗Backup data, use audit trails, event logs
❗Deploy ﬁxes in small steps, keep changes compatible
❗Acknowledge schema updates

51
What not to do
❌ Making rollouts
❌ Keep information in private channels
❌ Take an unbearable weight
❌ Point ﬁngers and blaming

53
Postmortem Root Cause Analysis
Understanding the Why and preventing possible
recurrence in the future
Learning and Improvement
Sharing knowledge and experiences and identifying
weaknesses in existing processes, procedures, and tools.
Transparency
Provide stakeholders with a clear understanding of the
causes of the incident and the steps being taken.
Explain why an incident happened and
how to avoid it in the future, without
blaming anyone

54
RFO, Postmortem
document
1. Overview and ownership
2. Impact
3. Timeline
4. Root cause and contributing factors
5. Mitigation
6. Follow ups
Postmortem templates repository
❗How did the incident process go?
❗What could go terribly worse?
❗Prioritize follow up work

55
Postmortem value
✅ Identify common patterns from RFOs
✅ Improve new service bootstrap and CI/CD pipelines
✅ Create learning materials

56
Get ready for the next
outage
From the past learning we can train
ourselves in production environment
“Turn off” AZ or regions
To ensure that failover AZ and regions are able to
handle the load
Cut service dependencies
To ensure your service is fault tolerant and fallback
solution works
Join other incidents
Get to know the process and learn from experienced
colleagues

57
Let's sum it up
● We setup the basement
● We discover the ﬁreﬁghting tips
● We emphasize the importance of postmortem

59
AI capabilities
Anomaly
direction
Domain
knowledge
Copylot Incident
summary

60
Grafana ML
Offers data analysis and generative AI capabilities, including creating
alerts, forecasting capacity requirements, and identifying anomalies.
Datadog Bits AI
Uses natural language to simplify queries, streamline incident response,
suggest ﬁxes, and automate workﬂows for seamless collaboration.
New Relic AI
Get clear insights from your data using simple language and an
integrated platform that makes it easy to understand complex telemetry.

OSMC 2024 | The story of firefighting: learnings from the incident management by Daniil Doronkin.pdf

More Related Content

Similar to OSMC 2024 | The story of firefighting: learnings from the incident management by Daniil Doronkin.pdf

Recently uploaded

OSMC 2024 | The story of firefighting: learnings from the incident management by Daniil Doronkin.pdf