The story of firefighting: learnings
from the incident management
Daniil Doronkin, in
2
3
4
5
6
7
An incident is inevitable
8
Amazon
Bad weather caused power failures throughout AWS US East.
Google
Successive lightning strikes on their European datacenter caused loss of
power to Google Compute Engine storage systems within that region.
Github
On January 28th, 2016 GitHub experienced a disruption in the power at
their primary datacenter.
Publicly available post-mortem list
9
Incident Experience
Stop “bleeding” Turn costs into learnings Increase system maturity
10
11
Love learning new things about distributed systems
Joining on-call to help with incidents
Arrived from the Netherlands by train in time
12
The opinions expressed in this presentation are strictly
personal and do not reflect the official stance of any
organization I am associated with
13
14
15
16
17
18
The business is
growing rapidly
It has benefits, but also brings a number
of challenges and raises expectations.
More features here and now
The team has to figure out how to deliver high quality
in timely manner
Many customers want to be on board
The platform must be able to scale accordingly
Scaling across continents
Geo-distribution is the key to success
19
20
21
22
23
24
25
Tools
Impact
Escalation
Logs
M
e
t
r
i
c
s
Roles
Priority
Traces
Collaboration
26
Where to start?
27
Investigation Hotfix Monitoring
Decision making Track timelines Involving people
Stakeholders
updates
Internal
communications
Updates Directs
Briefs
Firefighter
Communicator
Coordinator
28
Comfortable
Environment
A well-equipped environment provides
the necessary tools and resources to
respond effectively to incidents
PagerDuty, Grafana OnCall, Squadcast and more
Incident response tools to manage an incident
workflow end to end
Slack, Zoom, Teams and more
Communications tools to embrace collaboration and
keep the interested parties posted
Custom tools
Panic buttons, Slack Ops bots, Documentation
storage, everything that helps and increases visibility
29
Tools
Roles
Collaboration
Impact
Escalation
Logs
M
e
t
r
i
c
s
Priority
Traces
30
Measuring impact
31
32
Priorities in incident
response
Represent the urgency and impact level,
guiding teams on how quickly and intensely
to address an issue
Critical incident
Highest priority, typically representing a total system
outage or critical service failure
High priority
Partial functionality is impaired, or a critical subset of
users is impacted
Moderate priority
Moderate issues affecting a specific feature or subset of
users
P0
P1
P2
33
Data Analytics service job failed to aggregate the data collected in the past
hours. No system performance insights (internal users only) updated.
P1
P0
P2
Tracking the order delivery status in the mobile app is broken, users
cannot see where their order is at the moment.
Check out order page is broken, significant drop in the number of actual
orders detected.
34
L1 - Engineer on duty, business hours
L2 - Engineering Manager
L3 - On-call SRE, 24/7
L4 - Global Incident Response team, 24/7
35
Tools
Roles
Collaboration
Impact Escalation
Logs
M
e
t
r
i
c
s
Priority
Traces
36
Observability
37
38
39
40
41
42
Collaboration Roles Tools
Impact Priority Escalation
Logs Metrics Traces
43
Firefighting 🔥🔥🔥
44
“Shisa Kanko” for the rescue
Japan's railway safety method reduces errors
Point and Call system
The system works by engaging multiple senses and
as a result increases concentration and attention
How is this useful for us?
The simple and clear actions can be developed so
that everyone can learn and apply them.
45
Runbook aka
incident guideline
Shared knowledge for the day-to-day IT
operations including incident
management process
✅Check the priority
If is it critical - create an incident
✅Check the scope
If it is global - escalate according to the policy
✅Start the call
Invite relevant people such as service experts
✅Assign roles
Make sure everyone is aware of its responsibility
✅[Communicator] Keep audience updated
Ensure timely updates in public channels
✅[Firefighter] Open firefighting dashboard
It is your entry point to the incident investigation
…
✅Close incident
Update the status and initiate postmortem
46
Infrastructure issue?
No
What type?
Yes
Failover traffic Scale out Kill the pod
Availability zone
(AZ) outage
Traffic
increased
Failing Pod
❗Plan extra buffer for AZ failure
❗Take into account peak season
❗Perform capacity tests
47
Dependency issue?
No
Do you have a fallback?
Yes
“Turn off” dependency
Escalate to the
corresponding team
❗Consider fallback solutions if possible
❗Hide dependencies behind feature flags
❗Rely on resiliency patterns
Yes No
48
Recent deployment?
No
Can you rollback?
Yes
Rollback deployment Rollout hotfix
❗Deploy to pre-production environment
❗Use canary deployments
❗Use feature flags
Yes No
49
Database issue?
No
What type?
Yes
Rollback schema,
restore data
Kill blocking sessions
Kill long running
queries, scale out
Schema
migration
Deadlocks Performance
❗Backup data, use audit trails, event logs
❗Deploy fixes in small steps, keep changes compatible
❗Acknowledge schema updates
50
Deep dive
investigation
51
What not to do
❌ Making rollouts
❌ Keep information in private channels
❌ Take an unbearable weight
❌ Point fingers and blaming
52
Continuous improvement
53
Postmortem Root Cause Analysis
Understanding the Why and preventing possible
recurrence in the future
Learning and Improvement
Sharing knowledge and experiences and identifying
weaknesses in existing processes, procedures, and tools.
Transparency
Provide stakeholders with a clear understanding of the
causes of the incident and the steps being taken.
Explain why an incident happened and
how to avoid it in the future, without
blaming anyone
54
RFO, Postmortem
document
1. Overview and ownership
2. Impact
3. Timeline
4. Root cause and contributing factors
5. Mitigation
6. Follow ups
Postmortem templates repository
❗How did the incident process go?
❗What could go terribly worse?
❗Prioritize follow up work
55
Postmortem value
✅ Identify common patterns from RFOs
✅ Improve new service bootstrap and CI/CD pipelines
✅ Create learning materials
56
Get ready for the next
outage
From the past learning we can train
ourselves in production environment
“Turn off” AZ or regions
To ensure that failover AZ and regions are able to
handle the load
Cut service dependencies
To ensure your service is fault tolerant and fallback
solution works
Join other incidents
Get to know the process and learn from experienced
colleagues
57
Let's sum it up
● We setup the basement
● We discover the firefighting tips
● We emphasize the importance of postmortem
58
AI to the rescue!
59
AI capabilities
Anomaly
direction
Domain
knowledge
Copylot Incident
summary
60
Grafana ML
Offers data analysis and generative AI capabilities, including creating
alerts, forecasting capacity requirements, and identifying anomalies.
Datadog Bits AI
Uses natural language to simplify queries, streamline incident response,
suggest fixes, and automate workflows for seamless collaboration.
New Relic AI
Get clear insights from your data using simple language and an
integrated platform that makes it easy to understand complex telemetry.
61
Q&A

OSMC 2024 | The story of firefighting: learnings from the incident management by Daniil Doronkin.pdf

  • 1.
    The story offirefighting: learnings from the incident management Daniil Doronkin, in
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    7 An incident isinevitable
  • 8.
    8 Amazon Bad weather causedpower failures throughout AWS US East. Google Successive lightning strikes on their European datacenter caused loss of power to Google Compute Engine storage systems within that region. Github On January 28th, 2016 GitHub experienced a disruption in the power at their primary datacenter. Publicly available post-mortem list
  • 9.
    9 Incident Experience Stop “bleeding”Turn costs into learnings Increase system maturity
  • 10.
  • 11.
    11 Love learning newthings about distributed systems Joining on-call to help with incidents Arrived from the Netherlands by train in time
  • 12.
    12 The opinions expressedin this presentation are strictly personal and do not reflect the official stance of any organization I am associated with
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    18 The business is growingrapidly It has benefits, but also brings a number of challenges and raises expectations. More features here and now The team has to figure out how to deliver high quality in timely manner Many customers want to be on board The platform must be able to scale accordingly Scaling across continents Geo-distribution is the key to success
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    27 Investigation Hotfix Monitoring Decisionmaking Track timelines Involving people Stakeholders updates Internal communications Updates Directs Briefs Firefighter Communicator Coordinator
  • 28.
    28 Comfortable Environment A well-equipped environmentprovides the necessary tools and resources to respond effectively to incidents PagerDuty, Grafana OnCall, Squadcast and more Incident response tools to manage an incident workflow end to end Slack, Zoom, Teams and more Communications tools to embrace collaboration and keep the interested parties posted Custom tools Panic buttons, Slack Ops bots, Documentation storage, everything that helps and increases visibility
  • 29.
  • 30.
  • 31.
  • 32.
    32 Priorities in incident response Representthe urgency and impact level, guiding teams on how quickly and intensely to address an issue Critical incident Highest priority, typically representing a total system outage or critical service failure High priority Partial functionality is impaired, or a critical subset of users is impacted Moderate priority Moderate issues affecting a specific feature or subset of users P0 P1 P2
  • 33.
    33 Data Analytics servicejob failed to aggregate the data collected in the past hours. No system performance insights (internal users only) updated. P1 P0 P2 Tracking the order delivery status in the mobile app is broken, users cannot see where their order is at the moment. Check out order page is broken, significant drop in the number of actual orders detected.
  • 34.
    34 L1 - Engineeron duty, business hours L2 - Engineering Manager L3 - On-call SRE, 24/7 L4 - Global Incident Response team, 24/7
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    42 Collaboration Roles Tools ImpactPriority Escalation Logs Metrics Traces
  • 43.
  • 44.
    44 “Shisa Kanko” forthe rescue Japan's railway safety method reduces errors Point and Call system The system works by engaging multiple senses and as a result increases concentration and attention How is this useful for us? The simple and clear actions can be developed so that everyone can learn and apply them.
  • 45.
    45 Runbook aka incident guideline Sharedknowledge for the day-to-day IT operations including incident management process ✅Check the priority If is it critical - create an incident ✅Check the scope If it is global - escalate according to the policy ✅Start the call Invite relevant people such as service experts ✅Assign roles Make sure everyone is aware of its responsibility ✅[Communicator] Keep audience updated Ensure timely updates in public channels ✅[Firefighter] Open firefighting dashboard It is your entry point to the incident investigation … ✅Close incident Update the status and initiate postmortem
  • 46.
    46 Infrastructure issue? No What type? Yes Failovertraffic Scale out Kill the pod Availability zone (AZ) outage Traffic increased Failing Pod ❗Plan extra buffer for AZ failure ❗Take into account peak season ❗Perform capacity tests
  • 47.
    47 Dependency issue? No Do youhave a fallback? Yes “Turn off” dependency Escalate to the corresponding team ❗Consider fallback solutions if possible ❗Hide dependencies behind feature flags ❗Rely on resiliency patterns Yes No
  • 48.
    48 Recent deployment? No Can yourollback? Yes Rollback deployment Rollout hotfix ❗Deploy to pre-production environment ❗Use canary deployments ❗Use feature flags Yes No
  • 49.
    49 Database issue? No What type? Yes Rollbackschema, restore data Kill blocking sessions Kill long running queries, scale out Schema migration Deadlocks Performance ❗Backup data, use audit trails, event logs ❗Deploy fixes in small steps, keep changes compatible ❗Acknowledge schema updates
  • 50.
  • 51.
    51 What not todo ❌ Making rollouts ❌ Keep information in private channels ❌ Take an unbearable weight ❌ Point fingers and blaming
  • 52.
  • 53.
    53 Postmortem Root CauseAnalysis Understanding the Why and preventing possible recurrence in the future Learning and Improvement Sharing knowledge and experiences and identifying weaknesses in existing processes, procedures, and tools. Transparency Provide stakeholders with a clear understanding of the causes of the incident and the steps being taken. Explain why an incident happened and how to avoid it in the future, without blaming anyone
  • 54.
    54 RFO, Postmortem document 1. Overviewand ownership 2. Impact 3. Timeline 4. Root cause and contributing factors 5. Mitigation 6. Follow ups Postmortem templates repository ❗How did the incident process go? ❗What could go terribly worse? ❗Prioritize follow up work
  • 55.
    55 Postmortem value ✅ Identifycommon patterns from RFOs ✅ Improve new service bootstrap and CI/CD pipelines ✅ Create learning materials
  • 56.
    56 Get ready forthe next outage From the past learning we can train ourselves in production environment “Turn off” AZ or regions To ensure that failover AZ and regions are able to handle the load Cut service dependencies To ensure your service is fault tolerant and fallback solution works Join other incidents Get to know the process and learn from experienced colleagues
  • 57.
    57 Let's sum itup ● We setup the basement ● We discover the firefighting tips ● We emphasize the importance of postmortem
  • 58.
    58 AI to therescue!
  • 59.
  • 60.
    60 Grafana ML Offers dataanalysis and generative AI capabilities, including creating alerts, forecasting capacity requirements, and identifying anomalies. Datadog Bits AI Uses natural language to simplify queries, streamline incident response, suggest fixes, and automate workflows for seamless collaboration. New Relic AI Get clear insights from your data using simple language and an integrated platform that makes it easy to understand complex telemetry.
  • 61.