Major incidents can be very stressful, frustrating and chaotic experiences, especially if the on-call responders lack the proper process, training and coordination.
In this talk, we will walk through a real incident from PagerDuty’s own history, to illustrate what an effective incident response looks like. We will recreate the incident timeline step by step and go over all of the different roles involved, including the incident commander, scribe, customer/business liaison and subject matter experts. We will also cover the process and tooling needed to respond quickly and effectively to major incidents in order to minimize customer and business impact.
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
1. Anatomy of a real-life incident
Alex Solomon
CTO & Co-Founder @
2. THIS IS A TRUE STORY
The events in this presentation took place in
San Francisco and Toronto on January 6, 2017
In the interest of brevity, some details have
been omitted
3. The Services
Web2Kafka
Service
Incident Log
Entries Service
Docker
Mesos / marathon
Linux Kernel
publishes change events from
web monolith to Kafka for
other services to consume
stores log entries for incidents
6. [3:21 PM] David:
SME
!ic page
Officer URL:
Chat BOT
🚨Paging Incident Commander(s)
✔ Eric has been paged.
✔ Ken has been paged.
✔ Peter has been paged.
Incident triggered in the following service:
https://pd.pagerduty.com/services/PERDDFI
David:
SME
web2kafka is down, and I'm not sure what's
going on
kicked off the major incident
process
[3:21 PM] Eric:
IC
Taking IC Eric took the IC role (he was IC
primary on-call)
7. The Incident Commander
• The Wartime General: decision maker during a major incident
• GOAL: drive the incident to resolution quickly and effectively
• Gather data: ask subject matter experts to diagnose and
debug various aspects of the system
• Listen: collect proposed repair actions from SMEs
• Decide: decide on a course of action
• Act (via delegation): once a decision is made, ask the team
to act on it. IC should always delegate all diagnosis and
repair actions to the rest of the team.
8. Priyam:
SME
I’m here from EM
Evan:
SME
lmk if you need SRE
sounds like IHM might be down too
Ken:
DEPUTY
@renee, please join the call[3:22 PM] Ken took the deputy role
Other SMEs joined
9. The Deputy (backup IC)
• The Sidekick: right hand person for the IC
• Monitor the status of the incident
• Be prepared to page other people
• Provide regular updates to business and/
or exec stakeholders
10. Peter:
SCRIBE
I am now the scribe
Eric: Looking to find Mesos experts
Evan: Looking for logs & dashboards
Zayna:
SME
seeing a steady rise in crashes in
Android app around trigger incident log
entires
[3:24 PM]
JD:
SME
No ILEs will be generated due to LES
not being able to query web2kafka
[3:25 PM]
Eric: David, what have you looked at?
David: trolling logs, see errors
David: tried restarting, doesn’t help
[3:23 PM] Ken:
DEPUTY
Notifications are still going out, subject
lines are filled in but not email bodies
(they use ILEs)
Renee:
SME
Peter becomes the scribe
Discussing customer-visible
impact of the incident
Ken is both deputy and
scribe
11. The Scribe
• The Record-keeper
• Add notes to the chatroom when findings are
determined or significant actions are taken
• Add TODOs to the room that indicate follow-
ups for later (generally after the incident)
• Monitor tasks assigned by the IC to other
team members, remind the IC to follow-up
12. Renee:
SME
Can’t expand incident details
Luke:
CUST LIAISON
suggested tweet: `There is currently an
issue affecting the incident log entries
component of our web application
causing the application to display
errors. We are actively investigating.`
[3:29 PM]
David: No ILEs can be created
Renee: no incident details, error msg in
the UI
[3:27 PM] Peter:
SCRIBE
Eric: Comms rep on the phone? Luke
Eric to Luke: Please compose a tweet
Peter:
SCRIBE
Eric: What’s the customer impact?[3:26 PM] Peter:
SCRIBE
Luke to tweetPeter:
SCRIBE
IC asked the customer liaison
to write a msg to customers
Msg was sent out to
customers
13. The Communications Liaison
• The link to the customer
• Monitor customer and business impact
• Provide regular updates to customers (and/
or to customer-facing folks in the business)
• (Optional) Provide regular updates to
stakeholders
14. Cees:
SME
I’m away from any laptops, just arrived
at a pub for dinner.
[3:36 PM]
@cees Would you join us on the bridge?
We have a few Mesos questions
Eric:
IC
Evan: might need to kick new hardware if
system is actually unreachable.
Evan: slave01 is reachable
David: slave02 is not reachable.
David: slave03 is not reachable.
David: only 3 slaves for mesos
Eric: We are down to only one host
Evan: Seeing some stuff. Memory
exhaustion.
[3:37 PM] Peter:
SCRIBE
TODO: Create a runbook for mesos to
stop the world and start again
Peter:
SCRIBE
David added Cees to the incident
Eric: Is there a runbook for mesos?
David: Yes, but not for this issue.
[3:34 PM] Peter:
SCRIBE
Scribe captured a TODO to record &
remember a follow-up that should
happen after the incident is resolved
We paged a Mesos expert
who is not on-call
The Mesos expert joined the
chat
15. David: Only 3 slaves in that cluster, we
have another cluster in us-west-1
Eric: Two options: kick more slaves or
restart marathon
[3:38 PM] Peter:
SCRIBE
Evan: OOM killer has kicked in on
slave01
[3:39 PM] Peter:
SCRIBE
Eric: Stop slaves in west2, startup
web2kafka in west1
Evan: slave02 is alive!
Eric: Waiting 2 minutes
[3:47 PM] Peter:
SCRIBE
David: Consider bringing up another cluster?
Cees: Should be trivial
[3:44 PM] Peter:
SCRIBE
Eric to evan: please reboot slave02 and
slave03
[3:41 PM] Peter:
SCRIBE
Restart slaves firstCees:
SME
slave01 is now down[3:42 PM] Evan:
SME
They are considering
bringing up another
Mesos cluster in west1
slave02 is back up after
reboot, so they hold off
on flipping to west1
Noticed that oom-killer
killed the docker
process on slave01
16. Evan: Slave02 is quiet.
Evan: Slave02 is trying to start, exiting with
code 137
[3:49 PM] Peter:
SCRIBE
Evan: Slave02 is quiet.
Evan: 137 means it’s being killed by OOM,
OOM is killing docker containers
continuously
Peter:
SCRIBE
[3:53 PM] Proposed Action: David is going to
configure marathon to allow more memory
Peter:
SCRIBE
[3:54 PM] Proposed Action: Evan to force reboot
slave01
Peter:
SCRIBE
[3:56 PM] David: Web2kafka appears to be running
Eric: Looks like all things are running
Renee: Things are fine with notifications
JD: LES is seeing progress
Peter:
SCRIBE
[3:55 PM] Customer impact: there are 4 tickets so far
and 2 customers chatting with us, which is
another 2 tickets
Luke:
CUST LIAISON
They realized the
problem: oom-killer is
killing the docker
containers over and over
The resolution action was
to redeploy web2kafka
with a higher cgroup/
Docker memory limit:
2GB (vs 512 MB before)
The customer liaison
provided an update on
the customer impact
The system is recovering
17. The Punchline
• Root cause
• Increase in traffic caused web2kafka to increase its memory usage
• This caused the Linux oom-killer to kill the process
• Then, mesos / marathon immediately restarted it, it ramped up memory
again, oom-killer killed it, and so on.
• After doing this restart-kill cycle multiple times, we hit a race-condition
bug in the Linux kernel causing a kernel panic and killing the host
• Other services running on the host were impacted, notably LES
18. Summary
• Incident Command
• The most important role, crucial to fast decision making and action!
• Takes practice and experience
• Deputy
• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents
• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident
resolution
• Scribe
• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC
saying “Evan, do X, report back in 5 min”)
• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)
• Communications liaison
• Essential for tracking customer impact and communicating status to customers
19. The End
Alex Solomon
CTO & Co-Founder @
alex@pagerduty.com
The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com