SlideShare a Scribd company logo
1 of 19
Download to read offline
Anatomy of a real-life incident
Alex Solomon
CTO & Co-Founder @
THIS IS A TRUE STORY
The events in this presentation took place in
San Francisco and Toronto on January 6, 2017
In the interest of brevity, some details have
been omitted
The Services
Web2Kafka
Service
Incident Log
Entries Service
Docker
Mesos / marathon
Linux Kernel
publishes change events from
web monolith to Kafka for
other services to consume
stores log entries for incidents
The People
Eric
Incident
Commander
Peter
Scribe
Ken
Deputy
Luke
Communications
Liaison
Major incident response principal roles
David
Core on-call
Cees
Core eng
Evan
SRE on-call
Renee
IM People on-call
Zayna
Mobile on-call
JD
IM Data on-call
Priyam
EM on-call
Subject Matter Experts (SMEs)
The Incident
[3:21 PM] David:
SME
!ic page
Officer URL:
Chat BOT
🚨Paging Incident Commander(s)

✔ Eric has been paged.
✔ Ken has been paged.
✔ Peter has been paged.
Incident triggered in the following service:
https://pd.pagerduty.com/services/PERDDFI
David:
SME
web2kafka is down, and I'm not sure what's
going on
kicked off the major incident
process
[3:21 PM] Eric:
IC
Taking IC Eric took the IC role (he was IC
primary on-call)
The Incident Commander
• The Wartime General: decision maker during a major incident
• GOAL: drive the incident to resolution quickly and effectively
• Gather data: ask subject matter experts to diagnose and
debug various aspects of the system
• Listen: collect proposed repair actions from SMEs
• Decide: decide on a course of action
• Act (via delegation): once a decision is made, ask the team
to act on it. IC should always delegate all diagnosis and
repair actions to the rest of the team.
Priyam:
SME
I’m here from EM
Evan:
SME
lmk if you need SRE
sounds like IHM might be down too
Ken:
DEPUTY
@renee, please join the call[3:22 PM] Ken took the deputy role
Other SMEs joined
The Deputy (backup IC)
• The Sidekick: right hand person for the IC
• Monitor the status of the incident
• Be prepared to page other people
• Provide regular updates to business and/
or exec stakeholders
Peter:
SCRIBE
I am now the scribe
Eric: Looking to find Mesos experts
Evan: Looking for logs & dashboards
Zayna:
SME
seeing a steady rise in crashes in
Android app around trigger incident log
entires
[3:24 PM]
JD:
SME
No ILEs will be generated due to LES
not being able to query web2kafka
[3:25 PM]
Eric: David, what have you looked at?
David: trolling logs, see errors
David: tried restarting, doesn’t help
[3:23 PM] Ken:
DEPUTY
Notifications are still going out, subject
lines are filled in but not email bodies
(they use ILEs)
Renee:
SME
Peter becomes the scribe
Discussing customer-visible
impact of the incident
Ken is both deputy and
scribe
The Scribe
• The Record-keeper
• Add notes to the chatroom when findings are
determined or significant actions are taken
• Add TODOs to the room that indicate follow-
ups for later (generally after the incident)
• Monitor tasks assigned by the IC to other
team members, remind the IC to follow-up
Renee:
SME
Can’t expand incident details
Luke:
CUST LIAISON
suggested tweet: `There is currently an
issue affecting the incident log entries
component of our web application
causing the application to display
errors. We are actively investigating.`
[3:29 PM]
David: No ILEs can be created
Renee: no incident details, error msg in
the UI
[3:27 PM] Peter:
SCRIBE
Eric: Comms rep on the phone? Luke
Eric to Luke: Please compose a tweet
Peter:
SCRIBE
Eric: What’s the customer impact?[3:26 PM] Peter:
SCRIBE
Luke to tweetPeter:
SCRIBE
IC asked the customer liaison
to write a msg to customers
Msg was sent out to
customers
The Communications Liaison
• The link to the customer
• Monitor customer and business impact
• Provide regular updates to customers (and/
or to customer-facing folks in the business)
• (Optional) Provide regular updates to
stakeholders
Cees:
SME
I’m away from any laptops, just arrived
at a pub for dinner.
[3:36 PM]
@cees Would you join us on the bridge?
We have a few Mesos questions
Eric:
IC
Evan: might need to kick new hardware if
system is actually unreachable.
Evan: slave01 is reachable
David: slave02 is not reachable.
David: slave03 is not reachable.
David: only 3 slaves for mesos
Eric: We are down to only one host
Evan: Seeing some stuff. Memory
exhaustion.
[3:37 PM] Peter:
SCRIBE
TODO: Create a runbook for mesos to
stop the world and start again
Peter:
SCRIBE
David added Cees to the incident
Eric: Is there a runbook for mesos?
David: Yes, but not for this issue.
[3:34 PM] Peter:
SCRIBE
Scribe captured a TODO to record &
remember a follow-up that should
happen after the incident is resolved
We paged a Mesos expert
who is not on-call
The Mesos expert joined the
chat
David: Only 3 slaves in that cluster, we
have another cluster in us-west-1
Eric: Two options: kick more slaves or
restart marathon
[3:38 PM] Peter:
SCRIBE
Evan: OOM killer has kicked in on
slave01
[3:39 PM] Peter:
SCRIBE
Eric: Stop slaves in west2, startup
web2kafka in west1
Evan: slave02 is alive!
Eric: Waiting 2 minutes
[3:47 PM] Peter:
SCRIBE
David: Consider bringing up another cluster?
Cees: Should be trivial
[3:44 PM] Peter:
SCRIBE
Eric to evan: please reboot slave02 and
slave03
[3:41 PM] Peter:
SCRIBE
Restart slaves firstCees:
SME
slave01 is now down[3:42 PM] Evan:
SME
They are considering
bringing up another
Mesos cluster in west1
slave02 is back up after
reboot, so they hold off
on flipping to west1
Noticed that oom-killer
killed the docker
process on slave01
Evan: Slave02 is quiet.
Evan: Slave02 is trying to start, exiting with
code 137
[3:49 PM] Peter:
SCRIBE
Evan: Slave02 is quiet.
Evan: 137 means it’s being killed by OOM,
OOM is killing docker containers
continuously
Peter:
SCRIBE
[3:53 PM] Proposed Action: David is going to
configure marathon to allow more memory
Peter:
SCRIBE
[3:54 PM] Proposed Action: Evan to force reboot
slave01
Peter:
SCRIBE
[3:56 PM] David: Web2kafka appears to be running
Eric: Looks like all things are running
Renee: Things are fine with notifications
JD: LES is seeing progress
Peter:
SCRIBE
[3:55 PM] Customer impact: there are 4 tickets so far
and 2 customers chatting with us, which is
another 2 tickets
Luke:
CUST LIAISON
They realized the
problem: oom-killer is
killing the docker
containers over and over
The resolution action was
to redeploy web2kafka
with a higher cgroup/
Docker memory limit:
2GB (vs 512 MB before)
The customer liaison
provided an update on
the customer impact
The system is recovering
The Punchline
• Root cause
• Increase in traffic caused web2kafka to increase its memory usage
• This caused the Linux oom-killer to kill the process
• Then, mesos / marathon immediately restarted it, it ramped up memory
again, oom-killer killed it, and so on.
• After doing this restart-kill cycle multiple times, we hit a race-condition
bug in the Linux kernel causing a kernel panic and killing the host
• Other services running on the host were impacted, notably LES
Summary
• Incident Command
• The most important role, crucial to fast decision making and action!
• Takes practice and experience
• Deputy
• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents
• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident
resolution
• Scribe
• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC
saying “Evan, do X, report back in 5 min”)
• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)
• Communications liaison
• Essential for tracking customer impact and communicating status to customers
The End
Alex Solomon
CTO & Co-Founder @
alex@pagerduty.com
The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com

More Related Content

Similar to Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Troubleshooting in a distributed systems
Troubleshooting in a distributed systems  Troubleshooting in a distributed systems
Troubleshooting in a distributed systems Komodor
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel ZikmundKarel Zikmund
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundKarel Zikmund
 
Cs seminar 20071207
Cs seminar 20071207Cs seminar 20071207
Cs seminar 20071207Todd Deshane
 
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted BottleVirus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted BottlePeter Kálnai
 
Kalnai_Jirkal-vb-2016-malicious-osx-cocktail
Kalnai_Jirkal-vb-2016-malicious-osx-cocktailKalnai_Jirkal-vb-2016-malicious-osx-cocktail
Kalnai_Jirkal-vb-2016-malicious-osx-cocktailMartin Jirkal
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...Henning Jacobs
 
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingIBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingLuis Guirigay
 
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar....NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...Karel Zikmund
 
JAX London 2015 - Architecting a Highly Scalable Enterprise
JAX London 2015 - Architecting a Highly Scalable EnterpriseJAX London 2015 - Architecting a Highly Scalable Enterprise
JAX London 2015 - Architecting a Highly Scalable EnterpriseC24 Technologies
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner
 
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...DataScienceConferenc1
 
Lessons learned running large real-world Docker environments
Lessons learned running large real-world Docker environmentsLessons learned running large real-world Docker environments
Lessons learned running large real-world Docker environmentsAlois Mayr
 
Implementing Event Sourcing in .NET
Implementing Event Sourcing in .NETImplementing Event Sourcing in .NET
Implementing Event Sourcing in .NETAndrea Saltarello
 
The Art of Message Queues
The Art of Message QueuesThe Art of Message Queues
The Art of Message QueuesMike Willbanks
 
Webinar: From Frustration to Fascination: Dissecting Replication
Webinar: From Frustration to Fascination: Dissecting ReplicationWebinar: From Frustration to Fascination: Dissecting Replication
Webinar: From Frustration to Fascination: Dissecting ReplicationHoward Greenberg
 
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Ops Happens:  Improving Incident Response Using DevOps and SRE PracticesOps Happens:  Improving Incident Response Using DevOps and SRE Practices
Ops Happens: Improving Incident Response Using DevOps and SRE PracticesRundeck
 

Similar to Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty (20)

Troubleshooting in a distributed systems
Troubleshooting in a distributed systems  Troubleshooting in a distributed systems
Troubleshooting in a distributed systems
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
 
Cs seminar 20071207
Cs seminar 20071207Cs seminar 20071207
Cs seminar 20071207
 
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted BottleVirus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
 
Kalnai_Jirkal-vb-2016-malicious-osx-cocktail
Kalnai_Jirkal-vb-2016-malicious-osx-cocktailKalnai_Jirkal-vb-2016-malicious-osx-cocktail
Kalnai_Jirkal-vb-2016-malicious-osx-cocktail
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
 
Computer sc report
Computer sc reportComputer sc report
Computer sc report
 
Scaling Selenium
Scaling SeleniumScaling Selenium
Scaling Selenium
 
Lec06-IO2.ppt
Lec06-IO2.pptLec06-IO2.ppt
Lec06-IO2.ppt
 
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingIBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
 
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar....NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
 
JAX London 2015 - Architecting a Highly Scalable Enterprise
JAX London 2015 - Architecting a Highly Scalable EnterpriseJAX London 2015 - Architecting a Highly Scalable Enterprise
JAX London 2015 - Architecting a Highly Scalable Enterprise
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your Pipeline
 
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
 
Lessons learned running large real-world Docker environments
Lessons learned running large real-world Docker environmentsLessons learned running large real-world Docker environments
Lessons learned running large real-world Docker environments
 
Implementing Event Sourcing in .NET
Implementing Event Sourcing in .NETImplementing Event Sourcing in .NET
Implementing Event Sourcing in .NET
 
The Art of Message Queues
The Art of Message QueuesThe Art of Message Queues
The Art of Message Queues
 
Webinar: From Frustration to Fascination: Dissecting Replication
Webinar: From Frustration to Fascination: Dissecting ReplicationWebinar: From Frustration to Fascination: Dissecting Replication
Webinar: From Frustration to Fascination: Dissecting Replication
 
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Ops Happens:  Improving Incident Response Using DevOps and SRE PracticesOps Happens:  Improving Incident Response Using DevOps and SRE Practices
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
 

More from Outlyer

Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...Outlyer
 
How & When to Feature Flag
How & When to Feature FlagHow & When to Feature Flag
How & When to Feature FlagOutlyer
 
Why You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging ServerWhy You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging ServerOutlyer
 
How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma Outlyer
 
Packaging Services with Nix
Packaging Services with NixPackaging Services with Nix
Packaging Services with NixOutlyer
 
Minimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestrationMinimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestrationOutlyer
 
Ops is dead. long live ops.
Ops is dead. long live ops.Ops is dead. long live ops.
Ops is dead. long live ops.Outlyer
 
The service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applicationsThe service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applicationsOutlyer
 
Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?) Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?) Outlyer
 
Renan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to KubernetesRenan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to KubernetesOutlyer
 
Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution Outlyer
 
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...Outlyer
 
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group Outlyer
 
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...Outlyer
 
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik Outlyer
 
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...Outlyer
 
Zero Downtime Postgres Upgrades
Zero Downtime Postgres UpgradesZero Downtime Postgres Upgrades
Zero Downtime Postgres UpgradesOutlyer
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2Outlyer
 
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats Outlyer
 
DOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkDOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkOutlyer
 

More from Outlyer (20)

Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
 
How & When to Feature Flag
How & When to Feature FlagHow & When to Feature Flag
How & When to Feature Flag
 
Why You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging ServerWhy You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging Server
 
How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma
 
Packaging Services with Nix
Packaging Services with NixPackaging Services with Nix
Packaging Services with Nix
 
Minimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestrationMinimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestration
 
Ops is dead. long live ops.
Ops is dead. long live ops.Ops is dead. long live ops.
Ops is dead. long live ops.
 
The service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applicationsThe service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applications
 
Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?) Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?)
 
Renan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to KubernetesRenan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to Kubernetes
 
Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution
 
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
 
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
 
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
 
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
 
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
 
Zero Downtime Postgres Upgrades
Zero Downtime Postgres UpgradesZero Downtime Postgres Upgrades
Zero Downtime Postgres Upgrades
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2
 
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats
 
DOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkDOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using Splunk
 

Recently uploaded

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 

Recently uploaded (20)

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 

Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

  • 1. Anatomy of a real-life incident Alex Solomon CTO & Co-Founder @
  • 2. THIS IS A TRUE STORY The events in this presentation took place in San Francisco and Toronto on January 6, 2017 In the interest of brevity, some details have been omitted
  • 3. The Services Web2Kafka Service Incident Log Entries Service Docker Mesos / marathon Linux Kernel publishes change events from web monolith to Kafka for other services to consume stores log entries for incidents
  • 4. The People Eric Incident Commander Peter Scribe Ken Deputy Luke Communications Liaison Major incident response principal roles David Core on-call Cees Core eng Evan SRE on-call Renee IM People on-call Zayna Mobile on-call JD IM Data on-call Priyam EM on-call Subject Matter Experts (SMEs)
  • 6. [3:21 PM] David: SME !ic page Officer URL: Chat BOT 🚨Paging Incident Commander(s)
 ✔ Eric has been paged. ✔ Ken has been paged. ✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI David: SME web2kafka is down, and I'm not sure what's going on kicked off the major incident process [3:21 PM] Eric: IC Taking IC Eric took the IC role (he was IC primary on-call)
  • 7. The Incident Commander • The Wartime General: decision maker during a major incident • GOAL: drive the incident to resolution quickly and effectively • Gather data: ask subject matter experts to diagnose and debug various aspects of the system • Listen: collect proposed repair actions from SMEs • Decide: decide on a course of action • Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.
  • 8. Priyam: SME I’m here from EM Evan: SME lmk if you need SRE sounds like IHM might be down too Ken: DEPUTY @renee, please join the call[3:22 PM] Ken took the deputy role Other SMEs joined
  • 9. The Deputy (backup IC) • The Sidekick: right hand person for the IC • Monitor the status of the incident • Be prepared to page other people • Provide regular updates to business and/ or exec stakeholders
  • 10. Peter: SCRIBE I am now the scribe Eric: Looking to find Mesos experts Evan: Looking for logs & dashboards Zayna: SME seeing a steady rise in crashes in Android app around trigger incident log entires [3:24 PM] JD: SME No ILEs will be generated due to LES not being able to query web2kafka [3:25 PM] Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help [3:23 PM] Ken: DEPUTY Notifications are still going out, subject lines are filled in but not email bodies (they use ILEs) Renee: SME Peter becomes the scribe Discussing customer-visible impact of the incident Ken is both deputy and scribe
  • 11. The Scribe • The Record-keeper • Add notes to the chatroom when findings are determined or significant actions are taken • Add TODOs to the room that indicate follow- ups for later (generally after the incident) • Monitor tasks assigned by the IC to other team members, remind the IC to follow-up
  • 12. Renee: SME Can’t expand incident details Luke: CUST LIAISON suggested tweet: `There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating.` [3:29 PM] David: No ILEs can be created Renee: no incident details, error msg in the UI [3:27 PM] Peter: SCRIBE Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a tweet Peter: SCRIBE Eric: What’s the customer impact?[3:26 PM] Peter: SCRIBE Luke to tweetPeter: SCRIBE IC asked the customer liaison to write a msg to customers Msg was sent out to customers
  • 13. The Communications Liaison • The link to the customer • Monitor customer and business impact • Provide regular updates to customers (and/ or to customer-facing folks in the business) • (Optional) Provide regular updates to stakeholders
  • 14. Cees: SME I’m away from any laptops, just arrived at a pub for dinner. [3:36 PM] @cees Would you join us on the bridge? We have a few Mesos questions Eric: IC Evan: might need to kick new hardware if system is actually unreachable. Evan: slave01 is reachable David: slave02 is not reachable. David: slave03 is not reachable. David: only 3 slaves for mesos Eric: We are down to only one host Evan: Seeing some stuff. Memory exhaustion. [3:37 PM] Peter: SCRIBE TODO: Create a runbook for mesos to stop the world and start again Peter: SCRIBE David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue. [3:34 PM] Peter: SCRIBE Scribe captured a TODO to record & remember a follow-up that should happen after the incident is resolved We paged a Mesos expert who is not on-call The Mesos expert joined the chat
  • 15. David: Only 3 slaves in that cluster, we have another cluster in us-west-1 Eric: Two options: kick more slaves or restart marathon [3:38 PM] Peter: SCRIBE Evan: OOM killer has kicked in on slave01 [3:39 PM] Peter: SCRIBE Eric: Stop slaves in west2, startup web2kafka in west1 Evan: slave02 is alive! Eric: Waiting 2 minutes [3:47 PM] Peter: SCRIBE David: Consider bringing up another cluster? Cees: Should be trivial [3:44 PM] Peter: SCRIBE Eric to evan: please reboot slave02 and slave03 [3:41 PM] Peter: SCRIBE Restart slaves firstCees: SME slave01 is now down[3:42 PM] Evan: SME They are considering bringing up another Mesos cluster in west1 slave02 is back up after reboot, so they hold off on flipping to west1 Noticed that oom-killer killed the docker process on slave01
  • 16. Evan: Slave02 is quiet. Evan: Slave02 is trying to start, exiting with code 137 [3:49 PM] Peter: SCRIBE Evan: Slave02 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously Peter: SCRIBE [3:53 PM] Proposed Action: David is going to configure marathon to allow more memory Peter: SCRIBE [3:54 PM] Proposed Action: Evan to force reboot slave01 Peter: SCRIBE [3:56 PM] David: Web2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: LES is seeing progress Peter: SCRIBE [3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets Luke: CUST LIAISON They realized the problem: oom-killer is killing the docker containers over and over The resolution action was to redeploy web2kafka with a higher cgroup/ Docker memory limit: 2GB (vs 512 MB before) The customer liaison provided an update on the customer impact The system is recovering
  • 17. The Punchline • Root cause • Increase in traffic caused web2kafka to increase its memory usage • This caused the Linux oom-killer to kill the process • Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on. • After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host • Other services running on the host were impacted, notably LES
  • 18. Summary • Incident Command • The most important role, crucial to fast decision making and action! • Takes practice and experience • Deputy • The right-hand person for the IC, can step in and take over Incident Command for long-running incidents • Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident resolution • Scribe • Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC saying “Evan, do X, report back in 5 min”) • Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem) • Communications liaison • Essential for tracking customer impact and communicating status to customers
  • 19. The End Alex Solomon CTO & Co-Founder @ alex@pagerduty.com The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com