Nothing Good Ever Happens After 2am

Nothing Good Ever
Happens After 2am
Reversim 2019
Daniel Korn
Engineering Team Lead at BigPanda 
korndaniel1
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am
BigPanda’s 

Outage Procedure
Roles and responsibilities
On-call Incident Manager

On-Call (IMOC)
Tech Lead

On-Call (TLOC)
Support 

On-Call (SOC)
Incident Priority Definitions
Priority Affect Outage Resolution
P1
• Core feature
• Multiple customers
24/7
P2
• Core feature
• Single customer
24/7
P3
• Secondary feature
• No workaround
Next business day
Tools
Tools
• Alerting
Tools
• Alerting
• Communication
Tools
• Alerting
• Communication
• Observability
Alert/Support
notifies On-call
IMOC asses impact,
determine P1/P2/P3
On-call performs
simple mitigation
On-call escalate

to IMOC
IMOC escalate to
TLOC and SOC
1
2
3
4
5
6
7
8
9
10
On-call If (P1) { 

StatusPage;

dedicated channel;

}
SOC update
customers
R&D mitigate till
solved, update
StatusPage
IMOC Verifies resolved,

summary in channel
IMOC postmortem,
share with stakeholders
The Long Night
THIS IS A TRUE STORY.
The events depicted in this postmortem
took place in Tel Aviv and San Francisco
in 2018.



Despite the request of the survivors, the
names have not been changed.
Out of respect for our customers, the
story has been told exactly as it occurred.
Michal
On-call
Almog & Pini
TLOCs
Daniel (Me)
TLOC
Shmeff Andru
SOC Support
Julio
Support
Background
• REMINDER: BigPanda’s SLA
• New Access Control (RBAC) service
• Not all customers migrated
• Sunday: Multi-service deployment
[MON 05:03 PM] SOC

multiple tickets:“cannot
update environments”
[05:05 PM] On-call

Asks SOC for details, opens a
dedicated Slack channel
[05:08 PM] On-call

Identifies as Auth-related,
notifies TLOCs
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am
[05:35 PM] On-call

“we think it’s related to a
deploy, working on a fix”
[05:33 PM] SOC

considers opening a status
page, but “might be a P3”
[06:16 PM] SOC

Opens status page
Stick to the Plan
TA
K
EAW
AY
[07:41 PM] TLOCs

Deploy fix to production
[06:50-07:30 PM] TLOCs

Fix is tested, not reproduced
debate fix or revert
[07:45-08:05 PM] SOC

Verifies together with TLOCs
the issue is resolved
[08:10 PM] SOC

Closes status page

On-call and TLOCs leaving
REVERT FIRST
Rule of Thumb
TA
K
EAW
AY
[12:57 AM] SOC

“So it appears to be just a
UI issue”. Notifies On-call
[12:45 AM] Support

“Some customers can’t see
roles in the env editor”
[12:59 AM] On-call

Notifies TLOC
[01:01 AM] TLOC

Starts investigating the issue
Nothing Good Ever Happens After 2am
– Someone smart
If it looks like an outage, and (support)
sounds like an outage, then it might
be just a bug“
Do not Assume
an Outage
TA
K
EAW
AY
[01:54 AM] TLOCs

Deploy fix to production, 

ask SOC to verify with customers
[01:20 AM] TLOCs

Identifying the cause, 

starting to work on a fix
If you think this has a
happy ending, you haven’t
been paying attention.
— Ramsay Bolton
“
[02:00 AM] SOC + Support 

Debating on StatusPage re-open
[01:57 AM] Support

customers reporting the initial issue -
“cannot update environments”
[02:03 AM] TLOCs

Start investigating the issue
[02:15-02:51 AM] TLOC

Manually adds missing
permissions to customers DB
[02:10 AM] TLOCs

Identifying the cause - lack of
permissions (migration)
Nothing Good Ever Happens After 2am
Time to Call it
a Night
TA
K
EAW
AY
[02:56 AM] SOC

Verifies this customer is
facing the issue
[02:52 AM] TLOC

Having problems with a
specific customer
[02:56-03:25 AM] TLOCs

Identify the problem - edge case
involving FT and manual customizations
[03:25 PM] SOC

Asks TLOC to discuss the
situation on a phone call
[-04:07 AM] SOC+TLOC

SOC asks TLOC to
commit to fix by EOD
[03:29- AM] SOC + TLOC

Sensitive customer, no
changes ,issue remains
[09:30 AM - 05:12 PM] TLOCs

Implemented a fix, deploy to production,
ask SOC to verify
[05:25 PM] SOC

Verifies issue resolved
Do not Commit
to Action Items
TA
K
EAW
AY
[19:00 PM] CS + R&D + PM

Joint postmortem,

Preparing customer’s updates
[WED 11:00 AM] R&D

Conduct a postmortem,

Share with R&D and CS
Chaos isn’t a pit.
Chaos is a ladder.
— Petyr “Littlefinger” Baelish
“
Recap
• Stick to the plan
• Rule of thumb: REVERT FIRST
• Do not assume an outage
• Time to call it a night
• Do not commit to action items
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am
1 of 44

Recommended

Serena Mainframe Virtual User Group Jan 2014 by
Serena Mainframe Virtual User Group Jan 2014Serena Mainframe Virtual User Group Jan 2014
Serena Mainframe Virtual User Group Jan 2014Serena Software
978 views35 slides
Взаимодействие с Check Point Technical Support by
Взаимодействие с Check Point Technical SupportВзаимодействие с Check Point Technical Support
Взаимодействие с Check Point Technical SupportGroup of company MUK
1.9K views21 slides
How to have it all quality, cost, and performance by
How to have it all  quality, cost, and performanceHow to have it all  quality, cost, and performance
How to have it all quality, cost, and performancePaul Menig
1.9K views56 slides
How to perform trouble shooting based on counters by
How to perform trouble shooting based on countersHow to perform trouble shooting based on counters
How to perform trouble shooting based on countersAbdul Muin
21.8K views33 slides
Disaster recovery and WiFi hacking by
Disaster recovery and WiFi hackingDisaster recovery and WiFi hacking
Disaster recovery and WiFi hackingAbeera Naeem
246 views20 slides
A Day In the Life Of a Proactive Maintenance PdM Tech by
A Day In the Life Of a Proactive Maintenance PdM TechA Day In the Life Of a Proactive Maintenance PdM Tech
A Day In the Life Of a Proactive Maintenance PdM TechRicky Smith CMRP, CMRT
1K views7 slides

More Related Content

Similar to Nothing Good Ever Happens After 2am

DR planning and testing by
DR planning and testingDR planning and testing
DR planning and testingJason Dea
6 views35 slides
DR Planning and Testing by
DR Planning and TestingDR Planning and Testing
DR Planning and TestingJason Dea
959 views36 slides
2014 July Webinar Modern DR Workshop by
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR WorkshopBluelock
454 views32 slides
Technical debt in cyber ark [agile practitioners-2015] by
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]AgilePractitionersIL
1.3K views21 slides
Respond to and troubleshoot production incidents like an sa by
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saTom Cudd
473 views53 slides
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io by
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
53 views100 slides

Similar to Nothing Good Ever Happens After 2am(20)

DR planning and testing by Jason Dea
DR planning and testingDR planning and testing
DR planning and testing
Jason Dea6 views
DR Planning and Testing by Jason Dea
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
Jason Dea959 views
2014 July Webinar Modern DR Workshop by Bluelock
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop
Bluelock454 views
Technical debt in cyber ark [agile practitioners-2015] by AgilePractitionersIL
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]
Respond to and troubleshoot production incidents like an sa by Tom Cudd
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an sa
Tom Cudd473 views
Critical incident management.pptx by DavidForeroS
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptx
DavidForeroS77 views
Think You've Tested Your DR Plan? Think again! by Hostway|HOSTING
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!
Hostway|HOSTING977 views
Harry Regan - It's Never So Bad That It Can't Get Worse by centralohioissa
Harry Regan - It's Never So Bad That It Can't Get WorseHarry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get Worse
centralohioissa945 views
RPS/APS vulnerability in snom/yealink and others - slides by Cal Leeming
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slides
Cal Leeming4.2K views
Avoiding Technical Bankruptcy by markuskobler
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcy
markuskobler409 views
2011 06-21 green365 nahbrc - hph reoccuring issues by Amber Joan Wood
2011 06-21 green365 nahbrc - hph reoccuring issues2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues
Amber Joan Wood255 views
Plate Spin Disaster Recovery Solution by muralis3
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
muralis31.8K views
World-Class Incident Response Management by Keith Smith
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
Keith Smith265 views
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut... by SolarWinds
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
SolarWinds285 views
Product Keynote: Jira Service Desk, Opsgenie, Statuspage by Atlassian
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Atlassian4.7K views
Stop the Line practice in SW development by Gabor Gunyho
Stop the Line practice in SW developmentStop the Line practice in SW development
Stop the Line practice in SW development
Gabor Gunyho3.9K views
A Machine Learning approach to predict Software Defects by Chetan Hireholi
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
Chetan Hireholi137 views
Ccar2013121702(max trading) by YoungJae Kim
Ccar2013121702(max trading) Ccar2013121702(max trading)
Ccar2013121702(max trading)
YoungJae Kim183 views
Avoid the IT War Room: Integrate Mainframe and IBM i into ServiceNow by Precisely
Avoid the IT War Room: Integrate Mainframe and IBM i into ServiceNowAvoid the IT War Room: Integrate Mainframe and IBM i into ServiceNow
Avoid the IT War Room: Integrate Mainframe and IBM i into ServiceNow
Precisely145 views

Recently uploaded

Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...TomHalpin9
5 views29 slides
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides
AI and Ml presentation .pptx by
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptxFayazAli87
11 views15 slides
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Donato Onofri
773 views34 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...Deltares
9 views24 slides
Fleet Management Software in India by
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India Fleetable
11 views1 slide

Recently uploaded(20)

Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin95 views
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8711 views
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by Donato Onofri
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Donato Onofri773 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme... by Deltares
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
Deltares5 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik5 views
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy12 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok5 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j6 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana8 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved

Nothing Good Ever Happens After 2am

  • 1. Nothing Good Ever Happens After 2am Reversim 2019
  • 2. Daniel Korn Engineering Team Lead at BigPanda  korndaniel1
  • 6. Roles and responsibilities On-call Incident Manager
 On-Call (IMOC) Tech Lead
 On-Call (TLOC) Support 
 On-Call (SOC)
  • 7. Incident Priority Definitions Priority Affect Outage Resolution P1 • Core feature • Multiple customers 24/7 P2 • Core feature • Single customer 24/7 P3 • Secondary feature • No workaround Next business day
  • 12. Alert/Support notifies On-call IMOC asses impact, determine P1/P2/P3 On-call performs simple mitigation On-call escalate
 to IMOC IMOC escalate to TLOC and SOC 1 2 3 4 5
  • 13. 6 7 8 9 10 On-call If (P1) { 
 StatusPage;
 dedicated channel;
 } SOC update customers R&D mitigate till solved, update StatusPage IMOC Verifies resolved,
 summary in channel IMOC postmortem, share with stakeholders
  • 15. THIS IS A TRUE STORY. The events depicted in this postmortem took place in Tel Aviv and San Francisco in 2018.
 
 Despite the request of the survivors, the names have not been changed. Out of respect for our customers, the story has been told exactly as it occurred.
  • 18. Background • REMINDER: BigPanda’s SLA • New Access Control (RBAC) service • Not all customers migrated • Sunday: Multi-service deployment
  • 19. [MON 05:03 PM] SOC
 multiple tickets:“cannot update environments” [05:05 PM] On-call
 Asks SOC for details, opens a dedicated Slack channel [05:08 PM] On-call
 Identifies as Auth-related, notifies TLOCs
  • 22. [05:35 PM] On-call
 “we think it’s related to a deploy, working on a fix” [05:33 PM] SOC
 considers opening a status page, but “might be a P3” [06:16 PM] SOC
 Opens status page
  • 23. Stick to the Plan TA K EAW AY
  • 24. [07:41 PM] TLOCs
 Deploy fix to production [06:50-07:30 PM] TLOCs
 Fix is tested, not reproduced debate fix or revert [07:45-08:05 PM] SOC
 Verifies together with TLOCs the issue is resolved [08:10 PM] SOC
 Closes status page
 On-call and TLOCs leaving
  • 25. REVERT FIRST Rule of Thumb TA K EAW AY
  • 26. [12:57 AM] SOC
 “So it appears to be just a UI issue”. Notifies On-call [12:45 AM] Support
 “Some customers can’t see roles in the env editor” [12:59 AM] On-call
 Notifies TLOC [01:01 AM] TLOC
 Starts investigating the issue
  • 28. – Someone smart If it looks like an outage, and (support) sounds like an outage, then it might be just a bug“
  • 29. Do not Assume an Outage TA K EAW AY
  • 30. [01:54 AM] TLOCs
 Deploy fix to production, 
 ask SOC to verify with customers [01:20 AM] TLOCs
 Identifying the cause, 
 starting to work on a fix
  • 31. If you think this has a happy ending, you haven’t been paying attention. — Ramsay Bolton “
  • 32. [02:00 AM] SOC + Support 
 Debating on StatusPage re-open [01:57 AM] Support
 customers reporting the initial issue - “cannot update environments” [02:03 AM] TLOCs
 Start investigating the issue
  • 33. [02:15-02:51 AM] TLOC
 Manually adds missing permissions to customers DB [02:10 AM] TLOCs
 Identifying the cause - lack of permissions (migration)
  • 35. Time to Call it a Night TA K EAW AY
  • 36. [02:56 AM] SOC
 Verifies this customer is facing the issue [02:52 AM] TLOC
 Having problems with a specific customer [02:56-03:25 AM] TLOCs
 Identify the problem - edge case involving FT and manual customizations [03:25 PM] SOC
 Asks TLOC to discuss the situation on a phone call
  • 37. [-04:07 AM] SOC+TLOC
 SOC asks TLOC to commit to fix by EOD [03:29- AM] SOC + TLOC
 Sensitive customer, no changes ,issue remains [09:30 AM - 05:12 PM] TLOCs
 Implemented a fix, deploy to production, ask SOC to verify [05:25 PM] SOC
 Verifies issue resolved
  • 38. Do not Commit to Action Items TA K EAW AY
  • 39. [19:00 PM] CS + R&D + PM
 Joint postmortem,
 Preparing customer’s updates [WED 11:00 AM] R&D
 Conduct a postmortem,
 Share with R&D and CS
  • 40. Chaos isn’t a pit. Chaos is a ladder. — Petyr “Littlefinger” Baelish “
  • 41. Recap
  • 42. • Stick to the plan • Rule of thumb: REVERT FIRST • Do not assume an outage • Time to call it a night • Do not commit to action items