Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Nothing Good Ever Happens After 2am Slide 1 Nothing Good Ever Happens After 2am Slide 2 Nothing Good Ever Happens After 2am Slide 3 Nothing Good Ever Happens After 2am Slide 4 Nothing Good Ever Happens After 2am Slide 5 Nothing Good Ever Happens After 2am Slide 6 Nothing Good Ever Happens After 2am Slide 7 Nothing Good Ever Happens After 2am Slide 8 Nothing Good Ever Happens After 2am Slide 9 Nothing Good Ever Happens After 2am Slide 10 Nothing Good Ever Happens After 2am Slide 11 Nothing Good Ever Happens After 2am Slide 12 Nothing Good Ever Happens After 2am Slide 13 Nothing Good Ever Happens After 2am Slide 14 Nothing Good Ever Happens After 2am Slide 15 Nothing Good Ever Happens After 2am Slide 16 Nothing Good Ever Happens After 2am Slide 17 Nothing Good Ever Happens After 2am Slide 18 Nothing Good Ever Happens After 2am Slide 19 Nothing Good Ever Happens After 2am Slide 20 Nothing Good Ever Happens After 2am Slide 21 Nothing Good Ever Happens After 2am Slide 22 Nothing Good Ever Happens After 2am Slide 23 Nothing Good Ever Happens After 2am Slide 24 Nothing Good Ever Happens After 2am Slide 25 Nothing Good Ever Happens After 2am Slide 26 Nothing Good Ever Happens After 2am Slide 27 Nothing Good Ever Happens After 2am Slide 28 Nothing Good Ever Happens After 2am Slide 29 Nothing Good Ever Happens After 2am Slide 30 Nothing Good Ever Happens After 2am Slide 31 Nothing Good Ever Happens After 2am Slide 32 Nothing Good Ever Happens After 2am Slide 33 Nothing Good Ever Happens After 2am Slide 34 Nothing Good Ever Happens After 2am Slide 35 Nothing Good Ever Happens After 2am Slide 36 Nothing Good Ever Happens After 2am Slide 37 Nothing Good Ever Happens After 2am Slide 38 Nothing Good Ever Happens After 2am Slide 39 Nothing Good Ever Happens After 2am Slide 40 Nothing Good Ever Happens After 2am Slide 41 Nothing Good Ever Happens After 2am Slide 42 Nothing Good Ever Happens After 2am Slide 43 Nothing Good Ever Happens After 2am Slide 44
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Nothing Good Ever Happens After 2am

Download to read offline

A Postmortem session from Reversim Summit 2019.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Nothing Good Ever Happens After 2am

  1. 1. Nothing Good Ever Happens After 2am Reversim 2019
  2. 2. Daniel Korn Engineering Team Lead at BigPanda  korndaniel1
  3. 3. BigPanda’s 
 Outage Procedure
  4. 4. Roles and responsibilities On-call Incident Manager
 On-Call (IMOC) Tech Lead
 On-Call (TLOC) Support 
 On-Call (SOC)
  5. 5. Incident Priority Definitions Priority Affect Outage Resolution P1 • Core feature • Multiple customers 24/7 P2 • Core feature • Single customer 24/7 P3 • Secondary feature • No workaround Next business day
  6. 6. Tools
  7. 7. Tools • Alerting
  8. 8. Tools • Alerting • Communication
  9. 9. Tools • Alerting • Communication • Observability
  10. 10. Alert/Support notifies On-call IMOC asses impact, determine P1/P2/P3 On-call performs simple mitigation On-call escalate
 to IMOC IMOC escalate to TLOC and SOC 1 2 3 4 5
  11. 11. 6 7 8 9 10 On-call If (P1) { 
 StatusPage;
 dedicated channel;
 } SOC update customers R&D mitigate till solved, update StatusPage IMOC Verifies resolved,
 summary in channel IMOC postmortem, share with stakeholders
  12. 12. The Long Night
  13. 13. THIS IS A TRUE STORY. The events depicted in this postmortem took place in Tel Aviv and San Francisco in 2018.
 
 Despite the request of the survivors, the names have not been changed. Out of respect for our customers, the story has been told exactly as it occurred.
  14. 14. Michal On-call Almog & Pini TLOCs Daniel (Me) TLOC
  15. 15. Shmeff Andru SOC Support Julio Support
  16. 16. Background • REMINDER: BigPanda’s SLA • New Access Control (RBAC) service • Not all customers migrated • Sunday: Multi-service deployment
  17. 17. [MON 05:03 PM] SOC
 multiple tickets:“cannot update environments” [05:05 PM] On-call
 Asks SOC for details, opens a dedicated Slack channel [05:08 PM] On-call
 Identifies as Auth-related, notifies TLOCs
  18. 18. [05:35 PM] On-call
 “we think it’s related to a deploy, working on a fix” [05:33 PM] SOC
 considers opening a status page, but “might be a P3” [06:16 PM] SOC
 Opens status page
  19. 19. Stick to the Plan TA K EAW AY
  20. 20. [07:41 PM] TLOCs
 Deploy fix to production [06:50-07:30 PM] TLOCs
 Fix is tested, not reproduced debate fix or revert [07:45-08:05 PM] SOC
 Verifies together with TLOCs the issue is resolved [08:10 PM] SOC
 Closes status page
 On-call and TLOCs leaving
  21. 21. REVERT FIRST Rule of Thumb TA K EAW AY
  22. 22. [12:57 AM] SOC
 “So it appears to be just a UI issue”. Notifies On-call [12:45 AM] Support
 “Some customers can’t see roles in the env editor” [12:59 AM] On-call
 Notifies TLOC [01:01 AM] TLOC
 Starts investigating the issue
  23. 23. – Someone smart If it looks like an outage, and (support) sounds like an outage, then it might be just a bug“
  24. 24. Do not Assume an Outage TA K EAW AY
  25. 25. [01:54 AM] TLOCs
 Deploy fix to production, 
 ask SOC to verify with customers [01:20 AM] TLOCs
 Identifying the cause, 
 starting to work on a fix
  26. 26. If you think this has a happy ending, you haven’t been paying attention. — Ramsay Bolton “
  27. 27. [02:00 AM] SOC + Support 
 Debating on StatusPage re-open [01:57 AM] Support
 customers reporting the initial issue - “cannot update environments” [02:03 AM] TLOCs
 Start investigating the issue
  28. 28. [02:15-02:51 AM] TLOC
 Manually adds missing permissions to customers DB [02:10 AM] TLOCs
 Identifying the cause - lack of permissions (migration)
  29. 29. Time to Call it a Night TA K EAW AY
  30. 30. [02:56 AM] SOC
 Verifies this customer is facing the issue [02:52 AM] TLOC
 Having problems with a specific customer [02:56-03:25 AM] TLOCs
 Identify the problem - edge case involving FT and manual customizations [03:25 PM] SOC
 Asks TLOC to discuss the situation on a phone call
  31. 31. [-04:07 AM] SOC+TLOC
 SOC asks TLOC to commit to fix by EOD [03:29- AM] SOC + TLOC
 Sensitive customer, no changes ,issue remains [09:30 AM - 05:12 PM] TLOCs
 Implemented a fix, deploy to production, ask SOC to verify [05:25 PM] SOC
 Verifies issue resolved
  32. 32. Do not Commit to Action Items TA K EAW AY
  33. 33. [19:00 PM] CS + R&D + PM
 Joint postmortem,
 Preparing customer’s updates [WED 11:00 AM] R&D
 Conduct a postmortem,
 Share with R&D and CS
  34. 34. Chaos isn’t a pit. Chaos is a ladder. — Petyr “Littlefinger” Baelish “
  35. 35. Recap
  36. 36. • Stick to the plan • Rule of thumb: REVERT FIRST • Do not assume an outage • Time to call it a night • Do not commit to action items
  • ManfrediGiordano

    Dec. 20, 2019

A Postmortem session from Reversim Summit 2019.

Views

Total views

155

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

0

Shares

0

Comments

0

Likes

1

×