15. THIS IS A TRUE STORY.
The events depicted in this postmortem
took place in Tel Aviv and San Francisco
in 2018.
Despite the request of the survivors, the
names have not been changed.
Out of respect for our customers, the
story has been told exactly as it occurred.
18. Background
• REMINDER: BigPanda’s SLA
• New Access Control (RBAC) service
• Not all customers migrated
• Sunday: Multi-service deployment
19. [MON 05:03 PM] SOC
multiple tickets:“cannot
update environments”
[05:05 PM] On-call
Asks SOC for details, opens a
dedicated Slack channel
[05:08 PM] On-call
Identifies as Auth-related,
notifies TLOCs
22. [05:35 PM] On-call
“we think it’s related to a
deploy, working on a fix”
[05:33 PM] SOC
considers opening a status
page, but “might be a P3”
[06:16 PM] SOC
Opens status page
24. [07:41 PM] TLOCs
Deploy fix to production
[06:50-07:30 PM] TLOCs
Fix is tested, not reproduced
debate fix or revert
[07:45-08:05 PM] SOC
Verifies together with TLOCs
the issue is resolved
[08:10 PM] SOC
Closes status page
On-call and TLOCs leaving
26. [12:57 AM] SOC
“So it appears to be just a
UI issue”. Notifies On-call
[12:45 AM] Support
“Some customers can’t see
roles in the env editor”
[12:59 AM] On-call
Notifies TLOC
[01:01 AM] TLOC
Starts investigating the issue
28. – Someone smart
If it looks like an outage, and (support)
sounds like an outage, then it might
be just a bug“
36. [02:56 AM] SOC
Verifies this customer is
facing the issue
[02:52 AM] TLOC
Having problems with a
specific customer
[02:56-03:25 AM] TLOCs
Identify the problem - edge case
involving FT and manual customizations
[03:25 PM] SOC
Asks TLOC to discuss the
situation on a phone call
37. [-04:07 AM] SOC+TLOC
SOC asks TLOC to
commit to fix by EOD
[03:29- AM] SOC + TLOC
Sensitive customer, no
changes ,issue remains
[09:30 AM - 05:12 PM] TLOCs
Implemented a fix, deploy to production,
ask SOC to verify
[05:25 PM] SOC
Verifies issue resolved