Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

War Room Warrior: How to manage war room situations

24 views

Published on

Site outages and incidents are par for the course for tech companies. Solving the root cause often involves tense conversations and high-pressure situations. The war room serves as a dedicated space where the most critical team members work through these issues, but what happens when everyone gets in the room?

The goal is clear, restore site service as quickly as possible, but there can be many approaches to getting there. In this talk, Rashi Khurana will share best practices in leading teams to troubleshoot issues and navigate incident resolutions. She will address communication strategies, people management, process implementation and managing retroactive evaluations.

Published in: Technology
  • Be the first to comment

War Room Warrior: How to manage war room situations

  1. 1. Proprietary and confidential War Room Warrior: How to keep your cool in a catastrophe Rashi Khurana, Vice President of Engineering @RaKhurana
  2. 2. Proprietary and confidential #WarRoomWarrior A beautiful hike...
  3. 3. #WarRoomWarrior ...and then I got paged.
  4. 4. Hi, I’m Rashi Khurana!
  5. 5. Proprietary and confidential Chaos, Customers and Revenue annual downtime ticketmaster.com nike.com jcpenny.com gamestop.com victoriasecret.com groupon.com flipkart.com taobao.com
  6. 6. Proprietary and confidential
  7. 7. Proprietary and confidential 1. Ask yourself, “Are we prepared for these situations?” #WarRoomWarrior
  8. 8. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. ● Mean Time To Detect the issue - MTTD ○ the time between when the incident started and when we first realized (got paged) about it. ● Mean Time To Resolve the issue - MTTR ○ the time between when the incident was reported to when it was fully resolved. Besides uptime, we measure: #WarRoomWarrior
  9. 9. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Prepare for MTTD ● Monitoring and alerting ● Logging ● Service ownership ● PagerDuty on-calls ○ Triage or escalations ● Organize a war room ● Get the right crew online ● Traceroutes and similar developer tests #WarRoomWarrior
  10. 10. Proprietary and confidential
  11. 11. Proprietary and confidential It’s all about monitoring/alerting
  12. 12. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Prepare for MTTR ● Documentation and Runbooks ○ Set-up requirements like Okta, SumoLogic, LDAP, etc. ○ Runbooks for oncall ● Skills and Training - “I got paged, now what?” Welcome! You are a war room warrior! ○ On-call runbook walkthroughs ○ On-call expectation #WarRoomWarrior
  13. 13. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. The “Follow The Sun” Approach ● Multiple tiers of respondents ● Tier 1 and escalations ● Set up your Service Operating Centers globally ● Trainings and documentation #WarroomWarrior
  14. 14. #TechTransformation Runbook
  15. 15. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Change Management Changes are managed, not controlled. Create a framework for frequent changes: ● SDLC full cycle includes change requests ● DevOps version of Change Control ● CI/CD and iterate frequently ● Changes are still logged (jira) ● Easy to access and revisit (deployment markers) #WarRoomWarrior
  16. 16. Proprietary and confidential Change Control Change requires risk assessments at specific points of time. Create a framework for risky changes: ● Risk profiles ● Conservative approving ● Dev + Ops version of Change Control #WarroomWarrior
  17. 17. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Change Advisory Board (CAB Approval) Every change that is critical to all services, such as DNS routing changes or incoming proxy updates. Questions to ask: ● Do we have a roll-back procedure? ● Does it include time it takes to execute roll-back? ● What services or products can it impact? ● Are any other changes scheduled around the same time? ● Was change tested in pre-production? ● Is execution happening at peak customer hours? #WarRoomWarrior
  18. 18. Proprietary and confidential Internal Communication Strategy How do we communicate what is going on? ● Accessible email template ● Set-up email group :from and :to ● Easy to read color coding ○ Red - Critical impact ○ Orange - Parts of critical flow impacted ○ Green - All back to normal ● Slack channel #warrroom #WarroomWarrior
  19. 19. #TechTransformation Company email from OPS_Incident to tech.notices
  20. 20. Proprietary and confidential External Communication Strategy ● Media and comms for Social ● Status page or Maintenance page ● High revenue customers #TechTransformation #WarroomWarrior
  21. 21. This layout works great for dividing sections. Insert an amazing image and align it with this grey rectangle for a dramatic transition. Feel free to change the copy to white should want it to show up better against the image. #WarRoomWarrior 2. In the war room
  22. 22. #TechTransformation Impact Definitions #WarroomWarrior
  23. 23. Proprietary and confidential Recap - What do we have so far? ● Severity level is determined ● Communication is started ● There is an Incident Manager ● There is a Tech Recovery Manager ● Staff who were paged are present ● There is a decision maker ● Let’s look into the difficult part...
  24. 24. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Impact Detection ● Is there customer impact? - SEV 0 ● Is the impact functionality-specific or sitewide? - Dashboards ● Any changes in CAB that day? ● Is the impact perpetual or intermittent? ● Am I able to reproduce the issue in Production? ● Are customers starting to contact Customer Care? ● What percentage of customers are impacted? ● Am I able to reproduce the issue in QA? (hint) #WarroomWarrior
  25. 25. ics #TechTransformation Marching to resolution - Infrastructure vs Application #WarRoomWarrior
  26. 26. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Code that runs our applications and services ● Issue is siloed to my application. ○ Is the issue reproducible in QA? ● When was the last deployment? ● What part of the site is impacted, and was there a code change in downstream dependency? ● Is CPU/Throughput or memory trends erratic? ● Memcache and DB connection for the application ● Are their any A/B tests running? #WarRoomWarrior
  27. 27. Proprietary and confidential Infrastructure that runs our application ● Includes - Load balancers, KVMs, network, nodes, storage, puppet, chef, AWS and K8s, EMC, etc. ● Are multiple teams getting paged? ● Is the issue not reproducible in DEV/QA? ● What is the common denominator for the paged application? ● Are errors on a single route for the application or has overall error rate spiked? ● Check Network, Load Balancer graphs ● Check the dependency map view of New Relic to see if there is something red. ● Catch 22 - Possible the traffic does not even reach us. #TechTransformation
  28. 28. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Infrastructure as Code ● Best practices for code apply to infrastructure: ○ code reviews ○ versioning ○ automation tests ○ e.g Puppet, Chef, Ansible, Terraform, helm charts, jenkinsfiles, docker files. ● Application teams own issues that are infrastructural. ● Self serve - You built it, you run it! #WarroomWarrior
  29. 29. Proprietary and confidential “Not my issue.” ● Lead from behind ● Listen and gather information ● Be curious, probe from different angles ● Broader context - use your expertise to give feedback ● Help with trivial tasks ● Moral support ● But don’t get in the way #TechTransformation #WarroomWarrior
  30. 30. Proprietary and confidential Sometimes it’s not your issue until it is. ● Application teams may be needed to restart ● Rebuild a lost image ● Verify post changes ● Lingering issues in the aftermath ● e.g artifactory issue #WarRoomWarrior
  31. 31. Proprietary and confidential Note the slip-ups ● Are monitoring thresholds set-up correctly ● Are we hearing from our customers before we are aware of the issue? ● Was there a warning before the alert? ● Could this have been caught by an automated test in pre-prod environment? #TechTransformation #WarroomWarrior
  32. 32. This layout works great for a dramatic quote or statistic. Insert an amazing image and align it with this grey rectangle for a dramatic transition. Feel free to change the copy to white should want it to show up better against the image. 3. Post-incident and Ownership #WarRoomWarrior
  33. 33. Proprietary and confidential It’s a learning opportunity Setting up a Postmortem or Root Cause Analysis ● Postmortem presenters - owners ● Audience ● Knowledge sharing for the organization ● Details published in email ● When is a postmortem closed? #WarroomWarrior
  34. 34. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. No Blame!
  35. 35. #WarRoomWarrior Sample Postmortem
  36. 36. Postmortems ● Overview ● 5 Whys ● Resolution ● Root Cause #WarRoomWarrior
  37. 37. Postmortems ● Overview ● 5 Whys ● Resolution ● Root Cause ● Action Items / Next Steps (JIRA Ticket References Required) #WarRoomWarrior
  38. 38. Postmortems ● Overview ● 5 Whys ● Resolution ● Root Causes ● Action Items / Next Steps (JIRA Ticket References Required) ● Impact ● Lessons learned and knowledge shared #WarRoomWarrior
  39. 39. Postmortems ● Overview ● 5 Whys ● Resolution ● Root Causes ● Action Items / Next Steps (JIRA Ticket References Required) ● Impact ● Lessons Learned ● Could the incident have been detected earlier? ● Were proper procedures followed in notifying support teams? ● Responders / Attendees ● Timelines #WarRoomWarrior
  40. 40. Conclusion #WarRoomWarrior
  41. 41. #WarRoomWarrior Thank You! @RaKhurana
  42. 42. Severity-Level Examples
  43. 43. Severity-Level Examples
  44. 44. Change Management - Process details
  45. 45. CAB key jira steps

×