Advertisement
Advertisement

More Related Content

More from Fwdays(20)

Advertisement

Recently uploaded(20)

"Case study — how we block production. Triangulate issue, fix and postmortem", Mykyta Savin

  1. Structure of presentation ●Short overview of project and infra 5 mins ●Issue describe (spoiler – production not works) ●5 mins ●Investigate & triangle 15 mins ●Fix
  2. E-goverment platform
  3. RabbitMQ infrastructure prod-rabbit-dr-1 prod-rabbit-dr-3 prod-rabbit-dr-4 prod-rabbit-dr-1 prod-rabbit-dr-2 prod-rabbit-new-lb
  4. Problem definition phase
  5. Problem description ●Alarm on average page processing time fired ●Alarm on number of gateway timeouts fired
  6. Problem description ●Alarm on average page processing time fired ●Alarm on number of gateway timeouts fired ●We check manually – web site works very slow ●And situation becomes worser, after ~15 mins of incedent detection site almost stop works
  7. Problem triangulation
  8. Anamnes checklist ●Do we change anything in system recently ? ●Do we have deployments recently? ●Is RMQ works fine? Do we have overloads ? ●Do we have anything abnormal in logs or monitoring system ? ●Check system elements one-by-one if required
  9. ●Huge increase on prod-rabbit-lb, up to 300MBit ●Number of messages in queues normal ●But in/out rate in RMQ really slow ●Nothing incorrect in log ●So first result – something keeps RMQ busy
  10. EUREKA
  11. Problem solve
  12. ●Solve is trivial – just increase memory limit for problematic server in short perspective ●And review service to generate smaller RMQ messages in longer perspective
  13. Lesson learned
  14. ●Improve monitoring: – Check for often restarting services – Check for docker OOM killer events – Check for system-wide OOM killer
Advertisement