Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ignite (10m) how to not burn out your monitoring team

921 views

Published on

Bad monitoring, alerting and logging has made Gil Zellner very frustrated in some of his previous positions. It seems that almost nobody gets this exactly right. This will be a talk about the most annoying issues he has come across and advice for how to fix them.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Ignite (10m) how to not burn out your monitoring team

  1. 1. Relaxing picture of Yoga
  2. 2. PagerDuty Alert
  3. 3. hunt through logs for 2 hours
  4. 4. How to not burn out your production team Gil Zellner (CloudifyDev at Gigaspaces) Twitter: @Heathenaspargus
  5. 5. Who am I? Now: Past: @Heathenaspargus
  6. 6. @Heathenaspargus
  7. 7. @Heathenaspargus
  8. 8. cost of hiring new employee is 1.5-3x their monthly salary @Heathenaspargus
  9. 9. @Heathenaspargus
  10. 10. @Heathenaspargus
  11. 11. Next day @Heathenaspargus
  12. 12. @Heathenaspargus
  13. 13. @Heathenaspargus
  14. 14. frustration - I am unable to complete my task @Heathenaspargus
  15. 15. Time spent inefficiently @Heathenaspargus
  16. 16. Repetitive tasks @Heathenaspargus
  17. 17. Working Alone @Heathenaspargus
  18. 18. Yak Shaving @Heathenaspargus
  19. 19. https://www.ergoflex.co.uk/blog/category/sleep-research/sleeponomics-could-sleep-deprivation-be-the-real-reason-politicians-make-bad-decisions @Heathenaspargus
  20. 20. Easy (days) Intermediate (months) Hard (years) - no changes to infrastructure - just policy - Small changes to apps - logging - light automation - Design for better operability - long term @Heathenaspargus
  21. 21. Mandatory Half day-off after night production issue @Heathenaspargus
  22. 22. Allocate weekly time to resolve or automate issues that kept us up at night @Heathenaspargus
  23. 23. Wider rotation (more people do on-call) @Heathenaspargus
  24. 24. Knowledge Matrix Deploy System Mobile Link Backend Gil V V Karen V V Ari V V @Heathenaspargus
  25. 25. @Heathenaspargus
  26. 26. Easy (days) Intermediate (months) Hard (years) - no changes to infrastructure - just policy - Small changes to apps - logging - light automation - Design for better operability - long term @Heathenaspargus
  27. 27. @Heathenaspargus
  28. 28. solution: alert only things that meet the following criteria: 1) Alert on symptoms, not suspected "causes" 2) Actionable 3) Business breaking @Heathenaspargus
  29. 29. Alerte générale! @Heathenaspargus
  30. 30. Solution: direct alerts to relevant parties @Heathenaspargus
  31. 31. @Heathenaspargus
  32. 32. @Heathenaspargus
  33. 33. @Heathenaspargus
  34. 34. @Heathenaspargus
  35. 35. What are your KPIs ? @Heathenaspargus
  36. 36. @Heathenaspargus
  37. 37. Netflix stream starts per second @Heathenaspargus
  38. 38. Picking how to measure things
  39. 39. Diagnosis @Heathenaspargus
  40. 40. Make heal script @Heathenaspargus
  41. 41. @Heathenaspargus
  42. 42. @Heathenaspargus
  43. 43. Facebook Auto Remediation https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920 @Heathenaspargus
  44. 44. Easy (days) Intermediate (months) Hard (years) - no changes to infrastructure - just policy - Small changes to apps - logging - light automation - Design for better operability - long term @Heathenaspargus
  45. 45. @Heathenaspargus
  46. 46. Bad artists copy, great artists steal email: Gil.Zellner@gmail.com Twitter: @Heathenaspargus

×