Successfully reported this slideshow.
Your SlideShare is downloading. ×

Observability - From fire fighting to smoke detection

Ad

THE IMPORTANCE OF
OBSERVABILITY
From Fire Fighting to Smoke Detection
@ShaneCarroll84

Ad

WHAT IS
OBSERVABILITY?
@ShaneCarroll84

Ad

DASHBOARDS
@ShaneCarroll84

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 55 Ad
1 of 55 Ad

More Related Content

Observability - From fire fighting to smoke detection

  1. 1. THE IMPORTANCE OF OBSERVABILITY From Fire Fighting to Smoke Detection @ShaneCarroll84
  2. 2. WHAT IS OBSERVABILITY? @ShaneCarroll84
  3. 3. DASHBOARDS @ShaneCarroll84
  4. 4. KEY METRICS @ShaneCarroll84
  5. 5. OBJECTIVES @ShaneCarroll84
  6. 6. ALERTS @ShaneCarroll84
  7. 7. OBSERVABILITY IS LIKE A FITNESS TRACKER, BUT FOR
 YOUR SYSTEM @ShaneCarroll84
  8. 8. Track content opens, clicks, views, interactions… Get this data and send it on
 for other teams. MACGYVER’S JOB? @ShaneCarroll84
  9. 9. CHALLENGES Considerable amount of old enterprise environments… Ancient systems that are
 not easily testable. @ShaneCarroll84
  10. 10. IMPACT OUR TEAM? HOW DID THE LACK OF OBSERVABILITY @ShaneCarroll84
  11. 11. IT ALL STARTED WITH A BUG… Missing a large number of tracking events for one customer… 🤔 A customer is telling us we’ve
 a problem with our system. @ShaneCarroll84
  12. 12. WE WERE IN FIRE FIGHTING MODE Scrambling to try find the
 cause of the issue… Don’t know the full impact, causing stress for the team. @ShaneCarroll84
  13. 13. @ShaneCarroll84
  14. 14. LET’S LOOK
 AT THE LOGS… Logs were noisy and not easily searchable… Not all useful information to help isolate the issue was logged. @ShaneCarroll84
  15. 15. MACGYVER,
 WE HAVE A PROBLEM Issue with titles containing special characters… 🤦 Lots of customers impacted
 and over a million tracking
 events lost. @ShaneCarroll84
  16. 16. WHY DID WE NOT GET AN ALERT? IT WAS LOST IN
 A SEA OF ALERTS. @ShaneCarroll84
  17. 17. OUR PROCESS
 WAS BROKEN! Took weeks to gather missing data and reprocess events… Problem continued to impact customers and other teams. @ShaneCarroll84
  18. 18. WE DIDN’T KNOW OUR SYSTEM’S HEALTH What can we learn from this? Use a bad experience such
 as a bug to learn from
 and spark change. @ShaneCarroll84
  19. 19. WHEN YOU LOOK AT YOUR CURRENT SYSTEM, HOW DO YOU KNOW IT’S HEALTHY? @ShaneCarroll84
  20. 20. TESTING
 AT POPPULO Rob Meaney, Head of Testing CODS model 10 P’s of Testability @ShaneCarroll84
  21. 21. LEARNING REVIEW Detection Impact Isolation Fix & Retest Repair Minimise Impact Prevention @ShaneCarroll84
  22. 22. SHARE LEARNINGS @ShaneCarroll84
  23. 23. RISK @ShaneCarroll84
  24. 24. THREE AMIGOS George Dinwiddie first came up with this strategy in his blog. The Three Amigos – Product, Developer, and Tester – discuss the new feature @ShaneCarroll84
  25. 25. THREE AMIGOS Allow for discussion on risk before beginning work on a feature… We now ‘Three Amigo’ each new story before beginning any new development. @ShaneCarroll84
  26. 26. REFINING ALERTS Reduce noise and only alert on what is important to the team. Entire team takes responsibility for investigating and fixing alerts. @ShaneCarroll84
  27. 27. DAILY STAND-UP We changed our standup, to include a new question… Small change but now it’s part of our team process! Any new alerts today? @ShaneCarroll84
  28. 28. IMPROVED LOGGING Removed noise. Logged everything that helped isolate potential issues. Used structured logging. @ShaneCarroll84
  29. 29. UNSTRUCTURED LOGS @ShaneCarroll84
  30. 30. UNSTRUCTURED LOGS @ShaneCarroll84
  31. 31. VISUALISATIONS @ShaneCarroll84
  32. 32. DASHBOARDS Identified critical metrics Highlight failures Show trends Added important tests to the dashboard @ShaneCarroll84
  33. 33. KEY METRICS @ShaneCarroll84
  34. 34. COMPARE DATA @ShaneCarroll84
  35. 35. DRILL-DOWN @ShaneCarroll84
  36. 36. LIVE ARCHITECTURE @ShaneCarroll84
  37. 37. WATCHING TV
 AT WORK Why invest time and effort into monitoring tools if no one looks at them? Dashboards are now in constant view! @ShaneCarroll84
  38. 38. TRACING @ShaneCarroll84
  39. 39. TRACING @ShaneCarroll84
  40. 40. BUT BE CAREFUL Don’t introduce a dependency in your system when adding monitoring tools! @ShaneCarroll84
  41. 41. PERFORMANCE TESTS @ShaneCarroll84
  42. 42. PERFORMANCE TESTS @ShaneCarroll84
  43. 43. CLIENT-SIDE ERRORS @ShaneCarroll84
  44. 44. CLIENT-SIDE ERRORS @ShaneCarroll84
  45. 45. CLIENT-SIDE ERRORS @ShaneCarroll84
  46. 46. CLIENT-SIDE ERRORS @ShaneCarroll84
  47. 47. IN-HOUSE TOOLS @ShaneCarroll84
  48. 48. IN-HOUSE TOOLS @ShaneCarroll84
  49. 49. IN-HOUSE TOOLS @ShaneCarroll84
  50. 50. SMOKE DETECTION MODE ACTIVATED Risk is discussed early. Alerts are raised and actioned. Dashboards show our critical metrics. @ShaneCarroll84
  51. 51. SMOKE DETECTION MODE ACTIVATED Investigating logs and isolating issues is easier. Quickly see who's impacted. Replay scripts are available. @ShaneCarroll84
  52. 52. HOW DO YOU GO FROM FIRE FIGHTING TO SMOKE DETECTION? LEARN FROM THE FIRE! @ShaneCarroll84
  53. 53. THANK YOU! @ShaneCarroll84

×