Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Graceful Failure

63 views

Published on

How can you be sure that your team is alerted of a failure before it causes an outage for your users?

The move from monolith to microservice has allowed pieces of functionality to be deployed individually and on demand. Having functionality isolated allows the opportunity for one microservice to fail without bringing down the whole system. However, the complexity of releasing and monitoring API calls being made across services has increased.

Whether you’re launching a new product or iterating on a feature, delivering a delightful experience is crucial to your success. If something is to fail, you’d prefer your users didn’t know. Be thoughtful about how your system will degrade, how to inject failure to verify your design, and how this is monitored.

In this Sensu Summit 2019 talk, Lorne Kligerman, Director of Product at Gremlin, will cover failing gracefully as an engineering goal which can be confidently tested and monitored with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, you can safely observe and react in a timely manner to limit the effect on the end user.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Monitoring Graceful Failure

  1. 1. Monitoring Graceful Failure Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01
  2. 2. 2
  3. 3. 3
  4. 4. 4 Be down in 10! T-Ho 2017 Hey team… bit of a spill but I’m fine.
  5. 5. 5 We Expect Technology To Just Work™
  6. 6. Technical Issues Likely Cost Retailers Billions Macy’s, Lowe’s hit by Black Friday technical glitches Retail outages online leave shoppers frustrated on Black Friday People.com Black Friday Failures @lklig
  7. 7. Computer Problems Blamed For Flight Delays 4.1.19 Major US Airlines hit by delays after glitch at vendor 4.1.19 Pilots of doomed Boeing 737 MAX fought the plane’s software and lost 4.4.19 Airline Incidents @lklig
  8. 8. 8 Technology is fragile. Plan ahead to keep your users happy FAILURE GRACEFUL DEGRADATION @lklig
  9. 9. 9 Why Are Failures So Common?
  10. 10. 10 Legacy Systems @lklig
  11. 11. 11 Lack of Testing Failure UI End to end Integration Unit @lklig
  12. 12. @lklig
  13. 13. 13 What Can We Do About It?
  14. 14. 14 Design For Failure
  15. 15. 15@lklig Loading Screens Are Not Graceful
  16. 16. 16 Fail on Your Own Terms Key User Stories & Features Edge Cases From Unexpected User Behaviour Dependency Failures @lklig
  17. 17. 17 Inject Failure By Breaking Things On Purpose @lklig
  18. 18. Inject failure one service at a time. Maintain critical functionality. 18@lklig
  19. 19. 19 Degrade Gracefully
  20. 20. 20@lklig When one dependency fails, users are often affected Storage Auth User Data Content Cache Feature 1 Feature 2
  21. 21. 21@lklig
  22. 22. 22 Monitoring + Chaos Engineering
  23. 23. 23 Let Monitoring Know
  24. 24. 24
  25. 25. 25
  26. 26. 26 Let The Right People Know
  27. 27. 27
  28. 28. 28
  29. 29. 29 Closing the Loop
  30. 30. 30
  31. 31. 31
  32. 32. 32 RELIABILITY THROUGH CHAOS ENGINEERING Design for Failure Identify the most critical end user functionality. Inject Failure Impact your system to be sure your user experience isn’t impacted. Degrade Gracefully Plan for non critical functionality not to get in the way. Delight Your Users Your product metrics will show behaviour, no matter the condition. Graceful Failure @lklig
  33. 33. USE inthefamily FOR $50 OFF
  34. 34. 34 gremlin.com/lorne
  35. 35. Q&A Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01

×