Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resilience from Theory to Practice

579 views

Published on

video: https://www.youtube.com/watch?v=IBC9gcYqNR4

In this talk Efim Dimenstein, Chief Architect at Liveperson will cover the rules and guidelines of building resilient systems, implementing them in real life and lessons learned during the process. The talk will focus on achieving resilience in real life and will feature a lot of examples and lessons learned from building systems currently in production running at extreme scale.

Efim will talk about:

· General resilience guidelines

· How they are implemented in practice

· What changes needed to be implemented to achieve

resilience

· Lessons learned

· Summary

Published in: Technology
  • Be the first to comment

Resilience from Theory to Practice

  1. 1. Resilience From Theory to Practice by: Efim Dimenstein - Chief Architect Ori Cohen - Lead Resilience Engineer Jan 2016
  2. 2. What is Liveperson Liveperson transforms the connection between brands and consumers.
  3. 3. 1.5 M Visits concurrent 3BN Visits/month 200BN API calls/month 2 PB data Our Scale
  4. 4. 99.97% Uptime 6 Data Centers 1000+ physical servers 6000+ VMs Our Production
  5. 5. Fast release cycle ~250 people R&D Constant Innovation Multiple Technologies Our Engineering
  6. 6. interruptions per month on average 33 :)
  7. 7. The Past
  8. 8. The Past
  9. 9. The Present
  10. 10. LiveEngage Platform Composable ~100 services We keep splitting Much easier to scale
  11. 11. LiveEngage Platform Services are grouped into types The platform is divided into layers
  12. 12. LiveEngage Platform
  13. 13. Everything That Can Go Wrong Will Go Wrong
  14. 14. Resilience Pyramid DC HW SERVICE COMPONENT CODE
  15. 15. DC Resilience - Global
  16. 16. DC Resilience Primary Secondary
  17. 17. Service Node1 NodeN Node2 Node3 ... Service X
  18. 18. Service Node1 NodeN Node2 Node3 ... Service X HA Functionality
  19. 19. Service Grouping Administration& Configuration Real Time Near Real Time Offline
  20. 20. Components Solve once - reuse The Glue Level of abstraction Isolates common problems
  21. 21. Components - Guidelines Retries Fallback Cache
  22. 22. @ ground level
  23. 23. trust company
  24. 24. trust engineers
  25. 25. and still evaluate
  26. 26. knowledge is power
  27. 27. tooling
  28. 28. testing
  29. 29. deployment
  30. 30. metrics
  31. 31. logs
  32. 32. E2E
  33. 33. ALERTING
  34. 34. untested == unreliable
  35. 35. but… ?
  36. 36. costeffective
  37. 37. visibility
  38. 38. incident injection testing
  39. 39. process
  40. 40. opt-in
  41. 41. resilience @ scale ● multi layered solution ● requires monitoring and testing ● ingrained in the company culture ● keep things simple ● trust and empower your engineers ● break stuff
  42. 42. Thank you!
  43. 43. Q&A

×