Successfully reported this slideshow.
Your SlideShare is downloading. ×

Architectural Patterns of Resilient Distributed Systems

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

YouTube videos are no longer supported on SlideShare

View original on YouTube

! OMG Strangeloop 2015!!
Architectural
Patterns of
Resilient
Distributed
Systems
Ines
Sombra
@Randommood
ines@fastly.com
Upcoming SlideShare
We hear you like papers
We hear you like papers
Loading in …3
×

Check these out next

1 of 75 Ad

Architectural Patterns of Resilient Distributed Systems

Download to read offline

Strangeloop 2015 - http://www.thestrangeloop.com/2015/architectural-patterns-of-resilient-distributed-systems.html

Talk repo: https://github.com/Randommood/Strangeloop2015
Video: https://www.youtube.com/watch?v=ohvPnJYUW1E

Strangeloop 2015 - http://www.thestrangeloop.com/2015/architectural-patterns-of-resilient-distributed-systems.html

Talk repo: https://github.com/Randommood/Strangeloop2015
Video: https://www.youtube.com/watch?v=ohvPnJYUW1E

Advertisement
Advertisement

More Related Content

Viewers also liked (20)

Advertisement

Similar to Architectural Patterns of Resilient Distributed Systems (20)

Advertisement

Recently uploaded (20)

Architectural Patterns of Resilient Distributed Systems

  1. 1. ! OMG Strangeloop 2015!! Architectural Patterns of Resilient Distributed Systems
  2. 2. Ines Sombra @Randommood ines@fastly.com
  3. 3. Globally distributed & highly available
  4. 4. Today’s Journey Why care? Resilience literature Resilience in industry Conclusions @randommood ♥
  5. 5. OBLIGATORY DISCLAIMER SLIDE 
 All from a practitioner’s perspective! @randommood Things you may see in this talk Pugs Fast talking Life pondering Un-tweetable moments Rantifestos What surprised me this year Wedding factoids and trivia
  6. 6. Why Resilience?
  7. 7. How can I make a system more resilient? @randommood ♥
  8. 8. @randommood Resilience is the ability of a system to adapt or keep working when challenges occur
  9. 9. Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management @randommood ♥
  10. 10. It’s what really matters @randommood
  11. 11. Resilience in Literature lll
  12. 12. Harvest & Yield Model
  13. 13. @randommood Fraction of successfully answered queries Close to uptime but more useful because it directly maps to user experience (uptime misses this) Focus on yield rather than uptime Yield
  14. 14. @randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby AnimalsCute Harvest Fraction of the complete result
  15. 15. @randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby AnimalsCute X 66% harvest Harvest Fraction of the complete result
  16. 16. @randommood #1: Probabilistic Availability Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability
  17. 17. @randommood #2 Decomposition & Orthogonality Decomposing into subsystems independently intolerant to harvest degradation but the application can continue if they fail You can only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) ♥
  18. 18. @randommood “If your system favors yield or harvest is an outcome of its design” Fox & Brewer
  19. 19. Cook & Rasmussen model
  20. 20. Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen Operating point
  21. 21. Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen
  22. 22. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Cook & Rasmussen
  23. 23. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Cook & Rasmussen
  24. 24. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Cook & Rasmussen Incident!
  25. 25. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Safety Campaign Cook & Rasmussen
  26. 26. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort error margin Marginal boundary Safety Campaign Cook & Rasmussen
  27. 27. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  28. 28. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  29. 29. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  30. 30. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  31. 31. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  32. 32. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin
  33. 33. R.I.Cook - 2004 Accident boundary Flirting with the margin New marginal boundary!
  34. 34. R.I.Cook - 2004 Accident boundary Flirting with the margin New marginal boundary!
  35. 35. @randommood Insights from Cook’s model Engineering resilience requires a model of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Resilience is operator community focused
  36. 36. @randommood Engineering system resilience Build support for continuous maintenance Reveal control of system to operators Know it’s going to get moved, replaced, and used in ways you did not intend Think about configurations as interfaces
  37. 37. Borrill's model
  38. 38. Traditional
 engineering Reactive
 ops unk-unk @randommood Probability of failure Rank A system’s complexity Cascading or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined
  39. 39. Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank
  40. 40. Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury
  41. 41. Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury VS
  42. 42. Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury Alvaro VS
  43. 43. Strategies to build resilience Code standards Programming patterns Testing (full system!) Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown The goal is to build failure domain independence
  44. 44. @randommood “Thinking about building system resilience using a single discipline is insufficient. We need different strategies” Borrill
  45. 45. Wedding Trivia!!! @randommood
  46. 46. Resilience in Industry
  47. 47. @randommood Now with sparkles! ✨ ✨
  48. 48. @randommood API inherently more vulnerable to any system failures or latencies in the stack Without fault tolerance: 30 dependencies w 99.99% uptime could result in 2+ hours of downtime per month! Leveraged client libraries
  49. 49. @randommood Netflix’s resilient patterns Aggressive network timeouts & retries. Use of Semaphores. Separate threads on per- dependency thread pools Circuit-breakers to relieve pressure in underlying systems Exceptions cause app to shed load until things are healthy
  50. 50. @randommood We went on a diet just like you!#
  51. 51. $ $
  52. 52. @randommood Key insights from Chubby Library vs service? Service and client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems
  53. 53. @randommood Key insights from Chubby Centralized services are hard to construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK- UNK scenarios
  54. 54. @randommood And the family arrives!
  55. 55. @randommood Key insights from Truce Evolution of our purging system from v1 to v3 Used Bimodal Multicast (Gossip protocol) to provide extremely fast purging speed Design concerns & system evolution Tyler McMullen Bruce Spang
  56. 56. Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
  57. 57. Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
  58. 58. Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
  59. 59. @randommood So we have a myriad of systems with different stages of evolution Resilient systems like Varnish, Powderhorn, and Faild have taught us many lessons but some applications have availability problems, why? But wait a minute! ♥
  60. 60. @randommood Everyone okay?
  61. 61. Resilient architectural patterns
  62. 62. @randommood Redundancies are key Redundancies of resources, execution paths, checks, replication of data, replay of messages, anti-entropy build resilience Gossip / epidemic protocols too Capacity planning matters Optimizations can make your system less resilient!
  63. 63. @randommood Unawareness of proximity to error boundary means we are always guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter
  64. 64. @randommood Complexity if increases safety is actually good Adding resilience may come at the cost of other desired goals (e.g. performance, simplicity, cost, etc) Not all complexity is bad
  65. 65. @randommood Leverage Engineering best practices Resiliency and testing are correlated. TEST! Versioning from the start - provide an upgrade path from day 1 Upgrades & evolvability of systems is still tricky. Mixed-mode operations need to be common Re-examine the way we prototype systems
  66. 66. Bringing it together♥
  67. 67. tl;dr OPERABILITYWHILE IN DESIGN UNK-UNK Are we favoring harvest or yield? Orthogonality & decomposition FTW Do we have enough redundancies in place? Are we resilient to our dependencies? Am I providing enough control to my operators? Would I want to be on call for this? Rank your services: what can be dropped, killed, deferred? Monitoring and alerting in place? The existence of this stresses diligence on the other two areas Have we done everything we can? Abandon hope and resort to human sacrifices ♥ ♥ Theory matters!
  68. 68. IMPROVING OPERABILITYWHILE IN DESIGN Test dependency failures Code reviews != tests. Have both Distrust client behavior, even if they are internal Version (APIs, protocols, disk formats) from the start. Support mixed-mode operations. Checksum all the things Error handling, circuit breakers, backpressure, leases, timeouts Automation shortcuts taken while in a rush will come back to haunt you Release stability is o"en tied to system stability. Iron out your deploy process Link alerts to playbooks Consolidate system configuration (data bags, config file, etc) tl;dr♥ ♥ Operators determine resilience
  69. 69. @randommood We can’t recover from lack of design. Not minding harvest/yield means we sign up for a redesign the moment we finish coding. TODAY’S RANTIFESTO♥ ♥
  70. 70. Thank you!github.com/Randommood/Strangeloop2015 77 Special thanks to Paul Borrill, Jordan West, Caitie McCaffrey, Camille Fournier, Mike O'Neill, Neha Narula, Joao Taveira, Tyler McMullen, Zac Duncan, Nathan Taylor, Ian Fung, Armon Dadgard, Peter Alvaro, Peter Bailis, Bruce Spang, Matt Whiteley, Alex Rasmussen, Aysulu Greenberg, Elaine Greenberg, and Greg Bako.

×