Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resisting to The Shocks

25 views

Published on

... A fast summary about Resiliency in Distributed Systems: discussion, reasoning, patterns...

Published in: Software
  • Be the first to comment

  • Be the first to like this

Resisting to The Shocks

  1. 1. Resisting to the Shocks Resilience Patterns in an unstable world! STEFANO FAGO (Extendend Version from Meetup Crafted Software: 7th Edition 11th October 2018)
  2. 2. Resilience? The concept of Resilience has multiple definitions; the definition we will use is: … The Capacity to Recover Quickly from Difficulties; Toughness. ...
  3. 3. What is a Resilient System? << ...it is a system that on the outside seems complex but is characterized by a simpler modular structure made up of components that, when necessary, can detach and reconfigure themselves: this prevents the problems of one part from cascading onto the others... >> [A. Zolli - http://resiliencethebook.com/] A Resilient System is featured by: – dynamicity – modularity – diversity – decoupling – integrated shock obsorbers
  4. 4. Why have a Resilient System? ● ...because have a 24/7 and 99.99999 system... is Cool!?! ● ...because I'm ... an Incredible Software Engineer!?! ● ...because I don't want my Business lose money! << ...Many systems are built to pass QA testing rather than to survive the world after launch... >> [Michael Nygard - https://pragprog.com/book/mnee2/release-it-second-edition]
  5. 5. Fallacies Of Distributed Computing ● The network is reliable ● Latency is zero ● Bandwidth is infinite ● The network is secure ● Topology doesn't change ● There is one administrator ● Transport cost is zero ● The network is homogeneous [https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing] [https://www.rgoarchitects.com/Files/fallacies.pdf]
  6. 6. The Murphy's Laws for the Resilience ● If there is anything that can break in the system, it will break! ● If there is something that can break the System, there is at least one Customer who will find it! ● Under Pressure... things get worse! ● The size matters but ... You'll be wrong anyway! << ...the three most frequent types of failures we observed were due to: 1) Inbound request pattern changes, including overload and bad actors 2) Resource exhaustion such as CPU, memory, io_loop, or networking resources 3) Dependency failures, including infrastructure, data store, and downstream services … >> [UBER Engineering]
  7. 7. Fragility? Some Causes... ● Usage of proprietary protocols and software ● Deployment of proprietary systems to a large number of computers that cannot be properly assessed in terms of security vulnerabilities or other potential misuses ● Single points of failure ● Inter-dependece of services ● Systems that can easily be influenced by pressure groups ● Weak architecture ● Missing fallback-scenarios, graceful degradation https://devopsagenda.techtarget.com/opinion/Why-software- resilience-should-be-the-real-goal-of-DevOps
  8. 8. Resiliency isn't Reliability... ● Reliability: The target at which software designers have always aimed: perfect operation all the time. Reliability is the planned outcome. ● Resiliency: The ability of an app to recover from certain types of failure and yet remain functional from the customer perspective. Resilience is how you achieve the outcome. https://cabforward.com/the-difference-between-reliable-and-resilient- software/
  9. 9. Resilience in Distributed System : What does it imply? ● 100% Trap: not IF it will break but ... WHEN it will break! << ...the normal state of operation is partial failure... >> [Adrian Hornsby] ● It is not a perfect feature! << ...it is impossible for a system to have all three properties of consistency, resilience and partition-tolerance... >> [Architectural Design for Resilience - Dong Liu, Ralph Deters, and W.J. Zhang (2010)] ● It implies complexity, it does not reduce it! ● It need to study, measure and understand the business objectives!
  10. 10. Resilience in Distributed System : Base Elements ● Isolation ● Low Coupling ● Communication Methods ● Mitigate Failures Break down into parts, autonomy of the parties, avoid the propagation of failures Complementary to Isolation, contributes to the non-propagation of the failures, the Components are ignorant of the others It conditions how to model the domain and the recovery mechanisms, it can be heterogeneous (Sync, Async, Location Transparency, Message Passing, Streaming, ...) Anticipate unavoidable failures and adopt both system and application recovery mechanisms
  11. 11. Resilience in Distributed System : Isolation is important ...using an intuitive point of view... FAILURE & CHANGE [Mark Hibberd - https://www.youtube.com/watch?v=_VftQXWDkfk]
  12. 12. Patterns of Resilience (by Uwe Friedrichsen)
  13. 13. Patterns of Resilience: Bulkhead Isolate! Don't Propagate! ● Redundancy of Systems and Resources: where possible, multiply a critical resource to be readily replaceable ● Categorized Resource Allocation: Classify Resources and break them down into measurable and manipulable reference pools Warning: Redundancy and Pools may vary over time and some of them are affected by more than one factor
  14. 14. Patterns of Resilience: Queueing Take Your Time! ● Deferrable Work : postpone a non-urgent activity ● Bounded Queue/ Load-Levelling Queue: load-absorbers for request or traffic spikes ● BackPressure/Throttling: queue overload management policies to avoid indefinite growth WARNING: Asynchrony make the coordination complex and it is necessary to refine the approach on measurements deriving from reality
  15. 15. Patterns of Resilience: Timeout Stop to Wait: Fail Fast & Don't Propagate! ● Make predicatable the duration of an activity ● Set Timing Goals, measure, refine according to reality WARNING: The goals may be specific to a resource and does not impact the others; how to handle timeout errors?
  16. 16. Patterns of Resilience: Retry If you fail once, try again! Some failures are temporary or recoverable... ...Trying again require: the number of attempts, the presence of a temporal degradation between the retries (backoff) https://aws.amazon.com/it/blogs/architecture/exponential-backoff-and-jitter/ WARNING: Assumes the Idempotence analysis of the activities involved
  17. 17. Patterns of Resilience: Fallback/Fail Silent Don't Fail... Degrade gracefully! Do not fail with destructive actions but with approximation or alternative actions ● Default Value/Derived Value ● Alternative Actions/Invocations ● Caching WARNING: It is needed to incorporate the relate business conditions!...
  18. 18. Patterns of Resilience: Limiter No Stress, Know Your Limit! ● Rate-Limiter ● Concurrency-Limiter ● Adaptive Resource Sizing ● BackPressure/Throttling WARNING: These policies should not replace an effort to understand the Resource-Sizing, use appropriate algorithms and refine the reality of data for the different use-cases.
  19. 19. Patterns of Resilience: Circuit Breaker Don't do it if it hurts! Interrupt a pathological situation with controlled and immediate failure. The state of failure is revoked according to indices or time conditions. WARNING: The definition of the parameters for the activation of the failure and for the recovery, can be a difficult task and it is needed to study the consequences on the critical-path of execution of the services.
  20. 20. Patterns of Resilience: Decoupling By Events Describe in terms of the things that happen (Event), not the things that do the work (Command) Isolate/Decouple components, Model with Domains, accept failures with notifications allowing the recover of the components / sub-systems ● Event-Sourcing / CQRS / Message-Passing ● SAGA ( alternative to 2PC) WARNING: Asynchronous Activities and Domain Modeling make the system safer but complex. It could be presents abuse of queues and listener networks. Tradeoff between Transactionality and Compensative Activities.
  21. 21. Patterns of Resilience: Chaos Engineering << ...Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production... >> https://www.oreilly.com/ideas/chaos-engineering ● Implementing Testing in Production, with realistic data and volumes! ● Having the infrastructure for continuous experiments of ... Chaos! ● Learn from every failure / Always invent new failures! WARNING: Complex Startup, specific Skills, get products and <<...don't use the term Chaos Engineering, use Continuous limited scope disaster recovery instead. You might actually get a budget that way...>> [Russ Miles]
  22. 22. From Resilient to (auto)Recoverable Target for Architectural Maturity [Bilgin Ibryam]
  23. 23. From Resilient to (auto)Recoverable At the first sight yuo'll think to adopt these patterns only as an application solution but... … is in this context that DevOps practices and tools become an integral part of a broader vision – containers and containers orchestration – artifacts life cycle – distribution policies for certificates, configurations and artifacts – monitoring & metrics WARNING: adopting DevOps implies complexity, skills, organization and << ...application safety and correctness, in a distributed system is still the responsibility, of the application... >> [Christian Posta]
  24. 24. From Resilient to (auto)Recoverable In order to be suitable for automation (in cloud native) environments a service must be: – Idempotent for restarts (a service can be killed and started multiple times). – Idempotent for scaling up/down (a service can be autoscaled to multiple instances). – Idempotent service producer (other services may retry calls). – Idempotent service consumer (the service or the mesh can retry outgoing calls). If you service always behaves the same way when the above actions are performed one or multiples times, then the platform will be able recover your services from failures without human intervention. [https://www.infoq.com/articles/microservices-post-kubernetes - Bilgin Ibryam]
  25. 25. Remember that ... ● Distributed systems are different because they fail often / Extract services ● Writing robust distributed systems costs more than writing robust single- machine systems. / Robust, open source distributed systems are much less common than robust, single-machine systems ● If you can fit your problem in memory, it’s probably trivial / “It’s slow” is the hardest problem you’ll ever debug ● Implement backpressure throughout your system /Find ways to be partially available ● Metrics are the only way to get your job done : Use percentiles, not averages ● Learn to estimate your capacity / Exploit data-locality / Writing cached data back to persistent storage is bad ● Feature flags are how infrastructure is rolled out / Use the CAP theorem to critique systems https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems- for-young-bloods/
  26. 26. Resilience & Performance Anti-Patterns Are you in doubt? Does the system get complicated? Maybe is useful to compare the design of the system, services or resilience patterns used, with the following performance anti-patterns! ● N+1 Calls ● N+1 Query ● Payload flood ● Granularity ● Tigh-Coupling ● Inefficient Service Flow ● Dependencies
  27. 27. Reality will change again but ... ...do not waste money! Be Resilient and Recoverable! Thank You All!!!

×