Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resilient Functional Service Design

5,074 views

Published on

This slide deck addresses the importance of proper functional design for creating resilient distributed systems (not only, but also microservice-based systems).

It starts by explaining the pitfall that many developers fall into when getting started with resilience: Quite often the effects of the fundamentals, i.e., creating bulkheads and choosing the communication paradigm, on system robustness at runtime are heavily underrated. Instead, the pure technical measures like circuit breakers, backpressure, etc. are often overestimated.

Unfortunately, all technical measures will not help to create a robust system if the functional design leads to highly coupled services where the availability of one service functionally totally depends on the availability of another service. The same is true if you need to call many services that all need to be available to answer a client request.

To make it worse, most of the wide-spread design approaches like functional decomposition, DRY (don't repeat yourself), design for re-use or layered architecture exactly lead to those problems, i.e., they are not suitable for designing distributed systems.

This slide deck does not offer any silver bullets to solve the problem (and actually I believe there is no silver bullet), but at least a few guiding principles. Additionally, it shows how the choice of the communication paradigm influences the bulkhead design and this way creates more options to create a good service design that also supports resiliency on a functional level.

As always this slide deck is without the voice track, i.e., most of the information is missing. But I hope that the slides on their own also provide some helpful hints.

Remark: As the "dismiss reusability" slide tends to get quite some attention, one more remark about that slide: On the voice track I usually add "If you find something that is worth being made reusable, i.e., that it satisfies the commercial constraints of Fred Brooks 'Rule of 9' (see https://blog.codecentric.de/en/2015/10/the-broken-promise-of-re-use/ for details), do not put it in a service. Instead create a library and put the functionality in there. And make sure that changes to the library do not mean that all users have to upgrade at the same time, but that any library user can update whenever it fits the user's schedule. Otherwise, you would have introduced tight coupling through the back door again."

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Resilient Functional Service Design

  1. 1. Resilient Functional Service Design The usually forgotten parts of resilient software design Uwe Friedrichsen – codecentric AG – 2015-2017
  2. 2. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  3. 3. What’s that “resilience” thing?
  4. 4. Business Production Availability
  5. 5. (Almost) every system is a distributed system Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  6. 6. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  7. 7. Failures in todays complex, distributed and interconnected systems are not the exception. •  They are the normal case •  They are not predictable
  8. 8. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web à Ever-increasing complexity and connectivity
  9. 9. Do not try to avoid failures. Embrace them.
  10. 10. resilience (IT) the ability of a system to handle unexpected situations -  without the user noticing it (best case) -  with a graceful degradation of service (worst case)
  11. 11. Beware of the “100% available” trap!
  12. 12. Designing for resilience The pitfall
  13. 13. First, you learn about resilience …
  14. 14. Complement Core Detect Prevent Recover Mitigate Treat
  15. 15. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Acknowledgement Either level Voting Synthetic transaction Leaky bucket Routine
 checks Health check Fail fast Let sleeping
 dogs lie Small
 releases Hot
 deployments Routine maintenance Backup
 request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement
  16. 16. ... then, you digest the stuff just learned
  17. 17. Confinement Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Acknowledgement Either level Voting Synthetic transaction Leaky bucket Routine
 checks Health check Fail fast Let sleeping
 dogs lie Small
 releases Hot
 deployments Routine maintenance Backup
 request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Oh, my!
 Theoretical blah! Uncool!
 Know that anyway for eons. So, let’s move on to the cool parts … Ah, now we’re talkin’! Here’s the cool stuff! That‘s practical, applicable. Don‘t you have more code examples? Or even better: Can‘t we turn that all into a live hacking session? Offline activities?
 Hmm, let‘s focus on the other stuff. Uh, sounds like one-off, tough stuff …
 Better start with the easier stuff, best with library support Yeah, more cool stuff! Aren‘t there more libs like Hystrix that we can drag into our projects with a line of configuration? Well, neat … I’ll come back to that stuff whenever I really need it
  18. 18. Core Detect Recover Mitigate Treat Prevent Complement Developer priority Relevance for application robustness Ye be warned! If you don’t get this part right, nothing else matters Here be dragons! This is extremely hard and poorly understood
  19. 19. Let’s recap …
  20. 20. The core parts are •  extremely important •  poorly understood •  massively underestimated
  21. 21. Houston, we have a problem!
  22. 22. Let’s have a closer look at the core parts
  23. 23. Complement Core Detect Prevent Recover Mitigate Treat
  24. 24. Core Detect Treat Prevent Recover Mitigate Complement Isolation
  25. 25. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  26. 26. Core Detect Treat Prevent Recover Mitigate Complement Isolation Bulkhead
  27. 27. Bulkheads are not about thread pools!
  28. 28. Bulkheads •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., µservice, actor, scs, ... •  Implementation choice impacts system and resilience design a lot •  Shaping good bulkheads is extremely hard (pure design issue)
  29. 29. Sounds easy. Where is the problem?
  30. 30. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  31. 31. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  32. 32. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  33. 33. Let’s use the well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  34. 34. Unfortunately, …
  35. 35. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this ...
  36. 36. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  37. 37. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  38. 38. Welcome to distributed hell!
  39. 39. Caches to the rescue!
  40. 40. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  41. 41. Caches to the rescue?
  42. 42. Do you really think
 that copying stale data all over your system is a suitable measure
 to fix an inherently broken design?
  43. 43. We have to re-learn design for distributed systems!
  44. 44. A works-out-of-the-box-in-all-contexts, just-add-water-and-stir, three-bullet-point panacea for designing perfect bulkheads
  45. 45. You need lots of those …
  46. 46. ... maybe some of those
  47. 47. Then it is a lot of hard work …
  48. 48. ... and there is no silver bullet
  49. 49. Yet, a few guiding thoughts
 about bulkhead design …
  50. 50. Foundations of design •  High cohesion, low coupling •  Separation of concerns •  Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  51. 51. Short activation paths •  Long activation paths affect availability •  Increase latency and likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches sometimes help, but stale data as trade-off
  52. 52. Dismiss reusability •  Reusability increases coupling •  Reusability leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reuse •  Strive for replaceability instead
  53. 53. Broadening the options ...
  54. 54. Core Detect Treat Prevent Recover Mitigate Complement Isolation Communication paradigm Bulkhead
  55. 55. Communication paradigm •  Request-response <-> messaging <-> events <-> … •  Heavily influences resilience patterns to be used •  Also heavily influences functional bulkhead design •  Very fundamental decision which is often underestimated
  56. 56. µS Request/Response : Horizontal slicing Flow / Process µS µS µS µS µS µS Event-driven : Vertical slicing µS µS µS µS µS Flow / Process
  57. 57. Synchronous R/R vs. asynchronous events •  Decomposition •  Vertically divide-and-conquer vs. horizontally go-with-the-flow •  Coordination •  Coordination logic/services and orchestration vs. event chains and choreography •  Transactions •  Built-in transaction handling vs. external supervision •  Error handling •  Built into service vs. escalation/supervision strategy •  Separation of concerns •  Multiple responsibilities service vs. single responsibility services •  Encapsulation •  Domain logic distributed across services vs. domain logic in one place •  Reusability vs. Replaceability •  Complexity •  A draw …
  58. 58. The communication paradigm influences
 the functional service design a lot and also the resilience patterns to be used
  59. 59. Example: order fulfillment •  Simple order, credit card, non-digital items •  Add coupons incl. validation •  Add promotions incl. usage notification •  Add bonus card incl. purchase notification •  Customer accounts as payment type •  PayPal as payment type •  Integrate digital music library •  Integrate digital video library •  Integrate e-book library
  60. 60. Design exercise – Part 1 Create a bulkhead design for the case study •  Use one communication paradigm •  Synchronous request/response (e.g., REST) •  Asynchronous messaging (e.g., Akka) •  Asynchronous events (e.g., Pub/Sub) •  Assume incremental requirements •  How many services do you need to touch •  What about the functional isolation of the services •  How big/maintainable are the resulting services •  Take a few notes
  61. 61. Online Shop Checkout Credit Card Provider Warehouse System Coupon Management Campaign Management PayPal Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server Customer pressed “Buy now” ?
  62. 62. Order Fulfillment
 Service Online Shop Payment
 Service Credit Card Provider Shipment
 Service Warehouse System <Foreign Service> <Own Service> Coupon Management Promotion Campaign Management Loyalty Account Service Payment Provider PayPal Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server Coupon Credit Card Coordinate Warehouse Coordinate Assets Notify Cust. PayPal Coordinate
  63. 63. Order confirmed Online Shop Credit Card Provider Warehouse System <Foreign Service> <Own Service> Coupon Management Campaign Management Account service Credit Card Service Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server PayPal PayPal Service Warehouse Service Promotion Service Bonus Card Service Coupon Service Music Library Service Video Library Service E-Book Library Service Notification Service Payment authorized Digital asset provisioned Payment failed <Event> Order fulfillment supervisor Track flow of events Reschedule events in case of failure Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that
  64. 64. Do not limit your design options upfront without an important reason
  65. 65. Wrap-up •  Today’s systems are distributed •  Failures are not avoidable, nor predictable •  Resilient software design needed •  Bulkhead design is •  crucial for application robustness •  poorly understood •  massively underrated •  different from traditional design best practices •  Communication paradigms broaden
 your bulkhead design options
  66. 66. We have to re-learn design for distributed systems
  67. 67. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

×