Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Resilience reloaded
More resilience patterns
Uwe Friedrichsen, codecentric AG, 2015-2016
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Previously on “Resilience” …
Why resilience?
It‘s all about production!
Business
Production
Availability
Availability ≔
MTTF
MTTF + MTTR
MTTF: Mean Time To Failure
MTTR: Mean Time To Recovery
Traditional stability approach
Availability ≔
MTTF
MTTF + MTTR
Maximize MTTF
(Almost) every system is a distributed system
Chas Emerick
The Eight Fallacies of Distributed Computing
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The...
A distributed system is one in which the failure
of a computer you didn't even know existed
can render your own computer u...
Failures in todays complex, distributed and
interconnected systems are not the exception.
• They are the normal case
• The...
Do not try to avoid failures. Embrace them.
Resilience approach
Availability ≔
MTTF
MTTF + MTTR
Minimize MTTR
resilience (IT)
the ability of a system to handle unexpected situations
 without the user noticing it (best case)
 with ...
Isolation
Latency Control
Fail Fast
Circuit Breaker
Timeouts
Fan out &
quickest reply
Bounded Queues
Shed Load
Bulkheads
L...
… and there is more
• Recovery & mitigation patterns
• More supervision patterns
• Architectural patterns
• Anti-fragility...
(Title music starts & opening credits shown)
Let’s complete the picture first …
Isolation
Latency Control
Loose Coupling
Supervision
Core
(Architectural)
Detection Treatment Prevention
Recovery
Mitigation
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Isolation
Loose Coupling
Latency Control
Node level
...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Isolation Redundancy
Communication
paradigm
Supporti...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Supporting
patterns
Communication
paradigm
Redundanc...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Supporting
patterns
Communication
paradigm
Redundanc...
Bulkheads
• Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
• Shaping good bulkheads is hard (pur...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Supporting
patterns
Communication
paradigm
Redundanc...
Communication paradigm
• Request-response <-> messaging <-> events
• Not a pattern, but heavily influences resilience patt...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Supporting
patterns
Communication
paradigm
Redundanc...
Redundancy
• Core resilience concept
• Applicable to all failure types
• Basis for many recovery and mitigation patterns
•...
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Usage of redundan...
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Usage of redundan...
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Usage of redundan...
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Usage of redundan...
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
Usage of redundan...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Supporting
patterns
Communication
paradigm
Redundanc...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Node level
Supporting
patterns
System level
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Node level
Supporting
patterns
System level
Timeout
...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Node level
Supporting
patterns
System level
Monitor ...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Node level
Supporting
patterns
System level
Voting
S...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Retry
• Very basic recovery pattern
• Recover from omission errors
• Limit retries to minimize extra load on an already lo...
Retry example
// doAction returns true if successful, false otherwise
boolean doAction(...) {
...
}
// General pattern
boo...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Checkpoint Safe point
Rollback
• Roll back state and/or execution path to a defined safe state
• Recover from internal errors caused by external...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Checkpoint Safe point
R...
Roll-forward
• Advance execution past the point of error
• Often used as escalation if retry or rollback do not succeed
• ...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback Roll-
forward
Checkpoin...
Reset
• Often used as radical escalation if all other measures failed
• Restart service – do not forget to provide a consi...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Restart
Roll-
forward
R...
Failover
• Used as escalation if other measures failed or would take too long
• Requires redundancy – trades resources for...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Restart
Roll-
forward
R...
Read repair
• Handle response failures due to relaxed temporal constraints
• Requires redundancy – trades resources for av...
Read repair example (Riak, Java) 1/2
public class FooResolver implements ConflictResolver<Foo> {
@Override
public Foo reso...
Read repair example (Riak, Java) 2/2
public class BuddyResolver implements ConflictResolver<Buddy> {
@Override
public Budd...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Restart
Roll-
forward
R...
Error Handler
• Separate business logic and error handling
• Business logic just focuses on getting the task done
• Error ...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Retry
Limit retries
Rollback
Restart
Roll-
forward
R...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Fallback
• Execute an alternative action if the original action fails
• Basis for most mitigation patterns
• Fail silently...
Default value example (Hystrix, Java) 1/2
public class DefaultValueCommand extends HystrixCommand<String> {
private static...
Default value example (Hystrix, Java) 2/2
@Test
public void shouldSucceed() {
DefaultValueCommand command = new DefaultVal...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Queues for resources
• Protect resource from temporary overload situations
• Limit queue size to limit latency at longer-l...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Shed Load
• Use if overload will lead to unacceptable throughput of resource
• Shed requests in order to keep throughput o...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Share Load
• Use if overload will lead to unacceptable throughput of resource
• Share load between (added) resources to ke...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Deferrable work
• Maximize resources for online request processing under high load
• Pause or slow down routine and batch ...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
Marked data
• Avoid repeated and/or spreading errors due to erroneous data
• Use if time or information to correct data im...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Fallback
Fail
silently
Alternative action
Default va...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Let sleeping dogs lie
Small releasesHot deployments
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Anti-entropy
Routine maintenance
• Reduce system entropy – keep preventable errors from occurring
• Especially important if errors were...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Spread the news
Anti-entropy
Spread the news
• Pro-actively spread information about changes in system state
• Use a gossip or epidemic protocol for ro...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Backup
request
Spread the news
A...
Backup request
• Send request to multiple workers (optionally a bit offset)
• Use quickest reply and discard all other res...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Backup
request
Anti-fragility
Di...
Anti-fragility
• Avoid fragility caused by homogenization and standardization
• Protect against disastrous failures by usi...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Backup
request
Anti-fragility
Di...
Error injection
• Make resilient software design sustainable
• Inject errors at runtime and observe how the system reacts
...
• Chaos Monkey
• Chaos Gorilla
• Chaos Kong
• Latency Monkey
• Compliance Monkey
• Security Monkey
• Janitor Monkey
• Doct...
PreventionDetectionCore
(Architectural)
Recovery
Mitigation
Treatment
Routine
maintenance
Backup
request
Anti-fragility
Di...
Towards a pattern language …
Decisions to make
• General decisions
• Bulkhead type
• Communication paradigm
• Decisions per failure scenario (repeat)
•...
Core
(Architectural)
Detection Treatment Prevention
Recovery
Mitigation
Isolation Redundancy
Communication
paradigm
Suppor...
Core
(Architectural)
Detection Treatment Prevention
Recovery
Mitigation
Isolation Redundancy
Communication
paradigm
Suppor...
Core
(Architectural)
Detection Treatment Prevention
Recovery
Mitigation
Isolation Redundancy
Communication
paradigm
Suppor...
Core
(Architectural)
Detection Treatment Prevention
Recovery
Mitigation
Isolation Redundancy
Communication
paradigm
Suppor...
Wrap-up
• Today’s systems are distributed
• Failures are not avoidable
• Failures are not predictable
• Resilient software...
Further reading
1. Michael T. Nygard, Release It!,
Pragmatic Bookshelf, 2007
2. Robert S. Hanmer,
Patterns for Fault Toler...
Do not avoid failures. Embrace them!
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Resilience reloaded - more resilience patterns
Upcoming SlideShare
Loading in …5
×

Resilience reloaded - more resilience patterns

4,848 views

Published on

The talk belonging to this slide deck is basically a "sequel" of the "Patterns of resilience" talk. It briefly repeats a few core statements of the previous talk and then starts to complete the picture.

First, it creates a more complete pattern taxonomy and locates the taxonomy of the previous talk in it. Then, it repeats a few of the patterns of the first talk concerning core architectural patterns and error detection patterns, enriching them with some additional information.

After that the core part of this talk starts: The error recovery and error mitigation patterns. The talk presents several well known and not so well known patterns from those two domains, partially augmented with code examples. Then after, briefly touching fault treatment patterns, some more patterns of the error prevention domain are presented.

The talk concludes with a small guideline, how to identify the appropriate pattern set for a concrete situation.

As always, the voice track and therefore a lot of information is missing in this slide deck. Yet, I hope that it provides a few helpful pointers.

Published in: Technology

Resilience reloaded - more resilience patterns

  1. 1. Resilience reloaded More resilience patterns Uwe Friedrichsen, codecentric AG, 2015-2016
  2. 2. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  3. 3. Previously on “Resilience” …
  4. 4. Why resilience?
  5. 5. It‘s all about production!
  6. 6. Business Production Availability
  7. 7. Availability ≔ MTTF MTTF + MTTR MTTF: Mean Time To Failure MTTR: Mean Time To Recovery
  8. 8. Traditional stability approach Availability ≔ MTTF MTTF + MTTR Maximize MTTF
  9. 9. (Almost) every system is a distributed system Chas Emerick
  10. 10. The Eight Fallacies of Distributed Computing 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous Peter Deutsch https://blogs.oracle.com/jag/resource/Fallacies.html
  11. 11. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  12. 12. Failures in todays complex, distributed and interconnected systems are not the exception. • They are the normal case • They are not predictable • They are not avoidable
  13. 13. Do not try to avoid failures. Embrace them.
  14. 14. Resilience approach Availability ≔ MTTF MTTF + MTTR Minimize MTTR
  15. 15. resilience (IT) the ability of a system to handle unexpected situations  without the user noticing it (best case)  with a graceful degradation of service (worst case)
  16. 16. Isolation Latency Control Fail Fast Circuit Breaker Timeouts Fan out & quickest reply Bounded Queues Shed Load Bulkheads Loose Coupling Asynchronous Communication Event-Driven Idempotency Self-Containment Relaxed Temporal Constraints Location Transparency Stateless Supervision Monitor Complete Parameter Checking Error Handler Escalation
  17. 17. … and there is more • Recovery & mitigation patterns • More supervision patterns • Architectural patterns • Anti-fragility patterns • Fault treatment & prevention patterns A rich pattern family
  18. 18. (Title music starts & opening credits shown)
  19. 19. Let’s complete the picture first …
  20. 20. Isolation Latency Control Loose Coupling Supervision
  21. 21. Core (Architectural) Detection Treatment Prevention Recovery Mitigation
  22. 22. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Isolation Loose Coupling Latency Control Node level Supervision System level
  23. 23. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Isolation Redundancy Communication paradigm Supporting patterns
  24. 24. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Supporting patterns Communication paradigm RedundancyIsolation
  25. 25. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Supporting patterns Communication paradigm RedundancyIsolation Bulkhead
  26. 26. Bulkheads • Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) • Shaping good bulkheads is hard (pure design issue) • Diverse implementation choices available, e.g., µservice, actor, scs, ... • Implementation choice impacts system and resilience design a lot
  27. 27. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Supporting patterns Communication paradigm RedundancyIsolation
  28. 28. Communication paradigm • Request-response <-> messaging <-> events • Not a pattern, but heavily influences resilience patterns to be used • Also heavily influences functional bulkhead design • Very fundamental decision which is often underestimated
  29. 29. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Supporting patterns Communication paradigm RedundancyIsolation
  30. 30. Redundancy • Core resilience concept • Applicable to all failure types • Basis for many recovery and mitigation patterns • Often different variants implemented in a system
  31. 31. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure
  32. 32. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure Usage of redundancy • Patterns • Failover • Schemes • Active/Passive • Active/Active • N+M Redundancy • Implementation examples • Load balancer + health check (e.g., HAProxy) • Dynamic routing + health check (e.g., Consul, ZooKeeper) • Cluster manager with shared IP (e.g., Pacemaker & Corosync)
  33. 33. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure Usage of redundancy • Patterns • Retry (to different replica) • Failover • Backup Request • Schemes • Identical replicas • Failover schemes (for failover) • Implementation examples • Client-based routing • Load balancer • Leaky bucket + dynamic routing
  34. 34. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure Usage of redundancy • Patterns • Timeout + retry to different replica • Timeout + failover • Backup Request • Schemes • Identical replicas • Failover schemes (for failover) • Implementation examples • Client-based routing • Load balancer • Circuit breaker + dynamic routing
  35. 35. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure Usage of redundancy • Patterns • Voting • Recovery blocks • Routine exercise • Schemes • Identical replicas • Different replicas (recovery blocks) • Implementation examples • Majority based quorum • Adaptive weighted sum • Synthetic computation
  36. 36. Failure types • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure Usage of redundancy • Patterns • Voting • Recovery blocks • Routine exercise • Schemes • Identical replicas • Different replicas (recovery blocks) • Implementation examples • n > 3t quorum • Adaptive weighted sum • Synthetic computation
  37. 37. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Supporting patterns Communication paradigm RedundancyIsolation Stateless Idempotency Escalation Structural Behavioral Zero downtime deployment Location transparency Relaxed temporal constraints
  38. 38. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Node level Supporting patterns System level
  39. 39. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Node level Supporting patterns System level Timeout Circuit breaker Complete parameter checking Checksum
  40. 40. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Node level Supporting patterns System level Monitor Watchdog Heartbeat Acknowledgement
  41. 41. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Node level Supporting patterns System level Voting Synthetic transaction Leaky bucketRoutine checks Health check Fail fast
  42. 42. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment
  43. 43. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries
  44. 44. Retry • Very basic recovery pattern • Recover from omission errors • Limit retries to minimize extra load on an already loaded resource • Limit retries to avoid recurring errors
  45. 45. Retry example // doAction returns true if successful, false otherwise boolean doAction(...) { ... } // General pattern boolean success = false int tries = 0; while (!success && (tries < MAX_TRIES)) { success = doAction(...); tries++; } // Alternative one-retry-only variant success = doAction(...) || doAction(...);
  46. 46. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Checkpoint Safe point
  47. 47. Rollback • Roll back state and/or execution path to a defined safe state • Recover from internal errors caused by external failures • Use checkpoints and safe points to provide safe rollback points • Limit retries to avoid recurring errors
  48. 48. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Checkpoint Safe point Roll- forward
  49. 49. Roll-forward • Advance execution past the point of error • Often used as escalation if retry or rollback do not succeed • Not applicable if skipped activity is essential • Use checkpoints and safe points to provide safe roll-forward points
  50. 50. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Roll- forward Checkpoint Safe point Restart Reconnec t Data Reset Startup consistenc y Reset
  51. 51. Reset • Often used as radical escalation if all other measures failed • Restart service – do not forget to provide a consistent startup state • Reset data to a guaranteed consistent state if nothing else helps • Sometimes simply trying to reconnect helps (often forgotten)
  52. 52. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Restart Roll- forward Reconnec t Checkpoint Safe point Data Reset Startup consistenc y Reset Failover
  53. 53. Failover • Used as escalation if other measures failed or would take too long • Requires redundancy – trades resources for availability • Many implementation variants available, incl. out-of-the-box solutions • Usually implemented as a monitor-dynamic router combination
  54. 54. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Restart Roll- forward Reconnec t Checkpoint Safe point Data Reset Startup consistenc y Failover Reset Read repair
  55. 55. Read repair • Handle response failures due to relaxed temporal constraints • Requires redundancy – trades resources for availability • Decides correct state based on conflicting siblings • Often implemented in NoSQL databases (but not always accessible)
  56. 56. Read repair example (Riak, Java) 1/2 public class FooResolver implements ConflictResolver<Foo> { @Override public Foo resolve(List<Foo> siblings) { // Insert your sibling resolution logic here } } public class Buddy { public String name; public Set<String> nicknames; public Buddy(String name, Set<String> nicknames) { this.name = name; this.nicknames = nicknames; } }
  57. 57. Read repair example (Riak, Java) 2/2 public class BuddyResolver implements ConflictResolver<Buddy> { @Override public Buddy resolve(List<Buddy> siblings) { if (siblings.size == 0) { return null; } else if (siblings.size == 1) { return siblings.get(0); } else { // Name is also used as key. Thus, all siblings have the same name String name = siblings.get(0). name; Set<String> mergedNicknames = new HashSet<String>(); for (Buddy buddy : siblings) { mergedNicknames.addAll(buddy.nicknames); } return new Buddy(name, mergedNicknames); } } }
  58. 58. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Restart Roll- forward Reconnec t Checkpoint Safe point Data Reset Startup consistenc y Failover Read repair Reset Error handler
  59. 59. Error Handler • Separate business logic and error handling • Business logic just focuses on getting the task done • Error handler focuses on recovering from errors • Easier to maintain – can be extended to structural escalation
  60. 60. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Retry Limit retries Rollback Restart Roll- forward Reconnec t Checkpoint Safe point Data Reset Startup consistenc y Failover Read repair Error handler Reset
  61. 61. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment
  62. 62. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value
  63. 63. Fallback • Execute an alternative action if the original action fails • Basis for most mitigation patterns • Fail silently – silently ignore the error and continue processing • Default value – return a predefined default value if an error occurs
  64. 64. Default value example (Hystrix, Java) 1/2 public class DefaultValueCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default”; private final boolean preCondition; public DefaultValueCommand(boolean preCondition) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.preCondition = preCondition; } @Override protected String run() throws Exception { if (!preCondition) throw new RuntimeException((”Action failed")); return ”I am a result"; } @Override protected String getFallback() { return ”I am a default value"; // Return default value if action fails } }
  65. 65. Default value example (Hystrix, Java) 2/2 @Test public void shouldSucceed() { DefaultValueCommand command = new DefaultValueCommand(true); String s = command.execute(); assertEquals(”I am a result", s); } @Test public void shouldProvideDefaultValue () { DefaultValueCommand command = new DefaultValueCommand(false); String s = null; try { s = command.execute(); } catch (Exception e) { fail("Did not return default value"); } assertEquals(”I am a default value", s); }
  66. 66. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale
  67. 67. Queues for resources • Protect resource from temporary overload situations • Limit queue size to limit latency at longer-lasting overload • Finish work in progress – Create pushback on the callers • Fresh work before stale – Discard old entries
  68. 68. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale Shed load
  69. 69. Shed Load • Use if overload will lead to unacceptable throughput of resource • Shed requests in order to keep throughput of resource acceptable • Shed load at periphery – Minimize impact on resource itself • Usually combined with monitor to watch load of resource
  70. 70. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Shed load Queue for resources Bounded queue Finish work in progress Fresh work before stale Share load Statically Dynamically
  71. 71. Share Load • Use if overload will lead to unacceptable throughput of resource • Share load between (added) resources to keep throughput good • Minimize amount of synchronization needed between resources • Usually combined with monitor to watch load of resource(s)
  72. 72. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Shed loadShare load Queue for resources Bounded queue Finish work in progress Fresh work before staleStatically Dynamically Deferrable work
  73. 73. Deferrable work • Maximize resources for online request processing under high load • Pause or slow down routine and batch jobs • Provide a means to pause routine and batch jobs from outside • Alternatively use a scheduler with dynamic resource allocation
  74. 74. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Shed loadShare load Queue for resources Bounded queue Finish work in progress Fresh work before stale Marked data Statically Dynamically Deferrable work
  75. 75. Marked data • Avoid repeated and/or spreading errors due to erroneous data • Use if time or information to correct data immediately is missing • Mark data as being erroneous – check flag before processing data • Use routine maintenance job to correct data
  76. 76. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Fallback Fail silently Alternative action Default value Shed loadShare load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before staleStatically Dynamically Deferrable work
  77. 77. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment
  78. 78. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Let sleeping dogs lie Small releasesHot deployments
  79. 79. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment
  80. 80. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Anti-entropy
  81. 81. Routine maintenance • Reduce system entropy – keep preventable errors from occurring • Especially important if errors were only mitigated, not corrected • Check system periodically and fix detected faults and errors • Balance benefits, costs and additional system load
  82. 82. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Spread the news Anti-entropy
  83. 83. Spread the news • Pro-actively spread information about changes in system state • Use a gossip or epidemic protocol for robustness and efficiency • Can also be used for data reconciliation • Balance benefits, costs and additional network load
  84. 84. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Backup request Spread the news Anti-entropy
  85. 85. Backup request • Send request to multiple workers (optionally a bit offset) • Use quickest reply and discard all other responses • Prevents latent responses (or at least reduces probability) • Requires redundancy – trades resources for availability
  86. 86. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Backup request Anti-fragility Diversity Jitter Spread the news Anti-entropy
  87. 87. Anti-fragility • Avoid fragility caused by homogenization and standardization • Protect against disastrous failures by using diverse solutions • Protect against cumulating effects by introducing jitter • Balance risks, benefits and added costs and efforts carefully
  88. 88. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy
  89. 89. Error injection • Make resilient software design sustainable • Inject errors at runtime and observe how the system reacts • Can also be used to detect yet unknown failure modes • Make sure to inject errors of all types
  90. 90. • Chaos Monkey • Chaos Gorilla • Chaos Kong • Latency Monkey • Compliance Monkey • Security Monkey • Janitor Monkey • Doctor Monkey https://github.com/Netflix/SimianArmy
  91. 91. PreventionDetectionCore (Architectural) Recovery Mitigation Treatment Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy
  92. 92. Towards a pattern language …
  93. 93. Decisions to make • General decisions • Bulkhead type • Communication paradigm • Decisions per failure scenario (repeat) • Error detection on node & system level • Recovery/mitigation mechanism • Supporting treatment mechanism • Supporting prevention mechanism • Complementing decisions • Complementing redundancy mechanism(s) • Complementing architectural patterns
  94. 94. Core (Architectural) Detection Treatment Prevention Recovery Mitigation Isolation Redundancy Communication paradigm Supporting patterns Node level System level 1 Decide core system properties 2 Choose patterns per failure scenario (Have the different failure types in mind)3 Decide complementing patterns Ongoing Create and refine system design and functional decomposition. Functionally decouple bulkheads (A good functional decomposition on business level is the prerequisite for an effective resilience)
  95. 95. Core (Architectural) Detection Treatment Prevention Recovery Mitigation Isolation Redundancy Communication paradigm Supporting patterns Node level System level Example: Erlang (Akka) MonitorMessaging Actor Escalation Heartbeat Restart (Let it crash) Hot deployments
  96. 96. Core (Architectural) Detection Treatment Prevention Recovery Mitigation Isolation Redundancy Communication paradigm Supporting patterns Node level System level Example: Netflix Monitor Request/r esponse (Micro)Service Retry Hot deployments (Canary releases) Fallback Share load Bounded queue Timeout Circuit breake r Several variants Error injection
  97. 97. Core (Architectural) Detection Treatment Prevention Recovery Mitigation Isolation Redundancy Communication paradigm Supporting patterns Node level System level What is your pattern language?
  98. 98. Wrap-up • Today’s systems are distributed • Failures are not avoidable • Failures are not predictable • Resilient software design needed • Rich pattern language • Start with core system properties • Choose patterns based on failure scenarios • Complement with careful functional design
  99. 99. Further reading 1. Michael T. Nygard, Release It!, Pragmatic Bookshelf, 2007 2. Robert S. Hanmer, Patterns for Fault Tolerant Software, Wiley, 2007 3. Andrew Tanenbaum, Marten van Steen, Distributed Systems – Principles and Paradigms, Prentice Hall, 2nd Edition, 2006 4. Hystrix Wiki, https://github.com/Netflix/Hystrix/wiki 5. Uwe Friedrichsen, Patterns of resilience, http://de.slideshare.net/ufried/patterns-of- resilience
  100. 100. Do not avoid failures. Embrace them!
  101. 101. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

×