Why resilience - A primer at varying flight altitudes

2,223 views
2,085 views

Published on

This session provides a primer to resilience at varying flight altitudes.

It starts at a management level and motivates why resilience is important, why it is important today and what the business case for resilience is (or actually is not).

Then it descends to a high level architectural view and explains resilience a bit more in detail, its correlation to availability and the difference between resilience and robustness.

Afterwards it descends to a design level and explains some selected core principles of resilience, some of them garnished with grass-root level flight altitude code examples.

At the end the flight altitude is risen again and some recommendations how to introduce resilient software design into your software development process are given and the correlation to some related topics is explained.

Of course this slide deck will only show a fraction of the actual talk contents as the voice track is missing but I hope it will be helpful anyway.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,223
On SlideShare
0
From Embeds
0
Number of Embeds
55
Actions
Shares
0
Downloads
62
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Why resilience - A primer at varying flight altitudes

  1. 1. Why Resilience? A primer at varying flight altitudes Uwe Friedrichsen, codecentric AG, 2014
  2. 2. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  3. 3. Resilience? Never heard of it …
  4. 4. re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n. 1.  the power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2.  ability to recover readily from illness, depression, adversity, or the like; buoyancy. Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd. Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved. http://www.thefreedictionary.com/resilience
  5. 5. Resilience (IT) The ability of an application to handle unexpected situations -  without the user noticing it (best case) -  with a graceful degradation of service (worst case)
  6. 6. Resilience is not about testing your application (You should definitely test your application, but that‘s a different story) public class MySUTTest { @Test public void shouldDoSomething() { MySUT sut = new MySUT(); MyResult result = sut.doSomething(); assertEquals(<Some expected result>, result); } … }
  7. 7. It‘s all about production!
  8. 8. Why should I care?
  9. 9. Business Production Availability Resilience
  10. 10. Your web server doesn‘t look good …
  11. 11. The dreaded SiteTooSuccessfulException …
  12. 12. Reasons to care about resilience •  Loss of lives •  Loss of goods (manufacturing facilities) •  Loss of money •  Loss of reputation
  13. 13. Why should I care about it today? (The risks you mention are not new)
  14. 14. Resilience drivers •  Cloud-based systems •  Highly scalable systems •  Zero Downtime •  IoT & Mobile •  Social à Reliably running distributed systems
  15. 15. What’s the business case? (I don’t see any money to be made with it)
  16. 16. Counter question Can you afford to ignore it? (It’s not about making money, it’s about not loosing money)
  17. 17. Resilience business case •  Identify risk scenarios •  Calculate current occurrence probability •  Calculate future occurrence probability •  Calculate short-term losses •  Calculate long-term losses •  Assess risks and money •  Do not forget the competitors
  18. 18. Let’s dive deeper into resilience
  19. 19. Classification attempt Reliability: A set of attributes that bear on the capability of software to maintain its level
 of performance under stated conditions for a stated period of time. Efficiency ISO/IEC 9126
 software quality characteristics Usability Reliability Portability Maintainability Functionality Available with acceptable latency Resilience goes beyond that
  20. 20. How can I maximize availability?
  21. 21. Availability ≔ MTTF MTTF + MTTR MTTF: Mean Time To Failure MTTR: Mean Time To Recovery
  22. 22. Traditional approach (robustness) Availability ≔ MTTF MTTF + MTTR Maximize MTTF
  23. 23. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  24. 24. Failures in todays complex, distributed, interconnected systems are not the exception. They are the normal case.
  25. 25. Contemporary approach (resilience) Availability ≔ MTTF MTTF + MTTR Minimize MTTR
  26. 26. Do not try to avoid failures. Embrace them.
  27. 27. What kinds of failures
 do I need to deal with?
  28. 28. Failure types •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  29. 29. How do I implement resilience?
  30. 30. Bulkheads
  31. 31. •  Divide system in failure units •  Isolate failure units •  Define fallback strategy
  32. 32. Redundancy
  33. 33. •  Elaborate use case
 Minimize MTTR / scale transactions / handle response errors / … •  Define routing & balancing strategy
 Round robin / master-slave / fan-out & quickest one wins / … •  Consider admin involvement
 Automatic vs. manual / notification – monitoring / …
  34. 34. Loose Coupling
  35. 35. •  Isolate failure units (complements bulkheads) •  Go asynchronous wherever possible •  Use timeouts & circuit breakers •  Make actions idempotent
  36. 36. Implementation Example #1 Timeouts
  37. 37. Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this
  38. 38. Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }
  39. 39. Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }
  40. 40. Implementation Example #2 Circuit Breaker
  41. 41. Circuit Breaker – concept Client Resource Circuit Breaker Request Resource unavailable Resource available Closed Open Half-Open Lifecycle
  42. 42. Implemented patterns •  Timeout •  Circuit breaker •  Load shedder
  43. 43. Supported patterns •  Bulkheads
 (a.k.a. Failure Units) •  Fail fast •  Fail silently •  Graceful degradation of service •  Failover •  Escalation •  Retry •  ...
  44. 44. Hello, world!
  45. 45. public class HelloCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final String name; public HelloCommand(String name) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.name = name; } @Override protected String run() throws Exception { return "Hello, " + name; } } @Test public void shouldGreetWorld() { String result = new HelloCommand("World").execute(); assertEquals("Hello, World", result); }
  46. 46. Source: https://github.com/Netflix/Hystrix/wiki/How-it-Works
  47. 47. Fallbacks
  48. 48. •  What will you do if a request fails? •  Consider failure handling from the very beginning •  Supplement with general failure handling strategies
  49. 49. Scalability
  50. 50. •  Define scaling strategy •  Think full stack •  Apply D-I-D rule •  Design for elasticity
  51. 51. … and many more •  Supervision patterns •  Recovery & mitigation patterns •  Anti-fragility patterns •  Supporting patterns •  A rich pattern family Different approach than traditional
 enterprise software development
  52. 52. How do I integrate resilience into my software development process?
  53. 53. Steps to adopt resilient software design 1.  Create awareness: Go DevOps 2.  Create capability: Coach your developers 3.  Create sustainability: Inject errors
  54. 54. Related topics Reactive Anti-fragility Fault-tolerant software design Recovery-oriented computing
  55. 55. Wrap-up •  Resilience is about availability •  Crucial for todays complex systems •  Not caring is a risk •  Go DevOps to create awareness
  56. 56. Do not avoid failures. Embrace them!
  57. 57. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

×