Why resilience - A primer at varying flight altitudes
Upcoming SlideShare
Loading in...5

Why resilience - A primer at varying flight altitudes



This session provides a primer to resilience at varying flight altitudes. ...

This session provides a primer to resilience at varying flight altitudes.

It starts at a management level and motivates why resilience is important, why it is important today and what the business case for resilience is (or actually is not).

Then it descends to a high level architectural view and explains resilience a bit more in detail, its correlation to availability and the difference between resilience and robustness.

Afterwards it descends to a design level and explains some selected core principles of resilience, some of them garnished with grass-root level flight altitude code examples.

At the end the flight altitude is risen again and some recommendations how to introduce resilient software design into your software development process are given and the correlation to some related topics is explained.

Of course this slide deck will only show a fraction of the actual talk contents as the voice track is missing but I hope it will be helpful anyway.



Total Views
Views on SlideShare
Embed Views



1 Embed 1

https://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Why resilience - A primer at varying flight altitudes Why resilience - A primer at varying flight altitudes Presentation Transcript

  • Why Resilience? A primer at varying flight altitudes Uwe Friedrichsen, codecentric AG, 2014
  • @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  • Resilience? Never heard of it …
  • re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n. 1.  the power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2.  ability to recover readily from illness, depression, adversity, or the like; buoyancy. Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd. Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved. http://www.thefreedictionary.com/resilience
  • Resilience (IT) The ability of an application to handle unexpected situations -  without the user noticing it (best case) -  with a graceful degradation of service (worst case)
  • Resilience is not about testing your application (You should definitely test your application, but that‘s a different story) public class MySUTTest { @Test public void shouldDoSomething() { MySUT sut = new MySUT(); MyResult result = sut.doSomething(); assertEquals(<Some expected result>, result); } … }
  • It‘s all about production!
  • Why should I care?
  • Business Production Availability Resilience
  • Your web server doesn‘t look good …
  • The dreaded SiteTooSuccessfulException …
  • Reasons to care about resilience •  Loss of lives •  Loss of goods (manufacturing facilities) •  Loss of money •  Loss of reputation
  • Why should I care about it today? (The risks you mention are not new)
  • Resilience drivers •  Cloud-based systems •  Highly scalable systems •  Zero Downtime •  IoT & Mobile •  Social à Reliably running distributed systems
  • What’s the business case? (I don’t see any money to be made with it)
  • Counter question Can you afford to ignore it? (It’s not about making money, it’s about not loosing money)
  • Resilience business case •  Identify risk scenarios •  Calculate current occurrence probability •  Calculate future occurrence probability •  Calculate short-term losses •  Calculate long-term losses •  Assess risks and money •  Do not forget the competitors
  • Let’s dive deeper into resilience
  • Classification attempt Reliability: A set of attributes that bear on the capability of software to maintain its level
 of performance under stated conditions for a stated period of time. Efficiency ISO/IEC 9126
 software quality characteristics Usability Reliability Portability Maintainability Functionality Available with acceptable latency Resilience goes beyond that
  • How can I maximize availability?
  • Availability ≔ MTTF MTTF + MTTR MTTF: Mean Time To Failure MTTR: Mean Time To Recovery
  • Traditional approach (robustness) Availability ≔ MTTF MTTF + MTTR Maximize MTTF
  • A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  • Failures in todays complex, distributed, interconnected systems are not the exception. They are the normal case.
  • Contemporary approach (resilience) Availability ≔ MTTF MTTF + MTTR Minimize MTTR
  • Do not try to avoid failures. Embrace them.
  • What kinds of failures
 do I need to deal with?
  • Failure types •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • How do I implement resilience?
  • Bulkheads
  • •  Divide system in failure units •  Isolate failure units •  Define fallback strategy
  • Redundancy
  • •  Elaborate use case
 Minimize MTTR / scale transactions / handle response errors / … •  Define routing & balancing strategy
 Round robin / master-slave / fan-out & quickest one wins / … •  Consider admin involvement
 Automatic vs. manual / notification – monitoring / …
  • Loose Coupling
  • •  Isolate failure units (complements bulkheads) •  Go asynchronous wherever possible •  Use timeouts & circuit breakers •  Make actions idempotent
  • Implementation Example #1 Timeouts
  • Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this
  • Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }
  • Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }
  • Implementation Example #2 Circuit Breaker
  • Circuit Breaker – concept Client Resource Circuit Breaker Request Resource unavailable Resource available Closed Open Half-Open Lifecycle
  • Implemented patterns •  Timeout •  Circuit breaker •  Load shedder
  • Supported patterns •  Bulkheads
 (a.k.a. Failure Units) •  Fail fast •  Fail silently •  Graceful degradation of service •  Failover •  Escalation •  Retry •  ...
  • Hello, world!
  • public class HelloCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final String name; public HelloCommand(String name) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.name = name; } @Override protected String run() throws Exception { return "Hello, " + name; } } @Test public void shouldGreetWorld() { String result = new HelloCommand("World").execute(); assertEquals("Hello, World", result); }
  • Source: https://github.com/Netflix/Hystrix/wiki/How-it-Works
  • Fallbacks
  • •  What will you do if a request fails? •  Consider failure handling from the very beginning •  Supplement with general failure handling strategies
  • Scalability
  • •  Define scaling strategy •  Think full stack •  Apply D-I-D rule •  Design for elasticity
  • … and many more •  Supervision patterns •  Recovery & mitigation patterns •  Anti-fragility patterns •  Supporting patterns •  A rich pattern family Different approach than traditional
 enterprise software development
  • How do I integrate resilience into my software development process?
  • Steps to adopt resilient software design 1.  Create awareness: Go DevOps 2.  Create capability: Coach your developers 3.  Create sustainability: Inject errors
  • Related topics Reactive Anti-fragility Fault-tolerant software design Recovery-oriented computing
  • Wrap-up •  Resilience is about availability •  Crucial for todays complex systems •  Not caring is a risk •  Go DevOps to create awareness
  • Do not avoid failures. Embrace them!
  • @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com