BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Going Resilient...
Building blocks of failure-tolerant systems
Anatole Tresch, Principal Consultant
Agenda
Going Resilient…2 19.10.16

Motivation

Complex Systems

Resilient Design

Resilience in Java

Summary
Bio
Going Resilient…3 19.10.16
Anatole Tresch
• Principal Consultant
• Star Spec Lead JSR 354 Money & Currency
• Technical Architect, Lead Engineer
• PPMC Member Apache Tamaya
• Twitter/Google+: @atsticks
• anatole@apache.org
• anatole.tresch@trivadis.com
Going Resilient…4 19.10.16
Motivation
Going Resilient…5 19.10.16
“But it ain't how hard you hit; it's
about how hard you can get hit,
and keep moving forward. How
much you can take, and keep
moving forward. That's how
winning is done. “
Rocky Balboa
https://www.youtube.com/watch?v=vJHkTtvnUqA
Resilience/Resiliency is ...
Resilience/Resiliency is ...
Going Resilient…6 19.10.16
1) the power or ability to return to the original form, position, etc., after being
bent, compressed, or stretched; elasticity.
2) ability to recover readily from illness, depression, adversity, or the like;
buoyancy.
Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd.
Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved.
http://www.thefreedictionary.com/resilience
Why should I care ?
Going Resilient…7 19.10.16
Going Resilient…8 19.10.16
Well...
Going Resilient…9 19.10.16
It‘s all about successful production!
Going Resilient…10 19.10.16
Business
Software
Availability
OpsDev
Going Resilient…11 19.10.16
MTTF
Availability =
MTTF + MTTR
Given MTTF = Mean Time To Failure, and
MTTR = Mean Time To Recovery
How can availability maximized ?
Going Resilient…12 19.10.16
Going Resilient…13 19.10.16
MTTF
Availability =
MTTF + MTTR
Tradional Approach:
Maximize MTTF
This worked for a long time…
(a quick look at computing history...)
Going Resilient…14 19.10.16
We started relatively primitive...
Going Resilient…15 19.10.16
Going Resilient…16 19.10.16
We used mechanics...
Going Resilient…17 19.10.16
Going Resilient…18 19.10.16
And then electricity...
Going Resilient…19 19.10.16
Going Resilient…20 19.10.16
We started to connect machines...
Going Resilient…21 19.10.16
Going Resilient…22 19.10.16
...until...
Going Resilient…23 19.10.16
...today...
Going Resilient…24 19.10.16
We started with one single computer...
Going Resilient…25 19.10.16
Then we added more…
...and connected them...
Going Resilient…26 19.10.16
Number of computers increased...
Going Resilient…27 19.10.16
Going Resilient…28 6.09.16
But sometimes computers break...
Going Resilient…29 19.10.16
Going Resilient…30 6.09.16
... sometimes the break as many...
Going Resilient…31 19.10.16
Going Resilient…32 6.09.16
... sometimes you even dont know...
Going Resilient…33 19.10.16
Going Resilient…34 6.09.16
And on top of that ...
Going Resilient…35 19.10.16
...we virtualized machines...
Going Resilient…36 19.10.16
Going Resilient…37 6.09.16
...we added microservices, IoT, ...
Going Resilient…38 19.10.16
Going Resilient…39 6.09.16
...and connected all together...
Going Resilient…40 19.10.16
Going Resilient…41 19.10.16
Moving Targets !
Going Resilient…42 6.09.16
...did I forget to mention...
Going Resilient…43 19.10.16
THE NETWORK !
Going Resilient…44 6.09.16
Going Resilient…45 19.10.16
Complex Systems
Almost every system is a
distributed system.
Going Resilient…46 6.09.16
complex
Chas Emerick
Complex systems...
Going Resilient…47 19.10.16
«We can model and understand in isolation. But, when released into
competitive nominally regulated societies, their connections
proliferate, their interactions and interdependencies multiply, their
complexities mushroom. And we are caught short.»
Sidney Dekker
Do not try to avoid failures.
Embrace them!
Going Resilient…48 6.09.16
...
Going Resilient…49 6.09.16
Sounds easy ?
Going Resilient…50 6.09.16
Unfortunately it‘s not...
Going Resilient…51 6.09.16
Going Resilient…52 19.10.16
Resilient Design
Going Resilient…53 19.10.16
MTTF
Availability =
MTTF + MTTR
Resilient Approach:
Minimize MTTR
Let‘s start with some basics...
Going Resilient…54 6.09.16
A command is a task is a ...
Going Resilient…55 19.10.16
Command
Input
Output
Error
Commands can be connected...
Going Resilient…56 19.10.16
Command
Input
Output | Input
Error
Output
Error
Command
Command
So where „resilience“
must be added?
Going Resilient…57 19.10.16
Command
Input
Command
Command
●
Isolation
●
Decouple communications
with events
●
Flow Control
Manage state
Asynchronous communications, the simple way...
Going Resilient…58 19.10.16
Command
Input
Error
Output
Command
Error
InputOutput
Bulkhead
Now let‘s apply the model to distributed systems...
Going Resilient…59 6.09.16
Event-Driven communications
Going Resilient…60 19.10.16
Component
Input
Error
Output
ComponentError
InputOutput
Bulkhead
Queue
Queue
Additional functionality needed...
Going Resilient…61 19.10.16
Component
Input Output
Component
InputOutput
Queue
Error Queue
Error
Handler
Latency
Control
Monitor
ing
Supervision
Escalation
Error
Location
Transparency
Location
Transparency
Latency Control
Going Resilient…62 19.10.16
●
Bounded Queues
●
Fan out & quickest reply
●
Circuit Breakers and Fail Fast
●
Timeouts
●
Throttling, Semaphores
●
Failover
●
Degration of service level
Managing shared state: Quorums
Going Resilient…63 19.10.16

Ensure decision can be taken at any time

Even number of voters, num >= 3
Kernel Based Architecture
Going Resilient…64 19.10.16
Structure systems like onions in layers:
• State & failure
management in layers
• „Kernel“ holds and protects the
critical state
• Kernel is engaged always
through layers of protection
Rounding up...
Going Resilient…65 19.10.16
Resilient Systems in IT require

Asynchronous Communications

Idempotent, self-containing events

Location Transparency

Isolation & Recursive Restartability

Complete, unified Input and Output Validation

Common Error Handling and Monitoring

Supervision

Minimal shared state, Redundancy
Going Resilient…66 19.10.16
We require decoupling
in time and space !
Going Resilient…67 19.10.16
Resilience in Java
Going Resilient…68 19.10.16
Resilience is a design task.
Later improvements mostly focus
on latency control.
Going Resilient…69 19.10.16
Example: Let‘s call a service...
Going Resilient…70 19.10.16
@Inject
private AService service;
public void myMethod(String input){
   try{
     String result = service.call(service);
     // do something with the result
   }catch(Exception e){
     throw new IllegalStateException(„Server error“, e);
   }
}
Simple Example
Going Resilient…71 19.10.16
public void myMethod(Input input){
Future<String> resultFuture =
executor.submit(()->{service.call(input);});
try{
String result = resultFuture.get(4000L,
TimeUnit.MILLISECONDS);
// do something with the result
}catch(Exception e){
throw new IllegalStateException(„Server error“, e);
}
}
Executor Example, using a timeout
Going Resilient…72 19.10.16
https://github.com/Netflix/Hystrix
Going Resilient…73 19.10.16
Using Hystrix „standalone“...
Going Resilient…74 19.10.16
Hystrix – Basic Use

It wraps your
code

Adds resilient
features
Synchronous
Asynchronous
Reactive use
Going Resilient…75 19.10.16
Hystrix – Fallback
Going Resilient…76 19.10.16
Hystrix – Fallback, cascading
Going Resilient…77 19.10.16
Hystrix – Timeout
public class TimeoutCommand extends HystrixCommand<String> {
private final Callable<String> task;
public TimeoutCommand(int millis, Callable<String> task) {
super(Setter.withGroupKey(
HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP))
.andCommandPropertiesDefaults(
HystrixCommandProperties.Setter()
.withExecutionIsolationThreadTimeoutInMilliseconds(
timeout)));
this.task = task
}
@Override
protected String run() throws Exception {
return task.call();
}
@Override
protected String getFallback() {
return "Resource timed out";
}
}
Going Resilient…78 19.10.16
Hystrix – Dashboard
Going Resilient…79 19.10.16
Using Hystrix with Spring Boot…
http://cloud.spring.io/spring-cloud-netflix
Going Resilient…80 19.10.16
Spring Boot with Hystrix – CircuitBreaker
@SpringBootApplication
@EnableCircuitBreaker
public class Application {
public static void main(String[] args) {
new SpringApplicationBuilder(Application.class)
.web(true).run(args);
}
}
@Component
public class StoreIntegration {
@HystrixCommand(fallbackMethod = "defaultStores")
public Object getStores(Map<String, Object> parameters) {
//do stuff that might fail
}
public Object defaultStores(M
return "something useful";
}
}
Going Resilient…81 19.10.16
Using Akka…
http://akka.io/
Going Resilient…82 19.10.16
Akka – A simple Agent
public class HelloWorld extends UntypedActor {
@Override
public void preStart() {
final ActorRef greeter = getContext().actorOf(
Props.create(Greeter.class), "greeter");
greeter.tell(Greeter.Msg.GREET, getSelf());
}
@Override
public void onReceive(Object msg) {
if (msg == Greeter.Msg.DONE) {
// when the greeter is done, stop this actor and with it the application
getContext().stop(getSelf());
} else {
unhandled(msg);
}
}
}
Going Resilient…83 19.10.16
Akka – Supervision
Props.create(Greeter.class), "greeter");
Props supervisorProps = BackoffSupervisor.props(
Backoff.onStop(
childProps,
"myEcho",
Duration.create(3, TimeUnit.SECONDS),
Duration.create(30, TimeUnit.SECONDS),
0.2)); // adds 20% "noise" to
// vary the intervals slightly
system.actorOf(
supervisorProps, "echoSupervisor");
Going Resilient…84 19.10.16
Akka – Fault Tolerance
private static SupervisorStrategy strategy =
new OneForOneStrategy(10, Duration.create("1 minute"), DeciderBuilder.
match(ArithmeticException.class, e -> resume()).
match(NullPointerException.class, e -> restart()).
match(IllegalArgumentException.class, e -> stop()).
matchAny(o -> escalate()).build());
@Override
public SupervisorStrategy supervisorStrategy() {
return strategy;
}
Going Resilient…85 19.10.16
Vertx
http://vertx.io/
Going Resilient…86 19.10.16
Vertx – Circuit Breaker
CircuitBreaker breaker = CircuitBreaker.create("my-circuit-breaker", vertx,
new CircuitBreakerOptions().setMaxFailures(5).setTimeout(2000));
breaker.<String>execute(future -> {
vertx.createHttpClient().getNow(8080, "localhost", "/", response -> {
if (response.statusCode() != 200) {
future.fail("HTTP error");
} else {
response
.exceptionHandler(future::fail)
.bodyHandler(buffer -> {
future.complete(buffer.toString());
});
}
});
}).setHandler(ar -> {
// Do something with the result
});
Going Resilient…87 19.10.16
Summary
Summmary
Going Resilient…88 19.10.16
Resilience Software Design
 ...is a must!
 ...is achievable
 ...is well supported by frameworks such as Hystrix and Akka
 The patterns used used are ubiquious for all kind of distributed
systems
 ...fits naturally with microservices
THANK YOU !
Going Resilient…89 19.10.16
Going Resilient…90 19.10.16
Going Resilient...
• Hystrix Wiki,https://github.com/Netflix/Hystrix/wiki
• Jonas Boner: Resilience is by design: http://virtualjug.com/resilience-is-by-design/
• R.Cook, J. Rasmussen: “Going solid”: a model of system dynamics and consequences for patient safety
• Reinette Biggs et al.: Applying Resilient Thinking: Toward Resilient Architectures
• Michael Mehaffy, Nikos A. Salingaros : 1 - Biology Lessons:
• Richard Cook : How complex systems fail
• http://www.mindsetonline.com/
• George Candea, Amando Fox : Crash Only Software
• George Candea, Amando Fox : Turning the Crash-Only Pattern from a Slash-Hammer to a Scalpell
• Michael T. Nygard, Release It!,Pragmatic Bookshelf, 2007
• Robert S. Hanmer, Patterns for Fault Tolerant Software, Wiley, 2007
• Andrew Tanenbaum, Marten van Steen, Distributed Systems – Principles and Paradigms, Prentice Hall, 2nd Edition, 2006
• Uwe Friedrichsen, Slideshare: http://de.slideshare.net/ufried
Going Resilient...
Anatole Tresch
Principal Consultant
Tel. +41 58 459 53 93
anatole.tresch@trivadis.com
19.10.16 Going Resilient…91

Going Resilient...