Going Resilient...

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Going Resilient...
Building blocks of failure-tolerant systems
Anatole Tresch, Principal Consultant

Agenda
Going Resilient…2 19.10.16

Motivation

Complex Systems

Resilient Design

Resilience in Java

Summary

Bio
Anatole Tresch
• Principal Consultant
• Star Spec Lead JSR 354 Money & Currency
• Technical Architect, Lead Engineer
• PPMC Member Apache Tamaya
• Twitter/Google+: @atsticks
• anatole@apache.org
• anatole.tresch@trivadis.com

Motivation

“But it ain't how hard you hit; it's
about how hard you can get hit,
and keep moving forward. How
much you can take, and keep
moving forward. That's how
winning is done. “
Rocky Balboa
https://www.youtube.com/watch?v=vJHkTtvnUqA
Resilience/Resiliency is ...

Resilience/Resiliency is ...
1) the power or ability to return to the original form, position, etc., after being
bent, compressed, or stretched; elasticity.
2) ability to recover readily from illness, depression, adversity, or the like;
buoyancy.
Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd.
Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved.
http://www.thefreedictionary.com/resilience

Why should I care ?

Well...

It‘s all about successful production!

Business
Software
Availability
OpsDev

MTTF
Availability =
MTTF + MTTR
Given MTTF = Mean Time To Failure, and
MTTR = Mean Time To Recovery

How can availability maximized ?

MTTF
Availability =
MTTF + MTTR
Tradional Approach:
Maximize MTTF

This worked for a long time…
(a quick look at computing history...)

We started relatively primitive...

We used mechanics...

And then electricity...

We started to connect machines...

...until...

...today...

We started with one single computer...

Then we added more…
...and connected them...

Number of computers increased...

But sometimes computers break...

... sometimes the break as many...

... sometimes you even dont know...

And on top of that ...

...we virtualized machines...

...we added microservices, IoT, ...

...and connected all together...

Moving Targets !

...did I forget to mention...

THE NETWORK !

Complex Systems

Almost every system is a
distributed system.
complex
Chas Emerick

Complex systems...
«We can model and understand in isolation. But, when released into
competitive nominally regulated societies, their connections
proliferate, their interactions and interdependencies multiply, their
complexities mushroom. And we are caught short.»
Sidney Dekker

Do not try to avoid failures.
Embrace them!

...

Sounds easy ?

Unfortunately it‘s not...

Resilient Design

MTTF
Availability =
MTTF + MTTR
Resilient Approach:
Minimize MTTR

Let‘s start with some basics...

A command is a task is a ...
Command
Input
Output
Error

Commands can be connected...
Command
Input
Output | Input
Error
Output
Error
Command

Command
So where „resilience“
must be added?
Command
Input
Command
Command
●
Isolation
●
Decouple communications
with events
●
Flow Control
Manage state

Asynchronous communications, the simple way...
Command
Input
Error
Output
Command
Error
InputOutput
Bulkhead

Now let‘s apply the model to distributed systems...

Event-Driven communications
Component
Input
Error
Output
ComponentError
InputOutput
Bulkhead
Queue
Queue

Additional functionality needed...
Component
Input Output
Component
InputOutput
Queue
Error Queue
Error
Handler
Latency
Control
Monitor
ing
Supervision
Escalation
Error
Location
Transparency
Location
Transparency

Latency Control
●
Bounded Queues
●
Fan out & quickest reply
●
Circuit Breakers and Fail Fast
●
Timeouts
●
Throttling, Semaphores
●
Failover
●
Degration of service level

Managing shared state: Quorums

Ensure decision can be taken at any time

Even number of voters, num >= 3

Kernel Based Architecture
Structure systems like onions in layers:
• State & failure
management in layers
• „Kernel“ holds and protects the
critical state
• Kernel is engaged always
through layers of protection

Rounding up...
Resilient Systems in IT require

Asynchronous Communications

Idempotent, self-containing events

Location Transparency

Isolation & Recursive Restartability

Complete, unified Input and Output Validation

Common Error Handling and Monitoring

Supervision

Minimal shared state, Redundancy

We require decoupling
in time and space !

Resilience in Java

Resilience is a design task.
Later improvements mostly focus
on latency control.

Example: Let‘s call a service...

@Inject
private AService service;
public void myMethod(String input){
   try{
     String result = service.call(service);
     // do something with the result
   }catch(Exception e){
     throw new IllegalStateException(„Server error“, e);
   }
}
Simple Example

public void myMethod(Input input){
Future<String> resultFuture =
executor.submit(()->{service.call(input);});
try{
String result = resultFuture.get(4000L,
TimeUnit.MILLISECONDS);
// do something with the result
}catch(Exception e){
throw new IllegalStateException(„Server error“, e);
}
}
Executor Example, using a timeout

https://github.com/Netflix/Hystrix

Using Hystrix „standalone“...

Hystrix – Basic Use

It wraps your
code

Adds resilient
features
Synchronous
Asynchronous
Reactive use

Hystrix – Fallback

Hystrix – Fallback, cascading

Hystrix – Timeout
public class TimeoutCommand extends HystrixCommand<String> {
private final Callable<String> task;
public TimeoutCommand(int millis, Callable<String> task) {
super(Setter.withGroupKey(
HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP))
.andCommandPropertiesDefaults(
HystrixCommandProperties.Setter()
.withExecutionIsolationThreadTimeoutInMilliseconds(
timeout)));
this.task = task
}
@Override
protected String run() throws Exception {
return task.call();
}
@Override
protected String getFallback() {
return "Resource timed out";
}
}

Hystrix – Dashboard

Using Hystrix with Spring Boot…
http://cloud.spring.io/spring-cloud-netflix

Spring Boot with Hystrix – CircuitBreaker
@SpringBootApplication
@EnableCircuitBreaker
public class Application {
public static void main(String[] args) {
new SpringApplicationBuilder(Application.class)
.web(true).run(args);
}
}
@Component
public class StoreIntegration {
@HystrixCommand(fallbackMethod = "defaultStores")
public Object getStores(Map<String, Object> parameters) {
//do stuff that might fail
}
public Object defaultStores(M
return "something useful";
}
}

Using Akka…
http://akka.io/

Akka – A simple Agent
public class HelloWorld extends UntypedActor {
@Override
public void preStart() {
final ActorRef greeter = getContext().actorOf(
Props.create(Greeter.class), "greeter");
greeter.tell(Greeter.Msg.GREET, getSelf());
}
@Override
public void onReceive(Object msg) {
if (msg == Greeter.Msg.DONE) {
// when the greeter is done, stop this actor and with it the application
getContext().stop(getSelf());
} else {
unhandled(msg);
}
}
}

Akka – Supervision
Props.create(Greeter.class), "greeter");
Props supervisorProps = BackoffSupervisor.props(
Backoff.onStop(
childProps,
"myEcho",
Duration.create(3, TimeUnit.SECONDS),
Duration.create(30, TimeUnit.SECONDS),
0.2)); // adds 20% "noise" to
// vary the intervals slightly
system.actorOf(
supervisorProps, "echoSupervisor");

Akka – Fault Tolerance
private static SupervisorStrategy strategy =
new OneForOneStrategy(10, Duration.create("1 minute"), DeciderBuilder.
match(ArithmeticException.class, e -> resume()).
match(NullPointerException.class, e -> restart()).
match(IllegalArgumentException.class, e -> stop()).
matchAny(o -> escalate()).build());
@Override
public SupervisorStrategy supervisorStrategy() {
return strategy;
}

Vertx
http://vertx.io/

Vertx – Circuit Breaker
CircuitBreaker breaker = CircuitBreaker.create("my-circuit-breaker", vertx,
new CircuitBreakerOptions().setMaxFailures(5).setTimeout(2000));
breaker.<String>execute(future -> {
vertx.createHttpClient().getNow(8080, "localhost", "/", response -> {
if (response.statusCode() != 200) {
future.fail("HTTP error");
} else {
response
.exceptionHandler(future::fail)
.bodyHandler(buffer -> {
future.complete(buffer.toString());
});
}
});
}).setHandler(ar -> {
// Do something with the result
});

Summary

Summmary
Resilience Software Design
 ...is a must!
 ...is achievable
 ...is well supported by frameworks such as Hystrix and Akka
 The patterns used used are ubiquious for all kind of distributed
systems
 ...fits naturally with microservices

THANK YOU !

Going Resilient...
• Hystrix Wiki,https://github.com/Netflix/Hystrix/wiki
• Jonas Boner: Resilience is by design: http://virtualjug.com/resilience-is-by-design/
• R.Cook, J. Rasmussen: “Going solid”: a model of system dynamics and consequences for patient safety
• Reinette Biggs et al.: Applying Resilient Thinking: Toward Resilient Architectures
• Michael Mehaffy, Nikos A. Salingaros : 1 - Biology Lessons:
• Richard Cook : How complex systems fail
• http://www.mindsetonline.com/
• George Candea, Amando Fox : Crash Only Software
• George Candea, Amando Fox : Turning the Crash-Only Pattern from a Slash-Hammer to a Scalpell
• Michael T. Nygard, Release It!,Pragmatic Bookshelf, 2007
• Robert S. Hanmer, Patterns for Fault Tolerant Software, Wiley, 2007
• Andrew Tanenbaum, Marten van Steen, Distributed Systems – Principles and Paradigms, Prentice Hall, 2nd Edition, 2006
• Uwe Friedrichsen, Slideshare: http://de.slideshare.net/ufried

Going Resilient...
Anatole Tresch
Principal Consultant
Tel. +41 58 459 53 93
anatole.tresch@trivadis.com
19.10.16 Going Resilient…91

Going Resilient...

Recommended

Recommended

More Related Content

Similar to Going Resilient...

Similar to Going Resilient... (20)

More from Anatole Tresch

More from Anatole Tresch (20)

Recently uploaded

Recently uploaded (20)

Going Resilient...