Fault tolerant microservices - LJC Skills Matter 4thNov2014

Fault tolerant
microservices
BSkyB
@chbatey

@chbatey
Who is this guy?
● Enthusiastic nerd
● Senior software engineer at BSkyB
● Builds a lot of distributed applications
● Apache Cassandra MVP

@chbatey
Agenda
1. Setting the scene
○ What do we mean by a fault?
○ What is a microservice?
○ Monolith application vs the micro(ish) service
2. A worked example
○ Identify an issue
○ Reproduce/test it
○ Show how to deal with the issue

So… what do applications look like?
@chbatey

So... what do systems look like now?
@chbatey

But different things go wrong...
@chbatey
down
slow network
slow app
2 second max
GC :(
missing packets

Fault tolerance
1. Don’t take forever - Timeouts
2. Don’t try if you can’t succeed
3. Fail gracefully
4. Know if it’s your fault
5. Don’t whack a dead horse
6. Turn broken stuff off
@chbatey

Time for an example...
● All examples are on github
● Technologies used:
@chbatey
○ Dropwizard
○ Spring Boot
○ Wiremock
○ Hystrix
○ Graphite
○ Saboteur

Example: Movie player service
@chbatey
Shiny App
User
Service
Device
Service
Pin
Service
Shiny App
Shiny App
Shiny App
User
Se rUvisceer
Service
Device
Service
Play Movie

Testing microservices
You don’t know a service is
fault tolerant if you don’t
test faults
@chbatey

Isolated service tests
Shiny App
@chbatey
Mocks
User
Device
Pin
service
Acceptance Play Movie
Test
Prime

1 - Don’t take forever
@chbatey
● If at first you don’t
succeed, don’t take
forever to tell someone
● Timeout and fail fast

Which timeouts?
● Socket connection timeout
● Socket read timeout
@chbatey

Your service hung for 30 seconds :(
@chbatey
Customer
You :(

Which timeouts?
● Socket connection timeout
● Socket read timeout
● Resource acquisition
@chbatey

Your service hung for 10 minutes :(
@chbatey

Let’s think about this
@chbatey

A little more detail
@chbatey

Wiremock + Saboteur + Vagrant
● Vagrant - launches + provisions local VMs
● Saboteur - uses tc, iptables to simulate
@chbatey
network issues
● Wiremock - used to mock HTTP
dependencies
● Cucumber - acceptance tests

I can write an automated test for that?
@chbatey
Vagrant + Virtual box VM
Wiremock
User Service
Device Service
Pin Service
Sabot
eur
Play
Movie
Service
Acceptance
Test
prime to drop traffic
reset

Implementing reliable timeouts
● Homemade: Worker Queue + Thread pool
@chbatey
(executor)

@chbatey
(executor)
● Hystrix

@chbatey
(executor)
● Hystrix
● Spring Cloud Netflix

A simple Spring RestController
@chbatey
@RestController
public class Resource {
private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class);
@Autowired
private ScaryDependency scaryDependency;
@RequestMapping("/scary")
public String callTheScaryDependency() {
LOGGER.info("RestContoller: I wonder which thread I am on!");
return scaryDependency.getScaryString();
}
}

Scary dependency
@chbatey
@Component
public class ScaryDependency {
private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);
public String getScaryString() {
LOGGER.info("Scary dependency: I wonder which thread I am on!");
if (System.currentTimeMillis() % 2 == 0) {
return "Scary String";
} else {
Thread.sleep(10000);
return "Really slow scary string"; }
}
}

All on the tomcat thread
13:07:32.814 [http-nio-8080-exec-1] INFO info.batey.
examples.Resource - RestContoller: I wonder which thread
I am on!
examples.ScaryDependency - Scary dependency: I wonder
which thread I am on!
@chbatey

Seriously this simple now?
@chbatey
@Component
public class ScaryDependency {
private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);
@HystrixCommand
public String getScaryString() {
LOGGER.info("Scary dependency: I wonder which thread I am on!");
if (System.currentTimeMillis() % 2 == 0) {
return "Scary String";
} else {
Thread.sleep(10000);
return "Really slow scary string";
}
}
}

What an annotation can do...
examples.Resource - RestController: I wonder which
thread I am on!
13:07:32.896 [hystrix-ScaryDependency-1] INFO info.
batey.examples.ScaryDependency - Scary Dependency: I
wonder which thread I am on!
@chbatey

Timeouts take home
● You can’t use network level timeouts for
@chbatey
SLAs
● Test your SLAs - if someone says you can’t,
hit them with a stick
● Scary things happen without network issues

2 - Don’t try if you can’t succeed
@chbatey

Complexity
● When an application grows in complexity it
will eventually start sending emails
@chbatey

Complexity
● When an application grows in complexity it
will eventually start sending emails contain
queues and thread pools
@chbatey

Don’t try if you can’t succeed
● Executor Unbounded queues :(
○ newFixedThreadPool
○ newSingleThreadExecutor
○ newThreadCachedThreadPool
● Bound your queues and threads
● Fail quickly when the queue /
@chbatey
maxPoolSize is met
● Know your drivers

This is a functional requirement
● Set the timeout very high
● Use wiremock to add a large delay to the
@chbatey
requests
● Set queue size and thread pool size to 1
● Send in 2 requests to use the thread and fill
the queue
● What happens on the 3rd request?

Expect rubbish
● Expect invalid HTTP
● Expect malformed response bodies
● Expect connection failures
● Expect huge / tiny responses
@chbatey

Testing with Wiremock
@chbatey
stubFor(get(urlEqualTo("/dependencyPath"))
.willReturn(aResponse()
.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));
{
"request": {
"method": "GET",
"url": "/fault"
},
"response": {
"fault": "RANDOM_DATA_THEN_CLOSE"
}
}
{
"request": {
"method": "GET",
"url": "/fault"
},
"response": {
"fault": "EMPTY_RESPONSE"
}
}

4 - Know if it’s your fault
@chbatey

What to record
● Metrics: Timings, errors, concurrent
incoming requests, thread pool statistics,
connection pool statistics
● Logging: Boundary logging, elasticsearch /
@chbatey
logstash
● Request identifiers

Separate resource pools
● Don’t flood your dependencies
● Be able to answer the questions:
○ How many connections will
you make to dependency X?
○ Are you getting close to your
@chbatey
max connections?

So easy with Dropwizard + Hystrix
@Override
public void initialize(Bootstrap<AppConfig> appConfigBootstrap) {
HystrixCodaHaleMetricsPublisher metricsPublisher
= new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry())
HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);
@chbatey
}
metrics:
reporters:
- type: graphite
host: 192.168.10.120
port: 2003
prefix: shiny_app

5 - Don’t whack a dead horse
@chbatey
Shiny App
User
Service
Device
Service
Pin
Service
Shiny App
Shiny App
Shiny App
User
Se rUvisceer
Service
Device
Service
Play Movie

What to do..
● Yes this will happen..
● Mandatory dependency - fail *really* fast
● Throttling
● Fallbacks
@chbatey

Circuit breaker pattern
@chbatey

Implementation with Hystrix
@chbatey
@GET
@Timed
public String integrate() {
LOGGER.info("I best do some integration!");
String user = new UserServiceDependency(userService).execute();
String device = new DeviceServiceDependency(deviceService).execute();
Boolean pinCheck = new PinCheckDependency(pinService).execute();
return String.format("[User info: %s] n[Device info: %s] n[Pin check: %s] n", user, device,
pinCheck);
}

public class PinCheckDependency extends HystrixCommand<Boolean> {
@chbatey
@Override
protected Boolean run() throws Exception {
HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");
HttpResponse pinCheckResponse = httpClient.execute(pinCheck);
String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());
return Boolean.valueOf(pinCheckInfo);
}
}

public class PinCheckDependency extends HystrixCommand<Boolean> {
@chbatey
@Override
protected Boolean run() throws Exception {
HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");
HttpResponse pinCheckResponse = httpClient.execute(pinCheck);
String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());
return Boolean.valueOf(pinCheckInfo);
}
@Override
public Boolean getFallback() {
return true;
}
}

Triggering the fallback
● Error threshold percentage
● Bucket of time for the percentage
● Minimum number of requests to trigger
● Time before trying a request again
● Disable
● Per instance statistics
@chbatey

6 - Turn off broken stuff
● The kill switch
@chbatey

To recap
1. Don’t take forever - Timeouts
2. Don’t try if you can’t succeed
3. Fail gracefully
4. Know if it’s your fault
5. Don’t whack a dead horse
6. Turn broken stuff off
@chbatey

@chbatey
Links
● Examples:
○ https://github.com/chbatey/spring-cloud-example
○ https://github.com/chbatey/dropwizard-hystrix
○ https://github.com/chbatey/vagrant-wiremock-saboteur
● Tech:
○ https://github.com/Netflix/Hystrix
○ https://www.vagrantup.com/
○ http://wiremock.org/
○ https://github.com/tomakehurst/saboteur

Questions?
● Thanks for listening!
● http://christopher-batey.blogspot.co.uk/
@chbatey

Developer takeaways
● Learn about TCP
● Love vagrant, docker etc to enable testing
● Don’t trust libraries
@chbatey

Hystrix cost - do this yourself
@chbatey

Hystrix metrics
● Failure count
● Percentiles from Hystrix
@chbatey
point of view
● Error percentages

How to test metric publishing?
● Stub out graphite and verify calls?
● Programmatically call graphite and verify
@chbatey
numbers?
● Make metrics + logs part of the story demo

Fault tolerant microservices - LJC Skills Matter 4thNov2014

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Fault tolerant microservices - LJC Skills Matter 4thNov2014

Similar to Fault tolerant microservices - LJC Skills Matter 4thNov2014 (20)

More from Christopher Batey

More from Christopher Batey (20)

Recently uploaded

Recently uploaded (20)

Fault tolerant microservices - LJC Skills Matter 4thNov2014