JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With Docker

Distributed system fault
injection testing with Docker
Jedrzej Dabrowa
Software Engineer

Engineering optimized for impact
What will I be talking about?
● What fails and why?
● Case study
○ Use of Elasticsearch in Base
○ Elasticsearch Proxy
● Testing microservices
● Fault injection tests toolkit
● Examples

Microservices - quick recap
● Hard to do, even harder to do right

● You get all the benefits of distributed systems…

● …with all the pains:

● … with all the pains:
○ CAP theorem
nope

○ CAP theorem
○ Cloud-native (SCALE!)

○ CAP theorem
○ Network failures
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

○ CAP theorem
○ Hardware failures

○ CAP theorem
○ Hardware failures
○ All kind of stuff you would never think about

http://www.theregister.co.uk/2013/06/08/facebook_cloud_versus_cloud/

Elasticsearch proxy - Case study
● Account ID-based sharding
○ Works well for small to medium, similarly sized clients

● Account ID-based sharding
● Account ID sharding
○ Enter big clients… and problems start

●

● Need for new solution
○ Solve existing problems...
■ Keep current solution for small accounts
■ Handle big ones differently
○ … as well as those yet to come. And improve!
■ Prioritize interactive traffic for top notch user experience
■ Provide SLA/QoS for database access
■ Enable dynamic configuration

“All problems in computer science can
be solved by another level of indirection”
(“... except of course for the problem of too many indirections”)
David Wheeler

account ID
sharding
Custom sharding

Testing microservices
● Testing single service is no different
than testing regular application
● Testing microservices-based system
is a whole other story
● New challenges require new
approaches
cost/effortexecution
speed
Mike Cohn
Our target
https://github.com/xolvio/qualityfaster

● Welcome to the
non-deterministic world

● Welcome to the
● Sources of complexity:
○ System space complexity

● Welcome to the
○ Fault space complexity

● Welcome to the
○ Fault space complexity
● Impossible to efficiently
explore every possibility

● Interesting example: Netflix
● Chaos Monkey
● Mess with production (!)
● Lineage-driven fault injection
○ https://people.ucsc.edu/~palvaro/socc16.pdf
● Found problem in critical place (App Boot):
○ Brute force exploration would take ~2100
iterations
○ 5 potential failures found in ~200 experiments
● http://techblog.netflix.com/2016/01/automated-failure-testing.html

ES Proxy Fault Injection tests
● Need: environment
● Solution: Docker
○ Docker compose
○ Allows to easily setup whole environment
○ Relatively complex system may be hosted on PC
○ Nice, declarative configuration through compose files

● Need: microservices binaries
● Solution: Amazon ECR
○ Allows teams to share their services
○ Current version easily recognizable

● Need: Harness tool
● Solution: BATS
○ https://github.com/sstephenson/bats
○ BASH-based tool // :(
○ TAP-compliant
○ Simple
○ Allows to implement isolated test scenarios and setup
environment

● Need: possibility to inject fault
● Solution: Pumba tool
○ https://github.com/gaia-adm/pumba
○ Based on netem
○ Modifies egress traffic
○ Supports various delay/loss models
○ Relatively fresh (current version: 0.2.6)

ES Proxy Fault Injection tests scenarios
● So… what do we test?
○ Remember first slide?
○ Focus on most obvious failure
points first
○ What is the potential problem?
How should the application
behave?
ES Proxy loses ZK connection

points first
behave?
ZK partition / quorum loss

points first
behave?
There are delays in network
Packets are dropped
Packets are reordered

Enable global
800 ms delay
Enable global 8%
package loss

Conclusions
● Failures will happen

Conclusions
● Proper design and keeping fault tolerance in mind gives pretty good level of
confidence

Conclusions
confidence
● Fault injection tests improve software reliability in a way that cannot be
achieved through other kinds of tests

Conclusions
confidence
● Fault injection tests improve software reliability in a way that cannot be
achieved through other kinds of tests
● Those tests are only as good as you want/need them to be - they’re not
exhaustive

jedrzej.dabrowa@getbase.com
lab.getbase.com/java
@JeDabrowa @getbaselab

JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With Docker

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (7)

Similar to JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With Docker

Similar to JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With Docker (20)

Recently uploaded

Recently uploaded (20)

JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With Docker