Having more than a hundred loosely coupled microservices leads to a big challenge when it comes to resiliency testing. In a probabilistic system a failure is inevitable. With a help of Docker and the environment around we've built a framework which allowed us to test core components of Base for network issues, partitions, etc. Learn how you can build it and sleep well without system outages.
2. Engineering optimized for impact
What will I be talking about?
● What fails and why?
● Case study
○ Use of Elasticsearch in Base
○ Elasticsearch Proxy
● Testing microservices
● Fault injection tests toolkit
● Examples
4. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
5. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● …with all the pains:
6. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● … with all the pains:
○ CAP theorem
nope
7. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● … with all the pains:
○ CAP theorem
○ Cloud-native (SCALE!)
8. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● … with all the pains:
○ CAP theorem
○ Cloud-native (SCALE!)
○ Network failures
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
9. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● … with all the pains:
○ CAP theorem
○ Cloud-native (SCALE!)
○ Network failures
○ Hardware failures
10. Engineering optimized for impact
Microservices - quick recap
● Hard to do, even harder to do right
● You get all the benefits of distributed systems…
● … with all the pains:
○ CAP theorem
○ Cloud-native (SCALE!)
○ Network failures
○ Hardware failures
○ All kind of stuff you would never think about
11. Engineering optimized for impact
http://www.theregister.co.uk/2013/06/08/facebook_cloud_versus_cloud/
12. Engineering optimized for impact
Elasticsearch proxy - Case study
● Account ID-based sharding
○ Works well for small to medium, similarly sized clients
14. Engineering optimized for impact
Elasticsearch proxy - Case study
● Account ID-based sharding
○ Works well for small to medium, similarly sized clients
● Account ID sharding
○ Works well for small to medium, similarly sized clients
○ Enter big clients… and problems start
17. Engineering optimized for impact
Elasticsearch proxy - Case study
● Need for new solution
○ Solve existing problems...
■ Keep current solution for small accounts
■ Handle big ones differently
○ … as well as those yet to come. And improve!
■ Prioritize interactive traffic for top notch user experience
■ Provide SLA/QoS for database access
■ Enable dynamic configuration
18. Engineering optimized for impact
“All problems in computer science can
be solved by another level of indirection”
(“... except of course for the problem of too many indirections”)
David Wheeler
20. Engineering optimized for impact
Testing microservices
● Testing single service is no different
than testing regular application
● Testing microservices-based system
is a whole other story
● New challenges require new
approaches
cost/effortexecution
speed
Mike Cohn
Our target
https://github.com/xolvio/qualityfaster
22. Engineering optimized for impact
Testing microservices
● Welcome to the
non-deterministic world
● Sources of complexity:
○ System space complexity
23. Engineering optimized for impact
● Welcome to the
non-deterministic world
● Sources of complexity:
○ System space complexity
○ Fault space complexity
Testing microservices
24. Engineering optimized for impact
● Welcome to the
non-deterministic world
● Sources of complexity:
○ System space complexity
○ Fault space complexity
● Impossible to efficiently
explore every possibility
Testing microservices
25. Engineering optimized for impact
● Interesting example: Netflix
● Chaos Monkey
● Mess with production (!)
● Lineage-driven fault injection
○ https://people.ucsc.edu/~palvaro/socc16.pdf
● Found problem in critical place (App Boot):
○ Brute force exploration would take ~2100
iterations
○ 5 potential failures found in ~200 experiments
● http://techblog.netflix.com/2016/01/automated-failure-testing.html
Testing microservices
Engineering optimized for impact
26. Engineering optimized for impact
ES Proxy Fault Injection tests
● Need: environment
● Solution: Docker
○ Docker compose
○ Allows to easily setup whole environment
○ Relatively complex system may be hosted on PC
○ Nice, declarative configuration through compose files
28. Engineering optimized for impact
ES Proxy Fault Injection tests
● Need: microservices binaries
● Solution: Amazon ECR
○ Allows teams to share their services
○ Current version easily recognizable
34. Engineering optimized for impact
ES Proxy Fault Injection tests scenarios
● So… what do we test?
○ Remember first slide?
○ Focus on most obvious failure
points first
○ What is the potential problem?
How should the application
behave?
Engineering optimized for impact
ES Proxy loses ZK connection
36. Engineering optimized for impact
ES Proxy Fault Injection tests scenarios
● So… what do we test?
○ Remember first slide?
○ Focus on most obvious failure
points first
○ What is the potential problem?
How should the application
behave?
ZK partition / quorum loss
37. Engineering optimized for impact
ES Proxy Fault Injection tests scenarios
● So… what do we test?
○ Remember first slide?
○ Focus on most obvious failure
points first
○ What is the potential problem?
How should the application
behave?
There are delays in network
Packets are dropped
Packets are reordered
41. Engineering optimized for impact
Conclusions
● Failures will happen
● Proper design and keeping fault tolerance in mind gives pretty good level of
confidence
42. Engineering optimized for impact
Conclusions
● Failures will happen
● Proper design and keeping fault tolerance in mind gives pretty good level of
confidence
● Fault injection tests improve software reliability in a way that cannot be
achieved through other kinds of tests
43. Engineering optimized for impact
Conclusions
● Failures will happen
● Proper design and keeping fault tolerance in mind gives pretty good level of
confidence
● Fault injection tests improve software reliability in a way that cannot be
achieved through other kinds of tests
● Those tests are only as good as you want/need them to be - they’re not
exhaustive