Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Antifragility and testing for distributed systems failure

1,320 views

Published on

Failure is inevitable. In our modern world filled with continuously delivered and increasingly complex distributed architectures (looking at you micro-services), it is important to be able to test and improve our systems under a range of failure conditions.

In this talk, Matt discusses these complexities and the forces they exert on development teams, presenting some simple strategies and practical advice to deal with them.

Published in: Internet
  • Be the first to comment

Antifragility and testing for distributed systems failure

  1. 1. Antifragility and testing distributed systems Approaches for testing and improving resiliency
  2. 2. Failure It’s inevitable
  3. 3. Microservice Architectures ■ Bounded contexts ■ Deterministic in nature ■ Simple behaviour ■ Independently testable (e.g. Pact)
  4. 4. Distributed Architectures Conversely… ■ Unbounded context ■ Non-determinism ■ Exhibit chaotic behaviour ■ Emergent behaviour ■ Complex testing
  5. 5. Problems with traditional approaches ■ Integration test hell ■ Need to get by without E2E environments ■ Learnings are non-representative anyway ■ Slower ■ Costly (effort + $$)
  6. 6. Alternative? Create an isolated, simulated environment ■ Run locally or on a CI environment ■ Fast - no need to setup complex test data, scenarios etc. ■ Enables single-variable hypothesis testing ■ Automatable
  7. 7. Lab Testing w Docker Compose Hypothesis testing simulated environments
  8. 8. Docker Compose ■ Docker container orchestration tool ■ Run locally or remotely ■ Works across platforms (Windows, Mac, *nix) ■ Easy to use
  9. 9. Nginx Let’s take a practical, real-world example: Nginx as an API Proxy.
  10. 10. Simulating failure with Muxy “A tool to help simulate distributed systems failures”
  11. 11. Hypothesis testing Our job is to hypothesise, test, learn, change, and repeat
  12. 12. Nginx Testing H0 = Introducing network latency does not cause errors Test setup: ● Nginx running locally, with Production configuration ● DNSMasq used to resolve production urls to other Docker containers ● Muxy container setup, proxying the API ● A test harness to hit the API via Nginx n times, expecting 0 failures
  13. 13. Demo Fingers crossed...
  14. 14. Knobs and Levers We can now have a number of levers to pull. What if we... ● Want to improve on our SLA? ● Want to see how it performs if the API is hard down? ● ...
  15. 15. Antifragility Failure is inevitable, let’s make it normal
  16. 16. Titanic Architectures Architectures
  17. 17. Titanic Architectures “Titanic architectures are architectures that are good in theory, but haven’t been put into practice”
  18. 18. Anti-titanic architectures? “What doesn’t kill you makes you stronger”
  19. 19. Antifragility “The resilient resists shocks and stays the same; the antifragile gets better” - Nasim Taleb
  20. 20. Chaos Engineering ● We expect our teams to build resilient applications ○ Fault tolerance across and within service boundaries ● We expect servers and dependent services to fail ● Let’s make that normal ● Production is a playground ● Levelling up
  21. 21. Chaos Engineering - Principles 1. Build a hypothesis around Steady State Behavior 2. Vary real-world events 3. Run experiments in production 4. Automate experiments to run continuously Requires the ability to measure - you need metrics!! http://www.principlesofchaos.org/
  22. 22. Production Hypothesis Testing H0 = Loss of an AWS region does not result in errors Test setup: ● Multi-region application setup for the video playing API ● Apply Chaos Kong to us-west-2 ● Measure aggregate production traffic for ‘normal’ levels
  23. 23. Kill an AWS region http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
  24. 24. Go/Hystrix API Demo H0 = Introducing network latency does not cause API errors Test setup: ● API1 running with Hystrix circuit breaker enabled if API2 does not respond within SLAs ● Muxy container setup, proxying upstream API2 ● A test harness to hit API1 n times, expecting 0 failures
  25. 25. Human Factors Technology is only part of the problem, can we test that too?
  26. 26. Chernobyl ● Worst nuclear disaster of all time (1986) ● Public information sketchy ● Estimated > 3M Ukrainians affected ● Radioactive clouds sent over Europe ● Combination of system + human errors ● Series of seemingly logical steps -> catastrophe
  27. 27. What we know about human factors ● Accidents happen ● 1am - 8am = higher incidence of human errors ● Humans will ignore directions ○ They sometimes need to (e.g. override) ○ Other times they think they need to (mistake) ● Computers are better at following processes
  28. 28. Let’s use a Production deployment as a key example: ● CI -> CD pipeline used to deploy ● Production incident occurs 6 hours later (2am) ● ...what do we do? ● We trust the build pipeline, avoid non-standard actions These events help us understand and improve our systems Translation
  29. 29. “ A game day exercise is where we intentionally try to break our system, with the goal of being able to understand it better and learn from it ” Game Day Exercises
  30. 30. Prerequisites: ● A game plan ● All team members and affected staff aware of it ● Close collaboration between Dev, Ops, Test, Product people etc. ● An open mind ● Hypotheses ● Metrics ● Bravery Game Day Exercises
  31. 31. ● Get entire team together ● Make a simple diagram of system on a whiteboard ● Come up with ~5 failure scenarios ● Write down hypotheses for each scenario ● Backup any data you can’t lose ● Induce each failure and observe the results Game Day Exercises https://stripe.com/blog/game-day-exercises-at-stripe
  32. 32. Examples of things that fail: ● Application dies ● Hard disk fail ● Machine dies < AZ < Region… ● Github/Source control goes down ● Build server dies ● Loss of degraded network connectivity ● Loss of dependent API ● ... Game Day Exercises
  33. 33. Wrapping up I hope I didn’t fail
  34. 34. ■ Apply the scientific method ■ Use metrics to make learn and make decisions ■ Docker-compose + Muxy to automate failure ■ Build resilience into software & architecture ■ Regularly Production resilience until it’s normal ■ Production outages are opportunities to learn ■ Start small! Wrapping up
  35. 35. Thank you PRESENTED BY: @matthewfellows
  36. 36. ■ Antifragility (https://en.wikipedia.org/wiki/Antifragile) ■ Chaos Engineering (http://techblog.netflix.com/2014/09/introducing-chaos- engineering.html) ■ Principles of Chaos (http://www.principlesofchaos.org/) ■ Human factors in large-scale technological systems' accidents: Three Mile Island, Bhopal, Chernobyl (http://oae.sagepub.com/content/5/2/133.abstract) References
  37. 37. ■ Docker Compose (https://www.docker.com/docker-compose) ■ Muxy (https://github.com/mefellows/muxy) ■ Nginx resilience testing with Docker Compose (www.onegeek.com.au/articles/resilience-testing-nginx-with- docker-dnsmasq-and-muxy) ■ Golang + Hystrix resilience testing with Docker Compose (https://github.com/mefellows/muxy/tree/mst-meetup- demo/examples/hystrix) Code Tool References

×