Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chaos Engineering, When should you release the monkeys?

69 views

Published on

Chaos Engineering is listed as 'Trial' in the ThoughtWorks Tech Radar, but what is it really and how is it different from traditional testing? When and why should you get started with Chaos Engineering and is Chaos Monkey the right place to start when you do?

Published in: Software
  • Be the first to comment

  • Be the first to like this

Chaos Engineering, When should you release the monkeys?

  1. 1. Chaos Engineering When should you release the monkeys? Steve Upton
  2. 2. What is Chaos Engineering? Why Chaos Engineering? Getting started with Chaos Engineering
  3. 3. What is Chaos Engineering?
  4. 4. Netflix Principles
  5. 5. Netflix Principles 1. No service should be a Single Point Of Failure
  6. 6. Netflix Principles 1. No service should be a Single Point Of Failure 2. Never trust that you’ve done #1 correctly
  7. 7. "It is better to be in a constant state of minor failure [than occasionally catastrophic failure]” - Richard Rodger, CEO @ nearForm
  8. 8. Principles of Chaos Engineering source: principlesofchaos.org
  9. 9. Principles of Chaos Engineering source: principlesofchaos.org 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  10. 10. Principles of Chaos Engineering source: principlesofchaos.org 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. 2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  11. 11. Principles of Chaos Engineering source: principlesofchaos.org 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. 2. Hypothesize that this steady state will continue in both the control group and the experimental group. 3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  12. 12. Principles of Chaos Engineering 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. 2. Hypothesize that this steady state will continue in both the control group and the experimental group. 3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. 4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group. source: principlesofchaos.org
  13. 13. Principles of Chaos Engineering 1. Make sure your system is working. 2. Try to break it. 3. Did it break? 4. Repeat.
  14. 14. Principles of Chaos Engineering
  15. 15. Principles of Chaos Engineering
  16. 16. Chaos Engineering needs... A way of validating that the system is working and a way of inducing failures.
  17. 17. “Isn’t this just stress testing?” - Everyone
  18. 18. Real world failure modes
  19. 19. Real world failure modes
  20. 20. Real world failure modes
  21. 21. “Everything fails all the time” - Werner Vogels, VP + CTO @ AWS
  22. 22. “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Leslie Lamport
  23. 23. Decoupled cause and effect
  24. 24. Emphasis on monitoring
  25. 25. Why Chaos Engineering?
  26. 26. Image credit: UX Brighton Cynefin Framework
  27. 27. Simple Sense Categorize Respond Best Practice
  28. 28. Simple Sense Categorize Respond Best Practice
  29. 29. Complicated Sense Analyze Respond Good Practice Simple Sense Categorize Respond Best Practice
  30. 30. Complicated Sense Analyze Respond Good Practice Simple Sense Categorize Respond Best Practice
  31. 31. Complex Probe Sense Respond Emergent Complicated Sense Analyze Respond Good Practice Simple Sense Categorize Respond Best Practice
  32. 32. Complex Probe Sense Respond Emergent Complicated Sense Analyze Respond Good Practice Simple Sense Categorize Respond Best Practice
  33. 33. Complex Probe Sense Respond Emergent Complicated Sense Analyze Respond Good Practice Chaotic Act Sense Respond Novel Simple Sense Categorize Respond Best Practice
  34. 34. Complex Probe Sense Respond Emergent Complicated Sense Analyze Respond Good Practice Chaotic Act Sense Respond Novel Simple Sense Categorize Respond Best Practice
  35. 35. credit: @johncutlefish
  36. 36. Complex Probe Sense Respond Emergent Complicated Sense Analyze Respond Good Practice Chaotic Act Sense Respond Novel Simple Sense Categorize Respond Best Practice
  37. 37. Complex Complicated Chaotic Simple
  38. 38. Complex Complicated Chaotic Simple Chaos Engineering
  39. 39. Complex Complicated Chaotic Simple
  40. 40. Complex Chaotic ?
  41. 41. Complex Complicated Chaotic Simple Unified Logging
  42. 42. Complex Complicated Chaotic Simple Code Freeze
  43. 43. “An evolving [software] system increases in complexity unless work is done to reduce it.” - Lehman's laws of software evolution
  44. 44. Complex Complicated Chaotic Simple Where are you today?
  45. 45. Complex Complicated Chaotic Simple Where will you be tomorrow?
  46. 46. Getting started with Chaos Engineering
  47. 47. { "version": "1.0.0", "title": "What is the impact of an expired certificate on our application chain?", "description": "If a certificate expires, we should gracefully deal with the issue.", "steady-state-hypothesis": { "title": "Application responds", "probes": [ { "type": "probe", "name": "we-can-request-sunset", "tolerance": 200, "provider": { "type": "http", "timeout": 3, "verify_tls": false, "url": "https://localhost:8443/city/Paris" } } ] }
  48. 48. "method": [ { "type": "action", "name": "swap-to-expired-cert", "provider": { "type": "process", "path": "cp", "arguments": "expired-cert.pem cert.pem" } }, { "type": "action", "name": "restart-astre-service-to-pick-up-certificate", "provider": { "type": "process", "path": "pkill", "arguments": "--echo -HUP -F astre.pid" } } ],
  49. 49. "rollbacks": [ { "type": "action", "name": "swap-to-vald-cert", "provider": { "type": "process", "path": "cp", "arguments": "valid-cert.pem cert.pem" } }, { "ref": "restart-astre-service-to-pick-up-certificate" }, { "ref": "restart-sunset-service-to-pick-up-certificate" } ]
  50. 50. Chaos Engineering You must be this tall to use Microservices
  51. 51. Chaos Engineering You must be this tall to use Chaos Engineering
  52. 52. monitoring
  53. 53. https://artillery.io/chaos-lambda/
  54. 54. Steve Upton Thank you @Steve_Upton

×