Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chaos engineering - The art of breaking stuff in production on purpose

55 views

Published on

An introduction to Chaos Engineering given on FutureTech 2019 conference

Published in: Internet
  • Be the first to comment

Chaos engineering - The art of breaking stuff in production on purpose

  1. 1. CHAOS ENGINEERING THE FINE ART OF BREAKING STUFF IN PRODUCTION ON PURPOSE GEERT VANDER CRUIJSEN @GEERTVDC
  2. 2. GEERT VAN DER CRUIJSEN @GEERTVDC CLOUD NATIVE ARCHITECT #DOEPICSHIT FULL CYCLE DEVELOPER DEVOPS COACH
  3. 3. CHAOS ENGINEERING ? WHY DO WE NEED
  4. 4. “IN A COMPLEX LANDSCAPE YOUR APPLICATION IS NEVER FULLY UP”
  5. 5. TRADITIONAL MONITORING TOOLS ARE DEAD!
  6. 6. MEASURE USER IMPACT
  7. 7. MEASURE USER IMPACT RELIABILITY AVAILABILITY LATENCY THROUGHPUT CORRECTNESS FRESHNESS COVERAGE QUALITY DURABILITY
  8. 8. RESILIENT APPLICATIONS INFRASTRUCTURE NETWORK APPLICATION PEOPLE
  9. 9. GRACEFUL DEGRADATION FAIL OPEN
  10. 10. GRACEFUL DEGRADATION FAIL OPEN BUT WE DO TESTS?
  11. 11. BUT WE DO TESTS? UNIT A INPUT OUTPUT UNIT TESTS
  12. 12. BUT WE DO TESTS? COMPONENT / SERVICE A INPUT OUTPUT COMPONENT /SERVICE B INTEGRATION TESTS
  13. 13. CHAOS ENGINEERING ? WHAT IS
  14. 14. CHAOS ENGINEERING IS NOT RANDOMLY BREAKING STUFF IN PRODUCTION
  15. 15. CHAOS ENGINEERING “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” https//principlesofchaos.org
  16. 16. CHAOS ENGINEERING “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” https//principlesofchaos.org
  17. 17. SERVICE INPUT OUTPUT SERVICE
  18. 18. CHAOS ENGINEERING EXPERIMENTS HOST FAILURE RESOURCE CAPACITY ATTACKS APPLICATION FAILURE NETWORK ATTACKS BRENT ATTACK
  19. 19. CHAOS ENGINEERING ONLY IN PRODUCTION?
  20. 20. YOUR FIRST EXPERIMENT HOW TO START
  21. 21. GAME DAY
  22. 22. INCIDENT RESPONSE LEARNING OUTAGENORMAL DETECT & ANALYSIS FIX LEARNIMPROVE
  23. 23. CHAOS GAME DAY CHAOS EXPERIMENT NORMAL DETECT & ANALYSIS FIX LEARNIMPROVE
  24. 24. CHAOS EXPERIMENT PHASES STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  25. 25. STEADY STATE STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  26. 26. STEADY STATE MEASURE BUSINESS METRICS 100ms extra load time drop Amazon’s sale by 1%
  27. 27. STEADY STATE SERVICE UNDER TEST ROUTING SERVICE B
  28. 28. STEADY STATE SERVICE UNDER TEST ROUTING SERVICE B CONTROL SERVICE EXPERIMENT SERVICE
  29. 29. STEADY STATE SERVICE UNDER TEST ROUTING SERVICE B CONTROL SERVICE EXPERIMENT SERVICE 98% 1% 1%
  30. 30. ALWAYS BE ABLE TO ABORT
  31. 31. DEFINE HYPOTHESIS STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  32. 32. DEFINE HYPOTHESIS BRAINSTORM WHAT CAN GO WRONG BRING EVERYONE DEVELOPERS SRE /OPERATIONS NETWORKS BUSINESS INFRASTRUCTURE TESTERS WHAT CAN GO WRONG? WHAT IFDATABASE IS DOWN? WHAT IFSERVICE RESPONDS SLOWER? WHAT IFMY CACHE RESPONDS SLOW? WHAT IFA POD DIES? WHAT IF LOADBALANCER STOPS? WHAT IF….?
  33. 33. STOP IF YOU KNOW THE EXPERIMENT WILL BREAK
  34. 34. DESIGN & EXECUTE EXPERIMENT STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  35. 35. DESIGN & EXECUTE EXPERIMENT START SMALL NOTIFY PEOPLE INVOLVED SLOWLY INCREASE BLAST RADIUS TOOLS: GREMLIN.COM CHAOSTOOLKIT.ORG GITHUB.COM/NETFLIX/SIMIANARMY GITHUB.COM/ASOBTI/KUBE-MONKEY
  36. 36. LEARN STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  37. 37. LEARN HOW FAST DID WE RECOVER? HOW FAST DID WE DETECT? DO NOT BLAME!
  38. 38. FIX STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  39. 39. FIX IMPLEMENT FIX RERUN EXPERIMENT
  40. 40. EMBED STEADY STATE DEFINE HYPOTHESIS DESIGN& EXECUTE LEARN FIX EMBED
  41. 41. EMBED ONBOARDING CONTINUOUS CHAOS EMBED IN CULTURE
  42. 42. PATTERNS RESILIENT ARCHITECTURE
  43. 43. MULTI PARALELLISM PARALLELISM AVAILABILITY DOWNTIME PER YEAR 1 99% 3 DAYS 16 HOURS 2 99,99% 53 MINUTES 3 99,9999% 32 SECONDS HOW PARALEL IS YOUR CLOUD COMPONENT ? REGIONSAVAILABILITY ZONES
  44. 44. ASYNC COMMUNICATION SYNC REQUIRES A CONNECTION PER REQUEST FOCUS ON MESSAGE BASED COMMUNICATION DECOUPLING PUB SUB LISTENER
  45. 45. QUEUE BASED LOAD DISTRIBUTION
  46. 46. QUEUE BASED LOAD DISTRIBUTION
  47. 47. QUEUE BASED LOAD DISTRIBUTION SERVICE BUS
  48. 48. IDEMPOTENT APIS HTTP METHOD IDEMPOTENCE SAFETY GET YES YES HEAD YES YES PUT YES NO DELETE YES NO POST NO NO PATCH NO NO
  49. 49. BULKHEAD PATTERN ISOLATE WORKLOADS LIKE THE HULL OF A SHIP
  50. 50. CIRCUIT BREAKER
  51. 51. CIRCUIT BREAKER ADD JITTER TO RETRIES
  52. 52. SPLIT RESPONSIBILITIES READ / WRITE SHARDING CQRS
  53. 53. WRAP UP BIG CULTURE CHANGE FULL CYCLE DEVELOPERSPRODUCTION ACCESS START EXPERIMENTING START SMALL CHECK OUT TOOLSOBSERVABILITY
  54. 54. “CHAOS ENGINEERING DOESN’T CAUSE PROBLEMS, IT JUST REVEALS THEM” NORA JONES – CHAOS ENGINEERING LEAD SLACK
  55. 55. GEERT VAN DER CRUIJSEN @GEERTVDC THANK YOU!ALL PICTURES USED ARE FROM UNSPLASHED.COM RESOURCES BOOKS: Chaosengineering-O’Reilly Chaosengineeringobservability -O’Reilly TOOLS: chaostoolkit.org gremlin.com github.com/netflix/simianarmy github.com/asobti/kube-monkey RESOURCES: principlesofchaos.org github.com/dastergon/awesome-chaos-engineering docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency

×