Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chaos Engineering: Why Breaking Things Should Be Practised.

2,537 views

Published on

With the rise of micro-services and large-scale distributed architectures, software systems have grown increasingly complex and hard to understand. Adding to that complexity, the velocity of software delivery has also dramatically increased, resulting in failures being harder to predict and contain.

While the cloud allows for high availability, redundancy and fault-tolerance, no single component can guarantee 100% uptime. Therefore, we have to understand availability but especially learn how to design architectures with failure in mind.

And since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to identify failures before they become outages.

In this talk, I will deep dive into availability, reliability and large-scale architectures and make an introduction to chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more resilient systems.

Published in: Technology
  • Be the first to comment

Chaos Engineering: Why Breaking Things Should Be Practised.

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adrian Hornsby, Cloud Architecture Evangelist @ AWS @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Been there?
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. An Area of Complex & Dynamic Systems
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System failure rate Early Failures Wear Out Failures Observed Failures Random Failures
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System failure rate For high-velocity deployments Early Failures Wear Out Failures Observed Failures Random Failures
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … at the Edge
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.youtube.com/watch?v=zoz0ZjfrQ9s Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix 2013 https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Monkeys
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/Netflix/SimianArmy • Chaos Monkey - Kill instances randomly • Latency Monkey - Induce latency in services • Chaos Gorilla - Simulates AZ and regions failure • Conformity Monkey - Make sure instances follow good practices
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. http://principlesofchaos.org
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What “really” is Chaos Engineering?
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”http://principlesofchaos.org
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before breaking things …
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Really!! Don’t break things before you have done your home work!
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People Application Network & Data Infrastructure
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Availability Downtime per year Categories 95% (1-nine) 18 days 6 hours Batch processing, Data extraction, Load jobs. 99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking 99.9% (3-nines) 8 hours 45 minutes Online Commerce 99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems 99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions) 99.9999% (6-nines) 31 seconds Answering to my loved one* * Joke  http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Series Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new instance being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Immutable components are replaced for every deployment, rather than being updated in- place. Immutable Infrastructure
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Network & Data
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exponential Backoff
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
  46. 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dynamic Routing 1. Improve Latency for end-users 2. Disaster Recovery Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
  47. 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient state does not belong in the database.
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  51. 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  52. 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
  53. 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  54. 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exception Handling
  55. 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Degradation & Fallbacks
  56. 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People
  57. 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  58. 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  59. 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
  60. 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fire Drills
  61. 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Phases of Chaos Engineering
  62. 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix GameDay
  63. 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State
  64. 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  65. 65. What is Steady State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  66. 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  67. 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
  68. 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hypothesis
  69. 69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  70. 70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  71. 71. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for Chaos experiments: • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. • Injecting latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. • Time travel: Forcing system clocks out of sync with each other. • Executing a routine in driver code emulating I/O errors. • Maxing out CPU cores on an Elasticsearch cluster.
  72. 72. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Designing Experiment
  73. 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  74. 74. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  75. 75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run the Experiment
  76. 76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Version Users Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
  77. 77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Verify & Learn
  78. 78. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DON’T blame that one person …
  79. 79. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  80. 80. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  81. 81. The Conveyor Belt Accident Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Conclusion: So, the likely root cause of the associate’s damaged thumb is that he simply needed a table, there wasn’t one around, so he used a conveyor as a table. https://www.linkedin.com/pulse/use-5-whys-find-root-causes-peter-abilla/
  82. 82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. No, seriously. Root Cause is a Fallacy. http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
  83. 83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. No, seriously. Root Cause is a Fallacy. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede? http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
  84. 84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fix
  85. 85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  86. 86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Changing Culture takes time! Be patient…
  87. 87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices
  88. 88. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  89. 89. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thanks you! @adhorn https://medium.com/@adhorn

×