Successfully reported this slideshow.
Your SlideShare is downloading. ×

Keynote - Chaos Engineering: Why breaking things should be practiced

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Amazon EC2 Spot Instances
Amazon EC2 Spot Instances
Loading in …3
×

Check these out next

1 of 64 Ad

Keynote - Chaos Engineering: Why breaking things should be practiced

Download to read offline

Keynote delivered by Madhusudan Sekhar on the topic "Chaos Engineering: Why breaking things should be practiced" presented at AWS Community Day, Bangalore 2018

Keynote delivered by Madhusudan Sekhar on the topic "Chaos Engineering: Why breaking things should be practiced" presented at AWS Community Day, Bangalore 2018

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Keynote - Chaos Engineering: Why breaking things should be practiced (20)

Advertisement

More from AWS User Group Bengaluru (20)

Recently uploaded (20)

Advertisement

Keynote - Chaos Engineering: Why breaking things should be practiced

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BENGALURU
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Engineering: Why breaking things should be practiced Madhusudan Shekar | Oct 6, 2018 @madhushekar23
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selfie Time… Thank you @adhorn for slides
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Engineering: Why breaking things should be practiced Madhusudan Shekar | Oct 6, 2018 @madhushekar23
  5. 5. OLD WORLD IT Employees at work Factories + supply chainSales channels Marketing analytics
  6. 6. Employees at work Factories + supply chainSales channels Marketing analytics OLD WORLD IT NEW WORLD IT
  7. 7. NEW WORLD IT Employees at work Factories + supply chain IoT connected things Online marketing Continuous supply tracking Just in time production Online sales + delivery Social media
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Worked in Dev Ops problem now
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … at the Edge
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Building Confidence Through Testing Is it enough???
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.youtube.com/watch?v=zoz0ZjfrQ9s Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix 2013 https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
  15. 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  18. 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What “really” is Chaos Engineering?
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Changes… Application Network & Data Infrastructure People
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure
  25. 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds Availability
  26. 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  27. 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  28. 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  29. 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • No updates on live systems • Always start from a new instance being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong Immutable Infrastructure
  30. 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it! Infrastructure as Code
  31. 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Network & Data
  32. 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  33. 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  34. 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  36. 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  37. 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exponential Backoff
  38. 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Circuit Breaker https://martinfowler.com/bliki/CircuitBreaker.html
  39. 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Dynamic Routing with Route53 Amazon Route53 Resource A In US Resource B in EU User in US
  40. 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Improve Latency for end-users 2. Disaster Recovery Dynamic Routing Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
  41. 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  43. 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient state does not belong in the database.
  44. 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  45. 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual Consistency Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  46. 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Degradation & Fallbacks
  47. 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People
  48. 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  49. 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  50. 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
  51. 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fire Drills
  52. 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Phases of Chaos Engineering
  53. 53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
  54. 54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  55. 55. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hypothesis …? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  56. 56. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  57. 57. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization Run the Experiment • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  58. 58. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DON’T blame that one person …
  59. 59. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  60. 60. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  61. 61. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  62. 62. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Changing Culture takes time! Be patient…
  63. 63. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp- content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
  64. 64. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you @madhushekar23

Editor's Notes

  • With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
  • Traditionally, these sensible measures to gain confidence are taken before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you don't have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, etc.), then you need to react to the incident and get things working again as fast as possible.
    This implies that once a system is in production, "Don't touch it!"—except, of course, when it's broken, in which case touch it all you want, under the time pressure inherent in an outage response.


    https://queue.acm.org/detail.cfm?id=2353017
  • GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis. 
  • Super power with Docker (Dockerfiles) instead of Chef or Puppet.
  • Invest time to save time
  • Write and updates
    Counters!!!! Not on the DB – redis!!
  • Database Federation is where we break up the database by function.
    In our example, we have broken out the Forums DB from the User DB from the Products DB
    Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries
    This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line.
    This isn’t going to help for single large tables; for this we will need to shard.
  • Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well.
    Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB.
    Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here.
    This does create operation complexity so if you can federate first, do that.

    This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
  • Route your website visitors to an alternate location to avoid site outages
  • Does a region Fail?
    Full region: no
    Individual services can fail region-wide
    Most of the time, configuration issue
    Leading to cascading failures.
  • Eventual consistency, also called optimistic replication,[2] is widely deployed in distributed systems, and has origins in early mobile computing projects.[3] A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence.[4] Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints.
  • The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.

×