Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018

198 views

Published on

At Audible, we have invested in chaos engineering. In this session, we describe the experiment frameworks and some of the testing we’ve done on AWS, including using serverless technologies. We also discuss the scalability testing that we performed in order to gain full confidence in our entire system.

  • Be the first to comment

  • Be the first to like this

Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering and Scalability at Audible.com Tyler Lund Director, Software Engineering Audible.com A R C 3 0 8
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What I talk about when I talk about distributed architectures The curious case of the broken downloads A song of ice and chaos Principles of chaos and the Goblet of Fire The art and Zen of implementing chaos What happened?
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  5. 5. Largest producer and retailer of audiobooks … worldwide!
  6. 6. To unleash the power of the spoken word
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible members listened for almost 3 BILLION hours in 2017
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goals of chaos engineering Distributed systems are complex. No one person can understand the entirety. Chaos is inherent. It’s better to accept the chaos. Gain confidence in understanding the system through experiments.
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Component A Input Output Unit Testing
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service AInput Output Integration Testing
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering? Chaos engineering is the discipline of experimenting on distributed systems to gain confidence in their behavior Fail calls between services or add latency
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Implementing chaos engineering Socialization Monitoring Graceful restarts and degradation Targeted chaos Cause a cascading failure Build a failure ingestion framework Create a chaos automation platform
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Socialization Acknowledge the complexity of the system Get support from the business Never let a good problem go to waste
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Socialization • First CPU hog experiment in 2012 • Socialized cause and impact of download issue • Documented system architecture • Created file access experiments
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Getting started with chaos Define the steady state Start with non-critical services, in QA Only experiment on the services of teams that want to be
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring What are your key business metrics? Playback starts/second Adds to cart Orders Membership signups
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Graceful restarts and degradation • Start with on/off • Bring down a host, service, or DB • Spike the CPU
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Targeted chaos Start with your own team or one prepared for it Look at large recent issues
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cause a cascading failure Define a hypothesis about the system Test the hypothesis Break something other teams are dependent on
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Types of experiments Add latency Make services and dependencies unavailable Exceptions Packet loss Failed requests Resource contention
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a failure injection framework
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a failure injection framework • Web UI • Host agent • Service framework injection
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Latency Injection Injector
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Gremlin proxy injector
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a chaos automation platform • Calculate the maximum impact the KPI can experience • Run control and experiment clusters • Route based on impact • Stop experiment if KPI suffers or customers experience pain • Figure out problem with time, not when paged
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A - Prod Chaos Automation Client Service A - Control Service A - Experi ment LB 98% 1% 1%
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High Deviation: Stop Experiment
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run chaos all the time, everywhere • Run on all services, critical and non-critical • Run with limited warning • Run in production • Run often • Get feedback
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results • Show examples of resilience. • Link back to previous issues. • Can’t be proven with integration tests. • Find issues wouldn’t otherwise find. Retry storms.
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Our results • Development team ownership • Prioritizing experiments • Building a framework • Preventing customer impact
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Takeaways • Everyone can and should do chaos • Implementing chaos educates and improves engineering teams • Do no harm to the customer • Involve business partners
  50. 50. Chaos engineering doesn’t cause problems. Chaos engineering reveals them.
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Tuesday, Nov 27 ARC314 – Globalizing Player Accounts at Riot Games While Maintaining Availability 1:45 PM – 2:45 PM | Aria East, Plaza Level, Orovada 2 Tuesday, Nov 27 ARC307 – How Intuit Turbo Tax Ran Entirely on AWS for 2017 Taxes 10:45 AM - 11:45 AM | Venetian, Level 2, Venetian F
  52. 52. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tyler Lund tlund@audible.com @tylopoda
  53. 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×