Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Case for Chaos

6,700 views

Published on

An inside look at the philosophy and methodology around chaos at Netflix. Case studies, lessons learned, chaos concepts.

Published in: Technology
  • Be the first to comment

The Case for Chaos

  1. 1. The Case for Chaos – AWS Pop-up Loft Bruce Wong – Engineering Manager – Chaos Engineering, Netflix 1
  2. 2. Who am I? Bruce Wong 2@bruce_m_wong
  3. 3. Who am I? Bruce Wong  Netflix since 2010 3@bruce_m_wong
  4. 4. Who am I? Bruce Wong  Netflix since 2010  Computer Science 4@bruce_m_wong
  5. 5. Who am I? Bruce Wong  Netflix since 2010  Computer Science  Builds Engineering Teams  5 different teams so far 5@bruce_m_wong
  6. 6. Agenda  Why?  Case Studies  How you can start chaos testing  Future chaos 6@bruce_m_wong
  7. 7. Failure is Unavoidable  Disks Fail  Power outages. And your generator fails  Software bugs  Human Error 7@bruce_m_wong
  8. 8. What about the cloud? 8@bruce_m_wong
  9. 9. Cloud Case Study 9@bruce_m_wong  XSA-108 Security Vulnerability  ~10% of EC2 instances rebooted  Spread over a 5 days  One availability-zone at a time
  10. 10. Chaos Validated + Public Cloud Validated 10@bruce_m_wong
  11. 11. Netflix & Micro-Services 11@bruce_m_wong http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  12. 12. Netflix & Micro-Services 12@bruce_m_wong
  13. 13. 13@bruce_m_wong
  14. 14. 14@bruce_m_wong
  15. 15. 15@bruce_m_wong
  16. 16. 16@bruce_m_wong
  17. 17. 17 Graceful Degradation @bruce_m_wong Product + Engineering Decision
  18. 18. 18 Designing for Failure @bruce_m_wong  Infrastructure Failure  Instance terminations – single points of failure  Latency  Availability Zone  Regional  Application Failure  Graceful degradation  Software Bugs
  19. 19. 19 Testing @bruce_m_wong  Unit testing  Integration testing  Functional testing  Regression testing  Chaos Testing Finding bugs earlier
  20. 20. 20 Resilience needs to be tested @bruce_m_wong Testing is hard  Large and growing data sets  Internet-scale traffic  Innovation and New features  Change is constant
  21. 21. 21 Resilience needs to be tested @bruce_m_wong  Validate resilience design  Don’t wait for next outage  Un-controlled  Un-predictable Hope is not a strategy
  22. 22. Types of Chaos 22 Instances Fail Lessons • Be as stateless as possible • Autoscaling groups are good • Invest in automation to rebuilt state when necessary • Running Chaos Monkey on C* @bruce_m_wong
  23. 23. Types of Chaos 23 Many Instances can Fail Lessons • Cassandra works as expected • Moving Traffic back to steady state is just as hard • Infrastructure Management tools can be a bottleneck @bruce_m_wong
  24. 24. Types of Chaos 24 Natural Disasters Happen Lessons • Cassandra works as expected • Moving Traffic back to steady state is just as hard • Infrastructure Management can be a bottleneck • Smaller Blast-Radius Benefits • Traffic + Capacity orchestration is hard @bruce_m_wong
  25. 25. Types of Chaos 25 Latency Still Learning • Functional fallbacks don’t account for system limitations • Thread pools • Connection pools • Slow can be hard to find • Slow can be hard to contain • Unbounded Queues are BAD @bruce_m_wong
  26. 26. 26 Unbounded Queues @bruce_m_wong  Come in many forms, to name a few  Threads  Memory  Disk  Bounded by physical limitations  VERY difficult to find  Elastic is not Infinite
  27. 27. 27 For Example: Memory and Data @bruce_m_wong  Data is important  In-Memory Queue grows and shrinks  Failure Mode # 1 – Out of memory  NOT A MEMORY LEAK!
  28. 28. 28 For Example: Memory and Data @bruce_m_wong  Data is important  If Queue gets to size X  Write to disk  Flush later  Failure Mode # 2  Disk Full  File Descriptors Saturated
  29. 29. 29 For Example: Memory and Data @bruce_m_wong  Data is important … But not as important as uptime
  30. 30. Starting Chaos 30  Start small, very small.  Start simple, stateless systems  Start manually and coordinated  Failure Injection Fridays  Build confidence  Outages are opportunities @bruce_m_wong
  31. 31. Chaos takes time 31@bruce_m_wong 2010 2012 2014
  32. 32. Aspirational Chaos 32  Increase Frequency & Intensity  Reduces chance of drift  Infrastructure  Continuous Latency injection  Chaos Gorilla random AZ weekly  Latency Gorilla  CPU, Memory, Disk  Application  Continuous Validation of fallbacks  Startup dependency failure injection @bruce_m_wong
  33. 33. Questions 33@bruce_m_wong

×