Successfully reported this slideshow.

Chaos Driven Development (Bruce Wong)

19

Share

Upcoming SlideShare
Chaos Patterns
Chaos Patterns
Loading in …3
×
1 of 44
1 of 44

Chaos Driven Development (Bruce Wong)

19

Share

Download to read offline

Session slides from Future Insights Live, Vegas 2015:
https://futureinsightslive.com/las-vegas-2015/

Reliability and uptime are a critical aspect to any product. It doesn't matter how beautiful the user interface is, how amazing a feature is, or stunning a cutting edge product is if its down. We live in a world where our users expect the products we make to work anytime, every time, all the time. Chaos driven development is the discipline to start with failure scenarios and design our products with failure in mind. It forces us to understand our users, minimum viable product and what drives our businesses in order to architect the systems which we build our product's foundation on. Innovation is about navigating tradeoffs, chaos driven development helps us both understand and be intentional about the tradeoffs we make. Bruce takes a look at the radical strategies Netflix applies to ensure a reliable customer experience. Why chaos, how chaos, what chaos and the results.

Session slides from Future Insights Live, Vegas 2015:
https://futureinsightslive.com/las-vegas-2015/

Reliability and uptime are a critical aspect to any product. It doesn't matter how beautiful the user interface is, how amazing a feature is, or stunning a cutting edge product is if its down. We live in a world where our users expect the products we make to work anytime, every time, all the time. Chaos driven development is the discipline to start with failure scenarios and design our products with failure in mind. It forces us to understand our users, minimum viable product and what drives our businesses in order to architect the systems which we build our product's foundation on. Innovation is about navigating tradeoffs, chaos driven development helps us both understand and be intentional about the tradeoffs we make. Bruce takes a look at the radical strategies Netflix applies to ensure a reliable customer experience. Why chaos, how chaos, what chaos and the results.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Chaos Driven Development (Bruce Wong)

  1. 1. CHAOS DRIVEN DEVELOPMENT Future Insights Live 2015, LasVegas Bruce Wong
  2. 2. A LITTLE ABOUT ME • Founder of Chaos Engineering @ Netflix • Computer Science Background • Multiple roles scaling Netflix from 8m to 60m+ subs • CurrentlyTaking a Break @bruce_m_wong
  3. 3. Most enterprises hire people to fix things. Netflix hires people to break things…. …we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes. http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone @bruce_m_wong
  4. 4. http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/ https://gigaom.com/2014/09/11/netflixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong
  5. 5. http://www.cnbc.com/id/102394893@bruce_m_wong
  6. 6. http://www.cnbc.com/id/102394893@bruce_m_wong
  7. 7. CHAOS DEFINED “If it ain’t broke don’t fix it” -Bert Lance, Nation’s Business 1977 If it ain’t broke, try harder -chaos philosophy @bruce_m_wong
  8. 8. CHAOS DEFINED Intentionally introducing failure into a system with the purpose of validating resilience design. @bruce_m_wong
  9. 9. WHY CHAOS? Failure happens. @bruce_m_wong
  10. 10. WHY CHAOS? •Hardware fails •Power outages •Software has bugs •Human error •Natural disasters @bruce_m_wong
  11. 11. http://money.cnn.com/2012/10/30/technology/netflix-hurricane-sandy/@bruce_m_wong
  12. 12. http://www.pcworld.com/article/2691772/how-netflix-survived-the-amazon-ec2-reboot.html https://gigaom.com/2014/10/03/netflix-lost-218-database-servers-during-aws-reboot-and-stayed-online/ @bruce_m_wong
  13. 13. BLUE MOONS Once in a blue moon will eventually happen @bruce_m_wong
  14. 14. FAULT-TOLERANT DESIGN PRINCIPLES • Eliminate Single Points of Failure • Allow parts of the system to fail independently (Failure Isolation) • Prevent propagation (Failure Containment) @bruce_m_wong
  15. 15. START WITH CONSEQUENCES Chaos Driven Development @bruce_m_wong
  16. 16. MINIMUMVIABLE PRODUCT • Understand your users • Understand your value proposition • Understand your business @bruce_m_wong
  17. 17. PRIORITIZE • Many aspects and features are important • Each have different consequences for not working • A product’s value proposition is what drives your business @bruce_m_wong
  18. 18. DESIGN FOR FAILURE What failure isolation might look like @bruce_m_wong
  19. 19. APPLYING CHAOS Validation of fault-tolerant design @bruce_m_wong
  20. 20. BREAKINGTHE CONNECTION How Confident are you? -Next week? -Next month? -After that “quick patch”
  21. 21. WHAT DOES CHAOS LOOK LIKE? • Types - errors, latency • Duration - how long? • Intensity - how much? @bruce_m_wong
  22. 22. WHAT DOES CHAOS LOOK LIKE? • Return errors a % of requests • i.e. return HTTP500 for 1% of requests for 1 minute @bruce_m_wong
  23. 23. WHAT DOES CHAOS LOOK LIKE? • Make it slow(er) - Introduce Latency • i.e. sleep for 10ms on every request for 1 minute @bruce_m_wong
  24. 24. WHAT DOES CHAOS LOOK LIKE? Gradually increase • i.e. sleep for 10ms on every request for 1 minute • sleep for 100ms on every request for 3 minutes @bruce_m_wong
  25. 25. WHAT DOES CHAOS LOOK LIKE? The design/implementation worked! • microscopic impact, high confidence What if it didn’t work? • smaller impact than an outage • proactively fix it and try again @bruce_m_wong
  26. 26. WHAT AN OUTAGE LOOKS LIKE? • Detection takes time (TTD) • Analysis takes time • Resolution takes time (TTR) • Inconvenient times @bruce_m_wong
  27. 27. CHAOSVS OUTAGE Chaos • Controlled • Planned • Intentional • Microscopic user impact Outages • Uncontrolled • Unpredictable • Unintended • Large impact @bruce_m_wong
  28. 28. WHAT ABOUTTESTING? • Testing is good - do it, automate it • While great testing disciplines can find most functional bugs… • scale, traffic and capacity • System misconfiguration and design limitations @bruce_m_wong
  29. 29. LESSONS LEARNED • Learn more from chaos exercises than outages • Fixing a failure mode will uncover new ones • Configuration is often overlooked • Tools can break @bruce_m_wong
  30. 30. WHY ISTHIS HARD? @bruce_m_wong
  31. 31. WHAT MAKES RESILIENCE DESIGN HARD? • Product and Engineering Decision • Tradeoffs are difficult • Organizational Silos @bruce_m_wong
  32. 32. ORGANIZATIONAL SILOS • Services by Domain • Dev/Ops/Product • Incomplete context @bruce_m_wong
  33. 33. WHAT MAKES CHAOS HARD? In addition to the technical challenges • Organizations rarely incentivize people to try and break production • Misconceptions about complex systems and scale @bruce_m_wong
  34. 34. TAKE AWAYS • What are the consequences? • Start small, start early • Work together - share context • Validate don’t assume @bruce_m_wong
  35. 35. QUESTIONS? @bruce_m_wong

×