Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resiliency through failure @ QConNY 2013

2,685 views

Published on

Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.

Published in: Technology
  • Be the first to comment

Resiliency through failure @ QConNY 2013

  1. 1. @atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin
  2. 2. @atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40countries enjoying morethan one billion hoursof TV shows and moviesper month, includingoriginal series[1][1] http://ir.netflix.com/
  3. 3. @atseitlinA complex distributed system
  4. 4. @atseitlinHow Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocationsBrowsePlayWatch
  5. 5. @atseitlin
  6. 6. @atseitlin
  7. 7. @atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for theservice• New members can activate their service andadd new devices
  8. 8. @atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• People make mistakesFailure is unavoidable
  9. 9. @atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded experience (Hystrix)• All to insulate our users from failureIs that enough?
  10. 10. @atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we believe?• How do we prevent drifting into failure?The typical answer is…
  11. 11. @atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test allfailure modeCan we effectively simulate a large-scale distributed system?
  12. 12. @atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing shape• Internet-scale traffic• Complex interaction and information flow• Asynchronous nature• 3rd party services• All while innovating and building featuresProhibitively expensive, if not impossible,for most large-scale systems
  13. 13. @atseitlinWhat if we could reduce variability of failures?
  14. 14. @atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait for random failure. Remove itsuncertainty by forcing it periodically
  15. 15. @atseitlinAnd that’s exactly what we did
  16. 16. @atseitlinInstances fail
  17. 17. @atseitlin
  18. 18. @atseitlinChaos Monkey taught us…• State is bad• Clusters are good• Surviving single instance failure is not enough
  19. 19. @atseitlinLots of instances fail
  20. 20. @atseitlinChaos Gorilla
  21. 21. @atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottleneck• Large scale events are hard to simulate• Rapidly shifting traffic is error prone• Smooth recovery is a challenge• Cassandra works as expected
  22. 22. @atseitlinWhat about larger catastrophes?Anyone remember Sandy?
  23. 23. @atseitlinChaos Kong (*some day soon*)
  24. 24. @atseitlinThe Sick and Wounded
  25. 25. @atseitlinLatency Monkey
  26. 26. @atseitlin
  27. 27. @atseitlinHystrix, RxJavahttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  28. 28. @atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency management is important (visibility &transparency gets missed otherwise)• Know thy neighbor (unknown dependencies)• Fall backs can fail too
  29. 29. @atseitlinEntropy
  30. 30. @atseitlinClutter accumulates• Complexity• Cruft• Vulnerabilities• Cost
  31. 31. @atseitlinJanitor Monkey
  32. 32. @atseitlinJanitor Monkey taught us…• Label everything• Clutter builds up
  33. 33. @atseitlinRanks of the Simian Army• Chaos Monkey• Chaos Gorilla• Latency Monkey• Janitor Monkey• ConformityMonkey• Circus Monkey• Doctor Monkey• Howler Monkey• Security Monkey• Chaos Kong• Efficiency Monkey
  34. 34. @atseitlinObservability is key• Don’t exacerbate real customer issues withfailure exercises• Deep system visibility is key to root-causefailures and understand the system
  35. 35. @atseitlinOrganizational elements• Every engineer is an operator of the service• Each failure is an opportunity to learn• Blameless cultureGoal is to create a learning organization
  36. 36. @atseitlinAssembling the Puzzle
  37. 37. @atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAstyanaxCassandra client for JavaCassJMeterCassandra test suiteCassandraMulti-region EC2 datastoresupportAegisthusHadoop ETL for CassandraAWS UsageSpend analyticsGovernatorLibrary lifecycle and dependencyinjectionOdinCloud orchestrationBlitz4j Async loggingExhibitorZookeeper as a ServiceCuratorZookeeper PatternsEVCacheMemcached as a ServiceEureka / DiscoveryService DirectoryArchaiusDynamics Properties ServiceEddaConfig state with historyDenominatorRibbonREST Client + mid-tier LBKaryonInstrumented REST Base ServeServo and Autoscaling ScriptsGenieHadoop PaaSHystrixRobust service patternRxJava Reactive PatternsAsgardAutoScaleGroup based AWSconsoleChaos MonkeyRobustness verificationLatency MonkeyJanitor MonkeyBakeries / AminotorLegend
  38. 38. @atseitlinHow does it all fit together?
  39. 39. @atseitlin
  40. 40. @atseitlinOur Current Catalog of ReleasesFree code available at http://netflix.github.com
  41. 41. @atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailabilityUse the NetflixOSS platform to handle the heavylifting for building large-scale distributed cloud-native applications
  42. 42. @atseitlinThank you!Any questions?Ariel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin

×