Resiliency through failure @ QConNY 2013


Published on

Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
  • Resiliency through failure @ QConNY 2013

    1. 1. @atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlin
    2. 2. @atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40countries enjoying morethan one billion hoursof TV shows and moviesper month, includingoriginal series[1][1]
    3. 3. @atseitlinA complex distributed system
    4. 4. @atseitlinHow Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocationsBrowsePlayWatch
    5. 5. @atseitlin
    6. 6. @atseitlin
    7. 7. @atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for theservice• New members can activate their service andadd new devices
    8. 8. @atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• People make mistakesFailure is unavoidable
    9. 9. @atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded experience (Hystrix)• All to insulate our users from failureIs that enough?
    10. 10. @atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we believe?• How do we prevent drifting into failure?The typical answer is…
    11. 11. @atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test allfailure modeCan we effectively simulate a large-scale distributed system?
    12. 12. @atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing shape• Internet-scale traffic• Complex interaction and information flow• Asynchronous nature• 3rd party services• All while innovating and building featuresProhibitively expensive, if not impossible,for most large-scale systems
    13. 13. @atseitlinWhat if we could reduce variability of failures?
    14. 14. @atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait for random failure. Remove itsuncertainty by forcing it periodically
    15. 15. @atseitlinAnd that’s exactly what we did
    16. 16. @atseitlinInstances fail
    17. 17. @atseitlin
    18. 18. @atseitlinChaos Monkey taught us…• State is bad• Clusters are good• Surviving single instance failure is not enough
    19. 19. @atseitlinLots of instances fail
    20. 20. @atseitlinChaos Gorilla
    21. 21. @atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottleneck• Large scale events are hard to simulate• Rapidly shifting traffic is error prone• Smooth recovery is a challenge• Cassandra works as expected
    22. 22. @atseitlinWhat about larger catastrophes?Anyone remember Sandy?
    23. 23. @atseitlinChaos Kong (*some day soon*)
    24. 24. @atseitlinThe Sick and Wounded
    25. 25. @atseitlinLatency Monkey
    26. 26. @atseitlin
    27. 27. @atseitlinHystrix, RxJava
    28. 28. @atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency management is important (visibility &transparency gets missed otherwise)• Know thy neighbor (unknown dependencies)• Fall backs can fail too
    29. 29. @atseitlinEntropy
    30. 30. @atseitlinClutter accumulates• Complexity• Cruft• Vulnerabilities• Cost
    31. 31. @atseitlinJanitor Monkey
    32. 32. @atseitlinJanitor Monkey taught us…• Label everything• Clutter builds up
    33. 33. @atseitlinRanks of the Simian Army• Chaos Monkey• Chaos Gorilla• Latency Monkey• Janitor Monkey• ConformityMonkey• Circus Monkey• Doctor Monkey• Howler Monkey• Security Monkey• Chaos Kong• Efficiency Monkey
    34. 34. @atseitlinObservability is key• Don’t exacerbate real customer issues withfailure exercises• Deep system visibility is key to root-causefailures and understand the system
    35. 35. @atseitlinOrganizational elements• Every engineer is an operator of the service• Each failure is an opportunity to learn• Blameless cultureGoal is to create a learning organization
    36. 36. @atseitlinAssembling the Puzzle
    37. 37. @atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAstyanaxCassandra client for JavaCassJMeterCassandra test suiteCassandraMulti-region EC2 datastoresupportAegisthusHadoop ETL for CassandraAWS UsageSpend analyticsGovernatorLibrary lifecycle and dependencyinjectionOdinCloud orchestrationBlitz4j Async loggingExhibitorZookeeper as a ServiceCuratorZookeeper PatternsEVCacheMemcached as a ServiceEureka / DiscoveryService DirectoryArchaiusDynamics Properties ServiceEddaConfig state with historyDenominatorRibbonREST Client + mid-tier LBKaryonInstrumented REST Base ServeServo and Autoscaling ScriptsGenieHadoop PaaSHystrixRobust service patternRxJava Reactive PatternsAsgardAutoScaleGroup based AWSconsoleChaos MonkeyRobustness verificationLatency MonkeyJanitor MonkeyBakeries / AminotorLegend
    38. 38. @atseitlinHow does it all fit together?
    39. 39. @atseitlin
    40. 40. @atseitlinOur Current Catalog of ReleasesFree code available at
    41. 41. @atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailabilityUse the NetflixOSS platform to handle the heavylifting for building large-scale distributed cloud-native applications
    42. 42. @atseitlinThank you!Any questions?Ariel Tseitlin