The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
Resiliency through failure @ QConNY 2013
@atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin
@atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40countries enjoying morethan one billion hoursof TV shows and moviesper month, includingoriginal series http://ir.netflix.com/
@atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for theservice• New members can activate their service andadd new devices
@atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• People make mistakesFailure is unavoidable
@atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded experience (Hystrix)• All to insulate our users from failureIs that enough?
@atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we believe?• How do we prevent drifting into failure?The typical answer is…
@atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test allfailure modeCan we effectively simulate a large-scale distributed system?
@atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing shape• Internet-scale traffic• Complex interaction and information flow• Asynchronous nature• 3rd party services• All while innovating and building featuresProhibitively expensive, if not impossible,for most large-scale systems
@atseitlinWhat if we could reduce variability of failures?
@atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait for random failure. Remove itsuncertainty by forcing it periodically
@atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottleneck• Large scale events are hard to simulate• Rapidly shifting traffic is error prone• Smooth recovery is a challenge• Cassandra works as expected
@atseitlinWhat about larger catastrophes?Anyone remember Sandy?
@atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency management is important (visibility &transparency gets missed otherwise)• Know thy neighbor (unknown dependencies)• Fall backs can fail too
@atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAstyanaxCassandra client for JavaCassJMeterCassandra test suiteCassandraMulti-region EC2 datastoresupportAegisthusHadoop ETL for CassandraAWS UsageSpend analyticsGovernatorLibrary lifecycle and dependencyinjectionOdinCloud orchestrationBlitz4j Async loggingExhibitorZookeeper as a ServiceCuratorZookeeper PatternsEVCacheMemcached as a ServiceEureka / DiscoveryService DirectoryArchaiusDynamics Properties ServiceEddaConfig state with historyDenominatorRibbonREST Client + mid-tier LBKaryonInstrumented REST Base ServeServo and Autoscaling ScriptsGenieHadoop PaaSHystrixRobust service patternRxJava Reactive PatternsAsgardAutoScaleGroup based AWSconsoleChaos MonkeyRobustness verificationLatency MonkeyJanitor MonkeyBakeries / AminotorLegend
@atseitlinOur Current Catalog of ReleasesFree code available at http://netflix.github.com
@atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailabilityUse the NetflixOSS platform to handle the heavylifting for building large-scale distributed cloud-native applications