@atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlinhttp://www.linkedi...
@atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40count...
@atseitlinA complex distributed system
@atseitlinHow Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreamin...
@atseitlin
@atseitlin
@atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for th...
@atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• Peopl...
@atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded exper...
@atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we...
@atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test al...
@atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing sha...
@atseitlinWhat if we could reduce variability of failures?
@atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait...
@atseitlinAnd that’s exactly what we did
@atseitlinInstances fail
@atseitlin
@atseitlinChaos Monkey taught us…• State is bad• Clusters are good• Surviving single instance failure is not enough
@atseitlinLots of instances fail
@atseitlinChaos Gorilla
@atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottle...
@atseitlinWhat about larger catastrophes?Anyone remember Sandy?
@atseitlinChaos Kong (*some day soon*)
@atseitlinThe Sick and Wounded
@atseitlinLatency Monkey
@atseitlin
@atseitlinHystrix, RxJavahttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
@atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency m...
@atseitlinEntropy
@atseitlinClutter accumulates• Complexity• Cruft• Vulnerabilities• Cost
@atseitlinJanitor Monkey
@atseitlinJanitor Monkey taught us…• Label everything• Clutter builds up
@atseitlinRanks of the Simian Army• Chaos Monkey• Chaos Gorilla• Latency Monkey• Janitor Monkey• ConformityMonkey• Circus ...
@atseitlinObservability is key• Don’t exacerbate real customer issues withfailure exercises• Deep system visibility is key...
@atseitlinOrganizational elements• Every engineer is an operator of the service• Each failure is an opportunity to learn• ...
@atseitlinAssembling the Puzzle
@atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAsty...
@atseitlinHow does it all fit together?
@atseitlin
@atseitlinOur Current Catalog of ReleasesFree code available at http://netflix.github.com
@atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailability...
@atseitlinThank you!Any questions?Ariel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin
Upcoming SlideShare
Loading in...5
×

Resiliency through failure @ QConNY 2013

2,005

Published on

Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,005
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
  • Resiliency through failure @ QConNY 2013

    1. 1. @atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin
    2. 2. @atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40countries enjoying morethan one billion hoursof TV shows and moviesper month, includingoriginal series[1][1] http://ir.netflix.com/
    3. 3. @atseitlinA complex distributed system
    4. 4. @atseitlinHow Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocationsBrowsePlayWatch
    5. 5. @atseitlin
    6. 6. @atseitlin
    7. 7. @atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for theservice• New members can activate their service andadd new devices
    8. 8. @atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• People make mistakesFailure is unavoidable
    9. 9. @atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded experience (Hystrix)• All to insulate our users from failureIs that enough?
    10. 10. @atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we believe?• How do we prevent drifting into failure?The typical answer is…
    11. 11. @atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test allfailure modeCan we effectively simulate a large-scale distributed system?
    12. 12. @atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing shape• Internet-scale traffic• Complex interaction and information flow• Asynchronous nature• 3rd party services• All while innovating and building featuresProhibitively expensive, if not impossible,for most large-scale systems
    13. 13. @atseitlinWhat if we could reduce variability of failures?
    14. 14. @atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait for random failure. Remove itsuncertainty by forcing it periodically
    15. 15. @atseitlinAnd that’s exactly what we did
    16. 16. @atseitlinInstances fail
    17. 17. @atseitlin
    18. 18. @atseitlinChaos Monkey taught us…• State is bad• Clusters are good• Surviving single instance failure is not enough
    19. 19. @atseitlinLots of instances fail
    20. 20. @atseitlinChaos Gorilla
    21. 21. @atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottleneck• Large scale events are hard to simulate• Rapidly shifting traffic is error prone• Smooth recovery is a challenge• Cassandra works as expected
    22. 22. @atseitlinWhat about larger catastrophes?Anyone remember Sandy?
    23. 23. @atseitlinChaos Kong (*some day soon*)
    24. 24. @atseitlinThe Sick and Wounded
    25. 25. @atseitlinLatency Monkey
    26. 26. @atseitlin
    27. 27. @atseitlinHystrix, RxJavahttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
    28. 28. @atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency management is important (visibility &transparency gets missed otherwise)• Know thy neighbor (unknown dependencies)• Fall backs can fail too
    29. 29. @atseitlinEntropy
    30. 30. @atseitlinClutter accumulates• Complexity• Cruft• Vulnerabilities• Cost
    31. 31. @atseitlinJanitor Monkey
    32. 32. @atseitlinJanitor Monkey taught us…• Label everything• Clutter builds up
    33. 33. @atseitlinRanks of the Simian Army• Chaos Monkey• Chaos Gorilla• Latency Monkey• Janitor Monkey• ConformityMonkey• Circus Monkey• Doctor Monkey• Howler Monkey• Security Monkey• Chaos Kong• Efficiency Monkey
    34. 34. @atseitlinObservability is key• Don’t exacerbate real customer issues withfailure exercises• Deep system visibility is key to root-causefailures and understand the system
    35. 35. @atseitlinOrganizational elements• Every engineer is an operator of the service• Each failure is an opportunity to learn• Blameless cultureGoal is to create a learning organization
    36. 36. @atseitlinAssembling the Puzzle
    37. 37. @atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAstyanaxCassandra client for JavaCassJMeterCassandra test suiteCassandraMulti-region EC2 datastoresupportAegisthusHadoop ETL for CassandraAWS UsageSpend analyticsGovernatorLibrary lifecycle and dependencyinjectionOdinCloud orchestrationBlitz4j Async loggingExhibitorZookeeper as a ServiceCuratorZookeeper PatternsEVCacheMemcached as a ServiceEureka / DiscoveryService DirectoryArchaiusDynamics Properties ServiceEddaConfig state with historyDenominatorRibbonREST Client + mid-tier LBKaryonInstrumented REST Base ServeServo and Autoscaling ScriptsGenieHadoop PaaSHystrixRobust service patternRxJava Reactive PatternsAsgardAutoScaleGroup based AWSconsoleChaos MonkeyRobustness verificationLatency MonkeyJanitor MonkeyBakeries / AminotorLegend
    38. 38. @atseitlinHow does it all fit together?
    39. 39. @atseitlin
    40. 40. @atseitlinOur Current Catalog of ReleasesFree code available at http://netflix.github.com
    41. 41. @atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailabilityUse the NetflixOSS platform to handle the heavylifting for building large-scale distributed cloud-native applications
    42. 42. @atseitlinThank you!Any questions?Ariel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin

    ×