Your SlideShare is downloading. ×
  • Like
  • Save
Resiliency through failure @ QConNY 2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Resiliency through failure @ QConNY 2013

  • 1,872 views
Published

Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.

Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,872
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.

Transcript

  • 1. @atseitlinResiliency through failureNetflixs Approach to Extreme Availability in the CloudAriel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin
  • 2. @atseitlinAbout NetflixNetflix is the world’sleading Internettelevision network withmore than 36 millionmembers in 40countries enjoying morethan one billion hoursof TV shows and moviesper month, includingoriginal series[1][1] http://ir.netflix.com/
  • 3. @atseitlinA complex distributed system
  • 4. @atseitlinHow Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocationsBrowsePlayWatch
  • 5. @atseitlin
  • 6. @atseitlin
  • 7. @atseitlinOur goal is availability• Members can stream Netflix whenever theywant• New users can explore and sign up for theservice• New members can activate their service andadd new devices
  • 8. @atseitlinFailure is all around us• Disks fail• Power goes out. And your generator fails.• Software bugs introduced• People make mistakesFailure is unavoidable
  • 9. @atseitlinWe design around failure• Exception handling• Clusters• Redundancy• Fault tolerance• Fall-back or degraded experience (Hystrix)• All to insulate our users from failureIs that enough?
  • 10. @atseitlinIt’s not enough• How do we know if we’ve succeeded?• Does the system work as designed?• Is it as resilient as we believe?• How do we prevent drifting into failure?The typical answer is…
  • 11. @atseitlinMore testing!• Unit testing• Integration testing• Stress testing• Exhaustive test suites to simulate and test allfailure modeCan we effectively simulate a large-scale distributed system?
  • 12. @atseitlinBuilding distributed systems is hardTesting them exhaustively is even harder• Massive data sets and changing shape• Internet-scale traffic• Complex interaction and information flow• Asynchronous nature• 3rd party services• All while innovating and building featuresProhibitively expensive, if not impossible,for most large-scale systems
  • 13. @atseitlinWhat if we could reduce variability of failures?
  • 14. @atseitlinThere is another way• Cause failure to validate resiliency• Test design assumption by stressing them• Don’t wait for random failure. Remove itsuncertainty by forcing it periodically
  • 15. @atseitlinAnd that’s exactly what we did
  • 16. @atseitlinInstances fail
  • 17. @atseitlin
  • 18. @atseitlinChaos Monkey taught us…• State is bad• Clusters are good• Surviving single instance failure is not enough
  • 19. @atseitlinLots of instances fail
  • 20. @atseitlinChaos Gorilla
  • 21. @atseitlinChaos Gorilla taught us…• Hidden assumptions on deployment topology• Infrastructure control plane can be abottleneck• Large scale events are hard to simulate• Rapidly shifting traffic is error prone• Smooth recovery is a challenge• Cassandra works as expected
  • 22. @atseitlinWhat about larger catastrophes?Anyone remember Sandy?
  • 23. @atseitlinChaos Kong (*some day soon*)
  • 24. @atseitlinThe Sick and Wounded
  • 25. @atseitlinLatency Monkey
  • 26. @atseitlin
  • 27. @atseitlinHystrix, RxJavahttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 28. @atseitlinLatency Monkey taught us• Startup resiliency is often missed• An ongoing unified approach to runtimedependency management is important (visibility &transparency gets missed otherwise)• Know thy neighbor (unknown dependencies)• Fall backs can fail too
  • 29. @atseitlinEntropy
  • 30. @atseitlinClutter accumulates• Complexity• Cruft• Vulnerabilities• Cost
  • 31. @atseitlinJanitor Monkey
  • 32. @atseitlinJanitor Monkey taught us…• Label everything• Clutter builds up
  • 33. @atseitlinRanks of the Simian Army• Chaos Monkey• Chaos Gorilla• Latency Monkey• Janitor Monkey• ConformityMonkey• Circus Monkey• Doctor Monkey• Howler Monkey• Security Monkey• Chaos Kong• Efficiency Monkey
  • 34. @atseitlinObservability is key• Don’t exacerbate real customer issues withfailure exercises• Deep system visibility is key to root-causefailures and understand the system
  • 35. @atseitlinOrganizational elements• Every engineer is an operator of the service• Each failure is an opportunity to learn• Blameless cultureGoal is to create a learning organization
  • 36. @atseitlinAssembling the Puzzle
  • 37. @atseitlinOpen Source ProjectsGithub / TechblogApache ContributionsTechblog PostComing SoonPriamCassandra as a ServiceAstyanaxCassandra client for JavaCassJMeterCassandra test suiteCassandraMulti-region EC2 datastoresupportAegisthusHadoop ETL for CassandraAWS UsageSpend analyticsGovernatorLibrary lifecycle and dependencyinjectionOdinCloud orchestrationBlitz4j Async loggingExhibitorZookeeper as a ServiceCuratorZookeeper PatternsEVCacheMemcached as a ServiceEureka / DiscoveryService DirectoryArchaiusDynamics Properties ServiceEddaConfig state with historyDenominatorRibbonREST Client + mid-tier LBKaryonInstrumented REST Base ServeServo and Autoscaling ScriptsGenieHadoop PaaSHystrixRobust service patternRxJava Reactive PatternsAsgardAutoScaleGroup based AWSconsoleChaos MonkeyRobustness verificationLatency MonkeyJanitor MonkeyBakeries / AminotorLegend
  • 38. @atseitlinHow does it all fit together?
  • 39. @atseitlin
  • 40. @atseitlinOur Current Catalog of ReleasesFree code available at http://netflix.github.com
  • 41. @atseitlinTakeawaysRegularly inducing failure in your productionenvironment validates resiliency and increasesavailabilityUse the NetflixOSS platform to handle the heavylifting for building large-scale distributed cloud-native applications
  • 42. @atseitlinThank you!Any questions?Ariel Tseitlinhttp://www.linkedin.com/in/atseitlin@atseitlin