Your SlideShare is downloading. ×

Resiliency through Failure @ OSCON 2013


Published on

Resiliency through failure talk from OSCON, updated with a few details on application architecture and resiliency. …

Resiliency through failure talk from OSCON, updated with a few details on application architecture and resiliency.

More details at

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
  • Transcript

    • 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin @atseitlin
    • 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1]
    • 3. @atseitlin A complex distributed system
    • 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
    • 5. @atseitlin Highly Available Architecture Micro-services, redundancy, resiliency
    • 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
    • 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
    • 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    • 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    • 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
    • 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
    • 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
    • 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
    • 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
    • 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History
    • 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "", "", + "", - "" … }
    • 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
    • 18. @atseitlin Asgard
    • 19. @atseitlin
    • 20. @atseitlin
    • 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
    • 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
    • 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
    • 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
    • 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
    • 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
    • 27. @atseitlin What if we could reduce variability of failures?
    • 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
    • 29. @atseitlin And that’s exactly what we did
    • 30. @atseitlin Instances fail
    • 31. @atseitlin
    • 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
    • 33. @atseitlin Lots of instances fail
    • 34. @atseitlin Chaos Gorilla
    • 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
    • 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
    • 37. @atseitlin Chaos Kong (*some day soon*)
    • 38. @atseitlin The Sick and Wounded
    • 39. @atseitlin Latency Monkey
    • 40. @atseitlin
    • 41. @atseitlin Resilient Design – Hystrix, RxJava
    • 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
    • 43. @atseitlin Entropy
    • 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
    • 45. @atseitlin Janitor Monkey
    • 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
    • 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
    • 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
    • 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
    • 50. @atseitlin Assembling the Puzzle
    • 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
    • 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
    • 53. @atseitlin How does it all fit together?
    • 54. @atseitlin
    • 55. @atseitlin Our Current Catalog of Releases Free code available at
    • 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more
    • 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) @atseitlin @NetflixOSS
    • 58. @atseitlin Thank you! Any questions? Ariel Tseitlin @atseitlin