Resiliency through Failure @ OSCON 2013


Published on

Resiliency through failure talk from OSCON, updated with a few details on application architecture and resiliency.

More details at

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
  • Resiliency through Failure @ OSCON 2013

    1. 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin @atseitlin
    2. 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1]
    3. 3. @atseitlin A complex distributed system
    4. 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
    5. 5. @atseitlin Highly Available Architecture Micro-services, redundancy, resiliency
    6. 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
    7. 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
    8. 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    9. 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    10. 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
    11. 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
    12. 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
    13. 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
    14. 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
    15. 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History
    16. 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "", "", + "", - "" … }
    17. 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
    18. 18. @atseitlin Asgard
    19. 19. @atseitlin
    20. 20. @atseitlin
    21. 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
    22. 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
    23. 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
    24. 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
    25. 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
    26. 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
    27. 27. @atseitlin What if we could reduce variability of failures?
    28. 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
    29. 29. @atseitlin And that’s exactly what we did
    30. 30. @atseitlin Instances fail
    31. 31. @atseitlin
    32. 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
    33. 33. @atseitlin Lots of instances fail
    34. 34. @atseitlin Chaos Gorilla
    35. 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
    36. 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
    37. 37. @atseitlin Chaos Kong (*some day soon*)
    38. 38. @atseitlin The Sick and Wounded
    39. 39. @atseitlin Latency Monkey
    40. 40. @atseitlin
    41. 41. @atseitlin Resilient Design – Hystrix, RxJava
    42. 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
    43. 43. @atseitlin Entropy
    44. 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
    45. 45. @atseitlin Janitor Monkey
    46. 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
    47. 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
    48. 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
    49. 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
    50. 50. @atseitlin Assembling the Puzzle
    51. 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
    52. 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
    53. 53. @atseitlin How does it all fit together?
    54. 54. @atseitlin
    55. 55. @atseitlin Our Current Catalog of Releases Free code available at
    56. 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more
    57. 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) @atseitlin @NetflixOSS
    58. 58. @atseitlin Thank you! Any questions? Ariel Tseitlin @atseitlin