Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MassTLC Cloud Summit Keynote

My keynote at the MassTLC Cloud Summit on Oct 8th on the Netflix architecture and future in the cloud

  • Login to see the comments

MassTLC Cloud Summit Keynote

  1. 1. @atseitlin Netflix Cloud Platform Netflix's evolution in the cloud Ariel Tseitlin @atseitlin
  2. 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1]
  3. 3. @atseitlin Original Content
  4. 4. @atseitlin Critical Acclaim
  5. 5. @atseitlin A complex distributed system
  6. 6. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
  7. 7. @atseitlin Highly Available Architecture Micro-services, redundancy, resiliency
  8. 8. @atseitlin Web Server Dependencies Flow Home page business transaction Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
  9. 9. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
  10. 10. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  11. 11. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  12. 12. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  13. 13. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  14. 14. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
  15. 15. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
  16. 16. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
  17. 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  18. 18. @atseitlin Asgard
  19. 19. @atseitlin Made possible in the cloud APIs, Elasticity, Efficiency
  20. 20. @atseitlin APIs • Control everything (start, terminate, scale) • Inject failure • Monitor & audit • Automate operations
  21. 21. @atseitlin Elasticity • Capacity planning replaced with forecasting • Dynamic load-based auto-scaling • New data centers at the click of a button
  22. 22. @atseitlin Efficiency • ~10x trough to peak ratio. Fill trough with batch workloads • Optimize machine class for each service • Highly available red/black deployments
  23. 23. @atseitlin Coming soon to a cloud near you Billing & Payments, Big Data & Analytics, SaaS
  24. 24. @atseitlin Billing & Payments • PCI compliance • Privacy & security • Intermediate step of cache in the cloud
  25. 25. @atseitlin Big Data & Analytics • On deck for cloud migration • ETL already in cloud with EMR (Hadoop) • Many cloud alternatives but not yet as mature as the old guard
  26. 26. @atseitlin Corporate system moving to SaaS • Email (Exchange->Google Apps) • Expense Management (Concur->Workday) • Document sharing (File Servers->Box) • Goal is 100% SaaS
  27. 27. @atseitlin
  28. 28. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
  29. 29. @atseitlin
  30. 30. @atseitlin Our Current Catalog of Releases Free code available at
  31. 31. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Many, many more
  32. 32. @atseitlin Takeaways Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) The Cloud enables elasticity, efficiency and fine-grained control via APIs Credit cards, Big Data, and rest of corporate systems are next to move to the Cloud @atseitlin @NetflixOSS
  33. 33. @atseitlin Thank you! Any questions? Ariel Tseitlin @atseitlin