Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013


Published on

If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at

Published in: Technology

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

  1. 1. Cloud Native at Netflix What Changed? July 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS
  2. 2. Cloud Native Netflix Architecture NetflixOSS
  3. 3. Cloud Native What is it? Why?
  4. 4. Engineers Solve hard problems Build amazing and complex things Fix things when they break
  5. 5. Strive for perfection Perfect code Perfect hardware Perfectly operated
  6. 6. But perfection takes too long… Compromises… Time to market vs. Quality Utopia remains out of reach
  7. 7. Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services
  8. 8. Observe Orient Decide Act Land grab opportunity Competitive move Customer Pain Point Analysis Get buy-in Plan response Commit resources Implement Deliver Engage customers Research alternatives Measure customers Colonel Boyd, USAF “Get inside your adversaries' OODA loop to disorient them”
  9. 9. How Soon? Code features in days instead of months Get hardware in minutes instead of weeks Incident response in seconds instead of hours
  10. 10. A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  11. 11. Inspiration
  12. 12. How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization
  13. 13. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers
  14. 14. Netflix BusDevOps Organization Chief Product Officer VP Product Management Directors Product VP UI Engineering Directors Development Developers + DevOps UI Data Sources AWS VP Discovery Engineering Directors Development Developers + DevOps Discovery Data Sources AWS VP Platform Directors Platform Developers + DevOps Platform Data Sources AWS Denormalized, independently updated and scaled data Cloud, self service updated & scaled infrastructure Code, independently updated continuous delivery
  15. 15. Decentralized Deployment
  16. 16. Asgard Developer Portal
  17. 17. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  18. 18. Netflix Streaming A Cloud Native Application based on an open source platform
  19. 19. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  20. 20. How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations
  21. 21. Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
  22. 22. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group choosers (for US, Canada and Latam) Each icon is three to a few hundred instances across three AWS zones
  23. 23. Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  24. 24. Isolated Regions Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  25. 25. Cross Region Use Cases • Geographic Isolation – US to Europe replication of subscriber data – Read intensive, low update rate – Production use since late 2011 • Redundancy for regional failover – US East to US West replication of everything – Includes write intensive data, high update rate – Testing now
  26. 26. Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 min Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Inter-Zone Traffic 1 Million writes CL.ONE (wait for one replica to ack) 1 Million reads After 500ms CL.ONE with no Data loss Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3
  27. 27. Managing Multi-Region Availability Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Denominator – manage traffic via multiple DNS providers with Java code 2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May Denominator
  28. 28. Incidents – Impact and Mitigation PR X Incidents CS XX Incidents Metrics impact – Feature disable XXX Incidents No Impact – fast retry or automated failover XXXX Incidents Public Relations Media Impact High Customer Service Calls Affects AB Test Results Y incidents mitigated by Active Active, game day practicing YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging
  29. 29. Cloud Security Automated attack surface monitoring Crypto key store management (CloudHSM) Scale to resist DDOS attacks
  30. 30. What Changed? “Impossible” deployments are easy Jointly building code with vendors in public Highly available and secure despite scale and speed
  31. 31. The DIY Question Why doesn’t Netflix build and run its own cloud?
  32. 32. Fitting Into Public Scale Public Grey Area Private 1,000 Instances 100,000 Instances Netflix FacebookStartups
  33. 33. How big is Public? AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t) AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum -
  34. 34. A Cloud Native Open Source Platform See
  35. 35. Establish our solutions as Best Practices / Standards Hire, Retain and Engage Top Engineers Build up Netflix Technology Brand Benefit from a shared ecosystem Goals
  36. 36. Example Application – RSS Reader Z U U L Zuul Traffic Processing and Routing
  37. 37. Ice – Detailed AWS “Chargeback”
  38. 38. Boosting the @NetflixOSS Ecosystem See
  39. 39. More Use Cases More Features Better portability Higher availability Easier to deploy Contributions from end users Contributions from vendors What’s Coming Next?
  40. 40. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Released June in V3.3 Offering $10K prize for integration work Vendor and end user interest Openstack “Heat” getting there Paypal C3 Console based on Asgard
  41. 41. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering a cloud native ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  42. 42. Details • Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games – • Lightning Talks March S1E2 – roadmap • Lightning Talks Feb S1E1 – • Asgard In Depth Feb S1E1 – • Security Architecture – • Cost Aware Cloud Architectures – with Jinesh Varia of AWS – varia-aws-and-adrian-cockroft-netflix
  43. 43. What Changed? Speed wins, Cloud Native helps you get there NetflixOSS makes it easier for everyone to become Cloud Native @adrianco #netflixcloud @NetflixOSS