Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

48,094 views

Published on

Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery

Published in: Technology

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

  1. 1. Now Playing on Netflix: Adventures in a Cloudy Future CMG November 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  2. 2. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  3. 3. How Netflix Used to Work Consumer Electronics Oracle Monolithic Web App AWS Cloud Services MySQL CDN Edge Locations Oracle Datacenter Customer Device (PC, PS3, TV…) Monolithic Streaming App MySQL Content Management Limelight/Level 3 Akamai CDNs Content Encoding
  4. 4. How Netflix Streaming Works Today Consumer Electronics User Data Web Site or Discovery API AWS Cloud Services Personalization CDN Edge Locations DRM Datacenter Customer Device (PC, PS3, TV…) Streaming API QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding
  5. 5. Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
  6. 6. Netflix Scale • Tens of thousands of instances on AWS – Typically 4 core, 30GByte, Java business logic – Thousands created/removed every day • Thousands of Cassandra NoSQL storage nodes – Mostly 8 core, 60Gbyte, 2TByte of SSD – 65 different clusters, over 300TB data, triple zone – Over 40 are multi-region clusters (6, 9 or 12 zone) – Biggest 288 nodes, 300K rps, 1.3M wps
  7. 7. Reactions over time 2009 “You guys are crazy! Can’t believe it” 2010 “What Netflix is doing won’t work” 2011 “It only works for ‘Unicorns’ like Netflix” 2012 “We’d like to do that but can’t” 2013 “We’re on our way using Netflix OSS code”
  8. 8. "This is the IT swamp draining manual for anyone who is neck deep in alligators." Adrian Cockcroft, Cloud Architect at Netflix
  9. 9. Web-scale Cloud Commodity ClientServer Mainframe
  10. 10. Goal of Traditional IT: Reliable hardware running stable software
  11. 11. SCALE Breaks hardware
  12. 12. ….SPEED Breaks software
  13. 13. SPEED at SCALE Breaks everything
  14. 14. Incidents – Impact and Mitigation Public Relations Media Impact PR Y incidents mitigated by Active Active, game day practicing X Incidents High Customer Service Calls CS YY incidents mitigated by better tools and practices XX Incidents Affects AB Test Results Metrics impact – Feature disable XXX Incidents No Impact – fast retry or automated failover XXXX Incidents YYY incidents mitigated by better data tagging
  15. 15. Web Scale Architecture AWS Route53 DynECT DNS UltraDNS DNS Automation Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  16. 16. CIO Says Speed IT Up!
  17. 17. “Get inside your adversaries' OODA loop to disorient them” Colonel Boyd, USAF
  18. 18. Land grab opportunity Engage customers Deliver Measure Customers Act Competitive Move Observe Colonel Boyd, USAF “Get inside your adversaries' OODA loop to disorient them” Customer Pain Point Analysis Orient Model Hypotheses Implement Decide Commit Resources Plan Response Get Buy-in
  19. 19. Territory Expansion Print Ad Campaign Upgrade Mainframe Measure Revenue Act Foreign Competition Observe Mainframe Era - 1 year cycle Customer Pain Point Systems Analysis Orient Capacity Model Customize Vendor SW Decide Vendor Evaluation 5 year Plan Board Level Buyin
  20. 20. 80’s Mainframe Innovation Cycle • • • • • Cost $1M to $100M Duration 1 to 5 years Bet the whole company Cost of failure – bankrupt or bought Cobol and DB2 on MVS
  21. 21. Territory Expansion TV Advert Campaign Install Servers Measure Revenue Act Foreign Competition Observe Client/Server Era – 3 month cycle Customer Pain Point Data Warehouse Orient Capacity Estimate Customize Vendor SW Decide Vendor Evaluation 1 year Plan CIO Level Buy-in
  22. 22. 90’s Client Server Innovation Cycle • • • • • Cost $100K to $10M Duration 3 – 12 months Bet a product line or division Cost of failure – revenue hit, CIO’s job C++ and Oracle on Solaris
  23. 23. Territory Expansion Web Display Ads Measure Sales Install Capacity Act Competitive Moves Observe Commodity Era – 2 week agile train Customer Pain Point Data Warehouse Orient Capacity Estimate Code Feature Decide Feature Priority 2 Week Plan Business Buy-in
  24. 24. 00’s Commodity Agile Innovation Cycle • • • • • Cost $10K to $1M Duration 2 – 12 weeks Bet a product feature Cost of failure – product mgr reputation Java and MySQL on RedHat Linux
  25. 25. Train Model Process Hand-Off Steps Product Manager Developer QA Integration Team Operations Deploy Team BI Analytics Team
  26. 26. What Happened? Rate of change increased Cost and size and risk of change reduced
  27. 27. Cloud Native Construct a highly agile and highly available service from ephemeral and assumed broken components
  28. 28. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances across three AWS zones Cassandra memcached Start Here Personalization movie group choosers (for US, Canada and Latam) Web service S3 bucket
  29. 29. Continuous Deployment No time for handoff to IT
  30. 30. Developer Self Service Freedom and Responsibility
  31. 31. Developers run what they wrote Root access and pagerduty
  32. 32. IT is a Cloud API DEVops automation
  33. 33. Github all the things! Leverage social coding
  34. 34. Putting it all together…
  35. 35. Land grab opportunity Launch AB Test Automatic Deploy Measure Customers Act Competitive Move Observe Continuous Delivery on Cloud Customer Pain Point Analysis Orient Model Hypotheses Increment Implement Decide Plan Response Share Plans JFDI
  36. 36. Continuous Innovation Cycle • • • • • Cost near zero, variable expense Duration hours to days Bet a decoupled microservice code push Cost of failure – near zero, instant rollback Clojure/Scala/Python on NoSQL on Cloud
  37. 37. Continuous Deploy Hand-Off Steps Product Manager A/B test setup and enable Self service hypothesis test results Developer Automated test Self service deploy, on call Self service analytics
  38. 38. Continuous Deploy Automation Check in code, Jenkins build Bake AMI, launch in test env Functional and performance test Production canary test Production red/black push
  39. 39. Bad Canary Signature
  40. 40. Happy Canary Signature
  41. 41. Global Deploy Automation Afternoon in California Night-time in Europe If passes test suite, canary then deploy West Coast Load Balancers East Coast Load Balancers Europe Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Canary then deploy Next day on West Coast Canary then deploy Next day on East Coast After peak in Europe
  42. 42. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours Autoscale Up Autoscale Down P u s h
  43. 43. (New Today!) Predictive Autoscaling 24 Hours predicted traffic vs. actual More morning load Sat/Sun high traffic Lower load on Weds Prediction driving AWS Autoscaler to plan capacity
  44. 44. Inspiration
  45. 45. Takeaway Speed Wins Assume Broken Cloud Native Automation Github is your “app store” and resumé @adrianco @NetflixOSS http://netflix.github.com

×