Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Netflix and Open Source

Summary of cloud native architecture, NetflixOSS and the Cloud Prize

  • Login to see the comments

Netflix and Open Source

  1. 1. Netflix and Open Source March 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS
  2. 2. Cloud NativeNetflixOSS – Cloud Native On-RampNetflix Open Source Cloud Prize
  3. 3. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  4. 4. How Netflix Streaming WorksConsumerElectronics User Data Web Site orAWS Cloud Discovery API Services PersonalizationCDN EdgeLocations DRM Customer Device Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect CDN Boxes Content Encoding
  5. 5. Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx
  6. 6. November 2012 Traffic
  7. 7. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics)Each icon isthree to a fewhundredinstancesacross three CassandraAWS zones memcached Web service Start Here S3 bucketThree Personalization movie groupchoosers (for US, Canada and Latam)
  8. 8. Cloud Native Architecture Clients Things Autoscaled Micro JVM JVM JVM Services Autoscaled Micro JVM JVM Memcached ServicesDistributed Quorum Cassandra Cassandra Cassandra NoSQL Datastores Zone A Zone B Zone C
  9. 9. Non-Native Cloud ArchitectureAgile Mobile iOS/Android Mammals Cloudy App Servers BufferDatacenter MySQL Legacy AppsDinosaurs
  10. 10. New Anti-Fragile Patterns Micro-services Chaos enginesHighly available systems composed from ephemeral components
  11. 11. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu) Optional Apache frontend, Java (JDK 6 or 7)memcached,non-java apps AppDynamics Monitoring appagent monitoring Tomcat Log rotation Application war file, base Healthcheck, status to S3 GC and thread servlet, platform, client servlets, JMX interface,AppDynamics dump logging interface jars, Astyanax Servo autoscalemachineagent Epic/Atlas
  12. 12. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu) Tomcat andPriam on JDK Java (JDK 7)Healthcheck, Status AppDynamics appagent monitoring Cassandra Server MonitoringAppDynamics Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk GC and thread holding Commit log and SSTablesmachineagent dump logging Epic/Atlas
  13. 13. Configuration State Management Datacenter CMDB’s woeful Cloud native is the solution Dependably complete
  14. 14. Edda – Configuration History Eureka Services metadata AWS AppDynamicsInstances, Request flowASGs, etc. Edda Monkeys
  15. 15. Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=;_since=0" ["i-0123456789","i-012345678a","i-012345678b”]Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "", "",+ "",- ""… }
  16. 16. Cloud NativeMaster copies of data are cloud resident Everything is dynamically provisioned All services are ephemeral
  17. 17. Scalability Demands
  18. 18. Asgard
  19. 19. Cloud Deployment Scalability New Autoscaled AMI – zero to 500 instances from 21:38:52 - 21:46:32, 7m40sScaled up and down over a few days, total 2176 instance launches, m2.2xlarge (4 core 34GB) Min. 1st Qu. Median Mean 3rd Qu. Max. 41.0 104.2 149.0 171.8 215.8 562.0
  20. 20. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s hAutoscale Up Autoscale Down
  21. 21. Leveraging Public Scale 1,000 Instances 100,000 Instances GreyPublic Private AreaStartups Netflix Google
  22. 22. How big is Public? AWS Maximum Possible Instance Count 3.7 Million Growth >10x in Three Years, >2x Per AnnumAWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default
  23. 23. Availability Is it running yet?How many places is it running in? How far apart are those places?
  24. 24. Antifragile API PatternsFunctional Reactive with Circuit Breakers and Bulkheads
  25. 25. Outages• Running very fast with scissors – Mostly self inflicted – bugs, mistakes – Some caused by AWS bugs and mistakes• Next step is multi-region – Investigating and building in stages during 2013 – Could have prevented some of our 2012 outages
  26. 26. Managing Multi-Region Availability AWS DynECT Route53 UltraDNS DNS Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas What we need is a portable way to manage multiple DNS providers….
  27. 27. Denominator Software Defined DNS for Java Edda, Multi- Use Cases Region Failover Common Model DenominatorDNS Vendor Plug-in AWS Route53 DynECT UltraDNS Etc…API Models (varied IAM Key Auth User/pwd User/pwdand mostly broken) REST REST SOAP Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
  28. 28. A Cloud Native Open Source Platform
  29. 29. Inspiration
  30. 30. Three Questions Why is Netflix doing this?How does it all fit together? What is coming next?
  31. 31. Beware of Geeks Bearing Gifts: Strategies for an Increasingly Open Economy Simon Wardley - Researcher at the Leading Edge Forum
  32. 32. How did Netflix get ahead?Netflix Business + Developer Org Traditional IT Operations• Doing it right now • Taking their time• SaaS Applications • Pilot private cloud projects• PaaS for agility • Beta quality installations• Public IaaS for AWS features • Small scale• Big data in the cloud • Integrating several vendors• Integrating many APIs • Paying big $ for software• FOSS from github • Paying big $ for consulting• Renting hardware for 1hr • Buying hardware for 3yrs• Coding in Java/Groovy/Scala • Hacking at scripts
  33. 33. Netflix Platform Evolution 2009-2010 2011-2012 2013-2014Bleeding Edge Common Shared Innovation Pattern Pattern Netflix ended up several years ahead of the industry, but it’s not a sustainable position
  34. 34. Making it easy to followExploring the wild west each time vs. laying down a shared route
  35. 35. Establish our Hire, Retain and solutions as Best Engage TopPractices / Standards Engineers Goals Build up Netflix Benefit from a Technology Brand shared ecosystem
  36. 36. How does it all fit together?
  37. 37. NetflixOSS Continuous Build and Deployment Github Maven AWS NetflixOSS Central Base AMI Source Cloudbees Dynaslave Jenkins AWS AWS Build Aminator Baked AMIs Slaves Bakery Odin Asgard AWS Orchestration (+ Frigga) Account API Console
  38. 38. NetflixOSS Services ScopeAWS AccountAsgard ConsoleArchaius Config Multiple AWS Regions Service Cross region Priam C* Eureka Registry Explorers Dashboards Exhibitor ZK 3 AWS Zones Application Priam Evcache Atlas Edda History Clusters Cassandra Memcached Monitoring Autoscale Groups Persistent Storage Ephemeral Storage Instances Simian ArmyGenie Hadoop Services
  39. 39. NetflixOSS Instance Libraries • Baked AMI – Tomcat, Apache, your codeInitialization • Governator – Guice based dependency injection • Archaius – dynamic configuration properties client • Eureka - service registration client Service • Karyon - Base Server for inbound requests • RxJava – Reactive pattern • Hystrix/Turbine – dependencies and real-time status Requests • Ribbon - REST Client for outbound calls • Astyanax – Cassandra client and pattern libraryData Access • Evcache – Zone aware Memcached client • Curator – Zookeeper patterns • Denominator – DNS routing abstraction • Blitz4j – non-blocking logging Logging • Servo – metrics export for autoscaling • Atlas – high volume instrumentation
  40. 40. NetflixOSS Testing and Automation • CassJmeter – Load testing for Cassandra Test Tools • Circus Monkey – Test account reservation rebalancing • Janitor Monkey – Cleans up unused resources • Efficiency MonkeyMaintenance • Doctor Monkey • Howler Monkey – Complains about expiring certs • Chaos Monkey – Kills Instances • Chaos Gorilla – Kills Availability ZonesAvailability • Chaos Kong – Kills Regions • Latency Monkey – Latency and error injection • Security Monkey Security • Conformity Monkey
  41. 41. Example Application – RSS Reader
  42. 42. What’s Coming Next? Better portability Higher availability MoreFeatures Easier to deploy Contributions from end users Contributions from vendors More Use Cases
  43. 43. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Release 3.3 in 2Q13 Some vendor interestSome vendor interest Many missing featuresNeeds AWS compatible Autoscaler Bait and switch AWS API strategy
  44. 44. AWS 2009 vs. ??? Eucalyptus 3.3
  45. 45. Netflix Cloud PrizeBoosting the @NetflixOSS Ecosystem
  46. 46. In 2012 Netflix Engineering won this..
  47. 47. We’d like to give out prizes too But what for? Contributions to NetflixOSS! Shared under Apache license Located on github
  48. 48. How long do you have? Entries open March 13th Entries close September 15th Six months…
  49. 49. Who can win? Almost anyone, anywhere…Except current or former Netflix or AWS employees
  50. 50. Who decides who wins? Nominating Committee Panel of Judges
  51. 51. Judges Aino Corry Martin FowlerProgram Chair for Qcon/GOTO Simon Wardley Chief Scientist Thoughtworks Strategist Werner Vogels Yury Izrailevsky CTO Amazon Joe Weinman VP Cloud Netflix SVP Telx, Author “Cloudonomics”
  52. 52. What are Judges Looking For? Eligible, Apache 2.0 licensed Original and useful contribution to NetflixOSS Code that successfully builds and passes a test suite A large number of watchers, stars and forks on github NetflixOSS project pull requests Good code quality and structure Documentation on how to build and run itEvidence that code is in use by other projects, or is running in production
  53. 53. What do you win?One winner in each of the 10 categories Ticket and expenses to attend AWS Re:Invent 2013 in Las Vegas A Trophy
  54. 54. How do you enter? Get a (free) github accountFork Send us your email address Describe and build your entry Twitter #cloudprize
  55. 55. Award Apache Registration Close Entries AWS CeremonyGithub Opens Today Github Licensed Github September 15 Dinner Contributions Re:Invent November Judges Winners $10K cash $5K AWS Netflix Nominations Categories Ten Prize Engineering Categories AWSTrophy Re:Invent Conforms to Working Community Tickets Entrants Rules Code Traction
  56. 56. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering an ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  57. 57. TakeawayNetflix is making it easy for everyone to adopt Cloud Native patterns. Open Source is not just the default, it’s a strategic weapon. @adrianco #netflixcloud @NetflixOSS