Successfully reported this slideshow.
Your SlideShare is downloading. ×

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Dystopia as a Service
Dystopia as a Service
Loading in …3
×

Check these out next

1 of 73 Ad

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Download to read offline

Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.

Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (15)

Advertisement

Similar to Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction (20)

Recently uploaded (20)

Advertisement

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

  1. 1. Introduction: Building Using The NetflixOSS Architecture May 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  2. 2. Presentation vs. Tutorial • Presentation – Short duration, focused subject – One presenter to many anonymous audience – A few questions at the end • Tutorial – Time to explore in and around the subject – Tutor gets to know the audience – Discussion, rat-holes, “bring out your dead”
  3. 3. Introduction – Who are you? Netflix Open Source Cloud Prize Cloud Native – More details NetflixOSS – Cloud Native On-Ramp
  4. 4. Adrian Cockcroft • Director, Architecture for Cloud Systems, Netflix Inc. – Previously Director for Personalization Platform • Distinguished Availability Engineer, eBay Inc. 2004-7 – Founding member of eBay Research Labs • Distinguished Engineer, Sun Microsystems Inc. 1988-2004 – 2003-4 Chief Architect High Performance Technical Computing – 2001 Author: Capacity Planning for Web Services – 1999 Author: Resource Management – 1995 & 1998 Author: Sun Performance and Tuning – 1996 Japanese Edition of Sun Performance and Tuning • SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリー ズ) • More – Twitter @adrianco – Blog http://perfcap.blogspot.com – Presentations at http://www.slideshare.net/adrianco
  5. 5. Attendee Introductions • Who are you, where do you work • Why are you here today, what do you need • “Bring out your dead” – Do you have a specific problem or question? – One sentence elevator pitch • What instrument do you play?
  6. 6. Boosting the @NetflixOSS Ecosystem See netflix.github.com
  7. 7. In 2012 Netflix Engineering won this..
  8. 8. We’d like to give out prizes too But what for? Contributions to NetflixOSS! Shared under Apache license Located on github
  9. 9. How long do you have? Entries open March 13th Entries close September 15th Six months…
  10. 10. Who can win? Almost anyone, anywhere… Except current or former Netflix or AWS employees
  11. 11. Who decides who wins? Nominating Committee Panel of Judges
  12. 12. Judges Aino Corry Program Chair for Qcon/GOTO Martin Fowler Chief Scientist ThoughtworksSimon Wardley Strategist Yury Izrailevsky VP Cloud Netflix Werner Vogels CTO Amazon Joe Weinman SVP Telx, Author “Cloudonomics”
  13. 13. What are Judges Looking For? Eligible, Apache 2.0 licensed NetflixOSS project pull requests Original and useful contribution to NetflixOSS Good code quality and structure Documentation on how to build and run it Code that successfully builds and passes a test suite Evidence that code is in use by other projects, or is running in production A large number of watchers, stars and forks on github
  14. 14. What do you win? One winner in each of the 10 categories Ticket and expenses to attend AWS Re:Invent 2013 in Las Vegas A Trophy
  15. 15. How do you enter? Get a (free) github account Fork github.com/netflix/cloud-prize Send us your email address Describe and build your entry Twitter #cloudprize
  16. 16. Entrants Netflix Engineering Six Judges Winners Nominations Conforms to Rules Working Code Community Traction Categories Registration Opened March 13 Github Apache Licensed Contributions Github Close Entries September 15 Github Award Ceremony Dinner November AWS Re:Invent Ten Prize Categories $10K cash $5K AWS AWS Re:Invent Tickets Trophy
  17. 17. Cloud Native Recap the keynote in much more detail and discussion
  18. 18. A new engineering challenge Construct a highly agile and highly available service from ephemeral and often broken components
  19. 19. Inspiration
  20. 20. Netflix Streaming A Cloud Native Application based on an open source platform
  21. 21. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  22. 22. How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations
  23. 23. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group choosers (for US, Canada and Latam) Each icon is three to a few hundred instances across three AWS zones
  24. 24. New Cloud Native Patterns Micro-services and Chaos engines Highly available systems composed from ephemeral components Open Source is the default
  25. 25. Some Strategic Questions What changed…
  26. 26. The AWS Question Why does Netflix use AWS when Amazon Prime is a competitor?
  27. 27. Netflix vs. Amazon Prime • Do retailers competing with Amazon use AWS? – Yes, lots of them, Netflix is no different • Does Prime have a platform advantage? – No, because Netflix gets to run on AWS • Does Netflix take Amazon Prime seriously? – Yes, but so far Prime isn’t impacting our business
  28. 28. Amazon Video 1.31% 18x Prime 25x Prime Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
  29. 29. The Google Cloud Question Why doesn’t Netflix use Google Cloud as well as AWS?
  30. 30. Google Cloud – Wait and See Pro’s • Cloud Native • Huge scale for internal apps • Exposing internal services • Nice clean API model • Starting a price war • Fast for what it does • Rapid start & minute billing Con’s • In beta until last week • No big customers yet • Missing many key features • Different arch model • Missing billing options • No SSD or huge instances • Zone maintenance windows But: Anyone interested is welcome to port NetflixOSS components to Google Cloud
  31. 31. Cloud Wars: Price and Performance AWS vs. GCS War Private Cloud What Changed: Everyone using AWS or GCS gets the price cuts and performance improvements, as they happen. No need to switch vendor. No Change: Locked in for three years.
  32. 32. The DIY Question Why doesn’t Netflix build and run its own cloud?
  33. 33. Fitting Into Public Scale Public Grey Area Private 1,000 Instances 100,000 Instances Netflix FacebookStartups
  34. 34. How big is Public? AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default AWS Maximum Possible Instance Count 3.7 Million Growth >10x in Three Years, >2x Per Annum
  35. 35. The Alternative Supplier Question What if there is no clear leader for a feature, or AWS doesn’t have what we need?
  36. 36. Things We Don’t Use AWS For SaaS Applications – Pagerduty, Appdynamics Content Delivery Service DNS Service
  37. 37. CDN Scale AWS CloudFront Akamai Limelight Level 3 Netflix Openconnect YouTube Gigabits Terabits NetflixFacebookStartups
  38. 38. Content Delivery Service Open Source Hardware Design + FreeBSD, bird, nginx see openconnect.netflix.com
  39. 39. DNS Service AWS Route53 is missing too many features Multiple vendor strategy Dyn, Ultra, Route53 Abstracted (broken) DNS APIs with Denominator
  40. 40. Availability Questions Is it running yet? How many places is it running in? How far apart are those places?
  41. 41. Netflix Outages • Running very fast with scissors – Mostly self inflicted – bugs, mistakes from pace of change – Some caused by AWS bugs and mistakes • Incident Life-cycle Management by Platform Team – No runbooks, no operational changes by the SREs – Tools to identify what broke and call the right developer • Next step is multi-region – Investigating and building in stages during 2013 – Could have prevented some of our 2012 outages
  42. 42. Managing Multi-Region Availability Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Denominator – manage traffic via multiple DNS providers Denominator
  43. 43. Cloud Native Big Data Size the cluster to the data Size the cluster to the questions Never wait for space or answers
  44. 44. Netflix Dataoven Data Warehouse Over 2 Petabytes Ursula Aegisthus Data Pipelines From cloud Services ~100 Billion Events/day From C* Terabytes of Dimension data Hadoop Clusters – AWS EMR 1300 nodes 800 nodes Multiple 150 nodes Nightly RDS Metadata Gateways Tools
  45. 45. Cloud Native Patterns Master copies of data are cloud resident Dynamically provisioned micro-services Services are distributed and ephemeral
  46. 46. Cloud Native Architecture Distributed Quorum NoSQL Datastores Autoscaled Micro Services Autoscaled Micro Services Clients Things JVM JVM JVM JVM Cassandra Cassandra Cassandra Memcached JVM Zone A Zone B Zone C
  47. 47. Non-Native Cloud Architecture Datacenter Dinosaurs Cloudy Buffer Agile Mobile Mammals iOS/Android App Servers MySQL Legacy Apps
  48. 48. How to get to Cloud Native? Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization
  49. 49. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers
  50. 50. Netflix BusDevOps Organization Chief Product Officer VP Product Management Directors Product VP UI Engineering Directors Development Developers + DevOps UI Data Sources AWS VP Discovery Engineering Directors Development Developers + DevOps Discovery Data Sources AWS VP Platform Directors Platform Developers + DevOps Platform Data Sources AWS Denormalized, independently updated and scaled data Cloud, independently updated and scaled infrastructure Code, independently updated continuous delivery
  51. 51. Decentralized Deployment
  52. 52. Asgard http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  53. 53. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  54. 54. A Cloud Native Open Source Platform See netflix.github.com
  55. 55. Three Questions Why is Netflix doing this? How does it all fit together? What is coming next?
  56. 56. Beware of Geeks Bearing Gifts: Strategies for an Increasingly Open Economy Simon Wardley - Researcher at the Leading Edge Forum
  57. 57. How did Netflix get ahead? Netflix BusDevOps Org • Doing it since 2009 • SaaS Applications • PaaS for agility • Public IaaS for AWS features • Big data in the cloud • Integrating many APIs • FOSS from github • Renting hardware for 1hr • Coding in Java/Groovy/Scala Traditional IT Operations • Taking their time • Pilot private cloud projects • Beta quality installations • Small scale • Integrating several vendors • Paying big $ for software • Paying big $ for consulting • Buying hardware for 3yrs • Hacking at scripts
  58. 58. Netflix Platform Evolution Bleeding Edge Innovation Common Pattern Shared Pattern 2009-2010 2011-2012 2013-2014 Netflix ended up several years ahead of the industry, but it’s becoming commoditized now
  59. 59. Making it easy to follow Exploring the wild west each time vs. laying down a shared route
  60. 60. Establish our solutions as Best Practices / Standards Hire, Retain and Engage Top Engineers Build up Netflix Technology Brand Benefit from a shared ecosystem Goals
  61. 61. How does it all fit together?
  62. 62. Example Application – RSS Reader
  63. 63. Github NetflixOSS Source AWS Base AMI Maven Central Cloudbees Jenkins Aminator Bakery Dynaslave AWS Build Slaves Asgard (+ Frigga) Console AWS Baked AMIs Odin Orchestration API AWS Account NetflixOSS Continuous Build and Deployment
  64. 64. AWS Account Asgard Console Archaius Config Service Cross region Priam C* Pytheas Dashboards Atlas Monitoring Genie, Lipstick Hadoop Services AWS Usage Cost Monitoring Multiple AWS Regions Eureka Registry Exhibitor ZK Edda History Simian Army 3 AWS Zones Application Clusters Autoscale Groups Instances Priam Cassandra Persistent Storage Evcache Memcached Ephemeral Storage NetflixOSS Services Scope
  65. 65. •Baked AMI – Tomcat, Apache, your code •Governator – Guice based dependency injection •Archaius – dynamic configuration properties client •Eureka - service registration client Initialization •Karyon - Base Server for inbound requests •RxJava – Reactive pattern •Hystrix/Turbine – dependencies and real-time status •Ribbon - REST Client for outbound calls Service Requests •Astyanax – Cassandra client and pattern library •Evcache – Zone aware Memcached client •Curator – Zookeeper patterns •Denominator – DNS routing abstraction Data Access •Blitz4j – non-blocking logging •Servo – metrics export for autoscaling •Atlas – high volume instrumentation Logging NetflixOSS Instance Libraries
  66. 66. •CassJmeter – Load testing for Cassandra •Circus Monkey – Test account reservation rebalancingTest Tools •Janitor Monkey – Cleans up unused resources •Efficiency Monkey •Doctor Monkey •Howler Monkey – Complains about AWS limits Maintenance •Chaos Monkey – Kills Instances •Chaos Gorilla – Kills Availability Zones •Chaos Kong – Kills Regions •Latency Monkey – Latency and error injection Availability •Security Monkey – security group and S3 bucket permissions •Conformity Monkey – architectural pattern warningsSecurity NetflixOSS Testing and Automation
  67. 67. More Use Cases More Features Better portability Higher availability Easier to deploy Contributions from end users Contributions from vendors What’s Coming Next?
  68. 68. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Release 3.3 in 2Q13 Some vendor interest Needs AWS compatible Autoscaler Some vendor interest Many missing features “Confused” AWS API strategy
  69. 69. AWS 2009 Baseline features needed to support NetflixOSS Eucalyptus 3.3
  70. 70. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering a cloud native ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  71. 71. Takeaway NetflixOSS makes it easier for everyone to become Cloud Native @adrianco #netflixcloud @NetflixOSS

Editor's Notes

  • Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -> many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data & Clusters
  • When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
  • The railroad made it possible for California to be developed quickly, by creating an easy to follow path we can create a much bigger ecosystem around the Netflix platform

×