Your SlideShare is downloading. ×
  • Like
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction


Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other …

Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -> many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data & Clusters
  • When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
  • The railroad made it possible for California to be developed quickly, by creating an easy to follow path we can create a much bigger ecosystem around the Netflix platform


  • 1. Introduction:Building Using The NetflixOSSArchitectureMay 2013Adrian Cockcroft@adrianco #netflixcloud @NetflixOSS
  • 2. Presentation vs. Tutorial• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end• Tutorial– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”
  • 3. Introduction – Who are you?Netflix Open Source Cloud PrizeCloud Native – More detailsNetflixOSS – Cloud Native On-Ramp
  • 4. Adrian Cockcroft• Director, Architecture for Cloud Systems, Netflix Inc.– Previously Director for Personalization Platform• Distinguished Availability Engineer, eBay Inc. 2004-7– Founding member of eBay Research Labs• Distinguished Engineer, Sun Microsystems Inc. 1988-2004– 2003-4 Chief Architect High Performance Technical Computing– 2001 Author: Capacity Planning for Web Services– 1999 Author: Resource Management– 1995 & 1998 Author: Sun Performance and Tuning– 1996 Japanese Edition of Sun Performance and Tuning• SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)• More– Twitter @adrianco – Blog– Presentations at
  • 5. Attendee Introductions• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”– Do you have a specific problem or question?– One sentence elevator pitch• What instrument do you play?
  • 6. Boosting the @NetflixOSS EcosystemSee
  • 7. In 2012 Netflix Engineering won this..
  • 8. We’d like to give out prizes tooBut what for?Contributions to NetflixOSS!Shared under Apache licenseLocated on github
  • 9. How long do you have?Entries open March 13thEntries close September 15thSix months…
  • 10. Who can win?Almost anyone, anywhere…Except current or former Netflix orAWS employees
  • 11. Who decides who wins?Nominating CommitteePanel of Judges
  • 12. JudgesAino CorryProgram Chair for Qcon/GOTOMartin FowlerChief Scientist ThoughtworksSimon WardleyStrategistYury IzrailevskyVP Cloud NetflixWerner VogelsCTO Amazon Joe WeinmanSVP Telx, Author “Cloudonomics”
  • 13. What are Judges Looking For?Eligible, Apache 2.0 licensedNetflixOSS project pull requestsOriginal and useful contribution to NetflixOSSGood code quality and structureDocumentation on how to build and run itCode that successfully builds and passes a test suiteEvidence that code is in use by other projects, or is running in productionA large number of watchers, stars and forks on github
  • 14. What do you win?One winner in each of the 10 categoriesTicket and expenses to attend AWSRe:Invent 2013 in Las VegasA Trophy
  • 15. How do you enter?Get a (free) github accountFork us your email addressDescribe and build your entryTwitter #cloudprize
  • 16. EntrantsNetflixEngineeringSix Judges WinnersNominationsConforms toRulesWorkingCodeCommunityTractionCategoriesRegistrationOpenedMarch 13GithubApacheLicensedContributionsGithubClose EntriesSeptember 15GithubAwardCeremonyDinnerNovemberAWSRe:InventTen PrizeCategories$10K cash$5K AWSAWSRe:InventTicketsTrophy
  • 17. Cloud NativeRecap the keynote in much moredetail and discussion
  • 18. A new engineering challengeConstruct a highly agile and highlyavailable service from ephemeral andoften broken components
  • 19. Inspiration
  • 20. Netflix StreamingA Cloud Native Application based onan open source platform
  • 21. Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
  • 22. How Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocations
  • 23. Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)Start HerememcachedCassandraWeb serviceS3 bucketPersonalization movie group choosers(for US, Canada and Latam)Each icon isthree to a fewhundredinstancesacross threeAWS zones
  • 24. New Cloud Native PatternsMicro-services and Chaos enginesHighly available systems composedfrom ephemeral componentsOpen Source is the default
  • 25. Some Strategic QuestionsWhat changed…
  • 26. The AWS QuestionWhy does Netflix use AWS whenAmazon Prime is a competitor?
  • 27. Netflix vs. Amazon Prime• Do retailers competing with Amazon use AWS?– Yes, lots of them, Netflix is no different• Does Prime have a platform advantage?– No, because Netflix gets to run on AWS• Does Netflix take Amazon Prime seriously?– Yes, but so far Prime isn’t impacting our business
  • 28. Amazon Video 1.31%18x Prime25x PrimeNov2012StreamingBandwidthMarch2013MeanBandwidth+39% 6mo
  • 29. The Google Cloud QuestionWhy doesn’t Netflix use GoogleCloud as well as AWS?
  • 30. Google Cloud – Wait and SeePro’s• Cloud Native• Huge scale for internal apps• Exposing internal services• Nice clean API model• Starting a price war• Fast for what it does• Rapid start & minute billingCon’s• In beta until last week• No big customers yet• Missing many key features• Different arch model• Missing billing options• No SSD or huge instances• Zone maintenance windowsBut: Anyone interested is welcome to port NetflixOSS components to Google Cloud
  • 31. Cloud Wars: Price and PerformanceAWS vs.GCS WarPrivateCloudWhat Changed:Everyone usingAWS or GCS getsthe price cuts andperformanceimprovements, asthey happen. Noneed to switchvendor.No Change:Locked in forthree years.
  • 32. The DIY QuestionWhy doesn’t Netflix build and run itsown cloud?
  • 33. Fitting Into Public ScalePublicGreyAreaPrivate1,000 Instances 100,000 InstancesNetflix FacebookStartups
  • 34. How big is Public?AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by defaultAWS Maximum Possible Instance Count 3.7 MillionGrowth >10x in Three Years, >2x Per Annum
  • 35. The Alternative SupplierQuestionWhat if there is no clear leader for afeature, or AWS doesn’t have whatwe need?
  • 36. Things We Don’t Use AWS ForSaaS Applications – Pagerduty, AppdynamicsContent Delivery ServiceDNS Service
  • 37. CDN ScaleAWS CloudFrontAkamaiLimelightLevel 3NetflixOpenconnectYouTubeGigabits TerabitsNetflixFacebookStartups
  • 38. Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginxsee
  • 39. DNS ServiceAWS Route53 is missing too many featuresMultiple vendor strategy Dyn, Ultra, Route53Abstracted (broken) DNS APIs with Denominator
  • 40. Availability QuestionsIs it running yet?How many places is it running in?How far apart are those places?
  • 41. Netflix Outages• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer• Next step is multi-region– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages
  • 42. Managing Multi-Region AvailabilityCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersUltraDNSDynECTDNSAWSRoute53Denominator – manage traffic via multiple DNS providersDenominator
  • 43. Cloud Native Big DataSize the cluster to the dataSize the cluster to the questionsNever wait for space or answers
  • 44. Netflix DataovenData WarehouseOver 2 PetabytesUrsulaAegisthusData PipelinesFrom cloudServices~100 BillionEvents/dayFrom C*Terabytes ofDimensiondataHadoop Clusters – AWS EMR1300 nodes 800 nodes Multiple 150 nodes NightlyRDSMetadataGatewaysTools
  • 45. Cloud Native PatternsMaster copies of data are cloud residentDynamically provisioned micro-servicesServices are distributed and ephemeral
  • 46. Cloud Native ArchitectureDistributed QuorumNoSQL DatastoresAutoscaled MicroServicesAutoscaled MicroServicesClients ThingsJVM JVMJVM JVMCassandra Cassandra CassandraMemcachedJVMZone A Zone B Zone C
  • 47. Non-Native Cloud ArchitectureDatacenterDinosaursCloudyBufferAgile MobileMammalsiOS/AndroidApp ServersMySQL Legacy Apps
  • 48. How to get to Cloud Native?Freedom and Responsibility for DevelopersDecentralize and Automate Ops ActivitiesIntegrate DevOps into the Business Organization
  • 49. Four Transitions• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers
  • 50. Netflix BusDevOps OrganizationChief ProductOfficerVP ProductManagementDirectorsProductVP UIEngineeringDirectorsDevelopmentDevelopers +DevOpsUI DataSourcesAWSVP DiscoveryEngineeringDirectorsDevelopmentDevelopers +DevOpsDiscoveryData SourcesAWSVP PlatformDirectorsPlatformDevelopers +DevOpsPlatformData SourcesAWSDenormalized, independentlyupdated and scaled dataCloud, independently updatedand scaled infrastructureCode, independently updatedcontinuous delivery
  • 51. Decentralized Deployment
  • 52. Asgard
  • 53. Ephemeral Instances• Largest services are autoscaled• Average lifetime of an instance is 36 hoursPushAutoscale UpAutoscale Down
  • 54. A Cloud Native Open Source PlatformSee
  • 55. Three QuestionsWhy is Netflix doing this?How does it all fit together?What is coming next?
  • 56. Beware of Geeks Bearing Gifts: Strategies for anIncreasingly Open EconomySimon Wardley - Researcher at the Leading Edge Forum
  • 57. How did Netflix get ahead?Netflix BusDevOps Org• Doing it since 2009• SaaS Applications• PaaS for agility• Public IaaS for AWS features• Big data in the cloud• Integrating many APIs• FOSS from github• Renting hardware for 1hr• Coding in Java/Groovy/ScalaTraditional IT Operations• Taking their time• Pilot private cloud projects• Beta quality installations• Small scale• Integrating several vendors• Paying big $ for software• Paying big $ for consulting• Buying hardware for 3yrs• Hacking at scripts
  • 58. Netflix Platform EvolutionBleeding EdgeInnovationCommonPatternSharedPattern2009-2010 2011-2012 2013-2014Netflix ended up several years ahead of theindustry, but it’s becoming commoditized now
  • 59. Making it easy to followExploring the wild west each time vs. laying down a shared route
  • 60. Establish oursolutions as BestPractices / StandardsHire, Retain andEngage TopEngineersBuild up NetflixTechnology BrandBenefit from ashared ecosystemGoals
  • 61. How does it all fit together?
  • 62. Example Application – RSS Reader
  • 63. GithubNetflixOSSSourceAWSBase AMIMavenCentralCloudbeesJenkinsAminatorBakeryDynaslaveAWS BuildSlavesAsgard(+ Frigga)ConsoleAWSBaked AMIsOdinOrchestrationAPIAWSAccountNetflixOSS Continuous Build and Deployment
  • 64. AWS AccountAsgard ConsoleArchaiusConfig ServiceCross regionPriam C*PytheasDashboardsAtlasMonitoringGenie, LipstickHadoop ServicesAWS UsageCost MonitoringMultiple AWS RegionsEureka RegistryExhibitor ZKEdda HistorySimian Army3 AWS ZonesApplicationClustersAutoscale GroupsInstancesPriamCassandraPersistent StorageEvcacheMemcachedEphemeral StorageNetflixOSS Services Scope
  • 65. •Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration clientInitialization•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon - REST Client for outbound callsServiceRequests•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstractionData Access•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLoggingNetflixOSS Instance Libraries
  • 66. •CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancingTest Tools•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about AWS limitsMaintenance•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injectionAvailability•Security Monkey – security group and S3 bucket permissions•Conformity Monkey – architectural pattern warningsSecurityNetflixOSS Testing and Automation
  • 67. More Use CasesMoreFeaturesBetter portabilityHigher availabilityEasier to deployContributions from end usersContributions from vendorsWhat’s Coming Next?
  • 68. Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13Some vendor interestNeeds AWS compatible AutoscalerSome vendor interestMany missing features“Confused” AWS API strategy
  • 69. AWS 2009Baseline features needed to support NetflixOSSEucalyptus 3.3
  • 70. Functionality and scale now, portability comingMoving from parts to a platform in 2013Netflix is fostering a cloud native ecosystemRapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
  • 71. TakeawayNetflixOSS makes it easier for everyone to become Cloud Native@adrianco #netflixcloud @NetflixOSS