Genie - Hadoop Platform as a Service at Netflix

6,377 views
6,157 views

Published on

In a prior tech blog (http://nflx.it/XoySYR), we had discussed the architecture of our petabyte-scale data warehouse in the cloud. Salient features of our architecture include the use of Amazon’s Simple Storage Service (S3) as our "source of truth", leveraging the elasticity of the cloud to run multiple dynamically resizable Hadoop clusters to support various workloads, and our horizontally scalable Hadoop Platform as a Service called Genie.

We are pleased to announce that Genie is now open source (http://nflx.it/15rd6pJ), and available to the public from the Netflix OSS GitHub site (https://github.com/Netflix/genie).

Published in: Technology, Business
0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,377
On SlideShare
0
From Embeds
0
Number of Embeds
527
Actions
Shares
0
Downloads
5
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide
  • Referencehttp://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.htmlUse cases – reporting, analytics, insights, algorithms (e.g. recommendations)But big deal – so does everyone in the room
  • What is scale? It means different things to different people
  • Few petabytes of data – billons of log events captured each data, with retention of a few monthsMany clusters – 1000s of nodesAgain, big deal – there are many others in the room who do Hadoop at this scale (petabyte is the new terabyte)
  • Our Hadoop processing is 100% in the (public) cloudIn our case, public cloud is AWSThis is what differentiates our infrastructure from the restHadoop in the cloud is different from Hadoop in the datacenter – in this talk, we will discuss our cloud-based Hadoop platform
  • S3 is the source of truthDecoupling of storage from the computational infrastructureS3 benefitsHighly durable and available – 11 9’sBucket versioningHighly elastic - we grew our data warehouse organically from a few hundred terabytes to petabytes without having to provision any storage resources in advanceHDFS? Only for transient data, intermediate results for multi-stage jobsS3 cons – performance, eventual consistency
  • Another benefit of S3 - Multiple clusters can read/process the same data(Semi-) persistent sla and ad-hoc clusters~800-1300 nodesMultiple ad-hoc clusters to A/B test new releases/featuresNightly "bonus" clusters to supplement SLA clusterOperation assumption – clusters may go down at any time
  • Traditional Gateways/CLIsAd-hoc queryingGenieREST API for job execution/monitoringRepository/abstraction for clusters and metastoresFranklin – MDSUses HCAT/HiveServer to talk to Hive metastore
  • Next – we will focus on Genie for the rest of the talkOther tools will be talked about in the other Netflix talk
  • EMR: HadoopIaaS, and an API to run jobs on transient clusters – our clusters are semi-persistent, and job submissions don’t result in new clusters.Oozie: Workflow tool, which only supports Hadoop ecosystem – we have hybrid jobs (Teradata+Hadoop) being orchestrated by UC4, so we just needed a job submission API. Also no support for Hive when we started.Templeton: No multi-cluster, multi-user support, not quite ready for prime-time.
  • * Genie is a resource “match-maker”
  • Unit of execution is a Hadoop/Hive/Pig jobUsers provide scripts, dependencies and other metadataDoes no scheduling per se – only does “meta-scheduling” or resource matching
  • Status defines whether it is accepting jobsConfigurations are *-site.xmls and propertiesCluster name, schedule, etc
  • Two classes of users: admins and end-usersAdmins spin up clusters, set cluster metadataUsers use the clusters once they have been registeredGenie is built on top of Netflix OSS
  • Genie figures out the resources to run jobs on – back-end resources are abstracted outAsynchronous execution since jobs may be long-running
  • Every job run as a separate process using Hadoop/Hive/Pig CLIAvoids “jar hell” since it needs Hadoop jarsJobs run in their own sandbox (working directory)Provides isolation between jobs, and between Genie and the jobsStandard output/error of jobs easily availableAble to support multiple versions of Hadoop/Hive/Pig, and connect to multiple clusters
  • Configuration service helps us do crazy (cool) thingsWill describe each of these in greater detail
  • New bonus clusters launched each night – but clients are oblivious of actual host names/IP’sOne way to do thisHigher SLA jobs first ask for cluster by name
  • If it doesn’t exist, revert back to existing clusterWhy not just expand?Better isolationMixing matching instance types not ideal for HadoopProd cluster uses m1.xlarges for slave nodesShrink has proven to be a problemWe want to do hard shutdown when those instances are needed on awsprod
  • We had to bounce the prod job tracker to enable priorities for “long-pole” jobsWanted to do it with minimal impact to SLA jobs
  • Must wait for all existing jobs to finish for minimal impactHadoop jobs are long running – don’t want to kill a 5 hour job nearing its finish
  • Prod cluster is back up after maintenanceJobs that were scheduled on query cluster will continue to run there until it finishesThis is done from time to time – although not too often, we do red-black pushes…
  • This is initial state – we need to spin up a new cluster, e.g. to push a new feature
  • * Spin up new cluster, mark it as UP, mark old cluster as OOS
  • OUT_OF_SERVICE to TERMINATED
  • Mention that we will be writing a techblog about this soon, with more detailsTwo query clusters – A/B testing new fair share scheduler
  • Set up desired instance counts across multiple AZ’sDo “red-black” pushes using “sequential ASGs”Loss of individual nodes will cause jobs running on those nodes to be lost
  • Auto-scaling policy set up to expand if number of running jobs > ~80%
  • Still biased towards running in the cloud and at Netflix, but will generalize/improve it based on community feedback
  • * Come listen to how we enable “Data Platform as a Service” – it is truly Lipstick on a Pig.
  • Genie - Hadoop Platform as a Service at Netflix

    1. 1. 1Genie – Hadoop Platform as a Service at NetflixSriram KrishnanHadoop Summit, June 26, 2013
    2. 2. Netflix does Hadoop
    3. 3. Netflix does Hadoop at scale
    4. 4. Netflix does Hadoop at scale*
    5. 5. Netflix does Hadoop at scale in the cloud
    6. 6. S3 as the Cloud Data WarehouseCloud Data Warehouse
    7. 7. Multiple Hadoop ClustersCloud Data WarehouseHadoop (EMR) Clusters
    8. 8. Data Platform as a ServiceCloud Data WarehouseHadoop (EMR) ClustersHadoop Platform as a ServiceJobExecutionResource Configuration& ManagementMetadata Service(Franklin)
    9. 9. Large Ecosystem of Clients & ToolsCloud Data WarehouseHadoop (EMR) ClustersHadoop Platform as a ServiceJobExecutionResource Configuration& ManagementMetadata Service(Franklin)
    10. 10. Why Genie? Simple API for job submission and management Accessible from the data center and the cloud Abstraction of physical details of back-endHadoop clusters
    11. 11. What Genie is Not A workflow scheduler, such as Oozie A task scheduler, such as fair share or capacityschedulers An end-to-end resource management tool
    12. 12. Genie: Job Execution API to run Hadoop, Hive and Pigjobs Auto-magic submission of jobsto the right Hadoop cluster Abstracting away cluster detailsfrom clients
    13. 13. Genie: Resource Configuration API for management of clustermetadata Status: up, out of service, orterminated Site-specific Hadoop, Hive andPig configurations Cluster naming/tagging for jobsubmissions
    14. 14. Eureka ServiceEureka ServiceClientEurekaClientRibbonClient EurekaClientPython APIRegistersserviceDiscoversserviceDiscoversserviceInvokes(submits job)Launchescluster(s)LaunchesjobRegistersclusterEnd-usersAdminsNetflix OSShttp://netflix.github.comKaryonEurekaClientRibbonServoHadoopHivePigKaryonArchaiusRibbonServoHadoopHivePigEurekaClient
    15. 15. Genie: Job Execution• Job Type: {hadoop, hive, pig}• File dependencies (script, udfs, etc)• Command-line arguments• Schedule: {adhoc, sla}• Configuration: {prod, test, unittest}REST call
    16. 16. Genie: Job Execution* Used to query status, get outputs, kill jobResponse: job ID*
    17. 17. Genie Job DetailsJob IDScript to executeStandard output and errorPig logsJob conf directory
    18. 18. Genie – Use Cases Enabled at Netflix Running nightly short-lived “bonus” clusters toaugment ETL processing Re-routing traffic between clusters “Red/black” pushes for clusters Attaching stand-alone gateways to clusters Running 100% of all SLA jobs, and a highpercentage of ad-hoc jobs
    19. 19. Nightly Short-lived Bonus ClustersExecution Service Configuration ServiceProd SLA Cluster:Schedule: slaConfigurations: prod
    20. 20. Nightly Short-lived Bonus ClustersBonus Cluster:Schedule: bonusConfigurations: prodExecution Service Configuration Service{Schedule=bonus,Configuration=prod}Prod SLA Cluster:Schedule: slaConfigurations: prod
    21. 21. Nightly Short-lived Bonus ClustersBonus Cluster:Schedule: bonusConfigurations: prodStatus: OUT_OF_SERVICEExecution Service Configuration ServiceProd SLA Cluster:Schedule: slaConfigurations: prod{Schedule=sla,Configuration=prod}
    22. 22. Nightly Short-lived Bonus ClustersBonus Cluster:Schedule: bonusConfigurations: prodStatus: TERMINATEDExecution Service Configuration ServiceProd SLA Cluster:Schedule: slaConfigurations: prod{Schedule=sla,Configuration=prod}
    23. 23. Rerouting Traffic Between ClustersAd-hoc Cluster:Schedule: adhocConfigurations: prod, testProd SLA Cluster:Schedule: slaConfigurations: prodExecution Service Configuration Service{Schedule=sla,Configuration=prod}
    24. 24. Rerouting Traffic Between ClustersAd-hoc Cluster:Schedule: adhoc, slaConfigurations: prod, testExecution Service Configuration Service{Schedule=sla,Configuration=prod}Prod SLA Cluster:Schedule: slaConfigurations: prodStatus: OUT_OF_SERVICE
    25. 25. Rerouting Traffic Between ClustersAd-hoc Cluster:Schedule: adhocConfigurations: prod, testProd SLA Cluster:Schedule: slaConfigurations: prodStatus: UPExecution Service Configuration Service{Schedule=sla,Configuration=prod}
    26. 26. “Red/Black” Pushes for ClustersProd SLA Cluster:Schedule: slaConfigurations: prodStatus: UPExecution Service Configuration Service{Schedule=sla,Configuration=prod}
    27. 27. “Red/Black” Pushes for ClustersProd SLA Cluster:Schedule: slaConfigurations: prodStatus: OUT_OF_SERVICEExecution Service Configuration Service{Schedule=sla,Configuration=prod}Prod SLA Cluster:Schedule: slaConfigurations: prodStatus: UP
    28. 28. “Red/Black” Pushes for ClustersProd SLA Cluster:Schedule: slaConfigurations: prodStatus: TERMINATEDExecution Service Configuration Service{Schedule=sla,Configuration=prod}Prod SLA Cluster:Schedule: slaConfigurations: prodStatus: UP
    29. 29. Genie Usage at Netflix Usage statistics brought to you by “Sherlock” Pig job to gather Hadoop job statistics Tableau-based visualization
    30. 30. Cloud Deployment Asgard is also part of Netflix OSS https://github.com/Netflix/asgard
    31. 31. Auto Scaling in the Cloud
    32. 32. Genie is now part of Netflix OSS! http://techblog.netflix.com/2013/06/genie-is-out-of-bottle.html Clone it on GitHub at: https://github.com/Netflix/genie Still “version 0” – work in progress! All contributions and feedback welcome! Come talk to us and check out live demos at theNetflix Booth
    33. 33. Watching Pigs Fly with theNetflix Hadoop Toolkit
    34. 34.  Sriram KrishnanWe’re hiring!Thank you!Home: http://www.netflix.comJobs: http://jobs.netflix.comTech Blog: http://techblog.netflix.com/

    ×