1
Genie – Hadoop Platform as a Service at Netflix
Sriram Krishnan
Hadoop Summit, June 26, 2013
Netflix does Hadoop
Netflix does Hadoop at scale
Netflix does Hadoop at scale*
Netflix does Hadoop at scale in the cloud
S3 as the Cloud Data Warehouse
Cloud Data Warehouse
Multiple Hadoop Clusters
Cloud Data Warehouse
Hadoop (EMR) Clusters
Data Platform as a Service
Cloud Data Warehouse
Hadoop (EMR) Clusters
Hadoop Platform as a Service
Job
Execution
Resource ...
Large Ecosystem of Clients & Tools
Cloud Data Warehouse
Hadoop (EMR) Clusters
Hadoop Platform as a Service
Job
Execution
R...
Why Genie?
 Simple API for job submission and management
 Accessible from the data center and the cloud
 Abstraction of...
What Genie is Not
 A workflow scheduler, such as Oozie
 A task scheduler, such as fair share or capacity
schedulers
 An...
Genie: Job Execution
 API to run Hadoop, Hive and Pig
jobs
 Auto-magic submission of jobs
to the right Hadoop cluster
 ...
Genie: Resource Configuration
 API for management of cluster
metadata
 Status: up, out of service, or
terminated
 Site-...
Eureka ServiceEureka Service
Registers
service
ClientEureka
Client
Ribbon
Discovers
service
Invokes
(submits job)
Launches...
Genie: Job Execution
• Job Type: {hadoop, hive, pig}
• File dependencies (script, udfs, etc)
• Command-line arguments
• Sc...
Genie: Job Execution
* Used to query status, get outputs, kill job
Response: job ID*
Genie Job Details
Job ID
Script to execute
Standard output and error
Pig logs
Job conf directory
Genie – Use Cases Enabled at Netflix
 Running nightly short-lived “bonus” clusters to
augment ETL processing
 Re-routing...
Nightly Short-lived Bonus Clusters
Execution Service Configuration Service
Prod SLA Cluster:
Schedule: sla
Configurations:...
Nightly Short-lived Bonus Clusters
Bonus Cluster:
Schedule: bonus
Configurations: prod
Execution Service Configuration Ser...
Nightly Short-lived Bonus Clusters
Bonus Cluster:
Schedule: bonus
Configurations: prod
Status: OUT_OF_SERVICE
Execution Se...
Nightly Short-lived Bonus Clusters
Bonus Cluster:
Schedule: bonus
Configurations: prod
Status: TERMINATED
Execution Servic...
Rerouting Traffic Between Clusters
Ad-hoc Cluster:
Schedule: adhoc
Configurations: prod, test
Prod SLA Cluster:
Schedule: ...
Rerouting Traffic Between Clusters
Ad-hoc Cluster:
Schedule: adhoc, sla
Configurations: prod, test
Execution Service Confi...
Rerouting Traffic Between Clusters
Ad-hoc Cluster:
Schedule: adhoc
Configurations: prod, test
Prod SLA Cluster:
Schedule: ...
“Red/Black” Pushes for Clusters
Prod SLA Cluster:
Schedule: sla
Configurations: prod
Status: UP
Execution Service Configur...
“Red/Black” Pushes for Clusters
Prod SLA Cluster:
Schedule: sla
Configurations: prod
Status: OUT_OF_SERVICE
Execution Serv...
“Red/Black” Pushes for Clusters
Prod SLA Cluster:
Schedule: sla
Configurations: prod
Status: TERMINATED
Execution Service ...
Genie Usage at Netflix
 Usage statistics brought to you by “Sherlock”
 Pig job to gather Hadoop job statistics
 Tableau...
Genie Deployment in the Cloud
 Asgard is also part of Netflix OSS
 https://github.com/Netflix/asgard
Auto Scaling in the Cloud
Genie is now part of Netflix OSS!
 http://techblog.netflix.com/2013/06/genie-is-out-
of-bottle.html
 Clone it on GitHub ...
Watching Pigs Fly with the
Netflix Hadoop Toolkit
 Sriram Krishnan
We’re hiring!
Thank you!
Home: http://www.netflix.com
Jobs: http://jobs.netflix.com
Tech Blog: http://te...
Upcoming SlideShare
Loading in...5
×

Genie - Hadoop Platform as a Service at Netflix

892

Published on

Recently in our tech-blog, we discussed the architecture of our petabyte-scale data warehouse in the cloud (http://nflx.it/XoySYR). Salient features include the use of Amazon`s Simple Storage Service (S3) as our “source of truth”, leveraging the elasticity of the cloud to run multiple dynamically-resizable Hadoop clusters to support various workloads, and our implementation of a horizontally-scalable Hadoop Platform as a Service called ?Genie?. In this presentation, we will focus on Genie, which provides job and resource management for the Hadoop ecosystem in the cloud, and is the core service that the various components of the enterprise ecosystem at Netflix use to integrate with Hadoop in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides REST-ful APIs to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. We will describe how Genie is used in production at Netflix for processing 100s of terabytes of data everyday, running thousands of ETL (extract, transform, load) jobs, plus hundreds of ad-hoc jobs from our visualization tools and our web interface. Finally, we will discuss our plans for open sourcing Genie.

Published in: Technology, Business
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
892
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Reference tech blogs: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.htmlUse cases – reporting, analytics, insights, algorithms (e.g. recommendations)But big deal – so does everyone in the room
  • What is scale? It means different things to different people
  • 80-100 billion events per day, 10s of TB of data (compressed)Totals ~2PB (retention is a few months)Many clusters – 2000-2500 nodes at different times during the dayAgain, big deal – there are many others in the room who do Hadoop at this scale (petabyte is the new terabyte)
  • Our Hadoop processing is 100% in the (public) cloudIn our case, public cloud is AWSThis is what differentiates our infrastructure from the restHadoop in the cloud is different from Hadoop in the datacenter – in this talk, we will discuss our cloud-based Hadoop platformWe made certain architectural choices to make it easy for our end-users to run Hadoop jobs, and for us to manage Hadoop resources
  • S3 is the source of truthS3 benefitsHighly durable and available – 11 9’sBucket versioningHighly elastic - we grew our data warehouse organically from a few hundred terabytes to petabytes without having to provision any storage resources in advanceHDFS? Only for transient data, intermediate results for multi-stage jobsS3 cons – performance, eventual consistency
  • Another benefit of S3 - Multiple clusters can read/process the same data(Semi-) persistent sla and ad-hoc clusters:~800-1300 nodesMultiple ad-hoc clusters to A/B test new releases/featuresNightly "bonus" clusters to supplement SLA clusterOperation assumption – clusters may go down at any timeIf we lose a cluster, we just respin itClusters are inter-changeable: Decoupling of storage from the computational infrastructure
  • All end-users want to do is run jobs, and access their dataAs the platform team, our goal is to shield them from the back-end complexityGenieREST API for job execution/monitoringRepository/abstraction for clusters and metastoresFranklin – MDSUses HiveServerto talk to Hive metastoreIn all honesty – very few people use this API directly
  • Next – we will focus on Genie for the rest of the talkOther tools will be talked about in the other Netflix talk – Watching Pigs Fly with the Netflix Hadoop ToolkitThu, 1:40PM
  • EMR: HadoopIaaS, and an API to run jobs on transient clusters – our clusters are semi-persistent, and job submissions don’t result in new clusters.Oozie: Workflow tool, which only supports Hadoop ecosystem – we have hybrid jobs (Teradata+Hadoop) being orchestrated by UC4, so we just needed a job submission API. Also no support for Hive when we started.Templeton: No multi-cluster, multi-user support, not quite ready for prime-time.
  • Genie is a resource “match-maker”Next – we look at two key services that Genie provides
  • Unit of execution is a Hadoop/Hive/Pig jobUsers provide scripts, dependencies and other metadataDoes no scheduling per se – only does “meta-scheduling” or resource matching
  • Status defines whether it is accepting jobsConfigurations are *-site.xmls and propertiesCluster name, schedule, etcNext we look at the two classes of users supported by Genie – and overall lifecycle
  • Two classes of users: admins and end-usersAdmins spin up clusters, set cluster metadataUsers use the clusters once they have been registeredGenie is built on top of Netflix OSS
  • Genie figures out the resources to run jobs on – back-end resources are abstracted outAsynchronous execution since jobs may be long-running
  • Every job run as a separate process using Hadoop/Hive/Pig CLIAvoids “jar hell” since it needs Hadoop jarsJobs run in their own sandbox (working directory)Provides isolation between jobs, and between Genie and the jobsStandard output/error of jobs easily availableAble to support multiple versions of Hadoop/Hive/Pig, and connect to multiple clusters
  • Configuration service helps us do crazy (cool) thingsWill describe each of these in greater detail
  • New bonus clusters launched each night – but clients are oblivious of actual host names/IP’sOne way to do thisHigher SLA jobs first ask for cluster by name
  • New bonus clusters launched each night – but clients are oblivious of actual host names/IP’sOne way to do thisHigher SLA jobs first ask for cluster by name
  • If it doesn’t exist, revert back to existing clusterWhy not just expand?Better isolationMixing matching instance types not ideal for HadoopProd cluster uses m1.xlarges for slave nodesShrink has proven to be a problemWe want to do hard shutdown when those instances are needed on awsprod
  • If it doesn’t exist, revert back to existing clusterWhy not just expand?Better isolationMixing matching instance types not ideal for HadoopProd cluster uses m1.xlarges for slave nodesShrink has proven to be a problemWe want to do hard shutdown when those instances are needed on awsprod
  • We had to bounce the prod job tracker to enable priorities for “long-pole” jobsWanted to do it with minimal impact to SLA jobs
  • Must wait for all existing jobs to finish for minimal impactHadoop jobs are long running – don’t want to kill a 5 hour job nearing its finish
  • Prod cluster is back up after maintenanceJobs that were scheduled on query cluster will continue to run there until it finishesThis is done from time to time – although not too often, we do red-black pushes…
  • This is initial state – we need to spin up a new cluster, e.g. to push a new feature
  • * Spin up new cluster, mark it as UP, mark old cluster as OOS
  • OUT_OF_SERVICE to TERMINATED
  • Our techblog shows number of Hadoop jobs – this shows Genie jobsTwo query clusters – A/B testing new fair share schedulerMention that we will be writing a techblog about this soon, with more details
  • Set up desired instance counts across multiple AZ’sDo “red-black” pushes using “sequential ASGs”Loss of individual nodes will cause jobs running on those nodes to be lost
  • Auto-scaling policy set up to expand if number of running jobs > ~80%
  • Still biased towards running in the cloud and at Netflix, but will generalize/improve it based on community feedback
  • * Come listen to how we enable “Data Platform as a Service” – it is truly Lipstick on a Pig.
  • Genie - Hadoop Platform as a Service at Netflix

    1. 1. 1 Genie – Hadoop Platform as a Service at Netflix Sriram Krishnan Hadoop Summit, June 26, 2013
    2. 2. Netflix does Hadoop
    3. 3. Netflix does Hadoop at scale
    4. 4. Netflix does Hadoop at scale*
    5. 5. Netflix does Hadoop at scale in the cloud
    6. 6. S3 as the Cloud Data Warehouse Cloud Data Warehouse
    7. 7. Multiple Hadoop Clusters Cloud Data Warehouse Hadoop (EMR) Clusters
    8. 8. Data Platform as a Service Cloud Data Warehouse Hadoop (EMR) Clusters Hadoop Platform as a Service Job Execution Resource Configuration & Management Metadata Service (Franklin)
    9. 9. Large Ecosystem of Clients & Tools Cloud Data Warehouse Hadoop (EMR) Clusters Hadoop Platform as a Service Job Execution Resource Configuration & Management Metadata Service (Franklin)
    10. 10. Why Genie?  Simple API for job submission and management  Accessible from the data center and the cloud  Abstraction of physical details of back-end Hadoop clusters
    11. 11. What Genie is Not  A workflow scheduler, such as Oozie  A task scheduler, such as fair share or capacity schedulers  An end-to-end resource management tool
    12. 12. Genie: Job Execution  API to run Hadoop, Hive and Pig jobs  Auto-magic submission of jobs to the right Hadoop cluster  Abstracting away cluster details from clients
    13. 13. Genie: Resource Configuration  API for management of cluster metadata  Status: up, out of service, or terminated  Site-specific Hadoop, Hive and Pig configurations  Cluster naming/tagging for job submissions
    14. 14. Eureka ServiceEureka Service Registers service ClientEureka Client Ribbon Discovers service Invokes (submits job) Launches job Discovers service Client Eureka Client Python API Launches cluster(s) Registers cluster End-users Admins Netflix OSS http://netflix.github.com Karyon Eureka Client Ribbon Servo Hadoop Hive Pig Karyon Archaius Ribbon Servo Hadoop Hive Pig Eureka Client
    15. 15. Genie: Job Execution • Job Type: {hadoop, hive, pig} • File dependencies (script, udfs, etc) • Command-line arguments • Schedule: {adhoc, sla} • Configuration: {prod, test, unittest} REST call
    16. 16. Genie: Job Execution * Used to query status, get outputs, kill job Response: job ID*
    17. 17. Genie Job Details Job ID Script to execute Standard output and error Pig logs Job conf directory
    18. 18. Genie – Use Cases Enabled at Netflix  Running nightly short-lived “bonus” clusters to augment ETL processing  Re-routing traffic between clusters  “Red/black” pushes for clusters  Attaching stand-alone gateways to clusters  Running 100% of all SLA jobs, and a high percentage of ad-hoc jobs
    19. 19. Nightly Short-lived Bonus Clusters Execution Service Configuration Service Prod SLA Cluster: Schedule: sla Configurations: prod
    20. 20. Nightly Short-lived Bonus Clusters Bonus Cluster: Schedule: bonus Configurations: prod Execution Service Configuration Service {Schedule=bonus, Configuration=prod} Prod SLA Cluster: Schedule: sla Configurations: prod
    21. 21. Nightly Short-lived Bonus Clusters Bonus Cluster: Schedule: bonus Configurations: prod Status: OUT_OF_SERVICE Execution Service Configuration Service Prod SLA Cluster: Schedule: sla Configurations: prod {Schedule=sla, Configuration=prod}
    22. 22. Nightly Short-lived Bonus Clusters Bonus Cluster: Schedule: bonus Configurations: prod Status: TERMINATED Execution Service Configuration Service Prod SLA Cluster: Schedule: sla Configurations: prod {Schedule=sla, Configuration=prod}
    23. 23. Rerouting Traffic Between Clusters Ad-hoc Cluster: Schedule: adhoc Configurations: prod, test Prod SLA Cluster: Schedule: sla Configurations: prod Execution Service Configuration Service {Schedule=sla, Configuration=prod}
    24. 24. Rerouting Traffic Between Clusters Ad-hoc Cluster: Schedule: adhoc, sla Configurations: prod, test Execution Service Configuration Service {Schedule=sla, Configuration=prod} Prod SLA Cluster: Schedule: sla Configurations: prod Status: OUT_OF_SERVICE
    25. 25. Rerouting Traffic Between Clusters Ad-hoc Cluster: Schedule: adhoc Configurations: prod, test Prod SLA Cluster: Schedule: sla Configurations: prod Status: UP Execution Service Configuration Service {Schedule=sla, Configuration=prod}
    26. 26. “Red/Black” Pushes for Clusters Prod SLA Cluster: Schedule: sla Configurations: prod Status: UP Execution Service Configuration Service {Schedule=sla, Configuration=prod}
    27. 27. “Red/Black” Pushes for Clusters Prod SLA Cluster: Schedule: sla Configurations: prod Status: OUT_OF_SERVICE Execution Service Configuration Service {Schedule=sla, Configuration=prod} Prod SLA Cluster: Schedule: sla Configurations: prod Status: UP
    28. 28. “Red/Black” Pushes for Clusters Prod SLA Cluster: Schedule: sla Configurations: prod Status: TERMINATED Execution Service Configuration Service {Schedule=sla, Configuration=prod} Prod SLA Cluster: Schedule: sla Configurations: prod Status: UP
    29. 29. Genie Usage at Netflix  Usage statistics brought to you by “Sherlock”  Pig job to gather Hadoop job statistics  Tableau-based visualization
    30. 30. Genie Deployment in the Cloud  Asgard is also part of Netflix OSS  https://github.com/Netflix/asgard
    31. 31. Auto Scaling in the Cloud
    32. 32. Genie is now part of Netflix OSS!  http://techblog.netflix.com/2013/06/genie-is-out- of-bottle.html  Clone it on GitHub at:  https://github.com/Netflix/genie  Still “version 0” – work in progress!  All contributions and feedback welcome!  Come talk to us and check out live demos at the Netflix Booth
    33. 33. Watching Pigs Fly with the Netflix Hadoop Toolkit
    34. 34.  Sriram Krishnan We’re hiring! Thank you! Home: http://www.netflix.com Jobs: http://jobs.netflix.com Tech Blog: http://techblog.netflix.com/

    ×