• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 

May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

on

  • 2,404 views

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. Oozie stably processes more ...

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. Oozie stably processes more than 45% of the jobs run across more than 25 Hadoop clusters in Yahoo. At the same time adoption
in other enterprises has increased substantially since Oozie was contributed to the Apache community. We attribute these achievements to design decisions
that was selected to be presented at a workshop during the ACM/SIGMOD conference. This presentation covers the key architectural design choices described in the paper. Operational metrics will be used to illustrate production experience at Yahoo, and we will also include a quick tutorial.

Statistics

Views

Total Views
2,404
Views on SlideShare
2,362
Embed Views
42

Actions

Likes
3
Downloads
78
Comments
1

2 Embeds 42

http://sozialpapier.com 36
http://www.sozialpapier.com 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop Presentation Transcript

    • Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
    • Accepted Paper• Workshop in ACM/SIGMOD, May 2012.• It is a team effort!Mohammad Islam Angelo HuangMohamed Battisha Michelle ChiangSanthoshSrinivasan Craig PetersAndreas Neumann Alejandro Abdelnur
    • Presentation WorkflowOozie Design ResultTutorial Decision s s Question s? Address END Question
    • Installing OozieStep 1: Download the Oozietarballcurl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/oozie-3.1.3-incubating-distro.tar.gzStep 2: Unpack the tarballtar –xzvf<PATH_TO_OOZIE_TAR>Step 3: Run the setup scriptbin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zipStep 4: Start ooziebin/oozie-start.shStep 5: Check status of ooziebin/oozie admin -oozie http://localhost:11000/oozie -status
    • Running an Example•Standalone Map-Reduce job $ hadoop jar /usr/joe/hadoop-examples.jarorg.myorg.wordcountinputDiroutputDir• Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
    • Example Workflow<action name=’wordcount><map-reduce><configuration><property><name>mapred.mapper.class</name><value>org.myorg.WordCount.Map</value> mapred.mapper.class =</property> org.myorg.WordCount.Map <property><name>mapred.reducer.class</name><value>org.myorg.WordCount.Reduce</value></property><property> mapred.reducer.class =<name>mapred.input.dir</name> org.myorg.WordCount.Reduce<value>usr/joe/inputDir</value></property><property> mapred.input.dir = inputDir<name>mapred.output.dir</name><value>/usr/joe/outputDir</value></property></configuration> mapred.output.dir = outputDir</map-reduce></action>
    • A Workflow ApplicationThree components required for a Workflow:1) Workflow.xml: Contains job definition2) Libraries: optional ‘lib/’ directory contains .jar/.so files3) Properties file:• Parameterization of Workflow xml• Mandatory property is oozie.wf.application.path
    • Workflow SubmissionRun Workflow Job $ oozie job –run -configjob.properties-oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-WCheck Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -ooziehttp://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
    • Key Features and DesignDecisions• Multi-tenant• Security – Authenticate every request – Pass appropriate token to Hadoop job• Scalability – Vertical: Add extra memory/disk – Horizontal: Add machines
    • Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos OozieServerEnduser
    • Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user c
    • Oozie-Hadoop Security • Oozie is a multi-tenant system • Job can be scheduled to run later • Oozie submits/maintains the hadoop jobs • Hadoop needs security token for each requestQuestion: Who should provide the securitytoken to hadoop and how?
    • Oozie-Hadoop Security Contd.• Answer: Oozie• How? – Hadoop considers Oozieas a super-user – Hadoopdoes not check end-user credential – Hadooponly checks the credential of Oozieprocess• BUT hadoop job is executed as end-user.•Oozie utilizes doAs() functionality of Hadoop.
    • User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user c
    • Why Oozie Security?• One user should not modify another user’s job• Hadoop doesn’t authenticate end–user• Ooziehas to verifyits user before passing the job to Hadoop
    • How does Oozie Support Security?• Built-in authentication – Kerberos – Non-secured (default)• Design Decision – Pluggable authentication – Easy to include new type of authentication – Yahoo supports 3 types of authentication.
    • Job Submission to Hadoop• Oozie is designed to handle thousands of jobs at the same time• Question : Should Oozie server – Submit the hadoop job directly? – Wait for it to finish? • Answer: No
    • Job Submission Contd.• Reason – Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) – Isolation: Running user code on Oozie server might de-stabilize Oozie• Design Decision – Create a launcher hadoop job – Execute the actual user job from the launcher. – Wait asynchronously for the job to finish.
    • Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R JobOozie 3Server 1 4 Launcher 2 Mapper
    • Job Submission Contd.• Advantages – Horizontal scalability: If load increases, add machines into Hadoop cluster – Stability: Isolation of user code and system process• Disadvantages – Extra map-slot is occupied by each job.
    • Production Setup• Total number of nodes: 42K+• Total number of Clusters: 25+• Total number of processed jobs ≈ 750K/month• Data presented from two clusters• Each of them have nearly 4K nodes• Total number of users /cluster = 50
    • Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35 Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type• Pig and Java are the most popular• Number of pure Map-Reduce jobs are fewer
    • Experimental Setup• Number of nodes: 7• Number of map-slots: 28• 4 Core, RAM: 16 GB• 64 bit RHEL• Oozie Server – 3 GB RAM – Internal Queue size = 10 K – # Worker Thread = 300
    • Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission ThreadsObservation: Oozie can accept a large number of jobs
    • Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead OverheadTotal Oozie Overhead = Preparation + Completion
    • Oozie Overhead Per Action OverheadOverhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/WorkflowObservation: Oozie overhead is less when multipleactions are in the same workflow.
    • Oozie Futures• Scalability – Hot-Hot/Load balancing service – Replace SQL DB with Zookeeper• Improved Usability• Extend the benchmarking scope• Monitoring WS API
    • Take Away ..• Oozie is – Easier to use – Scalable – Secure and multi-tenant
    • Q&A Mohammad K Virag Kothari Islamkamrul@yahoo- virag@yahoo-inc.com inc.com http://incubator.apache.org/oozie/