Your SlideShare is downloading. ×
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop


Published on

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. Oozie stably processes more …

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. Oozie stably processes more than 45% of the jobs run across more than 25 Hadoop clusters in Yahoo. At the same time adoption
in other enterprises has increased substantially since Oozie was contributed to the Apache community. We attribute these achievements to design decisions
that was selected to be presented at a workshop during the ACM/SIGMOD conference. This presentation covers the key architectural design choices described in the paper. Operational metrics will be used to illustrate production experience at Yahoo, and we will also include a quick tutorial.

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  • 2. Accepted Paper• Workshop in ACM/SIGMOD, May 2012.• It is a team effort!Mohammad Islam Angelo HuangMohamed Battisha Michelle ChiangSanthoshSrinivasan Craig PetersAndreas Neumann Alejandro Abdelnur
  • 3. Presentation WorkflowOozie Design ResultTutorial Decision s s Question s? Address END Question
  • 4. Installing OozieStep 1: Download the Oozietarballcurl -O 2: Unpack the tarballtar –xzvf<PATH_TO_OOZIE_TAR>Step 3: Run the setup scriptbin/ -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zipStep 4: Start ooziebin/oozie-start.shStep 5: Check status of ooziebin/oozie admin -oozie http://localhost:11000/oozie -status
  • 5. Running an Example•Standalone Map-Reduce job $ hadoop jar /usr/joe/hadoop-examples.jarorg.myorg.wordcountinputDiroutputDir• Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 6. Example Workflow<action name=’wordcount><map-reduce><configuration><property><name>mapred.mapper.class</name><value>org.myorg.WordCount.Map</value> mapred.mapper.class =</property> org.myorg.WordCount.Map <property><name>mapred.reducer.class</name><value>org.myorg.WordCount.Reduce</value></property><property> mapred.reducer.class =<name>mapred.input.dir</name> org.myorg.WordCount.Reduce<value>usr/joe/inputDir</value></property><property> mapred.input.dir = inputDir<name>mapred.output.dir</name><value>/usr/joe/outputDir</value></property></configuration> mapred.output.dir = outputDir</map-reduce></action>
  • 7. A Workflow ApplicationThree components required for a Workflow:1) Workflow.xml: Contains job definition2) Libraries: optional ‘lib/’ directory contains .jar/.so files3) Properties file:• Parameterization of Workflow xml• Mandatory property is
  • 8. Workflow SubmissionRun Workflow Job $ oozie job –run http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-WCheck Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -ooziehttp://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  • 9. Key Features and DesignDecisions• Multi-tenant• Security – Authenticate every request – Pass appropriate token to Hadoop job• Scalability – Vertical: Add extra memory/disk – Horizontal: Add machines
  • 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos OozieServerEnduser
  • 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user c
  • 12. Oozie-Hadoop Security • Oozie is a multi-tenant system • Job can be scheduled to run later • Oozie submits/maintains the hadoop jobs • Hadoop needs security token for each requestQuestion: Who should provide the securitytoken to hadoop and how?
  • 13. Oozie-Hadoop Security Contd.• Answer: Oozie• How? – Hadoop considers Oozieas a super-user – Hadoopdoes not check end-user credential – Hadooponly checks the credential of Oozieprocess• BUT hadoop job is executed as end-user.•Oozie utilizes doAs() functionality of Hadoop.
  • 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user c
  • 15. Why Oozie Security?• One user should not modify another user’s job• Hadoop doesn’t authenticate end–user• Ooziehas to verifyits user before passing the job to Hadoop
  • 16. How does Oozie Support Security?• Built-in authentication – Kerberos – Non-secured (default)• Design Decision – Pluggable authentication – Easy to include new type of authentication – Yahoo supports 3 types of authentication.
  • 17. Job Submission to Hadoop• Oozie is designed to handle thousands of jobs at the same time• Question : Should Oozie server – Submit the hadoop job directly? – Wait for it to finish? • Answer: No
  • 18. Job Submission Contd.• Reason – Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) – Isolation: Running user code on Oozie server might de-stabilize Oozie• Design Decision – Create a launcher hadoop job – Execute the actual user job from the launcher. – Wait asynchronously for the job to finish.
  • 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R JobOozie 3Server 1 4 Launcher 2 Mapper
  • 20. Job Submission Contd.• Advantages – Horizontal scalability: If load increases, add machines into Hadoop cluster – Stability: Isolation of user code and system process• Disadvantages – Extra map-slot is occupied by each job.
  • 21. Production Setup• Total number of nodes: 42K+• Total number of Clusters: 25+• Total number of processed jobs ≈ 750K/month• Data presented from two clusters• Each of them have nearly 4K nodes• Total number of users /cluster = 50
  • 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35 Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type• Pig and Java are the most popular• Number of pure Map-Reduce jobs are fewer
  • 23. Experimental Setup• Number of nodes: 7• Number of map-slots: 28• 4 Core, RAM: 16 GB• 64 bit RHEL• Oozie Server – 3 GB RAM – Internal Queue size = 10 K – # Worker Thread = 300
  • 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission ThreadsObservation: Oozie can accept a large number of jobs
  • 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead OverheadTotal Oozie Overhead = Preparation + Completion
  • 26. Oozie Overhead Per Action OverheadOverhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/WorkflowObservation: Oozie overhead is less when multipleactions are in the same workflow.
  • 27. Oozie Futures• Scalability – Hot-Hot/Load balancing service – Replace SQL DB with Zookeeper• Improved Usability• Extend the benchmarking scope• Monitoring WS API
  • 28. Take Away ..• Oozie is – Easier to use – Scalable – Secure and multi-tenant
  • 29. Q&A Mohammad K Virag Kothari Islamkamrul@yahoo-