Your SlideShare is downloading. ×
Oozie HUG May12
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Oozie HUG May12


Published on

Presented at HUG, 2012

Presented at HUG, 2012

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  • 2. Accepted Paper•  Workshop in ACM/SIGMOD, May 2012.•  It is a team effort!Mohammad Islam Angelo HuangMohamed Battisha Michelle ChiangSanthosh Srinivasan Craig PetersAndreas Neumann Alejandro Abdelnur
  • 3. Presentation Workflow
  • 4. Installing OozieStep 1: Download the Oozie tarballcurl -O 2: Unpack the tarballtar –xzvf <PATH_TO_OOZIE_TAR>Step 3: Run the setup scriptbin/ -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zipStep 4: Start ooziebin/oozie-start.shStep 5: Check status of ooziebin/oozie admin -oozie http://localhost:11000/oozie -status
  • 5. Running an Example•  Standalone Map-Reduce job$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir•  Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 6. Example Workflow<action name=’wordcount> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name> mapred.mapper.class = <value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> mapred.reducer.class = </property> org.myorg.WordCount.Reduce <property> <name>mapred.input.dir</name> <value>usr/joe/inputDir </value> mapred.input.dir = inputDir </property> <property> <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> mapred.output.dir = outputDir </property> </configuration> </map-reduce> </action>
  • 7. A Workflow ApplicationThree components required for a Workflow:1)  Workflow.xml: Contains job definition2) Libraries: optional ‘lib/’ directory contains .jar/.so files3) Properties file:•  Parameterization of Workflow xml•  Mandatory property is
  • 8. Workflow SubmissionRun Workflow Job $ oozie job –run -config -oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-WCheck Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  • 9. Key Features and DesignDecisions•  Multi-tenant•  Security –  Authenticate every request –  Pass appropriate token to Hadoop job•  Scalability –  Vertical: Add extra memory/disk –  Horizontal: Add machines
  • 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnduser
  • 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  • 12. Oozie-Hadoop Security •  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each requestQuestion: Who should provide the securitytoken to hadoop and how?
  • 13. Oozie-Hadoop Security Contd.•  Answer: Oozie•  How? – Hadoop considers Oozie as a super-user – Hadoop does not check end-user credential – Hadoop only checks the credential of Oozie process•  BUT hadoop job is executed as end-user.•  Oozie utilizes doAs() functionality of Hadoop.
  • 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  • 15. Why Oozie Security?•  One user should not modify another user’s job•  Hadoop doesn’t authenticate end–user•  Oozie has to verify its user before passing the job to Hadoop
  • 16. How does Oozie Support Security?•  Built-in authentication –  Kerberos –  Non-secured (default)•  Design Decision –  Pluggable authentication –  Easy to include new type of authentication –  Yahoo supports 3 types of authentication.
  • 17. Job Submission to Hadoop•  Oozie is designed to handle thousands of jobs at the same time•  Question : Should Oozie server –  Submit the hadoop job directly? –  Wait for it to finish? •  Answer: No
  • 18. Job Submission Contd.•  Reason –  Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) –  Isolation: Running user code on Oozie server might de-stabilize Oozie•  Design Decision –  Create a launcher hadoop job –  Execute the actual user job from the launcher. –  Wait asynchronously for the job to finish.
  • 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R JobOozie 3Server 1 4 Launcher 2 Mapper
  • 20. Job Submission Contd.•  Advantages –  Horizontal scalability: If load increases, add machines into Hadoop cluster –  Stability: Isolation of user code and system process•  Disadvantages –  Extra map-slot is occupied by each job.
  • 21. Production Setup•  Total number of nodes: 42K+•  Total number of Clusters: 25+•  Total number of processed jobs ≈ 750K/month•  Data presented from two clusters•  Each of them have nearly 4K nodes•  Total number of users /cluster = 50
  • 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type
  • 23. Experimental Setup•  Number of nodes: 7•  Number of map-slots: 28•  4 Core, RAM: 16 GB•  64 bit RHEL•  Oozie Server –  3 GB RAM –  Internal Queue size = 10 K –  # Worker Thread = 300
  • 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission ThreadsObservation: Oozie can accept a large number of jobs
  • 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead OverheadTotal Oozie Overhead = Preparation + Completion
  • 26. Oozie Overhead Per Action OverheadOverhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/WorkflowObservation: Oozie overhead is less when multipleactions are in the same workflow.
  • 27. Oozie Futures•  Scalability –  Hot-Hot/Load balancing service –  Replace SQL DB with Zookeeper•  Improved Usability•  Extend the benchmarking scope•  Monitoring WS API
  • 28. Take Away ..•  Oozie is –  Easier to use –  Scalable –  Secure and multi-tenant
  • 29. Q&A Mohammad K Islam Virag
  • 30. Vertical Scalability•  Oozie asynchronously processes the user request.•  Memory resident –  An internal queue to store any sub-task. –  Thread pool to process the sub-tasks.•  Both items –  Fixed in size –  Don’t change due to load variations –  Size can be configured if needed. Might need extra memory
  • 31. Challenges in Scalability•  Centralized persistence storage –  Currently supports any SQL DB (such as Oracle, MySQL, derby etc) –  Each DBMS has respective limitations –  Oozie scalability is limited by the underlying DBMS –  Using of Zookeeper might be an option