Oozie: Towards a Scalable Workflow Management System for Hadoop                     Mohammad Islam                        ...
Accepted Paper•  Workshop in ACM/SIGMOD, May 2012.•  It is a team effort!Mohammad Islam      Angelo HuangMohamed Battisha ...
Presentation Workflow
Installing OozieStep 1: Download the Oozie tarballcurl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incu...
Running an Example•  Standalone Map-Reduce job$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outpu...
Example Workflow<action name=’wordcount>     <map-reduce>        <configuration>              <property>                  ...
A Workflow ApplicationThree components required for a Workflow:1)  Workflow.xml:    Contains job definition2) Libraries:  ...
Workflow SubmissionRun Workflow Job $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/ Workflow ...
Key Features and DesignDecisions•  Multi-tenant•  Security  –  Authenticate every request  –  Pass appropriate token to Ha...
Oozie Job Processing         Oozie Security                                                      Hadoop                   ...
Oozie-Hadoop Security           Oozie Security                                                         Hadoop             ...
Oozie-Hadoop Security •    Oozie is a multi-tenant system •    Job can be scheduled to run later •    Oozie submits/mainta...
Oozie-Hadoop Security Contd.•  Answer: Oozie•  How?  – Hadoop considers Oozie as a super-user  – Hadoop does not check end...
User-Oozie Security           Oozie Security                                                         Hadoop               ...
Why Oozie Security?•  One user should not modify another user’s   job•  Hadoop doesn’t authenticate end–user•  Oozie has t...
How does Oozie Support Security?•  Built-in authentication  –  Kerberos  –  Non-secured (default)•  Design Decision  –  Pl...
Job Submission to Hadoop•  Oozie is designed to handle thousands of   jobs at the same time•  Question : Should Oozie serv...
Job Submission Contd.•  Reason  –  Resource constraints: A single Oozie process     can’t simultaneously create thousands ...
Job Submission to Hadoop                    Hadoop Cluster             5     Job                                Actual    ...
Job Submission Contd.•  Advantages  –  Horizontal scalability: If load increases, add     machines into Hadoop cluster  – ...
Production Setup•  Total number of nodes: 42K+•  Total number of Clusters: 25+•  Total number of processed jobs ≈ 750K/mon...
Oozie Usage Pattern @ Y!                  Distribution of Job Types On Production Clusters             50             45  ...
Experimental Setup•    Number of nodes: 7•    Number of map-slots: 28•    4 Core, RAM: 16 GB•    64 bit RHEL•    Oozie Ser...
Job Acceptance                                         Workflow Acceptance Rate workflows Accepted/Min                    ...
Time Line of a Oozie Job  User       Oozie            Job         Job submits   submits to     completes   completes   Job...
Oozie Overhead                                          Per Action OverheadOverhead in millisecs                        18...
Oozie Futures•  Scalability  –  Hot-Hot/Load balancing service  –  Replace SQL DB with Zookeeper•  Improved Usability•  Ex...
Take Away ..•  Oozie is  –  Easier to use  –  Scalable  –  Secure and multi-tenant
Q&A   Mohammad K Islam                Virag Kotharikamrul@yahoo-inc.com      virag@yahoo-inc.com    http://incubator.apach...
Vertical Scalability•  Oozie asynchronously processes the user   request.•  Memory resident  –  An internal queue to store...
Challenges in Scalability•  Centralized persistence storage  –  Currently supports any SQL DB (such as     Oracle, MySQL, ...
Upcoming SlideShare
Loading in …5
×

Oozie HUG May12

2,947 views

Published on

Presented at HUG, 2012

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,947
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
74
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Oozie HUG May12

  1. 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  2. 2. Accepted Paper•  Workshop in ACM/SIGMOD, May 2012.•  It is a team effort!Mohammad Islam Angelo HuangMohamed Battisha Michelle ChiangSanthosh Srinivasan Craig PetersAndreas Neumann Alejandro Abdelnur
  3. 3. Presentation Workflow
  4. 4. Installing OozieStep 1: Download the Oozie tarballcurl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/oozie-3.1.3-incubating-distro.tar.gzStep 2: Unpack the tarballtar –xzvf <PATH_TO_OOZIE_TAR>Step 3: Run the setup scriptbin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zipStep 4: Start ooziebin/oozie-start.shStep 5: Check status of ooziebin/oozie admin -oozie http://localhost:11000/oozie -status
  5. 5. Running an Example•  Standalone Map-Reduce job$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir•  Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  6. 6. Example Workflow<action name=’wordcount> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name> mapred.mapper.class = <value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> mapred.reducer.class = </property> org.myorg.WordCount.Reduce <property> <name>mapred.input.dir</name> <value>usr/joe/inputDir </value> mapred.input.dir = inputDir </property> <property> <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> mapred.output.dir = outputDir </property> </configuration> </map-reduce> </action>
  7. 7. A Workflow ApplicationThree components required for a Workflow:1)  Workflow.xml: Contains job definition2) Libraries: optional ‘lib/’ directory contains .jar/.so files3) Properties file:•  Parameterization of Workflow xml•  Mandatory property is oozie.wf.application.path
  8. 8. Workflow SubmissionRun Workflow Job $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-WCheck Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  9. 9. Key Features and DesignDecisions•  Multi-tenant•  Security –  Authenticate every request –  Pass appropriate token to Hadoop job•  Scalability –  Vertical: Add extra memory/disk –  Horizontal: Add machines
  10. 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnduser
  11. 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  12. 12. Oozie-Hadoop Security •  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each requestQuestion: Who should provide the securitytoken to hadoop and how?
  13. 13. Oozie-Hadoop Security Contd.•  Answer: Oozie•  How? – Hadoop considers Oozie as a super-user – Hadoop does not check end-user credential – Hadoop only checks the credential of Oozie process•  BUT hadoop job is executed as end-user.•  Oozie utilizes doAs() functionality of Hadoop.
  14. 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  15. 15. Why Oozie Security?•  One user should not modify another user’s job•  Hadoop doesn’t authenticate end–user•  Oozie has to verify its user before passing the job to Hadoop
  16. 16. How does Oozie Support Security?•  Built-in authentication –  Kerberos –  Non-secured (default)•  Design Decision –  Pluggable authentication –  Easy to include new type of authentication –  Yahoo supports 3 types of authentication.
  17. 17. Job Submission to Hadoop•  Oozie is designed to handle thousands of jobs at the same time•  Question : Should Oozie server –  Submit the hadoop job directly? –  Wait for it to finish? •  Answer: No
  18. 18. Job Submission Contd.•  Reason –  Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) –  Isolation: Running user code on Oozie server might de-stabilize Oozie•  Design Decision –  Create a launcher hadoop job –  Execute the actual user job from the launcher. –  Wait asynchronously for the job to finish.
  19. 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R JobOozie 3Server 1 4 Launcher 2 Mapper
  20. 20. Job Submission Contd.•  Advantages –  Horizontal scalability: If load increases, add machines into Hadoop cluster –  Stability: Isolation of user code and system process•  Disadvantages –  Extra map-slot is occupied by each job.
  21. 21. Production Setup•  Total number of nodes: 42K+•  Total number of Clusters: 25+•  Total number of processed jobs ≈ 750K/month•  Data presented from two clusters•  Each of them have nearly 4K nodes•  Total number of users /cluster = 50
  22. 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type
  23. 23. Experimental Setup•  Number of nodes: 7•  Number of map-slots: 28•  4 Core, RAM: 16 GB•  64 bit RHEL•  Oozie Server –  3 GB RAM –  Internal Queue size = 10 K –  # Worker Thread = 300
  24. 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission ThreadsObservation: Oozie can accept a large number of jobs
  25. 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead OverheadTotal Oozie Overhead = Preparation + Completion
  26. 26. Oozie Overhead Per Action OverheadOverhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/WorkflowObservation: Oozie overhead is less when multipleactions are in the same workflow.
  27. 27. Oozie Futures•  Scalability –  Hot-Hot/Load balancing service –  Replace SQL DB with Zookeeper•  Improved Usability•  Extend the benchmarking scope•  Monitoring WS API
  28. 28. Take Away ..•  Oozie is –  Easier to use –  Scalable –  Secure and multi-tenant
  29. 29. Q&A Mohammad K Islam Virag Kotharikamrul@yahoo-inc.com virag@yahoo-inc.com http://incubator.apache.org/oozie/
  30. 30. Vertical Scalability•  Oozie asynchronously processes the user request.•  Memory resident –  An internal queue to store any sub-task. –  Thread pool to process the sub-tasks.•  Both items –  Fixed in size –  Don’t change due to load variations –  Size can be configured if needed. Might need extra memory
  31. 31. Challenges in Scalability•  Centralized persistence storage –  Currently supports any SQL DB (such as Oracle, MySQL, derby etc) –  Each DBMS has respective limitations –  Oozie scalability is limited by the underlying DBMS –  Using of Zookeeper might be an option

×