Your SlideShare is downloading. ×
Oozie HUG May12
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Oozie HUG May12

2,417

Published on

Presented at HUG, 2012

Presented at HUG, 2012

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,417
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
61
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  • 2. Accepted Paper•  Workshop in ACM/SIGMOD, May 2012.•  It is a team effort!Mohammad Islam Angelo HuangMohamed Battisha Michelle ChiangSanthosh Srinivasan Craig PetersAndreas Neumann Alejandro Abdelnur
  • 3. Presentation Workflow
  • 4. Installing OozieStep 1: Download the Oozie tarballcurl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/oozie-3.1.3-incubating-distro.tar.gzStep 2: Unpack the tarballtar –xzvf <PATH_TO_OOZIE_TAR>Step 3: Run the setup scriptbin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zipStep 4: Start ooziebin/oozie-start.shStep 5: Check status of ooziebin/oozie admin -oozie http://localhost:11000/oozie -status
  • 5. Running an Example•  Standalone Map-Reduce job$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir•  Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 6. Example Workflow<action name=’wordcount> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name> mapred.mapper.class = <value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> mapred.reducer.class = </property> org.myorg.WordCount.Reduce <property> <name>mapred.input.dir</name> <value>usr/joe/inputDir </value> mapred.input.dir = inputDir </property> <property> <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> mapred.output.dir = outputDir </property> </configuration> </map-reduce> </action>
  • 7. A Workflow ApplicationThree components required for a Workflow:1)  Workflow.xml: Contains job definition2) Libraries: optional ‘lib/’ directory contains .jar/.so files3) Properties file:•  Parameterization of Workflow xml•  Mandatory property is oozie.wf.application.path
  • 8. Workflow SubmissionRun Workflow Job $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-WCheck Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  • 9. Key Features and DesignDecisions•  Multi-tenant•  Security –  Authenticate every request –  Pass appropriate token to Hadoop job•  Scalability –  Vertical: Add extra memory/disk –  Horizontal: Add machines
  • 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnduser
  • 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  • 12. Oozie-Hadoop Security •  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each requestQuestion: Who should provide the securitytoken to hadoop and how?
  • 13. Oozie-Hadoop Security Contd.•  Answer: Oozie•  How? – Hadoop considers Oozie as a super-user – Hadoop does not check end-user credential – Hadoop only checks the credential of Oozie process•  BUT hadoop job is executed as end-user.•  Oozie utilizes doAs() functionality of Hadoop.
  • 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie ServerEnd user
  • 15. Why Oozie Security?•  One user should not modify another user’s job•  Hadoop doesn’t authenticate end–user•  Oozie has to verify its user before passing the job to Hadoop
  • 16. How does Oozie Support Security?•  Built-in authentication –  Kerberos –  Non-secured (default)•  Design Decision –  Pluggable authentication –  Easy to include new type of authentication –  Yahoo supports 3 types of authentication.
  • 17. Job Submission to Hadoop•  Oozie is designed to handle thousands of jobs at the same time•  Question : Should Oozie server –  Submit the hadoop job directly? –  Wait for it to finish? •  Answer: No
  • 18. Job Submission Contd.•  Reason –  Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) –  Isolation: Running user code on Oozie server might de-stabilize Oozie•  Design Decision –  Create a launcher hadoop job –  Execute the actual user job from the launcher. –  Wait asynchronously for the job to finish.
  • 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R JobOozie 3Server 1 4 Launcher 2 Mapper
  • 20. Job Submission Contd.•  Advantages –  Horizontal scalability: If load increases, add machines into Hadoop cluster –  Stability: Isolation of user code and system process•  Disadvantages –  Extra map-slot is occupied by each job.
  • 21. Production Setup•  Total number of nodes: 42K+•  Total number of Clusters: 25+•  Total number of processed jobs ≈ 750K/month•  Data presented from two clusters•  Each of them have nearly 4K nodes•  Total number of users /cluster = 50
  • 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type
  • 23. Experimental Setup•  Number of nodes: 7•  Number of map-slots: 28•  4 Core, RAM: 16 GB•  64 bit RHEL•  Oozie Server –  3 GB RAM –  Internal Queue size = 10 K –  # Worker Thread = 300
  • 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission ThreadsObservation: Oozie can accept a large number of jobs
  • 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead OverheadTotal Oozie Overhead = Preparation + Completion
  • 26. Oozie Overhead Per Action OverheadOverhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/WorkflowObservation: Oozie overhead is less when multipleactions are in the same workflow.
  • 27. Oozie Futures•  Scalability –  Hot-Hot/Load balancing service –  Replace SQL DB with Zookeeper•  Improved Usability•  Extend the benchmarking scope•  Monitoring WS API
  • 28. Take Away ..•  Oozie is –  Easier to use –  Scalable –  Secure and multi-tenant
  • 29. Q&A Mohammad K Islam Virag Kotharikamrul@yahoo-inc.com virag@yahoo-inc.com http://incubator.apache.org/oozie/
  • 30. Vertical Scalability•  Oozie asynchronously processes the user request.•  Memory resident –  An internal queue to store any sub-task. –  Thread pool to process the sub-tasks.•  Both items –  Fixed in size –  Don’t change due to load variations –  Size can be configured if needed. Might need extra memory
  • 31. Challenges in Scalability•  Centralized persistence storage –  Currently supports any SQL DB (such as Oracle, MySQL, derby etc) –  Each DBMS has respective limitations –  Oozie scalability is limited by the underlying DBMS –  Using of Zookeeper might be an option

×