Andreas Neumann Oozie – Workflow for Hadoop
Who Am I? <ul><ul><li>Dr. Andreas Neumann </li></ul></ul><ul><ul><li>Software Architect, Yahoo! </li></ul></ul><ul><ul><li...
Oozie Overview <ul><li>Main Features </li></ul><ul><ul><li>Execute and monitor workflows in Hadoop </li></ul></ul><ul><ul>...
Oozie Workflow Overview <ul><li>Purpose:  </li></ul><ul><li>Execution of workflows on the Grid </li></ul>Oozie Hadoop/Pig/...
Oozie Workflow Directed Acyclic Graph of Jobs start Java Main M/R streaming job decision fork Pig job M/R job join OK OK O...
Oozie Workflow Example <ul><li><workflow-app name=’wordcount-wf’> </li></ul><ul><li><start to=‘wordcount’/> </li></ul><ul>...
Oozie Workflow Nodes <ul><li>Control Flow: </li></ul><ul><ul><li>start/end/kill </li></ul></ul><ul><ul><li>decision </li><...
Oozie Workflow Application <ul><li>A HDFS directory containing: </li></ul><ul><ul><li>Definition file:  workflow.xml </li>...
Running an Oozie Workflow Job <ul><li>Application Deployment: </li></ul><ul><li>$ hadoop fs –put wordcount-wf hdfs://bar.c...
Monitoring an Oozie Workflow Job <ul><li>Workflow Job Status: </li></ul><ul><li>$  oozie job -info 1-20090525161321-oozie-...
Oozie Coordinator Overview <ul><li>Purpose:  </li></ul><ul><ul><li>Coordinated execution of workflows on the Grid </li></u...
Oozie Application Lifecycle Coordinator Job Oozie Coordinator Engine Oozie Workflow Engine 1*f Action 1 WF 2*f Action 2 WF...
Use Case 1: Time Triggers <ul><li>Execute your workflow every 15 minutes (CRON) </li></ul>00:15 00:30 00:45 01:00
Example 1: Run Workflow every 15 mins <ul><li><coordinator-app name=“coord1”  </li></ul><ul><li>start=&quot;2009-01-08T00:...
Use Case 2: Time and Data Triggers <ul><li>Materialize your workflow every hour, but  only run them when the input data is...
Example 2: Data Triggers <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  </li></ul><ul><li><datasets> <...
Use Case 3: Rolling Windows <ul><li>Access 15 minute datasets and roll them up into hourly datasets </li></ul>00:15 00:30 ...
Example 3: Rolling Windows <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  </li></ul><ul><li><datasets>...
Use Case 4: Sliding Windows <ul><li>Access last 24 hours of data, and roll them up every hour. </li></ul>01:00 02:00 03:00...
Example 4: Sliding Windows <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  </li></ul><ul><li><datasets>...
Oozie Coordinator Application <ul><li>A HDFS directory containing: </li></ul><ul><ul><li>Definition file:  coordinator.xml...
Running an Oozie Coordinator Job <ul><li>Application Deployment: </li></ul><ul><li>$ hadoop fs –put coord_job hdfs://bar.c...
Monitoring an Oozie Coordinator Job <ul><li>Coordinator Job Status: </li></ul><ul><li>$ oozie job -info 1-20090525161321-o...
Oozie Web Console: List Jobs
Oozie Web Console: Job Details
Oozie Web Console: Failed Action
Oozie Web Console: Error Messages
What’s Next For Oozie? <ul><li>New Features </li></ul><ul><ul><li>More out-of-the-box actions: distcp, hive, … </li></ul><...
We Need You! <ul><li>Oozie is Open Source </li></ul><ul><ul><li>Source:  http://github.com/yahoo/oozie </li></ul></ul><ul>...
Thank You! github.com/yahoo/oozie/wiki/How-To-Contribute
Upcoming SlideShare
Loading in …5
×

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

8,516 views

Published on

Published in: Technology
1 Comment
12 Likes
Statistics
Notes
No Downloads
Views
Total views
8,516
On SlideShare
0
From Embeds
0
Number of Embeds
244
Actions
Shares
0
Downloads
419
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

  1. 1. Andreas Neumann Oozie – Workflow for Hadoop
  2. 2. Who Am I? <ul><ul><li>Dr. Andreas Neumann </li></ul></ul><ul><ul><li>Software Architect, Yahoo! </li></ul></ul><ul><ul><li>anew <at> yahoo-inc <dot> com </li></ul></ul><ul><ul><li>At Yahoo! (2008-present) </li></ul></ul><ul><ul><li>Grid architecture </li></ul></ul><ul><ul><li>Content Platform </li></ul></ul><ul><ul><li>Research </li></ul></ul><ul><ul><li>At IBM (2000-2008) </li></ul></ul><ul><ul><li>Database (DB2) Development </li></ul></ul><ul><ul><li>Enterprise Search </li></ul></ul>
  3. 3. Oozie Overview <ul><li>Main Features </li></ul><ul><ul><li>Execute and monitor workflows in Hadoop </li></ul></ul><ul><ul><li>Periodic scheduling of workflows </li></ul></ul><ul><ul><li>Trigger execution by data availability </li></ul></ul><ul><ul><li>HTTP and command line interface + Web console </li></ul></ul><ul><li>Adoption </li></ul><ul><ul><li>~100 users on mailing list since launch on github </li></ul></ul><ul><ul><li>In production at Yahoo!, running >200K jobs/day </li></ul></ul>
  4. 4. Oozie Workflow Overview <ul><li>Purpose: </li></ul><ul><li>Execution of workflows on the Grid </li></ul>Oozie Hadoop/Pig/HDFS DB WS API Tomcat web-app
  5. 5. Oozie Workflow Directed Acyclic Graph of Jobs start Java Main M/R streaming job decision fork Pig job M/R job join OK OK OK OK end Java Main FS job OK OK ENOUGH MORE
  6. 6. Oozie Workflow Example <ul><li><workflow-app name=’wordcount-wf’> </li></ul><ul><li><start to=‘wordcount’/> </li></ul><ul><li><action name=’wordcount'> </li></ul><ul><li><map-reduce> </li></ul><ul><li><job-tracker>foo.com:9001</job-tracker> </li></ul><ul><li><name-node>hdfs://bar.com:9000</name-node> </li></ul><ul><li><configuration> </li></ul><ul><li><property> </li></ul><ul><li><name>mapred.input.dir</name> </li></ul><ul><li><value> ${inputDir} </value> </li></ul><ul><li></property> </li></ul><ul><li><property> </li></ul><ul><li><name>mapred.output.dir</name> </li></ul><ul><li><value> ${outputDir} </value> </li></ul><ul><li></property> </li></ul><ul><li></configuration> </li></ul><ul><li></map-reduce> </li></ul><ul><li><ok to=’end'/> </li></ul><ul><li><error to=’kill'/> </li></ul><ul><li></action> </li></ul><ul><li><kill name=‘kill’/> </li></ul><ul><li><end name=‘end’/> </li></ul><ul><li></workflow-app> </li></ul>OK Start Error Start M-R wordcount End Kill
  7. 7. Oozie Workflow Nodes <ul><li>Control Flow: </li></ul><ul><ul><li>start/end/kill </li></ul></ul><ul><ul><li>decision </li></ul></ul><ul><ul><li>fork/join </li></ul></ul><ul><li>Actions: </li></ul><ul><ul><li>map-reduce </li></ul></ul><ul><ul><li>pig </li></ul></ul><ul><ul><li>hdfs </li></ul></ul><ul><ul><li>sub-workflow </li></ul></ul><ul><ul><li>java – run custom Java code </li></ul></ul>
  8. 8. Oozie Workflow Application <ul><li>A HDFS directory containing: </li></ul><ul><ul><li>Definition file: workflow.xml </li></ul></ul><ul><ul><li>Configuration file: config-default.xml </li></ul></ul><ul><ul><li>App files: lib/ directory with JAR and SO files </li></ul></ul><ul><ul><li>Pig Scripts </li></ul></ul>
  9. 9. Running an Oozie Workflow Job <ul><li>Application Deployment: </li></ul><ul><li>$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount </li></ul><ul><li> </li></ul><ul><li>Workflow Job Parameters: </li></ul><ul><li>$ cat job.properties </li></ul><ul><li>oozie.wf.application.path = hdfs://bar.com:9000/usr/ abc /wordcount </li></ul><ul><li>input = /usr/ abc /input-data </li></ul><ul><li>output = /user/ abc /output-data </li></ul><ul><li>Job Execution: </li></ul><ul><li>$ oozie job –run -config job.properties </li></ul><ul><li>job: 1-20090525161321-oozie-xyz-W </li></ul>
  10. 10. Monitoring an Oozie Workflow Job <ul><li>Workflow Job Status: </li></ul><ul><li>$ oozie job -info 1-20090525161321-oozie-xyz-W </li></ul><ul><li>------------------------------------------------------------------------ </li></ul><ul><li>Workflow Name : wordcount-wf </li></ul><ul><li>App Path : hdfs://bar.com:9000/usr/abc/wordcount </li></ul><ul><li>Status : RUNNING </li></ul><ul><li>… </li></ul><ul><li>Workflow Job Log: </li></ul><ul><li>$ oozie job –log 1-20090525161321-oozie-xyz-W </li></ul><ul><li>Workflow Job Definition: </li></ul><ul><li>$ oozie job –definition 1-20090525161321-oozie-xyz-W </li></ul><ul><li> </li></ul>
  11. 11. Oozie Coordinator Overview <ul><li>Purpose: </li></ul><ul><ul><li>Coordinated execution of workflows on the Grid </li></ul></ul><ul><ul><li>Workflows are backwards compatible </li></ul></ul>Hadoop Tomcat Oozie Client Oozie Workflow WS API Oozie Coordinator Check Data Availability
  12. 12. Oozie Application Lifecycle Coordinator Job Oozie Coordinator Engine Oozie Workflow Engine 1*f Action 1 WF 2*f Action 2 WF N*f … … Action N … … WF 0*f Action 0 WF action create action start start end A B C
  13. 13. Use Case 1: Time Triggers <ul><li>Execute your workflow every 15 minutes (CRON) </li></ul>00:15 00:30 00:45 01:00
  14. 14. Example 1: Run Workflow every 15 mins <ul><li><coordinator-app name=“coord1” </li></ul><ul><li>start=&quot;2009-01-08T00:00Z&quot; </li></ul><ul><li>end=&quot;2010-01-01T00:00Z&quot; </li></ul><ul><li>frequency=” 15 &quot; </li></ul><ul><li>xmlns=&quot;uri:oozie:coordinator:0.1&quot;> </li></ul><ul><li><action> </li></ul><ul><li><workflow> </li></ul><ul><li><app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> </li></ul><ul><li><configuration> </li></ul><ul><li><property> <name>key1</name><value>value1</value> </property> </li></ul><ul><li></configuration> </li></ul><ul><li></workflow> </li></ul><ul><li></action> </li></ul><ul><li></coordinator-app> </li></ul>
  15. 15. Use Case 2: Time and Data Triggers <ul><li>Materialize your workflow every hour, but only run them when the input data is ready. </li></ul>Input Data Exists? 01:00 02:00 03:00 04:00 Hadoop
  16. 16. Example 2: Data Triggers <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> </li></ul><ul><li><datasets> </li></ul><ul><li><dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> </li></ul><ul><li><uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </li></ul><ul><li></dataset> </li></ul><ul><li></datasets> </li></ul><ul><li><input-events> </li></ul><ul><li><data-in name=“ inputLogs ” dataset=&quot; logs &quot;> </li></ul><ul><li><instance> ${current(0)} </instance> </li></ul><ul><li></data-in> </li></ul><ul><li></input-events> </li></ul><ul><li><action> </li></ul><ul><li><workflow> </li></ul><ul><li><app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> </li></ul><ul><li><configuration> </li></ul><ul><li><property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </li></ul><ul><li></configuration> </li></ul><ul><li></workflow> </li></ul><ul><li></action> </li></ul><ul><li></coordinator-app> </li></ul>
  17. 17. Use Case 3: Rolling Windows <ul><li>Access 15 minute datasets and roll them up into hourly datasets </li></ul>00:15 00:30 00:45 01:00 01:00 01:15 01:30 01:45 02:00 02:00
  18. 18. Example 3: Rolling Windows <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> </li></ul><ul><li><datasets> </li></ul><ul><li><dataset name=&quot; logs &quot; frequency=“ 15 ” initial-instance=&quot;2009-01-01T00:00Z&quot;> </li></ul><ul><li><uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE} </uri-template> </li></ul><ul><li></dataset> </li></ul><ul><li></datasets> </li></ul><ul><li><input-events> </li></ul><ul><li><data-in name=“ inputLogs ” dataset=&quot;logs&quot;> </li></ul><ul><li><start-instance> ${current(-3)} </start-instance> </li></ul><ul><li><end-instance> ${current(0)} </end-instance> </li></ul><ul><li></data-in> </li></ul><ul><li></input-events> </li></ul><ul><li><action> </li></ul><ul><li><workflow> </li></ul><ul><li><app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> </li></ul><ul><li><configuration> </li></ul><ul><li><property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </li></ul><ul><li></configuration> </li></ul><ul><li></workflow> </li></ul><ul><li></action> </li></ul><ul><li></coordinator-app> </li></ul>
  19. 19. Use Case 4: Sliding Windows <ul><li>Access last 24 hours of data, and roll them up every hour. </li></ul>01:00 02:00 03:00 24:00 24:00 … 02:00 03:00 04:00 +1 day 01:00 +1 day 01:00 … 03:00 04:00 05:00 +1 day 02:00 +1 day 02:00 …
  20. 20. Example 4: Sliding Windows <ul><li><coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> </li></ul><ul><li><datasets> </li></ul><ul><li><dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> </li></ul><ul><li><uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </li></ul><ul><li></dataset> </li></ul><ul><li></datasets> </li></ul><ul><li><input-events> </li></ul><ul><li><data-in name=“ inputLogs ” dataset=&quot;logs&quot;> </li></ul><ul><li><start-instance> ${current(-23)} </start-instance> </li></ul><ul><li><end-instance> ${current(0)} </end-instance> </li></ul><ul><li></data-in> </li></ul><ul><li></input-events> </li></ul><ul><li><action> </li></ul><ul><li><workflow> </li></ul><ul><li><app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> </li></ul><ul><li><configuration> </li></ul><ul><li><property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </li></ul><ul><li></configuration> </li></ul><ul><li></workflow> </li></ul><ul><li></action> </li></ul><ul><li></coordinator-app> </li></ul>
  21. 21. Oozie Coordinator Application <ul><li>A HDFS directory containing: </li></ul><ul><ul><li>Definition file: coordinator.xml </li></ul></ul><ul><ul><li>Configuration file: coord-config-default.xml </li></ul></ul>
  22. 22. Running an Oozie Coordinator Job <ul><li>Application Deployment: </li></ul><ul><li>$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job </li></ul><ul><li> </li></ul><ul><li>Coordinator Job Parameters: </li></ul><ul><li>$ cat job.properties </li></ul><ul><li>oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job </li></ul><ul><li>Job Execution: </li></ul><ul><li>$ oozie job –run -config job.properties </li></ul><ul><li>job: 1-20090525161321-oozie-xyz-C </li></ul><ul><li> </li></ul>
  23. 23. Monitoring an Oozie Coordinator Job <ul><li>Coordinator Job Status: </li></ul><ul><li>$ oozie job -info 1-20090525161321-oozie-xyz-C </li></ul><ul><li>------------------------------------------------------------------------ </li></ul><ul><li>Job Name : wordcount-coord </li></ul><ul><li>App Path : hdfs://bar.com:9000/usr/abc/coord_job </li></ul><ul><li>Status : RUNNING </li></ul><ul><li>… </li></ul><ul><li>Coordinator Job Log: </li></ul><ul><li>$ oozie job –log 1-20090525161321-oozie-xyz-C </li></ul><ul><li>Coordinator Job Definition: </li></ul><ul><li>$ oozie job –definition 1-20090525161321-oozie-xyz-C </li></ul>
  24. 24. Oozie Web Console: List Jobs
  25. 25. Oozie Web Console: Job Details
  26. 26. Oozie Web Console: Failed Action
  27. 27. Oozie Web Console: Error Messages
  28. 28. What’s Next For Oozie? <ul><li>New Features </li></ul><ul><ul><li>More out-of-the-box actions: distcp, hive, … </li></ul></ul><ul><ul><li>Authentication framework </li></ul></ul><ul><ul><ul><li>Authenticate a client with Oozie </li></ul></ul></ul><ul><ul><ul><li>Authenticate an Oozie workflow with downstream services </li></ul></ul></ul><ul><ul><li>Bundles: Manage multiple coordinators together </li></ul></ul><ul><ul><li>Asynchronous data sets and coordinators </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>Memory footprint </li></ul></ul><ul><ul><li>Data notification instead of polling </li></ul></ul><ul><li>Integration with Howl ( http://github.com/yahoo/howl ) </li></ul>
  29. 29. We Need You! <ul><li>Oozie is Open Source </li></ul><ul><ul><li>Source: http://github.com/yahoo/oozie </li></ul></ul><ul><ul><li>Docs: http://yahoo.github.com/oozie </li></ul></ul><ul><ul><li>List: http://tech.groups.yahoo.com/group/Oozie-users/ </li></ul></ul><ul><li>To Contribute: </li></ul><ul><ul><li>https://github.com/yahoo/oozie/wiki/How-To-Contribute </li></ul></ul>
  30. 30. Thank You! github.com/yahoo/oozie/wiki/How-To-Contribute

×