• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

on

  • 7,504 views

 

Statistics

Views

Total Views
7,504
Views on SlideShare
7,270
Embed Views
234

Actions

Likes
11
Downloads
368
Comments
1

7 Embeds 234

http://d.hatena.ne.jp 149
http://sozialpapier.com 39
http://orsite 24
http://play.daumcorp.com 14
http://www.sozialpapier.com 6
http://doryokujin.hatenablog.jp 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N Presentation Transcript

    • Andreas Neumann Oozie – Workflow for Hadoop
    • Who Am I?
        • Dr. Andreas Neumann
        • Software Architect, Yahoo!
        • anew <at> yahoo-inc <dot> com
        • At Yahoo! (2008-present)
        • Grid architecture
        • Content Platform
        • Research
        • At IBM (2000-2008)
        • Database (DB2) Development
        • Enterprise Search
    • Oozie Overview
      • Main Features
        • Execute and monitor workflows in Hadoop
        • Periodic scheduling of workflows
        • Trigger execution by data availability
        • HTTP and command line interface + Web console
      • Adoption
        • ~100 users on mailing list since launch on github
        • In production at Yahoo!, running >200K jobs/day
    • Oozie Workflow Overview
      • Purpose:
      • Execution of workflows on the Grid
      Oozie Hadoop/Pig/HDFS DB WS API Tomcat web-app
    • Oozie Workflow Directed Acyclic Graph of Jobs start Java Main M/R streaming job decision fork Pig job M/R job join OK OK OK OK end Java Main FS job OK OK ENOUGH MORE
    • Oozie Workflow Example
      • <workflow-app name=’wordcount-wf’>
      • <start to=‘wordcount’/>
      • <action name=’wordcount'>
      • <map-reduce>
      • <job-tracker>foo.com:9001</job-tracker>
      • <name-node>hdfs://bar.com:9000</name-node>
      • <configuration>
      • <property>
      • <name>mapred.input.dir</name>
      • <value> ${inputDir} </value>
      • </property>
      • <property>
      • <name>mapred.output.dir</name>
      • <value> ${outputDir} </value>
      • </property>
      • </configuration>
      • </map-reduce>
      • <ok to=’end'/>
      • <error to=’kill'/>
      • </action>
      • <kill name=‘kill’/>
      • <end name=‘end’/>
      • </workflow-app>
      OK Start Error Start M-R wordcount End Kill
    • Oozie Workflow Nodes
      • Control Flow:
        • start/end/kill
        • decision
        • fork/join
      • Actions:
        • map-reduce
        • pig
        • hdfs
        • sub-workflow
        • java – run custom Java code
    • Oozie Workflow Application
      • A HDFS directory containing:
        • Definition file: workflow.xml
        • Configuration file: config-default.xml
        • App files: lib/ directory with JAR and SO files
        • Pig Scripts
    • Running an Oozie Workflow Job
      • Application Deployment:
      • $ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount
      • Workflow Job Parameters:
      • $ cat job.properties
      • oozie.wf.application.path = hdfs://bar.com:9000/usr/ abc /wordcount
      • input = /usr/ abc /input-data
      • output = /user/ abc /output-data
      • Job Execution:
      • $ oozie job –run -config job.properties
      • job: 1-20090525161321-oozie-xyz-W
    • Monitoring an Oozie Workflow Job
      • Workflow Job Status:
      • $ oozie job -info 1-20090525161321-oozie-xyz-W
      • ------------------------------------------------------------------------
      • Workflow Name : wordcount-wf
      • App Path : hdfs://bar.com:9000/usr/abc/wordcount
      • Status : RUNNING
      • Workflow Job Log:
      • $ oozie job –log 1-20090525161321-oozie-xyz-W
      • Workflow Job Definition:
      • $ oozie job –definition 1-20090525161321-oozie-xyz-W
    • Oozie Coordinator Overview
      • Purpose:
        • Coordinated execution of workflows on the Grid
        • Workflows are backwards compatible
      Hadoop Tomcat Oozie Client Oozie Workflow WS API Oozie Coordinator Check Data Availability
    • Oozie Application Lifecycle Coordinator Job Oozie Coordinator Engine Oozie Workflow Engine 1*f Action 1 WF 2*f Action 2 WF N*f … … Action N … … WF 0*f Action 0 WF action create action start start end A B C
    • Use Case 1: Time Triggers
      • Execute your workflow every 15 minutes (CRON)
      00:15 00:30 00:45 01:00
    • Example 1: Run Workflow every 15 mins
      • <coordinator-app name=“coord1”
      • start=&quot;2009-01-08T00:00Z&quot;
      • end=&quot;2010-01-01T00:00Z&quot;
      • frequency=” 15 &quot;
      • xmlns=&quot;uri:oozie:coordinator:0.1&quot;>
      • <action>
      • <workflow>
      • <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path>
      • <configuration>
      • <property> <name>key1</name><value>value1</value> </property>
      • </configuration>
      • </workflow>
      • </action>
      • </coordinator-app>
    • Use Case 2: Time and Data Triggers
      • Materialize your workflow every hour, but only run them when the input data is ready.
      Input Data Exists? 01:00 02:00 03:00 04:00 Hadoop
    • Example 2: Data Triggers
      • <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>
      • <datasets>
      • <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;>
      • <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template>
      • </dataset>
      • </datasets>
      • <input-events>
      • <data-in name=“ inputLogs ” dataset=&quot; logs &quot;>
      • <instance> ${current(0)} </instance>
      • </data-in>
      • </input-events>
      • <action>
      • <workflow>
      • <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path>
      • <configuration>
      • <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property>
      • </configuration>
      • </workflow>
      • </action>
      • </coordinator-app>
    • Use Case 3: Rolling Windows
      • Access 15 minute datasets and roll them up into hourly datasets
      00:15 00:30 00:45 01:00 01:00 01:15 01:30 01:45 02:00 02:00
    • Example 3: Rolling Windows
      • <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>
      • <datasets>
      • <dataset name=&quot; logs &quot; frequency=“ 15 ” initial-instance=&quot;2009-01-01T00:00Z&quot;>
      • <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE} </uri-template>
      • </dataset>
      • </datasets>
      • <input-events>
      • <data-in name=“ inputLogs ” dataset=&quot;logs&quot;>
      • <start-instance> ${current(-3)} </start-instance>
      • <end-instance> ${current(0)} </end-instance>
      • </data-in>
      • </input-events>
      • <action>
      • <workflow>
      • <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path>
      • <configuration>
      • <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property>
      • </configuration>
      • </workflow>
      • </action>
      • </coordinator-app>
    • Use Case 4: Sliding Windows
      • Access last 24 hours of data, and roll them up every hour.
      01:00 02:00 03:00 24:00 24:00 … 02:00 03:00 04:00 +1 day 01:00 +1 day 01:00 … 03:00 04:00 05:00 +1 day 02:00 +1 day 02:00 …
    • Example 4: Sliding Windows
      • <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>
      • <datasets>
      • <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;>
      • <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template>
      • </dataset>
      • </datasets>
      • <input-events>
      • <data-in name=“ inputLogs ” dataset=&quot;logs&quot;>
      • <start-instance> ${current(-23)} </start-instance>
      • <end-instance> ${current(0)} </end-instance>
      • </data-in>
      • </input-events>
      • <action>
      • <workflow>
      • <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path>
      • <configuration>
      • <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property>
      • </configuration>
      • </workflow>
      • </action>
      • </coordinator-app>
    • Oozie Coordinator Application
      • A HDFS directory containing:
        • Definition file: coordinator.xml
        • Configuration file: coord-config-default.xml
    • Running an Oozie Coordinator Job
      • Application Deployment:
      • $ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job
      • Coordinator Job Parameters:
      • $ cat job.properties
      • oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job
      • Job Execution:
      • $ oozie job –run -config job.properties
      • job: 1-20090525161321-oozie-xyz-C
    • Monitoring an Oozie Coordinator Job
      • Coordinator Job Status:
      • $ oozie job -info 1-20090525161321-oozie-xyz-C
      • ------------------------------------------------------------------------
      • Job Name : wordcount-coord
      • App Path : hdfs://bar.com:9000/usr/abc/coord_job
      • Status : RUNNING
      • Coordinator Job Log:
      • $ oozie job –log 1-20090525161321-oozie-xyz-C
      • Coordinator Job Definition:
      • $ oozie job –definition 1-20090525161321-oozie-xyz-C
    • Oozie Web Console: List Jobs
    • Oozie Web Console: Job Details
    • Oozie Web Console: Failed Action
    • Oozie Web Console: Error Messages
    • What’s Next For Oozie?
      • New Features
        • More out-of-the-box actions: distcp, hive, …
        • Authentication framework
          • Authenticate a client with Oozie
          • Authenticate an Oozie workflow with downstream services
        • Bundles: Manage multiple coordinators together
        • Asynchronous data sets and coordinators
      • Scalability
        • Memory footprint
        • Data notification instead of polling
      • Integration with Howl ( http://github.com/yahoo/howl )
    • We Need You!
      • Oozie is Open Source
        • Source: http://github.com/yahoo/oozie
        • Docs: http://yahoo.github.com/oozie
        • List: http://tech.groups.yahoo.com/group/Oozie-users/
      • To Contribute:
        • https://github.com/yahoo/oozie/wiki/How-To-Contribute
    • Thank You! github.com/yahoo/oozie/wiki/How-To-Contribute