SlideShare a Scribd company logo
1 of 10
Apache Oozie Workflow
Scheduler
Overview of Oozie
• Oozie is an open-source Apache project that provides a framework for
coordinating and scheduling Hadoop jobs. Oozie is not restricted to just
MapReduce jobs; you can use Oozie to schedule Pig, Hive, Sqoop, Streaming
jobs, and even Java programs.
• Oozie is a Java web application that runs in a Tomcat instance.
• Oozie can be run as a service, and then start workflows using the oozie
command.
Overview of Oozie
Oozie has two main capabilities:
1. Oozie Workflow: a collection of actions (defined in a workflow.xml file).
• Pig Actions
• Hive Actions
• MapReduce Actions
2. Oozie Coordinator: a recurring workflow (defined in a coordinator.xml file).
• Schedule a Job Based on Time
• Schedule a Job Based on Data Availability
Defining an Oozie Workflow
Defining an Oozie Workflow
• An Oozie workflow consists of a workflow.xml file and the necessary files
required by the workflow. The workflow is put into HDFS with the
following directory structure:
• /appdir/workflow.xml
• /appdir/config-default.xml
• /appdir/lib/files.jar
• The workflow.xml, a workflow definition consists of two main entries:
• Control flow nodes: for determining the execution path.
• Action nodes: for executing a job or task.
• The config-default.xml file is optional and contains properties shared by
all workflows.
• Each workflow can also have a job.properties file (not put into HDFS) for
job-specific properties.
Pig Action
<workflow-app xmlns="uri:oozie:workflow:0.2" name=“transform-workflow">
<start to="transform_data" />
<action name=" transform_data">
<pig>
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path=“old_data_path" />
</prepare>
<script>transform.pig</script>
</pig>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
</workflow-app>
Hive Action
<action name="find_count_visit">
<hive xmlns="uri:oozie:hive-action:0.5">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path=“old_data_path" />
</prepare>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
</configuration>
<script>findcount.hive</script>
</hive>
<ok to="end" />
<error to="fail" />
</action>
MapReduce Action
Schedule a Job Based on Time
<coordinator-app name="tf-idf" frequency="1440"
start="2013-01-01T00:00Z" end="2013-12-31T00:00Z"
timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path> hdfs://node:8020/home/train/tfidf/workflow
</app-path>
</workflow>
</action>
</coordinator-app>
Schedule Based on Data Availability
<coordinator-app name="file_check" frequency="1440"
start="2012-01-01T00:00Z" end="2015-12-31T00:00Z" timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="input1">
<uri-template>
hdfs://node:8020/job/result/
</uri-template>
</dataset>
</datasets>
<action>
<workflow>
<app-path>hdfs://node:8020/myapp/</app-path>
</workflow>
</action>
</coordinator-app>

More Related Content

What's hot

modern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Campmodern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
Puppet
 
OpenStack Horizon: Controlling the Cloud using Django
OpenStack Horizon: Controlling the Cloud using DjangoOpenStack Horizon: Controlling the Cloud using Django
OpenStack Horizon: Controlling the Cloud using Django
David Lapsley
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 

What's hot (20)

Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
Gradle - Build System
Gradle - Build SystemGradle - Build System
Gradle - Build System
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!
 
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Campmodern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
 
Liquibase få kontroll på dina databasförändringar
Liquibase   få kontroll på dina databasförändringarLiquibase   få kontroll på dina databasförändringar
Liquibase få kontroll på dina databasförändringar
 
RESTFul development with Apache sling
RESTFul development with Apache slingRESTFul development with Apache sling
RESTFul development with Apache sling
 
Powershell For Developers
Powershell For DevelopersPowershell For Developers
Powershell For Developers
 
OpenStack Horizon: Controlling the Cloud using Django
OpenStack Horizon: Controlling the Cloud using DjangoOpenStack Horizon: Controlling the Cloud using Django
OpenStack Horizon: Controlling the Cloud using Django
 
Getting started with agile database migrations for java flywaydb
Getting started with agile database migrations for java flywaydbGetting started with agile database migrations for java flywaydb
Getting started with agile database migrations for java flywaydb
 
Creating Modular Test-Driven SPAs with Spring and AngularJS
Creating Modular Test-Driven SPAs with Spring and AngularJSCreating Modular Test-Driven SPAs with Spring and AngularJS
Creating Modular Test-Driven SPAs with Spring and AngularJS
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Play framework
Play frameworkPlay framework
Play framework
 
Database migration with flyway
Database migration  with flywayDatabase migration  with flyway
Database migration with flyway
 

Similar to Apache Oozie Workflow Scheduler - Module 10

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 

Similar to Apache Oozie Workflow Scheduler - Module 10 (20)

Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Apache Oozie.pptx
Apache Oozie.pptxApache Oozie.pptx
Apache Oozie.pptx
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
 
Mobile Fest 2018. Алексей Лизенко. Make your project great again
Mobile Fest 2018. Алексей Лизенко. Make your project great againMobile Fest 2018. Алексей Лизенко. Make your project great again
Mobile Fest 2018. Алексей Лизенко. Make your project great again
 
Case Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New WorldCase Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New World
 
Rest API with Swagger and NodeJS
Rest API with Swagger and NodeJSRest API with Swagger and NodeJS
Rest API with Swagger and NodeJS
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Solr
SolrSolr
Solr
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 

More from Rohit Agrawal

More from Rohit Agrawal (9)

Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Apache Oozie Workflow Scheduler - Module 10

  • 2. Overview of Oozie • Oozie is an open-source Apache project that provides a framework for coordinating and scheduling Hadoop jobs. Oozie is not restricted to just MapReduce jobs; you can use Oozie to schedule Pig, Hive, Sqoop, Streaming jobs, and even Java programs. • Oozie is a Java web application that runs in a Tomcat instance. • Oozie can be run as a service, and then start workflows using the oozie command.
  • 3. Overview of Oozie Oozie has two main capabilities: 1. Oozie Workflow: a collection of actions (defined in a workflow.xml file). • Pig Actions • Hive Actions • MapReduce Actions 2. Oozie Coordinator: a recurring workflow (defined in a coordinator.xml file). • Schedule a Job Based on Time • Schedule a Job Based on Data Availability
  • 4. Defining an Oozie Workflow
  • 5. Defining an Oozie Workflow • An Oozie workflow consists of a workflow.xml file and the necessary files required by the workflow. The workflow is put into HDFS with the following directory structure: • /appdir/workflow.xml • /appdir/config-default.xml • /appdir/lib/files.jar • The workflow.xml, a workflow definition consists of two main entries: • Control flow nodes: for determining the execution path. • Action nodes: for executing a job or task. • The config-default.xml file is optional and contains properties shared by all workflows. • Each workflow can also have a job.properties file (not put into HDFS) for job-specific properties.
  • 6. Pig Action <workflow-app xmlns="uri:oozie:workflow:0.2" name=“transform-workflow"> <start to="transform_data" /> <action name=" transform_data"> <pig> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path=“old_data_path" /> </prepare> <script>transform.pig</script> </pig> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end" /> </workflow-app>
  • 7. Hive Action <action name="find_count_visit"> <hive xmlns="uri:oozie:hive-action:0.5"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path=“old_data_path" /> </prepare> <job-xml>hive-site.xml</job-xml> <configuration> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> </configuration> <script>findcount.hive</script> </hive> <ok to="end" /> <error to="fail" /> </action>
  • 9. Schedule a Job Based on Time <coordinator-app name="tf-idf" frequency="1440" start="2013-01-01T00:00Z" end="2013-12-31T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path> hdfs://node:8020/home/train/tfidf/workflow </app-path> </workflow> </action> </coordinator-app>
  • 10. Schedule Based on Data Availability <coordinator-app name="file_check" frequency="1440" start="2012-01-01T00:00Z" end="2015-12-31T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1"> <uri-template> hdfs://node:8020/job/result/ </uri-template> </dataset> </datasets> <action> <workflow> <app-path>hdfs://node:8020/myapp/</app-path> </workflow> </action> </coordinator-app>