SlideShare a Scribd company logo
1 of 31
Everything that you ever wanted to
know about Oozie, but were afraid
              to ask

       B Lublinsky, A Yakubovich
Apache Oozie
• Oozie is a workflow/coordination system to
  manage Apache Hadoop jobs.
• A single Oozie server implements all four
  functional Oozie components:
  – Oozie workflow
  – Oozie coordinator
  – Oozie bundle
  – Oozie SLA.
Main components
                                                                               Oozie Server


                                                                    Bundle
3rd party application


                         time condition monitoring

                                                                Coordinator


             WS API

                                                                    workflow
                                                                                                                   data condition monitoring




                                            action
 Oozie Command                  action                        action
  Line Interface
                                                     action
                                                                               wf logic              job submission
                                                                                                     and monitoring




                                                     definitions,
                                                       states




                                                                                              Oozie shared
                                                                                                libraries
                                                                                                                            HDFS



                                            Bundle
                                           Coordinator
                                            Coordinator
                                                                                                       MapReduce

                                             Data
                                          Coordinator
                                           Coordinator
                                            Coordinator



                                           Workflow
                                           Coordinator
                                            Coordinator
                                                                                                             Hadoop
Oozie workflow
Workflow Language
Flow-control   XML element type       Description
node
Decision       workflow:DECISION      expressing “switch-case” logic

Fork           workflow:FORK          splits one path of execution into multiple concurrent paths
Join           workflow:JOIN          waits until every concurrent execution path of a previous fork
                                      node arrives to it
Kill           workflow:kill          forces a workflow job to kill (abort) itself

Action node    XML element type    Description
java           workflow:JAVA       invokes the main() method from the specified java class
fs             workflow:FS         manipulate files and directories in HDFS; supports commands:
                                   move, delete, mkdir
MapReduce      workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
                                   streaming job or pipe job
Pig            workflow:pig        runs a Pig job
Sub workflow   workflow:SUB-       runs a child workflow job
               WORKFLOW
Hive *         workflow:HIVE       runs a Hive job
Shell *        workflow:SHELL      runs a Shell command
ssh *          workflow:SSH        starts a shell command on a remote machine as a remote secure
                                   shell
Sqoop *        workflow:SQOOP      runs a Sqoop job
Email *        workflow:EMAIL      sending emails from Oozie workflow application
Distcp ?                           Under development (Yahoo)
Workflow actions
• Oozie workflow supports two types of actions:
    Synchronous, executed inside Oozie runtime
    Asynchronous, executed as a Map Reduce job.
 ActionStartCommand             WorkflowStore                    Services          ActionExecutorContext                 JavaActionExecutor             JobClient


         1 : workflow := getWorkflow()


            2 : action := getAction()


                                           3 : context := init<>()


                            4 : executor := get()



                                                                     5 : start()



                                                                                                                                 6 : submitLauncher()




                                                                                      7 : jobClient := get()

                                                                                                                                  8 : runningJob := submit()


                                                                                                         9 : setStartData()
Workflow lifecycle

                       PREP




KILLED                RUNNING               FAILED




          SUSPENDED             SUCCEDDED
Oozie execution console
Extending Oozie workflow
• Oozie provides a “minimal” workflow language, which
  contains only a handful of control and actions nodes.
• Oozie supports a very elegant extensibility mechanism –
  custom action nodes. Custom action nodes allow to extend
  Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of
  following:
   – Java action implementation, which extends ActionExecutor
     class.
   – Implementation of the action’s XML schema defining action’s
     configuration parameters
   – Packaging of java implementation and configuration schema
     into action jar, which has to be added to Oozie war
   – extending oozie-site.xml to register information about custom
     executor with Oozie runtime.
Oozie Workflow Client
• Oozie provides an easy way for integration with enterprise
  applications through Oozie client APIs. It provides two
  types of APIs
• REST HTTP API
   Number of HTTP requests
   • Info requests (job status, job configuration)
   • Job management (submit, start, suspend, resume, kill)
   Example: job definition info request
       GET /oozie/v0/job/job-ID?show=definition
• Java API - package org.apache.oozie.client
   – OozieClient
       start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
   – WorkflowJob, WorkflowAction
   – CoordinatorJob, CoordinatorAction
   – SLAEvent
Oozie workflow good, bad and ugly
• Good
   – Nice integration with Hadoop ecosystem, allowing to easily build
     processes encompassing synchronized execution of multiple Map
     Reduce, Hive, Pig, etc jobs.
   – Nice UI for tracking execution progress
   – Simple APIs for integration with other applications
   – Simple extensibility APIs
• Bad
   – Process has to be expressed directly in hPDL with no visual support
   – No support for Uber Jars (but we added our own)
• Ugly
   – Static forking (but you can regenerate workflow and invoke on a fly)
   – No support for loops
Oozie Coordinator
Coordinator language
Element type   Description                                         Attributes and sub-elements
coordinator-   top-level element in coordinator instance           frequency
app                                                                start
                                                                   end
controls       specify the execution policy for coordinator and timeout (actions)
               it’s elements (workflow actions)                 concurrency (actions)
                                                                execution order (workflow
                                                                instances)
action         Required singular element specifying the            Workflow name
               associated workflow. The jobs specified in
               workflow consume and produce dataset
               instances
datasets       Collection of data referred to by a logical name.
               Datasets serve to specify data dependences
               between workflow instances
input event    specifies the input conditions (in the form of
               present data sets) that are required in order to
               execute a coordinator action
output event   specifies the dataset that should be produced
               by coordinator action
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle

                                  PREP




 PREPSUSPENDED       PREPPAUSED          RUNNING    KILLED




SUSPENDED                                  FAILED   PAUSED
                   SUCCEDDED
Oozie SLA
SLA Navigation
                      COORD_JOBS

                       id
                       app_name
                       app_path
                       …
                                         WF_JOBS
SLA_EVENT

event_id                                id
alert_contact                           app_name
alert-frieuency                         app_path
…                                       …
sla_id
...                   COORD_ACTIONS

                        id
                        action_number
                        action_xml      WF_ACTIONS
                        …
                        external_id
                        ...              id
                                         conf
                                         console_url
                                         …
Using Probes to analyze/monitor Places

• Select probe data for specified time/location
• Validate – Filter - Transform probe data
• Calculate statistics on available probe data
• Distribute data per geo-tiles
• Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
Workflow as acyclic graph
Workflow – fragment 1
Workflow – fragment 2
Oozie tips and tricks
Configuring workflow
• Oozie provides 3 overlapping mechanisms to configure workflow -
  config-default.xml, jobs properties file and job arguments that can
  be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as
  follows:
    – Use all of the parameters from command line invocation
    – For remaining unresolved parameters, job config is used
    – Use config-default.xml for everything else
• Although documentation does not describe clearly when to use
  which, the overall recommendation is as follows:
    – Use config-default.xml for defining parameters that never change for a
      given workflow
    – Use jobs properties for the parameters that are common for a given
      deployment of a workflow
    – Use command line arguments for the parameters that are specific for
      a given workflow invocation.
Accessing and storing process
                variables
• Accessing
  – Through the arguments in java main
• Storing
     String ooziePropFileName =
            System.getProperty("oozie.action.output.properties");
     OutputStream os = new FileOutputStream(new
            File(ooziePropFileName));
     Properties props = new Properties();
     props.setProperty(key, value);
     props.store(os, "");
     os.close();
Validating data presence
• Oozie provides two possible approaches for validating
  resource file(s) presence
   – using Oozie coordinator’s input events based on the data set -
     technically the simplest implementation approach, but it does
     not provide a more complex decision support that might be
     required. It just either runs a corresponding workflow or not.
   – custom java node inside Oozie workflow. - allows to extend
     decision logic by sending notifications about data absence, run
     execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator,
  for example, ability to wait for files arrival, etc. can expand
  usage of Oozie coordinator.
Invoking map Reduce jobs
• Oozie provides two different ways of invoking Map Reduce
  job – MapReduce action and java action.
• Invocation of Map Reduce job with java action is somewhat
  similar to invocation of this job with Hadoop command line
  from the edge node. You specify a driver as a class for the
  java activity and Oozie invokes the driver. This approach
  has two main advantages:
   – The same driver class can be used for both – running Map
     Reduce job from an edge node and a java action in an Oozie
     process.
   – A driver provides a convenient place for executing additional
     code, for example clean-up required for Map Reduce execution.
• Driver requires a proper shutdown hook to ensure that
  there are no lingering Map Reduce jobs
Implementing predefined looping and
              forking
• hPDL is an XML document with the well-defined
  schema.
• This means that the actual workflow can be easily
  manipulated using JAXB objects, which can be
  generated from hPDL schema using xjc compiler.
• This means that we can create the complete
  workflow programmatically, based on calculated
  amount of fork branches or implementing loops
  as a repeated actions.
• The other option is creation of template process
  and modifying it based on calculated parameters.
Oozie client security (or lack of)
• By default Oozie client reads clients identity from the
  local machine OS and passes it to the Oozie server,
  which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting
  OozieClient class’ method createConfiguration, where
  client variables can be set through new constructor.
         public Properties createConfiguration() {
             Properties conf = new Properties();
             if(user == null)
                conf.setProperty(USER_NAME, System.getProperty("user.name"));
             else
                conf.setProperty(USER_NAME, user);
             return conf;
          }
uber jars with Oozie
uber jar contains resources: other jars, so libraries, zip files


                                                           unpack resources
     Oozie                               launcher        to current uber jar dir
     server                             java action
                                                         set inverse classloader
                       uber jar
                   Classes (Launcher)                      invoke MR driver
                                                            pass arguments
                      jars so zip

<java>                                                    set shutdown hook
   …                                                      ‘wait for complete’
  <main-class>${wfUberLauncher}</main-class>
  <arg>-appStart=${wfAppMain}</arg>
   …                                                  mapper
</java>                                                   mapper

More Related Content

What's hot

SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!
SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!
SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!SAS Institute Japan
 
Writing Schema based GML with FME
Writing Schema based GML with FMEWriting Schema based GML with FME
Writing Schema based GML with FMESafe Software
 
OpenFlow Switch Management using NETCONF and YANG
OpenFlow Switch Management using NETCONF and YANGOpenFlow Switch Management using NETCONF and YANG
OpenFlow Switch Management using NETCONF and YANGTail-f Systems
 
CH2019 keynote: Lukas Vermeer - One neat trick to run better experiments
CH2019 keynote: Lukas Vermeer - One neat trick to run better experimentsCH2019 keynote: Lukas Vermeer - One neat trick to run better experiments
CH2019 keynote: Lukas Vermeer - One neat trick to run better experimentsWebanalisten .nl
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
Calling VoWiFi... The Next Mobile Operator Service is here...
Calling VoWiFi... The Next Mobile Operator Service is here... Calling VoWiFi... The Next Mobile Operator Service is here...
Calling VoWiFi... The Next Mobile Operator Service is here... Cisco Canada
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Workday Integration Cloud Connect Datasheet
Workday Integration Cloud Connect DatasheetWorkday Integration Cloud Connect Datasheet
Workday Integration Cloud Connect DatasheetWorkday
 
HTTPを理解する
HTTPを理解するHTTPを理解する
HTTPを理解するIIJ
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
 
Service Function Chaining with SRv6
Service Function Chaining with SRv6Service Function Chaining with SRv6
Service Function Chaining with SRv6Ahmed AbdelSalam
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Nityo pravam sap capability deck
Nityo pravam   sap capability deckNityo pravam   sap capability deck
Nityo pravam sap capability deckPrasan (AKA) Jeff
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Sap Ps Case Study Thomas Fanciullo
Sap Ps Case Study Thomas FanciulloSap Ps Case Study Thomas Fanciullo
Sap Ps Case Study Thomas FanciulloMichael WANG
 

What's hot (20)

SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!
SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!
SAS言語派集まれ!SAS StudioからSAS Viyaを使ってみよう!
 
Writing Schema based GML with FME
Writing Schema based GML with FMEWriting Schema based GML with FME
Writing Schema based GML with FME
 
OpenFlow Switch Management using NETCONF and YANG
OpenFlow Switch Management using NETCONF and YANGOpenFlow Switch Management using NETCONF and YANG
OpenFlow Switch Management using NETCONF and YANG
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
CH2019 keynote: Lukas Vermeer - One neat trick to run better experiments
CH2019 keynote: Lukas Vermeer - One neat trick to run better experimentsCH2019 keynote: Lukas Vermeer - One neat trick to run better experiments
CH2019 keynote: Lukas Vermeer - One neat trick to run better experiments
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Calling VoWiFi... The Next Mobile Operator Service is here...
Calling VoWiFi... The Next Mobile Operator Service is here... Calling VoWiFi... The Next Mobile Operator Service is here...
Calling VoWiFi... The Next Mobile Operator Service is here...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Workday Integration Cloud Connect Datasheet
Workday Integration Cloud Connect DatasheetWorkday Integration Cloud Connect Datasheet
Workday Integration Cloud Connect Datasheet
 
HTTPを理解する
HTTPを理解するHTTPを理解する
HTTPを理解する
 
検索基盤Qass
検索基盤Qass検索基盤Qass
検索基盤Qass
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Service Function Chaining with SRv6
Service Function Chaining with SRv6Service Function Chaining with SRv6
Service Function Chaining with SRv6
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Nityo pravam sap capability deck
Nityo pravam   sap capability deckNityo pravam   sap capability deck
Nityo pravam sap capability deck
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Sap Ps Case Study Thomas Fanciullo
Sap Ps Case Study Thomas FanciulloSap Ps Case Study Thomas Fanciullo
Sap Ps Case Study Thomas Fanciullo
 

Viewers also liked

Oozie sweet
Oozie sweetOozie sweet
Oozie sweetmislam77
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtimeDataWorks Summit
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyFX Live Group
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesMd Rahaman
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot GamesMatt Goeke
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_startGim GyungJin
 

Viewers also liked (20)

Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and Processes
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdfwwww63
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsJinith Joseph
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentationdgarijo
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012dgarijo
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Apache Oozie.pptx
Apache Oozie.pptxApache Oozie.pptx
Apache Oozie.pptx
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Introducing spring
Introducing springIntroducing spring
Introducing spring
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
F03-Cloud-Obiwee
F03-Cloud-ObiweeF03-Cloud-Obiwee
F03-Cloud-Obiwee
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (18)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Everything you wanted to know, but were afraid to ask about Oozie

  • 1. Everything that you ever wanted to know about Oozie, but were afraid to ask B Lublinsky, A Yakubovich
  • 2. Apache Oozie • Oozie is a workflow/coordination system to manage Apache Hadoop jobs. • A single Oozie server implements all four functional Oozie components: – Oozie workflow – Oozie coordinator – Oozie bundle – Oozie SLA.
  • 3. Main components Oozie Server Bundle 3rd party application time condition monitoring Coordinator WS API workflow data condition monitoring action Oozie Command action action Line Interface action wf logic job submission and monitoring definitions, states Oozie shared libraries HDFS Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator Workflow Coordinator Coordinator Hadoop
  • 5. Workflow Language Flow-control XML element type Description node Decision workflow:DECISION expressing “switch-case” logic Fork workflow:FORK splits one path of execution into multiple concurrent paths Join workflow:JOIN waits until every concurrent execution path of a previous fork node arrives to it Kill workflow:kill forces a workflow job to kill (abort) itself Action node XML element type Description java workflow:JAVA invokes the main() method from the specified java class fs workflow:FS manipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job Pig workflow:pig runs a Pig job Sub workflow workflow:SUB- runs a child workflow job WORKFLOW Hive * workflow:HIVE runs a Hive job Shell * workflow:SHELL runs a Shell command ssh * workflow:SSH starts a shell command on a remote machine as a remote secure shell Sqoop * workflow:SQOOP runs a Sqoop job Email * workflow:EMAIL sending emails from Oozie workflow application Distcp ? Under development (Yahoo)
  • 6. Workflow actions • Oozie workflow supports two types of actions:  Synchronous, executed inside Oozie runtime  Asynchronous, executed as a Map Reduce job. ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient 1 : workflow := getWorkflow() 2 : action := getAction() 3 : context := init<>() 4 : executor := get() 5 : start() 6 : submitLauncher() 7 : jobClient := get() 8 : runningJob := submit() 9 : setStartData()
  • 7. Workflow lifecycle PREP KILLED RUNNING FAILED SUSPENDED SUCCEDDED
  • 9. Extending Oozie workflow • Oozie provides a “minimal” workflow language, which contains only a handful of control and actions nodes. • Oozie supports a very elegant extensibility mechanism – custom action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs). • Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class. – Implementation of the action’s XML schema defining action’s configuration parameters – Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war – extending oozie-site.xml to register information about custom executor with Oozie runtime.
  • 10. Oozie Workflow Client • Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs • REST HTTP API Number of HTTP requests • Info requests (job status, job configuration) • Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition • Java API - package org.apache.oozie.client – OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() – WorkflowJob, WorkflowAction – CoordinatorJob, CoordinatorAction – SLAEvent
  • 11. Oozie workflow good, bad and ugly • Good – Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. – Nice UI for tracking execution progress – Simple APIs for integration with other applications – Simple extensibility APIs • Bad – Process has to be expressed directly in hPDL with no visual support – No support for Uber Jars (but we added our own) • Ugly – Static forking (but you can regenerate workflow and invoke on a fly) – No support for loops
  • 13. Coordinator language Element type Description Attributes and sub-elements coordinator- top-level element in coordinator instance frequency app start end controls specify the execution policy for coordinator and timeout (actions) it’s elements (workflow actions) concurrency (actions) execution order (workflow instances) action Required singular element specifying the Workflow name associated workflow. The jobs specified in workflow consume and produce dataset instances datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action output event specifies the dataset that should be produced by coordinator action
  • 16. Bundle lifecycle PREP PREPSUSPENDED PREPPAUSED RUNNING KILLED SUSPENDED FAILED PAUSED SUCCEDDED
  • 18. SLA Navigation COORD_JOBS id app_name app_path … WF_JOBS SLA_EVENT event_id id alert_contact app_name alert-frieuency app_path … … sla_id ... COORD_ACTIONS id action_number action_xml WF_ACTIONS … external_id ... id conf console_url …
  • 19.
  • 20. Using Probes to analyze/monitor Places • Select probe data for specified time/location • Validate – Filter - Transform probe data • Calculate statistics on available probe data • Distribute data per geo-tiles • Calculate place statistics (e.g. attendance index) ------------------------------------------------------------- If exception condition happens, report failure If all steps succeeded, report success
  • 24. Oozie tips and tricks
  • 25. Configuring workflow • Oozie provides 3 overlapping mechanisms to configure workflow - config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. • The way Oozie processes these three sets of the parameters is as follows: – Use all of the parameters from command line invocation – For remaining unresolved parameters, job config is used – Use config-default.xml for everything else • Although documentation does not describe clearly when to use which, the overall recommendation is as follows: – Use config-default.xml for defining parameters that never change for a given workflow – Use jobs properties for the parameters that are common for a given deployment of a workflow – Use command line arguments for the parameters that are specific for a given workflow invocation.
  • 26. Accessing and storing process variables • Accessing – Through the arguments in java main • Storing String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
  • 27. Validating data presence • Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set - technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. – custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc. • Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
  • 28. Invoking map Reduce jobs • Oozie provides two different ways of invoking Map Reduce job – MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages: – The same driver class can be used for both – running Map Reduce job from an edge node and a java action in an Oozie process. – A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution. • Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
  • 29. Implementing predefined looping and forking • hPDL is an XML document with the well-defined schema. • This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler. • This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions. • The other option is creation of template process and modifying it based on calculated parameters.
  • 30. Oozie client security (or lack of) • By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation • Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor. public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
  • 31. uber jars with Oozie uber jar contains resources: other jars, so libraries, zip files unpack resources Oozie launcher to current uber jar dir server java action set inverse classloader uber jar Classes (Launcher) invoke MR driver pass arguments jars so zip <java> set shutdown hook … ‘wait for complete’ <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> … mapper </java> mapper