SlideShare a Scribd company logo
1 of 31
Everything that you ever wanted to
know about Oozie, but were afraid
              to ask

       B Lublinsky, A Yakubovich
Apache Oozie
• Oozie is a workflow/coordination system to
  manage Apache Hadoop jobs.
• A single Oozie server implements all four
  functional Oozie components:
  – Oozie workflow
  – Oozie coordinator
  – Oozie bundle
  – Oozie SLA.
Main components
                                                                               Oozie Server


                                                                    Bundle
3rd party application


                         time condition monitoring

                                                                Coordinator


             WS API

                                                                    workflow
                                                                                                                   data condition monitoring




                                            action
 Oozie Command                  action                        action
  Line Interface
                                                     action
                                                                               wf logic              job submission
                                                                                                     and monitoring




                                                     definitions,
                                                       states




                                                                                              Oozie shared
                                                                                                libraries
                                                                                                                            HDFS



                                            Bundle
                                           Coordinator
                                            Coordinator
                                                                                                       MapReduce

                                             Data
                                          Coordinator
                                           Coordinator
                                            Coordinator



                                           Workflow
                                           Coordinator
                                            Coordinator
                                                                                                             Hadoop
Oozie workflow
Workflow Language
Flow-control   XML element type       Description
node
Decision       workflow:DECISION      expressing “switch-case” logic

Fork           workflow:FORK          splits one path of execution into multiple concurrent paths
Join           workflow:JOIN          waits until every concurrent execution path of a previous fork
                                      node arrives to it
Kill           workflow:kill          forces a workflow job to kill (abort) itself

Action node    XML element type    Description
java           workflow:JAVA       invokes the main() method from the specified java class
fs             workflow:FS         manipulate files and directories in HDFS; supports commands:
                                   move, delete, mkdir
MapReduce      workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
                                   streaming job or pipe job
Pig            workflow:pig        runs a Pig job
Sub workflow   workflow:SUB-       runs a child workflow job
               WORKFLOW
Hive *         workflow:HIVE       runs a Hive job
Shell *        workflow:SHELL      runs a Shell command
ssh *          workflow:SSH        starts a shell command on a remote machine as a remote secure
                                   shell
Sqoop *        workflow:SQOOP      runs a Sqoop job
Email *        workflow:EMAIL      sending emails from Oozie workflow application
Distcp ?                           Under development (Yahoo)
Workflow actions
• Oozie workflow supports two types of actions:
    Synchronous, executed inside Oozie runtime
    Asynchronous, executed as a Map Reduce job.
 ActionStartCommand             WorkflowStore                    Services          ActionExecutorContext                 JavaActionExecutor             JobClient


         1 : workflow := getWorkflow()


            2 : action := getAction()


                                           3 : context := init<>()


                            4 : executor := get()



                                                                     5 : start()



                                                                                                                                 6 : submitLauncher()




                                                                                      7 : jobClient := get()

                                                                                                                                  8 : runningJob := submit()


                                                                                                         9 : setStartData()
Workflow lifecycle

                       PREP




KILLED                RUNNING               FAILED




          SUSPENDED             SUCCEDDED
Oozie execution console
Extending Oozie workflow
• Oozie provides a “minimal” workflow language, which
  contains only a handful of control and actions nodes.
• Oozie supports a very elegant extensibility mechanism –
  custom action nodes. Custom action nodes allow to extend
  Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of
  following:
   – Java action implementation, which extends ActionExecutor
     class.
   – Implementation of the action’s XML schema defining action’s
     configuration parameters
   – Packaging of java implementation and configuration schema
     into action jar, which has to be added to Oozie war
   – extending oozie-site.xml to register information about custom
     executor with Oozie runtime.
Oozie Workflow Client
• Oozie provides an easy way for integration with enterprise
  applications through Oozie client APIs. It provides two
  types of APIs
• REST HTTP API
   Number of HTTP requests
   • Info requests (job status, job configuration)
   • Job management (submit, start, suspend, resume, kill)
   Example: job definition info request
       GET /oozie/v0/job/job-ID?show=definition
• Java API - package org.apache.oozie.client
   – OozieClient
       start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
   – WorkflowJob, WorkflowAction
   – CoordinatorJob, CoordinatorAction
   – SLAEvent
Oozie workflow good, bad and ugly
• Good
   – Nice integration with Hadoop ecosystem, allowing to easily build
     processes encompassing synchronized execution of multiple Map
     Reduce, Hive, Pig, etc jobs.
   – Nice UI for tracking execution progress
   – Simple APIs for integration with other applications
   – Simple extensibility APIs
• Bad
   – Process has to be expressed directly in hPDL with no visual support
   – No support for Uber Jars (but we added our own)
• Ugly
   – Static forking (but you can regenerate workflow and invoke on a fly)
   – No support for loops
Oozie Coordinator
Coordinator language
Element type   Description                                         Attributes and sub-elements
coordinator-   top-level element in coordinator instance           frequency
app                                                                start
                                                                   end
controls       specify the execution policy for coordinator and timeout (actions)
               it’s elements (workflow actions)                 concurrency (actions)
                                                                execution order (workflow
                                                                instances)
action         Required singular element specifying the            Workflow name
               associated workflow. The jobs specified in
               workflow consume and produce dataset
               instances
datasets       Collection of data referred to by a logical name.
               Datasets serve to specify data dependences
               between workflow instances
input event    specifies the input conditions (in the form of
               present data sets) that are required in order to
               execute a coordinator action
output event   specifies the dataset that should be produced
               by coordinator action
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle

                                  PREP




 PREPSUSPENDED       PREPPAUSED          RUNNING    KILLED




SUSPENDED                                  FAILED   PAUSED
                   SUCCEDDED
Oozie SLA
SLA Navigation
                      COORD_JOBS

                       id
                       app_name
                       app_path
                       …
                                         WF_JOBS
SLA_EVENT

event_id                                id
alert_contact                           app_name
alert-frieuency                         app_path
…                                       …
sla_id
...                   COORD_ACTIONS

                        id
                        action_number
                        action_xml      WF_ACTIONS
                        …
                        external_id
                        ...              id
                                         conf
                                         console_url
                                         …
Using Probes to analyze/monitor Places

• Select probe data for specified time/location
• Validate – Filter - Transform probe data
• Calculate statistics on available probe data
• Distribute data per geo-tiles
• Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
Workflow as acyclic graph
Workflow – fragment 1
Workflow – fragment 2
Oozie tips and tricks
Configuring workflow
• Oozie provides 3 overlapping mechanisms to configure workflow -
  config-default.xml, jobs properties file and job arguments that can
  be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as
  follows:
    – Use all of the parameters from command line invocation
    – For remaining unresolved parameters, job config is used
    – Use config-default.xml for everything else
• Although documentation does not describe clearly when to use
  which, the overall recommendation is as follows:
    – Use config-default.xml for defining parameters that never change for a
      given workflow
    – Use jobs properties for the parameters that are common for a given
      deployment of a workflow
    – Use command line arguments for the parameters that are specific for
      a given workflow invocation.
Accessing and storing process
                variables
• Accessing
  – Through the arguments in java main
• Storing
     String ooziePropFileName =
            System.getProperty("oozie.action.output.properties");
     OutputStream os = new FileOutputStream(new
            File(ooziePropFileName));
     Properties props = new Properties();
     props.setProperty(key, value);
     props.store(os, "");
     os.close();
Validating data presence
• Oozie provides two possible approaches for validating
  resource file(s) presence
   – using Oozie coordinator’s input events based on the data set -
     technically the simplest implementation approach, but it does
     not provide a more complex decision support that might be
     required. It just either runs a corresponding workflow or not.
   – custom java node inside Oozie workflow. - allows to extend
     decision logic by sending notifications about data absence, run
     execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator,
  for example, ability to wait for files arrival, etc. can expand
  usage of Oozie coordinator.
Invoking map Reduce jobs
• Oozie provides two different ways of invoking Map Reduce
  job – MapReduce action and java action.
• Invocation of Map Reduce job with java action is somewhat
  similar to invocation of this job with Hadoop command line
  from the edge node. You specify a driver as a class for the
  java activity and Oozie invokes the driver. This approach
  has two main advantages:
   – The same driver class can be used for both – running Map
     Reduce job from an edge node and a java action in an Oozie
     process.
   – A driver provides a convenient place for executing additional
     code, for example clean-up required for Map Reduce execution.
• Driver requires a proper shutdown hook to ensure that
  there are no lingering Map Reduce jobs
Implementing predefined looping and
              forking
• hPDL is an XML document with the well-defined
  schema.
• This means that the actual workflow can be easily
  manipulated using JAXB objects, which can be
  generated from hPDL schema using xjc compiler.
• This means that we can create the complete
  workflow programmatically, based on calculated
  amount of fork branches or implementing loops
  as a repeated actions.
• The other option is creation of template process
  and modifying it based on calculated parameters.
Oozie client security (or lack of)
• By default Oozie client reads clients identity from the
  local machine OS and passes it to the Oozie server,
  which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting
  OozieClient class’ method createConfiguration, where
  client variables can be set through new constructor.
         public Properties createConfiguration() {
             Properties conf = new Properties();
             if(user == null)
                conf.setProperty(USER_NAME, System.getProperty("user.name"));
             else
                conf.setProperty(USER_NAME, user);
             return conf;
          }
uber jars with Oozie
uber jar contains resources: other jars, so libraries, zip files


                                                           unpack resources
     Oozie                               launcher        to current uber jar dir
     server                             java action
                                                         set inverse classloader
                       uber jar
                   Classes (Launcher)                      invoke MR driver
                                                            pass arguments
                      jars so zip

<java>                                                    set shutdown hook
   …                                                      ‘wait for complete’
  <main-class>${wfUberLauncher}</main-class>
  <arg>-appStart=${wfAppMain}</arg>
   …                                                  mapper
</java>                                                   mapper

More Related Content

What's hot

Introducing Obsidian Software and RAVEN-GCS for PowerPC
Introducing Obsidian Software and RAVEN-GCS for PowerPCIntroducing Obsidian Software and RAVEN-GCS for PowerPC
Introducing Obsidian Software and RAVEN-GCS for PowerPC
DVClub
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmark
kevin han
 

What's hot (20)

PayPayでのk8s活用事例
PayPayでのk8s活用事例PayPayでのk8s活用事例
PayPayでのk8s活用事例
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
その ionice、ほんとに効いてますか?
その ionice、ほんとに効いてますか?その ionice、ほんとに効いてますか?
その ionice、ほんとに効いてますか?
 
Java 17直前!オレ流OpenJDK「の」開発環境(Open Source Conference 2021 Online/Kyoto 発表資料)
Java 17直前!オレ流OpenJDK「の」開発環境(Open Source Conference 2021 Online/Kyoto 発表資料)Java 17直前!オレ流OpenJDK「の」開発環境(Open Source Conference 2021 Online/Kyoto 発表資料)
Java 17直前!オレ流OpenJDK「の」開発環境(Open Source Conference 2021 Online/Kyoto 発表資料)
 
Introducing Obsidian Software and RAVEN-GCS for PowerPC
Introducing Obsidian Software and RAVEN-GCS for PowerPCIntroducing Obsidian Software and RAVEN-GCS for PowerPC
Introducing Obsidian Software and RAVEN-GCS for PowerPC
 
Migration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjugMigration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjug
 
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.
 
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajpAt least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
 
Hive
HiveHive
Hive
 
MySQL Binlog Events でストリーム処理してみた #MySQLUC15
MySQL Binlog Events でストリーム処理してみた #MySQLUC15MySQL Binlog Events でストリーム処理してみた #MySQLUC15
MySQL Binlog Events でストリーム処理してみた #MySQLUC15
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmark
 
Hadoopの標準GUI HUEの最新情報
Hadoopの標準GUI HUEの最新情報Hadoopの標準GUI HUEの最新情報
Hadoopの標準GUI HUEの最新情報
 
Apache Flink Adoption @ Shopify
Apache Flink Adoption @ ShopifyApache Flink Adoption @ Shopify
Apache Flink Adoption @ Shopify
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Apache EventMesh を使ってみた
Apache EventMesh を使ってみたApache EventMesh を使ってみた
Apache EventMesh を使ってみた
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
 

Viewers also liked (20)

Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and Processes
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie

Similar to Everything you wanted to know, but were afraid to ask about Oozie (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Apache Oozie.pptx
Apache Oozie.pptxApache Oozie.pptx
Apache Oozie.pptx
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Introducing spring
Introducing springIntroducing spring
Introducing spring
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
F03-Cloud-Obiwee
F03-Cloud-ObiweeF03-Cloud-Obiwee
F03-Cloud-Obiwee
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (18)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Everything you wanted to know, but were afraid to ask about Oozie

  • 1. Everything that you ever wanted to know about Oozie, but were afraid to ask B Lublinsky, A Yakubovich
  • 2. Apache Oozie • Oozie is a workflow/coordination system to manage Apache Hadoop jobs. • A single Oozie server implements all four functional Oozie components: – Oozie workflow – Oozie coordinator – Oozie bundle – Oozie SLA.
  • 3. Main components Oozie Server Bundle 3rd party application time condition monitoring Coordinator WS API workflow data condition monitoring action Oozie Command action action Line Interface action wf logic job submission and monitoring definitions, states Oozie shared libraries HDFS Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator Workflow Coordinator Coordinator Hadoop
  • 5. Workflow Language Flow-control XML element type Description node Decision workflow:DECISION expressing “switch-case” logic Fork workflow:FORK splits one path of execution into multiple concurrent paths Join workflow:JOIN waits until every concurrent execution path of a previous fork node arrives to it Kill workflow:kill forces a workflow job to kill (abort) itself Action node XML element type Description java workflow:JAVA invokes the main() method from the specified java class fs workflow:FS manipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job Pig workflow:pig runs a Pig job Sub workflow workflow:SUB- runs a child workflow job WORKFLOW Hive * workflow:HIVE runs a Hive job Shell * workflow:SHELL runs a Shell command ssh * workflow:SSH starts a shell command on a remote machine as a remote secure shell Sqoop * workflow:SQOOP runs a Sqoop job Email * workflow:EMAIL sending emails from Oozie workflow application Distcp ? Under development (Yahoo)
  • 6. Workflow actions • Oozie workflow supports two types of actions:  Synchronous, executed inside Oozie runtime  Asynchronous, executed as a Map Reduce job. ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient 1 : workflow := getWorkflow() 2 : action := getAction() 3 : context := init<>() 4 : executor := get() 5 : start() 6 : submitLauncher() 7 : jobClient := get() 8 : runningJob := submit() 9 : setStartData()
  • 7. Workflow lifecycle PREP KILLED RUNNING FAILED SUSPENDED SUCCEDDED
  • 9. Extending Oozie workflow • Oozie provides a “minimal” workflow language, which contains only a handful of control and actions nodes. • Oozie supports a very elegant extensibility mechanism – custom action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs). • Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class. – Implementation of the action’s XML schema defining action’s configuration parameters – Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war – extending oozie-site.xml to register information about custom executor with Oozie runtime.
  • 10. Oozie Workflow Client • Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs • REST HTTP API Number of HTTP requests • Info requests (job status, job configuration) • Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition • Java API - package org.apache.oozie.client – OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() – WorkflowJob, WorkflowAction – CoordinatorJob, CoordinatorAction – SLAEvent
  • 11. Oozie workflow good, bad and ugly • Good – Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. – Nice UI for tracking execution progress – Simple APIs for integration with other applications – Simple extensibility APIs • Bad – Process has to be expressed directly in hPDL with no visual support – No support for Uber Jars (but we added our own) • Ugly – Static forking (but you can regenerate workflow and invoke on a fly) – No support for loops
  • 13. Coordinator language Element type Description Attributes and sub-elements coordinator- top-level element in coordinator instance frequency app start end controls specify the execution policy for coordinator and timeout (actions) it’s elements (workflow actions) concurrency (actions) execution order (workflow instances) action Required singular element specifying the Workflow name associated workflow. The jobs specified in workflow consume and produce dataset instances datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action output event specifies the dataset that should be produced by coordinator action
  • 16. Bundle lifecycle PREP PREPSUSPENDED PREPPAUSED RUNNING KILLED SUSPENDED FAILED PAUSED SUCCEDDED
  • 18. SLA Navigation COORD_JOBS id app_name app_path … WF_JOBS SLA_EVENT event_id id alert_contact app_name alert-frieuency app_path … … sla_id ... COORD_ACTIONS id action_number action_xml WF_ACTIONS … external_id ... id conf console_url …
  • 19.
  • 20. Using Probes to analyze/monitor Places • Select probe data for specified time/location • Validate – Filter - Transform probe data • Calculate statistics on available probe data • Distribute data per geo-tiles • Calculate place statistics (e.g. attendance index) ------------------------------------------------------------- If exception condition happens, report failure If all steps succeeded, report success
  • 24. Oozie tips and tricks
  • 25. Configuring workflow • Oozie provides 3 overlapping mechanisms to configure workflow - config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. • The way Oozie processes these three sets of the parameters is as follows: – Use all of the parameters from command line invocation – For remaining unresolved parameters, job config is used – Use config-default.xml for everything else • Although documentation does not describe clearly when to use which, the overall recommendation is as follows: – Use config-default.xml for defining parameters that never change for a given workflow – Use jobs properties for the parameters that are common for a given deployment of a workflow – Use command line arguments for the parameters that are specific for a given workflow invocation.
  • 26. Accessing and storing process variables • Accessing – Through the arguments in java main • Storing String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
  • 27. Validating data presence • Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set - technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. – custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc. • Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
  • 28. Invoking map Reduce jobs • Oozie provides two different ways of invoking Map Reduce job – MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages: – The same driver class can be used for both – running Map Reduce job from an edge node and a java action in an Oozie process. – A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution. • Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
  • 29. Implementing predefined looping and forking • hPDL is an XML document with the well-defined schema. • This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler. • This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions. • The other option is creation of template process and modifying it based on calculated parameters.
  • 30. Oozie client security (or lack of) • By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation • Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor. public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
  • 31. uber jars with Oozie uber jar contains resources: other jars, so libraries, zip files unpack resources Oozie launcher to current uber jar dir server java action set inverse classloader uber jar Classes (Launcher) invoke MR driver pass arguments jars so zip <java> set shutdown hook … ‘wait for complete’ <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> … mapper </java> mapper