SlideShare a Scribd company logo
1 of 67
Download to read offline
Oozie: A Workflow Scheduling For Hadoop


                    Mohammad Islam
Presentation Workflow
What is Hadoop?

•  A framework for very large scale data
   processing in distributed environment
•  Main Idea came from Google
•  Implemented at Yahoo and open sourced
   to Apache.
•  Hadoop is free!
What is Hadoop? Contd.

•  Main components:
  –  HDFS
  –  Map-Reduce
•  Highly scalable, fault-tolerant system
•  Built on commodity hardware.
Yet Another WF System?
•  Workflow management is a matured field.
•  A lot of WF systems are already available
•  Existing WF system could be hacked to use
   for Hadoop
•  But no WF system built-for Hadoop.
•  Hadoop has some benefits and shortcomings.
•  Oozie was designed only for Hadoop
   considering those features
Oozie in Hadoop Eco-System

                Oozie




                               HCatalog
        Pig    Sqoop    Hive
Oozie




              Map-Reduce

                  HDFS
Oozie : The Conductor
A Workflow Engine
•  Oozie executes workflow defined as DAG of jobs
•  The job type includes: Map-Reduce/Pig/Hive/Any script/
   Custom Java Code etc
                                      M/R
                                   streaming
                                       job


             M/R
  start               fork                           join
             job



                                     Pig                    MORE
                                                                          decision
                                     job



                                                        M/R                   ENOUGH
                                                        job




                                               FS
                             end                                   Java
                                               job
A Scheduler
•  Oozie executes workflow based on:
   –  Time Dependency (Frequency)
   –  Data Dependency

                 Oozie Server
                                        Check
  WS API           Oozie            Data Availability
                 Coordinator

                   Oozie
 Oozie            Workflow
 Client                                     Hadoop
Bundle
•  A new abstraction layer on top of Coordinator.
•  Users can define and execute a bunch of coordinator
   applications.
•  Bundle is optional

                        Tomcat
                                               Check
     WS API                                Data Availability
                        Bundle

                      Coordinator

    Oozie              Workflow
    Client                                        Hadoop
Oozie Abstraction Layers
                             Bundle                                   Layer 1


         Coord Job                            Coord Job 



                                                                      Layer 2
   Coord               Coord            Coord              Coord 
   Action              Action           Action             Action 




   WF Job               WF Job         WF Job              WF Job 



                                                                      Layer 3
    M/R               PIG              M/R                     PIG 
    Job               Job              Job                     Job 
Access to Oozie Service

•  Four different ways:
  – Using Oozie CLI client
  – Java API
  – REST API
  – Web Interface (read-only)
Installing Oozie

Step 1: Download the Oozie tarball
curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/
oozie-3.1.3-incubating-distro.tar.gz

Step 2: Unpack the tarball
tar –xzvf <PATH_TO_OOZIE_TAR>

Step 3: Run the setup script
bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie
bin/oozie-start.sh

Step 5: Check status of oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status
Running an Example

•  Standalone Map-Reduce job
$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir


•  Using Oozie


                   MapReduce    OK                           <workflow –app name =..>
    Start                                End                 <start..>
                   wordcount
                                                             <action>
                                                                 <map-reduce>
                        ERROR                                    ……
                                                                 ……
                                                             </workflow>
                       Kill


              Example DAG                                    Workflow.xml
Example Workflow
<action name=’wordcount'>
     <map-reduce>
        <configuration>
              <property>
                   <name>mapred.mapper.class</name>            mapred.mapper.class =
                   <value>org.myorg.WordCount.Map</value>      org.myorg.WordCount.Map
              </property>
              <property>
                   <name>mapred.reducer.class</name>
                   <value>org.myorg.WordCount.Reduce</value>   mapred.reducer.class =
              </property>                                      org.myorg.WordCount.Reduce
              <property>
                   <name>mapred.input.dir</name>
                   <value>usr/joe/inputDir </value>            mapred.input.dir = inputDir
            </property>
            <property>
                   <name>mapred.output.dir</name>
                   <value>/usr/joe/outputDir</value>           mapred.output.dir = outputDir
             </property>
         </configuration>
      </map-reduce>
 </action>
A Workflow Application
Three components required for a Workflow:

1)  Workflow.xml:
    Contains job definition


2) Libraries:
   optional ‘lib/’ directory contains .jar/.so files

3) Properties file:
•  Parameterization of Workflow xml
•  Mandatory property is oozie.wf.application.path
Workflow Submission
Deploy Workflow to HDFS

 $ hadoop fs –put wf_job hdfs://bar.com:9000/usr/abc/wf_job


Run Workflow Job

  $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/
  Workflow ID: 00123-123456-oozie-wrkf-W



Check Workflow Job Status

  $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/
oozie/
Oozie Web Console




             18
Oozie Web Console: Job Details




              19
Oozie Web Console: Action
Details




              20
Hadoop Job Details




              21
Use Case : Time Triggers
•  Execute your workflow every 15 minutes
   (CRON)




    00:15    00:30    00:45    01:00




                     22
Run Workflow every 15 mins




              23
Use Case: Time and Data
Triggers
•  Execute your workflow every hour, but only run
   them when the input data is ready.
                   Hadoop
Input Data
  Exists?


        01:00    02:00        03:00   04:00




                         24
Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> 
   <datasets>
     <dataset name="logs" frequency=“${1*HOURS}” 
  
    
       
     
initial-instance="2009-01-01T23:59Z">
           <uri-template>
          hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}
          </uri-template>
     </dataset>
  </datasets>
  <input-events>
     <data-in name=“inputLogs” dataset="logs">
      <instance>${current(0)}</instance>
     </data-in>
  </input-events>
                              25

</coordinator-app>
Running an Coordinator Job
Application Deployment
$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job

Coordinator Job Parameters:
$ cat job.properties
 oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job

Job Submission
$ oozie job –run -config job.properties
  job: 000001-20090525161321-oozie-xyz-C

                                   26
Debugging an Coordinator Job
Coordinator Job Information
$ oozie job -info 000001-20090525161321-oozie-xyz-C
 Job Name : wordcount-coord
 App Path       : hdfs://bar.com:9000/usr/abc/coord_job
 Status         : RUNNING


Coordinator Job Log
$ oozie job –log 000001-20090525161321-oozie-xyz-C

Coordinator Job Definition
$ oozie job –definition 000001-20090525161321-oozie-xyz-C

                                   27
Three Questions …
 Do you need Oozie?


Q1 : Do you have multiple jobs with
     dependency?
Q2 : Does your job start based on time or data
     availability?
Q3 : Do you need monitoring and operational
     support for your jobs?
   If any one of your answers is YES,
   then you should consider Oozie!
What Oozie is NOT

•  Oozie is not a resource scheduler

•  Oozie is not for off-grid scheduling
   o  Note: Off-grid execution is possible through
   SSH action.

•  If you want to submit your job occasionally,
   Oozie is NOT a must.
    o  Oozie provides REST API based submission.
Oozie in Apache
Main Contributors
Oozie in Apache

•  Y! internal usages:
  –  Total number of user : 375
  –  Total number of processed jobs ≈ 750K/
     month
•  External downloads:
  –  2500+ in last year from GitHub
  –  A large number of downloads maintained by
     3rd party packaging.
Oozie Usages Contd.

•  User Community:
  –  Membership
    •  Y! internal - 286
    •  External – 163
  –  No of Messages (approximate)
    •  Y! internal – 7/day
    •  External – 10+/day
Oozie: Towards a Scalable Workflow
 Management System for Hadoop

                     Mohammad Islam
Key Features and Design
Decisions
•  Multi-tenant
•  Security
  –  Authenticate every request
  –  Pass appropriate token to Hadoop job
•  Scalability
  –  Vertical: Add extra memory/disk
  –  Horizontal: Add machines
Oozie Job Processing

         Oozie Security




                                                      Hadoop
                                  Access
                                  Secure
       Job                                 Kerberos
                   Oozie Server

End
user
Oozie-Hadoop Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
     Job                                      Kerberos
                      Oozie Server


End user
Oozie-Hadoop Security

 •    Oozie is a multi-tenant system
 •    Job can be scheduled to run later
 •    Oozie submits/maintains the hadoop jobs
 •    Hadoop needs security token for each
      request

Question: Who should provide the security
token to hadoop and how?
Oozie-Hadoop Security Contd.

•  Answer: Oozie
•  How?
  – Hadoop considers Oozie as a super-user
  – Hadoop does not check end-user
    credential
  – Hadoop only checks the credential of
    Oozie process

•  BUT hadoop job is executed as end-user.
•  Oozie utilizes doAs() functionality of Hadoop.
User-Oozie Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
      Job                                     Kerberos
                      Oozie Server


End user
Why Oozie Security?

•  One user should not modify another user’s
   job
•  Hadoop doesn’t authenticate end–user
•  Oozie has to verify its user before passing
   the job to Hadoop
How does Oozie Support Security?

•  Built-in authentication
  –  Kerberos
  –  Non-secured (default)
•  Design Decision
  –  Pluggable authentication
  –  Easy to include new type of authentication
  –  Yahoo supports 3 types of authentication.
Job Submission to Hadoop

•  Oozie is designed to handle thousands of
   jobs at the same time

•  Question : Should Oozie server
  –  Submit the hadoop job directly?
  –  Wait for it to finish?


 •  Answer: No
Job Submission Contd.
•  Reason
  –  Resource constraints: A single Oozie process
     can’t simultaneously create thousands of thread
     for each hadoop job. (Scaling limitation)
  –  Isolation: Running user code on Oozie server
     might de-stabilize Oozie
•  Design Decision
  –  Create a launcher hadoop job
  –  Execute the actual user job from the launcher.
  –  Wait asynchronously for the job to finish.
Job Submission to Hadoop


                    Hadoop Cluster
             5     Job
                                Actual
                 Tracker
                                M/R Job
Oozie                       3
Server       1                  4
                 Launcher
         2        Mapper
Job Submission Contd.

•  Advantages
  –  Horizontal scalability: If load increases, add
     machines into Hadoop cluster
  –  Stability: Isolation of user code and system
     process
•  Disadvantages
  –  Extra map-slot is occupied by each job.
Production Setup

•    Total number of nodes: 42K+
•    Total number of Clusters: 25+
•    Data presented from two clusters
•    Each of them have nearly 4K nodes
•    Total number of users /cluster = 50
Oozie Usage Pattern @ Y!
                  Distribution of Job Types On Production Clusters
             50

             45

             40

             35
Percentage




             30

             25

             20                                                      #1 Cluster
             15                                                      #2 Cluster
             10

              5

              0

                       fs          java         map-reduce     pig
                                          Job type
Experimental Setup

•    Number of nodes: 7
•    Number of map-slots: 28
•    4 Core, RAM: 16 GB
•    64 bit RHEL
•    Oozie Server
     –  3 GB RAM
     –  Internal Queue size = 10 K
     –  # Worker Thread = 300
Job Acceptance
                                         Workflow Acceptance Rate
 workflows Accepted/Min



                          1400
                          1200
                          1000
                           800
                           600
                           400
                           200
                             0
                                 2   6   10   14   20   40   52 100 120 200 320 640
                                          Number of Submission Threads

Observation: Oozie can accept a large number of jobs
Time Line of a Oozie Job

  User       Oozie            Job         Job
 submits   submits to     completes   completes
   Job      Hadoop        at Hadoop    at Oozie

                                             Time

   Preparation              Completion
   Overhead                 Overhead

Total Oozie Overhead = Preparation + Completion
Oozie Overhead
                                          Per Action Overhead
Overhead in millisecs




                        1800
                        1600
                        1400
                        1200
                        1000
                         800
                         600
                         400
                         200
                           0
                               1 Action       5 Actions   10 Actions    50 Actions
                                           Number of Actions/Workflow

Observation: Oozie overhead is less when multiple
actions are in the same workflow.
Future Work

•  Scalability
  •  Hot-Hot/Load balancing service
  •  Replace SQL DB with Zookeeper
•  Event-based data processing
  •  Asynchronous Data processing
  •  User-defined event
•  Extend the benchmarking scope
•  Monitoring WS API
Take Away ..

•  Oozie is
  – Apache project
  – Scalable, Reliable and multi-tenant
  – Deployed in production and growing
    fast.
Q&A




                          Mohammad K Islam
                      kamrul@yahoo-inc.com
  http://incubator.apache.org/oozie/
Backup Slides
Coordinator Application Lifecycle
                                        Coordinator Job

          0*f     1*f      2*f      …                        …   N*f
 start                                                                    end

            action                  …                        …
            create


         Action   Action   Action                                Action
           0        1        2                                     N


            action
             start                      Coordinator Engine
                                        Workflow Engine
           A
          WF          WF    WF                                    WF




    B             C




                                                56
Vertical Scalability

•  Oozie asynchronously processes the user
   request.
•  Memory resident
  –  An internal queue to store any sub-task.
  –  Thread pool to process the sub-tasks.
•  Both items
  –  Fixed in size
  –  Don’t change due to load variations
  –  Size can be configured if needed. Might need
     extra memory
Challenges in Scalability

•  Centralized persistence storage
  –  Currently supports any SQL DB (such as
     Oracle, MySQL, derby etc)
  –  Each DBMS has respective limitations
  –  Oozie scalability is limited by the underlying
     DBMS
  –  Using of Zookeeper might be an option
Oozie Workflow Application
•  Contents
  –  A workflow.xml file
  –  Resource files, config files and Pig scripts
  –  All necessary JAR and native library files

•  Parameters
  –  The workflow.xml, is parameterized,
     parameters can be propagated to map-reduce,
     pig & ssh jobs

•  Deployment
  –  In a directory in the HDFS of the Hadoop cluster
     where the Hadoop & Pig jobs will run

                                                    59
Oozie
Running a Workflow Job                                                                  cmd


Workflow Application Deployment
   
$ hadoop   fs   –mkdir hdfs://usr/tucu/wordcount-wf
   
$ hadoop   fs   –mkdir hdfs://usr/tucu/wordcount-wf/lib
   
$ hadoop   fs   –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf
   
$ hadoop   fs   –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib
   
$



Workflow Job Execution
   
$ oozie run -o http://foo.corp:8080/oozie 
                -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 

                input=/data/2008/input output=/data/2008/output
   
 Workflow job id [1234567890-wordcount-wf]
   
$




Workflow Job Status
   
$ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf
   
 Workflow job status [RUNNING]
     ...
   
$




                                                                                   60
Big Features (1/ 2)

•  Integration with Hadoop 0.23
•  Event-based data processing (3.3)
  •  Asynchronous Data processing (3.3)
  •  User-defined event (Next)
•  HCatalog integration (3.3)
  –  Non-polling approach
Big Features (2/ 2)

•  DAG AM (Next)
  –  WF will run as AM on hadoop cluster
  –  Higher scalability
•  Extend coordinator’s scope (3.3)
  –  Currently supports only WF
  –  Any job type such as pig, hive, MR, java, distcp
     can be scheduled without WF.
•  Dataflow-based processing in Hadoop(Next)
  –  Create an abstraction to make things easier.
Usability (1/2)

•  Easy adoption
  –  Modeling tool (3.3)
  –  IDE integration (3.3)
  –  Modular Configurations (3.3)
•  Monitoring API (3.3)
•  Improved Oozie application management
   (next+)
  –  Automate upload the application to hdfs.
  –  Application versioning.
  –  Seamless upgrade
Usability (2/2)

•  Shell and Distcp Action
•  Improved UI for coordinator
•  Mini-Oozie for CI
  –  Like Mini-cluster
•  Support multiple versions (3.3)
  –  Pig, Distcp, Hive etc.
•  Allow job notification through JMS (Next ++)
•  Prioritization (Next++)
  –  By user, system level.
Reliability

•  Auto-Retry in WF Action level

•  High-Availability
  –  Hot-Warm through ZooKeeper (3.3)
  –  Hot-Hot with Load Balancing (Next)
Manageability

•  Email action

•  Query Pig Stats/Hadoop Counters
  –  Runtime control of Workflow based on stats
  –  Application-level control using the stats
REST-API for Hadoop Components

•  Direct access to Hadoop components
  –  Emulates the command line through REST
     API.
•  Supported Products:
  –  Pig
  –  Map Reduce

More Related Content

What's hot

Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot GamesMatt Goeke
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowMatt Ray
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankCarsten Czarski
 
04 integrate entityframework
04 integrate entityframework04 integrate entityframework
04 integrate entityframeworkErhwen Kuo
 
03 integrate webapisignalr
03 integrate webapisignalr03 integrate webapisignalr
03 integrate webapisignalrErhwen Kuo
 

What's hot (20)

Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Hadoop HDFS
Hadoop HDFS Hadoop HDFS
Hadoop HDFS
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-Datenbank
 
Gradle - Build System
Gradle - Build SystemGradle - Build System
Gradle - Build System
 
Big Data at Riot Games
Big Data at Riot GamesBig Data at Riot Games
Big Data at Riot Games
 
04 integrate entityframework
04 integrate entityframework04 integrate entityframework
04 integrate entityframework
 
03 integrate webapisignalr
03 integrate webapisignalr03 integrate webapisignalr
03 integrate webapisignalr
 

Viewers also liked

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyFX Live Group
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, CoordinatorKim Log
 
BigData Hadoop Tutorial
BigData Hadoop TutorialBigData Hadoop Tutorial
BigData Hadoop TutorialGuru99
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지충섭 김
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 

Viewers also liked (14)

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
 
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
 
BigData Hadoop Tutorial
BigData Hadoop TutorialBigData Hadoop Tutorial
BigData Hadoop Tutorial
 
clock-pro
clock-proclock-pro
clock-pro
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 

Similar to Oozie sweet

Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_startGim GyungJin
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707oscon2007
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809Tim Bunce
 
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDBBattle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDBJesse Wolgamott
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Cloud Best Practices
Cloud Best PracticesCloud Best Practices
Cloud Best PracticesEric Bottard
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdfwwww63
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11mislam77
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Everything ruby
Everything rubyEverything ruby
Everything rubyajeygore
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
Connecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRubyConnecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRubyNick Sieger
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 

Similar to Oozie sweet (20)

Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDBBattle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Cloud Best Practices
Cloud Best PracticesCloud Best Practices
Cloud Best Practices
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Everything ruby
Everything rubyEverything ruby
Everything ruby
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Connecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRubyConnecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRuby
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Oozie sweet

  • 1. Oozie: A Workflow Scheduling For Hadoop Mohammad Islam
  • 3. What is Hadoop? •  A framework for very large scale data processing in distributed environment •  Main Idea came from Google •  Implemented at Yahoo and open sourced to Apache. •  Hadoop is free!
  • 4. What is Hadoop? Contd. •  Main components: –  HDFS –  Map-Reduce •  Highly scalable, fault-tolerant system •  Built on commodity hardware.
  • 5. Yet Another WF System? •  Workflow management is a matured field. •  A lot of WF systems are already available •  Existing WF system could be hacked to use for Hadoop •  But no WF system built-for Hadoop. •  Hadoop has some benefits and shortcomings. •  Oozie was designed only for Hadoop considering those features
  • 6. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop Hive Oozie Map-Reduce HDFS
  • 7. Oozie : The Conductor
  • 8. A Workflow Engine •  Oozie executes workflow defined as DAG of jobs •  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
  • 9. A Scheduler •  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 10. Bundle •  A new abstraction layer on top of Coordinator. •  Users can define and execute a bunch of coordinator applications. •  Bundle is optional Tomcat Check WS API Data Availability Bundle Coordinator Oozie Workflow Client Hadoop
  • 11. Oozie Abstraction Layers Bundle  Layer 1 Coord Job  Coord Job  Layer 2 Coord  Coord  Coord  Coord  Action  Action  Action  Action  WF Job  WF Job  WF Job  WF Job  Layer 3 M/R  PIG  M/R  PIG  Job  Job  Job  Job 
  • 12. Access to Oozie Service •  Four different ways: – Using Oozie CLI client – Java API – REST API – Web Interface (read-only)
  • 13. Installing Oozie Step 1: Download the Oozie tarball curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/ oozie-3.1.3-incubating-distro.tar.gz Step 2: Unpack the tarball tar –xzvf <PATH_TO_OOZIE_TAR> Step 3: Run the setup script bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip Step 4: Start oozie bin/oozie-start.sh Step 5: Check status of oozie bin/oozie admin -oozie http://localhost:11000/oozie -status
  • 14. Running an Example •  Standalone Map-Reduce job $ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir •  Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 15. Example Workflow <action name=’wordcount'> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name> mapred.mapper.class = <value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> mapred.reducer.class = </property> org.myorg.WordCount.Reduce <property> <name>mapred.input.dir</name> <value>usr/joe/inputDir </value> mapred.input.dir = inputDir </property> <property> <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> mapred.output.dir = outputDir </property> </configuration> </map-reduce> </action>
  • 16. A Workflow Application Three components required for a Workflow: 1)  Workflow.xml: Contains job definition 2) Libraries: optional ‘lib/’ directory contains .jar/.so files 3) Properties file: •  Parameterization of Workflow xml •  Mandatory property is oozie.wf.application.path
  • 17. Workflow Submission Deploy Workflow to HDFS $ hadoop fs –put wf_job hdfs://bar.com:9000/usr/abc/wf_job Run Workflow Job $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-W Check Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/ oozie/
  • 19. Oozie Web Console: Job Details 19
  • 20. Oozie Web Console: Action Details 20
  • 22. Use Case : Time Triggers •  Execute your workflow every 15 minutes (CRON) 00:15 00:30 00:45 01:00 22
  • 23. Run Workflow every 15 mins 23
  • 24. Use Case: Time and Data Triggers •  Execute your workflow every hour, but only run them when the input data is ready. Hadoop Input Data Exists? 01:00 02:00 03:00 04:00 24
  • 25. Data Triggers <coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T23:59Z"> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> 25 </coordinator-app>
  • 26. Running an Coordinator Job Application Deployment $ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job Coordinator Job Parameters: $ cat job.properties oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job Job Submission $ oozie job –run -config job.properties job: 000001-20090525161321-oozie-xyz-C 26
  • 27. Debugging an Coordinator Job Coordinator Job Information $ oozie job -info 000001-20090525161321-oozie-xyz-C Job Name : wordcount-coord App Path : hdfs://bar.com:9000/usr/abc/coord_job Status : RUNNING Coordinator Job Log $ oozie job –log 000001-20090525161321-oozie-xyz-C Coordinator Job Definition $ oozie job –definition 000001-20090525161321-oozie-xyz-C 27
  • 28. Three Questions … Do you need Oozie? Q1 : Do you have multiple jobs with dependency? Q2 : Does your job start based on time or data availability? Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 29. What Oozie is NOT •  Oozie is not a resource scheduler •  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action. •  If you want to submit your job occasionally, Oozie is NOT a must. o  Oozie provides REST API based submission.
  • 30. Oozie in Apache Main Contributors
  • 31. Oozie in Apache •  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month •  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  • 32. Oozie Usages Contd. •  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  No of Messages (approximate) •  Y! internal – 7/day •  External – 10+/day
  • 33. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam
  • 34. Key Features and Design Decisions •  Multi-tenant •  Security –  Authenticate every request –  Pass appropriate token to Hadoop job •  Scalability –  Vertical: Add extra memory/disk –  Horizontal: Add machines
  • 35. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 36. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 37. Oozie-Hadoop Security •  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each request Question: Who should provide the security token to hadoop and how?
  • 38. Oozie-Hadoop Security Contd. •  Answer: Oozie •  How? – Hadoop considers Oozie as a super-user – Hadoop does not check end-user credential – Hadoop only checks the credential of Oozie process •  BUT hadoop job is executed as end-user. •  Oozie utilizes doAs() functionality of Hadoop.
  • 39. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 40. Why Oozie Security? •  One user should not modify another user’s job •  Hadoop doesn’t authenticate end–user •  Oozie has to verify its user before passing the job to Hadoop
  • 41. How does Oozie Support Security? •  Built-in authentication –  Kerberos –  Non-secured (default) •  Design Decision –  Pluggable authentication –  Easy to include new type of authentication –  Yahoo supports 3 types of authentication.
  • 42. Job Submission to Hadoop •  Oozie is designed to handle thousands of jobs at the same time •  Question : Should Oozie server –  Submit the hadoop job directly? –  Wait for it to finish? •  Answer: No
  • 43. Job Submission Contd. •  Reason –  Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) –  Isolation: Running user code on Oozie server might de-stabilize Oozie •  Design Decision –  Create a launcher hadoop job –  Execute the actual user job from the launcher. –  Wait asynchronously for the job to finish.
  • 44. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R Job Oozie 3 Server 1 4 Launcher 2 Mapper
  • 45. Job Submission Contd. •  Advantages –  Horizontal scalability: If load increases, add machines into Hadoop cluster –  Stability: Isolation of user code and system process •  Disadvantages –  Extra map-slot is occupied by each job.
  • 46. Production Setup •  Total number of nodes: 42K+ •  Total number of Clusters: 25+ •  Data presented from two clusters •  Each of them have nearly 4K nodes •  Total number of users /cluster = 50
  • 47. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35 Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type
  • 48. Experimental Setup •  Number of nodes: 7 •  Number of map-slots: 28 •  4 Core, RAM: 16 GB •  64 bit RHEL •  Oozie Server –  3 GB RAM –  Internal Queue size = 10 K –  # Worker Thread = 300
  • 49. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission Threads Observation: Oozie can accept a large number of jobs
  • 50. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead Overhead Total Oozie Overhead = Preparation + Completion
  • 51. Oozie Overhead Per Action Overhead Overhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/Workflow Observation: Oozie overhead is less when multiple actions are in the same workflow.
  • 52. Future Work •  Scalability •  Hot-Hot/Load balancing service •  Replace SQL DB with Zookeeper •  Event-based data processing •  Asynchronous Data processing •  User-defined event •  Extend the benchmarking scope •  Monitoring WS API
  • 53. Take Away .. •  Oozie is – Apache project – Scalable, Reliable and multi-tenant – Deployed in production and growing fast.
  • 54. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 56. Coordinator Application Lifecycle Coordinator Job 0*f 1*f 2*f … … N*f start end action … … create Action Action Action Action 0 1 2 N action start Coordinator Engine Workflow Engine A WF WF WF WF B C 56
  • 57. Vertical Scalability •  Oozie asynchronously processes the user request. •  Memory resident –  An internal queue to store any sub-task. –  Thread pool to process the sub-tasks. •  Both items –  Fixed in size –  Don’t change due to load variations –  Size can be configured if needed. Might need extra memory
  • 58. Challenges in Scalability •  Centralized persistence storage –  Currently supports any SQL DB (such as Oracle, MySQL, derby etc) –  Each DBMS has respective limitations –  Oozie scalability is limited by the underlying DBMS –  Using of Zookeeper might be an option
  • 59. Oozie Workflow Application •  Contents –  A workflow.xml file –  Resource files, config files and Pig scripts –  All necessary JAR and native library files •  Parameters –  The workflow.xml, is parameterized, parameters can be propagated to map-reduce, pig & ssh jobs •  Deployment –  In a directory in the HDFS of the Hadoop cluster where the Hadoop & Pig jobs will run 59
  • 60. Oozie Running a Workflow Job cmd Workflow Application Deployment $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib $ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf $ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib $ Workflow Job Execution $ oozie run -o http://foo.corp:8080/oozie -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 
 input=/data/2008/input output=/data/2008/output Workflow job id [1234567890-wordcount-wf] $
 Workflow Job Status $ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf Workflow job status [RUNNING] ... $ 60
  • 61. Big Features (1/ 2) •  Integration with Hadoop 0.23 •  Event-based data processing (3.3) •  Asynchronous Data processing (3.3) •  User-defined event (Next) •  HCatalog integration (3.3) –  Non-polling approach
  • 62. Big Features (2/ 2) •  DAG AM (Next) –  WF will run as AM on hadoop cluster –  Higher scalability •  Extend coordinator’s scope (3.3) –  Currently supports only WF –  Any job type such as pig, hive, MR, java, distcp can be scheduled without WF. •  Dataflow-based processing in Hadoop(Next) –  Create an abstraction to make things easier.
  • 63. Usability (1/2) •  Easy adoption –  Modeling tool (3.3) –  IDE integration (3.3) –  Modular Configurations (3.3) •  Monitoring API (3.3) •  Improved Oozie application management (next+) –  Automate upload the application to hdfs. –  Application versioning. –  Seamless upgrade
  • 64. Usability (2/2) •  Shell and Distcp Action •  Improved UI for coordinator •  Mini-Oozie for CI –  Like Mini-cluster •  Support multiple versions (3.3) –  Pig, Distcp, Hive etc. •  Allow job notification through JMS (Next ++) •  Prioritization (Next++) –  By user, system level.
  • 65. Reliability •  Auto-Retry in WF Action level •  High-Availability –  Hot-Warm through ZooKeeper (3.3) –  Hot-Hot with Load Balancing (Next)
  • 66. Manageability •  Email action •  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  • 67. REST-API for Hadoop Components •  Direct access to Hadoop components –  Emulates the command line through REST API. •  Supported Products: –  Pig –  Map Reduce