A Workflow Engine• Oozie executes workflow defined as DAG of jobs• The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start job fork join Pig MORE decision job M/R ENOUGH job FS end Java job
A Scheduler• Oozie executes workflow based on: – Time Dependency (Frequency) – Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
REST-API for Hadoop Components• Direct access to Hadoop components – Emulates the command line through REST API.• Supported Products: – Pig – Map Reduce
Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with dependency?Q2 : Does your job start based on time or data availability?Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
What Oozie is NOT• Oozie is not a resource scheduler• Oozie is not for off-grid scheduling o Note: Off-grid execution is possible through SSH action.• If you want to submit your job occasionally, Oozie is an option. o Oozie provides REST API based submission.
Oozie in Apache• Y! internal usages: – Total number of user : 375 – Total number of processed jobs ≈ 750K/ month• External downloads: – 2500+ in last year from GitHub – A large number of downloads maintained by 3rd party packaging.
Next Release …• Integration with Hadoop 0.23• HCatalog integration – Non-polling approach
Usability• Script Action• Distcp Action• Suspend Action• Mini-Oozie for CI – Like Mini-cluster• Support multiple versions – Pig, Distcp, Hive etc.
Reliability• Auto-Retry in WF Action level• High-Availability – Hot-Warm through ZooKeeper
Manageability• Email action• Query Pig Stats/Hadoop Counters – Runtime control of Workflow based on stats – Application-level control using the stats
Challenges : Queue Starvation• Which Queue? – Not a Hadoop queue issue. – Oozie internal queue to process the Oozie sub-tasks. – Oozie’s main execution engine.• User Problem : – Job’s kill/suspend takes very long time.
Challenges : Queue StarvationTechnical Problem: • Before execution, every task acquires lock on the job id. • Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
Challenges : Queue StarvationResolution: • Add the high priority task in both the interrupt list and normal queue. • Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt ListJ1(H)
Oozie Futures• Easy adoption – Modeling tool – IDE integration – Modular Configurations• Allow job notification through JMS• Event-based data processing• Prioritization – By user, system level.
Take Away ..• Oozie is – In Apache! – Reliable and feature-rich. – Growing fast.
Q&A Mohammad K Islam firstname.lastname@example.org http://incubator.apache.org/oozie/
Who needs Oozie?• Multiple jobs that have sequential/ conditional/parallel dependency• Need to run job/Workflow periodically.• Need to launch job when data is available.• Operational requirements: – Easy monitoring – Reprocessing – Catch-up
Challenges : Queue StarvationProblem: • Consider queue with tasks of type T1 and T2. Max Concurrency = 2. • Over-provisioned task (marked by red) is pushed back to the queue. • At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
Challenges : Queue StarvationResolution: • Before de-queuing any task, check its concurrency. • If violated, skip and get the next task. In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 2Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to