Oozie EvolutionGateway to Hadoop Eco-System             Mohammad Islam
Agenda•    What is Oozie?•    What is in the Next Release?•    Challenges•    Future Works•    Q&A
Oozie in Hadoop Eco-System                Oozie                               HCatalog        Pig    Sqoop    HiveOozie   ...
Oozie : The Conductor
A Workflow Engine•  Oozie executes workflow defined as DAG of jobs•  The job type includes: Map-Reduce/Pig/Hive/Any script...
A Scheduler•  Oozie executes workflow based on:   –  Time Dependency (Frequency)   –  Data Dependency                 Oozi...
REST-API for Hadoop Components•  Direct access to Hadoop components  –  Emulates the command line through REST     API.•  ...
Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with     dependency?Q2 : Does your job start based on t...
What Oozie is NOT•  Oozie is not a resource scheduler•  Oozie is not for off-grid scheduling   o  Note: Off-grid execution...
Oozie in ApacheMain Contributors
Oozie in Apache•  Y! internal usages:  –  Total number of user : 375  –  Total number of processed jobs ≈ 750K/     month•...
Oozie Usages Contd.•  User Community:  –  Membership    •  Y! internal - 286    •  External – 163  –  Message (approximate...
Next Release …•  Integration with Hadoop 0.23•  HCatalog integration  –  Non-polling approach
Usability•    Script Action•    Distcp Action•    Suspend Action•    Mini-Oozie for CI     –  Like Mini-cluster•  Support ...
Reliability•  Auto-Retry in WF Action level•  High-Availability  –  Hot-Warm through ZooKeeper
Manageability•  Email action•  Query Pig Stats/Hadoop Counters  –  Runtime control of Workflow based on stats  –  Applicat...
Challenges : Queue Starvation•  Which Queue?  –  Not a Hadoop queue issue.  –  Oozie internal queue to process the Oozie  ...
Challenges : Queue StarvationTechnical Problem:           •  Before   execution, every task acquires lock on the job id.  ...
Challenges : Queue StarvationResolution:    • Add the high priority task in both the interrupt list and normal queue.   • ...
Oozie Futures•  Easy adoption  –  Modeling tool  –  IDE integration  –  Modular Configurations•  Allow job notification th...
Take Away ..•  Oozie is  –  In Apache!  –  Reliable and feature-rich.  –  Growing fast.
Q&A                  Mohammad K Islam               kamrul@yahoo-inc.com      http://incubator.apache.org/oozie/
Who needs Oozie?•  Multiple jobs that have sequential/   conditional/parallel dependency•  Need to run job/Workflow period...
Challenges : Queue StarvationProblem:                 •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2....
Challenges : Queue StarvationResolution:            •  Before de-queuing any task, check its concurrency.            •  If...
Upcoming SlideShare
Loading in …5
×

Oozie hugnov11

732 views

Published on

Oozie is a Scheduler for Apache Hadoop jobs.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
732
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Oozie hugnov11

  1. 1. Oozie EvolutionGateway to Hadoop Eco-System Mohammad Islam
  2. 2. Agenda•  What is Oozie?•  What is in the Next Release?•  Challenges•  Future Works•  Q&A
  3. 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop HiveOozie Map-Reduce HDFS
  4. 4. Oozie : The Conductor
  5. 5. A Workflow Engine•  Oozie executes workflow defined as DAG of jobs•  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
  6. 6. A Scheduler•  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  7. 7. REST-API for Hadoop Components•  Direct access to Hadoop components –  Emulates the command line through REST API.•  Supported Products: –  Pig –  Map Reduce
  8. 8. Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with dependency?Q2 : Does your job start based on time or data availability?Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  9. 9. What Oozie is NOT•  Oozie is not a resource scheduler•  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action.•  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
  10. 10. Oozie in ApacheMain Contributors
  11. 11. Oozie in Apache•  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month•  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  12. 12. Oozie Usages Contd.•  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
  13. 13. Next Release …•  Integration with Hadoop 0.23•  HCatalog integration –  Non-polling approach
  14. 14. Usability•  Script Action•  Distcp Action•  Suspend Action•  Mini-Oozie for CI –  Like Mini-cluster•  Support multiple versions –  Pig, Distcp, Hive etc.
  15. 15. Reliability•  Auto-Retry in WF Action level•  High-Availability –  Hot-Warm through ZooKeeper
  16. 16. Manageability•  Email action•  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  17. 17. Challenges : Queue Starvation•  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine.•  User Problem : –  Job’s kill/suspend takes very long time.
  18. 18. Challenges : Queue StarvationTechnical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  19. 19. Challenges : Queue StarvationResolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt ListJ1(H)
  20. 20. Oozie Futures•  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations•  Allow job notification through JMS•  Event-based data processing•  Prioritization –  By user, system level.
  21. 21. Take Away ..•  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
  22. 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  23. 23. Who needs Oozie?•  Multiple jobs that have sequential/ conditional/parallel dependency•  Need to run job/Workflow periodically.•  Need to launch job when data is available.•  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
  24. 24. Challenges : Queue StarvationProblem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  25. 25. Challenges : Queue StarvationResolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 2Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to

×