• Save
Nov 2011 HUG: Oozie
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Nov 2011 HUG: Oozie






Total Views
Views on SlideShare
Embed Views



1 Embed 2

http://localhost 2



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Nov 2011 HUG: Oozie Presentation Transcript

  • 1. Oozie EvolutionGateway to Hadoop Eco-System Mohammad Islam
  • 2. Agenda•  What is Oozie?•  What is in the Next Release?•  Challenges•  Future Works•  Q&A
  • 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop HiveOozie Map-Reduce HDFS
  • 4. Oozie : The Conductor
  • 5. A Workflow Engine•  Oozie executes workflow defined as DAG of jobs•  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start job fork join Pig MORE decision job M/R ENOUGH job FS end Java job
  • 6. A Scheduler•  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 7. REST-API for Hadoop Components•  Direct access to Hadoop components –  Emulates the command line through REST API.•  Supported Products: –  Pig –  Map Reduce
  • 8. Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with dependency?Q2 : Does your job start based on time or data availability?Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 9. What Oozie is NOT•  Oozie is not a resource scheduler•  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action.•  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
  • 10. Oozie in ApacheMain Contributors
  • 11. Oozie in Apache•  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month•  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  • 12. Oozie Usages Contd.•  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
  • 13. Next Release …•  Integration with Hadoop 0.23•  HCatalog integration –  Non-polling approach
  • 14. Usability•  Script Action•  Distcp Action•  Suspend Action•  Mini-Oozie for CI –  Like Mini-cluster•  Support multiple versions –  Pig, Distcp, Hive etc.
  • 15. Reliability•  Auto-Retry in WF Action level•  High-Availability –  Hot-Warm through ZooKeeper
  • 16. Manageability•  Email action•  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  • 17. Challenges : Queue Starvation•  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine.•  User Problem : –  Job’s kill/suspend takes very long time.
  • 18. Challenges : Queue StarvationTechnical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  • 19. Challenges : Queue StarvationResolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt ListJ1(H)
  • 20. Oozie Futures•  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations•  Allow job notification through JMS•  Event-based data processing•  Prioritization –  By user, system level.
  • 21. Take Away ..•  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
  • 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 23. Who needs Oozie?•  Multiple jobs that have sequential/ conditional/parallel dependency•  Need to run job/Workflow periodically.•  Need to launch job when data is available.•  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
  • 24. Challenges : Queue StarvationProblem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  • 25. Challenges : Queue StarvationResolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 2Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to