Your SlideShare is downloading. ×
0
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Oozie hugnov11
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Oozie hugnov11

507

Published on

Oozie is a Scheduler for Apache Hadoop jobs.

Oozie is a Scheduler for Apache Hadoop jobs.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
507
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Oozie EvolutionGateway to Hadoop Eco-System Mohammad Islam
  • 2. Agenda•  What is Oozie?•  What is in the Next Release?•  Challenges•  Future Works•  Q&A
  • 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop HiveOozie Map-Reduce HDFS
  • 4. Oozie : The Conductor
  • 5. A Workflow Engine•  Oozie executes workflow defined as DAG of jobs•  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
  • 6. A Scheduler•  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 7. REST-API for Hadoop Components•  Direct access to Hadoop components –  Emulates the command line through REST API.•  Supported Products: –  Pig –  Map Reduce
  • 8. Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with dependency?Q2 : Does your job start based on time or data availability?Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 9. What Oozie is NOT•  Oozie is not a resource scheduler•  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action.•  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
  • 10. Oozie in ApacheMain Contributors
  • 11. Oozie in Apache•  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month•  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  • 12. Oozie Usages Contd.•  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
  • 13. Next Release …•  Integration with Hadoop 0.23•  HCatalog integration –  Non-polling approach
  • 14. Usability•  Script Action•  Distcp Action•  Suspend Action•  Mini-Oozie for CI –  Like Mini-cluster•  Support multiple versions –  Pig, Distcp, Hive etc.
  • 15. Reliability•  Auto-Retry in WF Action level•  High-Availability –  Hot-Warm through ZooKeeper
  • 16. Manageability•  Email action•  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  • 17. Challenges : Queue Starvation•  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine.•  User Problem : –  Job’s kill/suspend takes very long time.
  • 18. Challenges : Queue StarvationTechnical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  • 19. Challenges : Queue StarvationResolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt ListJ1(H)
  • 20. Oozie Futures•  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations•  Allow job notification through JMS•  Event-based data processing•  Prioritization –  By user, system level.
  • 21. Take Away ..•  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
  • 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 23. Who needs Oozie?•  Multiple jobs that have sequential/ conditional/parallel dependency•  Need to run job/Workflow periodically.•  Need to launch job when data is available.•  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
  • 24. Challenges : Queue StarvationProblem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  • 25. Challenges : Queue StarvationResolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 2Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to

×