• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Oozie hugnov11
 

Oozie hugnov11

on

  • 609 views

Oozie is a Scheduler for Apache Hadoop jobs.

Oozie is a Scheduler for Apache Hadoop jobs.

Statistics

Views

Total Views
609
Views on SlideShare
609
Embed Views
0

Actions

Likes
0
Downloads
24
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Oozie hugnov11 Oozie hugnov11 Presentation Transcript

    • Oozie EvolutionGateway to Hadoop Eco-System Mohammad Islam
    • Agenda•  What is Oozie?•  What is in the Next Release?•  Challenges•  Future Works•  Q&A
    • Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop HiveOozie Map-Reduce HDFS
    • Oozie : The Conductor
    • A Workflow Engine•  Oozie executes workflow defined as DAG of jobs•  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
    • A Scheduler•  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
    • REST-API for Hadoop Components•  Direct access to Hadoop components –  Emulates the command line through REST API.•  Supported Products: –  Pig –  Map Reduce
    • Three Questions … Do you need Oozie?Q1 : Do you have multiple jobs with dependency?Q2 : Does your job start based on time or data availability?Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
    • What Oozie is NOT•  Oozie is not a resource scheduler•  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action.•  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
    • Oozie in ApacheMain Contributors
    • Oozie in Apache•  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month•  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
    • Oozie Usages Contd.•  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
    • Next Release …•  Integration with Hadoop 0.23•  HCatalog integration –  Non-polling approach
    • Usability•  Script Action•  Distcp Action•  Suspend Action•  Mini-Oozie for CI –  Like Mini-cluster•  Support multiple versions –  Pig, Distcp, Hive etc.
    • Reliability•  Auto-Retry in WF Action level•  High-Availability –  Hot-Warm through ZooKeeper
    • Manageability•  Email action•  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
    • Challenges : Queue Starvation•  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine.•  User Problem : –  Job’s kill/suspend takes very long time.
    • Challenges : Queue StarvationTechnical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
    • Challenges : Queue StarvationResolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt ListJ1(H)
    • Oozie Futures•  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations•  Allow job notification through JMS•  Event-based data processing•  Prioritization –  By user, system level.
    • Take Away ..•  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
    • Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
    • Who needs Oozie?•  Multiple jobs that have sequential/ conditional/parallel dependency•  Need to run job/Workflow periodically.•  Need to launch job when data is available.•  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
    • Challenges : Queue StarvationProblem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
    • Challenges : Queue StarvationResolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2)T1 T2 T1 T1 T1 T2 T1 012 01 2Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to