• Save
Workflow on Hadoop Using Oozie__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Workflow on Hadoop Using Oozie__HadoopSummit2010

on

  • 6,530 views

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!

Statistics

Views

Total Views
6,530
Views on SlideShare
6,527
Embed Views
3

Actions

Likes
11
Downloads
0
Comments
0

2 Embeds 3

https://confluence.nurago.com 2
https://confluence.gfk.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • Let me try to formalize things a bit A coordinator application can be parameterized, the locations of their inputs and outputs, special settings for the workflows jobs. They have a start and end date and a frequency, which also can be parameterized. On every frequency tick a workflow job is scheduled but the workflow is in WAITING state until all the INPUT is available A coordinator application defines all its input and an output. And they are normally related to the frequency and the time the workflow jobs are scheduled.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Workflow on Hadoop Using Oozie__HadoopSummit2010 Workflow on Hadoop Using Oozie__HadoopSummit2010 Presentation Transcript

  • Yahoo! Workflow Engine for Hadoop
    • Alejandro Abdelnur
    Yahoo!
    • Oozie workflow engine (Oozie 1)
    • Oozie coordinator engine (Oozie 2)
    • Getting Oozie
    Session Agenda
    • What was Oozie?
      • An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions
      • Workflow applications are written in a PDL in XML
      • Workflow applications are parameterized
      • Oozie is a server
        • It is transactional, reliable and it scales
        • HTTP REST API only (Java API, CLI, console on top of it)
        • Implementation: Java web-app + SQL DB
    Oozie 1, Workflow
  • Users Experience
    • ” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...”
    • Keyword Research Service team
    • ”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…”
    • Segment Manager team
    • Oozie users: 50
    • Workflow applications: 4868
    • Largest workflow: 2000 action nodes
    • Average action nodes per workflow: 18
    • Workflow jobs in last month: 55K
    • Workflow action nodes by type:
    • Longest running workflow job: 17 hours
    Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%
    • Releases
      • 4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2)
    • Failures? YES (recovered from them? YES )
      • Servlet-container (Tomcat) and database (MySQL)
      • Did we lose workflow jobs data? NO
      • Code issues that caused failures:
        • DB CONN leaks (fix: use command pattern all over)
        • Thread pool starvation (fix: added thread quota per command type)
        • HDFS CONN leaks (fix: 2 nd level caching)
    The First Year …
    • Deployment Model
      • Started: 1 Tomcat / multiple Oozies
      • Now: multiple 1 Tomcat/ 1 Oozie
    • Database
      • Migrated from MySQL to Oracle
    … The First Year …
    • Co-existence with Hadoop
      • When JT/NN are slow, Oozie users complain that Oozie is slow
      • Bad workflows can overload JT/NN
        • fork of 2000+ MR
        • Java action looping waiting for files to become available
      • Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues)
      • Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches)
    … The First Year …
    • Implementation changes
      • Deprecated SSH action, added JAVA action
      • MR/Pig actions are started via a launcher M(1)R(0) job
      • Improved user logging (specially for Pig)
      • Removed external calls from within DB TRX (nasty one)
      • Using (Open)JPA for DB access
    • Got right from the beginning
      • Backward compatibility for API and PDL: ALWAYS KEPT
      • Heavy use of asynchronous command execution (queue + threadpool)
      • Instrumentation data (for monitoring)
    … The First Year
    • A Workflow job MUST NOT be started until all external input is available
    RULE for Oozie Workflows
    • What is Oozie 2? It is Oozie 1 PLUS …
      • Time+Data driven execution of workflow jobs
        • Workflow job is scheduled at a regular frequency
        • Workflow job is started when all input data is available
    Oozie 2 Coordinator Coordinator app f IN Workflow OUT
  • Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00
    • A coordinator application can be parameterized
    • Coordinator jobs have frequency, start & end date
    • Every tick of the frequency a coordinator action is created
    • The coordinator action starts a workflow job only when all input data is available
    • Coordinator applications define their input/output data
    • Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates:
    • hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN}
    Coordinator Applications
  • Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT
    • Minutes and hours in a day change on per TZ basis
      • Hours in March == 31 * 24 ? YES & NO
    • A day of hourly datasets is always 24 instances? YES & NO
    • How about mixing datasets from different US TZs?
    • How about mixing datasets from different TZs from different regions/countries?
    • SOLUTION: Built-In Support for TZ/DS
    Daylight Saving is Evil
    • Automatic temporary back-off from JT/NN when down or too slow
    • Map-Reduce and Pig jobs submission over HTTP (w/o WF)
    • High Availability (via Zookeeper)
    • Improved Workflow Schema
    • Complete Coordinator specification support (asynch datasets and apps)
    • More user friendly functions
    • Integration with metadata system
    • Coordinator reprocessing features
    • Coordinator application bundles (manage many coord jobs as one unit)
    What is Next?
    • http://developer.yahoo.com/hadoop
    • http://yahoo.github.com/oozie
    Getting Oozie
  • Questions?
    • Alejandro Abdelnur
    • [email_address]