• Save
Workflow on Hadoop Using Oozie__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Workflow on Hadoop Using Oozie__HadoopSummit2010

  • 6,713 views
Uploaded on

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,713
On Slideshare
6,710
From Embeds
3
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
11

Embeds 3

https://confluence.nurago.com 2
https://confluence.gfk.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • Let me try to formalize things a bit A coordinator application can be parameterized, the locations of their inputs and outputs, special settings for the workflows jobs. They have a start and end date and a frequency, which also can be parameterized. On every frequency tick a workflow job is scheduled but the workflow is in WAITING state until all the INPUT is available A coordinator application defines all its input and an output. And they are normally related to the frequency and the time the workflow jobs are scheduled.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Transcript

  • 1. Yahoo! Workflow Engine for Hadoop
    • Alejandro Abdelnur
    Yahoo!
  • 2.
    • Oozie workflow engine (Oozie 1)
    • Oozie coordinator engine (Oozie 2)
    • Getting Oozie
    Session Agenda
  • 3.
    • What was Oozie?
      • An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions
      • Workflow applications are written in a PDL in XML
      • Workflow applications are parameterized
      • Oozie is a server
        • It is transactional, reliable and it scales
        • HTTP REST API only (Java API, CLI, console on top of it)
        • Implementation: Java web-app + SQL DB
    Oozie 1, Workflow
  • 4. Users Experience
    • ” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...”
    • Keyword Research Service team
    • ”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…”
    • Segment Manager team
  • 5.
    • Oozie users: 50
    • Workflow applications: 4868
    • Largest workflow: 2000 action nodes
    • Average action nodes per workflow: 18
    • Workflow jobs in last month: 55K
    • Workflow action nodes by type:
    • Longest running workflow job: 17 hours
    Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%
  • 6.
    • Releases
      • 4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2)
    • Failures? YES (recovered from them? YES )
      • Servlet-container (Tomcat) and database (MySQL)
      • Did we lose workflow jobs data? NO
      • Code issues that caused failures:
        • DB CONN leaks (fix: use command pattern all over)
        • Thread pool starvation (fix: added thread quota per command type)
        • HDFS CONN leaks (fix: 2 nd level caching)
    The First Year …
  • 7.
    • Deployment Model
      • Started: 1 Tomcat / multiple Oozies
      • Now: multiple 1 Tomcat/ 1 Oozie
    • Database
      • Migrated from MySQL to Oracle
    … The First Year …
  • 8.
    • Co-existence with Hadoop
      • When JT/NN are slow, Oozie users complain that Oozie is slow
      • Bad workflows can overload JT/NN
        • fork of 2000+ MR
        • Java action looping waiting for files to become available
      • Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues)
      • Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches)
    … The First Year …
  • 9.
    • Implementation changes
      • Deprecated SSH action, added JAVA action
      • MR/Pig actions are started via a launcher M(1)R(0) job
      • Improved user logging (specially for Pig)
      • Removed external calls from within DB TRX (nasty one)
      • Using (Open)JPA for DB access
    • Got right from the beginning
      • Backward compatibility for API and PDL: ALWAYS KEPT
      • Heavy use of asynchronous command execution (queue + threadpool)
      • Instrumentation data (for monitoring)
    … The First Year
  • 10.
  • 11.
    • A Workflow job MUST NOT be started until all external input is available
    RULE for Oozie Workflows
  • 12.
    • What is Oozie 2? It is Oozie 1 PLUS …
      • Time+Data driven execution of workflow jobs
        • Workflow job is scheduled at a regular frequency
        • Workflow job is started when all input data is available
    Oozie 2 Coordinator Coordinator app f IN Workflow OUT
  • 13. Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00
  • 14.
    • A coordinator application can be parameterized
    • Coordinator jobs have frequency, start & end date
    • Every tick of the frequency a coordinator action is created
    • The coordinator action starts a workflow job only when all input data is available
    • Coordinator applications define their input/output data
    • Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates:
    • hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN}
    Coordinator Applications
  • 15. Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT
  • 16.
    • Minutes and hours in a day change on per TZ basis
      • Hours in March == 31 * 24 ? YES & NO
    • A day of hourly datasets is always 24 instances? YES & NO
    • How about mixing datasets from different US TZs?
    • How about mixing datasets from different TZs from different regions/countries?
    • SOLUTION: Built-In Support for TZ/DS
    Daylight Saving is Evil
  • 17.
    • Automatic temporary back-off from JT/NN when down or too slow
    • Map-Reduce and Pig jobs submission over HTTP (w/o WF)
    • High Availability (via Zookeeper)
    • Improved Workflow Schema
    • Complete Coordinator specification support (asynch datasets and apps)
    • More user friendly functions
    • Integration with metadata system
    • Coordinator reprocessing features
    • Coordinator application bundles (manage many coord jobs as one unit)
    What is Next?
  • 18.
    • http://developer.yahoo.com/hadoop
    • http://yahoo.github.com/oozie
    Getting Oozie
  • 19. Questions?
    • Alejandro Abdelnur
    • [email_address]