Your SlideShare is downloading. ×
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Workflow on Hadoop Using Oozie__HadoopSummit2010

5,364

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,364
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • Let me try to formalize things a bit A coordinator application can be parameterized, the locations of their inputs and outputs, special settings for the workflows jobs. They have a start and end date and a frequency, which also can be parameterized. On every frequency tick a workflow job is scheduled but the workflow is in WAITING state until all the INPUT is available A coordinator application defines all its input and an output. And they are normally related to the frequency and the time the workflow jobs are scheduled.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Yahoo! Workflow Engine for Hadoop
      • Alejandro Abdelnur
      Yahoo!
    • 2.
      • Oozie workflow engine (Oozie 1)
      • Oozie coordinator engine (Oozie 2)
      • Getting Oozie
      Session Agenda
    • 3.
      • What was Oozie?
        • An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions
        • Workflow applications are written in a PDL in XML
        • Workflow applications are parameterized
        • Oozie is a server
          • It is transactional, reliable and it scales
          • HTTP REST API only (Java API, CLI, console on top of it)
          • Implementation: Java web-app + SQL DB
      Oozie 1, Workflow
    • 4. Users Experience
      • ” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...”
      • Keyword Research Service team
      • ”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…”
      • Segment Manager team
    • 5.
      • Oozie users: 50
      • Workflow applications: 4868
      • Largest workflow: 2000 action nodes
      • Average action nodes per workflow: 18
      • Workflow jobs in last month: 55K
      • Workflow action nodes by type:
      • Longest running workflow job: 17 hours
      Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%
    • 6.
      • Releases
        • 4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2)
      • Failures? YES (recovered from them? YES )
        • Servlet-container (Tomcat) and database (MySQL)
        • Did we lose workflow jobs data? NO
        • Code issues that caused failures:
          • DB CONN leaks (fix: use command pattern all over)
          • Thread pool starvation (fix: added thread quota per command type)
          • HDFS CONN leaks (fix: 2 nd level caching)
      The First Year …
    • 7.
      • Deployment Model
        • Started: 1 Tomcat / multiple Oozies
        • Now: multiple 1 Tomcat/ 1 Oozie
      • Database
        • Migrated from MySQL to Oracle
      … The First Year …
    • 8.
      • Co-existence with Hadoop
        • When JT/NN are slow, Oozie users complain that Oozie is slow
        • Bad workflows can overload JT/NN
          • fork of 2000+ MR
          • Java action looping waiting for files to become available
        • Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues)
        • Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches)
      … The First Year …
    • 9.
      • Implementation changes
        • Deprecated SSH action, added JAVA action
        • MR/Pig actions are started via a launcher M(1)R(0) job
        • Improved user logging (specially for Pig)
        • Removed external calls from within DB TRX (nasty one)
        • Using (Open)JPA for DB access
      • Got right from the beginning
        • Backward compatibility for API and PDL: ALWAYS KEPT
        • Heavy use of asynchronous command execution (queue + threadpool)
        • Instrumentation data (for monitoring)
      … The First Year
    • 10.
    • 11.
      • A Workflow job MUST NOT be started until all external input is available
      RULE for Oozie Workflows
    • 12.
      • What is Oozie 2? It is Oozie 1 PLUS …
        • Time+Data driven execution of workflow jobs
          • Workflow job is scheduled at a regular frequency
          • Workflow job is started when all input data is available
      Oozie 2 Coordinator Coordinator app f IN Workflow OUT
    • 13. Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00
    • 14.
      • A coordinator application can be parameterized
      • Coordinator jobs have frequency, start & end date
      • Every tick of the frequency a coordinator action is created
      • The coordinator action starts a workflow job only when all input data is available
      • Coordinator applications define their input/output data
      • Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates:
      • hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN}
      Coordinator Applications
    • 15. Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT
    • 16.
      • Minutes and hours in a day change on per TZ basis
        • Hours in March == 31 * 24 ? YES & NO
      • A day of hourly datasets is always 24 instances? YES & NO
      • How about mixing datasets from different US TZs?
      • How about mixing datasets from different TZs from different regions/countries?
      • SOLUTION: Built-In Support for TZ/DS
      Daylight Saving is Evil
    • 17.
      • Automatic temporary back-off from JT/NN when down or too slow
      • Map-Reduce and Pig jobs submission over HTTP (w/o WF)
      • High Availability (via Zookeeper)
      • Improved Workflow Schema
      • Complete Coordinator specification support (asynch datasets and apps)
      • More user friendly functions
      • Integration with metadata system
      • Coordinator reprocessing features
      • Coordinator application bundles (manage many coord jobs as one unit)
      What is Next?
    • 18.
      • http://developer.yahoo.com/hadoop
      • http://yahoo.github.com/oozie
      Getting Oozie
    • 19. Questions?
      • Alejandro Abdelnur
      • [email_address]

    ×