Yahoo! Workflow Engine for Hadoop <ul><li>Alejandro Abdelnur </li></ul>Yahoo!
<ul><li>Oozie workflow engine (Oozie 1) </li></ul><ul><li>Oozie coordinator engine (Oozie 2) </li></ul><ul><li>Getting Ooz...
<ul><li>What was Oozie? </li></ul><ul><ul><li>An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions </li></ul></ul...
Users Experience <ul><li>” Oozie has enabled us to reduce our index building operation  from  a  manually   intensive  4-d...
<ul><li>Oozie users: 50  </li></ul><ul><li>Workflow applications: 4868 </li></ul><ul><li>Largest workflow: 2000 action nod...
<ul><li>Releases </li></ul><ul><ul><li>4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2) </li></ul></u...
<ul><li>Deployment Model </li></ul><ul><ul><li>Started: 1 Tomcat / multiple Oozies  </li></ul></ul><ul><ul><li>Now: multip...
<ul><li>Co-existence with Hadoop </li></ul><ul><ul><li>When JT/NN are slow, Oozie users complain that Oozie is slow </li><...
<ul><li>Implementation changes </li></ul><ul><ul><li>Deprecated SSH action, added JAVA action </li></ul></ul><ul><ul><li>M...
<ul><li>A Workflow job MUST NOT be started until all external input is available </li></ul>RULE for Oozie Workflows
<ul><li>What is Oozie 2? It is  Oozie 1 PLUS … </li></ul><ul><ul><li>Time+Data driven execution of workflow jobs </li></ul...
Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH...
<ul><li>A coordinator application can be parameterized </li></ul><ul><li>Coordinator jobs have frequency, start & end date...
Coordinator Input and Output Data PH1 1:05 f j   (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN  31DEC 2:00 ${current(0...
<ul><li>Minutes and hours in a day change on per TZ basis </li></ul><ul><ul><li>Hours in March == 31 * 24 ? YES & NO </li>...
<ul><li>Automatic temporary back-off from JT/NN when down or too slow </li></ul><ul><li>Map-Reduce and Pig jobs submission...
<ul><li>http://developer.yahoo.com/hadoop </li></ul><ul><li>http://yahoo.github.com/oozie </li></ul>Getting Oozie
Questions? <ul><li>Alejandro Abdelnur </li></ul><ul><li>[email_address] </li></ul>
Upcoming SlideShare
Loading in...5
×

Workflow on Hadoop Using Oozie__HadoopSummit2010

5,441

Published on

Hadoop Summit 2010 - Developers Track
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,441
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • Let me try to formalize things a bit A coordinator application can be parameterized, the locations of their inputs and outputs, special settings for the workflows jobs. They have a start and end date and a frequency, which also can be parameterized. On every frequency tick a workflow job is scheduled but the workflow is in WAITING state until all the INPUT is available A coordinator application defines all its input and an output. And they are normally related to the frequency and the time the workflow jobs are scheduled.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript of "Workflow on Hadoop Using Oozie__HadoopSummit2010"

    1. 1. Yahoo! Workflow Engine for Hadoop <ul><li>Alejandro Abdelnur </li></ul>Yahoo!
    2. 2. <ul><li>Oozie workflow engine (Oozie 1) </li></ul><ul><li>Oozie coordinator engine (Oozie 2) </li></ul><ul><li>Getting Oozie </li></ul>Session Agenda
    3. 3. <ul><li>What was Oozie? </li></ul><ul><ul><li>An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions </li></ul></ul><ul><ul><li>Workflow applications are written in a PDL in XML </li></ul></ul><ul><ul><li>Workflow applications are parameterized </li></ul></ul><ul><ul><li>Oozie is a server </li></ul></ul><ul><ul><ul><li>It is transactional, reliable and it scales </li></ul></ul></ul><ul><ul><ul><li>HTTP REST API only (Java API, CLI, console on top of it) </li></ul></ul></ul><ul><ul><ul><li>Implementation: Java web-app + SQL DB </li></ul></ul></ul>Oozie 1, Workflow
    4. 4. Users Experience <ul><li>” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...” </li></ul><ul><li>Keyword Research Service team </li></ul><ul><li>”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…” </li></ul><ul><li>Segment Manager team </li></ul>
    5. 5. <ul><li>Oozie users: 50 </li></ul><ul><li>Workflow applications: 4868 </li></ul><ul><li>Largest workflow: 2000 action nodes </li></ul><ul><li>Average action nodes per workflow: 18 </li></ul><ul><li>Workflow jobs in last month: 55K </li></ul><ul><li>Workflow action nodes by type: </li></ul><ul><li>Longest running workflow job: 17 hours </li></ul>Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%
    6. 6. <ul><li>Releases </li></ul><ul><ul><li>4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2) </li></ul></ul><ul><li>Failures? YES (recovered from them? YES ) </li></ul><ul><ul><li>Servlet-container (Tomcat) and database (MySQL) </li></ul></ul><ul><ul><li>Did we lose workflow jobs data? NO </li></ul></ul><ul><ul><li>Code issues that caused failures: </li></ul></ul><ul><ul><ul><li>DB CONN leaks (fix: use command pattern all over) </li></ul></ul></ul><ul><ul><ul><li>Thread pool starvation (fix: added thread quota per command type) </li></ul></ul></ul><ul><ul><ul><li>HDFS CONN leaks (fix: 2 nd level caching) </li></ul></ul></ul>The First Year …
    7. 7. <ul><li>Deployment Model </li></ul><ul><ul><li>Started: 1 Tomcat / multiple Oozies </li></ul></ul><ul><ul><li>Now: multiple 1 Tomcat/ 1 Oozie </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>Migrated from MySQL to Oracle </li></ul></ul>… The First Year …
    8. 8. <ul><li>Co-existence with Hadoop </li></ul><ul><ul><li>When JT/NN are slow, Oozie users complain that Oozie is slow </li></ul></ul><ul><ul><li>Bad workflows can overload JT/NN </li></ul></ul><ul><ul><ul><li>fork of 2000+ MR </li></ul></ul></ul><ul><ul><ul><li>Java action looping waiting for files to become available </li></ul></ul></ul><ul><ul><li>Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues) </li></ul></ul><ul><ul><li>Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches) </li></ul></ul>… The First Year …
    9. 9. <ul><li>Implementation changes </li></ul><ul><ul><li>Deprecated SSH action, added JAVA action </li></ul></ul><ul><ul><li>MR/Pig actions are started via a launcher M(1)R(0) job </li></ul></ul><ul><ul><li>Improved user logging (specially for Pig) </li></ul></ul><ul><ul><li>Removed external calls from within DB TRX (nasty one) </li></ul></ul><ul><ul><li>Using (Open)JPA for DB access </li></ul></ul><ul><li>Got right from the beginning </li></ul><ul><ul><li>Backward compatibility for API and PDL: ALWAYS KEPT </li></ul></ul><ul><ul><li>Heavy use of asynchronous command execution (queue + threadpool) </li></ul></ul><ul><ul><li>Instrumentation data (for monitoring) </li></ul></ul>… The First Year
    10. 10.
    11. 11. <ul><li>A Workflow job MUST NOT be started until all external input is available </li></ul>RULE for Oozie Workflows
    12. 12. <ul><li>What is Oozie 2? It is Oozie 1 PLUS … </li></ul><ul><ul><li>Time+Data driven execution of workflow jobs </li></ul></ul><ul><ul><ul><li>Workflow job is scheduled at a regular frequency </li></ul></ul></ul><ul><ul><ul><li>Workflow job is started when all input data is available </li></ul></ul></ul>Oozie 2 Coordinator Coordinator app f IN Workflow OUT
    13. 13. Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00
    14. 14. <ul><li>A coordinator application can be parameterized </li></ul><ul><li>Coordinator jobs have frequency, start & end date </li></ul><ul><li>Every tick of the frequency a coordinator action is created </li></ul><ul><li>The coordinator action starts a workflow job only when all input data is available </li></ul><ul><li>Coordinator applications define their input/output data </li></ul><ul><li>Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates: </li></ul><ul><li>hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN} </li></ul>Coordinator Applications
    15. 15. Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT
    16. 16. <ul><li>Minutes and hours in a day change on per TZ basis </li></ul><ul><ul><li>Hours in March == 31 * 24 ? YES & NO </li></ul></ul><ul><li>A day of hourly datasets is always 24 instances? YES & NO </li></ul><ul><li>How about mixing datasets from different US TZs? </li></ul><ul><li>How about mixing datasets from different TZs from different regions/countries? </li></ul><ul><li>SOLUTION: Built-In Support for TZ/DS </li></ul>Daylight Saving is Evil
    17. 17. <ul><li>Automatic temporary back-off from JT/NN when down or too slow </li></ul><ul><li>Map-Reduce and Pig jobs submission over HTTP (w/o WF) </li></ul><ul><li>High Availability (via Zookeeper) </li></ul><ul><li>Improved Workflow Schema </li></ul><ul><li>Complete Coordinator specification support (asynch datasets and apps) </li></ul><ul><li>More user friendly functions </li></ul><ul><li>Integration with metadata system </li></ul><ul><li>Coordinator reprocessing features </li></ul><ul><li>Coordinator application bundles (manage many coord jobs as one unit) </li></ul>What is Next?
    18. 18. <ul><li>http://developer.yahoo.com/hadoop </li></ul><ul><li>http://yahoo.github.com/oozie </li></ul>Getting Oozie
    19. 19. Questions? <ul><li>Alejandro Abdelnur </li></ul><ul><li>[email_address] </li></ul>

    ×