Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Yahoo! Workflow Engine for Hadoop <ul><li>Alejandro Abdelnur </li></ul>Yahoo!
<ul><li>Oozie workflow engine (Oozie 1) </li></ul><ul><li>Oozie coordinator engine (Oozie 2) </li></ul><ul><li>Getting Ooz...
<ul><li>What was Oozie? </li></ul><ul><ul><li>An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions </li></ul></ul...
Users Experience <ul><li>” Oozie has enabled us to reduce our index building operation  from  a  manually   intensive  4-d...
<ul><li>Oozie users: 50  </li></ul><ul><li>Workflow applications: 4868 </li></ul><ul><li>Largest workflow: 2000 action nod...
<ul><li>Releases </li></ul><ul><ul><li>4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2) </li></ul></u...
<ul><li>Deployment Model </li></ul><ul><ul><li>Started: 1 Tomcat / multiple Oozies  </li></ul></ul><ul><ul><li>Now: multip...
<ul><li>Co-existence with Hadoop </li></ul><ul><ul><li>When JT/NN are slow, Oozie users complain that Oozie is slow </li><...
<ul><li>Implementation changes </li></ul><ul><ul><li>Deprecated SSH action, added JAVA action </li></ul></ul><ul><ul><li>M...
<ul><li>A Workflow job MUST NOT be started until all external input is available </li></ul>RULE for Oozie Workflows
<ul><li>What is Oozie 2? It is  Oozie 1 PLUS … </li></ul><ul><ul><li>Time+Data driven execution of workflow jobs </li></ul...
Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH...
<ul><li>A coordinator application can be parameterized </li></ul><ul><li>Coordinator jobs have frequency, start & end date...
Coordinator Input and Output Data PH1 1:05 f j   (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN  31DEC 2:00 ${current(0...
<ul><li>Minutes and hours in a day change on per TZ basis </li></ul><ul><ul><li>Hours in March == 31 * 24 ? YES & NO </li>...
<ul><li>Automatic temporary back-off from JT/NN when down or too slow </li></ul><ul><li>Map-Reduce and Pig jobs submission...
<ul><li>http://developer.yahoo.com/hadoop </li></ul><ul><li>http://yahoo.github.com/oozie </li></ul>Getting Oozie
Questions? <ul><li>Alejandro Abdelnur </li></ul><ul><li>[email_address] </li></ul>
Upcoming SlideShare
Loading in …5
×

Workflow on Hadoop Using Oozie__HadoopSummit2010

6,255 views

Published on

Hadoop Summit 2010 - Developers Track
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!

Published in: Technology
  • Be the first to comment

Workflow on Hadoop Using Oozie__HadoopSummit2010

  1. 1. Yahoo! Workflow Engine for Hadoop <ul><li>Alejandro Abdelnur </li></ul>Yahoo!
  2. 2. <ul><li>Oozie workflow engine (Oozie 1) </li></ul><ul><li>Oozie coordinator engine (Oozie 2) </li></ul><ul><li>Getting Oozie </li></ul>Session Agenda
  3. 3. <ul><li>What was Oozie? </li></ul><ul><ul><li>An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions </li></ul></ul><ul><ul><li>Workflow applications are written in a PDL in XML </li></ul></ul><ul><ul><li>Workflow applications are parameterized </li></ul></ul><ul><ul><li>Oozie is a server </li></ul></ul><ul><ul><ul><li>It is transactional, reliable and it scales </li></ul></ul></ul><ul><ul><ul><li>HTTP REST API only (Java API, CLI, console on top of it) </li></ul></ul></ul><ul><ul><ul><li>Implementation: Java web-app + SQL DB </li></ul></ul></ul>Oozie 1, Workflow
  4. 4. Users Experience <ul><li>” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...” </li></ul><ul><li>Keyword Research Service team </li></ul><ul><li>”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…” </li></ul><ul><li>Segment Manager team </li></ul>
  5. 5. <ul><li>Oozie users: 50 </li></ul><ul><li>Workflow applications: 4868 </li></ul><ul><li>Largest workflow: 2000 action nodes </li></ul><ul><li>Average action nodes per workflow: 18 </li></ul><ul><li>Workflow jobs in last month: 55K </li></ul><ul><li>Workflow action nodes by type: </li></ul><ul><li>Longest running workflow job: 17 hours </li></ul>Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%
  6. 6. <ul><li>Releases </li></ul><ul><ul><li>4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2) </li></ul></ul><ul><li>Failures? YES (recovered from them? YES ) </li></ul><ul><ul><li>Servlet-container (Tomcat) and database (MySQL) </li></ul></ul><ul><ul><li>Did we lose workflow jobs data? NO </li></ul></ul><ul><ul><li>Code issues that caused failures: </li></ul></ul><ul><ul><ul><li>DB CONN leaks (fix: use command pattern all over) </li></ul></ul></ul><ul><ul><ul><li>Thread pool starvation (fix: added thread quota per command type) </li></ul></ul></ul><ul><ul><ul><li>HDFS CONN leaks (fix: 2 nd level caching) </li></ul></ul></ul>The First Year …
  7. 7. <ul><li>Deployment Model </li></ul><ul><ul><li>Started: 1 Tomcat / multiple Oozies </li></ul></ul><ul><ul><li>Now: multiple 1 Tomcat/ 1 Oozie </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>Migrated from MySQL to Oracle </li></ul></ul>… The First Year …
  8. 8. <ul><li>Co-existence with Hadoop </li></ul><ul><ul><li>When JT/NN are slow, Oozie users complain that Oozie is slow </li></ul></ul><ul><ul><li>Bad workflows can overload JT/NN </li></ul></ul><ul><ul><ul><li>fork of 2000+ MR </li></ul></ul></ul><ul><ul><ul><li>Java action looping waiting for files to become available </li></ul></ul></ul><ul><ul><li>Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues) </li></ul></ul><ul><ul><li>Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches) </li></ul></ul>… The First Year …
  9. 9. <ul><li>Implementation changes </li></ul><ul><ul><li>Deprecated SSH action, added JAVA action </li></ul></ul><ul><ul><li>MR/Pig actions are started via a launcher M(1)R(0) job </li></ul></ul><ul><ul><li>Improved user logging (specially for Pig) </li></ul></ul><ul><ul><li>Removed external calls from within DB TRX (nasty one) </li></ul></ul><ul><ul><li>Using (Open)JPA for DB access </li></ul></ul><ul><li>Got right from the beginning </li></ul><ul><ul><li>Backward compatibility for API and PDL: ALWAYS KEPT </li></ul></ul><ul><ul><li>Heavy use of asynchronous command execution (queue + threadpool) </li></ul></ul><ul><ul><li>Instrumentation data (for monitoring) </li></ul></ul>… The First Year
  10. 10.
  11. 11. <ul><li>A Workflow job MUST NOT be started until all external input is available </li></ul>RULE for Oozie Workflows
  12. 12. <ul><li>What is Oozie 2? It is Oozie 1 PLUS … </li></ul><ul><ul><li>Time+Data driven execution of workflow jobs </li></ul></ul><ul><ul><ul><li>Workflow job is scheduled at a regular frequency </li></ul></ul></ul><ul><ul><ul><li>Workflow job is started when all input data is available </li></ul></ul></ul>Oozie 2 Coordinator Coordinator app f IN Workflow OUT
  13. 13. Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00
  14. 14. <ul><li>A coordinator application can be parameterized </li></ul><ul><li>Coordinator jobs have frequency, start & end date </li></ul><ul><li>Every tick of the frequency a coordinator action is created </li></ul><ul><li>The coordinator action starts a workflow job only when all input data is available </li></ul><ul><li>Coordinator applications define their input/output data </li></ul><ul><li>Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates: </li></ul><ul><li>hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN} </li></ul>Coordinator Applications
  15. 15. Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT
  16. 16. <ul><li>Minutes and hours in a day change on per TZ basis </li></ul><ul><ul><li>Hours in March == 31 * 24 ? YES & NO </li></ul></ul><ul><li>A day of hourly datasets is always 24 instances? YES & NO </li></ul><ul><li>How about mixing datasets from different US TZs? </li></ul><ul><li>How about mixing datasets from different TZs from different regions/countries? </li></ul><ul><li>SOLUTION: Built-In Support for TZ/DS </li></ul>Daylight Saving is Evil
  17. 17. <ul><li>Automatic temporary back-off from JT/NN when down or too slow </li></ul><ul><li>Map-Reduce and Pig jobs submission over HTTP (w/o WF) </li></ul><ul><li>High Availability (via Zookeeper) </li></ul><ul><li>Improved Workflow Schema </li></ul><ul><li>Complete Coordinator specification support (asynch datasets and apps) </li></ul><ul><li>More user friendly functions </li></ul><ul><li>Integration with metadata system </li></ul><ul><li>Coordinator reprocessing features </li></ul><ul><li>Coordinator application bundles (manage many coord jobs as one unit) </li></ul>What is Next?
  18. 18. <ul><li>http://developer.yahoo.com/hadoop </li></ul><ul><li>http://yahoo.github.com/oozie </li></ul>Getting Oozie
  19. 19. Questions? <ul><li>Alejandro Abdelnur </li></ul><ul><li>[email_address] </li></ul>

×