Workflow on Hadoop Using Oozie__HadoopSummit2010

Yahoo! Workflow Engine for Hadoop Alejandro Abdelnur Yahoo!

Oozie workflow engine (Oozie 1) Oozie coordinator engine (Oozie 2) Getting Oozie Session Agenda

What was Oozie? An Oozie workflow is a DAG of MR/pig/fs/java/workflow actions Workflow applications are written in a PDL in XML Workflow applications are parameterized Oozie is a server It is transactional, reliable and it scales HTTP REST API only (Java API, CLI, console on top of it) Implementation: Java web-app + SQL DB Oozie 1, Workflow

Users Experience ” Oozie has enabled us to reduce our index building operation from a manually intensive 4-days process to 6-hours fully automated process...” Keyword Research Service team ”… It saved us tremendous amount of time and resources not to develop alternative custom solution to manage our complex workflows on the Grid…” Segment Manager team

Oozie users: 50 Workflow applications: 4868 Largest workflow: 2000 action nodes Average action nodes per workflow: 18 Workflow jobs in last month: 55K Workflow action nodes by type: Longest running workflow job: 17 hours Some Numbers Map-Red Pig File System Java Sub-Workflow 23% 30% 19% 18% 4%

Releases 4 feature releases, 6 patches, 1 DB schema change (from Oozie 1 to 2) Failures? YES (recovered from them? YES ) Servlet-container (Tomcat) and database (MySQL) Did we lose workflow jobs data? NO Code issues that caused failures: DB CONN leaks (fix: use command pattern all over) Thread pool starvation (fix: added thread quota per command type) HDFS CONN leaks (fix: 2 nd level caching) The First Year …

Deployment Model Started: 1 Tomcat / multiple Oozies Now: multiple 1 Tomcat/ 1 Oozie Database Migrated from MySQL to Oracle … The First Year …

Co-existence with Hadoop When JT/NN are slow, Oozie users complain that Oozie is slow Bad workflows can overload JT/NN fork of 2000+ MR Java action looping waiting for files to become available Hadoop patching requires a synchronized patching of Oozie (because of Hadoop-RPC compatibility issues) Different Y! clusters use different Hadoop versions (it requires juggling with Oozie code to avoid more branches) … The First Year …

Implementation changes Deprecated SSH action, added JAVA action MR/Pig actions are started via a launcher M(1)R(0) job Improved user logging (specially for Pig) Removed external calls from within DB TRX (nasty one) Using (Open)JPA for DB access Got right from the beginning Backward compatibility for API and PDL: ALWAYS KEPT Heavy use of asynchronous command execution (queue + threadpool) Instrumentation data (for monitoring) … The First Year

A Workflow job MUST NOT be started until all external input is available RULE for Oozie Workflows

What is Oozie 2? It is Oozie 1 PLUS … Time+Data driven execution of workflow jobs Workflow job is scheduled at a regular frequency Workflow job is started when all input data is available Oozie 2 Coordinator Coordinator app f IN Workflow OUT

Use Cases: Data Pipelines WS f (5min) PH1 1:05 f (60min) PH1 1:10 PH1 1:15 PH1 2:00 LOG 1:05 LOG 1:10 LOG 1:15 LOG 2:00 PH2 2:00 01JAN 31DEC 01JAN 31DEC 1:05 1:10 2:00 1:15 2:00

A coordinator application can be parameterized Coordinator jobs have frequency, start & end date Every tick of the frequency a coordinator action is created The coordinator action starts a workflow job only when all input data is available Coordinator applications define their input/output data Input/output data is (normally) relative to action creation time (the job frequency), they are expressed as URI templates: hdfs://.../ph1/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MIN} Coordinator Applications

Coordinator Input and Output Data PH1 1:05 f j (60min) PH1 1:10 PH1 1:15 PH1 2:00 PH2 2:00 01JAN 31DEC 2:00 ${current(0)} ${current(-11)} ${current(0)} ${current(-10)} ${current(-9)} f i (5min) f o (60min) IN Workflow OUT

Minutes and hours in a day change on per TZ basis Hours in March == 31 * 24 ? YES & NO A day of hourly datasets is always 24 instances? YES & NO How about mixing datasets from different US TZs? How about mixing datasets from different TZs from different regions/countries? SOLUTION: Built-In Support for TZ/DS Daylight Saving is Evil

Automatic temporary back-off from JT/NN when down or too slow Map-Reduce and Pig jobs submission over HTTP (w/o WF) High Availability (via Zookeeper) Improved Workflow Schema Complete Coordinator specification support (asynch datasets and apps) More user friendly functions Integration with metadata system Coordinator reprocessing features Coordinator application bundles (manage many coord jobs as one unit) What is Next?

http://developer.yahoo.com/hadoop http://yahoo.github.com/oozie Getting Oozie

Questions? Alejandro Abdelnur [email_address]

Workflow on Hadoop Using Oozie__HadoopSummit2010

More Related Content

What's hot

Similar to Workflow on Hadoop Using Oozie__HadoopSummit2010

More from Yahoo Developer Network

Recently uploaded

Workflow on Hadoop Using Oozie__HadoopSummit2010

Editor's Notes