October 2013 HUG: Oozie 4.x

Uploaded on

Apache Oozie has come a long way and now accounts for over 2.8 Million jobs per month on Yahoo's grid infrastructure. If you are running Hadoop jobs repeatedly and thinking of a smarter way of doing …

Apache Oozie has come a long way and now accounts for over 2.8 Million jobs per month on Yahoo's grid infrastructure. If you are running Hadoop jobs repeatedly and thinking of a smarter way of doing it, Apache Oozie is the answer. Be it running complex data transformation jobs chained one after another or simple daily data copy, Oozie workflows will help you to manage these tasks efficiently. Mona will cover the new features introduced in Apache Oozie 4.x, in particular, Apache HCatalog Integration, Job Notifications and SLA Monitoring for building large-scale and efficient data processing pipelines.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Oozie – Now and Beyond §  PRESENTED BY Mona Chitnis⎪ Hadoop User Group, Yahoo Sunnyvale, October 16, 2013
  • 2. Team In Action §  §  §  §  §  §  §  §  §  2 Alejandro Abdelnur Mohammad Islam Rohini Palaniswamy Robert Kanter Virag Kothari Mona Chitnis Ryota Egashira Michelle Chiang Bowen Zhang Yahoo Confidential & Proprietary
  • 4. Overview Why Oozie? The Need The Problem §  Doing something on the grid often required multiple steps §  Workflow scheduler with better support for grid jobs (native integration with Hadoop) §  MapReduce job §  orchestrate dependency between jobs §  Pig job §  §  Streaming job execute at specific time or on data availability §  HDFS operation (mkdir, chmod, etc)… §  retry jobs in the event of failures (reliable) §  custom job control Common framework for communication and execution of production process §  shell scripts §  §  §  Multiple ad-hoc solutions existed cron… §  §  sync (clocked dataset) awareness A server-based workflow async (unspecifiedsystem to scheduling freq) data awareness manage Hadoop jobs §  Cost of building and running apps were high §  §  4 development and applications engineering support, operations, and hardware Yahoo Confidential & Proprietary §  Horizontally scalable and extensible system §  Open-source §  Workflows to couple resources instead of having a monolithic code base
  • 5. Overview Oozie – A Workflow Engine §  Oozie executes workflow defined as DAG of jobs §  The job type includes MapReduce, Pig, Hive, shell script, custom Java code etc. §  Introduced in Oozie 1.x M/R job start M/R job OK fork join MORE Pig job ERROR kill Control-flow nodes (start, kill, end | fork, join, decision) M/R job end FS job Action nodes (map reduce, pig, hive, distcp, java, fs, sub-workflow, shell, ssh, email) 5 Yahoo Confidential & Proprietary decision ENOUGH Java
  • 6. Overview Example M/R Action JT and NN Mapper Reducer Input Directory Output Directory Queue Name 6 Yahoo Confidential & Proprietary
  • 7. Overview Workflow State Transitions Source: Chicago HUG, Dec 2012 7 Yahoo Confidential & Proprietary
  • 8. Overview Oozie (Coordinator) – A Scheduler §  Oozie executes workflow based on §  time dependency (frequency) §  data dependency §  Introduced in 2.x Oozie Server WS API Oozie Client 8 Yahoo Confidential & Proprietary Oozie Coordinator Oozie Workflow Check Data Availability HDFS/ HCat
  • 9. Overview Oozie (Bundle) – A Pipeline Framework §  Users can define and execute a “bundle” of coordinator apps §  large scale data processing (inter-related coordinators) §  operability and manageability of pipelines §  User can start/stop/suspend/resume/rerun in the bundle level §  Introduced in 3.x, bundles are optional Oozie Server Bundle WS API Check Data Availability Oozie Coordinator Oozie Client 9 Yahoo Confidential & Proprietary Oozie Workflow HDFS/ HCat
  • 10. Overview Layers of Abstraction in Oozie 1. Bundle Bundle     Coord  Job   Coord  Job   2. Coordinator Coord   Action   WF  Job   Coord   Action   WF  Job   Coord   Action   WF  Job   Coord   Action   WF  Job   3. Workflow M/R   Job   10 Yahoo Confidential & Proprietary PIG   Job   M/R   Job   PIG   Job  
  • 11. Overview Architectural Overview Web Services (JSON/REST API) Security WS API WS Callback DAG Engine submit start rerun callback suspend resume kill signal job Recovery Daemon Thread info check action start action end action notification M/R 11 Yahoo Confidential & Proprietary Pig fs Oracle DB executed Asynchronously via Command Queue Action Executors Oozie (Java Web-App) WF store Command Executor Thread Pool WF lib Command Queue Instrumentation Commands sub-wf pluggable, to support additional action types
  • 12. Overview Oozie Security, Multi-tenancy and Scalability Hadoop Cluster YARN RM Oozie Server 1 Auth. End User (Kerberos, Y! specific) 12 Yahoo Confidential & Proprietary 2 Create Launcher Job (super-user) 5 Async Callback 3 Execute User Job (doAs) Launcher Mapper Actual M/R Job 4 Response
  • 13. USE CASES
  • 14. Use Cases and Common Patterns Use Case 1: Time Triggers Execute your workflow every 15 minutes 00:15 14 Yahoo Confidential & Proprietary 00:30 00:45 01:00
  • 15. Use Cases and Common Patterns Use Case 2: Time and Data Triggers Materialize your workflow every hour, but only run them when the input data is ready (that is loaded to the grid every hour) Hadoop Input Data Exists? 01:00 15 Yahoo Confidential & Proprietary 02:00 03:00 04:00
  • 16. Use Cases and Common Patterns Use Case 2: Time and Data Triggers <coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T23:59Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> Dataset Definition </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> Input Events Definition with time of coordinator action materialized (created) <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> 16 Yahoo Confidential & Proprietary Action Definition
  • 17. Use Cases and Common Patterns Use Case 3: Rolling Window Access 15 minute datasets and roll them up into hourly datasets 00:15 00:30 00:45 01:15 01:00 01:00 17 Yahoo Confidential & Proprietary 01:30 01:45 02:00 02:00
  • 18. Use Cases and Common Patterns Use Case 4: Sliding Window Access last 24 hours of data, and roll them up every hour 01:00 02:00 03:00 … 24:00 24:00 02:00 03:00 04:00 … +1 day 01:00 +1 day 01:00 03:00 04:00 05:00 … +1 day 02:00 +1 day 02:00 18 Yahoo Confidential & Proprietary
  • 19. Where are We Today Proven Scale and Multi-tenancy §  2.8 M jobs/month 13,000 jobs/server day §  16% of all Hadoop jobs §  75 products §  255 monthly users §  2,000+ projects §  5.4 M compute hrs/month §  770,000 workflows §  Between 1-8 actions §  250 coordinator jobs/day §  Yahoo Confidential & Proprietary §  §  19 17 clusters Avg. 4 actions/workflow §  67% of Oozie jobs kicked thru coordinator
  • 20. Where are We Today Mix Of Job Types For Workflows Pig MapReduce 100% Java Other 4% 90% 80% SAMPLE USE OF JOB TYPES 28% §  Data processing/ filtering §  Aggregation MapReduce §  Publishing data (HDFS/ HCat) Java §  Legacy code and logic Others 70% Pig §  Distcp and shell §  Data copy/ transfer 60% 50% 29% 40% 30% 20% 39% 10% 0% Jobs 20 Yahoo Confidential & Proprietary
  • 22. What’s New in Oozie Existing Features (Oozie 3.x) §  HBase access through Oozie, via credentials §  HCatalog access through Oozie, via credentials §  Email action §  DistCp action (intra as well as inter-cluster copy) §  Shell action (run any script e.g. perl, python, hadoop CLI) §  Workflow dry-run & Fork-Join validation §  Bulk monitoring (REST API) §  Coordinator EL functions for parameterized workflows §  Job DAG 22 Yahoo Confidential & Proprietary
  • 23. What’s New in Oozie HBase Credentials §  Add in workflow.xml §  Add a section of "credentials". The type is "hbase”. §  Specify the java action to use the credentials. §  Put hbase-site.xml in oozie application path. And use <file> in workflow.xml to put hbase-site.xml in the distributed cache. A copy of the hbase-site.xml can be found in gateway:/home/gs/conf/hbase/hbase-site.xml. §  Put jars "guava-*.jar, zookeeper-*.jar, hbase-*.jar, protobuf-java-*.jar” in workflow “lib” dir §  Make sure you are using Oozie XSD version 0.3 and above for the tag.            <workflow-­‐app  name="foo-­‐wf"  xmlns="uri:oozie:workflow:0.3">                    <credentials>                            <credential  name="hbase.cert"  type="hbase">  </credential>                      //  optional  properties  -­‐  zookeeper.znode.parent,  hbase.zookeeper.quorum                    </credentials>                    <start  to=”map-­‐reduce-­‐action"  />                    <action  name=’map-­‐reduce-­‐action'  cred="hbase.cert">                            <map-­‐reduce>                            <configuration>      <property>  <name>mapred.mapper.class</name>                            <value>SampleMapperHBase</value>  </property>      <property>  <name>mapred.reducer.class</name>                            <value>org.apache.oozie.example.DemoReducer</value>  </property>  </configuration>                                        <file>hbase-­‐site.xml#hbase-­‐site.xml</file>                            </java>     §  Refer to http://twiki.corp.yahoo.com/view/CCDI/UseHbaseCred 23 Yahoo Confidential & Proprietary
  • 24. What’s New in Oozie Oozie 4.0 1 2 Job Notifications 3 24 HCatalog Integration SLA Monitoring Yahoo Confidential & Proprietary
  • 25. What’s New in Oozie 1 HCatalog Integration §  Oozie now supports HCatalog datasets, in addition to HDFS §  Query HCat server directly -OR- §  Receive ‘partition created’ notifications §  With HDFS datasets, poll NameNode to check data availability §  Delay §  Single source data exists? Oozie data exists? ……. NameNode HDFS /data/click/2013/03/10 /data/click/2013/03/11 /data/click/2013/03/12 ……. 25 Yahoo Confidential & Proprietary
  • 26. What’s New in Oozie Latest Oozie 4.0 Features HCatalog Integration <coordinator-­‐app  name=”hcat-­‐coord”  …  >     ›  HCat - metastore has info about HDFS datasets, locations and file formats. ›  Using HCat loader and storer, dataset can be    <datasets>          <dataset  name=”inp-­‐logs"  frequency="${coord:hours(1)}”>              <uri-­‐template>${hcatNode}/${db}/${table}/ds=${YEAR}-­‐$ {MONTH}-­‐${DAY};region=${region}</uri-­‐template>              <done-­‐flag></done-­‐flag>   consumed uniformly using Pig, Hive and Map/Reduce in Oozie, using the “database,        <dataset  name=”out-­‐logs"  frequency=”${coord:days(1)}”>   table, partition” abstraction. ›         </dataset>              <uri-­‐template>${hcatNode}/${db}/${outputtable}/ds=$ {dataOut};region=${region}</uri-­‐template>   Oozie notified on partition availability via JMS messages, to trigger workflows immediately ›  Use JARs hcatalog-core.jar, webhcat-javaclient.jar, hive-common.jar, hive-exec.jar,            <done-­‐flag></done-­‐flag>          </dataset>   ...   <property>              <name>FILTER</name>              <value>${coord:dataInPartitionFilter('input',  'pig')}              </value>   hive-metastore.jar, hive-serde.jar and libfb303.jar in workflow ‘lib’ §  26 Docs http://oozie.apache.org/docs/4.0.0/ DG_HCatalogIntegration.html Yahoo Confidential & Proprietary Pig  action  script:   A  =  load  '$DB.$TABLE'  using   org.apache.hcatalog.pig.HCatLoader();      B  =  FILTER  A  BY  $FILTER;      C  =  foreach  B  generate  foo,  bar;      store  C  into  '$OUTPUT_DB.$OUTPUT_TABLE'  USING   org.apache.hcatalog.pig.HCatStorer('$OUTPUT_PARTITION');  
  • 27. With HCatalog + Notifications What’s New in Oozie High-level Diagram /data/click/2013/03/12 Data Producer Produce data (distcp, pig, M/R..) HDFS Update metadata (ALTER TABLE click ADD PARTITION(data=‘2013/03/12’) location ’hdfs://data/click/2013/03/12’) HCatalog 27 Yahoo Confidential & Proprietary
  • 28. What’s New in Oozie With HCatalog + Notifications High-level Diagram Data Producer Oozie HDFS 1. Query/Poll Partition 2. Register Topic Message Bus (e..g, ActiveMQ) 28 Yahoo Confidential & Proprietary HCatalog
  • 29. What’s New in Oozie With HCatalog + Notifications High-level Diagram /data/click/2013/03/12 Data Producer Produce data (distcp, pig, M/R..) HDFS Update metadata (ALTER TABLE click ADD PARTITION(data=‘2013/03/12’) location ’hdfs://data/click/2013/03/12’) Oozie 1. Query/Poll Partition 2. Register Topic Start workflow 4. Notify New Partition Message Bus (e..g, ActiveMQ) 29 Yahoo Confidential & Proprietary HCatalog 3. Push notification <New Partition>
  • 30. What’s New in Oozie Latest Oozie 4.0 Features 2 Job Notifications §  Notification event sent on jobs’ status change §  Messages sent on the configured JMScompliant message broker §  Users should write message listeners to listen on select topics (e.g. username) §  To filter more, apply JMS selectors on Filter desired app-types for notification: <property>   <name>oozie.service.EventHandlerService.   filter.app.types</name>   <value>workflow_job,  workflow_action,   coordinator_job,  coordinator_action</value>   </property>   Notification Msg Example: Coordinator Action Failure Event ›  Header (Selectors) messages. •  •  •  •  §  E.g. user, jobid, app-type, status, msg-type (JOB or SLA). §  Docs http://oozie.apache.org/docs/4.0.0/ DG_JMSNotifications.html 30 Yahoo Confidential & Proprietary ›  AppType – Coordinator_Action Status - FAILURE User App-Name Message Body (JSON) •  •  •  •  •  •  •  ID (coord action id) Parent ID (coord Job ID) NominalTime StartTime EndTime Status - FAILED, KILLED, SUSPENDED, TIMEDOUT Error-Code, Error-Message (if KILLED or FAILED)
  • 31. Latest Oozie 4.0 Features SLA Monitoring 3 §  Oozie can actively track SLAs on Jobs’ §  Start-time, End-time, Duration §  Event Status §  START_MET, START_MISS §  END_MET, END_MISS §  DURATION_MET, DURATION_MISS §  At any time, the SLA processing stage will reflect: §  Not_Started <-- Job not yet begun §  In_Process <-- Job started and is running, and SLAs are being tracked §  Met <-- caused by an END_MET §  Miss <-- caused by an END_MISS §  Access/Filter SLA info via §  §  JMS Messages §  31 REST API §  §  Web-console dashboard Email alert Docs http://oozie.apache.org/docs/4.0.0/DG_SLAMonitoring.html Yahoo Confidential & Proprietary What’s New in Oozie   <workflow-­‐app  xmlns="uri:oozie:workflow: 0.5"  xmlns:sla="uri:oozie:sla:0.2"   name=”sla-­‐wf">   ...      <end  name="end"/>      <sla:info>          <sla:nominal-­‐time>${nominalTime}         </sla:nominal-­‐time>          <sla:should-­‐start>${shouldStart}           </sla:should-­‐start>          <sla:should-­‐end>${shouldEnd}                 </sla:should-­‐end>          <sla:max-­‐duration>${duration}               </sla:max-­‐duration>          <sla:alert-­‐events>start_miss,end_miss   </sla:alert-­‐events>          <sla:alert-­‐contact>joe@yahoo                 </sla:alert-­‐contact>      </sla:info>   </workflow-­‐app>  
  • 32. What’s New in Oozie SLA Monitoring Dashboard 32 Yahoo Confidential & Proprietary
  • 33. Demo Checking Oozie Job 1. CLI (yoozie_client) $ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe ---------------------------------------------------------------------------------------------------------------Workflow Name : map-reduce-wf App Path : hdfs://localhost:8020/user/joe/workflows/map-reduce Status : SUCCEEDED Run : 0 User : joe Group : users Created : 2009-05-26 05:01 Started : 2009-05-26 05:01 Ended : 2009-05-26 05:01 Actions --------------------------------------------------------------------------------------------------------------------Action Name Type Status Transition External Id External Status Error Code Start End -----------------------------------------------------------------------------------------------------------------------------------------------------hadoop1 map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 2009-05-26 05:01 ------------------------------------------------------------------------------------------------------------------------------------------------------ 33 Yahoo Confidential & Proprietary
  • 34. Demo Checking / Debugging Oozie Jobs 2. Web-Console e.g. http://my-oozie-server:4080/oozie Docs - https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook 34 Yahoo Confidential & Proprietary
  • 35. What else is out there?
  • 36. Oozie at ASF Oozie vs. Other Workflow Systems Champion LinkedIn Spotify Apache Affiliation TLP License only License only Language Java Java Python Adoption High, part of all standard Hadoop distributions Low Low Code Complexity High (>100K lines) Medium (< 50K lines) Low (<10K lines) Hadoop Job Support Extensive built-in support Limited job types Limited job types Docs & Support Excellent Limited Limited Auth. Kerberos, custom xml-based, custom Linux-based Reruns Yes (recovery, retries at all levels) Partial After removing output, idempotent UI 36 Yahoo! (now ASF) Average Good - Yahoo Confidential & Proprietary
  • 37. Roadmap The Next Release §  Scalability and performance improvements to handle higher loads §  More 1 and 5 min frequency jobs §  High Availability with Load Balancing §  Flexible Cron-Based Scheduling §  Handling cluster Rolling upgrades for Hadoop 2.0 37 Yahoo Confidential & Proprietary
  • 38. Q & A
  • 39. 39 Yahoo Confidential & Proprietary