August 2016 HUG: Recent development in Apache Oozie

Recent Development in Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Satish Saley (saley@yahoo-inc.com)

Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5

Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log

Scale at Yahoo
4
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)

Data Pipelines
6
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting

Data Pipelines
7
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline

Oozie Coordinator
9
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="input1">
<instance>${coord:current(0}</instance>
</data-in>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://localhost:9000/tmp/workflows</app-path>
</workflow>
</action>
</coordinator-app>

Current limitation of Oozie coordinator
• All dataset are required
• All instance are forced
• We can’t combine datasets from multiple provider
• There is no way to assign priority among datasets
10

Complex dependencies
11
OOZIE-1976 : Specifying coordinator input datasets in more logical ways

Oozie Coordinator with input logic
12
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
<datasets>
</dataset>
</dataset>
</datasets>
<input-events>
</data-in>
</data-in>
</input-events>
<input-logic>
<or name=“input1ORinput2”>
<data-in dataset=“input1”/>
<data-in dataset=“input2"/>
</or>
</input-logic>
…...............

BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
13

Minimum availability processing
14
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>

Optional feeds
15
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>

Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="C”/>
</or>
</input-logic>
16

Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
</or>
</input-logic>
17

Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
</combine>
</input-logic>
18

MiniOozie
20
 MiniOozie
 HCat
 Pig
 Hive
 Spark
 MiniOozieClient
 To communicate with oozie server.

Oozie unit Yaml
21
name: TestCoordinator
job:
properties:
raw_logs_path: "/tmp/test/input"
aggregated_logs_path: "/user/test/output”
oozie.coord.application.path: src/test/resources/coordinator-test.xml
hdfs:
touchz:
- /tmp/test/input/2010/02/01/09/_SUCCESS
- /tmp/test/input/2010/02/01/10/_SUCCESS
mkdir:
- /user/test/output
validations:
validate_job:
sleep: 6000
coordinator_actions:
- coordinator_action : "@2"
not_status: WAITING
nominal_time: 2010-02-01T11:00Z

Spark Action
Yahoo Confidential & Proprietary
• Oozie native support for Apache Spark jobs
• Introduced last year in Apache Oozie 4.2.0

Example
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Spark-FileCopy</name>
<class>org.apache.oozie.example.SparkFileCopy</class>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>
<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>
<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>
<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>
</spark>

PySpark Example
 Automatically sets up pyspark.zip and py4j-src.zip from Sharelib
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>PySparkExample</name>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>
</spark>

Modes supported
• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting
any properties for Driver, property should be prefixed with oozie.launcher.
• For ex, oozie.launcher.mapreduce.map.memory.mb and
oozie.launcher.mapreduce.map.java.opts should be modified for increasing
driver memory.
Master Mode
local[*]
yarn client
yarn cluster

Recent enhancements
• Support for PySpark jobs
• Show Spark Job URLs in Oozie UI under Child Jobs Tab
• Automatically include spark-defaults.conf from Sharelib
• Support for <file> and <archive>
• Faster job launch time
• Simplify setting up of classpath
• Avoid re-uploading jars for localization by reusing hdfs paths in
mapreduce.job.cache.files
• Couple of bug fixes

Future Work
29
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic processing
 Managed through workarounds

August 2016 HUG: Recent development in Apache Oozie

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to August 2016 HUG: Recent development in Apache Oozie

Similar to August 2016 HUG: Recent development in Apache Oozie (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

August 2016 HUG: Recent development in Apache Oozie