Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Recent Development in Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Satish Saley (saley@yahoo-inc.com)
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive,...
Scale at Yahoo
4
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + proje...
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Data Pipelines
6
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization...
Data Pipelines
7
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Mana...
Use Case - Data pipeline
8
Oozie Coordinator
9
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri...
Current limitation of Oozie coordinator
• All dataset are required
• All instance are forced
• We can’t combine datasets f...
Complex dependencies
11
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
Oozie Coordinator with input logic
12
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezon...
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is ava...
Minimum availability processing
14
 Some time, we want to process even if partial data is available.
<input-logic>
<data-...
Optional feeds
15
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset ...
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-log...
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for...
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever...
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
MiniOozie
20
 MiniOozie
 HCat
 Pig
 Hive
 Spark
 MiniOozieClient
 To communicate with oozie server.
Oozie unit Yaml
21
name: TestCoordinator
job:
properties:
raw_logs_path: "/tmp/test/input"
aggregated_logs_path: "/user/te...
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Spark Action
Yahoo Confidential & Proprietary
• Oozie native support for Apache Spark jobs
• Introduced last year in Apach...
Example
Yahoo Confidential & Proprietary
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</m...
PySpark Example
Yahoo Confidential & Proprietary
 Automatically sets up pyspark.zip and py4j-src.zip from Sharelib
<spark...
Modes supported
Yahoo Confidential & Proprietary
• For local and yarn-client mode, Driver runs in Oozie launcher itself, t...
Recent enhancements
Yahoo Confidential & Proprietary
• Support for PySpark jobs
• Show Spark Job URLs in Oozie UI under Ch...
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Future Work
29
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dep...
Upcoming SlideShare
Loading in …5
×

August 2016 HUG: Recent development in Apache Oozie

24,202 views

Published on

First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.

Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.

Published in: Technology
  • Be the first to comment

August 2016 HUG: Recent development in Apache Oozie

  1. 1. Recent Development in Oozie Purshotam Shah (purushah@yahoo-inc.com) Satish Saley (saley@yahoo-inc.com)
  2. 2. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  3. 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  4. 4. Scale at Yahoo 4 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  5. 5. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  6. 6. Data Pipelines 6 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  7. 7. Data Pipelines 7 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  8. 8. Use Case - Data pipeline 8
  9. 9. Oozie Coordinator 9 <coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>
  10. 10. Current limitation of Oozie coordinator • All dataset are required • All instance are forced • We can’t combine datasets from multiple provider • There is no way to assign priority among datasets 10
  11. 11. Complex dependencies 11 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  12. 12. Oozie Coordinator with input logic 12 <coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic> …...............
  13. 13. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 13
  14. 14. Minimum availability processing 14  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  15. 15. Optional feeds 15  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  16. 16. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 16
  17. 17. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 17
  18. 18. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 18
  19. 19. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  20. 20. MiniOozie 20  MiniOozie  HCat  Pig  Hive  Spark  MiniOozieClient  To communicate with oozie server.
  21. 21. Oozie unit Yaml 21 name: TestCoordinator job: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xml hdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/output validations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z
  22. 22. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  23. 23. Spark Action Yahoo Confidential & Proprietary • Oozie native support for Apache Spark jobs • Introduced last year in Apache Oozie 4.2.0
  24. 24. Example Yahoo Confidential & Proprietary <spark xmlns="uri:oozie:spark-action:0.2"> <master>yarn</master> <mode>cluster</mode> <name>Spark-FileCopy</name> <class>org.apache.oozie.example.SparkFileCopy</class> <jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar> <file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file> <archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive> <spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts> <arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg> <arg>${nameNode}/${examplesRoot}/output-data/spark</arg> </spark>
  25. 25. PySpark Example Yahoo Confidential & Proprietary  Automatically sets up pyspark.zip and py4j-src.zip from Sharelib <spark xmlns="uri:oozie:spark-action:0.2"> <master>yarn</master> <mode>cluster</mode> <name>PySparkExample</name> <jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar> <spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts> </spark>
  26. 26. Modes supported Yahoo Confidential & Proprietary • For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher. • For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory. Master Mode local[*] yarn client yarn cluster
  27. 27. Recent enhancements Yahoo Confidential & Proprietary • Support for PySpark jobs • Show Spark Job URLs in Oozie UI under Child Jobs Tab • Automatically include spark-defaults.conf from Sharelib • Support for <file> and <archive> • Faster job launch time • Simplify setting up of classpath • Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files • Couple of bug fixes
  28. 28. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  29. 29. Future Work 29  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic processing  Managed through workarounds

×