SlideShare a Scribd company logo
1 of 29
Recent Development in Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Satish Saley (saley@yahoo-inc.com)
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log
Scale at Yahoo
4
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Data Pipelines
6
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
Data Pipelines
7
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
8
Oozie Coordinator
9
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
<dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="input1">
<instance>${coord:current(0}</instance>
</data-in>
<data-in name="coordInput2" dataset="input2">
<instance>${coord:current(0}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://localhost:9000/tmp/workflows</app-path>
</workflow>
</action>
</coordinator-app>
Current limitation of Oozie coordinator
• All dataset are required
• All instance are forced
• We can’t combine datasets from multiple provider
• There is no way to assign priority among datasets
10
Complex dependencies
11
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
Oozie Coordinator with input logic
12
<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
<dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="input1">
<instance>${coord:current(0}</instance>
</data-in>
<data-in name="coordInput2" dataset="input2">
<instance>${coord:current(0}</instance>
</data-in>
</input-events>
<input-logic>
<or name=“input1ORinput2”>
<data-in dataset=“input1”/>
<data-in dataset=“input2"/>
</or>
</input-logic>
…...............
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
13
Minimum availability processing
14
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
Optional feeds
15
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
16
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
17
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
18
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
MiniOozie
20
 MiniOozie
 HCat
 Pig
 Hive
 Spark
 MiniOozieClient
 To communicate with oozie server.
Oozie unit Yaml
21
name: TestCoordinator
job:
properties:
raw_logs_path: "/tmp/test/input"
aggregated_logs_path: "/user/test/output”
oozie.coord.application.path: src/test/resources/coordinator-test.xml
hdfs:
touchz:
- /tmp/test/input/2010/02/01/09/_SUCCESS
- /tmp/test/input/2010/02/01/10/_SUCCESS
mkdir:
- /user/test/output
validations:
validate_job:
sleep: 6000
coordinator_actions:
- coordinator_action : "@2"
not_status: WAITING
nominal_time: 2010-02-01T11:00Z
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Spark Action
Yahoo Confidential & Proprietary
• Oozie native support for Apache Spark jobs
• Introduced last year in Apache Oozie 4.2.0
Example
Yahoo Confidential & Proprietary
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Spark-FileCopy</name>
<class>org.apache.oozie.example.SparkFileCopy</class>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>
<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>
<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>
<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>
</spark>
PySpark Example
Yahoo Confidential & Proprietary
 Automatically sets up pyspark.zip and py4j-src.zip from Sharelib
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>PySparkExample</name>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>
</spark>
Modes supported
Yahoo Confidential & Proprietary
• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting
any properties for Driver, property should be prefixed with oozie.launcher.
• For ex, oozie.launcher.mapreduce.map.memory.mb and
oozie.launcher.mapreduce.map.java.opts should be modified for increasing
driver memory.
Master Mode
local[*]
yarn client
yarn cluster
Recent enhancements
Yahoo Confidential & Proprietary
• Support for PySpark jobs
• Show Spark Job URLs in Oozie UI under Child Jobs Tab
• Automatically include spark-defaults.conf from Sharelib
• Support for <file> and <archive>
• Faster job launch time
• Simplify setting up of classpath
• Avoid re-uploading jars for localization by reusing hdfs paths in
mapreduce.job.cache.files
• Couple of bug fixes
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Future Work
29
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic processing
 Managed through workarounds

More Related Content

What's hot

The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of DataCloudera, Inc.
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerDataWorks Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 

What's hot (20)

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of Data
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 

Similar to August 2016 HUG: Recent development in Apache Oozie

Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdfwwww63
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Yahoo Developer Network
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...Nandana Mihindukulasooriya
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL PerformanceTommy Lee
 
Composing re-useable ETL on Hadoop
Composing re-useable ETL on HadoopComposing re-useable ETL on Hadoop
Composing re-useable ETL on HadoopPaul Lam
 
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...DevOpsDays Tel Aviv
 
20151010 my sq-landjavav2a
20151010 my sq-landjavav2a20151010 my sq-landjavav2a
20151010 my sq-landjavav2aIvan Ma
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 

Similar to August 2016 HUG: Recent development in Apache Oozie (20)

October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Micro service architecture
Micro service architectureMicro service architecture
Micro service architecture
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
ACM BPM and elasticsearch AMIS25
ACM BPM and elasticsearch AMIS25ACM BPM and elasticsearch AMIS25
ACM BPM and elasticsearch AMIS25
 
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
 
Composing re-useable ETL on Hadoop
Composing re-useable ETL on HadoopComposing re-useable ETL on Hadoop
Composing re-useable ETL on Hadoop
 
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
 
20151010 my sq-landjavav2a
20151010 my sq-landjavav2a20151010 my sq-landjavav2a
20151010 my sq-landjavav2a
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

August 2016 HUG: Recent development in Apache Oozie

  • 1. Recent Development in Oozie Purshotam Shah (purushah@yahoo-inc.com) Satish Saley (saley@yahoo-inc.com)
  • 2. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  • 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  • 4. Scale at Yahoo 4 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  • 5. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  • 6. Data Pipelines 6 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  • 7. Data Pipelines 7 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  • 8. Use Case - Data pipeline 8
  • 9. Oozie Coordinator 9 <coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>
  • 10. Current limitation of Oozie coordinator • All dataset are required • All instance are forced • We can’t combine datasets from multiple provider • There is no way to assign priority among datasets 10
  • 11. Complex dependencies 11 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  • 12. Oozie Coordinator with input logic 12 <coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic> …...............
  • 13. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 13
  • 14. Minimum availability processing 14  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  • 15. Optional feeds 15  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  • 16. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 16
  • 17. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 17
  • 18. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 18
  • 19. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  • 20. MiniOozie 20  MiniOozie  HCat  Pig  Hive  Spark  MiniOozieClient  To communicate with oozie server.
  • 21. Oozie unit Yaml 21 name: TestCoordinator job: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xml hdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/output validations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z
  • 22. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  • 23. Spark Action Yahoo Confidential & Proprietary • Oozie native support for Apache Spark jobs • Introduced last year in Apache Oozie 4.2.0
  • 24. Example Yahoo Confidential & Proprietary <spark xmlns="uri:oozie:spark-action:0.2"> <master>yarn</master> <mode>cluster</mode> <name>Spark-FileCopy</name> <class>org.apache.oozie.example.SparkFileCopy</class> <jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar> <file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file> <archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive> <spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts> <arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg> <arg>${nameNode}/${examplesRoot}/output-data/spark</arg> </spark>
  • 25. PySpark Example Yahoo Confidential & Proprietary  Automatically sets up pyspark.zip and py4j-src.zip from Sharelib <spark xmlns="uri:oozie:spark-action:0.2"> <master>yarn</master> <mode>cluster</mode> <name>PySparkExample</name> <jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar> <spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts> </spark>
  • 26. Modes supported Yahoo Confidential & Proprietary • For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher. • For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory. Master Mode local[*] yarn client yarn cluster
  • 27. Recent enhancements Yahoo Confidential & Proprietary • Support for PySpark jobs • Show Spark Job URLs in Oozie UI under Child Jobs Tab • Automatically include spark-defaults.conf from Sharelib • Support for <file> and <archive> • Faster job launch time • Simplify setting up of classpath • Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files • Couple of bug fixes
  • 28. Agenda Oozie at Yahoo1 Data Pipelines and Complex dependencies Oozie unit testing Spark Action Future Work 2 3 4 5
  • 29. Future Work 29  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic processing  Managed through workarounds