SlideShare a Scribd company logo
1 of 52
Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log
4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
Data Pipelines
8
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
9
Large Scale Data Pipeline Requirements
10
 Administrative
 One should be able to start, stop and pause all related pipelines or part of it at the
same time
 Dependency Management
 BCP support
 Data is not guaranteed, start processing even if partial data is available
 Mandatory and optional feeds
Large Scale Data Pipeline Requirements
11
 Multiple Providers
 If data is available from multiple providers, I want to specify the provider priority
 Combining dataset from multiple providers
 SLA Management
 Monitor pipeline processing to take immediate action in case of failures or SLA misses
 Pipelines owners should get notified if an SLA is missed
Bundle
12
 The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
 Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
Complex dependencies
13
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14
Minimum availability processing
15
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
Optional feeds
16
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
17
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
18
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
19
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Monitoring
21
 Configure to receive notifications
 Email action
 HTTP notifications for job status change
 Email notification for SLA misses
 JMS notification for SLA events
 By Polling
 CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
Monitoring
22
 Email action can be added to workflow to send mail
 Job status change notification for coordinator action
 oozie.coord.action.notification.url
 oozie.coord.action.notification.proxy
 Job status change notification for workflow
 “oozie.wf.workflow.notification.url”
 “oozie.wf.workflow.notification.proxy”
Job Monitoring - polling
23
 Supported for both CLI and web service
 Single job monitoring
 Bulk job monitoring
 Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
 Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
 Oozie can actively track SLAs on Jobs’
 Start-time, End-time, Duration
 Access/Filter SLA info via
 Web-console dashboard
 REST API
 JMS Messages
 Email alert
24
SLA Monitoring
25
SLA dashboard – tabular view
26
SLA dashboard – Graph view
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
 User view
 BCP SLA support
 No Color coding
 Paging/oncall
 Threshold
 Consolidated email
 Multi grid view
28
Monitoring Limitations
29
Data pipeline monitoring use case from Y!
 Setup cron job which periodically pull SLA information from oozie
 If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1
Case-1
31
Case-2
32
 Divided into four section
 SLA Details
 Error jobs
 Long Running Jobs
 Running jobs
SLA information
33
SLA-status
34
Long Waiting jobs
35
Long Waiting jobs – missing dependencies
36
Error Jobs
37
Running job details
38
Job explorer
39
Feeds - jobs
40
Validation job
41
 Data pipe line also run periodically validation jobs to validate the output
 Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
Alert
42
Reprocessing
43
 One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
 Oozie does not support any data dependencies
 This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
Reprocessing
44
 To solve Oozie limitation, they have built a job dependency DAG.
 It is very similar to job explorer->feed lookup feature.
 job explorer->feed lookup is based on the output produced by
coordinator jobs.
 Job dependencies DAG is based on the input to jobs.
 Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
Reprocessing
45
 Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
 Create a new coordinator for timeline which has failed
• Easy to monitor
Reprocessing
46
Reprocessing
47
Consolidate SLA Monitoring
48
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Future Work
50
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic and Incremental processing
 Managed through workarounds
Oozie BOF at Ballroom B
51
THANK YOU
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team

More Related Content

What's hot

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed databaseHoneySah
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3Dr. Sunil Kr. Pandey
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...HostedbyConfluent
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional ModellingAshish Chandwani
 
Data Modeling Basics
Data Modeling BasicsData Modeling Basics
Data Modeling Basicsrenuindia
 
Database normalization
Database normalizationDatabase normalization
Database normalizationEdward Blurock
 
Third Nature - Open Source Data Warehousing
Third Nature - Open Source Data WarehousingThird Nature - Open Source Data Warehousing
Third Nature - Open Source Data Warehousingmark madsen
 

What's hot (20)

Databases: Normalisation
Databases: NormalisationDatabases: Normalisation
Databases: Normalisation
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed database
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Er model
Er modelEr model
Er model
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Database concepts
Database conceptsDatabase concepts
Database concepts
 
Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...
JDBC Source Connector: What could go wrong? with Francesco Tisiot | Kafka Sum...
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
 
Data Modeling Basics
Data Modeling BasicsData Modeling Basics
Data Modeling Basics
 
Database normalization
Database normalizationDatabase normalization
Database normalization
 
Third Nature - Open Source Data Warehousing
Third Nature - Open Source Data WarehousingThird Nature - Open Source Data Warehousing
Third Nature - Open Source Data Warehousing
 
Kafka: Internals
Kafka: InternalsKafka: Internals
Kafka: Internals
 

Viewers also liked

August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Thearkvalais
 
Hrm Od Presentation 6oct
Hrm Od Presentation 6octHrm Od Presentation 6oct
Hrm Od Presentation 6octjunaidhr
 
Talking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodTalking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodKeller Fay Group
 
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...inversecondemnation
 
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de AveríasCerta Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averíasathworz
 
Pinterest for Nonprofits
Pinterest for NonprofitsPinterest for Nonprofits
Pinterest for NonprofitsAnne Yurasek
 

Viewers also liked (20)

August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Loan Decisioning Transformation
Loan Decisioning TransformationLoan Decisioning Transformation
Loan Decisioning Transformation
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...
 
Hrm Od Presentation 6oct
Hrm Od Presentation 6octHrm Od Presentation 6oct
Hrm Od Presentation 6oct
 
Egger
EggerEgger
Egger
 
Talking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodTalking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth Rockwood
 
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
 
Tom @ Leo's Academy
Tom @ Leo's AcademyTom @ Leo's Academy
Tom @ Leo's Academy
 
Poster KOBE
Poster KOBEPoster KOBE
Poster KOBE
 
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de AveríasCerta Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
 
Pinterest for Nonprofits
Pinterest for NonprofitsPinterest for Nonprofits
Pinterest for Nonprofits
 

Similar to Building and managing complex dependencies pipeline using Apache Oozie

Working Procedure SAP BW Testing
Working Procedure SAP BW TestingWorking Procedure SAP BW Testing
Working Procedure SAP BW TestingGavaskar Selvarajan
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
 
Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...bhaskarbi
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Oracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewOracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewKris Rice
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_taskstovetrivel
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesJade Global
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Ludovico Caldara
 
Sap basis online training classes
Sap basis online training classesSap basis online training classes
Sap basis online training classessapehsit
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosDeep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosSajith C P Nair
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Sap bw lo extraction
Sap bw lo extractionSap bw lo extraction
Sap bw lo extractionObaid shaikh
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperationsLocuto Riorama
 

Similar to Building and managing complex dependencies pipeline using Apache Oozie (20)

Working Procedure SAP BW Testing
Working Procedure SAP BW TestingWorking Procedure SAP BW Testing
Working Procedure SAP BW Testing
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Oracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewOracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ Overview
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_tasks
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
 
Sap basis online training classes
Sap basis online training classesSap basis online training classes
Sap basis online training classes
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosDeep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Sap bw lo extraction
Sap bw lo extractionSap bw lo extraction
Sap bw lo extraction
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Building and managing complex dependencies pipeline using Apache Oozie

  • 1. Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer
  • 2. Agenda Oozie at Yahoo1 Data Pipelines SLA and Monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  • 4. 4 Security: https + kerberos / cookie-based auth Deployment Architecture at Yahoo Load Balancer Oracle RAC Hadoop Cluster, HBase, HCatalog submit request request redirection Oozie Server 1 Oozie Server 2 Inter server communication for log streaming,sharelib update etc Zookeeper Curator Security: https + kerberos / cookie- based-auth Security: https+kerberos Lock management Security: kerberos Security: kerberos
  • 5. Scale at Yahoo 5 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  • 6. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 7. Data Pipelines 7 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  • 8. Data Pipelines 8 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  • 9. Use Case - Data pipeline 9
  • 10. Large Scale Data Pipeline Requirements 10  Administrative  One should be able to start, stop and pause all related pipelines or part of it at the same time  Dependency Management  BCP support  Data is not guaranteed, start processing even if partial data is available  Mandatory and optional feeds
  • 11. Large Scale Data Pipeline Requirements 11  Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combining dataset from multiple providers  SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed
  • 12. Bundle 12  The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.  Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
  • 13. Complex dependencies 13 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  • 14. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 14
  • 15. Minimum availability processing 15  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  • 16. Optional feeds 16  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  • 17. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 17
  • 18. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 18
  • 19. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 19
  • 20. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 21. Monitoring 21  Configure to receive notifications  Email action  HTTP notifications for job status change  Email notification for SLA misses  JMS notification for SLA events  By Polling  CLI/REST API monitoring • Single Job monitoring • Bulk Monitoring for Bundles and Coordinators • SLA monitoring
  • 22. Monitoring 22  Email action can be added to workflow to send mail  Job status change notification for coordinator action  oozie.coord.action.notification.url  oozie.coord.action.notification.proxy  Job status change notification for workflow  “oozie.wf.workflow.notification.url”  “oozie.wf.workflow.notification.proxy”
  • 23. Job Monitoring - polling 23  Supported for both CLI and web service  Single job monitoring  Bulk job monitoring  Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime  Multiple job status such as • oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
  • 24.  Oozie can actively track SLAs on Jobs’  Start-time, End-time, Duration  Access/Filter SLA info via  Web-console dashboard  REST API  JMS Messages  Email alert 24 SLA Monitoring
  • 25. 25 SLA dashboard – tabular view
  • 26. 26 SLA dashboard – Graph view
  • 27. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 28.  User view  BCP SLA support  No Color coding  Paging/oncall  Threshold  Consolidated email  Multi grid view 28 Monitoring Limitations
  • 29. 29 Data pipeline monitoring use case from Y!
  • 30.  Setup cron job which periodically pull SLA information from oozie  If there is any SLA miss, notification is sent to internal monitoring system › Pages and sends mobile alert to on-call person › Send email alert 30 Case-1
  • 32. Case-2 32  Divided into four section  SLA Details  Error jobs  Long Running Jobs  Running jobs
  • 36. Long Waiting jobs – missing dependencies 36
  • 41. Validation job 41  Data pipe line also run periodically validation jobs to validate the output  Those multiple pipeline has multiple validation requirement, One example of validation job is to validate the number of click impression with billing details.
  • 43. Reprocessing 43  One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.  Oozie does not support any data dependencies  This makes it very difficult to rerun the whole pipeline for a particular nominal time.
  • 44. Reprocessing 44  To solve Oozie limitation, they have built a job dependency DAG.  It is very similar to job explorer->feed lookup feature.  job explorer->feed lookup is based on the output produced by coordinator jobs.  Job dependencies DAG is based on the input to jobs.  Currently there is no UI to this, they parse oozie jobs daily and store the dependencies in text file.
  • 45. Reprocessing 45  Rerun the failed action and all dependent coordinator jobs. • Easy to do • Cons – Difficult to monitor  Create a new coordinator for timeline which has failed • Easy to monitor
  • 49. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 50. Future Work 50  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic and Incremental processing  Managed through workarounds
  • 51. Oozie BOF at Ballroom B 51
  • 52. THANK YOU Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team