SlideShare a Scribd company logo
1 of 52
Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log
4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
Data Pipelines
8
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
9
Large Scale Data Pipeline Requirements
10
 Administrative
 One should be able to start, stop and pause all related pipelines or part of it at the
same time
 Dependency Management
 BCP support
 Data is not guaranteed, start processing even if partial data is available
 Mandatory and optional feeds
Large Scale Data Pipeline Requirements
11
 Multiple Providers
 If data is available from multiple providers, I want to specify the provider priority
 Combining dataset from multiple providers
 SLA Management
 Monitor pipeline processing to take immediate action in case of failures or SLA misses
 Pipelines owners should get notified if an SLA is missed
Bundle
12
 The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
 Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
Complex dependencies
13
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14
Minimum availability processing
15
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
Optional feeds
16
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
17
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
18
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
19
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Monitoring
21
 Configure to receive notifications
 Email action
 HTTP notifications for job status change
 Email notification for SLA misses
 JMS notification for SLA events
 By Polling
 CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
Monitoring
22
 Email action can be added to workflow to send mail
 Job status change notification for coordinator action
 oozie.coord.action.notification.url
 oozie.coord.action.notification.proxy
 Job status change notification for workflow
 “oozie.wf.workflow.notification.url”
 “oozie.wf.workflow.notification.proxy”
Job Monitoring - polling
23
 Supported for both CLI and web service
 Single job monitoring
 Bulk job monitoring
 Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
 Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
 Oozie can actively track SLAs on Jobs’
 Start-time, End-time, Duration
 Access/Filter SLA info via
 Web-console dashboard
 REST API
 JMS Messages
 Email alert
24
SLA Monitoring
25
SLA dashboard – tabular view
26
SLA dashboard – Graph view
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
 User view
 BCP SLA support
 No Color coding
 Paging/oncall
 Threshold
 Consolidated email
 Multi grid view
28
Monitoring Limitations
29
Data pipeline monitoring use case from Y!
 Setup cron job which periodically pull SLA information from oozie
 If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1
Case-1
31
Case-2
32
 Divided into four section
 SLA Details
 Error jobs
 Long Running Jobs
 Running jobs
SLA information
33
SLA-status
34
Long Waiting jobs
35
Long Waiting jobs – missing dependencies
36
Error Jobs
37
Running job details
38
Job explorer
39
Feeds - jobs
40
Validation job
41
 Data pipe line also run periodically validation jobs to validate the output
 Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
Alert
42
Reprocessing
43
 One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
 Oozie does not support any data dependencies
 This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
Reprocessing
44
 To solve Oozie limitation, they have built a job dependency DAG.
 It is very similar to job explorer->feed lookup feature.
 job explorer->feed lookup is based on the output produced by
coordinator jobs.
 Job dependencies DAG is based on the input to jobs.
 Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
Reprocessing
45
 Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
 Create a new coordinator for timeline which has failed
• Easy to monitor
Reprocessing
46
Reprocessing
47
Consolidate SLA Monitoring
48
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Future Work
50
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic and Incremental processing
 Managed through workarounds
Oozie BOF at Ballroom B
51
THANK YOU
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team

More Related Content

What's hot

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!PGConf APAC
 
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour Oct 2019
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour  Oct 2019Troubleshooting Tips and Tricks for Database 19c - EMEA Tour  Oct 2019
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour Oct 2019Sandesh Rao
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovationsGrant McAlister
 
MySQL 8.0 Optimizer Guide
MySQL 8.0 Optimizer GuideMySQL 8.0 Optimizer Guide
MySQL 8.0 Optimizer GuideMorgan Tocker
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
Oracle Extended Clusters for Oracle RAC
Oracle Extended Clusters for Oracle RACOracle Extended Clusters for Oracle RAC
Oracle Extended Clusters for Oracle RACMarkus Michalewicz
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationCommand Prompt., Inc
 

What's hot (20)

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!
 
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour Oct 2019
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour  Oct 2019Troubleshooting Tips and Tricks for Database 19c - EMEA Tour  Oct 2019
Troubleshooting Tips and Tricks for Database 19c - EMEA Tour Oct 2019
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
 
MySQL 8.0 Optimizer Guide
MySQL 8.0 Optimizer GuideMySQL 8.0 Optimizer Guide
MySQL 8.0 Optimizer Guide
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Oracle Extended Clusters for Oracle RAC
Oracle Extended Clusters for Oracle RACOracle Extended Clusters for Oracle RAC
Oracle Extended Clusters for Oracle RAC
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 

Viewers also liked

August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Thearkvalais
 
Hrm Od Presentation 6oct
Hrm Od Presentation 6octHrm Od Presentation 6oct
Hrm Od Presentation 6octjunaidhr
 
Talking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodTalking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodKeller Fay Group
 
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...inversecondemnation
 
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de AveríasCerta Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averíasathworz
 
Pinterest for Nonprofits
Pinterest for NonprofitsPinterest for Nonprofits
Pinterest for NonprofitsAnne Yurasek
 

Viewers also liked (20)

August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Loan Decisioning Transformation
Loan Decisioning TransformationLoan Decisioning Transformation
Loan Decisioning Transformation
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...Planification intégrée de ressources de la production d’électricité jusqu’au ...
Planification intégrée de ressources de la production d’électricité jusqu’au ...
 
Hrm Od Presentation 6oct
Hrm Od Presentation 6octHrm Od Presentation 6oct
Hrm Od Presentation 6oct
 
Egger
EggerEgger
Egger
 
Talking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth RockwoodTalking Social TV 2 with Ed Keller and Beth Rockwood
Talking Social TV 2 with Ed Keller and Beth Rockwood
 
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
 
Tom @ Leo's Academy
Tom @ Leo's AcademyTom @ Leo's Academy
Tom @ Leo's Academy
 
Poster KOBE
Poster KOBEPoster KOBE
Poster KOBE
 
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de AveríasCerta Servicios Periciales - Peritos de Seguros y Comisarios de Averías
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
 
Pinterest for Nonprofits
Pinterest for NonprofitsPinterest for Nonprofits
Pinterest for Nonprofits
 

Similar to Building and managing complex dependencies pipeline using Apache Oozie

Working Procedure SAP BW Testing
Working Procedure SAP BW TestingWorking Procedure SAP BW Testing
Working Procedure SAP BW TestingGavaskar Selvarajan
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
 
Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...bhaskarbi
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Oracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewOracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewKris Rice
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_taskstovetrivel
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesJade Global
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Ludovico Caldara
 
Sap basis online training classes
Sap basis online training classesSap basis online training classes
Sap basis online training classessapehsit
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosDeep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosSajith C P Nair
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Sap bw lo extraction
Sap bw lo extractionSap bw lo extraction
Sap bw lo extractionObaid shaikh
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperationsLocuto Riorama
 

Similar to Building and managing complex dependencies pipeline using Apache Oozie (20)

Working Procedure SAP BW Testing
Working Procedure SAP BW TestingWorking Procedure SAP BW Testing
Working Procedure SAP BW Testing
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...Errors while sending packages from oltp to bi (one of error at the time of da...
Errors while sending packages from oltp to bi (one of error at the time of da...
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Oracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ OverviewOracle REST Data Services Best Practices/ Overview
Oracle REST Data Services Best Practices/ Overview
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_tasks
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
 
Sap basis online training classes
Sap basis online training classesSap basis online training classes
Sap basis online training classes
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosDeep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Sap bw lo extraction
Sap bw lo extractionSap bw lo extraction
Sap bw lo extraction
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Building and managing complex dependencies pipeline using Apache Oozie

  • 1. Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer
  • 2. Agenda Oozie at Yahoo1 Data Pipelines SLA and Monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  • 4. 4 Security: https + kerberos / cookie-based auth Deployment Architecture at Yahoo Load Balancer Oracle RAC Hadoop Cluster, HBase, HCatalog submit request request redirection Oozie Server 1 Oozie Server 2 Inter server communication for log streaming,sharelib update etc Zookeeper Curator Security: https + kerberos / cookie- based-auth Security: https+kerberos Lock management Security: kerberos Security: kerberos
  • 5. Scale at Yahoo 5 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  • 6. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 7. Data Pipelines 7 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  • 8. Data Pipelines 8 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  • 9. Use Case - Data pipeline 9
  • 10. Large Scale Data Pipeline Requirements 10  Administrative  One should be able to start, stop and pause all related pipelines or part of it at the same time  Dependency Management  BCP support  Data is not guaranteed, start processing even if partial data is available  Mandatory and optional feeds
  • 11. Large Scale Data Pipeline Requirements 11  Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combining dataset from multiple providers  SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed
  • 12. Bundle 12  The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.  Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
  • 13. Complex dependencies 13 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  • 14. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 14
  • 15. Minimum availability processing 15  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  • 16. Optional feeds 16  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  • 17. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 17
  • 18. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 18
  • 19. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 19
  • 20. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 21. Monitoring 21  Configure to receive notifications  Email action  HTTP notifications for job status change  Email notification for SLA misses  JMS notification for SLA events  By Polling  CLI/REST API monitoring • Single Job monitoring • Bulk Monitoring for Bundles and Coordinators • SLA monitoring
  • 22. Monitoring 22  Email action can be added to workflow to send mail  Job status change notification for coordinator action  oozie.coord.action.notification.url  oozie.coord.action.notification.proxy  Job status change notification for workflow  “oozie.wf.workflow.notification.url”  “oozie.wf.workflow.notification.proxy”
  • 23. Job Monitoring - polling 23  Supported for both CLI and web service  Single job monitoring  Bulk job monitoring  Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime  Multiple job status such as • oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
  • 24.  Oozie can actively track SLAs on Jobs’  Start-time, End-time, Duration  Access/Filter SLA info via  Web-console dashboard  REST API  JMS Messages  Email alert 24 SLA Monitoring
  • 25. 25 SLA dashboard – tabular view
  • 26. 26 SLA dashboard – Graph view
  • 27. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 28.  User view  BCP SLA support  No Color coding  Paging/oncall  Threshold  Consolidated email  Multi grid view 28 Monitoring Limitations
  • 29. 29 Data pipeline monitoring use case from Y!
  • 30.  Setup cron job which periodically pull SLA information from oozie  If there is any SLA miss, notification is sent to internal monitoring system › Pages and sends mobile alert to on-call person › Send email alert 30 Case-1
  • 32. Case-2 32  Divided into four section  SLA Details  Error jobs  Long Running Jobs  Running jobs
  • 36. Long Waiting jobs – missing dependencies 36
  • 41. Validation job 41  Data pipe line also run periodically validation jobs to validate the output  Those multiple pipeline has multiple validation requirement, One example of validation job is to validate the number of click impression with billing details.
  • 43. Reprocessing 43  One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.  Oozie does not support any data dependencies  This makes it very difficult to rerun the whole pipeline for a particular nominal time.
  • 44. Reprocessing 44  To solve Oozie limitation, they have built a job dependency DAG.  It is very similar to job explorer->feed lookup feature.  job explorer->feed lookup is based on the output produced by coordinator jobs.  Job dependencies DAG is based on the input to jobs.  Currently there is no UI to this, they parse oozie jobs daily and store the dependencies in text file.
  • 45. Reprocessing 45  Rerun the failed action and all dependent coordinator jobs. • Easy to do • Cons – Difficult to monitor  Create a new coordinator for timeline which has failed • Easy to monitor
  • 49. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 50. Future Work 50  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic and Incremental processing  Managed through workarounds
  • 51. Oozie BOF at Ballroom B 51
  • 52. THANK YOU Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team