SlideShare a Scribd company logo
1 of 28
Scheduling Hadoop Pipelines
How to manage data process pipelines on Hadoop.
HUG UK 2015-01-13
2
About Me
Name : James Grant
Hadoop Enterprise Data Warehouse Developer here at Expedia
Working with Hadoop and related technology for about 6 years
Email : jamegrant@expedia.com or james@queeg.org
3
Contents
Introduce the example
Schedule the example using cron style scheduling
Look at what’s wrong with time based scheduling
Introducing Apache Oozie
Introducing Apache Falcon
Questions
4
Example
Tracking marketing profit and loss (PnL)
Using
–Booking data
–Marketing spend data
–Web server logs
Producing records showing spend, revenue and profit per
campaign per day
5
Example – Jobs to schedule
Land Booking Data to HDFS
Land Marketing spend data to HDFS
Land Web logs to HDFS
Process web logs to identify bookings and points of entry
Enrich with booking revenue and profit
Enrich with marketing spend
Attribute revenue and profit to marketing campaign
6
7
Scheduling the Example
We need to know how long each task normally takes
We also need to know how long it could possibly take
We then need to work out at what time of day to schedule the
task
8
Scheduling the Example
9
Scheduling the Example
10
The Problem With Time Based Scheduling
It’s brittle
–Any delay upstream means all downstream tasks fail
It’s inefficient
–All scheduling has to be on a near worst case basis
–So the final result arrives later than we would like
Difficult to manage at scale
–Coordinating schedules between different teams is hard
11
Introducing Apache Oozie
URL: http://oozie.apache.org/
A workflow scheduler for Hadoop jobs
Describe your workflow as a DAG of actions
Trigger that workflow periodically or on dataset availability
12
Example Oozie Coordinator
<coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}"
start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<controls>
<timeout>1080</timeout>
<concurrency>1</concurrency>
<execution>FIFO</execution>
</controls>
13
Example Oozie Coordinator
<datasets>
<dataset name="d_weblogs" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<uri-template>hdfs://data/weblogs/${YEAR}/${MONTH}/${DAY}/</uri-template>
<done-flag></done-flag>
</dataset>
...
<dataset name="d_marketing-pnl" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<uri-template>
hdfs://data/marketing-pnl/${YEAR}/${MONTH}/${DAY}/
</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
14
Example Oozie Coordinator
<input-events>
<data-in name="e_weblogs" dataset="d_weblogs">
<instance>${coord:current(0)}</instance>
</data-in>
...
</input-events>
<output-events>
<data-out name="e_marketing-pnl" dataset="d_marketing-pnl">
<instance>${coord:current(-1)}</instance>
</data-out>
</output-events>
15
Example Oozie Coordinator
<action>
<workflow>
<app-path>hdfs://apps/marketing/pnl/wf/</app-path>
<configuration>
<property>
<name>wf_weblogs</name>
<value>${coord:dataIn('e_weblogs')}</value>
</property>
<property>
<name>wf_output</name>
<value>${coord:dataIn('e_marketing-pnl')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
16
Example Oozie Workflow
17
Example Oozie Workflow
<workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1">
<start to="fork"/>
<fork name="fork">
<path start="downloadBooking"/>
<path start="downloadWeblogs"/>
<path start="downloadSpend"/>
</fork>
18
Example Oozie Workflow
<action name="downloadBooking">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>downloadBooking.sh</exec>
<argument>--bookings=${e_bookings}</argument>
<file>${wf:appPath()}/downloadBooking.sh</file>
<file>${wf:appPath()}/downloadBooking.jar</file>
</shell>
<ok to="join"/>
<error to="sendErrorEmail"/>
</action>
19
Example Oozie Workflow
<action name="downloadWeblogs">
...
</action>
<action name="downloadSpend">
...
</action>
...
<join name="join" to="merge"/>
<action name="sendErrorEmail">
...
</action>
<kill name="killJobAction">
<message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
20
Scheduling With Apache Oozie
Processes will be launched in a container on the cluster
There is a lot of XML
When working with multiple teams/pipelines dataset
definitions must be repeated
21
Introducing Apache Falcon
http://falcon.incubator.apache.org/ http://falcon.apache.org/
“A data processing and management solution”
Describe datasets and processes
Processes are scheduled based on the descriptions
Uses Oozie as the scheduler
Processes can be Hive HQL scripts Pig scripts or Oozie
workflows
22
Example Dataset Description
<?xml version="1.0" encoding="UTF-8"?>
<feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1">
<frequency>days(1)</frequency>
<late-arrival cut-off="hours(18)"/>
<clusters>
<cluster name="production" type="source">
<validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/>
<retention limit="years(5)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/>
</locations>
<ACL owner="marketing" group="etl" permission="0755"/>
<schema location="/none" provider="none"/>
<properties>
<property name="queueName" value="prod_etl"/>
</properties>
</feed>
23
Example Process Description
<?xml version="1.0" encoding="UTF-8"?>
<process name="mkgMerge" xmlns="uri:falcon:process:0.1">
<clusters>…</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" />
<input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0,
<input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" />
</inputs>
<outputs>
<output name="output" feed="mkgEnrichedLog" instance="today(0,0)" />
</outputs>
<properties>
<property name="queueName" value="prod_etl" />
</properties>
<workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" />
</process>
24
Benefits and Observations of Falcon
About the same amount of XML but in smaller chunks
Declare the data and processing steps and have the schedule
created for you
A dataset is declared once and used by all processing steps that
need it
Also handles retention (a separate process under Oozie)
Also handles replication
25
Oozie workflows
Describe a DAG of actions to take to complete a task
Available actions are:
–Map-Reduce
–Pig
–File system
–SSH
–Java
–Shell
All actions take place in a container on the cluster
26
Example Workflow
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
27
Example Workflow
<exec>mkgMerge.sh</exec>
<argument>--partition=${nominalTime}</argument>
<argument>--bookings=${bookings}</argument>
<argument>--webActions=${webActions}</argument>
<argument>--spend=${spend}</argument>
<file>${wf:appPath()}/mkgMerge.sh</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Action failed: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Any Questions?

More Related Content

Similar to Schedule Hadoop Pipelines with Apache Oozie and Falcon

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!Craig Schumann
 
Tek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJSTek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJSPablo Godel
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud Alithya
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
Agile data presentation 3 - cambridge
Agile data   presentation 3 - cambridgeAgile data   presentation 3 - cambridge
Agile data presentation 3 - cambridgeRomans Malinovskis
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastAtlassian
 
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIOWei-Yuan Chang
 
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Mike Schinkel
 
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven applicationwgyn
 
Using PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the EnterpriseUsing PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the Enterprisewebhostingguy
 
Data models pivot with splunk break out session
Data models pivot with splunk break out sessionData models pivot with splunk break out session
Data models pivot with splunk break out sessionGeorg Knon
 
Profitable Product Introduction with SAP
Profitable Product Introduction with SAPProfitable Product Introduction with SAP
Profitable Product Introduction with SAPJulien Delvat
 
PHP on Windows and on Azure
PHP on Windows and on AzurePHP on Windows and on Azure
PHP on Windows and on AzureMaarten Balliauw
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 

Similar to Schedule Hadoop Pipelines with Apache Oozie and Falcon (20)

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Cqrs api v2
Cqrs api v2Cqrs api v2
Cqrs api v2
 
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
 
Tek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJSTek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJS
 
SSAS and MDX
SSAS and MDXSSAS and MDX
SSAS and MDX
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
 
Svelte JS introduction
Svelte JS introductionSvelte JS introduction
Svelte JS introduction
 
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Agile data presentation 3 - cambridge
Agile data   presentation 3 - cambridgeAgile data   presentation 3 - cambridge
Agile data presentation 3 - cambridge
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
 
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIO
 
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
 
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
 
Using PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the EnterpriseUsing PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the Enterprise
 
Data models pivot with splunk break out session
Data models pivot with splunk break out sessionData models pivot with splunk break out session
Data models pivot with splunk break out session
 
Profitable Product Introduction with SAP
Profitable Product Introduction with SAPProfitable Product Introduction with SAP
Profitable Product Introduction with SAP
 
PHP on Windows and on Azure
PHP on Windows and on AzurePHP on Windows and on Azure
PHP on Windows and on Azure
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Schedule Hadoop Pipelines with Apache Oozie and Falcon

  • 1. Scheduling Hadoop Pipelines How to manage data process pipelines on Hadoop. HUG UK 2015-01-13
  • 2. 2 About Me Name : James Grant Hadoop Enterprise Data Warehouse Developer here at Expedia Working with Hadoop and related technology for about 6 years Email : jamegrant@expedia.com or james@queeg.org
  • 3. 3 Contents Introduce the example Schedule the example using cron style scheduling Look at what’s wrong with time based scheduling Introducing Apache Oozie Introducing Apache Falcon Questions
  • 4. 4 Example Tracking marketing profit and loss (PnL) Using –Booking data –Marketing spend data –Web server logs Producing records showing spend, revenue and profit per campaign per day
  • 5. 5 Example – Jobs to schedule Land Booking Data to HDFS Land Marketing spend data to HDFS Land Web logs to HDFS Process web logs to identify bookings and points of entry Enrich with booking revenue and profit Enrich with marketing spend Attribute revenue and profit to marketing campaign
  • 6. 6
  • 7. 7 Scheduling the Example We need to know how long each task normally takes We also need to know how long it could possibly take We then need to work out at what time of day to schedule the task
  • 10. 10 The Problem With Time Based Scheduling It’s brittle –Any delay upstream means all downstream tasks fail It’s inefficient –All scheduling has to be on a near worst case basis –So the final result arrives later than we would like Difficult to manage at scale –Coordinating schedules between different teams is hard
  • 11. 11 Introducing Apache Oozie URL: http://oozie.apache.org/ A workflow scheduler for Hadoop jobs Describe your workflow as a DAG of actions Trigger that workflow periodically or on dataset availability
  • 12. 12 Example Oozie Coordinator <coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}" start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <controls> <timeout>1080</timeout> <concurrency>1</concurrency> <execution>FIFO</execution> </controls>
  • 13. 13 Example Oozie Coordinator <datasets> <dataset name="d_weblogs" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template>hdfs://data/weblogs/${YEAR}/${MONTH}/${DAY}/</uri-template> <done-flag></done-flag> </dataset> ... <dataset name="d_marketing-pnl" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template> hdfs://data/marketing-pnl/${YEAR}/${MONTH}/${DAY}/ </uri-template> <done-flag></done-flag> </dataset> </datasets>
  • 14. 14 Example Oozie Coordinator <input-events> <data-in name="e_weblogs" dataset="d_weblogs"> <instance>${coord:current(0)}</instance> </data-in> ... </input-events> <output-events> <data-out name="e_marketing-pnl" dataset="d_marketing-pnl"> <instance>${coord:current(-1)}</instance> </data-out> </output-events>
  • 17. 17 Example Oozie Workflow <workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1"> <start to="fork"/> <fork name="fork"> <path start="downloadBooking"/> <path start="downloadWeblogs"/> <path start="downloadSpend"/> </fork>
  • 18. 18 Example Oozie Workflow <action name="downloadBooking"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>downloadBooking.sh</exec> <argument>--bookings=${e_bookings}</argument> <file>${wf:appPath()}/downloadBooking.sh</file> <file>${wf:appPath()}/downloadBooking.jar</file> </shell> <ok to="join"/> <error to="sendErrorEmail"/> </action>
  • 19. 19 Example Oozie Workflow <action name="downloadWeblogs"> ... </action> <action name="downloadSpend"> ... </action> ... <join name="join" to="merge"/> <action name="sendErrorEmail"> ... </action> <kill name="killJobAction"> <message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message> </kill> <end name="end" /> </workflow-app>
  • 20. 20 Scheduling With Apache Oozie Processes will be launched in a container on the cluster There is a lot of XML When working with multiple teams/pipelines dataset definitions must be repeated
  • 21. 21 Introducing Apache Falcon http://falcon.incubator.apache.org/ http://falcon.apache.org/ “A data processing and management solution” Describe datasets and processes Processes are scheduled based on the descriptions Uses Oozie as the scheduler Processes can be Hive HQL scripts Pig scripts or Oozie workflows
  • 22. 22 Example Dataset Description <?xml version="1.0" encoding="UTF-8"?> <feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1"> <frequency>days(1)</frequency> <late-arrival cut-off="hours(18)"/> <clusters> <cluster name="production" type="source"> <validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/> <retention limit="years(5)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/> </locations> <ACL owner="marketing" group="etl" permission="0755"/> <schema location="/none" provider="none"/> <properties> <property name="queueName" value="prod_etl"/> </properties> </feed>
  • 23. 23 Example Process Description <?xml version="1.0" encoding="UTF-8"?> <process name="mkgMerge" xmlns="uri:falcon:process:0.1"> <clusters>…</clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" /> <input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0, <input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" /> </inputs> <outputs> <output name="output" feed="mkgEnrichedLog" instance="today(0,0)" /> </outputs> <properties> <property name="queueName" value="prod_etl" /> </properties> <workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" /> </process>
  • 24. 24 Benefits and Observations of Falcon About the same amount of XML but in smaller chunks Declare the data and processing steps and have the schedule created for you A dataset is declared once and used by all processing steps that need it Also handles retention (a separate process under Oozie) Also handles replication
  • 25. 25 Oozie workflows Describe a DAG of actions to take to complete a task Available actions are: –Map-Reduce –Pig –File system –SSH –Java –Shell All actions take place in a container on the cluster
  • 26. 26 Example Workflow <?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration>