Schedule Hadoop Pipelines with Apache Oozie and Falcon

Scheduling Hadoop Pipelines
How to manage data process pipelines on Hadoop.
HUG UK 2015-01-13

2
About Me
Name : James Grant
Hadoop Enterprise Data Warehouse Developer here at Expedia
Working with Hadoop and related technology for about 6 years
Email : jamegrant@expedia.com or james@queeg.org

3
Contents
Introduce the example
Schedule the example using cron style scheduling
Look at what’s wrong with time based scheduling
Introducing Apache Oozie
Introducing Apache Falcon
Questions

4
Example
Tracking marketing profit and loss (PnL)
Using
–Booking data
–Marketing spend data
–Web server logs
Producing records showing spend, revenue and profit per
campaign per day

5
Example – Jobs to schedule
Land Booking Data to HDFS
Land Marketing spend data to HDFS
Land Web logs to HDFS
Process web logs to identify bookings and points of entry
Enrich with booking revenue and profit
Enrich with marketing spend
Attribute revenue and profit to marketing campaign

7
Scheduling the Example
We need to know how long each task normally takes
We also need to know how long it could possibly take
We then need to work out at what time of day to schedule the
task

10
The Problem With Time Based Scheduling
It’s brittle
–Any delay upstream means all downstream tasks fail
It’s inefficient
–All scheduling has to be on a near worst case basis
–So the final result arrives later than we would like
Difficult to manage at scale
–Coordinating schedules between different teams is hard

11
Introducing Apache Oozie
URL: http://oozie.apache.org/
A workflow scheduler for Hadoop jobs
Describe your workflow as a DAG of actions
Trigger that workflow periodically or on dataset availability

12
Example Oozie Coordinator
<coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}"
start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<controls>
<timeout>1080</timeout>
<concurrency>1</concurrency>
<execution>FIFO</execution>
</controls>

13
<datasets>
<dataset name="d_weblogs" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<uri-template>hdfs://data/weblogs/${YEAR}/${MONTH}/${DAY}/</uri-template>
<done-flag></done-flag>
</dataset>
...
<dataset name="d_marketing-pnl" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<uri-template>
hdfs://data/marketing-pnl/${YEAR}/${MONTH}/${DAY}/
</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>

14
<input-events>
<data-in name="e_weblogs" dataset="d_weblogs">
<instance>${coord:current(0)}</instance>
</data-in>
...
</input-events>
<output-events>
<data-out name="e_marketing-pnl" dataset="d_marketing-pnl">
<instance>${coord:current(-1)}</instance>
</data-out>
</output-events>

15
<action>
<workflow>
<app-path>hdfs://apps/marketing/pnl/wf/</app-path>
<configuration>
<property>
<name>wf_weblogs</name>
<value>${coord:dataIn('e_weblogs')}</value>
</property>
<property>
<name>wf_output</name>
<value>${coord:dataIn('e_marketing-pnl')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>

17
Example Oozie Workflow
<workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1">
<start to="fork"/>
<fork name="fork">
<path start="downloadBooking"/>
<path start="downloadWeblogs"/>
<path start="downloadSpend"/>
</fork>

18
<action name="downloadBooking">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>downloadBooking.sh</exec>
<argument>--bookings=${e_bookings}</argument>
<file>${wf:appPath()}/downloadBooking.sh</file>
<file>${wf:appPath()}/downloadBooking.jar</file>
</shell>
<ok to="join"/>
<error to="sendErrorEmail"/>
</action>

19
<action name="downloadWeblogs">
...
</action>
<action name="downloadSpend">
...
</action>
...
<join name="join" to="merge"/>
<action name="sendErrorEmail">
...
</action>
<kill name="killJobAction">
<message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>

20
Scheduling With Apache Oozie
Processes will be launched in a container on the cluster
There is a lot of XML
When working with multiple teams/pipelines dataset
definitions must be repeated

21
Introducing Apache Falcon
http://falcon.incubator.apache.org/ http://falcon.apache.org/
“A data processing and management solution”
Describe datasets and processes
Processes are scheduled based on the descriptions
Uses Oozie as the scheduler
Processes can be Hive HQL scripts Pig scripts or Oozie
workflows

22
Example Dataset Description
<?xml version="1.0" encoding="UTF-8"?>
<feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1">
<frequency>days(1)</frequency>
<late-arrival cut-off="hours(18)"/>
<clusters>
<cluster name="production" type="source">
<validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/>
<retention limit="years(5)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/>
</locations>
<ACL owner="marketing" group="etl" permission="0755"/>
<schema location="/none" provider="none"/>
<properties>
<property name="queueName" value="prod_etl"/>
</properties>
</feed>

23
Example Process Description
<process name="mkgMerge" xmlns="uri:falcon:process:0.1">
<clusters>…</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" />
<input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0,
<input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" />
</inputs>
<outputs>
<output name="output" feed="mkgEnrichedLog" instance="today(0,0)" />
</outputs>
<properties>
<property name="queueName" value="prod_etl" />
</properties>
<workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" />
</process>

24
Benefits and Observations of Falcon
About the same amount of XML but in smaller chunks
Declare the data and processing steps and have the schedule
created for you
A dataset is declared once and used by all processing steps that
need it
Also handles retention (a separate process under Oozie)
Also handles replication

25
Oozie workflows
Describe a DAG of actions to take to complete a task
Available actions are:
–Map-Reduce
–Pig
–File system
–SSH
–Java
–Shell
All actions take place in a container on the cluster

26
Example Workflow
<workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>

27
Example Workflow
<exec>mkgMerge.sh</exec>
<argument>--partition=${nominalTime}</argument>
<argument>--bookings=${bookings}</argument>
<argument>--webActions=${webActions}</argument>
<argument>--spend=${spend}</argument>
<file>${wf:appPath()}/mkgMerge.sh</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Action failed: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

Schedule Hadoop Pipelines with Apache Oozie and Falcon

Recommended

Recommended

More Related Content

Similar to Schedule Hadoop Pipelines with Apache Oozie and Falcon

Similar to Schedule Hadoop Pipelines with Apache Oozie and Falcon (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Schedule Hadoop Pipelines with Apache Oozie and Falcon