The document discusses scheduling Hadoop pipelines using various Apache projects. It provides an example of a marketing profit and loss (PnL) pipeline that processes booking, marketing spend, and web log data. It describes scheduling the example jobs using cron-style scheduling and the problems with time-based scheduling. It then introduces Apache Oozie and Apache Falcon for more robust workflow scheduling based on dataset availability. It provides examples of using Oozie coordinators and workflows and Falcon feeds and processes to schedule the example PnL pipeline based on when input data is available rather than fixed time schedules.
2. 2
About Me
Name : James Grant
Hadoop Enterprise Data Warehouse Developer here at Expedia
Working with Hadoop and related technology for about 6 years
Email : jamegrant@expedia.com or james@queeg.org
3. 3
Contents
Introduce the example
Schedule the example using cron style scheduling
Look at what’s wrong with time based scheduling
Introducing Apache Oozie
Introducing Apache Falcon
Questions
4. 4
Example
Tracking marketing profit and loss (PnL)
Using
–Booking data
–Marketing spend data
–Web server logs
Producing records showing spend, revenue and profit per
campaign per day
5. 5
Example – Jobs to schedule
Land Booking Data to HDFS
Land Marketing spend data to HDFS
Land Web logs to HDFS
Process web logs to identify bookings and points of entry
Enrich with booking revenue and profit
Enrich with marketing spend
Attribute revenue and profit to marketing campaign
7. 7
Scheduling the Example
We need to know how long each task normally takes
We also need to know how long it could possibly take
We then need to work out at what time of day to schedule the
task
10. 10
The Problem With Time Based Scheduling
It’s brittle
–Any delay upstream means all downstream tasks fail
It’s inefficient
–All scheduling has to be on a near worst case basis
–So the final result arrives later than we would like
Difficult to manage at scale
–Coordinating schedules between different teams is hard
11. 11
Introducing Apache Oozie
URL: http://oozie.apache.org/
A workflow scheduler for Hadoop jobs
Describe your workflow as a DAG of actions
Trigger that workflow periodically or on dataset availability
20. 20
Scheduling With Apache Oozie
Processes will be launched in a container on the cluster
There is a lot of XML
When working with multiple teams/pipelines dataset
definitions must be repeated
21. 21
Introducing Apache Falcon
http://falcon.incubator.apache.org/ http://falcon.apache.org/
“A data processing and management solution”
Describe datasets and processes
Processes are scheduled based on the descriptions
Uses Oozie as the scheduler
Processes can be Hive HQL scripts Pig scripts or Oozie
workflows
24. 24
Benefits and Observations of Falcon
About the same amount of XML but in smaller chunks
Declare the data and processing steps and have the schedule
created for you
A dataset is declared once and used by all processing steps that
need it
Also handles retention (a separate process under Oozie)
Also handles replication
25. 25
Oozie workflows
Describe a DAG of actions to take to complete a task
Available actions are:
–Map-Reduce
–Pig
–File system
–SSH
–Java
–Shell
All actions take place in a container on the cluster