Process Scheduling on Hadoop at Expedia


Exploring how to manage data process pipelines on Hadoop. Without using cron. This talk explores the details, and pit-holes, of Apache Oozie and Falcon.

  1. 1. Scheduling Hadoop Pipelines How to manage data process pipelines on Hadoop. HUG UK 2015-01-13
  2. 2. 2 About Me Name : James Grant Hadoop Enterprise Data Warehouse Developer here at Expedia Working with Hadoop and related technology for about 6 years Email : or
  3. 3. 3 Contents Introduce the example Schedule the example using cron style scheduling Look at what’s wrong with time based scheduling Introducing Apache Oozie Introducing Apache Falcon Questions
  4. 4. 4 Example Tracking marketing profit and loss (PnL) Using –Booking data –Marketing spend data –Web server logs Producing records showing spend, revenue and profit per campaign per day
  5. 5. 5 Example – Jobs to schedule Land Booking Data to HDFS Land Marketing spend data to HDFS Land Web logs to HDFS Process web logs to identify bookings and points of entry Enrich with booking revenue and profit Enrich with marketing spend Attribute revenue and profit to marketing campaign
  7. 7. 7 Scheduling the Example We need to know how long each task normally takes We also need to know how long it could possibly take We then need to work out at what time of day to schedule the task
  8. 8. 8 Scheduling the Example
  9. 9. 9 Scheduling the Example
  10. 10. 10 The Problem With Time Based Scheduling It’s brittle –Any delay upstream means all downstream tasks fail It’s inefficient –All scheduling has to be on a near worst case basis –So the final result arrives later than we would like Difficult to manage at scale –Coordinating schedules between different teams is hard
  11. 11. 11 Introducing Apache Oozie URL: A workflow scheduler for Hadoop jobs Describe your workflow as a DAG of actions Trigger that workflow periodically or on dataset availability
  12. 12. 12 Example Oozie Coordinator <coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}" start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <controls> <timeout>1080</timeout> <concurrency>1</concurrency> <execution>FIFO</execution> </controls>
  13. 13. 13 Example Oozie Coordinator <datasets> <dataset name="d_weblogs" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template>hdfs://data/weblogs/${YEAR}/${MONTH}/${DAY}/</uri-template> <done-flag></done-flag> </dataset> ... <dataset name="d_marketing-pnl" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template> hdfs://data/marketing-pnl/${YEAR}/${MONTH}/${DAY}/ </uri-template> <done-flag></done-flag> </dataset> </datasets>
  14. 14. 14 Example Oozie Coordinator <input-events> <data-in name="e_weblogs" dataset="d_weblogs"> <instance>${coord:current(0)}</instance> </data-in> ... </input-events> <output-events> <data-out name="e_marketing-pnl" dataset="d_marketing-pnl"> <instance>${coord:current(-1)}</instance> </data-out> </output-events>
  15. 15. 15 Example Oozie Coordinator <action> <workflow> <app-path>hdfs://apps/marketing/pnl/wf/</app-path> <configuration> <property> <name>wf_weblogs</name> <value>${coord:dataIn('e_weblogs')}</value> </property> <property> <name>wf_output</name> <value>${coord:dataIn('e_marketing-pnl')}</value> </property> </configuration> </workflow> </action> </coordinator-app>
  16. 16. 16 Example Oozie Workflow
  17. 17. 17 Example Oozie Workflow <workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1"> <start to="fork"/> <fork name="fork"> <path start="downloadBooking"/> <path start="downloadWeblogs"/> <path start="downloadSpend"/> </fork>
  18. 18. 18 Example Oozie Workflow <action name="downloadBooking"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration> <exec></exec> <argument>--bookings=${e_bookings}</argument> <file>${wf:appPath()}/</file> <file>${wf:appPath()}/downloadBooking.jar</file> </shell> <ok to="join"/> <error to="sendErrorEmail"/> </action>
  19. 19. 19 Example Oozie Workflow <action name="downloadWeblogs"> ... </action> <action name="downloadSpend"> ... </action> ... <join name="join" to="merge"/> <action name="sendErrorEmail"> ... </action> <kill name="killJobAction"> <message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message> </kill> <end name="end" /> </workflow-app>
  20. 20. 20 Scheduling With Apache Oozie Processes will be launched in a container on the cluster There is a lot of XML When working with multiple teams/pipelines dataset definitions must be repeated
  21. 21. 21 Introducing Apache Falcon  “A data processing and management solution” Describe datasets and processes Processes are scheduled based on the descriptions Uses Oozie as the scheduler Processes can be Hive HQL scripts Pig scripts or Oozie workflows
  22. 22. 22 Example Dataset Description <?xml version="1.0" encoding="UTF-8"?> <feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1"> <frequency>days(1)</frequency> <late-arrival cut-off="hours(18)"/> <clusters> <cluster name="production" type="source"> <validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/> <retention limit="years(5)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/> </locations> <ACL owner="marketing" group="etl" permission="0755"/> <schema location="/none" provider="none"/> <properties> <property name="queueName" value="prod_etl"/> </properties> </feed>
  23. 23. 23 Example Process Description <?xml version="1.0" encoding="UTF-8"?> <process name="mkgMerge" xmlns="uri:falcon:process:0.1"> <clusters>…</clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" /> <input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0, <input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" /> </inputs> <outputs> <output name="output" feed="mkgEnrichedLog" instance="today(0,0)" /> </outputs> <properties> <property name="queueName" value="prod_etl" /> </properties> <workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" /> </process>
  24. 24. 24 Benefits and Observations of Falcon About the same amount of XML but in smaller chunks Declare the data and processing steps and have the schedule created for you A dataset is declared once and used by all processing steps that need it Also handles retention (a separate process under Oozie) Also handles replication
  25. 25. 25 Oozie workflows Describe a DAG of actions to take to complete a task Available actions are: –Map-Reduce –Pig –File system –SSH –Java –Shell All actions take place in a container on the cluster
  26. 26. 26 Example Workflow <?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration>
  27. 27. 27 Example Workflow <exec></exec> <argument>--partition=${nominalTime}</argument> <argument>--bookings=${bookings}</argument> <argument>--webActions=${webActions}</argument> <argument>--spend=${spend}</argument> <file>${wf:appPath()}/</file> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Action failed: [${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
  28. 28. Any Questions?