Luigi presentation OA Summit

12,121 views

Published on

OA NYC Summit

Luigi presentation OA Summit

  1. 1. Data Workflowsat Foursquareusing Luigi
  2. 2. Foursquare•  35 million users•  Nearly 4 billion check-ins•  More than 5 million check-ins per day•  50 million point-of-interest database•  100s of GB of log data per day
  3. 3. Tools We Use•  Hiveo  Ad hoc analytics, data dumping ground•  Raw MapReduceo  100s of MapReduce jobs in our codebase•  Pigo  Fits between structure Hive and free-formMapReduce•  Verticao  Low latency analytics
  4. 4. CronE.g.0 0 * * * ./hadoop-script-1.sh# Wait two hours for that job to finish...0 2 * * * ./hadoop-script-2.sh# And on and on and on
  5. 5. Cron - Problems•  Brittle•  Hard to reason about / visualize•  Spend a lot of time waiting•  Difficult to tell what succeeded or failed•  No one likes writing Bash scripts
  6. 6. OozieXML-based Workflow Engine, with support forHadoop, Hive, and PigWorkflows specify computations in a DAG, e.g"Run this Hive query, then run these twoMapReduce jobs in parallel"Coordinators launch recurring workflows at agiven frequency, when dependent data isavailable
  7. 7. Oozie - Example
  8. 8. Oozie - Problems•  Workflows are all-or-nothingo  Cannot just run step that failedo  Very little code reuse•  Little to no extensibility•  Limited control flow•  Extremely verbose•  Difficult to test•  No one likes writing XML
  9. 9. Luigi•  Python framework for batch processing jobs•  Created by Spotify, open-sourced Sept. 2012•  Tasks are units of work that produce Targets•  Tasks can depend on one or more other Tasks•  A Task is only run if all of its dependent Tasks are done•  Tasks are idempotent
  10. 10. Luigi - Example Task
  11. 11. Luigi - Running the Task$ python word-count.py WordCount --date 2013-06-01
  12. 12. Luigi - SchedulerCentral scheduler ensures each Task is onlyrun by a single worker.A task is uniquely identified by its class nameand its Parameters, e.g.WordCount(date=2013-06-01)Will retry failed Tasks after a configured timeoutEmails someone when a Task fails
  13. 13. Luigi - Visualizer
  14. 14. Luigi - Visualizer
  15. 15. Luigi - Visualizer
  16. 16. Luigi - Advantages over Cron•  Explicit dependencies•  No wasted time waiting•  Easy to tell what has failed•  Avoid duplicate work / partial failures
  17. 17. Luigi - Advantages over Oozie•  Explicit dependencies between workflows•  Easier to write•  Vastly more extensible•  Code reuse•  Can easily re-run individual steps
  18. 18. Thank you!Check out Luigi:https://github.com/spotify/luigiDrop me a line:Joe Enneverjennever@foursquare.com

×