Successfully reported this slideshow.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Luigi presentation OA Summit

  1. 1. Data Workflows at Foursquare using Luigi
  2. 2. Foursquare •  35 million users •  Nearly 4 billion check-ins •  More than 5 million check-ins per day •  50 million point-of-interest database •  100's of GB of log data per day
  3. 3. Tools We Use •  Hive o  Ad hoc analytics, data dumping ground •  Raw MapReduce o  100's of MapReduce jobs in our codebase •  Pig o  Fits between structure Hive and free-form MapReduce •  Vertica o  Low latency analytics
  4. 4. Cron E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish... 0 2 * * * ./hadoop-script-2.sh # And on and on and on
  5. 5. Cron - Problems •  Brittle •  Hard to reason about / visualize •  Spend a lot of time waiting •  Difficult to tell what succeeded or failed •  No one likes writing Bash scripts
  6. 6. Oozie XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
  7. 7. Oozie - Example
  8. 8. Oozie - Problems •  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse •  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML
  9. 9. Luigi •  Python framework for batch processing jobs •  Created by Spotify, open-sourced Sept. 2012 •  Tasks are units of work that produce Targets •  Tasks can depend on one or more other Tasks •  A Task is only run if all of its dependent Tasks are done •  Tasks are idempotent
  10. 10. Luigi - Example Task
  11. 11. Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01
  12. 12. Luigi - Scheduler Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
  13. 13. Luigi - Visualizer
  14. 14. Luigi - Visualizer
  15. 15. Luigi - Visualizer
  16. 16. Luigi - Advantages over Cron •  Explicit dependencies •  No wasted time waiting •  Easy to tell what has failed •  Avoid duplicate work / partial failures
  17. 17. Luigi - Advantages over Oozie •  Explicit dependencies between workflows •  Easier to write •  Vastly more extensible •  Code reuse •  Can easily re-run individual steps
  18. 18. Thank you! Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com

×