Data Workflows
at Foursquare
using Luigi
Foursquare
•  35 million users
•  Nearly 4 billion check-ins
•  More than 5 million check-ins per day
•  50 million point-of-interest database
•  100's of GB of log data per day
Tools We Use
•  Hive
o  Ad hoc analytics, data dumping ground
•  Raw MapReduce
o  100's of MapReduce jobs in our codebase
•  Pig
o  Fits between structure Hive and free-form
MapReduce
•  Vertica
o  Low latency analytics
Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
Cron - Problems
•  Brittle
•  Hard to reason about / visualize
•  Spend a lot of time waiting
•  Difficult to tell what succeeded or failed
•  No one likes writing Bash scripts
Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
Oozie - Example
Oozie - Problems
•  Workflows are all-or-nothing
o  Cannot just run step that failed
o  Very little code reuse
•  Little to no extensibility
•  Limited control flow
•  Extremely verbose
•  Difficult to test
•  No one likes writing XML
Luigi
•  Python framework for batch processing jobs
•  Created by Spotify, open-sourced Sept. 2012
•  Tasks are units of work that produce Targets
•  Tasks can depend on one or more other Tasks
•  A Task is only run if all of its dependent Tasks are done
•  Tasks are idempotent
Luigi - Example Task
Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Luigi - Advantages over Cron
•  Explicit dependencies
•  No wasted time waiting
•  Easy to tell what has failed
•  Avoid duplicate work / partial failures
Luigi - Advantages over Oozie
•  Explicit dependencies between workflows
•  Easier to write
•  Vastly more extensible
•  Code reuse
•  Can easily re-run individual steps
Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com

Luigi presentation OA Summit

  • 1.
  • 2.
    Foursquare •  35 millionusers •  Nearly 4 billion check-ins •  More than 5 million check-ins per day •  50 million point-of-interest database •  100's of GB of log data per day
  • 3.
    Tools We Use • Hive o  Ad hoc analytics, data dumping ground •  Raw MapReduce o  100's of MapReduce jobs in our codebase •  Pig o  Fits between structure Hive and free-form MapReduce •  Vertica o  Low latency analytics
  • 4.
    Cron E.g. 0 0 ** * ./hadoop-script-1.sh # Wait two hours for that job to finish... 0 2 * * * ./hadoop-script-2.sh # And on and on and on
  • 5.
    Cron - Problems • Brittle •  Hard to reason about / visualize •  Spend a lot of time waiting •  Difficult to tell what succeeded or failed •  No one likes writing Bash scripts
  • 6.
    Oozie XML-based Workflow Engine,with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
  • 7.
  • 8.
    Oozie - Problems • Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse •  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML
  • 9.
    Luigi •  Python frameworkfor batch processing jobs •  Created by Spotify, open-sourced Sept. 2012 •  Tasks are units of work that produce Targets •  Tasks can depend on one or more other Tasks •  A Task is only run if all of its dependent Tasks are done •  Tasks are idempotent
  • 10.
  • 11.
    Luigi - Runningthe Task $ python word-count.py WordCount --date 2013-06-01
  • 12.
    Luigi - Scheduler Centralscheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
  • 13.
  • 14.
  • 15.
  • 16.
    Luigi - Advantagesover Cron •  Explicit dependencies •  No wasted time waiting •  Easy to tell what has failed •  Avoid duplicate work / partial failures
  • 17.
    Luigi - Advantagesover Oozie •  Explicit dependencies between workflows •  Easier to write •  Vastly more extensible •  Code reuse •  Can easily re-run individual steps
  • 18.
    Thank you! Check outLuigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com