SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
2.
Foursquare
• 35 million users
• Nearly 4 billion check-ins
• More than 5 million check-ins per day
• 50 million point-of-interest database
• 100's of GB of log data per day
3.
Tools We Use
• Hive
o Ad hoc analytics, data dumping ground
• Raw MapReduce
o 100's of MapReduce jobs in our codebase
• Pig
o Fits between structure Hive and free-form
MapReduce
• Vertica
o Low latency analytics
4.
Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
5.
Cron - Problems
• Brittle
• Hard to reason about / visualize
• Spend a lot of time waiting
• Difficult to tell what succeeded or failed
• No one likes writing Bash scripts
6.
Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
8.
Oozie - Problems
• Workflows are all-or-nothing
o Cannot just run step that failed
o Very little code reuse
• Little to no extensibility
• Limited control flow
• Extremely verbose
• Difficult to test
• No one likes writing XML
9.
Luigi
• Python framework for batch processing jobs
• Created by Spotify, open-sourced Sept. 2012
• Tasks are units of work that produce Targets
• Tasks can depend on one or more other Tasks
• A Task is only run if all of its dependent Tasks are done
• Tasks are idempotent
11.
Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
12.
Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
16.
Luigi - Advantages over Cron
• Explicit dependencies
• No wasted time waiting
• Easy to tell what has failed
• Avoid duplicate work / partial failures
17.
Luigi - Advantages over Oozie
• Explicit dependencies between workflows
• Easier to write
• Vastly more extensible
• Code reuse
• Can easily re-run individual steps
18.
Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com