Luigi workflow engine for NYC Data Science meetup

December 15, 2014
Luigi
NYC Data Science meetup

What is Luigi
Luigi is a workflow engine
If you run 10,000+ Hadoop jobs every day, you need one
If you play around with batch processing just for fun, you want one
Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at
It helps you with the plumbing of connecting lots of tasks into complicated pipelines,
especially if those tasks run on Hadoop
2

What do we use it for?
Music recommendations
A/B testing
Top lists
Ad targeting
Label reporting
Dashboards
… and a million other things!
3

Currently running 10,000+ Hadoop jobs every day
On average a Hadoop job is launched every 10s
There’s 2,000+ Luigi tasks in production
4

Some history
… let’s go back to 2008!
5

The year was 2008
I was writing my master’s thesis
about music recommendations
Had to run hundreds of long-running
tasks to compute the output
6

Toy example: classify skipped tracks
$ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled
$ python train_model.py /tmp/subsampled model.pickle
$ python inspect_model.py model.pickle
7
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classiﬁer
Classiﬁer
Look at the
output

Reproducibility matters
…and automation.
!
The previous code is really hard to run again
8

Let’s make into a big workflow
9
$ python run_everything.py

Reality: crashes will happen
10
How do you resume this?

Ability to resume matters
When you are developing something interactively, you will try and fail a lot
Failures will happen, and you want to resume once you fixed it
You want the system to figure out exactly what it has to re-run and nothing else
Atomic file operations is crucial for the ability to resume
11

So let’s make it possible to resume
12

13
But still annoying parts
Hardcoded
junk

Generalization matters
You should be able to re-run your entire pipeline with a new value for a parameter
Command line integration means you can run interactive experiments
14

… now we’re getting
something
15
$ python run_everything.py --date-
first 2014-01-01 --date-last
2014-01-31 --n-trees 200

16
… but it’s hardly
readable
BOILERPLATE

Boilerplate matters!
We keep re-implementing the same functionality
Let’s factor it out to a framework
17

A lot of real-world
data pipelines are a
lot more complex
The ideal framework should make it
trivial to build up big data
pipelines where dependencies
are non-trivial (eg depend on date
algebra)
18

So I started thinking
Wanted to build something like GNU Make
19

What is Make and why is it pretty cool?
Build reusable rules
Specify what you want to build and then
backtrack to find out what you need
in order to get there
Reproducible runs
20
# the compiler: gcc for C program, define as g++ for C++
CC = gcc
!
# compiler flags:
# -g adds debugging information to the executable file
# -Wall turns on most, but not all, compiler warnings
CFLAGS = -g -Wall
!
# the build target executable:
TARGET = myprog
!
all: $(TARGET)
!
$(TARGET): $(TARGET).c
$(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c
!
clean:
$(RM) $(TARGET)

We want something that works for a wide range of systems
We need to support lots of systems
“80% of data science is data munging”
21

Data processing needs to interact with lots of systems
Need to support practically any type of task:
Hadoop jobs
Database dumps
Ingest into Cassandra
Send email
SCP file somewhere else
22

My first attempt: builder
Use XML config to build up the dependency graph!
23

Don’t use XML
… seriously, don’t use it
24

Dependencies need code
Pipelines deployed in production often have nontrivial ways they define dependencies between
tasks
!
!
!
!
!
!
!
!
… and many other cases
25
Recursion (and date algebra)
BloomFilter(date=2014-05-01)
BloomFilter(date=2014-04-30)
Log(date=2014-04-30)
Log(date=2014-04-29)
...
Date algebra
Toplist(date_interval=2014-01)
Log(date=2014-01-01)
Log(date=2014-01-02)
...
Log(date=2014-01-31)
Enum types
IdMap(type=artist) IdMap(type=track)
IdToIdMap(from_type=artist, to_type=track)

Don’t ever invent your own DSL
“It’s better to write domain specific code in a
general purpose language, than writing
general purpose code in a domain specific
language” – unknown author
!
!
Oozie is a good example of how messy it gets
26

2009: builder2
Solved all the things I just mentioned
- Dependency graph specified in Python
- Support for arbitrary tasks
- Error emails
- Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc
- Lots of other things :)
27

What were the good bits?
!
Build up dependency graphs and visualize them
Non-event to go from development to deployment
Built-in HDFS integration but decoupled from the core library
!
!
What went wrong?
!
Still too much boiler plate
Pretty bad command line integration
31

Introducing Luigi
A workflow engine in Python
33

Luigi – History at Spotify
Late 2011: Me and Elias Freider build it, release it into the
wild at Spotify, people start using it
“The Python era”
!
Late 2012: Open source it
Early 2013: First known company outside of Spotify:
Foursquare
!
34

Luigi is your friendly plumber
Simple dependency definitions
Emphasis on Hadoop/HDFS integration
Atomic file operations
Data flow visualization
Command line integration
35

Luigi Task – breakdown 37
The business logic of the task Where it writes output What other tasks it depends on
Parameters for this task

Easy command line integration
So easy that you want to use Luigi for it
38
$ python my_task.py MyTask --param 43
INFO: Scheduled MyTask(param=43)
INFO: Scheduled SomeOtherTask(param=43)
INFO: Done scheduling tasks
INFO: [pid 20235] Running SomeOtherTask(param=43)
INFO: [pid 20235] Done SomeOtherTask(param=43)
INFO: [pid 20235] Running MyTask(param=43)
INFO: [pid 20235] Done MyTask(param=43)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$ cat /tmp/foo/bar-43.txt
hello, world
$

Let’s go back to the example
39
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classiﬁer
Classiﬁer
Look at the
output

$ python demo.py SubsampleFeatures --date-interval 2013-11-01
DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01)
...
INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2-
hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113
INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0%
...
INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: Worker was stopped. Shutting down Keep-Alive thread
$
Run on the command line
42

Step 2: Train a
machine
learning model
43

Let’s run everything on the command line from scratch
$ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03
DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self)
INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING)
INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE)
INFO: Running Worker with 1 processes
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0%
INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
SubsampleFeatures(test=False, date_interval=2011-01-03)
TrainClassifier(date_interval=2011-01-03, n_trees=10)
TrainClassifier(date_interval=2011-01-03, n_trees=10)
InspectModel(date_interval=2011-01-03, n_trees=10)
time 0.1335%
ms_played 96.9351%
shuffle 0.0728%
local_track 0.0000%
bitrate 2.8586%
InspectModel(date_interval=2011-01-03, n_trees=10)
44

Let’s make it more complicated – cross validation
45
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classiﬁer
Classiﬁer
Log e
Log e+1
...
Log e+k-1
Subsample
and extract
features
Subsampled
features
Cross validation

Cross validation
implementation
$ python xv.py CrossValidation
--date-interval-a 2012-11-01
--date-interval-b 2012-11-02
46

Run on the command line
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Running Worker with 1 processes
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
2011-01-01 (train) AUC: 0.9040
2011-01-02 ( test) AUC: 0.9040
username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
INFO: Done
INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread
47
… no overfitting!

More trees!
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING)
username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100)
username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100)
username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
2011-01-01 (train) AUC: 0.9074
2011-01-02 ( test) AUC: 0.8896
username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
INFO: Done
INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern,
pid=27835) was stopped. Shutting down Keep-Alive thread
48
… overfitting!

The nice things about Luigi
50

Overhead for a task is about 5 lines (class def + requires + output + run)
Easy command line integration
Section name
Minimal boiler plate
51

Everything is a directed acyclic graph
Makefile style
Tasks specify what they are dependent on not what other things depend on them
52

Run with multiple workers
$ python dataflow.py --workers 3 AggregateArtists --date-interval
2013-W08
55

Process synchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
Prevents the same task from being run simultaneously, but all execution is being done by the
workers.
57

Luigi is a way of coordinating lots of different tasks
… but you still have to figure out how to implement and scale them!
58

Do general-purpose stuff
Don’t focus on a specific platform
!
… but comes “batteries included”
59

Built-in support for HDFS & Hadoop
At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and
Scalding. Luigi is a great glue!
!
Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in
production.
!
Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs,
Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into
Postgres, send email reports, etc.
60

The one time we accidentally deleted 50TB of data
We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3
days
61

Some things are still not perfect
62

The missing parts
Execution is tied to scheduling – you can’t schedule something to run “in the cloud”
Visualization could be a lot more useful
There’s no built scheduling – have to rely on crontab
These are all things we have in the backlog
63

Source:
What are some ideas for the future?
64

Separate scheduling and execution
65
Luigi central scheduler
Slave
Slave
Slave
Slave
...

Luigi implements some core beliefs
The #1 focus is on removing all boiler plate
The #2 focus is to be as general as possible
The #3 focus is to make it easy to go from test to production
!
!
67

Luigi workflow engine for NYC Data Science meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Luigi workflow engine for NYC Data Science meetup

Similar to Luigi workflow engine for NYC Data Science meetup (20)

Recently uploaded

Recently uploaded (20)

Luigi workflow engine for NYC Data Science meetup