Luigi presentation NYC Data Science

Erik Bernhardsson
Erik BernhardssonHead of Engineering — we're hiring at Better Mortgage
December 15, 2014
Luigi
NYC Data Science meetup
What is Luigi
Luigi is a workflow engine
If you run 10,000+ Hadoop jobs every day, you need one
If you play around with batch processing just for fun, you want one
Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at
It helps you with the plumbing of connecting lots of tasks into complicated pipelines,
especially if those tasks run on Hadoop
2
What do we use it for?
Music recommendations
A/B testing
Top lists
Ad targeting
Label reporting
Dashboards
… and a million other things!
3
Currently running 10,000+ Hadoop jobs every day
On average a Hadoop job is launched every 10s
There’s 2,000+ Luigi tasks in production
4
Some history
… let’s go back to 2008!
5
The year was 2008
I was writing my master’s thesis
about music recommendations
Had to run hundreds of long-running
tasks to compute the output
6
Toy example: classify skipped tracks
$ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled
$ python train_model.py /tmp/subsampled model.pickle
$ python inspect_model.py model.pickle
7
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Reproducibility matters
…and automation.
!
The previous code is really hard to run again
8
Let’s make into a big workflow
9
$ python run_everything.py
Reality: crashes will happen
10
How do you resume this?
Ability to resume matters
When you are developing something interactively, you will try and fail a lot
Failures will happen, and you want to resume once you fixed it
You want the system to figure out exactly what it has to re-run and nothing else
Atomic file operations is crucial for the ability to resume
11
So let’s make it possible to resume
12
13
But still annoying parts
Hardcoded
junk
Generalization matters
You should be able to re-run your entire pipeline with a new value for a parameter
Command line integration means you can run interactive experiments
14
… now we’re getting
something
15
$ python run_everything.py --date-
first 2014-01-01 --date-last
2014-01-31 --n-trees 200
16
… but it’s hardly
readable
BOILERPLATE
Boilerplate matters!
We keep re-implementing the same functionality
Let’s factor it out to a framework
17
A lot of real-world
data pipelines are a
lot more complex
The ideal framework should make it
trivial to build up big data
pipelines where dependencies
are non-trivial (eg depend on date
algebra)
18
So I started thinking
Wanted to build something like GNU Make
19
What is Make and why is it pretty cool?
Build reusable rules
Specify what you want to build and then
backtrack to find out what you need
in order to get there
Reproducible runs
20
# the compiler: gcc for C program, define as g++ for C++
CC = gcc
!
# compiler flags:
# -g adds debugging information to the executable file
# -Wall turns on most, but not all, compiler warnings
CFLAGS = -g -Wall
!
# the build target executable:
TARGET = myprog
!
all: $(TARGET)
!
$(TARGET): $(TARGET).c
$(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c
!
clean:
$(RM) $(TARGET)
We want something that works for a wide range of systems
We need to support lots of systems
“80% of data science is data munging”
21
Data processing needs to interact with lots of systems
Need to support practically any type of task:
Hadoop jobs
Database dumps
Ingest into Cassandra
Send email
SCP file somewhere else
22
My first attempt: builder
Use XML config to build up the dependency graph!
23
Don’t use XML
… seriously, don’t use it
24
Dependencies need code
Pipelines deployed in production often have nontrivial ways they define dependencies between
tasks
!
!
!
!
!
!
!
!
… and many other cases
25
Recursion (and date algebra)
BloomFilter(date=2014-05-01)
BloomFilter(date=2014-04-30)
Log(date=2014-04-30)
Log(date=2014-04-29)
...
Date algebra
Toplist(date_interval=2014-01)
Log(date=2014-01-01)
Log(date=2014-01-02)
...
Log(date=2014-01-31)
Enum types
IdMap(type=artist) IdMap(type=track)
IdToIdMap(from_type=artist, to_type=track)
Don’t ever invent your own DSL
“It’s better to write domain specific code in a
general purpose language, than writing
general purpose code in a domain specific
language” – unknown author
!
!
Oozie is a good example of how messy it gets
26
2009: builder2
Solved all the things I just mentioned
- Dependency graph specified in Python
- Support for arbitrary tasks
- Error emails
- Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc
- Lots of other things :)
27
Graphs!
28
More graphs!
29
Even more graphs!
30
What were the good bits?
!
Build up dependency graphs and visualize them
Non-event to go from development to deployment
Built-in HDFS integration but decoupled from the core library
!
!
What went wrong?
!
Still too much boiler plate
Pretty bad command line integration
31
32
Introducing Luigi
A workflow engine in Python
33
Luigi – History at Spotify
Late 2011: Me and Elias Freider build it, release it into the
wild at Spotify, people start using it
“The Python era”
!
Late 2012: Open source it
Early 2013: First known company outside of Spotify:
Foursquare
!
34
Luigi is your friendly plumber
Simple dependency definitions
Emphasis on Hadoop/HDFS integration
Atomic file operations
Data flow visualization
Command line integration
35
Luigi Task 36
Luigi Task – breakdown 37
The business logic of the task Where it writes output What other tasks it depends on
Parameters for this task
Easy command line integration
So easy that you want to use Luigi for it
38
$ python my_task.py MyTask --param 43
INFO: Scheduled MyTask(param=43)
INFO: Scheduled SomeOtherTask(param=43)
INFO: Done scheduling tasks
INFO: [pid 20235] Running SomeOtherTask(param=43)
INFO: [pid 20235] Done SomeOtherTask(param=43)
INFO: [pid 20235] Running MyTask(param=43)
INFO: [pid 20235] Done MyTask(param=43)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$ cat /tmp/foo/bar-43.txt
hello, world
$
Let’s go back to the example
39
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Code in Luigi
40
Extract the features
41
$ python demo.py SubsampleFeatures --date-interval 2013-11-01
DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01)
...
INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2-
hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113
INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0%
INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0%
INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0%
...
INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87%
INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100%
INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$
Run on the command line
42
Step 2: Train a
machine
learning model
43
Let’s run everything on the command line from scratch
$ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03
DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self)
INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING)
INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0%
INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0%
INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0%
INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0%
INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0%
INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0%
INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100%
INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
InspectModel(date_interval=2011-01-03, n_trees=10)
time 0.1335%
ms_played 96.9351%
shuffle 0.0728%
local_track 0.0000%
bitrate 2.8586%
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
InspectModel(date_interval=2011-01-03, n_trees=10)
44
Let’s make it more complicated – cross validation
45
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Log e
Log e+1
...
Log e+k-1
Subsample
and extract
features
Subsampled
features
Cross validation
Cross validation
implementation
$ python xv.py CrossValidation
--date-interval-a 2012-11-01
--date-interval-b 2012-11-02
46
Run on the command line
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
2011-01-01 (train) AUC: 0.9040
2011-01-02 ( test) AUC: 0.9040
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread
47
… no overfitting!
More trees!
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
2011-01-01 (train) AUC: 0.9074
2011-01-02 ( test) AUC: 0.8896
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern,
pid=27835) was stopped. Shutting down Keep-Alive thread
48
… overfitting!
Luigi presentation NYC Data Science
The nice things about Luigi
50
Overhead for a task is about 5 lines (class def + requires + output + run)
Easy command line integration
Section name
Minimal boiler plate
51
Everything is a directed acyclic graph
Makefile style
Tasks specify what they are dependent on not what other things depend on them
52
Luigi’s visualizer
53
Dive into any task
54
Run with multiple workers
$ python dataflow.py --workers 3 AggregateArtists --date-interval
2013-W08
55
Error notifications
56
Process synchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
Prevents the same task from being run simultaneously, but all execution is being done by the
workers.
57
Luigi is a way of coordinating lots of different tasks
… but you still have to figure out how to implement and scale them!
58
Do general-purpose stuff
Don’t focus on a specific platform
!
… but comes “batteries included”
59
Built-in support for HDFS & Hadoop
At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and
Scalding. Luigi is a great glue!
!
Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in
production.
!
Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs,
Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into
Postgres, send email reports, etc.
60
The one time we accidentally deleted 50TB of data
We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3
days
61
Some things are still not perfect
62
The missing parts
Execution is tied to scheduling – you can’t schedule something to run “in the cloud”
Visualization could be a lot more useful
There’s no built scheduling – have to rely on crontab
These are all things we have in the backlog
63
Source:
What are some ideas for the future?
64
Separate scheduling and execution
65
Luigi central scheduler
Slave
Slave
Slave
Slave
...
Luigi in Scala?
66
Luigi implements some core beliefs
The #1 focus is on removing all boiler plate
The #2 focus is to be as general as possible
The #3 focus is to make it easy to go from test to production
!
!
67
Join the club!
Questions?
69
1 of 69

Recommended

Luigi presentation OA Summit by
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
16.2K views18 slides
Python as part of a production machine learning stack by Michael Manapat PyDa... by
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
20.7K views23 slides
Managing data workflows with Luigi by
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
6.2K views35 slides
A Beginner's Guide to Building Data Pipelines with Luigi by
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
57K views26 slides
Presto on Apache Spark: A Tale of Two Computation Engines by
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
1.6K views26 slides
All about Zookeeper and ClickHouse Keeper.pdf by
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
3K views45 slides

More Related Content

What's hot

ClickHouse Keeper by
ClickHouse KeeperClickHouse Keeper
ClickHouse KeeperAltinity Ltd
1.9K views52 slides
Grafana Loki: like Prometheus, but for Logs by
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
2.8K views38 slides
Airflow and supervisor by
Airflow and supervisorAirflow and supervisor
Airflow and supervisorRafael Roman Otero
343 views35 slides
Migration to ClickHouse. Practical guide, by Alexander Zaitsev by
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
9.4K views54 slides
Evening out the uneven: dealing with skew in Flink by
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
2.5K views35 slides
Airflow presentation by
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
3.6K views53 slides

What's hot(20)

ClickHouse Keeper by Altinity Ltd
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
Altinity Ltd1.9K views
Grafana Loki: like Prometheus, but for Logs by Marco Pracucci
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
Marco Pracucci2.8K views
Migration to ClickHouse. Practical guide, by Alexander Zaitsev by Altinity Ltd
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd9.4K views
Evening out the uneven: dealing with skew in Flink by Flink Forward
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward2.5K views
Airflow presentation by Ilias Okacha
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha3.6K views
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG by Marcin Grzejszczak
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUGMicroservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
Marcin Grzejszczak2.7K views
Introduction to Apache Airflow by mutt_data
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data898 views
Better than you think: Handling JSON data in ClickHouse by Altinity Ltd
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd2K views
The Parquet Format and Performance Optimization Opportunities by Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.1K views
Building an analytics workflow using Apache Airflow by Yohei Onishi
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi2.3K views
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha... by Altinity Ltd
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd2.1K views
The Patterns of Distributed Logging and Containers by SATOSHI TAGOMORI
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI24.9K views
Introduction to Apache Calcite by Jordan Halterman
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman16.3K views
Extending Flink SQL for stream processing use cases by Flink Forward
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward113 views
Presto: Optimizing Performance of SQL-on-Anything Engine by DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit1.8K views

Viewers also liked

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기 by
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기Sumin Byeon
8.9K views194 slides
Online game server on Akka.NET (NDC2016) by
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Esun Kim
7.5K views78 slides
영상 데이터의 처리와 정보의 추출 by
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출동윤 이
15.2K views48 slides
김병관 성공캠프 SNS팀 자원봉사 후기 by
김병관 성공캠프 SNS팀 자원봉사 후기김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기Harns (Nak-Hyoung) Kim
5.5K views32 slides
게임회사 취업을 위한 현실적인 전략 3가지 by
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지Harns (Nak-Hyoung) Kim
16.1K views29 slides
Docker by
DockerDocker
DockerHuey Park
1.1K views56 slides

Viewers also liked(20)

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기 by Sumin Byeon
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
Sumin Byeon8.9K views
Online game server on Akka.NET (NDC2016) by Esun Kim
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)
Esun Kim7.5K views
영상 데이터의 처리와 정보의 추출 by 동윤 이
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출
동윤 이15.2K views
게임회사 취업을 위한 현실적인 전략 3가지 by Harns (Nak-Hyoung) Kim
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지
Docker by Huey Park
DockerDocker
Docker
Huey Park1.1K views
Re:Zero부터 시작하지 않는 오픈소스 개발 by Chris Ohk
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
Chris Ohk3.4K views
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임 by Imseong Kang
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
Imseong Kang12.1K views
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매] by Sumin Byeon
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
Sumin Byeon2.6K views
Behavior Tree in Unreal engine 4 by Huey Park
Behavior Tree in Unreal engine 4Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4
Huey Park7.3K views
NDC16 스매싱더배틀 1년간의 개발일지 by Daehoon Han
NDC16 스매싱더배틀 1년간의 개발일지NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지
Daehoon Han10.9K views
Developing Success in Mobile with Unreal Engine 4 | David Stelzer by Jessica Tams
Developing Success in Mobile with Unreal Engine 4 | David StelzerDeveloping Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
Jessica Tams1.2K views
Deep learning as_WaveExtractor by 동윤 이
Deep learning as_WaveExtractorDeep learning as_WaveExtractor
Deep learning as_WaveExtractor
동윤 이2.9K views
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012 by Esun Kim
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
Esun Kim20.9K views
Profiling - 실시간 대화식 프로파일러 by Heungsub Lee
Profiling - 실시간 대화식 프로파일러Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러
Heungsub Lee9.7K views
Approximate nearest neighbor methods and vector models – NYC ML meetup by Erik Bernhardsson
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson22.2K views
Custom fabric shader for unreal engine 4 by 동석 김
Custom fabric shader for unreal engine 4Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4
동석 김22.4K views
레퍼런스만 알면 언리얼 엔진이 제대로 보인다 by Lee Dustin
레퍼런스만 알면 언리얼 엔진이 제대로 보인다레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
Lee Dustin7.3K views
버텍스 셰이더로 하는 머리카락 애니메이션 by 동석 김
버텍스 셰이더로 하는 머리카락 애니메이션버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션
동석 김3.7K views

Similar to Luigi presentation NYC Data Science

How I learned to time travel, or, data pipelining and scheduling with Airflow by
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
8.7K views90 slides
Euro python2011 High Performance Python by
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
3K views48 slides
How I learned to time travel, or, data pipelining and scheduling with Airflow by
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
2.2K views90 slides
Flink internals web by
Flink internals web Flink internals web
Flink internals web Kostas Tzoumas
1.3K views53 slides
Yaetos Tech Overview by
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
67 views31 slides
BP206 - Let's Give Your LotusScript a Tune-Up by
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up Craig Schumann
1.5K views39 slides

Similar to Luigi presentation NYC Data Science(20)

How I learned to time travel, or, data pipelining and scheduling with Airflow by PyData
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData8.7K views
Euro python2011 High Performance Python by Ian Ozsvald
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
Ian Ozsvald3K views
How I learned to time travel, or, data pipelining and scheduling with Airflow by Laura Lorenz
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz2.2K views
Yaetos Tech Overview by prevota
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota67 views
BP206 - Let's Give Your LotusScript a Tune-Up by Craig Schumann
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up
Craig Schumann1.5K views
Intro - End to end ML with Kubeflow @ SignalConf 2018 by Holden Karau
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau3.5K views
Big Data Beyond the JVM - Strata San Jose 2018 by Holden Karau
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau689 views
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018 by Holden Karau
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau355 views
Network Automation: Ansible 101 by APNIC
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
APNIC1.8K views
Introduction to TensorFlow by Matthias Feys
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Matthias Feys18.6K views
Data Analytics and Simulation in Parallel with MATLAB* by Intel® Software
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software610 views
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow by Romain Dorgueil
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Romain Dorgueil2.8K views
The genesis of clusterlib - An open source library to tame your favourite sup... by Arnaud Joly
The genesis of clusterlib - An open source library to tame your favourite sup...The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...
Arnaud Joly672 views
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni... by AMD Developer Central
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
Python VS GO by Ofir Nir
Python VS GOPython VS GO
Python VS GO
Ofir Nir86 views

Recently uploaded

.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...Marc Müller
38 views62 slides
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides
Consulting for Data Monetization Maximizing the Profit Potential of Your Data... by
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Flexsin
15 views10 slides
DevsRank by
DevsRankDevsRank
DevsRankdevsrank786
11 views1 slide
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker by
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - ParkerDSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - ParkerDeltares
9 views16 slides
WebAssembly by
WebAssemblyWebAssembly
WebAssemblyJens Siebert
33 views18 slides

Recently uploaded(20)

.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
Consulting for Data Monetization Maximizing the Profit Potential of Your Data... by Flexsin
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Flexsin 15 views
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker by Deltares
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - ParkerDSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker
Deltares9 views
Roadmap y Novedades de producto by Neo4j
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de producto
Neo4j50 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit... by Deltares
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...
Deltares13 views
Tridens DevOps by Tridens
Tridens DevOpsTridens DevOps
Tridens DevOps
Tridens9 views
Upgrading Incident Management with Icinga - Icinga Camp Milan 2023 by Icinga
Upgrading Incident Management with Icinga - Icinga Camp Milan 2023Upgrading Incident Management with Icinga - Icinga Camp Milan 2023
Upgrading Incident Management with Icinga - Icinga Camp Milan 2023
Icinga38 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares9 views
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut... by HCLSoftware
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
HCLSoftware6 views
El Arte de lo Possible by Neo4j
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j38 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary19 views
DSD-INT 2023 Dam break simulation in Derna (Libya) using HydroMT_SFINCS - Prida by Deltares
DSD-INT 2023 Dam break simulation in Derna (Libya) using HydroMT_SFINCS - PridaDSD-INT 2023 Dam break simulation in Derna (Libya) using HydroMT_SFINCS - Prida
DSD-INT 2023 Dam break simulation in Derna (Libya) using HydroMT_SFINCS - Prida
Deltares18 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana6 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares11 views

Luigi presentation NYC Data Science

  • 1. December 15, 2014 Luigi NYC Data Science meetup
  • 2. What is Luigi Luigi is a workflow engine If you run 10,000+ Hadoop jobs every day, you need one If you play around with batch processing just for fun, you want one Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at It helps you with the plumbing of connecting lots of tasks into complicated pipelines, especially if those tasks run on Hadoop 2
  • 3. What do we use it for? Music recommendations A/B testing Top lists Ad targeting Label reporting Dashboards … and a million other things! 3
  • 4. Currently running 10,000+ Hadoop jobs every day On average a Hadoop job is launched every 10s There’s 2,000+ Luigi tasks in production 4
  • 5. Some history … let’s go back to 2008! 5
  • 6. The year was 2008 I was writing my master’s thesis about music recommendations Had to run hundreds of long-running tasks to compute the output 6
  • 7. Toy example: classify skipped tracks $ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled $ python train_model.py /tmp/subsampled model.pickle $ python inspect_model.py model.pickle 7 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 8. Reproducibility matters …and automation. ! The previous code is really hard to run again 8
  • 9. Let’s make into a big workflow 9 $ python run_everything.py
  • 10. Reality: crashes will happen 10 How do you resume this?
  • 11. Ability to resume matters When you are developing something interactively, you will try and fail a lot Failures will happen, and you want to resume once you fixed it You want the system to figure out exactly what it has to re-run and nothing else Atomic file operations is crucial for the ability to resume 11
  • 12. So let’s make it possible to resume 12
  • 13. 13 But still annoying parts Hardcoded junk
  • 14. Generalization matters You should be able to re-run your entire pipeline with a new value for a parameter Command line integration means you can run interactive experiments 14
  • 15. … now we’re getting something 15 $ python run_everything.py --date- first 2014-01-01 --date-last 2014-01-31 --n-trees 200
  • 16. 16 … but it’s hardly readable BOILERPLATE
  • 17. Boilerplate matters! We keep re-implementing the same functionality Let’s factor it out to a framework 17
  • 18. A lot of real-world data pipelines are a lot more complex The ideal framework should make it trivial to build up big data pipelines where dependencies are non-trivial (eg depend on date algebra) 18
  • 19. So I started thinking Wanted to build something like GNU Make 19
  • 20. What is Make and why is it pretty cool? Build reusable rules Specify what you want to build and then backtrack to find out what you need in order to get there Reproducible runs 20 # the compiler: gcc for C program, define as g++ for C++ CC = gcc ! # compiler flags: # -g adds debugging information to the executable file # -Wall turns on most, but not all, compiler warnings CFLAGS = -g -Wall ! # the build target executable: TARGET = myprog ! all: $(TARGET) ! $(TARGET): $(TARGET).c $(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c ! clean: $(RM) $(TARGET)
  • 21. We want something that works for a wide range of systems We need to support lots of systems “80% of data science is data munging” 21
  • 22. Data processing needs to interact with lots of systems Need to support practically any type of task: Hadoop jobs Database dumps Ingest into Cassandra Send email SCP file somewhere else 22
  • 23. My first attempt: builder Use XML config to build up the dependency graph! 23
  • 24. Don’t use XML … seriously, don’t use it 24
  • 25. Dependencies need code Pipelines deployed in production often have nontrivial ways they define dependencies between tasks ! ! ! ! ! ! ! ! … and many other cases 25 Recursion (and date algebra) BloomFilter(date=2014-05-01) BloomFilter(date=2014-04-30) Log(date=2014-04-30) Log(date=2014-04-29) ... Date algebra Toplist(date_interval=2014-01) Log(date=2014-01-01) Log(date=2014-01-02) ... Log(date=2014-01-31) Enum types IdMap(type=artist) IdMap(type=track) IdToIdMap(from_type=artist, to_type=track)
  • 26. Don’t ever invent your own DSL “It’s better to write domain specific code in a general purpose language, than writing general purpose code in a domain specific language” – unknown author ! ! Oozie is a good example of how messy it gets 26
  • 27. 2009: builder2 Solved all the things I just mentioned - Dependency graph specified in Python - Support for arbitrary tasks - Error emails - Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc - Lots of other things :) 27
  • 31. What were the good bits? ! Build up dependency graphs and visualize them Non-event to go from development to deployment Built-in HDFS integration but decoupled from the core library ! ! What went wrong? ! Still too much boiler plate Pretty bad command line integration 31
  • 32. 32
  • 33. Introducing Luigi A workflow engine in Python 33
  • 34. Luigi – History at Spotify Late 2011: Me and Elias Freider build it, release it into the wild at Spotify, people start using it “The Python era” ! Late 2012: Open source it Early 2013: First known company outside of Spotify: Foursquare ! 34
  • 35. Luigi is your friendly plumber Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration 35
  • 37. Luigi Task – breakdown 37 The business logic of the task Where it writes output What other tasks it depends on Parameters for this task
  • 38. Easy command line integration So easy that you want to use Luigi for it 38 $ python my_task.py MyTask --param 43 INFO: Scheduled MyTask(param=43) INFO: Scheduled SomeOtherTask(param=43) INFO: Done scheduling tasks INFO: [pid 20235] Running SomeOtherTask(param=43) INFO: [pid 20235] Done SomeOtherTask(param=43) INFO: [pid 20235] Running MyTask(param=43) INFO: [pid 20235] Done MyTask(param=43) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ cat /tmp/foo/bar-43.txt hello, world $
  • 39. Let’s go back to the example 39 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 42. $ python demo.py SubsampleFeatures --date-interval 2013-11-01 DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01) ... INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2- hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113 INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0% INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0% INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0% ... INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87% INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100% INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ Run on the command line 42
  • 43. Step 2: Train a machine learning model 43
  • 44. Let’s run everything on the command line from scratch $ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03 DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self) INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING) INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477 INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0% INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0% INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0% INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0% INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0% INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0% INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100% INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running InspectModel(date_interval=2011-01-03, n_trees=10) time 0.1335% ms_played 96.9351% shuffle 0.0728% local_track 0.0000% bitrate 2.8586% INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done InspectModel(date_interval=2011-01-03, n_trees=10) 44
  • 45. Let’s make it more complicated – cross validation 45 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Log e Log e+1 ... Log e+k-1 Subsample and extract features Subsampled features Cross validation
  • 46. Cross validation implementation $ python xv.py CrossValidation --date-interval-a 2012-11-01 --date-interval-b 2012-11-02 46
  • 47. Run on the command line $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) 2011-01-01 (train) AUC: 0.9040 2011-01-02 ( test) AUC: 0.9040 INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread 47 … no overfitting!
  • 48. More trees! $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) 2011-01-01 (train) AUC: 0.9074 2011-01-02 ( test) AUC: 0.8896 INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) was stopped. Shutting down Keep-Alive thread 48 … overfitting!
  • 50. The nice things about Luigi 50
  • 51. Overhead for a task is about 5 lines (class def + requires + output + run) Easy command line integration Section name Minimal boiler plate 51
  • 52. Everything is a directed acyclic graph Makefile style Tasks specify what they are dependent on not what other things depend on them 52
  • 54. Dive into any task 54
  • 55. Run with multiple workers $ python dataflow.py --workers 3 AggregateArtists --date-interval 2013-W08 55
  • 57. Process synchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner Prevents the same task from being run simultaneously, but all execution is being done by the workers. 57
  • 58. Luigi is a way of coordinating lots of different tasks … but you still have to figure out how to implement and scale them! 58
  • 59. Do general-purpose stuff Don’t focus on a specific platform ! … but comes “batteries included” 59
  • 60. Built-in support for HDFS & Hadoop At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and Scalding. Luigi is a great glue! ! Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in production. ! Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs, Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into Postgres, send email reports, etc. 60
  • 61. The one time we accidentally deleted 50TB of data We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3 days 61
  • 62. Some things are still not perfect 62
  • 63. The missing parts Execution is tied to scheduling – you can’t schedule something to run “in the cloud” Visualization could be a lot more useful There’s no built scheduling – have to rely on crontab These are all things we have in the backlog 63
  • 64. Source: What are some ideas for the future? 64
  • 65. Separate scheduling and execution 65 Luigi central scheduler Slave Slave Slave Slave ...
  • 67. Luigi implements some core beliefs The #1 focus is on removing all boiler plate The #2 focus is to be as general as possible The #3 focus is to make it easy to go from test to production ! ! 67