SlideShare a Scribd company logo
1 of 69
Download to read offline
December 15, 2014
Luigi
NYC Data Science meetup
What is Luigi
Luigi is a workflow engine
If you run 10,000+ Hadoop jobs every day, you need one
If you play around with batch processing just for fun, you want one
Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at
It helps you with the plumbing of connecting lots of tasks into complicated pipelines,
especially if those tasks run on Hadoop
2
What do we use it for?
Music recommendations
A/B testing
Top lists
Ad targeting
Label reporting
Dashboards
… and a million other things!
3
Currently running 10,000+ Hadoop jobs every day
On average a Hadoop job is launched every 10s
There’s 2,000+ Luigi tasks in production
4
Some history
… let’s go back to 2008!
5
The year was 2008
I was writing my master’s thesis
about music recommendations
Had to run hundreds of long-running
tasks to compute the output
6
Toy example: classify skipped tracks
$ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled
$ python train_model.py /tmp/subsampled model.pickle
$ python inspect_model.py model.pickle
7
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Reproducibility matters
…and automation.
!
The previous code is really hard to run again
8
Let’s make into a big workflow
9
$ python run_everything.py
Reality: crashes will happen
10
How do you resume this?
Ability to resume matters
When you are developing something interactively, you will try and fail a lot
Failures will happen, and you want to resume once you fixed it
You want the system to figure out exactly what it has to re-run and nothing else
Atomic file operations is crucial for the ability to resume
11
So let’s make it possible to resume
12
13
But still annoying parts
Hardcoded
junk
Generalization matters
You should be able to re-run your entire pipeline with a new value for a parameter
Command line integration means you can run interactive experiments
14
… now we’re getting
something
15
$ python run_everything.py --date-
first 2014-01-01 --date-last
2014-01-31 --n-trees 200
16
… but it’s hardly
readable
BOILERPLATE
Boilerplate matters!
We keep re-implementing the same functionality
Let’s factor it out to a framework
17
A lot of real-world
data pipelines are a
lot more complex
The ideal framework should make it
trivial to build up big data
pipelines where dependencies
are non-trivial (eg depend on date
algebra)
18
So I started thinking
Wanted to build something like GNU Make
19
What is Make and why is it pretty cool?
Build reusable rules
Specify what you want to build and then
backtrack to find out what you need
in order to get there
Reproducible runs
20
# the compiler: gcc for C program, define as g++ for C++
CC = gcc
!
# compiler flags:
# -g adds debugging information to the executable file
# -Wall turns on most, but not all, compiler warnings
CFLAGS = -g -Wall
!
# the build target executable:
TARGET = myprog
!
all: $(TARGET)
!
$(TARGET): $(TARGET).c
$(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c
!
clean:
$(RM) $(TARGET)
We want something that works for a wide range of systems
We need to support lots of systems
“80% of data science is data munging”
21
Data processing needs to interact with lots of systems
Need to support practically any type of task:
Hadoop jobs
Database dumps
Ingest into Cassandra
Send email
SCP file somewhere else
22
My first attempt: builder
Use XML config to build up the dependency graph!
23
Don’t use XML
… seriously, don’t use it
24
Dependencies need code
Pipelines deployed in production often have nontrivial ways they define dependencies between
tasks
!
!
!
!
!
!
!
!
… and many other cases
25
Recursion (and date algebra)
BloomFilter(date=2014-05-01)
BloomFilter(date=2014-04-30)
Log(date=2014-04-30)
Log(date=2014-04-29)
...
Date algebra
Toplist(date_interval=2014-01)
Log(date=2014-01-01)
Log(date=2014-01-02)
...
Log(date=2014-01-31)
Enum types
IdMap(type=artist) IdMap(type=track)
IdToIdMap(from_type=artist, to_type=track)
Don’t ever invent your own DSL
“It’s better to write domain specific code in a
general purpose language, than writing
general purpose code in a domain specific
language” – unknown author
!
!
Oozie is a good example of how messy it gets
26
2009: builder2
Solved all the things I just mentioned
- Dependency graph specified in Python
- Support for arbitrary tasks
- Error emails
- Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc
- Lots of other things :)
27
Graphs!
28
More graphs!
29
Even more graphs!
30
What were the good bits?
!
Build up dependency graphs and visualize them
Non-event to go from development to deployment
Built-in HDFS integration but decoupled from the core library
!
!
What went wrong?
!
Still too much boiler plate
Pretty bad command line integration
31
32
Introducing Luigi
A workflow engine in Python
33
Luigi – History at Spotify
Late 2011: Me and Elias Freider build it, release it into the
wild at Spotify, people start using it
“The Python era”
!
Late 2012: Open source it
Early 2013: First known company outside of Spotify:
Foursquare
!
34
Luigi is your friendly plumber
Simple dependency definitions
Emphasis on Hadoop/HDFS integration
Atomic file operations
Data flow visualization
Command line integration
35
Luigi Task 36
Luigi Task – breakdown 37
The business logic of the task Where it writes output What other tasks it depends on
Parameters for this task
Easy command line integration
So easy that you want to use Luigi for it
38
$ python my_task.py MyTask --param 43
INFO: Scheduled MyTask(param=43)
INFO: Scheduled SomeOtherTask(param=43)
INFO: Done scheduling tasks
INFO: [pid 20235] Running SomeOtherTask(param=43)
INFO: [pid 20235] Done SomeOtherTask(param=43)
INFO: [pid 20235] Running MyTask(param=43)
INFO: [pid 20235] Done MyTask(param=43)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$ cat /tmp/foo/bar-43.txt
hello, world
$
Let’s go back to the example
39
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Code in Luigi
40
Extract the features
41
$ python demo.py SubsampleFeatures --date-interval 2013-11-01
DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01)
...
INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2-
hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113
INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0%
INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0%
INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0%
...
INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87%
INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100%
INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$
Run on the command line
42
Step 2: Train a
machine
learning model
43
Let’s run everything on the command line from scratch
$ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03
DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self)
INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING)
INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0%
INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0%
INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0%
INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0%
INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0%
INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0%
INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100%
INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
InspectModel(date_interval=2011-01-03, n_trees=10)
time 0.1335%
ms_played 96.9351%
shuffle 0.0728%
local_track 0.0000%
bitrate 2.8586%
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
InspectModel(date_interval=2011-01-03, n_trees=10)
44
Let’s make it more complicated – cross validation
45
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Log e
Log e+1
...
Log e+k-1
Subsample
and extract
features
Subsampled
features
Cross validation
Cross validation
implementation
$ python xv.py CrossValidation
--date-interval-a 2012-11-01
--date-interval-b 2012-11-02
46
Run on the command line
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
2011-01-01 (train) AUC: 0.9040
2011-01-02 ( test) AUC: 0.9040
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread
47
… no overfitting!
More trees!
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
2011-01-01 (train) AUC: 0.9074
2011-01-02 ( test) AUC: 0.8896
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern,
pid=27835) was stopped. Shutting down Keep-Alive thread
48
… overfitting!
The nice things about Luigi
50
Overhead for a task is about 5 lines (class def + requires + output + run)
Easy command line integration
Section name
Minimal boiler plate
51
Everything is a directed acyclic graph
Makefile style
Tasks specify what they are dependent on not what other things depend on them
52
Luigi’s visualizer
53
Dive into any task
54
Run with multiple workers
$ python dataflow.py --workers 3 AggregateArtists --date-interval
2013-W08
55
Error notifications
56
Process synchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
Prevents the same task from being run simultaneously, but all execution is being done by the
workers.
57
Luigi is a way of coordinating lots of different tasks
… but you still have to figure out how to implement and scale them!
58
Do general-purpose stuff
Don’t focus on a specific platform
!
… but comes “batteries included”
59
Built-in support for HDFS & Hadoop
At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and
Scalding. Luigi is a great glue!
!
Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in
production.
!
Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs,
Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into
Postgres, send email reports, etc.
60
The one time we accidentally deleted 50TB of data
We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3
days
61
Some things are still not perfect
62
The missing parts
Execution is tied to scheduling – you can’t schedule something to run “in the cloud”
Visualization could be a lot more useful
There’s no built scheduling – have to rely on crontab
These are all things we have in the backlog
63
Source:
What are some ideas for the future?
64
Separate scheduling and execution
65
Luigi central scheduler
Slave
Slave
Slave
Slave
...
Luigi in Scala?
66
Luigi implements some core beliefs
The #1 focus is on removing all boiler plate
The #2 focus is to be as general as possible
The #3 focus is to make it easy to go from test to production
!
!
67
Join the club!
Questions?
69

More Related Content

What's hot

Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureDiscover Pinterest
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB ClusterMongoDB
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기AWSKRUG - AWS한국사용자모임
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 choi kyumin
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기Brian Hong
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learningAkash Agrawal
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaEdureka!
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 

What's hot (20)

Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor Netty
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learning
 
Serving ML easily with FastAPI
Serving ML easily with FastAPIServing ML easily with FastAPI
Serving ML easily with FastAPI
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | Edureka
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 

Viewers also liked

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기Sumin Byeon
 
Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Esun Kim
 
영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출동윤 이
 
김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기Harns (Nak-Hyoung) Kim
 
게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지Harns (Nak-Hyoung) Kim
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임Imseong Kang
 
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]Sumin Byeon
 
Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Huey Park
 
NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지Daehoon Han
 
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerDeveloping Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerJessica Tams
 
Deep learning as_WaveExtractor
Deep learning as_WaveExtractorDeep learning as_WaveExtractor
Deep learning as_WaveExtractor동윤 이
 
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012Esun Kim
 
8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈Harns (Nak-Hyoung) Kim
 
Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Heungsub Lee
 
Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4동석 김
 
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다Lee Dustin
 
버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션동석 김
 
[NDC 2009] 행동 트리로 구현하는 인공지능
[NDC 2009] 행동 트리로 구현하는 인공지능[NDC 2009] 행동 트리로 구현하는 인공지능
[NDC 2009] 행동 트리로 구현하는 인공지능Yongha Kim
 

Viewers also liked (20)

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
 
Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)
 
영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출
 
김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기
 
게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지
 
Docker
DockerDocker
Docker
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
 
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
 
Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4
 
NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지
 
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerDeveloping Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
 
Deep learning as_WaveExtractor
Deep learning as_WaveExtractorDeep learning as_WaveExtractor
Deep learning as_WaveExtractor
 
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
 
8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈
 
Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러
 
Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4
 
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
 
버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션
 
[NDC 2009] 행동 트리로 구현하는 인공지능
[NDC 2009] 행동 트리로 구현하는 인공지능[NDC 2009] 행동 트리로 구현하는 인공지능
[NDC 2009] 행동 트리로 구현하는 인공지능
 

Similar to Luigi workflow engine for NYC Data Science meetup

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up Craig Schumann
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101APNIC
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linuxPavel Klimiankou
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...Arnaud Joly
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesTatiana Al-Chueyr
 

Similar to Luigi workflow engine for NYC Data Science meetup (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up
 
Handout3o
Handout3oHandout3o
Handout3o
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linux
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

Luigi workflow engine for NYC Data Science meetup

  • 1. December 15, 2014 Luigi NYC Data Science meetup
  • 2. What is Luigi Luigi is a workflow engine If you run 10,000+ Hadoop jobs every day, you need one If you play around with batch processing just for fun, you want one Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at It helps you with the plumbing of connecting lots of tasks into complicated pipelines, especially if those tasks run on Hadoop 2
  • 3. What do we use it for? Music recommendations A/B testing Top lists Ad targeting Label reporting Dashboards … and a million other things! 3
  • 4. Currently running 10,000+ Hadoop jobs every day On average a Hadoop job is launched every 10s There’s 2,000+ Luigi tasks in production 4
  • 5. Some history … let’s go back to 2008! 5
  • 6. The year was 2008 I was writing my master’s thesis about music recommendations Had to run hundreds of long-running tasks to compute the output 6
  • 7. Toy example: classify skipped tracks $ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled $ python train_model.py /tmp/subsampled model.pickle $ python inspect_model.py model.pickle 7 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 8. Reproducibility matters …and automation. ! The previous code is really hard to run again 8
  • 9. Let’s make into a big workflow 9 $ python run_everything.py
  • 10. Reality: crashes will happen 10 How do you resume this?
  • 11. Ability to resume matters When you are developing something interactively, you will try and fail a lot Failures will happen, and you want to resume once you fixed it You want the system to figure out exactly what it has to re-run and nothing else Atomic file operations is crucial for the ability to resume 11
  • 12. So let’s make it possible to resume 12
  • 13. 13 But still annoying parts Hardcoded junk
  • 14. Generalization matters You should be able to re-run your entire pipeline with a new value for a parameter Command line integration means you can run interactive experiments 14
  • 15. … now we’re getting something 15 $ python run_everything.py --date- first 2014-01-01 --date-last 2014-01-31 --n-trees 200
  • 16. 16 … but it’s hardly readable BOILERPLATE
  • 17. Boilerplate matters! We keep re-implementing the same functionality Let’s factor it out to a framework 17
  • 18. A lot of real-world data pipelines are a lot more complex The ideal framework should make it trivial to build up big data pipelines where dependencies are non-trivial (eg depend on date algebra) 18
  • 19. So I started thinking Wanted to build something like GNU Make 19
  • 20. What is Make and why is it pretty cool? Build reusable rules Specify what you want to build and then backtrack to find out what you need in order to get there Reproducible runs 20 # the compiler: gcc for C program, define as g++ for C++ CC = gcc ! # compiler flags: # -g adds debugging information to the executable file # -Wall turns on most, but not all, compiler warnings CFLAGS = -g -Wall ! # the build target executable: TARGET = myprog ! all: $(TARGET) ! $(TARGET): $(TARGET).c $(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c ! clean: $(RM) $(TARGET)
  • 21. We want something that works for a wide range of systems We need to support lots of systems “80% of data science is data munging” 21
  • 22. Data processing needs to interact with lots of systems Need to support practically any type of task: Hadoop jobs Database dumps Ingest into Cassandra Send email SCP file somewhere else 22
  • 23. My first attempt: builder Use XML config to build up the dependency graph! 23
  • 24. Don’t use XML … seriously, don’t use it 24
  • 25. Dependencies need code Pipelines deployed in production often have nontrivial ways they define dependencies between tasks ! ! ! ! ! ! ! ! … and many other cases 25 Recursion (and date algebra) BloomFilter(date=2014-05-01) BloomFilter(date=2014-04-30) Log(date=2014-04-30) Log(date=2014-04-29) ... Date algebra Toplist(date_interval=2014-01) Log(date=2014-01-01) Log(date=2014-01-02) ... Log(date=2014-01-31) Enum types IdMap(type=artist) IdMap(type=track) IdToIdMap(from_type=artist, to_type=track)
  • 26. Don’t ever invent your own DSL “It’s better to write domain specific code in a general purpose language, than writing general purpose code in a domain specific language” – unknown author ! ! Oozie is a good example of how messy it gets 26
  • 27. 2009: builder2 Solved all the things I just mentioned - Dependency graph specified in Python - Support for arbitrary tasks - Error emails - Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc - Lots of other things :) 27
  • 31. What were the good bits? ! Build up dependency graphs and visualize them Non-event to go from development to deployment Built-in HDFS integration but decoupled from the core library ! ! What went wrong? ! Still too much boiler plate Pretty bad command line integration 31
  • 32. 32
  • 33. Introducing Luigi A workflow engine in Python 33
  • 34. Luigi – History at Spotify Late 2011: Me and Elias Freider build it, release it into the wild at Spotify, people start using it “The Python era” ! Late 2012: Open source it Early 2013: First known company outside of Spotify: Foursquare ! 34
  • 35. Luigi is your friendly plumber Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration 35
  • 37. Luigi Task – breakdown 37 The business logic of the task Where it writes output What other tasks it depends on Parameters for this task
  • 38. Easy command line integration So easy that you want to use Luigi for it 38 $ python my_task.py MyTask --param 43 INFO: Scheduled MyTask(param=43) INFO: Scheduled SomeOtherTask(param=43) INFO: Done scheduling tasks INFO: [pid 20235] Running SomeOtherTask(param=43) INFO: [pid 20235] Done SomeOtherTask(param=43) INFO: [pid 20235] Running MyTask(param=43) INFO: [pid 20235] Done MyTask(param=43) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ cat /tmp/foo/bar-43.txt hello, world $
  • 39. Let’s go back to the example 39 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 42. $ python demo.py SubsampleFeatures --date-interval 2013-11-01 DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01) ... INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2- hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113 INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0% INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0% INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0% ... INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87% INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100% INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ Run on the command line 42
  • 43. Step 2: Train a machine learning model 43
  • 44. Let’s run everything on the command line from scratch $ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03 DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self) INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING) INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477 INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0% INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0% INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0% INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0% INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0% INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0% INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100% INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running InspectModel(date_interval=2011-01-03, n_trees=10) time 0.1335% ms_played 96.9351% shuffle 0.0728% local_track 0.0000% bitrate 2.8586% INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done InspectModel(date_interval=2011-01-03, n_trees=10) 44
  • 45. Let’s make it more complicated – cross validation 45 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Log e Log e+1 ... Log e+k-1 Subsample and extract features Subsampled features Cross validation
  • 46. Cross validation implementation $ python xv.py CrossValidation --date-interval-a 2012-11-01 --date-interval-b 2012-11-02 46
  • 47. Run on the command line $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) 2011-01-01 (train) AUC: 0.9040 2011-01-02 ( test) AUC: 0.9040 INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread 47 … no overfitting!
  • 48. More trees! $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) 2011-01-01 (train) AUC: 0.9074 2011-01-02 ( test) AUC: 0.8896 INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) was stopped. Shutting down Keep-Alive thread 48 … overfitting!
  • 49.
  • 50. The nice things about Luigi 50
  • 51. Overhead for a task is about 5 lines (class def + requires + output + run) Easy command line integration Section name Minimal boiler plate 51
  • 52. Everything is a directed acyclic graph Makefile style Tasks specify what they are dependent on not what other things depend on them 52
  • 54. Dive into any task 54
  • 55. Run with multiple workers $ python dataflow.py --workers 3 AggregateArtists --date-interval 2013-W08 55
  • 57. Process synchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner Prevents the same task from being run simultaneously, but all execution is being done by the workers. 57
  • 58. Luigi is a way of coordinating lots of different tasks … but you still have to figure out how to implement and scale them! 58
  • 59. Do general-purpose stuff Don’t focus on a specific platform ! … but comes “batteries included” 59
  • 60. Built-in support for HDFS & Hadoop At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and Scalding. Luigi is a great glue! ! Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in production. ! Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs, Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into Postgres, send email reports, etc. 60
  • 61. The one time we accidentally deleted 50TB of data We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3 days 61
  • 62. Some things are still not perfect 62
  • 63. The missing parts Execution is tied to scheduling – you can’t schedule something to run “in the cloud” Visualization could be a lot more useful There’s no built scheduling – have to rely on crontab These are all things we have in the backlog 63
  • 64. Source: What are some ideas for the future? 64
  • 65. Separate scheduling and execution 65 Luigi central scheduler Slave Slave Slave Slave ...
  • 67. Luigi implements some core beliefs The #1 focus is on removing all boiler plate The #2 focus is to be as general as possible The #3 focus is to make it easy to go from test to production ! ! 67