Luigi PyData presentation

Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com

Background
Why did we build Luigi?

We have a lot of data
Billions of log messages (several TBs) every day
Usage and backend stats, debug information
What we want to do
AB-testing
Music recommendations
Monthly/daily/hourly reporting
Business metric dashboards
We experiment a lot – need quick development cycles

How to do it?

Hadoop
Data storage
Clean, filter, join and aggregate data
Postgres
Semi-aggregated data for dashboards
Cassandra
Time series data

Defining a data flow
A bunch of data processing tasks with inter-dependencies

Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples

Artist
Streams
Aggregation

Top 10 Database

Example: Artist Toplist
Naive (non-luigi) approach

Errors will occur
If a failed step leaves behind broken data, we want to clean that up

Avoid duplicate work
The second step fails, you fix it, then you want to resume

Parametrization
To use data flows as command line tools

Repeat!
You want to run the dataflow for a set of similar inputs

Introducing Luigi
A Python framework for data flow definition and execution

Main features

Dead-simple dependency definitions (think GNU make)
Emphasis on Hadoop/HDFS integration
Atomic file operations
Powerful task templating using OOP
Data flow visualization
Command line integration

Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running AggregateArtists()
INFO: [pid 74375] Done AggregateArtists()
INFO: Done
INFO: There are no more tasks to run at this time

Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model

Luigi comes with pre-implemented Tasks for...

Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
Running Hive and (soon) Pig queries
Inserting data sets into Postgres

Writing new ones are as easy as defining an interface and
implementing run()

Hadoop MapReduce
Built-in Hadoop Streaming Python framework

Features

Very slim interface
Fetches error logs from Hadoop cluster and displays them to the user
Class instance variables can be referenced in MapReduce code, which makes it
easy to supply extra data in dictionaries etc. for map side joins
Easy to send along Python modules that might not be installed on the cluster
Basic support for secondary sort
Runs on CPython so you can use your favorite libs (numpy, pandas etc.)

Hadoop MapReduce
Built-in Hadoop Streaming Python framework

Attach Python modules
So you can use local modules remotely on the cluster

Local state is transferred
Set up local variables for map-side joins etc.

Completing the top list
Top 10 artists - Wrapped arbitrary Python code

Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works

Running it all...
DEBUG: Checking if ArtistToplistToDatabase() is complete
INFO: Scheduled ArtistToplistToDatabase()
DEBUG: Checking if Top10Artists() is complete
INFO: Scheduled Top10Artists()
DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
INFO: [pid 74811] Running AggregateArtists()
INFO: [pid 74811] Done AggregateArtists()
INFO: [pid 74811] Running Top10Artists()
INFO: [pid 74811] Done Top10Artists()
INFO: [pid 74811] Running ArtistToplistToDatabase()
INFO: Done writing, importing at 2013-03-13 15:41:09.407138
INFO: [pid 74811] Done ArtistToplistToDatabase()
INFO: Done
INFO: There are no more tasks to run at this time

Data flow visualization
Light-weight scheduler daemon with a web interface

The results
Imagine how cool this would be with real data...

Task Parameters
Class variables with some magic

Tasks have implicit __init__

Generates command line interface with typing and documentation
$ python dataflow.py AggregateArtists --date 2013-03-05

Task Parameters
Combined usage example

Task Parameters
Makes it really easy to create aggregates over time

Multiple workers
Basic multi-processing

$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11

Large data flows
(Screenshot from web interface)

EndSongCleaned EndSongCleaned
(date=2013-03-16) (date=2013-03-15)

AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay
(test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False,
date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06,
test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False)

JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod
(test=False, (test=False, (test=False, (test=False,
date=2013-03-16, date=2013-03-15, period=10, period=10,
nshards=12, nshards=12, date=2013-03-17, date=2013-03-16,
test_users=False) test_users=False) test_users=False) test_users=False)

AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices
AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists
(test=False, (test=False, (test=False, (test=False, (test=False, (test=False,
(test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False,
date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11,
date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06,
index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077,
test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False)
test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False)

AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists
JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids
date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15,
index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False,
date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11,
test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90,
nshards=12, nshards=12, nshards=12, nshards=12,
decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99,
test_users=False) test_users=False) test_users=False) test_users=False)
days=5, days=5, days=10, days=10,
build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True)

UserRecs UserRecs
(test=False, (test=False,
date=2013-03-17, date=2013-03-16,
rec_days=5, rec_days=5,
IngestUserRecs
exp_days=10, exp_days=10,
(date=2013-03-15,
test_users=False, test_users=False,
test_users=False,
force_updates=False, force_updates=False,
ttl=604800)
build_from_scratch=True, build_from_scratch=True,
index_path=/spotify/discover/index, index_path=/spotify/discover/index,
index_version=None, index_version=None,
FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0)

IngestUserRecs
(date=2013-03-16,
test_users=False,
ttl=604800)

IngestUserRecs
(date=2013-03-17,
test_users=False,
ttl=604800)

Really large...
Visualization needs some work

Error notifications
Great for automated execution

Process Synchronization
Prevents two identical tasks from running simultaneously

luigid
Simple task synchronization

Data ﬂow 1 Data ﬂow 2

Common dependency
Task

What we use Luigi for
Not just another Hadoop streaming framework!

Hadoop Streaming
Java Hadoop MapReduce
Hive
Pig
Local (non-hadoop) data processing
Import/Export data to/from Postgres
Insert data into Cassandra
scp/rsync/ftp data files and reports
Dump and load databases
Others using it with Scala MapReduce and MRJob as well

Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012

Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs

Elias Freider
freider@spotify.com

Luigi PyData presentation

In this document

More Related Content

What's hot

Similar to Luigi PyData presentation

Recently uploaded

Luigi PyData presentation