Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com
Background
Why did we build Luigi?


We have a lot of data
 Billions of log messages (several TBs) every day
 Usage and backend stats, debug information
 What we want to do
   AB-testing
   Music recommendations
   Monthly/daily/hourly reporting
   Business metric dashboards
 We experiment a lot – need quick development cycles
How to do it?

Hadoop
  Data storage
  Clean, filter, join and aggregate data
Postgres
  Semi-aggregated data for dashboards
Cassandra
  Time series data
Defining a data flow
A bunch of data processing tasks with inter-dependencies
Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples




                                 Artist
      Streams
                              Aggregation




                                  Top 10             Database
Example: Artist Toplist
Naive (non-luigi) approach
Errors will occur
If a failed step leaves behind broken data, we want to clean that up
Avoid duplicate work
The second step fails, you fix it, then you want to resume
Parametrization
To use data flows as command line tools
Repeat!
You want to run the dataflow for a set of similar inputs
Introducing Luigi
A Python framework for data flow definition and execution




Main features

  Dead-simple dependency definitions (think GNU make)
  Emphasis on Hadoop/HDFS integration
  Atomic file operations
  Powerful task templating using OOP
  Data flow visualization
  Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running   AggregateArtists()
INFO: [pid 74375] Done      AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model




Luigi comes with pre-implemented Tasks for...

  Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
  Running Hive and (soon) Pig queries
  Inserting data sets into Postgres


Writing new ones are as easy as defining an interface and
implementing run()
Hadoop MapReduce
Built-in Hadoop Streaming Python framework




Features

 Very slim interface
 Fetches error logs from Hadoop cluster and displays them to the user
 Class instance variables can be referenced in MapReduce code, which makes it
 easy to supply extra data in dictionaries etc. for map side joins
 Easy to send along Python modules that might not be installed on the cluster
 Basic support for secondary sort
 Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Hadoop MapReduce
Built-in Hadoop Streaming Python framework
Attach Python modules
So you can use local modules remotely on the cluster
Local state is transferred
Set up local variables for map-side joins etc.
Completing the top list
Top 10 artists - Wrapped arbitrary Python code
Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...
   DEBUG: Checking if ArtistToplistToDatabase() is complete
   INFO: Scheduled ArtistToplistToDatabase()
   DEBUG: Checking if Top10Artists() is complete
   INFO: Scheduled Top10Artists()
   DEBUG: Checking if AggregateArtists() is complete
   INFO: Scheduled AggregateArtists()
   DEBUG: Checking if Streams() is complete
   INFO: Done scheduling tasks
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 3
   INFO: [pid 74811] Running   AggregateArtists()
   INFO: [pid 74811] Done      AggregateArtists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 2
   INFO: [pid 74811] Running   Top10Artists()
   INFO: [pid 74811] Done      Top10Artists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 1
   INFO: [pid 74811] Running   ArtistToplistToDatabase()
   INFO: Done writing, importing at 2013-03-13 15:41:09.407138
   INFO: [pid 74811] Done      ArtistToplistToDatabase()
   DEBUG: Asking scheduler for work...
   INFO: Done
   INFO: There are no more tasks to run at this time
Data flow visualization
Light-weight scheduler daemon with a web interface
The results
Imagine how cool this would be with real data...
Task Parameters
Class variables with some magic




Tasks have implicit __init__




   Generates command line interface with typing and documentation
   $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Multiple workers
 Basic multi-processing



$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
            (Screenshot from web interface)




                                                                                                                                                                                                                                                                                                                  EndSongCleaned                                                                       EndSongCleaned
                                                                                                                                                                                                                                                                                                                  (date=2013-03-16)                                                                   (date=2013-03-15)




                                                                                                                                                                                                AggregateTracks                                                                                                             AggregateTracks                                                 UserLocationDay                                  UserLocationDay              UserLocationDay        UserLocationDay         UserLocationDay           UserLocationDay               UserLocationDay      UserLocationDay     UserLocationDay     UserLocationDay     UserLocationDay
                                                                                                                                                                                                   (test=False,                MasterMetadata                                                                                    (test=False,                                                    (test=False,                                   (test=False,                (test=False,           (test=False,            (test=False,               (test=False,                    (test=False,      (test=False,        (test=False,        (test=False,        (test=False,
                                                                                                                                                                                                date=2013-03-16,              (date=2013-03-15)                                                                            date=2013-03-15,                                                 date=2013-03-16,                                 date=2013-03-14,            date=2013-03-13,        date=2013-03-08,        date=2013-03-12,          date=2013-03-07,              date=2013-03-11,     date=2013-03-10,    date=2013-03-09,    date=2013-03-15,    date=2013-03-06,
                                                                                                                                                                                                test_users=False)                                                                                                          test_users=False)                                                test_users=False)                                test_users=False)           test_users=False)       test_users=False)       test_users=False)         test_users=False)             test_users=False)    test_users=False)   test_users=False)   test_users=False)   test_users=False)




                                                                                                                                                                                              JoinArtistGids                                 JoinArtistGids                                                                                                                                                                                                                                                                                       UserLocationPeriod             UserLocationPeriod
                                                                                                                                                                                               (test=False,                                     (test=False,                                                                                                                                                                                                                                                                                         (test=False,                     (test=False,
                                                                                                                                                                                            date=2013-03-16,                               date=2013-03-15,                                                                                                                                                                                                                                                                                           period=10,                        period=10,
                                                                                                                                                                                               nshards=12,                                   nshards=12,                                                                                                                                                                                                                                                                                          date=2013-03-17,               date=2013-03-16,
                                                                                                                                                                                            test_users=False)                              test_users=False)                                                                                                                                                                                                                                                                                      test_users=False)              test_users=False)




  AggregateUserMatrices       AggregateUserMatrices       AggregateUserMatrices                    AggregateUserMatrices               AggregateUserMatrices      AggregateUserMatrices
                                                                                                                                                                                            AggregateByArtists      AggregateByArtists                                        AggregateByArtists               AggregateByArtists               AggregateByArtists             AggregateByArtists                AggregateByArtists            AggregateByArtists           AggregateByArtists      AggregateByArtists       AggregateByArtists
       (test=False,                (test=False,                (test=False,                              (test=False,                        (test=False,              (test=False,
                                                                                                                                                                                               (test=False,            (test=False,                ArtistFollows                 (test=False,                     (test=False,                      (test=False,                  (test=False,                        (test=False,                (test=False,                  (test=False,            (test=False,            (test=False,
    date=2013-03-16,            date=2013-03-14,            date=2013-03-13,                          date=2013-03-15,                   date=2013-03-12,           date=2013-03-11,
                                                                                                                                                                                            date=2013-03-16,        date=2013-03-13,            (date=2013-03-14)             date=2013-03-14,                 date=2013-03-09,                 date=2013-03-08,               date=2013-03-15,                  date=2013-03-12,              date=2013-03-11,              date=2013-03-07,        date=2013-03-10,         date=2013-03-06,
index_version=1362433077,   index_version=1362433077,   index_version=1362433077,               index_version=1362433077,           index_version=1362433077,   index_version=1362433077,
                                                                                                                                                                                            test_users=False)       test_users=False)                                         test_users=False)                test_users=False)                test_users=False)              test_users=False)                 test_users=False)             test_users=False)             test_users=False)       test_users=False)       test_users=False)
    test_users=False)           test_users=False)           test_users=False)                         test_users=False)                  test_users=False)          test_users=False)




                                                                              AccumulateUserMatrices                         AccumulateUserMatrices                                                                                                     AccumulateByArtists                                                                                                                                                                                         AccumulateByArtists
                                                                                    (test=False,                                   (test=False,                                                                                                                (test=False,                                                                                                                                                                                             (test=False,
                                                                                                                                                                                                                                                                                            JoinArtistGids                                                                JoinArtistGids                                              JoinArtistGids                                                                                                                 JoinArtistGids
                                                                                 date=2013-03-16,                               date=2013-03-15,                                                                                                         date=2013-03-16,                                                          PlaylistVectors                                                                                                                   date=2013-03-15,
                                                                                                                                                                                                                                                                                                (test=False,                                                               (test=False,                                                (test=False,                                                                                                                   (test=False,
                                                                          index_version=1362433077,                         index_version=1362433077,                                                                                                    test_users=False,                                                           (test=False,                                                    RelatedArtistsTC                                                test_users=False,
                                                                                                                                                                                                                                                                                          date=2013-03-14,                                                              date=2013-03-13,                                             date=2013-03-12,                                                                                                               date=2013-03-11,
                                                                                 test_users=False,                              test_users=False,                                                                                                          max_days=90,                                                           date=2013-03-14,                                                              ()                                                    max_days=90,
                                                                                                                                                                                                                                                                                            nshards=12,                                                                   nshards=12,                                                  nshards=12,                                                                                                                    nshards=12,
                                                                                 decay_factor=0.99,                             decay_factor=0.99,                                                                                                      decay_factor=0.99,                                                index_version=1362433077)                                                                                                                 decay_factor=0.99,
                                                                                                                                                                                                                                                                                         test_users=False)                                                              test_users=False)                                            test_users=False)                                                                                                              test_users=False)
                                                                                      days=5,                                        days=5,                                                                                                                    days=10,                                                                                                                                                                                                 days=10,
                                                                              build_from_scratch=True)                       build_from_scratch=True)                                                                                                build_from_scratch=True)                                                                                                                                                                                    build_from_scratch=True)




                                                                                                                                                                                                                                                                                                                            UserRecs                                                                                 UserRecs
                                                                                                                                                                                                                                                                                                                            (test=False,                                                                         (test=False,
                                                                                                                                                                                                                                                                                                                        date=2013-03-17,                                                                   date=2013-03-16,
                                                                                                                                                                                                                                                                                                                           rec_days=5,                                                                           rec_days=5,
                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                          exp_days=10,                                                                          exp_days=10,
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-15,
                                                                                                                                                                                                                                                                                                                        test_users=False,                                                                  test_users=False,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                       force_updates=False,                                                              force_updates=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
                                                                                                                                                                                                                                                                                                                     build_from_scratch=True,                                                          build_from_scratch=True,
                                                                                                                                                                                                                                                                                                                index_path=/spotify/discover/index,                                                index_path=/spotify/discover/index,
                                                                                                                                                                                                                                                                                                                       index_version=None,                                                                index_version=None,
                                                                                                                                                                                                                                                                                                                     FOLLOWS_SCORE=5.0)                                                                 FOLLOWS_SCORE=5.0)




                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-16,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)




                                                                                                                                                                                                                                                                                                                                                     IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                    (date=2013-03-17,
                                                                                                                                                                                                                                                                                                                                                     test_users=False,
                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
Really large...
Visualization needs some work
Error notifications
Great for automated execution
Process Synchronization
Prevents two identical tasks from running simultaneously


                                luigid
                       Simple task synchronization




             Data flow 1                   Data flow 2




                        Common dependency
                             Task
What we use Luigi for
Not just another Hadoop streaming framework!

 Hadoop Streaming
 Java Hadoop MapReduce
 Hive
 Pig
 Local (non-hadoop) data processing
 Import/Export data to/from Postgres
 Insert data into Cassandra
 scp/rsync/ftp data files and reports
 Dump and load databases
Others using it with Scala MapReduce and MRJob as well
Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012
Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs



Elias Freider
freider@spotify.com

Luigi PyData presentation

  • 1.
    Luigi Batch data processingin Python Elias Freider freider@spotify.com
  • 2.
    Background Why did webuild Luigi? We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  • 3.
    How to doit? Hadoop Data storage Clean, filter, join and aggregate data Postgres Semi-aggregated data for dashboards Cassandra Time series data
  • 4.
    Defining a dataflow A bunch of data processing tasks with inter-dependencies
  • 5.
    Example: Artist Toplist Inputis a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  • 6.
    Example: Artist Toplist Naive(non-luigi) approach
  • 7.
    Errors will occur Ifa failed step leaves behind broken data, we want to clean that up
  • 8.
    Avoid duplicate work Thesecond step fails, you fix it, then you want to resume
  • 9.
    Parametrization To use dataflows as command line tools
  • 10.
    Repeat! You want torun the dataflow for a set of similar inputs
  • 11.
    Introducing Luigi A Pythonframework for data flow definition and execution Main features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  • 12.
  • 13.
    Luigi - AggregateArtists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 14.
    Task templates andtargets Reduces code-repetition by utilizing Luigi’s object oriented data model Luigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
  • 15.
    Hadoop MapReduce Built-in HadoopStreaming Python framework Features Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  • 16.
    Hadoop MapReduce Built-in HadoopStreaming Python framework
  • 17.
    Attach Python modules Soyou can use local modules remotely on the cluster
  • 18.
    Local state istransferred Set up local variables for map-side joins etc.
  • 19.
    Completing the toplist Top 10 artists - Wrapped arbitrary Python code
  • 20.
    Database support Basic functionalityfor exporting to Postgres. Cassandra support is in the works
  • 21.
    Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 22.
    Data flow visualization Light-weightscheduler daemon with a web interface
  • 23.
    The results Imagine howcool this would be with real data...
  • 24.
    Task Parameters Class variableswith some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  • 25.
  • 26.
    Task Parameters Makes itreally easy to create aggregates over time
  • 27.
    Multiple workers Basicmulti-processing $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28.
    Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  • 29.
  • 30.
    Error notifications Great forautomated execution
  • 31.
    Process Synchronization Prevents twoidentical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  • 32.
    What we useLuigi for Not just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
  • 33.
    Luigi is opensource http://github.com/spotify/luigi Originated at Spotify Based on many years of experience with data processing Recent contributions by Foursquare and Bitly Open source since September 2012
  • 34.
    Thank you! For moreinformation feel free to reach out at http://github.com/spotify/luigi Oh, and we’re hiring – http://spotify.com/jobs Elias Freider freider@spotify.com