SlideShare a Scribd company logo
1 of 34
Download to read offline
Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com
Background
Why did we build Luigi?


We have a lot of data
 Billions of log messages (several TBs) every day
 Usage and backend stats, debug information
 What we want to do
   AB-testing
   Music recommendations
   Monthly/daily/hourly reporting
   Business metric dashboards
 We experiment a lot – need quick development cycles
How to do it?

Hadoop
  Data storage
  Clean, filter, join and aggregate data
Postgres
  Semi-aggregated data for dashboards
Cassandra
  Time series data
Defining a data flow
A bunch of data processing tasks with inter-dependencies
Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples




                                 Artist
      Streams
                              Aggregation




                                  Top 10             Database
Example: Artist Toplist
Naive (non-luigi) approach
Errors will occur
If a failed step leaves behind broken data, we want to clean that up
Avoid duplicate work
The second step fails, you fix it, then you want to resume
Parametrization
To use data flows as command line tools
Repeat!
You want to run the dataflow for a set of similar inputs
Introducing Luigi
A Python framework for data flow definition and execution




Main features

  Dead-simple dependency definitions (think GNU make)
  Emphasis on Hadoop/HDFS integration
  Atomic file operations
  Powerful task templating using OOP
  Data flow visualization
  Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running   AggregateArtists()
INFO: [pid 74375] Done      AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model




Luigi comes with pre-implemented Tasks for...

  Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
  Running Hive and (soon) Pig queries
  Inserting data sets into Postgres


Writing new ones are as easy as defining an interface and
implementing run()
Hadoop MapReduce
Built-in Hadoop Streaming Python framework




Features

 Very slim interface
 Fetches error logs from Hadoop cluster and displays them to the user
 Class instance variables can be referenced in MapReduce code, which makes it
 easy to supply extra data in dictionaries etc. for map side joins
 Easy to send along Python modules that might not be installed on the cluster
 Basic support for secondary sort
 Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Hadoop MapReduce
Built-in Hadoop Streaming Python framework
Attach Python modules
So you can use local modules remotely on the cluster
Local state is transferred
Set up local variables for map-side joins etc.
Completing the top list
Top 10 artists - Wrapped arbitrary Python code
Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...
   DEBUG: Checking if ArtistToplistToDatabase() is complete
   INFO: Scheduled ArtistToplistToDatabase()
   DEBUG: Checking if Top10Artists() is complete
   INFO: Scheduled Top10Artists()
   DEBUG: Checking if AggregateArtists() is complete
   INFO: Scheduled AggregateArtists()
   DEBUG: Checking if Streams() is complete
   INFO: Done scheduling tasks
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 3
   INFO: [pid 74811] Running   AggregateArtists()
   INFO: [pid 74811] Done      AggregateArtists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 2
   INFO: [pid 74811] Running   Top10Artists()
   INFO: [pid 74811] Done      Top10Artists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 1
   INFO: [pid 74811] Running   ArtistToplistToDatabase()
   INFO: Done writing, importing at 2013-03-13 15:41:09.407138
   INFO: [pid 74811] Done      ArtistToplistToDatabase()
   DEBUG: Asking scheduler for work...
   INFO: Done
   INFO: There are no more tasks to run at this time
Data flow visualization
Light-weight scheduler daemon with a web interface
The results
Imagine how cool this would be with real data...
Task Parameters
Class variables with some magic




Tasks have implicit __init__




   Generates command line interface with typing and documentation
   $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Multiple workers
 Basic multi-processing



$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
            (Screenshot from web interface)




                                                                                                                                                                                                                                                                                                                  EndSongCleaned                                                                       EndSongCleaned
                                                                                                                                                                                                                                                                                                                  (date=2013-03-16)                                                                   (date=2013-03-15)




                                                                                                                                                                                                AggregateTracks                                                                                                             AggregateTracks                                                 UserLocationDay                                  UserLocationDay              UserLocationDay        UserLocationDay         UserLocationDay           UserLocationDay               UserLocationDay      UserLocationDay     UserLocationDay     UserLocationDay     UserLocationDay
                                                                                                                                                                                                   (test=False,                MasterMetadata                                                                                    (test=False,                                                    (test=False,                                   (test=False,                (test=False,           (test=False,            (test=False,               (test=False,                    (test=False,      (test=False,        (test=False,        (test=False,        (test=False,
                                                                                                                                                                                                date=2013-03-16,              (date=2013-03-15)                                                                            date=2013-03-15,                                                 date=2013-03-16,                                 date=2013-03-14,            date=2013-03-13,        date=2013-03-08,        date=2013-03-12,          date=2013-03-07,              date=2013-03-11,     date=2013-03-10,    date=2013-03-09,    date=2013-03-15,    date=2013-03-06,
                                                                                                                                                                                                test_users=False)                                                                                                          test_users=False)                                                test_users=False)                                test_users=False)           test_users=False)       test_users=False)       test_users=False)         test_users=False)             test_users=False)    test_users=False)   test_users=False)   test_users=False)   test_users=False)




                                                                                                                                                                                              JoinArtistGids                                 JoinArtistGids                                                                                                                                                                                                                                                                                       UserLocationPeriod             UserLocationPeriod
                                                                                                                                                                                               (test=False,                                     (test=False,                                                                                                                                                                                                                                                                                         (test=False,                     (test=False,
                                                                                                                                                                                            date=2013-03-16,                               date=2013-03-15,                                                                                                                                                                                                                                                                                           period=10,                        period=10,
                                                                                                                                                                                               nshards=12,                                   nshards=12,                                                                                                                                                                                                                                                                                          date=2013-03-17,               date=2013-03-16,
                                                                                                                                                                                            test_users=False)                              test_users=False)                                                                                                                                                                                                                                                                                      test_users=False)              test_users=False)




  AggregateUserMatrices       AggregateUserMatrices       AggregateUserMatrices                    AggregateUserMatrices               AggregateUserMatrices      AggregateUserMatrices
                                                                                                                                                                                            AggregateByArtists      AggregateByArtists                                        AggregateByArtists               AggregateByArtists               AggregateByArtists             AggregateByArtists                AggregateByArtists            AggregateByArtists           AggregateByArtists      AggregateByArtists       AggregateByArtists
       (test=False,                (test=False,                (test=False,                              (test=False,                        (test=False,              (test=False,
                                                                                                                                                                                               (test=False,            (test=False,                ArtistFollows                 (test=False,                     (test=False,                      (test=False,                  (test=False,                        (test=False,                (test=False,                  (test=False,            (test=False,            (test=False,
    date=2013-03-16,            date=2013-03-14,            date=2013-03-13,                          date=2013-03-15,                   date=2013-03-12,           date=2013-03-11,
                                                                                                                                                                                            date=2013-03-16,        date=2013-03-13,            (date=2013-03-14)             date=2013-03-14,                 date=2013-03-09,                 date=2013-03-08,               date=2013-03-15,                  date=2013-03-12,              date=2013-03-11,              date=2013-03-07,        date=2013-03-10,         date=2013-03-06,
index_version=1362433077,   index_version=1362433077,   index_version=1362433077,               index_version=1362433077,           index_version=1362433077,   index_version=1362433077,
                                                                                                                                                                                            test_users=False)       test_users=False)                                         test_users=False)                test_users=False)                test_users=False)              test_users=False)                 test_users=False)             test_users=False)             test_users=False)       test_users=False)       test_users=False)
    test_users=False)           test_users=False)           test_users=False)                         test_users=False)                  test_users=False)          test_users=False)




                                                                              AccumulateUserMatrices                         AccumulateUserMatrices                                                                                                     AccumulateByArtists                                                                                                                                                                                         AccumulateByArtists
                                                                                    (test=False,                                   (test=False,                                                                                                                (test=False,                                                                                                                                                                                             (test=False,
                                                                                                                                                                                                                                                                                            JoinArtistGids                                                                JoinArtistGids                                              JoinArtistGids                                                                                                                 JoinArtistGids
                                                                                 date=2013-03-16,                               date=2013-03-15,                                                                                                         date=2013-03-16,                                                          PlaylistVectors                                                                                                                   date=2013-03-15,
                                                                                                                                                                                                                                                                                                (test=False,                                                               (test=False,                                                (test=False,                                                                                                                   (test=False,
                                                                          index_version=1362433077,                         index_version=1362433077,                                                                                                    test_users=False,                                                           (test=False,                                                    RelatedArtistsTC                                                test_users=False,
                                                                                                                                                                                                                                                                                          date=2013-03-14,                                                              date=2013-03-13,                                             date=2013-03-12,                                                                                                               date=2013-03-11,
                                                                                 test_users=False,                              test_users=False,                                                                                                          max_days=90,                                                           date=2013-03-14,                                                              ()                                                    max_days=90,
                                                                                                                                                                                                                                                                                            nshards=12,                                                                   nshards=12,                                                  nshards=12,                                                                                                                    nshards=12,
                                                                                 decay_factor=0.99,                             decay_factor=0.99,                                                                                                      decay_factor=0.99,                                                index_version=1362433077)                                                                                                                 decay_factor=0.99,
                                                                                                                                                                                                                                                                                         test_users=False)                                                              test_users=False)                                            test_users=False)                                                                                                              test_users=False)
                                                                                      days=5,                                        days=5,                                                                                                                    days=10,                                                                                                                                                                                                 days=10,
                                                                              build_from_scratch=True)                       build_from_scratch=True)                                                                                                build_from_scratch=True)                                                                                                                                                                                    build_from_scratch=True)




                                                                                                                                                                                                                                                                                                                            UserRecs                                                                                 UserRecs
                                                                                                                                                                                                                                                                                                                            (test=False,                                                                         (test=False,
                                                                                                                                                                                                                                                                                                                        date=2013-03-17,                                                                   date=2013-03-16,
                                                                                                                                                                                                                                                                                                                           rec_days=5,                                                                           rec_days=5,
                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                          exp_days=10,                                                                          exp_days=10,
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-15,
                                                                                                                                                                                                                                                                                                                        test_users=False,                                                                  test_users=False,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                       force_updates=False,                                                              force_updates=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
                                                                                                                                                                                                                                                                                                                     build_from_scratch=True,                                                          build_from_scratch=True,
                                                                                                                                                                                                                                                                                                                index_path=/spotify/discover/index,                                                index_path=/spotify/discover/index,
                                                                                                                                                                                                                                                                                                                       index_version=None,                                                                index_version=None,
                                                                                                                                                                                                                                                                                                                     FOLLOWS_SCORE=5.0)                                                                 FOLLOWS_SCORE=5.0)




                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-16,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)




                                                                                                                                                                                                                                                                                                                                                     IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                    (date=2013-03-17,
                                                                                                                                                                                                                                                                                                                                                     test_users=False,
                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
Really large...
Visualization needs some work
Error notifications
Great for automated execution
Process Synchronization
Prevents two identical tasks from running simultaneously


                                luigid
                       Simple task synchronization




             Data flow 1                   Data flow 2




                        Common dependency
                             Task
What we use Luigi for
Not just another Hadoop streaming framework!

 Hadoop Streaming
 Java Hadoop MapReduce
 Hive
 Pig
 Local (non-hadoop) data processing
 Import/Export data to/from Postgres
 Insert data into Cassandra
 scp/rsync/ftp data files and reports
 Dump and load databases
Others using it with Scala MapReduce and MRJob as well
Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012
Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs



Elias Freider
freider@spotify.com

More Related Content

What's hot

What's hot (20)

小さなサービスも契約する時代
小さなサービスも契約する時代小さなサービスも契約する時代
小さなサービスも契約する時代
 
KafkaとPulsar
KafkaとPulsarKafkaとPulsar
KafkaとPulsar
 
How to run P4 BMv2
How to run P4 BMv2How to run P4 BMv2
How to run P4 BMv2
 
グラフデータベース Neptune 使ってみた
グラフデータベース Neptune 使ってみたグラフデータベース Neptune 使ってみた
グラフデータベース Neptune 使ってみた
 
ruby-ffiについてざっくり解説
ruby-ffiについてざっくり解説ruby-ffiについてざっくり解説
ruby-ffiについてざっくり解説
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
CyberAgent における OSS の CI/CD 基盤開発 myshoes #CICD2021
CyberAgent における OSS の CI/CD 基盤開発 myshoes #CICD2021CyberAgent における OSS の CI/CD 基盤開発 myshoes #CICD2021
CyberAgent における OSS の CI/CD 基盤開発 myshoes #CICD2021
 
PHPからgoへの移行で分かったこと
PHPからgoへの移行で分かったことPHPからgoへの移行で分かったこと
PHPからgoへの移行で分かったこと
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
 
Spanner移行について本気出して考えてみた
Spanner移行について本気出して考えてみたSpanner移行について本気出して考えてみた
Spanner移行について本気出して考えてみた
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
 
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudyリペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれ
 
Redisの特徴と活用方法について
Redisの特徴と活用方法についてRedisの特徴と活用方法について
Redisの特徴と活用方法について
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Luigi PyData presentation

  • 1. Luigi Batch data processing in Python Elias Freider freider@spotify.com
  • 2. Background Why did we build Luigi? We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  • 3. How to do it? Hadoop Data storage Clean, filter, join and aggregate data Postgres Semi-aggregated data for dashboards Cassandra Time series data
  • 4. Defining a data flow A bunch of data processing tasks with inter-dependencies
  • 5. Example: Artist Toplist Input is a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  • 6. Example: Artist Toplist Naive (non-luigi) approach
  • 7. Errors will occur If a failed step leaves behind broken data, we want to clean that up
  • 8. Avoid duplicate work The second step fails, you fix it, then you want to resume
  • 9. Parametrization To use data flows as command line tools
  • 10. Repeat! You want to run the dataflow for a set of similar inputs
  • 11. Introducing Luigi A Python framework for data flow definition and execution Main features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  • 12. Luigi - Aggregate Artists
  • 13. Luigi - Aggregate Artists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 14. Task templates and targets Reduces code-repetition by utilizing Luigi’s object oriented data model Luigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
  • 15. Hadoop MapReduce Built-in Hadoop Streaming Python framework Features Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  • 16. Hadoop MapReduce Built-in Hadoop Streaming Python framework
  • 17. Attach Python modules So you can use local modules remotely on the cluster
  • 18. Local state is transferred Set up local variables for map-side joins etc.
  • 19. Completing the top list Top 10 artists - Wrapped arbitrary Python code
  • 20. Database support Basic functionality for exporting to Postgres. Cassandra support is in the works
  • 21. Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 22. Data flow visualization Light-weight scheduler daemon with a web interface
  • 23. The results Imagine how cool this would be with real data...
  • 24. Task Parameters Class variables with some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  • 26. Task Parameters Makes it really easy to create aggregates over time
  • 27. Multiple workers Basic multi-processing $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28. Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  • 30. Error notifications Great for automated execution
  • 31. Process Synchronization Prevents two identical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  • 32. What we use Luigi for Not just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
  • 33. Luigi is open source http://github.com/spotify/luigi Originated at Spotify Based on many years of experience with data processing Recent contributions by Foursquare and Bitly Open source since September 2012
  • 34. Thank you! For more information feel free to reach out at http://github.com/spotify/luigi Oh, and we’re hiring – http://spotify.com/jobs Elias Freider freider@spotify.com