0
LuigiBatch data processing inPythonElias Freiderfreider@spotify.com
BackgroundWhy did we build Luigi?We have a lot of data Billions of log messages (several TBs) every day Usage and backend ...
How to do it?Hadoop  Data storage  Clean, filter, join and aggregate dataPostgres  Semi-aggregated data for dashboardsCass...
Defining a data flowA bunch of data processing tasks with inter-dependencies
Example: Artist ToplistInput is a list of Timestamp, Track, Artist tuples                                 Artist      Stre...
Example: Artist ToplistNaive (non-luigi) approach
Errors will occurIf a failed step leaves behind broken data, we want to clean that up
Avoid duplicate workThe second step fails, you fix it, then you want to resume
ParametrizationTo use data flows as command line tools
Repeat!You want to run the dataflow for a set of similar inputs
Introducing LuigiA Python framework for data flow definition and executionMain features  Dead-simple dependency definition...
Luigi - Aggregate Artists
Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtistsDEBUG: Checking if AggregateArtists(...
Task templates and targetsReduces code-repetition by utilizing Luigi’s object oriented data modelLuigi comes with pre-impl...
Hadoop MapReduceBuilt-in Hadoop Streaming Python frameworkFeatures Very slim interface Fetches error logs from Hadoop clus...
Hadoop MapReduceBuilt-in Hadoop Streaming Python framework
Attach Python modulesSo you can use local modules remotely on the cluster
Local state is transferredSet up local variables for map-side joins etc.
Completing the top listTop 10 artists - Wrapped arbitrary Python code
Database supportBasic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...   DEBUG: Checking if ArtistToplistToDatabase() is complete   INFO: Scheduled ArtistToplistToDatabase()  ...
Data flow visualizationLight-weight scheduler daemon with a web interface
The resultsImagine how cool this would be with real data...
Task ParametersClass variables with some magicTasks have implicit __init__   Generates command line interface with typing ...
Task ParametersCombined usage example
Task ParametersMakes it really easy to create aggregates over time
Multiple workers Basic multi-processing$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows            (Screenshot from web interface)                                                               ...
Really large...Visualization needs some work
Error notificationsGreat for automated execution
Process SynchronizationPrevents two identical tasks from running simultaneously                                luigid     ...
What we use Luigi forNot just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (n...
Luigi is open sourcehttp://github.com/spotify/luigiOriginated at SpotifyBased on many years of experience with data proces...
Thank you!For more information feel free to reach out athttp://github.com/spotify/luigiOh, and we’re hiring – http://spoti...
Upcoming SlideShare
Loading in...5
×

Luigi PyData presentation

6,307

Published on

Luigi is Spotify's recently open-sourced Python framework for data flow definition and execution. This presentation introduces the basics of how to use it for data processing and ETL, both locally and on the Hadoop distributed computing platform.

Published in: Technology
0 Comments
25 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,307
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
89
Comments
0
Likes
25
Embeds 0
No embeds

No notes for slide

Transcript of "Luigi PyData presentation"

  1. 1. LuigiBatch data processing inPythonElias Freiderfreider@spotify.com
  2. 2. BackgroundWhy did we build Luigi?We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  3. 3. How to do it?Hadoop Data storage Clean, filter, join and aggregate dataPostgres Semi-aggregated data for dashboardsCassandra Time series data
  4. 4. Defining a data flowA bunch of data processing tasks with inter-dependencies
  5. 5. Example: Artist ToplistInput is a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  6. 6. Example: Artist ToplistNaive (non-luigi) approach
  7. 7. Errors will occurIf a failed step leaves behind broken data, we want to clean that up
  8. 8. Avoid duplicate workThe second step fails, you fix it, then you want to resume
  9. 9. ParametrizationTo use data flows as command line tools
  10. 10. Repeat!You want to run the dataflow for a set of similar inputs
  11. 11. Introducing LuigiA Python framework for data flow definition and executionMain features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  12. 12. Luigi - Aggregate Artists
  13. 13. Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtistsDEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
  14. 14. Task templates and targetsReduces code-repetition by utilizing Luigi’s object oriented data modelLuigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into PostgresWriting new ones are as easy as defining an interface andimplementing run()
  15. 15. Hadoop MapReduceBuilt-in Hadoop Streaming Python frameworkFeatures Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  16. 16. Hadoop MapReduceBuilt-in Hadoop Streaming Python framework
  17. 17. Attach Python modulesSo you can use local modules remotely on the cluster
  18. 18. Local state is transferredSet up local variables for map-side joins etc.
  19. 19. Completing the top listTop 10 artists - Wrapped arbitrary Python code
  20. 20. Database supportBasic functionality for exporting to Postgres. Cassandra support is in the works
  21. 21. Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  22. 22. Data flow visualizationLight-weight scheduler daemon with a web interface
  23. 23. The resultsImagine how cool this would be with real data...
  24. 24. Task ParametersClass variables with some magicTasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  25. 25. Task ParametersCombined usage example
  26. 26. Task ParametersMakes it really easy to create aggregates over time
  27. 27. Multiple workers Basic multi-processing$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  28. 28. Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06,index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  29. 29. Really large...Visualization needs some work
  30. 30. Error notificationsGreat for automated execution
  31. 31. Process SynchronizationPrevents two identical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  32. 32. What we use Luigi forNot just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databasesOthers using it with Scala MapReduce and MRJob as well
  33. 33. Luigi is open sourcehttp://github.com/spotify/luigiOriginated at SpotifyBased on many years of experience with data processingRecent contributions by Foursquare and BitlyOpen source since September 2012
  34. 34. Thank you!For more information feel free to reach out athttp://github.com/spotify/luigiOh, and we’re hiring – http://spotify.com/jobsElias Freiderfreider@spotify.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×