Luigi - Batch Data Processing in Python (PyData SV 2013)

3,462 views
3,224 views

Published on

Slides for a talk given by Elias Freider at PyData Silicon Valley 2013

Published in: Technology

Luigi - Batch Data Processing in Python (PyData SV 2013)

  1. 1. Elias Freiderfreider@spotify.comBatchdataprocessinginPythonLuigi
  2. 2. BackgroundBillionsoflogmessages(severalTBs)everydayUsageandbackendstats,debuginformationWhatwewanttodoAB-testingMusicrecommendationsMonthly/daily/hourlyreportingBusinessmetricdashboardsWeexperimentalot–needquickdevelopmentcyclesWehavealotof dataWhydidwebuildLuigi?
  3. 3. Howtodoit?HadoopDatastorageClean,filter,joinandaggregatedataPostgresSemi-aggregateddatafordashboardsCassandraTimeseriesdata
  4. 4. Abunchofdataprocessingtaskswithinter-dependenciesDefiningadataflow
  5. 5. InputisalistofTimestamp,Track,ArtisttuplesExample:ArtistToplistStreamsArtistAggregationTop 10 Database
  6. 6. Naive(non-luigi)approachExample:ArtistToplist
  7. 7. Ifafailedstepleavesbehindbrokendata,wewanttocleanthatupErrorswilloccur
  8. 8. Thesecondstepfails,youfixit,thenyouwanttoresumeAvoidduplicatework
  9. 9. TousedataflowsascommandlinetoolsParametrization
  10. 10. YouwanttorunthedataflowforasetofsimilarinputsRepeat!
  11. 11. Dead-simpledependencydefinitions(thinkGNUmake)EmphasisonHadoop/HDFSintegrationAtomicfileoperationsPowerfultasktemplatingusingOOPDataflowvisualizationCommandlineintegrationMainfeaturesAPythonframeworkfordataflowdefinitionandexecutionIntroducingLuigi
  12. 12. Luigi-AggregateArtists
  13. 13. Luigi-AggregateArtistsRun on the command line:$ python dataflow.py AggregateArtistsDEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
  14. 14. RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-filesRunningHiveand(soon)PigqueriesInsertingdatasetsintoPostgresLuigicomeswithpre-implementedTasksfor...Reducescode-repetitionbyutilizingLuigi’sobjectorienteddatamodelTasktemplatesandtargetsWritingnew onesareaseasyasdefininganinterfaceandimplementingrun()
  15. 15. Built-inHadoopStreamingPythonframeworkHadoopMapReduceVerysliminterfaceFetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuserClassinstancevariablescanbereferencedinMapReducecode,whichmakesiteasytosupplyextradataindictionariesetc.formapsidejoinsEasytosendalongPythonmodulesthatmightnotbeinstalledontheclusterBasicsupportforsecondarysortRunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.)Features
  16. 16. Built-inHadoopStreamingPythonframeworkHadoopMapReduce
  17. 17. SoyoucanuselocalmodulesremotelyontheclusterAttachPythonmodules
  18. 18. Setuplocalvariablesformap-sidejoinsetc.Localstateistransferred
  19. 19. Top10artists-WrappedarbitraryPythoncodeCompletingthetoplist
  20. 20. BasicfunctionalityforexportingtoPostgres.CassandrasupportisintheworksDatabasesupport
  21. 21. Runningitall...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
  22. 22. Light-weightschedulerdaemonwithawebinterfaceDataflowvisualization
  23. 23. Imaginehowcoolthiswouldbewithrealdata...Theresults
  24. 24. Taskshaveimplicit__init__TaskParametersGeneratescommandlineinterfacewithtypinganddocumentationClassvariableswithsomemagic$ python dataflow.py AggregateArtists --date 2013-03-05
  25. 25. CombinedusageexampleTaskParameters
  26. 26. MakesitreallyeasytocreateaggregatesovertimeTaskParameters
  27. 27. Basicmulti-processingMultipleworkers$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  28. 28. LargedataflowsAggregateTracks(test=False,date=2013-03-16,test_users=False)JoinArtistGids(test=False,date=2013-03-16,nshards=12,test_users=False)AggregateUserMatrices(test=False,date=2013-03-16,index_version=1362433077,test_users=False)AggregateByArtists(test=False,date=2013-03-16,test_users=False)UserRecs(test=False,date=2013-03-17,rec_days=5,exp_days=10,test_users=False,force_updates=False,build_from_scratch=True,index_path=/spotify/discover/index,index_version=None,FOLLOWS_SCORE=5.0)AccumulateUserMatrices(test=False,date=2013-03-16,index_version=1362433077,test_users=False,decay_factor=0.99,days=5,build_from_scratch=True)AccumulateByArtists(test=False,date=2013-03-16,test_users=False,max_days=90,decay_factor=0.99,days=10,build_from_scratch=True)IngestUserRecs(date=2013-03-17,test_users=False,ttl=604800)AggregateByArtists(test=False,date=2013-03-10,test_users=False)AccumulateByArtists(test=False,date=2013-03-15,test_users=False,max_days=90,decay_factor=0.99,days=10,build_from_scratch=True)UserRecs(test=False,date=2013-03-16,rec_days=5,exp_days=10,test_users=False,force_updates=False,build_from_scratch=True,index_path=/spotify/discover/index,index_version=None,FOLLOWS_SCORE=5.0)IngestUserRecs(date=2013-03-16,test_users=False,ttl=604800)AggregateByArtists(test=False,date=2013-03-13,test_users=False)JoinArtistGids(test=False,date=2013-03-11,nshards=12,test_users=False)MasterMetadata(date=2013-03-15)JoinArtistGids(test=False,date=2013-03-15,nshards=12,test_users=False)AggregateByArtists(test=False,date=2013-03-15,test_users=False)AggregateUserMatrices(test=False,date=2013-03-15,index_version=1362433077,test_users=False)AccumulateUserMatrices(test=False,date=2013-03-15,index_version=1362433077,test_users=False,decay_factor=0.99,days=5,build_from_scratch=True)JoinArtistGids(test=False,date=2013-03-13,nshards=12,test_users=False)EndSongCleaned(date=2013-03-15)AggregateTracks(test=False,date=2013-03-15,test_users=False)UserLocationDay(test=False,date=2013-03-15,test_users=False)UserLocationPeriod(test=False,period=10,date=2013-03-16,test_users=False)UserLocationPeriod(test=False,period=10,date=2013-03-17,test_users=False)AggregateUserMatrices(test=False,date=2013-03-12,index_version=1362433077,test_users=False)ArtistFollows(date=2013-03-14)EndSongCleaned(date=2013-03-16)UserLocationDay(test=False,date=2013-03-16,test_users=False)UserLocationDay(test=False,date=2013-03-06,test_users=False)UserLocationDay(test=False,date=2013-03-14,test_users=False)AggregateByArtists(test=False,date=2013-03-14,test_users=False)AggregateByArtists(test=False,date=2013-03-09,test_users=False)AggregateByArtists(test=False,date=2013-03-08,test_users=False)UserLocationDay(test=False,date=2013-03-13,test_users=False)AggregateUserMatrices(test=False,date=2013-03-14,index_version=1362433077,test_users=False)UserLocationDay(test=False,date=2013-03-08,test_users=False)AggregateUserMatrices(test=False,date=2013-03-13,index_version=1362433077,test_users=False)RelatedArtistsTC()AggregateByArtists(test=False,date=2013-03-12,test_users=False)AggregateByArtists(test=False,date=2013-03-11,test_users=False)AggregateByArtists(test=False,date=2013-03-06,test_users=False)UserLocationDay(test=False,date=2013-03-12,test_users=False)JoinArtistGids(test=False,date=2013-03-12,nshards=12,test_users=False)JoinArtistGids(test=False,date=2013-03-14,nshards=12,test_users=False)IngestUserRecs(date=2013-03-15,test_users=False,ttl=604800)UserLocationDay(test=False,date=2013-03-07,test_users=False)AggregateByArtists(test=False,date=2013-03-07,test_users=False)UserLocationDay(test=False,date=2013-03-11,test_users=False)UserLocationDay(test=False,date=2013-03-10,test_users=False)UserLocationDay(test=False,date=2013-03-09,test_users=False)AggregateUserMatrices(test=False,date=2013-03-11,index_version=1362433077,test_users=False)PlaylistVectors(test=False,date=2013-03-14,index_version=1362433077)(Screenshotfromwebinterface)
  29. 29. VisualizationneedssomeworkReallylarge...
  30. 30. GreatforautomatedexecutionErrornotifications
  31. 31. PreventstwoidenticaltasksfromrunningsimultaneouslyProcessSynchronizationluigidSimple task synchronizationData flow 1 Data flow 2Common dependencyTask
  32. 32. NotjustanotherHadoopstreamingframework!WhatweuseLuigiforHadoopStreamingJavaHadoopMapReduceHivePigLocal(non-hadoop)dataprocessingImport/Exportdatato/fromPostgresInsertdataintoCassandrascp/rsync/ftpdatafilesandreportsDumpandloaddatabasesOthersusingitwithScalaMapReduceandMRJobaswell
  33. 33. OriginatedatSpotifyBasedonmanyyearsofexperiencewithdataprocessingRecentcontributionsbyFoursquareandBitlyOpensourcesinceSeptember2012http://github.com/spotify/luigiLuigiisopensource
  34. 34. Formoreinformationfeelfreetoreachoutathttp://github.com/spotify/luigiThankyou!Oh,andwe’rehiring–http://spotify.com/jobsElias Freiderfreider@spotify.com

×