Your SlideShare is downloading. ×
  • Like
Luigi - Batch Data Processing in Python (PyData SV 2013)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Luigi - Batch Data Processing in Python (PyData SV 2013)

  • 2,096 views
Published

Slides for a talk given by Elias Freider at PyData Silicon Valley 2013

Slides for a talk given by Elias Freider at PyData Silicon Valley 2013

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,096
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
44
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Elias Freiderfreider@spotify.comBatchdataprocessinginPythonLuigi
  • 2. BackgroundBillionsoflogmessages(severalTBs)everydayUsageandbackendstats,debuginformationWhatwewanttodoAB-testingMusicrecommendationsMonthly/daily/hourlyreportingBusinessmetricdashboardsWeexperimentalot–needquickdevelopmentcyclesWehavealotof dataWhydidwebuildLuigi?
  • 3. Howtodoit?HadoopDatastorageClean,filter,joinandaggregatedataPostgresSemi-aggregateddatafordashboardsCassandraTimeseriesdata
  • 4. Abunchofdataprocessingtaskswithinter-dependenciesDefiningadataflow
  • 5. InputisalistofTimestamp,Track,ArtisttuplesExample:ArtistToplistStreamsArtistAggregationTop 10 Database
  • 6. Naive(non-luigi)approachExample:ArtistToplist
  • 7. Ifafailedstepleavesbehindbrokendata,wewanttocleanthatupErrorswilloccur
  • 8. Thesecondstepfails,youfixit,thenyouwanttoresumeAvoidduplicatework
  • 9. TousedataflowsascommandlinetoolsParametrization
  • 10. YouwanttorunthedataflowforasetofsimilarinputsRepeat!
  • 11. Dead-simpledependencydefinitions(thinkGNUmake)EmphasisonHadoop/HDFSintegrationAtomicfileoperationsPowerfultasktemplatingusingOOPDataflowvisualizationCommandlineintegrationMainfeaturesAPythonframeworkfordataflowdefinitionandexecutionIntroducingLuigi
  • 12. Luigi-AggregateArtists
  • 13. Luigi-AggregateArtistsRun on the command line:$ python dataflow.py AggregateArtistsDEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
  • 14. RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-filesRunningHiveand(soon)PigqueriesInsertingdatasetsintoPostgresLuigicomeswithpre-implementedTasksfor...Reducescode-repetitionbyutilizingLuigi’sobjectorienteddatamodelTasktemplatesandtargetsWritingnew onesareaseasyasdefininganinterfaceandimplementingrun()
  • 15. Built-inHadoopStreamingPythonframeworkHadoopMapReduceVerysliminterfaceFetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuserClassinstancevariablescanbereferencedinMapReducecode,whichmakesiteasytosupplyextradataindictionariesetc.formapsidejoinsEasytosendalongPythonmodulesthatmightnotbeinstalledontheclusterBasicsupportforsecondarysortRunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.)Features
  • 16. Built-inHadoopStreamingPythonframeworkHadoopMapReduce
  • 17. SoyoucanuselocalmodulesremotelyontheclusterAttachPythonmodules
  • 18. Setuplocalvariablesformap-sidejoinsetc.Localstateistransferred
  • 19. Top10artists-WrappedarbitraryPythoncodeCompletingthetoplist
  • 20. BasicfunctionalityforexportingtoPostgres.CassandrasupportisintheworksDatabasesupport
  • 21. Runningitall...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
  • 22. Light-weightschedulerdaemonwithawebinterfaceDataflowvisualization
  • 23. Imaginehowcoolthiswouldbewithrealdata...Theresults
  • 24. Taskshaveimplicit__init__TaskParametersGeneratescommandlineinterfacewithtypinganddocumentationClassvariableswithsomemagic$ python dataflow.py AggregateArtists --date 2013-03-05
  • 25. CombinedusageexampleTaskParameters
  • 26. MakesitreallyeasytocreateaggregatesovertimeTaskParameters
  • 27. Basicmulti-processingMultipleworkers$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28. LargedataflowsAggregateTracks(test=False,date=2013-03-16,test_users=False)JoinArtistGids(test=False,date=2013-03-16,nshards=12,test_users=False)AggregateUserMatrices(test=False,date=2013-03-16,index_version=1362433077,test_users=False)AggregateByArtists(test=False,date=2013-03-16,test_users=False)UserRecs(test=False,date=2013-03-17,rec_days=5,exp_days=10,test_users=False,force_updates=False,build_from_scratch=True,index_path=/spotify/discover/index,index_version=None,FOLLOWS_SCORE=5.0)AccumulateUserMatrices(test=False,date=2013-03-16,index_version=1362433077,test_users=False,decay_factor=0.99,days=5,build_from_scratch=True)AccumulateByArtists(test=False,date=2013-03-16,test_users=False,max_days=90,decay_factor=0.99,days=10,build_from_scratch=True)IngestUserRecs(date=2013-03-17,test_users=False,ttl=604800)AggregateByArtists(test=False,date=2013-03-10,test_users=False)AccumulateByArtists(test=False,date=2013-03-15,test_users=False,max_days=90,decay_factor=0.99,days=10,build_from_scratch=True)UserRecs(test=False,date=2013-03-16,rec_days=5,exp_days=10,test_users=False,force_updates=False,build_from_scratch=True,index_path=/spotify/discover/index,index_version=None,FOLLOWS_SCORE=5.0)IngestUserRecs(date=2013-03-16,test_users=False,ttl=604800)AggregateByArtists(test=False,date=2013-03-13,test_users=False)JoinArtistGids(test=False,date=2013-03-11,nshards=12,test_users=False)MasterMetadata(date=2013-03-15)JoinArtistGids(test=False,date=2013-03-15,nshards=12,test_users=False)AggregateByArtists(test=False,date=2013-03-15,test_users=False)AggregateUserMatrices(test=False,date=2013-03-15,index_version=1362433077,test_users=False)AccumulateUserMatrices(test=False,date=2013-03-15,index_version=1362433077,test_users=False,decay_factor=0.99,days=5,build_from_scratch=True)JoinArtistGids(test=False,date=2013-03-13,nshards=12,test_users=False)EndSongCleaned(date=2013-03-15)AggregateTracks(test=False,date=2013-03-15,test_users=False)UserLocationDay(test=False,date=2013-03-15,test_users=False)UserLocationPeriod(test=False,period=10,date=2013-03-16,test_users=False)UserLocationPeriod(test=False,period=10,date=2013-03-17,test_users=False)AggregateUserMatrices(test=False,date=2013-03-12,index_version=1362433077,test_users=False)ArtistFollows(date=2013-03-14)EndSongCleaned(date=2013-03-16)UserLocationDay(test=False,date=2013-03-16,test_users=False)UserLocationDay(test=False,date=2013-03-06,test_users=False)UserLocationDay(test=False,date=2013-03-14,test_users=False)AggregateByArtists(test=False,date=2013-03-14,test_users=False)AggregateByArtists(test=False,date=2013-03-09,test_users=False)AggregateByArtists(test=False,date=2013-03-08,test_users=False)UserLocationDay(test=False,date=2013-03-13,test_users=False)AggregateUserMatrices(test=False,date=2013-03-14,index_version=1362433077,test_users=False)UserLocationDay(test=False,date=2013-03-08,test_users=False)AggregateUserMatrices(test=False,date=2013-03-13,index_version=1362433077,test_users=False)RelatedArtistsTC()AggregateByArtists(test=False,date=2013-03-12,test_users=False)AggregateByArtists(test=False,date=2013-03-11,test_users=False)AggregateByArtists(test=False,date=2013-03-06,test_users=False)UserLocationDay(test=False,date=2013-03-12,test_users=False)JoinArtistGids(test=False,date=2013-03-12,nshards=12,test_users=False)JoinArtistGids(test=False,date=2013-03-14,nshards=12,test_users=False)IngestUserRecs(date=2013-03-15,test_users=False,ttl=604800)UserLocationDay(test=False,date=2013-03-07,test_users=False)AggregateByArtists(test=False,date=2013-03-07,test_users=False)UserLocationDay(test=False,date=2013-03-11,test_users=False)UserLocationDay(test=False,date=2013-03-10,test_users=False)UserLocationDay(test=False,date=2013-03-09,test_users=False)AggregateUserMatrices(test=False,date=2013-03-11,index_version=1362433077,test_users=False)PlaylistVectors(test=False,date=2013-03-14,index_version=1362433077)(Screenshotfromwebinterface)
  • 29. VisualizationneedssomeworkReallylarge...
  • 30. GreatforautomatedexecutionErrornotifications
  • 31. PreventstwoidenticaltasksfromrunningsimultaneouslyProcessSynchronizationluigidSimple task synchronizationData flow 1 Data flow 2Common dependencyTask
  • 32. NotjustanotherHadoopstreamingframework!WhatweuseLuigiforHadoopStreamingJavaHadoopMapReduceHivePigLocal(non-hadoop)dataprocessingImport/Exportdatato/fromPostgresInsertdataintoCassandrascp/rsync/ftpdatafilesandreportsDumpandloaddatabasesOthersusingitwithScalaMapReduceandMRJobaswell
  • 33. OriginatedatSpotifyBasedonmanyyearsofexperiencewithdataprocessingRecentcontributionsbyFoursquareandBitlyOpensourcesinceSeptember2012http://github.com/spotify/luigiLuigiisopensource
  • 34. Formoreinformationfeelfreetoreachoutathttp://github.com/spotify/luigiThankyou!Oh,andwe’rehiring–http://spotify.com/jobsElias Freiderfreider@spotify.com