Luigi Presentation at OSCON 2013

  • 5,788 views
Uploaded on

From OSCON 2013

From OSCON 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
5,788
On Slideshare
0
From Embeds
0
Number of Embeds
12

Actions

Shares
Downloads
51
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Erik Bernhardsson erikbern@spotify.com Batchdataprocessingin Python
  • 2. Focusingmostlyonmusicdiscoveryandlargescalemachinelearning Previouslymanagedthe“Analyticsteam”inStockholm I’matSpotify in NYC BtwI’mErikBernhardsson
  • 3. Background Billionsoflogmessages(severalTBs)everyday Usageandbackendstats,debuginformation Whatwewanttodo AB-testing Musicrecommendations Monthly/daily/hourlyreporting Businessmetricdashboards Weexperimentalot–needquickdevelopmentcycles Wecrunchalot of data WhydidwebuildLuigi?
  • 4. Oursecondcluster(in2009): WelikeHadoop
  • 5. Longstoryshort:) Ourfifthcluster
  • 6. Runningonejobiseasy Lotsoflong-runningprocesseswithdependencies Needmonitoring Handlefailures Gofromexperimentationtoproductioneasily Butwhataboutrunning1000sofjob every day?
  • 7. Butalsonon-Hadoopstuff MostthingsarePythonMap/Reducejobs AlsoPig,Hive SCPfilesfromonehosttoanother Trainamachinelearningmodel PutdatainCassandra
  • 8. Inthepre-Luigiworld Hownottodoworkflows
  • 9. “Streams”isalistof(username,track,artist,timestamp)tuples Example:ArtistToplist Streams Artist Aggregation Top 10 Database
  • 10. Pre-Luigiexampleofartisttoplists Don’tdothisathome
  • 11. OK,sochainthetasks
  • 12. Cronnicer,yay!
  • 13. That’sOK,butdon’tleavebrokendatasomewhere (btw,LuigigivesyouatomicfileoperationslocallyandinHDFS) Errorswilloccur
  • 14. Thesecondstepfails,youfixit,thenyouwanttoresume Don’trunthingstwice
  • 15. Tousedataflowsascommandlinetools Parametrizetasks
  • 16. Youwanttorunthedataflowforasetofsimilarinputs Puttasksinloops
  • 17. Plumbingsucks
  • 18. Graphalgorithmsrock! Plumbingsucks...
  • 19. Who’stheworld’ssecond mostfamousplumber? Hint:hewearsgreen
  • 20. APythonframeworkfordataflowdefinitionandexecution IntroducingLuigi
  • 21. OnsteroidsandPCP ...withatoolboxofmainlyHadooprelatedstuff Simpledependencydefinitions EmphasisonHadoop/HDFSintegration Atomicfileoperations Dataflowvisualization Commandlineintegration Mainfeatures Luigiis“kindoflike Makefile”inPython
  • 22. LuigiTask
  • 23. Luigi-AggregateArtists
  • 24. Luigi-AggregateArtists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 25. Top10artists-WrappedarbitraryPythoncode Completingthetoplist
  • 26. BasicfunctionalityforexportingtoPostgres.Cassandrasupportisintheworks Databasesupport
  • 27. Runningitall... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 28. Imaginehowcoolthiswouldbewithrealdata... Theresults
  • 29. Taskshaveimplicit__init__ TaskParameters Generatescommandlineinterfacewithtypinganddocumentation Classvariableswithsomemagic $ python dataflow.py AggregateArtists --date 2013-03-05
  • 30. Combinedusageexample TaskParameters
  • 31. RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-files RunningHiveand(soon)Pigqueries InsertingdatasetsintoPostgres LuigicomeswithatoolboxofabstractTasksfor... ...howtorunanything,really Tasktemplatesandtargets Writingnew onesareaseasyasdefininganinterfaceand implementingrun()
  • 32. Built-inHadoopStreamingPythonframework HadoopMapReduce Tinyinterface–justimplementmapperandreducer FetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuser ClassinstancevariablescanbereferencedinMapReducecode,whichmakesit easytosupplyextradataindictionariesetc.formapsidejoins EasytosendalongPythonmodulesthatmightnotbeinstalledonthecluster Supportforcounters,secondarysort,combiners,distributedcache,etc. RunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.) Features
  • 33. Built-inHadoopStreamingPythonframework HadoopMapReduce
  • 34. Morefeatures
  • 35. Luigi’s“visualiser”
  • 36. Diveintoanytask
  • 37. Basicmulti-processing Multipleworkers $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08
  • 38. Greatforautomatedexecution Errornotifications
  • 39. Preventstwoidenticaltasksfromrunningsimultaneously ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner
  • 40. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  • 41. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  • 42. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  • 43. Largedataflows (Screenshotfromwebinterface)
  • 44. ThingsLuigiisnot
  • 45. Yes,youcanrunPythonHadoopjobsinLuigi. Butthemainfocusisworkflow management. Luigiisnottryingto replacemrjob
  • 46. Youstillneedtofigureouthoweachtaskruns Luigidoesnotgiveyou scalability
  • 47. Mapreduce/Pig/Hive/etcarewonderfultoolsfordoingthisandLuigiismorethan happytodelegateittothem. Luigidoesnothelpyou transformthedata
  • 48. AlthoughOozieiskindofannoying ...butit’ssortoflikeOozie Oozie Luigi Only Hadoop Yes! Horrible XML Yes! Easy Yes! Fun & powerful Yes!
  • 49. “Oozieexample” <workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'> <start to='getDirInfo' /> <!-- STEP ONE --> <action name='getDirInfo'> <!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist, otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist, otherwise returns age of dir in days --> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action> <!-- STEP TWO --> <decision name="makeIngestDecision"> <switch> <!-- empty or doesn't exist --> <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case> <!-- # of files >= 24 --> <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision> <!--EMAIL--> <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>probedata2@navteq.com</arg> <arg>gregory.titievsky@navteq.com</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg>
  • 50. Instead,focusonridiculouslylittleboilerplatecode Generalsoyoucanbuildwhateverontopofit Aswellasrapidexperimentationcycle Oncethingswork,trivialtoputinproduction Luigidoesnothave999 features
  • 51. WhatweuseLuigifor HadoopStreaming JavaHadoopMapReduce Hive Pig Trainmachinelearningmodels Import/exportdatato/fromPostgres InsertdataintoCassandra scp/rsync/ftpdatafilesandreports Dumpandloaddatabases OthersusingitwithScalaMapReduceandMRJobaswell
  • 52. Beoneofthecoolkids!
  • 53. OriginatedatSpotify MainlybuiltbymeandEliasFreider Basedonmanyyearsofexperiencewithdataprocessing OpensourcesinceSeptember2012 https://github.com/spotify/luigi Luigiisopensource
  • 54. •Pig •EC2 •Scalding •Cassandra Futureplans!
  • 55. Formoreinformationfeelfreetoreachoutat http://github.com/spotify/luigi Thankyou! Oh,andwe’rehiring–http://spotify.com/jobs Erik Bernhardsson erikbern@spotify.com