Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Luigi Presentation at OSCON 2013

From OSCON 2013

Luigi Presentation at OSCON 2013

  1. 1. Erik Bernhardsson erikbern@spotify.com Batchdataprocessingin Python
  2. 2. Focusingmostlyonmusicdiscoveryandlargescalemachinelearning Previouslymanagedthe“Analyticsteam”inStockholm I’matSpotify in NYC BtwI’mErikBernhardsson
  3. 3. Background Billionsoflogmessages(severalTBs)everyday Usageandbackendstats,debuginformation Whatwewanttodo AB-testing Musicrecommendations Monthly/daily/hourlyreporting Businessmetricdashboards Weexperimentalot–needquickdevelopmentcycles Wecrunchalot of data WhydidwebuildLuigi?
  4. 4. Oursecondcluster(in2009): WelikeHadoop
  5. 5. Longstoryshort:) Ourfifthcluster
  6. 6. Runningonejobiseasy Lotsoflong-runningprocesseswithdependencies Needmonitoring Handlefailures Gofromexperimentationtoproductioneasily Butwhataboutrunning1000sofjob every day?
  7. 7. Butalsonon-Hadoopstuff MostthingsarePythonMap/Reducejobs AlsoPig,Hive SCPfilesfromonehosttoanother Trainamachinelearningmodel PutdatainCassandra
  8. 8. Inthepre-Luigiworld Hownottodoworkflows
  9. 9. “Streams”isalistof(username,track,artist,timestamp)tuples Example:ArtistToplist Streams Artist Aggregation Top 10 Database
  10. 10. Pre-Luigiexampleofartisttoplists Don’tdothisathome
  11. 11. OK,sochainthetasks
  12. 12. Cronnicer,yay!
  13. 13. That’sOK,butdon’tleavebrokendatasomewhere (btw,LuigigivesyouatomicfileoperationslocallyandinHDFS) Errorswilloccur
  14. 14. Thesecondstepfails,youfixit,thenyouwanttoresume Don’trunthingstwice
  15. 15. Tousedataflowsascommandlinetools Parametrizetasks
  16. 16. Youwanttorunthedataflowforasetofsimilarinputs Puttasksinloops
  17. 17. Plumbingsucks
  18. 18. Graphalgorithmsrock! Plumbingsucks...
  19. 19. Who’stheworld’ssecond mostfamousplumber? Hint:hewearsgreen
  20. 20. APythonframeworkfordataflowdefinitionandexecution IntroducingLuigi
  21. 21. OnsteroidsandPCP ...withatoolboxofmainlyHadooprelatedstuff Simpledependencydefinitions EmphasisonHadoop/HDFSintegration Atomicfileoperations Dataflowvisualization Commandlineintegration Mainfeatures Luigiis“kindoflike Makefile”inPython
  22. 22. LuigiTask
  23. 23. Luigi-AggregateArtists
  24. 24. Luigi-AggregateArtists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  25. 25. Top10artists-WrappedarbitraryPythoncode Completingthetoplist
  26. 26. BasicfunctionalityforexportingtoPostgres.Cassandrasupportisintheworks Databasesupport
  27. 27. Runningitall... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  28. 28. Imaginehowcoolthiswouldbewithrealdata... Theresults
  29. 29. Taskshaveimplicit__init__ TaskParameters Generatescommandlineinterfacewithtypinganddocumentation Classvariableswithsomemagic $ python dataflow.py AggregateArtists --date 2013-03-05
  30. 30. Combinedusageexample TaskParameters
  31. 31. RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-files RunningHiveand(soon)Pigqueries InsertingdatasetsintoPostgres LuigicomeswithatoolboxofabstractTasksfor... ...howtorunanything,really Tasktemplatesandtargets Writingnew onesareaseasyasdefininganinterfaceand implementingrun()
  32. 32. Built-inHadoopStreamingPythonframework HadoopMapReduce Tinyinterface–justimplementmapperandreducer FetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuser ClassinstancevariablescanbereferencedinMapReducecode,whichmakesit easytosupplyextradataindictionariesetc.formapsidejoins EasytosendalongPythonmodulesthatmightnotbeinstalledonthecluster Supportforcounters,secondarysort,combiners,distributedcache,etc. RunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.) Features
  33. 33. Built-inHadoopStreamingPythonframework HadoopMapReduce
  34. 34. Morefeatures
  35. 35. Luigi’s“visualiser”
  36. 36. Diveintoanytask
  37. 37. Basicmulti-processing Multipleworkers $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08
  38. 38. Greatforautomatedexecution Errornotifications
  39. 39. Preventstwoidenticaltasksfromrunningsimultaneously ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner
  40. 40. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  41. 41. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  42. 42. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  43. 43. Largedataflows (Screenshotfromwebinterface)
  44. 44. ThingsLuigiisnot
  45. 45. Yes,youcanrunPythonHadoopjobsinLuigi. Butthemainfocusisworkflow management. Luigiisnottryingto replacemrjob
  46. 46. Youstillneedtofigureouthoweachtaskruns Luigidoesnotgiveyou scalability
  47. 47. Mapreduce/Pig/Hive/etcarewonderfultoolsfordoingthisandLuigiismorethan happytodelegateittothem. Luigidoesnothelpyou transformthedata
  48. 48. AlthoughOozieiskindofannoying ...butit’ssortoflikeOozie Oozie Luigi Only Hadoop Yes! Horrible XML Yes! Easy Yes! Fun & powerful Yes!
  49. 49. “Oozieexample” <workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'> <start to='getDirInfo' /> <!-- STEP ONE --> <action name='getDirInfo'> <!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist, otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist, otherwise returns age of dir in days --> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action> <!-- STEP TWO --> <decision name="makeIngestDecision"> <switch> <!-- empty or doesn't exist --> <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case> <!-- # of files >= 24 --> <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision> <!--EMAIL--> <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>probedata2@navteq.com</arg> <arg>gregory.titievsky@navteq.com</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg>
  50. 50. Instead,focusonridiculouslylittleboilerplatecode Generalsoyoucanbuildwhateverontopofit Aswellasrapidexperimentationcycle Oncethingswork,trivialtoputinproduction Luigidoesnothave999 features
  51. 51. WhatweuseLuigifor HadoopStreaming JavaHadoopMapReduce Hive Pig Trainmachinelearningmodels Import/exportdatato/fromPostgres InsertdataintoCassandra scp/rsync/ftpdatafilesandreports Dumpandloaddatabases OthersusingitwithScalaMapReduceandMRJobaswell
  52. 52. Beoneofthecoolkids!
  53. 53. OriginatedatSpotify MainlybuiltbymeandEliasFreider Basedonmanyyearsofexperiencewithdataprocessing OpensourcesinceSeptember2012 https://github.com/spotify/luigi Luigiisopensource
  54. 54. •Pig •EC2 •Scalding •Cassandra Futureplans!
  55. 55. Formoreinformationfeelfreetoreachoutat http://github.com/spotify/luigi Thankyou! Oh,andwe’rehiring–http://spotify.com/jobs Erik Bernhardsson erikbern@spotify.com

×