Luigi Presentation at OSCON 2013

8,341 views
8,047 views

Published on

From OSCON 2013

Published in: Technology

Luigi Presentation at OSCON 2013

  1. 1. Erik Bernhardsson erikbern@spotify.com Batchdataprocessingin Python
  2. 2. Focusingmostlyonmusicdiscoveryandlargescalemachinelearning Previouslymanagedthe“Analyticsteam”inStockholm I’matSpotify in NYC BtwI’mErikBernhardsson
  3. 3. Background Billionsoflogmessages(severalTBs)everyday Usageandbackendstats,debuginformation Whatwewanttodo AB-testing Musicrecommendations Monthly/daily/hourlyreporting Businessmetricdashboards Weexperimentalot–needquickdevelopmentcycles Wecrunchalot of data WhydidwebuildLuigi?
  4. 4. Oursecondcluster(in2009): WelikeHadoop
  5. 5. Longstoryshort:) Ourfifthcluster
  6. 6. Runningonejobiseasy Lotsoflong-runningprocesseswithdependencies Needmonitoring Handlefailures Gofromexperimentationtoproductioneasily Butwhataboutrunning1000sofjob every day?
  7. 7. Butalsonon-Hadoopstuff MostthingsarePythonMap/Reducejobs AlsoPig,Hive SCPfilesfromonehosttoanother Trainamachinelearningmodel PutdatainCassandra
  8. 8. Inthepre-Luigiworld Hownottodoworkflows
  9. 9. “Streams”isalistof(username,track,artist,timestamp)tuples Example:ArtistToplist Streams Artist Aggregation Top 10 Database
  10. 10. Pre-Luigiexampleofartisttoplists Don’tdothisathome
  11. 11. OK,sochainthetasks
  12. 12. Cronnicer,yay!
  13. 13. That’sOK,butdon’tleavebrokendatasomewhere (btw,LuigigivesyouatomicfileoperationslocallyandinHDFS) Errorswilloccur
  14. 14. Thesecondstepfails,youfixit,thenyouwanttoresume Don’trunthingstwice
  15. 15. Tousedataflowsascommandlinetools Parametrizetasks
  16. 16. Youwanttorunthedataflowforasetofsimilarinputs Puttasksinloops
  17. 17. Plumbingsucks
  18. 18. Graphalgorithmsrock! Plumbingsucks...
  19. 19. Who’stheworld’ssecond mostfamousplumber? Hint:hewearsgreen
  20. 20. APythonframeworkfordataflowdefinitionandexecution IntroducingLuigi
  21. 21. OnsteroidsandPCP ...withatoolboxofmainlyHadooprelatedstuff Simpledependencydefinitions EmphasisonHadoop/HDFSintegration Atomicfileoperations Dataflowvisualization Commandlineintegration Mainfeatures Luigiis“kindoflike Makefile”inPython
  22. 22. LuigiTask
  23. 23. Luigi-AggregateArtists
  24. 24. Luigi-AggregateArtists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  25. 25. Top10artists-WrappedarbitraryPythoncode Completingthetoplist
  26. 26. BasicfunctionalityforexportingtoPostgres.Cassandrasupportisintheworks Databasesupport
  27. 27. Runningitall... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  28. 28. Imaginehowcoolthiswouldbewithrealdata... Theresults
  29. 29. Taskshaveimplicit__init__ TaskParameters Generatescommandlineinterfacewithtypinganddocumentation Classvariableswithsomemagic $ python dataflow.py AggregateArtists --date 2013-03-05
  30. 30. Combinedusageexample TaskParameters
  31. 31. RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-files RunningHiveand(soon)Pigqueries InsertingdatasetsintoPostgres LuigicomeswithatoolboxofabstractTasksfor... ...howtorunanything,really Tasktemplatesandtargets Writingnew onesareaseasyasdefininganinterfaceand implementingrun()
  32. 32. Built-inHadoopStreamingPythonframework HadoopMapReduce Tinyinterface–justimplementmapperandreducer FetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuser ClassinstancevariablescanbereferencedinMapReducecode,whichmakesit easytosupplyextradataindictionariesetc.formapsidejoins EasytosendalongPythonmodulesthatmightnotbeinstalledonthecluster Supportforcounters,secondarysort,combiners,distributedcache,etc. RunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.) Features
  33. 33. Built-inHadoopStreamingPythonframework HadoopMapReduce
  34. 34. Morefeatures
  35. 35. Luigi’s“visualiser”
  36. 36. Diveintoanytask
  37. 37. Basicmulti-processing Multipleworkers $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08
  38. 38. Greatforautomatedexecution Errornotifications
  39. 39. Preventstwoidenticaltasksfromrunningsimultaneously ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner
  40. 40. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  41. 41. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  42. 42. ...whathappens ProcessSynchronization Luigi worker 1 Luigi worker 2 A B C A C F
  43. 43. Largedataflows (Screenshotfromwebinterface)
  44. 44. ThingsLuigiisnot
  45. 45. Yes,youcanrunPythonHadoopjobsinLuigi. Butthemainfocusisworkflow management. Luigiisnottryingto replacemrjob
  46. 46. Youstillneedtofigureouthoweachtaskruns Luigidoesnotgiveyou scalability
  47. 47. Mapreduce/Pig/Hive/etcarewonderfultoolsfordoingthisandLuigiismorethan happytodelegateittothem. Luigidoesnothelpyou transformthedata
  48. 48. AlthoughOozieiskindofannoying ...butit’ssortoflikeOozie Oozie Luigi Only Hadoop Yes! Horrible XML Yes! Easy Yes! Fun & powerful Yes!
  49. 49. “Oozieexample” <workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'> <start to='getDirInfo' /> <!-- STEP ONE --> <action name='getDirInfo'> <!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist, otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist, otherwise returns age of dir in days --> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action> <!-- STEP TWO --> <decision name="makeIngestDecision"> <switch> <!-- empty or doesn't exist --> <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case> <!-- # of files >= 24 --> <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision> <!--EMAIL--> <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>probedata2@navteq.com</arg> <arg>gregory.titievsky@navteq.com</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg>
  50. 50. Instead,focusonridiculouslylittleboilerplatecode Generalsoyoucanbuildwhateverontopofit Aswellasrapidexperimentationcycle Oncethingswork,trivialtoputinproduction Luigidoesnothave999 features
  51. 51. WhatweuseLuigifor HadoopStreaming JavaHadoopMapReduce Hive Pig Trainmachinelearningmodels Import/exportdatato/fromPostgres InsertdataintoCassandra scp/rsync/ftpdatafilesandreports Dumpandloaddatabases OthersusingitwithScalaMapReduceandMRJobaswell
  52. 52. Beoneofthecoolkids!
  53. 53. OriginatedatSpotify MainlybuiltbymeandEliasFreider Basedonmanyyearsofexperiencewithdataprocessing OpensourcesinceSeptember2012 https://github.com/spotify/luigi Luigiisopensource
  54. 54. •Pig •EC2 •Scalding •Cassandra Futureplans!
  55. 55. Formoreinformationfeelfreetoreachoutat http://github.com/spotify/luigi Thankyou! Oh,andwe’rehiring–http://spotify.com/jobs Erik Bernhardsson erikbern@spotify.com

×