Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Scientific Workflows with Spotify's Luigi

2,616 views

Published on

Experiences from Building Scientific (Bioinformatics) workflows with Spotify's Luigi tool at the Workshop for e-Infrastructures for Massively Parallel Sequencing
Uppsala, Jan 19-20, 2015 (uppnex.se/twiki/do/view/Courses/EinfraMPS2015)

Accompanying Video Available at: https://youtu.be/f26PqSXZdWM

Published in: Science
  • Be the first to comment

Building Scientific Workflows with Spotify's Luigi

  1. 1. Building workflows with Spotify's Workshop for e-Infrastructures for Massively Parallel Sequencing Uppsala, Jan 19-20, 2015 Samuel Lampa UU/Dept of Pharm. Biosci. & BILS samuel.lampa@bils.se samuel.lampa.co
  2. 2. Luigi Short Facts ● Batch (No streaming) ● Both Command line and Hadoop execution ● Does not replace Hive, Pig, Cascading etc - instead used to stitch many such jobs together ● Powers 1000:s of jobs every day at Spotify ● Communication / synchronization between tasks via “targets” (Normal file, HDFS, database …) ● Dependencies hard coded in tasks ● Pull based (ask for task, and it figures out dependencies) ● Main features: Scheduling, Dependency Graph, Basic multi-node support (via central planner daemon) ● github.com/spotify/luigi ● Easy to install: pip install luigi [tornado]
  3. 3. A Basic Luigi Task
  4. 4. A Basic Luigi Task
  5. 5. Calling tasks from the commandline
  6. 6. Defining workflows: Default way
  7. 7. Dependencies: Default way TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA()
  8. 8. Parameters: Default way TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA() DependencyParameter
  9. 9. Growing list of parameters
  10. 10. Dependencies and parameters: A better way AWorkflowTask() TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA()
  11. 11. Supporting separate network definition
  12. 12. Using this functionality in tasks
  13. 13. Using this new functionality in workflows, and wrapping whole workflows in Tasks
  14. 14. Unit testing Luigi tasks is easy
  15. 15. Lessons Learned (April 2014) ● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs” ● Not fully separate workflow ● Multiple inputs / outputs somewhat error-prone ● Dynamic typed language ● Check out for max processes limits ● No real distribution of compute or data (only common scheduling)
  16. 16. Lessons Learned (Update Jan 2015) ● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs” ● Not fully separate workflow def. – We found a way around this. ● Multiple inputs / outputs somewhat error-prone – We're looking into solving this too. ● Dynamic typed language – We've found that unit testing is easy. ● Check out for max processes limits ● No real distribution of compute or data (only common scheduling) – We have found ourselves running everything through SLURM anyway
  17. 17. Thank you!

×