Building Scientific Workflows with Spotify's Luigi

Samuel Lampa
Samuel LampaPhD Student in Pharmaceutical Bioinformatics
Building workflows with Spotify's
Workshop for e-Infrastructures for Massively Parallel Sequencing
Uppsala, Jan 19-20, 2015
Samuel Lampa
UU/Dept of Pharm. Biosci. & BILS
samuel.lampa@bils.se
samuel.lampa.co
Luigi Short Facts
●
Batch (No streaming)
●
Both Command line and Hadoop execution
●
Does not replace Hive, Pig, Cascading etc
- instead used to stitch many such jobs together
●
Powers 1000:s of jobs every day at Spotify
●
Communication / synchronization between tasks via “targets” (Normal file,
HDFS, database …)
●
Dependencies hard coded in tasks
●
Pull based (ask for task, and it figures out dependencies)
●
Main features: Scheduling, Dependency Graph,
Basic multi-node support (via central planner daemon)
●
github.com/spotify/luigi
●
Easy to install: pip install luigi [tornado]
Building Scientific Workflows with Spotify's Luigi
A Basic Luigi Task
A Basic Luigi Task
Calling tasks from the commandline
Defining workflows: Default way
Dependencies: Default way
TaskA()
TaskB()
TaskBA() TaskBB()
TaskBAA() TaskBBB()TaskBAB() TaskBBA()
Parameters: Default way
TaskA()
TaskB()
TaskBA() TaskBB()
TaskBAA() TaskBBB()TaskBAB() TaskBBA()
DependencyParameter
Growing list of parameters
Dependencies and parameters:
A better way
AWorkflowTask()
TaskA()
TaskB()
TaskBA() TaskBB()
TaskBAA() TaskBBB()TaskBAB() TaskBBA()
Supporting separate network definition
Using this functionality in tasks
Using this new functionality in workflows, and wrapping whole workflows in Tasks
Unit testing Luigi tasks is easy
Lessons Learned (April 2014)
●
Only “pull” or “call/return” semantics, no “push”
or “auto triggering by inputs”
●
Not fully separate workflow
●
Multiple inputs / outputs somewhat error-prone
●
Dynamic typed language
●
Check out for max processes limits
●
No real distribution of compute or data
(only common scheduling)
Lessons Learned (Update Jan 2015)
●
Only “pull” or “call/return” semantics, no “push” or “auto triggering
by inputs”
●
Not fully separate workflow def.
– We found a way around this.
●
Multiple inputs / outputs somewhat error-prone
– We're looking into solving this too.
●
Dynamic typed language
– We've found that unit testing is easy.
●
Check out for max processes limits
●
No real distribution of compute or data
(only common scheduling)
– We have found ourselves running everything through SLURM anyway
Thank you!
1 of 18

Recommended

A Beginner's Guide to Building Data Pipelines with Luigi by
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
57K views26 slides
Managing data workflows with Luigi by
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
6.2K views35 slides
Building cloud-enabled genomics workflows with Luigi and Docker by
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
5.9K views55 slides
Luigi presentation NYC Data Science by
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
60.5K views69 slides
Building a data processing pipeline in Python by
Building a data processing pipeline in PythonBuilding a data processing pipeline in Python
Building a data processing pipeline in PythonJoe Cabrera
2.4K views16 slides
Workflow Engines + Luigi by
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + LuigiVladislav Supalov
1.9K views24 slides

More Related Content

Viewers also liked

Building a unified data pipeline in Apache Spark by
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
26.2K views26 slides
Approximate nearest neighbor methods and vector models – NYC ML meetup by
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
22.2K views62 slides
(CMP310) Data Processing Pipelines Using Containers & Spot Instances by
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
16.5K views34 slides
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce by
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceEngineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceAaron Knight
1.7K views38 slides
Luigi PyData presentation by
Luigi PyData presentationLuigi PyData presentation
Luigi PyData presentationElias Freider
17.1K views34 slides
Workflow Engines for Hadoop by
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
38.4K views39 slides

Viewers also liked(7)

Building a unified data pipeline in Apache Spark by DataWorks Summit
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit26.2K views
Approximate nearest neighbor methods and vector models – NYC ML meetup by Erik Bernhardsson
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson22.2K views
(CMP310) Data Processing Pipelines Using Containers & Spot Instances by Amazon Web Services
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
Amazon Web Services16.5K views
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce by Aaron Knight
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceEngineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Aaron Knight1.7K views
Luigi PyData presentation by Elias Freider
Luigi PyData presentationLuigi PyData presentation
Luigi PyData presentation
Elias Freider17.1K views
Workflow Engines for Hadoop by Joe Crobak
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
Joe Crobak38.4K views
Building a Data Pipeline from Scratch - Joe Crobak by Hakka Labs
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs38.6K views

More from Samuel Lampa

Using Flow-based programming to write tools and workflows for Scientific Comp... by
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Samuel Lampa
1.8K views54 slides
Linked Data for improved organization of research data by
Linked Data  for improved organization  of research dataLinked Data  for improved organization  of research data
Linked Data for improved organization of research dataSamuel Lampa
700 views24 slides
How to document computational research projects by
How to document computational research projectsHow to document computational research projects
How to document computational research projectsSamuel Lampa
107 views7 slides
Reproducibility in Scientific Data Analysis - BioScience Seminar by
Reproducibility in Scientific Data Analysis - BioScience SeminarReproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience SeminarSamuel Lampa
432 views27 slides
Batch import of large RDF datasets into Semantic MediaWiki by
Batch import of large RDF datasets into Semantic MediaWikiBatch import of large RDF datasets into Semantic MediaWiki
Batch import of large RDF datasets into Semantic MediaWikiSamuel Lampa
1.1K views28 slides
SciPipe - A light-weight workflow library inspired by flow-based programming by
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
1.4K views14 slides

More from Samuel Lampa(19)

Using Flow-based programming to write tools and workflows for Scientific Comp... by Samuel Lampa
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...
Samuel Lampa1.8K views
Linked Data for improved organization of research data by Samuel Lampa
Linked Data  for improved organization  of research dataLinked Data  for improved organization  of research data
Linked Data for improved organization of research data
Samuel Lampa700 views
How to document computational research projects by Samuel Lampa
How to document computational research projectsHow to document computational research projects
How to document computational research projects
Samuel Lampa107 views
Reproducibility in Scientific Data Analysis - BioScience Seminar by Samuel Lampa
Reproducibility in Scientific Data Analysis - BioScience SeminarReproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience Seminar
Samuel Lampa432 views
Batch import of large RDF datasets into Semantic MediaWiki by Samuel Lampa
Batch import of large RDF datasets into Semantic MediaWikiBatch import of large RDF datasets into Semantic MediaWiki
Batch import of large RDF datasets into Semantic MediaWiki
Samuel Lampa1.1K views
SciPipe - A light-weight workflow library inspired by flow-based programming by Samuel Lampa
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
Samuel Lampa1.4K views
Vagrant, Ansible and Docker - How they fit together for productive flexible d... by Samuel Lampa
Vagrant, Ansible and Docker - How they fit together for productive flexible d...Vagrant, Ansible and Docker - How they fit together for productive flexible d...
Vagrant, Ansible and Docker - How they fit together for productive flexible d...
Samuel Lampa1.3K views
iRODS Rule Language Cheat Sheet by Samuel Lampa
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat Sheet
Samuel Lampa2.9K views
AddisDev Meetup ii: Golang and Flow-based Programming by Samuel Lampa
AddisDev Meetup ii: Golang and Flow-based ProgrammingAddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based Programming
Samuel Lampa5.1K views
First encounter with Elixir - Some random things by Samuel Lampa
First encounter with Elixir - Some random thingsFirst encounter with Elixir - Some random things
First encounter with Elixir - Some random things
Samuel Lampa697 views
Profiling go code a beginners tutorial by Samuel Lampa
Profiling go code   a beginners tutorialProfiling go code   a beginners tutorial
Profiling go code a beginners tutorial
Samuel Lampa2.2K views
Flow based programming an overview by Samuel Lampa
Flow based programming   an overviewFlow based programming   an overview
Flow based programming an overview
Samuel Lampa4.6K views
Python Generators - Talk at PySthlm meetup #15 by Samuel Lampa
Python Generators - Talk at PySthlm meetup #15Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15
Samuel Lampa1.4K views
The RDFIO Extension - A Status update by Samuel Lampa
The RDFIO Extension - A Status updateThe RDFIO Extension - A Status update
The RDFIO Extension - A Status update
Samuel Lampa1.3K views
My lightning talk at Go Stockholm meetup Aug 6th 2013 by Samuel Lampa
My lightning talk at Go Stockholm meetup Aug 6th 2013My lightning talk at Go Stockholm meetup Aug 6th 2013
My lightning talk at Go Stockholm meetup Aug 6th 2013
Samuel Lampa1.7K views
Hooking up Semantic MediaWiki with external tools via SPARQL by Samuel Lampa
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
Samuel Lampa4.1K views
Thesis presentation Samuel Lampa by Samuel Lampa
Thesis presentation Samuel LampaThesis presentation Samuel Lampa
Thesis presentation Samuel Lampa
Samuel Lampa715 views
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse by Samuel Lampa
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
Samuel Lampa1.3K views
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse by Samuel Lampa
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
Samuel Lampa828 views

Recently uploaded

XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... by
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...Sérgio Sacani
1.1K views14 slides
BLOTTING TECHNIQUES SPECIAL by
BLOTTING TECHNIQUES SPECIALBLOTTING TECHNIQUES SPECIAL
BLOTTING TECHNIQUES SPECIALMuhammadImranMirza2
17 views56 slides
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Anmol Vishnu Gupta
28 views12 slides
Bacterial Reproduction.pdf by
Bacterial Reproduction.pdfBacterial Reproduction.pdf
Bacterial Reproduction.pdfNandadulalSannigrahi
53 views32 slides
ZEBRA FISH: as model organism.pptx by
ZEBRA FISH: as model organism.pptxZEBRA FISH: as model organism.pptx
ZEBRA FISH: as model organism.pptxmahimachoudhary0807
14 views17 slides
INTRODUCTION TO PLANT SYSTEMATICS.pptx by
INTRODUCTION TO PLANT SYSTEMATICS.pptxINTRODUCTION TO PLANT SYSTEMATICS.pptx
INTRODUCTION TO PLANT SYSTEMATICS.pptxRASHMI M G
5 views19 slides

Recently uploaded(20)

XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... by Sérgio Sacani
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
Sérgio Sacani1.1K views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
INTRODUCTION TO PLANT SYSTEMATICS.pptx by RASHMI M G
INTRODUCTION TO PLANT SYSTEMATICS.pptxINTRODUCTION TO PLANT SYSTEMATICS.pptx
INTRODUCTION TO PLANT SYSTEMATICS.pptx
RASHMI M G 5 views
GLUCONEOGENESIS Presentation.pptx by GunjanBaisla
GLUCONEOGENESIS Presentation.pptxGLUCONEOGENESIS Presentation.pptx
GLUCONEOGENESIS Presentation.pptx
GunjanBaisla6 views
COMPLEXOMETRIC TITRATION OR CHEALATOMETRIC TITRATION by Poonam Aher Patil
COMPLEXOMETRIC TITRATION OR CHEALATOMETRIC TITRATIONCOMPLEXOMETRIC TITRATION OR CHEALATOMETRIC TITRATION
COMPLEXOMETRIC TITRATION OR CHEALATOMETRIC TITRATION
Poonam Aher Patil267 views
Paper Chromatography or Paper partition chromatography by Poonam Aher Patil
Paper Chromatography or Paper partition chromatographyPaper Chromatography or Paper partition chromatography
Paper Chromatography or Paper partition chromatography
Oral_Presentation_by_Fatma (2).pdf by fatmaalmrzqi
Oral_Presentation_by_Fatma (2).pdfOral_Presentation_by_Fatma (2).pdf
Oral_Presentation_by_Fatma (2).pdf
fatmaalmrzqi8 views
Eukaryotic microbiology lab Dos and Donts.pptx by Prasanna Kumar
Eukaryotic microbiology lab Dos and Donts.pptxEukaryotic microbiology lab Dos and Donts.pptx
Eukaryotic microbiology lab Dos and Donts.pptx
Prasanna Kumar8 views
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by ShadmanSakib63
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
ShadmanSakib639 views
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri110 views
DNA manipulation Enzymes 2.pdf by NetHelix
DNA manipulation Enzymes 2.pdfDNA manipulation Enzymes 2.pdf
DNA manipulation Enzymes 2.pdf
NetHelix6 views
Exploring_The_Unthinkable_ Franco Gollo.pdf by draconox80
Exploring_The_Unthinkable_ Franco Gollo.pdfExploring_The_Unthinkable_ Franco Gollo.pdf
Exploring_The_Unthinkable_ Franco Gollo.pdf
draconox806 views
Worldviews and their (im)plausibility: Science and Holism by JohnWilkins48
Worldviews and their (im)plausibility: Science and HolismWorldviews and their (im)plausibility: Science and Holism
Worldviews and their (im)plausibility: Science and Holism
JohnWilkins4849 views
Presentation on experimental laboratory animal- Hamster by Kanika13641
Presentation on experimental laboratory animal- HamsterPresentation on experimental laboratory animal- Hamster
Presentation on experimental laboratory animal- Hamster
Kanika136418 views
Towards Error-Corrected Quantum Computing with Neutral Atoms by Yuval Boger
Towards Error-Corrected Quantum Computing with Neutral AtomsTowards Error-Corrected Quantum Computing with Neutral Atoms
Towards Error-Corrected Quantum Computing with Neutral Atoms
Yuval Boger5 views
Non Aqueous titration: Definition, Principle and Application by Poonam Aher Patil
Non Aqueous titration: Definition, Principle and ApplicationNon Aqueous titration: Definition, Principle and Application
Non Aqueous titration: Definition, Principle and Application

Building Scientific Workflows with Spotify's Luigi

  • 1. Building workflows with Spotify's Workshop for e-Infrastructures for Massively Parallel Sequencing Uppsala, Jan 19-20, 2015 Samuel Lampa UU/Dept of Pharm. Biosci. & BILS samuel.lampa@bils.se samuel.lampa.co
  • 2. Luigi Short Facts ● Batch (No streaming) ● Both Command line and Hadoop execution ● Does not replace Hive, Pig, Cascading etc - instead used to stitch many such jobs together ● Powers 1000:s of jobs every day at Spotify ● Communication / synchronization between tasks via “targets” (Normal file, HDFS, database …) ● Dependencies hard coded in tasks ● Pull based (ask for task, and it figures out dependencies) ● Main features: Scheduling, Dependency Graph, Basic multi-node support (via central planner daemon) ● github.com/spotify/luigi ● Easy to install: pip install luigi [tornado]
  • 6. Calling tasks from the commandline
  • 8. Dependencies: Default way TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA()
  • 9. Parameters: Default way TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA() DependencyParameter
  • 10. Growing list of parameters
  • 11. Dependencies and parameters: A better way AWorkflowTask() TaskA() TaskB() TaskBA() TaskBB() TaskBAA() TaskBBB()TaskBAB() TaskBBA()
  • 14. Using this new functionality in workflows, and wrapping whole workflows in Tasks
  • 15. Unit testing Luigi tasks is easy
  • 16. Lessons Learned (April 2014) ● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs” ● Not fully separate workflow ● Multiple inputs / outputs somewhat error-prone ● Dynamic typed language ● Check out for max processes limits ● No real distribution of compute or data (only common scheduling)
  • 17. Lessons Learned (Update Jan 2015) ● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs” ● Not fully separate workflow def. – We found a way around this. ● Multiple inputs / outputs somewhat error-prone – We're looking into solving this too. ● Dynamic typed language – We've found that unit testing is easy. ● Check out for max processes limits ● No real distribution of compute or data (only common scheduling) – We have found ourselves running everything through SLURM anyway