SlideShare a Scribd company logo
1 of 21
Snakemake
Phil Ashton, PGI Advanced Training - KEMRI Wellcome, April 2022
Why are we going to talk to you
about workflow managers?
Advantages of workflow managers
- Each job/task can run within its own environment/container
- Failures handled elegantly - and re-starting.
- Portable - can send it to collaborator and it should work (?)
- Scalable - same code can run on your laptop or an HPC with minor changes
- Efficient - parallelisable jobs run in parallel.
Workflow managers
● Snakemake
● Nextflow
● Cromwell
● Galaxy
● Ruffus
● BPipe
● …
Choosing which one to use is a
trade-off between learning curve,
feature richness, and ease of use.
One way to think about Snakemake scripts
1. What is the final thing you want out of this pipeline?
2. What is the process that will give you that thing?
3. What are the inputs for that process?
4. Where do those inputs come from?
Input files
One way to think about Snakemake scripts
Assembly
Trimmed reads
Raw reads
Process: Assembly
Input: Trimmed Reads
Output: assembly.fasta
Process: read trimming
Input: raw reads
Output: trimmed reads Given as files
Process: Gather assemblies
Input: assembly.fasta
=
Snakemake
● Based on Python & Make
● Split your workflow into separate “rules”
○ One rule per process normally
● Rules have:
○ A rule name
○ Input files
○ Output files
○ A command to turn the input into the output
An example Snakemake script
rule bwa_map:
input: "data/genome.fa", "data/samples/A.fastq"
output: "mapped_reads/A.bam"
shell: "bwa mem {input} | samtools view -Sb - > {output}"
Running Snakemake scripts
If the contents of the previous slide is saved as “my_mapping_pipeline.smk” then to
run it we would do:
`snakemake -s my_mapping_pipeline.smk`
Alternatively you can save the contents in a file called `Snakefile` and then just
run `snakemake` in the directory containing `Snakefile` and it will run the script.
A generalised Snakemake script
rule bwa_map:
input: "data/genome.fa", "data/samples/{sample}.fastq"
output: "mapped_reads/{sample}.bam"
shell: "bwa mem {input} | samtools view -Sb - > {output}"
● Snakemake uses “named wildcards”, in this case `{sample}`.
● In this case snakemake would find all the files matching the pattern
`data/samples/{sample}.fastq` and run the rule `bwa_map` on them
● The `{sample}` wildcard will “propagate” through to the output, so the output
bam name will match the input fastq name.
A more useful generalised Snakemake script
samples = [‘sample1’, ‘sample2’]
rule bwa_map:
input: ref_genome = "data/genome.fa", fastqs =
expand("data/samples/{sample}.fastq", sample = samples)
output: "mapped_reads/{sample}.bam"
shell: "bwa mem {input} | samtools view -Sb - > {output}"
The `expand` function lets you
give a “todo list” to a rule.
A two-step snakemake workflow
rule bbduk:
input: r1 = '{root_dir}/{sample}_1.fastq.gz',r2 = '{root_dir}/{sample}_2.fastq.gz'
output: r1 = '{root_dir}/{sample}_bbduk_1.fq.gz',r2 = '{root_dir}/{sample}_bbduk_2.fq.gz'
shell: 'bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2}'
rule shovill:
input: r1 = rules.bbduk.output.r1, r2 = rules.bbduk.output.r2
output: final = '{root_dir}/{sample}/shovill_bbduk/contigs.fa',
shell: 'shovill --outdir {root_dir}/{wildcards.sample}/shovill -R1 {input.r1} -R2
{input.r2}'
A three-step snakemake workflow
rule all:
input: expand('{root_dir}/{sample}/shovill_bbduk/contigs.fa', sample = todo_list, root_dir =
root_dir)
rule bbduk:
input: r1 = '{root_dir}/{sample}_1.fastq.gz',r2 = '{root_dir}/{sample}_2.fastq.gz'
output: r1 = '{root_dir}/{sample}_bbduk_1.fq.gz',r2 = '{root_dir}/{sample}_bbduk_2.fq.gz'
shell: 'bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2}'
rule shovill:
input: r1 = rules.bbduk.output.r1, r2 = rules.bbduk.output.r2
output: final = '{root_dir}/{sample}/shovill_bbduk/contigs.fa',
shell: 'shovill --outdir {root_dir}/{wildcards.sample}/shovill -R1 {input.r1} -R2 {input.r2}'
Directed acyclic graphs (DAGs)
When snakemake runs, it converts the snakefile into a
directed acyclic graph.
Practical exercise 1 - basic and advanced snakemake
1. Basics: An example workflow — Snakemake 7.3.2 documentation
2. Advanced: Decorating the example workflow — Snakemake 7.3.3
documentation
Using conda with snakemake
rule bwa_map:
input: ref_genome = "data/genome.fa", fastqs =
expand("data/samples/{sample}.fastq", sample = samples)
output: "mapped_reads/{sample}.bam"
conda: "../../envs/bwa.yaml"
shell: "bwa mem {input} | samtools view -Sb - > {output}"
A yaml file defining the conda packages required to run
this rule.
Practical exercise 2 - conda integration
● Modify your script from the basic conda exercise to use conda environments
for each rule
a. Distribution and Reproducibility — Snakemake 7.3.5 documentation
Using snakemake with HPC
● Submitting/managing jobs on an HPC can be a hassle
● Especially when jobs depend on each other
● Snakemake makes it much easier to use an HPC
● If you don’t have an HPC, snakemake can also be configured to use e.g.
Amazon Web Services.
Using snakemake with HPC
$ cat ~/.config/snakemake/salmonella/config.yaml
__default__:
cluster: sbatch
cpus-per-task: 1
mem-per-cpu: 4000
jobs: 100
bbduk:
cpus-per-task: 8
shovill:
cpus-per-task: 8
mem-per-cpu: 6000
snakemake -s ~/scripts/snakemake/salmonella_workflow.smk 
--cluster-config ~/.config/snakemake/salmonella_slurm/config.yaml 
--cluster "sbatch --cpus-per-task={cluster.cpus-per-task} 
--mem-per-cpu={cluster.mem-per-cpu}" 
--jobs 1000
Practical exercise 3 - using snakemake on HPC
● Modify your snakemake script from the basic workflow to execute on the HPC
○ More information - Cluster Execution — Snakemake 7.3.2 documentation
○ You will need to write a config file like on the previous slide, for this example you can just set a
sensible `__default__` for all rules.
○ Then execute using the snakemake command from previous slide as a template.
Acknowledgements & further reading
Anna Price, Cardiff/CLIMB - https://www.youtube.com/watch?v=qORviM_ELdk
Reproducible, scalable, and shareable analysis pipelines with bioinformatics
workflow managers | Nature Methods
Snakemake - Reproducible and Scalable Bioinformatic Workflows
https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

More Related Content

What's hot

PCR and primer design techniques
PCR and primer design techniquesPCR and primer design techniques
PCR and primer design techniquesMesele Tilahun
 
Light Intro to the Gene Ontology
Light Intro to the Gene OntologyLight Intro to the Gene Ontology
Light Intro to the Gene Ontologynniiicc
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
フリーソフトで始めるNGS解析_第41・42回勉強会資料
フリーソフトで始めるNGS解析_第41・42回勉強会資料フリーソフトで始めるNGS解析_第41・42回勉強会資料
フリーソフトで始めるNGS解析_第41・42回勉強会資料Amelieff
 
Crispr handbook 2015
Crispr handbook 2015Crispr handbook 2015
Crispr handbook 2015rochonf
 
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...Equnix Business Solutions
 
WGCNA: an R package for weighted correlation network analysis
WGCNA: an R package for weighted  correlation network analysisWGCNA: an R package for weighted  correlation network analysis
WGCNA: an R package for weighted correlation network analysisAlireza Doustmohammadi
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformaticsJoel Ricci-López
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
IBM Tape the future of tape
IBM Tape the future of tapeIBM Tape the future of tape
IBM Tape the future of tapeJosef Weingand
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Sebastian Schmeier
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doMetehan Çetinkaya
 

What's hot (20)

Pcr primer design
Pcr primer designPcr primer design
Pcr primer design
 
PCR and primer design techniques
PCR and primer design techniquesPCR and primer design techniques
PCR and primer design techniques
 
Light Intro to the Gene Ontology
Light Intro to the Gene OntologyLight Intro to the Gene Ontology
Light Intro to the Gene Ontology
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
フリーソフトで始めるNGS解析_第41・42回勉強会資料
フリーソフトで始めるNGS解析_第41・42回勉強会資料フリーソフトで始めるNGS解析_第41・42回勉強会資料
フリーソフトで始めるNGS解析_第41・42回勉強会資料
 
Crispr handbook 2015
Crispr handbook 2015Crispr handbook 2015
Crispr handbook 2015
 
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...
PGConf.ASIA 2019 Bali - Setup a High-Availability and Load Balancing PostgreS...
 
WGCNA: an R package for weighted correlation network analysis
WGCNA: an R package for weighted  correlation network analysisWGCNA: an R package for weighted  correlation network analysis
WGCNA: an R package for weighted correlation network analysis
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformatics
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
IBM Tape the future of tape
IBM Tape the future of tapeIBM Tape the future of tape
IBM Tape the future of tape
 
CRISPR Cas 9.pptx
CRISPR Cas 9.pptxCRISPR Cas 9.pptx
CRISPR Cas 9.pptx
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.do
 

Similar to 2022.03.24 Snakemake.pptx

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
HPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialHPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialDirk Hähnel
 
Getting Started with Ansible - Jake.pdf
Getting Started with Ansible - Jake.pdfGetting Started with Ansible - Jake.pdf
Getting Started with Ansible - Jake.pdfssuserd254491
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017Ayush Sharma
 
Node.js basics
Node.js basicsNode.js basics
Node.js basicsBen Lin
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow softwareAnubhav Jain
 
Automating with ansible (Part A)
Automating with ansible (Part A)Automating with ansible (Part A)
Automating with ansible (Part A)iman darabi
 
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltStack
 
Spark optimization
Spark optimizationSpark optimization
Spark optimizationAnkit Beohar
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Improving build solutions dependency management with webpack
Improving build solutions  dependency management with webpackImproving build solutions  dependency management with webpack
Improving build solutions dependency management with webpackNodeXperts
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...HostedbyConfluent
 

Similar to 2022.03.24 Snakemake.pptx (20)

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
HPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialHPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster Tutorial
 
Getting Started with Ansible - Jake.pdf
Getting Started with Ansible - Jake.pdfGetting Started with Ansible - Jake.pdf
Getting Started with Ansible - Jake.pdf
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
 
Node.js basics
Node.js basicsNode.js basics
Node.js basics
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Introducing Ansible
Introducing AnsibleIntroducing Ansible
Introducing Ansible
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
Automating with ansible (Part A)
Automating with ansible (Part A)Automating with ansible (Part A)
Automating with ansible (Part A)
 
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
 
Spark 101
Spark 101Spark 101
Spark 101
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Improving build solutions dependency management with webpack
Improving build solutions  dependency management with webpackImproving build solutions  dependency management with webpack
Improving build solutions dependency management with webpack
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
 

More from Philip Ashton

2022.09.26 iNTS household survey.pptx
2022.09.26 iNTS household survey.pptx2022.09.26 iNTS household survey.pptx
2022.09.26 iNTS household survey.pptxPhilip Ashton
 
2022.03.23 Conda and Conda environments.pptx
2022.03.23 Conda and Conda environments.pptx2022.03.23 Conda and Conda environments.pptx
2022.03.23 Conda and Conda environments.pptxPhilip Ashton
 
2016.11.08 doherty institute symposium
2016.11.08 doherty institute symposium2016.11.08 doherty institute symposium
2016.11.08 doherty institute symposiumPhilip Ashton
 
2016.03.08 immem talk salmonella outbreaks
2016.03.08 immem talk   salmonella outbreaks2016.03.08 immem talk   salmonella outbreaks
2016.03.08 immem talk salmonella outbreaksPhilip Ashton
 
SnapperDB ASM NGS conference
SnapperDB ASM NGS conferenceSnapperDB ASM NGS conference
SnapperDB ASM NGS conferencePhilip Ashton
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyPhilip Ashton
 

More from Philip Ashton (6)

2022.09.26 iNTS household survey.pptx
2022.09.26 iNTS household survey.pptx2022.09.26 iNTS household survey.pptx
2022.09.26 iNTS household survey.pptx
 
2022.03.23 Conda and Conda environments.pptx
2022.03.23 Conda and Conda environments.pptx2022.03.23 Conda and Conda environments.pptx
2022.03.23 Conda and Conda environments.pptx
 
2016.11.08 doherty institute symposium
2016.11.08 doherty institute symposium2016.11.08 doherty institute symposium
2016.11.08 doherty institute symposium
 
2016.03.08 immem talk salmonella outbreaks
2016.03.08 immem talk   salmonella outbreaks2016.03.08 immem talk   salmonella outbreaks
2016.03.08 immem talk salmonella outbreaks
 
SnapperDB ASM NGS conference
SnapperDB ASM NGS conferenceSnapperDB ASM NGS conference
SnapperDB ASM NGS conference
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiology
 

Recently uploaded

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

2022.03.24 Snakemake.pptx

  • 1. Snakemake Phil Ashton, PGI Advanced Training - KEMRI Wellcome, April 2022
  • 2. Why are we going to talk to you about workflow managers?
  • 3. Advantages of workflow managers - Each job/task can run within its own environment/container - Failures handled elegantly - and re-starting. - Portable - can send it to collaborator and it should work (?) - Scalable - same code can run on your laptop or an HPC with minor changes - Efficient - parallelisable jobs run in parallel.
  • 4. Workflow managers ● Snakemake ● Nextflow ● Cromwell ● Galaxy ● Ruffus ● BPipe ● … Choosing which one to use is a trade-off between learning curve, feature richness, and ease of use.
  • 5. One way to think about Snakemake scripts 1. What is the final thing you want out of this pipeline? 2. What is the process that will give you that thing? 3. What are the inputs for that process? 4. Where do those inputs come from? Input files
  • 6. One way to think about Snakemake scripts Assembly Trimmed reads Raw reads Process: Assembly Input: Trimmed Reads Output: assembly.fasta Process: read trimming Input: raw reads Output: trimmed reads Given as files Process: Gather assemblies Input: assembly.fasta =
  • 7. Snakemake ● Based on Python & Make ● Split your workflow into separate “rules” ○ One rule per process normally ● Rules have: ○ A rule name ○ Input files ○ Output files ○ A command to turn the input into the output
  • 8. An example Snakemake script rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output: "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"
  • 9. Running Snakemake scripts If the contents of the previous slide is saved as “my_mapping_pipeline.smk” then to run it we would do: `snakemake -s my_mapping_pipeline.smk` Alternatively you can save the contents in a file called `Snakefile` and then just run `snakemake` in the directory containing `Snakefile` and it will run the script.
  • 10. A generalised Snakemake script rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}" ● Snakemake uses “named wildcards”, in this case `{sample}`. ● In this case snakemake would find all the files matching the pattern `data/samples/{sample}.fastq` and run the rule `bwa_map` on them ● The `{sample}` wildcard will “propagate” through to the output, so the output bam name will match the input fastq name.
  • 11. A more useful generalised Snakemake script samples = [‘sample1’, ‘sample2’] rule bwa_map: input: ref_genome = "data/genome.fa", fastqs = expand("data/samples/{sample}.fastq", sample = samples) output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}" The `expand` function lets you give a “todo list” to a rule.
  • 12. A two-step snakemake workflow rule bbduk: input: r1 = '{root_dir}/{sample}_1.fastq.gz',r2 = '{root_dir}/{sample}_2.fastq.gz' output: r1 = '{root_dir}/{sample}_bbduk_1.fq.gz',r2 = '{root_dir}/{sample}_bbduk_2.fq.gz' shell: 'bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2}' rule shovill: input: r1 = rules.bbduk.output.r1, r2 = rules.bbduk.output.r2 output: final = '{root_dir}/{sample}/shovill_bbduk/contigs.fa', shell: 'shovill --outdir {root_dir}/{wildcards.sample}/shovill -R1 {input.r1} -R2 {input.r2}'
  • 13. A three-step snakemake workflow rule all: input: expand('{root_dir}/{sample}/shovill_bbduk/contigs.fa', sample = todo_list, root_dir = root_dir) rule bbduk: input: r1 = '{root_dir}/{sample}_1.fastq.gz',r2 = '{root_dir}/{sample}_2.fastq.gz' output: r1 = '{root_dir}/{sample}_bbduk_1.fq.gz',r2 = '{root_dir}/{sample}_bbduk_2.fq.gz' shell: 'bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2}' rule shovill: input: r1 = rules.bbduk.output.r1, r2 = rules.bbduk.output.r2 output: final = '{root_dir}/{sample}/shovill_bbduk/contigs.fa', shell: 'shovill --outdir {root_dir}/{wildcards.sample}/shovill -R1 {input.r1} -R2 {input.r2}'
  • 14. Directed acyclic graphs (DAGs) When snakemake runs, it converts the snakefile into a directed acyclic graph.
  • 15. Practical exercise 1 - basic and advanced snakemake 1. Basics: An example workflow — Snakemake 7.3.2 documentation 2. Advanced: Decorating the example workflow — Snakemake 7.3.3 documentation
  • 16. Using conda with snakemake rule bwa_map: input: ref_genome = "data/genome.fa", fastqs = expand("data/samples/{sample}.fastq", sample = samples) output: "mapped_reads/{sample}.bam" conda: "../../envs/bwa.yaml" shell: "bwa mem {input} | samtools view -Sb - > {output}" A yaml file defining the conda packages required to run this rule.
  • 17. Practical exercise 2 - conda integration ● Modify your script from the basic conda exercise to use conda environments for each rule a. Distribution and Reproducibility — Snakemake 7.3.5 documentation
  • 18. Using snakemake with HPC ● Submitting/managing jobs on an HPC can be a hassle ● Especially when jobs depend on each other ● Snakemake makes it much easier to use an HPC ● If you don’t have an HPC, snakemake can also be configured to use e.g. Amazon Web Services.
  • 19. Using snakemake with HPC $ cat ~/.config/snakemake/salmonella/config.yaml __default__: cluster: sbatch cpus-per-task: 1 mem-per-cpu: 4000 jobs: 100 bbduk: cpus-per-task: 8 shovill: cpus-per-task: 8 mem-per-cpu: 6000 snakemake -s ~/scripts/snakemake/salmonella_workflow.smk --cluster-config ~/.config/snakemake/salmonella_slurm/config.yaml --cluster "sbatch --cpus-per-task={cluster.cpus-per-task} --mem-per-cpu={cluster.mem-per-cpu}" --jobs 1000
  • 20. Practical exercise 3 - using snakemake on HPC ● Modify your snakemake script from the basic workflow to execute on the HPC ○ More information - Cluster Execution — Snakemake 7.3.2 documentation ○ You will need to write a config file like on the previous slide, for this example you can just set a sensible `__default__` for all rules. ○ Then execute using the snakemake command from previous slide as a template.
  • 21. Acknowledgements & further reading Anna Price, Cardiff/CLIMB - https://www.youtube.com/watch?v=qORviM_ELdk Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods Snakemake - Reproducible and Scalable Bioinformatic Workflows https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

Editor's Notes

  1. Here’s a way to think about snakemake scripts. Say 1. For example, an assembly fasta file. Say 2. It’s a de novo assembly process. Say 3. The inputs might be trimmed reads. Say 4. They come from another process right? Read trimming. So then, [run through 2 to 4 again], and in this case the raw reads might be the input files. The same logic applies if you have multiple steps in your pipeline. At each stage, what’s the output you want, whats the process to generate that output, and what’s the input for that process?
  2. But, if this script is run as is, then the rule “shovill” won’t be run, because nothing requires it’s output as an input.
  3. So, it’s normal to have a rule, typically called “all” which requires the final thing you want out of the pipeline as an input. You can think of this as pulling the output you want out of the pipeline, and that “pull” goes all through the pipeline to the input you provide.