1. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
C-SCALE tutorial: Snakemake
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
3. Why workflows?
Credits: https://github.com/c-scale-community/use-case-hisea
Goals:
● from raw data to figures
○ with “one click”
● re-run with new config
○ spatial scale
○ temporal scale
● re-run half-way through
○ recover from issues
● dependency management
○ between tasks
○ software packages
3
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
4. Why workflows?
When to build a workflow?
● Re-run the same analysis over and over again, with different input parameters
● Ability to re-run the work partially; recover from intermediate failures
● Combine together heterogeneous tooling into the same analysis
○ Python, R, Julia, Docker, Bash, etc.
4
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
5. Why snakemake?
• Mature workflow management system.
• Great community around it.
• Easy to learn? :)
• A Snakemake workflow scales without modification from single core workstations and
multi-core servers to batch systems (e.g. slurm)
• Snakemake integrates with the package manager Conda and the container engine
Singularity such that defining the software stack becomes part of the workflow itself.
• Further information: https://snakemake.readthedocs.io/
5
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
6. Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• The application of a rule to generate a set of output files is called job.
6
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
7. Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
7
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Belgium
Snakefile
22. Let’s build a workflow!
• Other examples
• https://github.com/c-scale-community/c-scale-tutorial-snakemake
• https://github.com/c-scale-community/use-case-hisea/pull/41/files
22
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
23. Let’s build a workflow!
• Advanced features
• Pre-built functionality for scatter-gather jobs
• Cluster execution: snakemake --cluster qsub (see SLURM docs)
• Self-contained HTML reports
• Accessing remote storage:
• Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage
• SFTP, HTTP, FTP, Dropbox, XRootD, WebDAV, GFAL, GridFTP, iRODs, etc.
• Best practices
• https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html
• FAQs: https://snakemake.readthedocs.io/en/stable/project_info/faq.html
23
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
24. Thank you for your attention.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
contact@c-scale.eu
https://c-scale.eu
@C_SCALE_EU
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
26. Let’s build a workflow!
• Many to many with glob_wildcards:
• $ snakemake --cores 1
26
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
CATEGORIES, = glob_wildcards("countries/{category}-countries.txt")
print(CATEGORIES)
rule all:
input:
expand("stats/number-of-{category}-countries.txt", category=CATEGORIES)
rule count_countries:
input:
"countries/{category}-countries.txt"
output:
"stats/number-of-{category}-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat list-of-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
input-1
input-2
output-1
output-2
input-n output-n
input-.. output-..
27. Let’s build a workflow!
• Dependencies between the rules are determined automatically, creating a DAG (directed
acyclic graph) of jobs that can be automatically parallelized.
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
• https://github.com/snakemake/snakemake/issues/1978
• Snakemake works backwards from requested output, and not from available input.
• Targets
• rule names can be targets
• output files can be targets
• if no target is given at the command line, Snakemake will define the first rule of the
Snakefile as the target. Hence, it is best practice to have a rule all at the top of the
workflow which has all typically desired target files as input files.
27
C-SCALE tutorial: Snakemake | 29th November 2022 | Online