2024-10-15
Chloé QUIGNOT (BIOI2 @I2BC) - ORCID: 0000-0001-8504-232X
Source: adapted from FAIRbioinfo 2021 training material of the IFB and Snakemake introduction tutorial from BIOI2
Material under CC-BY-SA
licence
Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)
Workflows are made up of blocks, each block performs a specific (set of) instruction(s)
- 1 rule = 1 instruction (ideally) - inputs and outputs are one or multiple files - at least 1 input and/or 1 output per rule |
execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution
Rules are linked together by Snakemake using matching filenames in their input and output directives.
At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline.
Below is a workflow example using 2 tools sequentially to align 2 protein sequences:
In this example, we have:
fusionFasta
and Mafft
*.fasta
*fused.fasta
*aligned.fasta
generated by
MafftSnakemake (Smk) steps | running path |
---|---|
Smk creates the DAG from the snakefile | |
Smk sees that the final output
*aligned.fasta doesn’t exist but knows it can create it
with the Mafft rule |
|
Mafft needs files matching
*fused.fasta (don’t exist) but the fusionFasta
rule can generate it |
|
fusionFasta needs
.fasta files |
Snakemake steps | running path |
---|---|
.fasta files exist! Smk stops
backtracking |
|
Smk runs the fusionFasta
rule |
|
P10415_P01308_fused.fasta
exists and feeds the Mafft rule |
|
the final output
(P10415_P01308_aligned.fasta ) is generated, the workflow
has finished |
Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…
Rules are run if:
Many default files constitute the “Snakemake system” & there are standards on how to organise them.
They are not all necessary for a basic pipeline execution.
The most important is the Snakefile
, that’s where all
the code is saved.
For more information: https://github.com/snakemake-workflows/snakemake-workflow-template
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines them
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders & are replaced by input & output
file names at execution
rule myRuleName:
____input: "myInputFile"
____output: "myOutputFile"
____shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders & are replaced by input & output
file names at execution
=> code alignment (=indentations) is important
=> files and shell
directives should be given within
quotes ('
, "
or """
for
multi-line code)
=> additional & optional directives exist, e.g.:
params:
, resources:
, log:
, etc.
(we’ll see them later)
For more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
rule mafft:
input:
"P10415_P01308_fused.fasta",
output:
"P10415_P01308_aligned.fasta",
shell:
"""
mafft {input} > {output}
"""
rule fusionFasta:
input:
p1="P10415.fasta",
p2="P01308.fasta",
output:
"P10415_P01308_fused.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
fusionFasta
& mafft
fusionFasta
: 2 input (p1
&
p2
) & 1 output filemafft
: 1 input & 1 output fileNB: input & output files can be named
e.g. p1="P10415.fasta"
and explicitly accessed in shell
e.g. {input.p1}
or {input[0]}
When Snakemake is installed (how to install):
Snakefile
snakemake --cores 1
to run the pipeline
(--cores
specifies the number of cores to use)When you run Snakemake, you’ll get a full report printed on the screen of its progress:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
fusionFasta 1
mafft 1
total 2
[...]
5 of 3 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log
When it’s finished, a .snakemake
folder will appear in
your working directory:
ls -a
to see itTo visualise the complete workflow (--dag
), rule
dependencies (--rulegraph
) or rule dependencies with their
I/O files, in dot language. Uses the dot
tool of the
graphviz
package to create a png, pdf or other format:
snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rule.png
snakemake --filegraph | dot -Tpng > file.png
--dry-run
optionUsing this option will perform a “dry-run” i.e. nothing will be executed but everything that would’ve been run is displayed on the screen
-p --printshellcmds
-D
All command line options: https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options
input
and output
to specify input & output
files):rule myRuleName
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
{input}
&
{output}
placeholders within the shell directive
snakemake --cores 1
command
(+ other options available)--dag
, --rulegraph
,
--filegraph
and --dry-run
wildcards
e.g. {upid}
, {sample}
etc.
{}
".+"
)
rule fusionFasta:
input:
p1="P10415.fasta",
p2="P01308.fasta",
output:
"P10415_P01308_fused.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
rule fusionFasta:
input:
p1="{upid1}.fasta",
p2="{upid2}.fasta",
output:
"{upid1}_{upid2}_fused.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
Objective: learn how to run an already-existing Snakefile