Introduction to Snakemake

Introduction to Snakemake workflows

2024-10-15

Chloé QUIGNOT (BIOI2 @I2BC) - ORCID: 0000-0001-8504-232X

Source: adapted from FAIRbioinfo 2021 training material of the IFB and Snakemake introduction tutorial from BIOI2

Material under CC-BY-SA licence
CC-BY-SA

The principle behind Snakemake

Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)

Workflows are like legos:

Workflows are made up of blocks, each block performs a specific (set of) instruction(s)

1 “block” = 1 rule:

single rule example - 1 rule = 1 instruction (ideally)
- inputs and outputs are one or multiple files
- at least 1 input and/or 1 output per rule

Linking data flows

Rule order is not important…

execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution

…but matching file names is key!

Rules are linked together by Snakemake using matching filenames in their input and output directives.

At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline.

 A workflow example

Below is a workflow example using 2 tools sequentially to align 2 protein sequences:

2-step example outline

In this example, we have:

  • 2 linked rules: fusionFasta and Mafft
  • input protein sequence files named *.fasta
  • an intermediate file generated by fusionFasta named *fused.fasta
  • the final output named *aligned.fasta generated by Mafft
detailed illustration

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow (summary)

Snakemake (Smk) steps running path
Smk creates the DAG from the snakefile 2-step example outline
Smk sees that the final output *aligned.fasta doesn’t exist but knows it can create it with the Mafft rule 2-step example outline
Mafft needs files matching *fused.fasta (don’t exist) but the fusionFasta rule can generate it 2-step example outline
fusionFasta needs .fasta files 2-step example outline

How Snakemake creates your workflow (summary)

Snakemake steps running path
.fasta files exist! Smk stops backtracking 2-step example outline
Smk runs the fusionFasta rule 2-step example outline
P10415_P01308_fused.fasta exists and feeds the Mafft rule 2-step example outline
the final output (P10415_P01308_aligned.fasta) is generated, the workflow has finished 2-step example outline

Rules are run when outputs are missing… but not only

Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…

Rules are run if:

  • output doesn’t exist
  • output exists but is older than the input
  • changes detected in parameters, code or tool versions since last execution

The Snakemake world

Many default files constitute the “Snakemake system” & there are standards on how to organise them.

They are not all necessary for a basic pipeline execution.

The most important is the Snakefile, that’s where all the code is saved.

The Snakemake system

For more information: https://github.com/snakemake-workflows/snakemake-workflow-template

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> "myInputFile" & "myOutputFile" specify 1 or more input & output files

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> "myInputFile" & "myOutputFile" specify 1 or more input & output files
=> shell specifies what to do (shell commands in this case -> alternative directives exist)

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> "myInputFile" & "myOutputFile" specify 1 or more input & output files
=> shell specifies what to do (shell commands in this case -> alternative directives exist)
=> {input} & {output} are placeholders & are replaced by input & output file names at execution

Within the Snakefile…

  • The Snakefile is where rules are defined
  • The basic syntax of a rule is:
rule myRuleName:
____input: "myInputFile"
____output: "myOutputFile"
____shell: "echo {input} > {output}"

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> "myInputFile" & "myOutputFile" specify 1 or more input & output files
=> shell specifies what to do (shell commands in this case -> alternative directives exist)
=> {input} & {output} are placeholders & are replaced by input & output file names at execution

=> code alignment (=indentations) is important
=> files and shell directives should be given within quotes (', " or """ for multi-line code)
=> additional & optional directives exist, e.g.: params:, resources:, log:, etc. (we’ll see them later)

For more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

Snakefile of the previous example

rule mafft:
    input:
        "P10415_P01308_fused.fasta",
    output:
        "P10415_P01308_aligned.fasta",
    shell:
        """
        mafft {input} > {output} 
        """

rule fusionFasta:
    input:
        p1="P10415.fasta",
        p2="P01308.fasta",
    output:
        "P10415_P01308_fused.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """
  • 2 rules: fusionFasta & mafft
  • fusionFasta: 2 input (p1 & p2) & 1 output file
  • mafft: 1 input & 1 output file

NB: input & output files can be named
e.g. p1="P10415.fasta"
and explicitly accessed in shell
e.g. {input.p1} or {input[0]}

mafft alignment pipeline

How to run a Snakemake pipeline?

When Snakemake is installed (how to install):

  1. move into the directory containing the Snakefile
  2. type snakemake --cores 1 to run the pipeline (--cores specifies the number of cores to use)

Snakemake’s monolog & it’s hidden treasure chest

When you run Snakemake, you’ll get a full report printed on the screen of its progress:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
fusionFasta    1
mafft          1
total          2

[...]

5 of 3 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log


When it’s finished, a .snakemake folder will appear in your working directory:

  • it can be heavy (when using environments)
  • it can contain a lot of files (unsuited for some file systems)
  • it’s a hidden folder so ls -a to see it
  • don’t forget to remove it once you’re sure you’ve finished your analysis

Useful debugging options

Visualise the Snakemake DAG

To visualise the complete workflow (--dag), rule dependencies (--rulegraph) or rule dependencies with their I/O files, in dot language. Uses the dot tool of the graphviz package to create a png, pdf or other format:

snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rule.png
snakemake --filegraph | dot -Tpng > file.png
dag
rule graph
file graph

Use the --dry-run option

Using this option will perform a “dry-run” i.e. nothing will be executed but everything that would’ve been run is displayed on the screen

Other useful options for debugging when running Snakemake

  • print the shell command that is run: -p --printshellcmds
  • print a summary and status of rule: -D

All command line options: https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options

Conclusion

So far, we’ve seen

  • Snakemake workflow = set of rules
  • Rules are written in Snakefiles
  • Snakemake links rules together by matching up common input/output files
  • Rules are defined by their name and contain directives (of which input and output to specify input & output files):
rule myRuleName
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"
  • Access input & output values with {input} & {output} placeholders within the shell directive
  • A Snakefile is run with the snakemake --cores 1 command (+ other options available)
  • Debugging options: --dag, --rulegraph, --filegraph and --dry-run

Where to get pipelines?

A quick side note

Rules are generalisable with wildcards

e.g. {upid}, {sample} etc.

  • wildcards replace parts of file names
  • written between braces: {}
  • automatically resolved (ie. replaced by the regular expression: ".+")
  • specific per rule
rule fusionFasta:
    input:
        p1="P10415.fasta",
        p2="P01308.fasta",
    output:
        "P10415_P01308_fused.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """
rule fusionFasta:
    input:
        p1="{upid1}.fasta",
        p2="{upid2}.fasta",
    output:
        "{upid1}_{upid2}_fused.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """

Your turn to work!

Objective: learn how to run an already-existing Snakefile