Introduction to Snakemake

Exercise D - a few extra hacks to take home

Theme X - creating a nice report

Basic report

All you have to do is to run Snakemake a second time but with the --report option after running your pipeline.

e.g.

snakemake -s ex1c_o5.smk --configfile ex1.yml --report report.html

Hint: in some browsers, the html report page overflows the window. A quick hack to fix this is to remove all mentions of h-screen like this:

sed -i -E 's/([^l]) h-screen/\1/g' report.html

The basic report contains the Snakemake DAG that was run, the commands that were run in each step and some statistics (time & resources used) per job.

Including outputs into your report

Basic syntax

Typically, in the FastQC + MultiQC pipeline, we could include MultiQC’s output to the report. To do so, we would have to specify what files to include within the Snakefile using the report() function.

e.g.
  output:
    report("multiqc_report.html"),
    directory("multiqc_data"),

Advanced options

The report() function is capable of taking extra arguments such as labels, category or caption to include descriptions to your included file.

e.g.
  output:
    report(
      "multiqc_report.html",
      caption="QC.rst",
      category="0_QC",
      labels={"step": "QC", "description": "MultiQC report"},
    ),

Personnalising your report

Although .rst files included as captions are just that: captions, you can also include a more detailed report using general report file containing, for example, references of tools used or versions used. To include this .rst into your report, you need to add the report directive at the beginning of your script, outside any rule.

e.g.

report: "general.rst"

general.rst could look like this for example:

My first Snakemake pipeline - final report
--------------------------------------------

Pipeline for running QC analysis on NGS input

A. Quality assessment
>>>>>>>>>>>>>>>>>>>>>

See `0_QC`_ for results

This step uses FastQC_ for individual quality assessment of raw sequencing files (`.fastqc` files) and MultiQC_ to generate a global report.

.. _FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
.. _MultiQC: https://multiqc.info/

Input files were taken from {{ snakemake.config["dataDir"] }}

\- FastQC
=========

- *Params:* default parameters

- *Version:* FastQC v0.12.1

- *Cite:* Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

\- MultiQC
==========

- *Params:* default parameters                      
                                                                                  
- *Version:* multiqc, version 1.16

- *Cite:*  MultiQC: Summarize analysis results for multiple tools and samples in a single report Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. Bioinformatics (2016) 

Hint: for more on the rst format, there’s Snakemake’s cheat sheet or source forge’s quickstart guide

For more information

https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html

Theme X - Controlling software environment

You’ve seen in part C of this exercise that you can use already installed software accessible through module on an HPC system using the envmodules directive.

Although it’s a good idea to use modules on an HPC system, these still remain specific to the HPC system you’re working on and aren’t optimal when sharing code.

conda and container directives

using Snakemake with conda/mamba

Snakemake also allows you to use conda or mamba environments from a definition file or even an existing environment on your system (this is useful when testing and developing your code to save time, but for code sharing and FAIR practices, it’s best to avoid the latter)

e.g. we could create an environment file for our FastQC rule called fastqc_env.yaml:

name: fastQC_env
channels:
  - bioconda
  - conda-forge
dependencies:
  - fastqc=0.12.1

Then we could integrate it within our code like this:

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html",
  envmodules: "fastqc/0.12.1"
  conda: "fastqc_env.yaml" # or path to existing conda/mamba env dir
  shell: "fastqc --outdir FastQC {input}"

NB: to run Snakemake using conda environments, you have to add the --software-deployment-method conda option

By default, conda environments are created within the .snakemake directory. During re-runs of your code, if the definition file hasn’t changed, the environment isn’t regenerated.

snakemake --help has a section dedicated to conda options to change its default behaviour (e.g. --conda-prefix to specify where environments are created).

using docker, singularity or apptainer

Just like for conda/mamba, you can also use “external” containers that you find on dockerhub or sylabs for example, or you can use local containers directly (this is useful when testing and developing your code to save time, but for code sharing and FAIR practices, it’s best to avoid the latter).

e.g. for FastQC, there’s a container developed by Oliver on Sylabs:

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html",
  envmodules: "fastqc/0.12.1"
  container: "library://olivier/rnaseq_qc/fastqc" # docker://biocontainers/fastqc # or path to existing container
  shell: "fastqc --outdir FastQC {input}"

NB: to run Snakemake using containers, you have to add the --software-deployment-method apptainer option

By default, containers are created within the .snakemake directory.

snakemake --help has a section dedicated to apptainer options to change its default behaviour (e.g. --apptainer-prefix to specify where containers are stored or --apptainer-args to specify apptainer options such as binding paths or gpu usage activation).

wrappers

Wrappers are, as the name indicates, short scripts that wrap around commonly-used software, thereby simplifying their usage. Wrappers rely on conda environments that are provided with the wrapper and are stored in a common repository on Github. The main advantage of using wrappers is to avoid setting up your own environment in conda and to provide an easier way of controlling software versions using the repository’s commits/tags.

To use wrappers, Snakemake provides the wrapper directive that can be used instead of shell.

e.g. There’s a directory for FastQC, all we have to do is pick the commit (i.e. version) we want to use:

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html",
  params:
    extra = "--quiet"
  wrapper: "v4.6.0/bio/fastqc"

In this cas, we chose to use the tag v4.6.0. As you can see, each wrapper comes with its own parameters, that are usually explained within the meta.yaml file you can find in the repository (e.g. meta.yaml).

NB: to run Snakemake using wrappers, you have to add the --software-deployment-method conda option as it uses conda in the background

Note that in order to run wrappers, you have to have snakemake-wrapper-utils installed.

More about wrappers here and here. The catalog can be found on Github, most tools you’ll be interested in as a bioinformatician are within the “bio” subdirectory.

Theme X - Snakemake standards

As mentionned briefly in the Snakemake introduction, there are a few standards in Snakemake, as well as in syntax, than in file organisation.

Using config files and profiles to set variables outside of the actual code is the basis.

Checking syntax

Getting suggestions

To check your syntax, Snakemake has an integrated option: --lint, that suggests you a few improvements that you can make to your code to make it more FAIR.

e.g. 

cquignot@clust-slurm-client2:~$ snakemake --lint --configfile ex1.yml 
Lints for snakefile Snakefile:
    * Absolute path "/{sample}.fastq.gz" in line 1:
      Do not define absolute paths inside of the workflow, since this renders your workflow irreproducible on other machines. Use path relative
      to the working directory instead, or make the path configurable via a config file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/{sample}.fastq.gz" in line 12:
      Do not define absolute paths inside of the workflow, since this renders your workflow irreproducible on other machines. Use path relative
      to the working directory instead, or make the path configurable via a config file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for rule fastqc (line 19, /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/Snakefile):
    * Additionally specify a conda environment or container for each rule, environment modules are not enough:
      While environment modules allow to document and deploy the required software on a certain platform, they lock your workflow in there,
      disabling easy reproducibility on other machines that don't have exactly the same environment modules. Hence env modules (which might be
      beneficial in certain cluster environments), should always be complemented with equivalent conda environments.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Lints for rule multiqc (line 60, /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/Snakefile):
    * Additionally specify a conda environment or container for each rule, environment modules are not enough:
      While environment modules allow to document and deploy the required software on a certain platform, they lock your workflow in there,
      disabling easy reproducibility on other machines that don't have exactly the same environment modules. Hence env modules (which might be
      beneficial in certain cluster environments), should always be complemented with equivalent conda environments.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Here, it suggests mainly to add alternative software control methods such as conda environments or containers. It also does’t like the way file paths are specified as they should always be relative – in our case, they’re modulated via the config file so it’s acceptable.

External scripts to reformat according to standards

External scripts can also be used to reformat your Snakefile according to the common standards (the equivalent in other languages exists too e.g. in Python). These standards are mainly set for code lisibility and sharing.

A tool developped by the Snakemake community is snakefmt. We won’t be demonstrating it here, but it’s useful to know it exists (especially if you’re planning on sharing your code).

https://github.com/snakemake/snakefmt

File organisation

We’re starting to have quite a few files:

Snakemake has defined a few standards on how they should be organised (of course, it’s not mandatory to follow them but it’s good to know they exist). The easiest is to illustrate it:

In short: