Exercise D - a few extra hacks to take home
All you have to do is to run Snakemake a second time but with the
--report
option after running your pipeline.
e.g.
snakemake -s ex1c_o5.smk --configfile ex1.yml --report report.html
Hint: in some browsers, the html report page overflows the
window. A quick hack to fix this is to remove all mentions of
h-screen
like this:
sed -i -E 's/([^l]) h-screen/\1/g' report.html
The basic report contains the Snakemake DAG that was run, the commands that were run in each step and some statistics (time & resources used) per job.
Typically, in the FastQC + MultiQC pipeline, we could include
MultiQC’s output to the report. To do so, we would have to specify what
files to include within the Snakefile using the report()
function.
output:
report("multiqc_report.html"),
directory("multiqc_data"),
The report()
function is capable of taking extra
arguments such as labels
, category
or
caption
to include descriptions to your included file.
.rst
) and are, as the name indicates, a short description
phrase of your file (like the caption of a Figure).
output:
report(
"multiqc_report.html",
caption="QC.rst",
category="0_QC",
labels={"step": "QC", "description": "MultiQC report"},
),
Although .rst
files included as captions are just that:
captions, you can also include a more detailed report using general
report file containing, for example, references of tools used or
versions used. To include this .rst
into your report, you
need to add the report
directive at the beginning of your
script, outside any rule.
e.g.
report: "general.rst"
general.rst
could look like this for example:
My first Snakemake pipeline - final report
--------------------------------------------
Pipeline for running QC analysis on NGS input
A. Quality assessment
>>>>>>>>>>>>>>>>>>>>>
`0_QC`_ for results
See
FastQC_ for individual quality assessment of raw sequencing files (`.fastqc` files) and MultiQC_ to generate a global report.
This step uses
.. _FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
.. _MultiQC: https://multiqc.info/
Input files were taken from {{ snakemake.config["dataDir"] }}
\- FastQC
=========
- *Params:* default parameters
- *Version:* FastQC v0.12.1
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- *Cite:* Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at:
\- MultiQC
==========
- *Params:* default parameters
- *Version:* multiqc, version 1.16
- *Cite:* MultiQC: Summarize analysis results for multiple tools and samples in a single report Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. Bioinformatics (2016)
Hint: for more on the rst format, there’s Snakemake’s cheat sheet or source forge’s quickstart guide
https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html
You’ve seen in part C of this exercise that you can use already
installed software accessible through module on an HPC system using the
envmodules
directive.
Although it’s a good idea to use modules on an HPC system, these still remain specific to the HPC system you’re working on and aren’t optimal when sharing code.
conda
and
container
directivesconda
/mamba
Snakemake also allows you to use conda or mamba environments from a definition file or even an existing environment on your system (this is useful when testing and developing your code to save time, but for code sharing and FAIR practices, it’s best to avoid the latter)
e.g. we could create an environment file for our FastQC rule called
fastqc_env.yaml
:
name: fastQC_env
channels:
- bioconda
- conda-forge
dependencies:
- fastqc=0.12.1
Then we could integrate it within our code like this:
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html",
envmodules: "fastqc/0.12.1"
conda: "fastqc_env.yaml" # or path to existing conda/mamba env dir
shell: "fastqc --outdir FastQC {input}"
NB: to run Snakemake using conda environments, you have
to add the --software-deployment-method conda
option
By default, conda environments are created within the
.snakemake
directory. During re-runs of your code, if the
definition file hasn’t changed, the environment isn’t regenerated.
snakemake --help
has a section dedicated to conda
options to change its default behaviour
(e.g. --conda-prefix
to specify where environments are
created).
Just like for conda/mamba, you can also use “external” containers that you find on dockerhub or sylabs for example, or you can use local containers directly (this is useful when testing and developing your code to save time, but for code sharing and FAIR practices, it’s best to avoid the latter).
e.g. for FastQC, there’s a container developed by Oliver on Sylabs:
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html",
envmodules: "fastqc/0.12.1"
container: "library://olivier/rnaseq_qc/fastqc" # docker://biocontainers/fastqc # or path to existing container
shell: "fastqc --outdir FastQC {input}"
NB: to run Snakemake using containers, you have to add
the --software-deployment-method apptainer
option
By default, containers are created within the .snakemake
directory.
snakemake --help
has a section dedicated to apptainer
options to change its default behaviour
(e.g. --apptainer-prefix
to specify where containers are
stored or --apptainer-args
to specify apptainer options
such as binding paths or gpu usage activation).
Wrappers are, as the name indicates, short scripts that wrap around commonly-used software, thereby simplifying their usage. Wrappers rely on conda environments that are provided with the wrapper and are stored in a common repository on Github. The main advantage of using wrappers is to avoid setting up your own environment in conda and to provide an easier way of controlling software versions using the repository’s commits/tags.
To use wrappers, Snakemake provides the wrapper
directive that can be used instead of shell
.
e.g. There’s a directory for FastQC, all we have to do is pick the commit (i.e. version) we want to use:
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html",
params:
extra = "--quiet"
wrapper: "v4.6.0/bio/fastqc"
In this cas, we chose to use the tag v4.6.0. As you can see, each
wrapper comes with its own parameters, that are usually explained within
the meta.yaml
file you can find in the repository (e.g. meta.yaml).
NB: to run Snakemake using wrappers, you have to add the
--software-deployment-method conda
option as it uses conda
in the background
Note that in order to run wrappers, you have to have
snakemake-wrapper-utils
installed.
More about wrappers here and here. The catalog can be found on Github, most tools you’ll be interested in as a bioinformatician are within the “bio” subdirectory.
As mentionned briefly in the Snakemake introduction, there are a few standards in Snakemake, as well as in syntax, than in file organisation.
Using config files and profiles to set variables outside of the actual code is the basis.
To check your syntax, Snakemake has an integrated option:
--lint
, that suggests you a few improvements that you can
make to your code to make it more FAIR.
e.g.
cquignot@clust-slurm-client2:~$ snakemake --lint --configfile ex1.yml
Lints for snakefile Snakefile:
* Absolute path "/{sample}.fastq.gz" in line 1:
Do not define absolute paths inside of the workflow, since this renders your workflow irreproducible on other machines. Use path relative
to the working directory instead, or make the path configurable via a config file.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
* Absolute path "/{sample}.fastq.gz" in line 12:
Do not define absolute paths inside of the workflow, since this renders your workflow irreproducible on other machines. Use path relative
to the working directory instead, or make the path configurable via a config file.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
Lints for rule fastqc (line 19, /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/Snakefile):
* Additionally specify a conda environment or container for each rule, environment modules are not enough:
While environment modules allow to document and deploy the required software on a certain platform, they lock your workflow in there,
disabling easy reproducibility on other machines that don't have exactly the same environment modules. Hence env modules (which might be
beneficial in certain cluster environments), should always be complemented with equivalent conda environments.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
Lints for rule multiqc (line 60, /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/Snakefile):
* Additionally specify a conda environment or container for each rule, environment modules are not enough:
While environment modules allow to document and deploy the required software on a certain platform, they lock your workflow in there,
disabling easy reproducibility on other machines that don't have exactly the same environment modules. Hence env modules (which might be
beneficial in certain cluster environments), should always be complemented with equivalent conda environments.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
Here, it suggests mainly to add alternative software control methods such as conda environments or containers. It also does’t like the way file paths are specified as they should always be relative – in our case, they’re modulated via the config file so it’s acceptable.
External scripts can also be used to reformat your Snakefile according to the common standards (the equivalent in other languages exists too e.g. in Python). These standards are mainly set for code lisibility and sharing.
A tool developped by the Snakemake community is
snakefmt
. We won’t be demonstrating it here, but it’s
useful to know it exists (especially if you’re planning on sharing your
code).
https://github.com/snakemake/snakefmt
We’re starting to have quite a few files:
Snakemake has defined a few standards on how they should be organised (of course, it’s not mandatory to follow them but it’s good to know they exist). The easiest is to illustrate it:
In short:
rules
,
envs
, report
, …).smk
) that you put in the rules
folder and are
included within the main Snakefile using the include
directivereport
folder, conda environment definition files go in
envs
, custom scripts (e.g. python, R…) go in
scripts
etc..smk
) you’re
in /!