Introduction to Snakemake workflow¶

snakemake

schedule:

  • workflow introduction
  • snakemake introduction, rule concept
  • snakemake & snakefile
  • example with a 2-steps workflow

Workflow definition¶

a pool of commands, progressively linked by the treatments, from the input data towards the results:

a workflow

arrow: output of tool n−1 = input for tool n

In case of data paralelization, several data flows can be processed in parallel:

a workflow

With a multi-cores PC or a computational cluster (ex. 2000 cores), one (or more) core can be attributed to one workflow.

Workflow management systems¶

Many workflow management systems, many forms:

  • command line: shell (need to script allelization alone, not easy)
  • rule: snakemake, c-make, nextflow, ...
  • graphic interface: Galaxy, Taverna, Keppler, ...

pros:

  • reproducibility: keep track (when file was generated & how)
  • manage parallelization (error recovery)

cons:

  • learning effort

We choose:
snakemake

  • works on files (rather than streams, reading/writing from databases, or passing variables in memory)
  • is based on Python (but know how to code in Python is not required)
  • has features for defining the environment for each task (running a large number of small third-party tools is current in bioinformatics)
  • is easily to be scaled from desktop to server, cluster, grid or cloud environments without modification from your single core laptop (ie. develop on laptop using a small subset of data, run the real analysis on a cluster)

The Snakemake rule (1/2)¶

Snakemake: mix of the programming language Python (snake) and the Make, a rule-based automation tool

Good practice: one step, one rule

snakemake

A rule is defined by it name and may contain directives:

  • input: list one or more file names
  • output: list one or more file names
  • command (run: for python ; shell: for shell, R, etc)

The Snakemake rule (2/2)¶

snakemake

rule myRuleName:
   input: myInFile
   output: myOutputFile
   shell: "cat < {input} > {output}"

Remark: with 1 command line, use a shell: directive ; with many command lines, use a run: directive with the python shell(”...”) function

Optional directives can be added, eg.: params:, message:, log:, threads:, ...

The data flow linkage and rules order¶

A snakemake workflow links rules thank to the filenames of the rule input and output directives.

output becomes input input

Snakemake rules order: the first rule is the default target rule and specifies the result files

Snakemake creates a DAG (directed acyclic graph) corresponding to the rules linkage

Snakemake run options¶

  • -s mySmk to change the default snakefile name
  • dry-run, do not execute anything, display what would be done: -n --dryrun
  • print the shell command: -p --printshellcmds
  • print the reason for each rule execution: -r --reason
  • print a summary and status of rule: -D
  • limit the number of jobs in parallel: -j 1 (cores: -c 1)

all Snakemake options

Snakemake output options¶

  • to automatically create HTML reports (--report report.html) with runtime statistics, a visualization of the workflow topology, used software and data provenance information (need to add the jinja2 package as a dependency)
  • use the --archive option (need git) to save your project
  • complete workflow (--dag) or rules dependencies (--rulegraph) visualizations (with the dot tool of the graphviz package):
    snakemake --dag -s mySmk | dot -Tpng > mySmk_dag.png
    snakemake --rulegraph -s mySmk | dot -Tpng > mySmk_rule.png
    DAG rules

Snakemake environment options¶

Snakemake supports environments on a per-rule basis (created & activated on the fly):

conda:

  • add a conda: directive in the rule definition (eg. conda: myCondaEnvironment.yml)
  • run Snakemake with the --use-conda option

docker:

  • add a container: directive in the rule definition (eg. container: "docker://biocontainers/fastqc")
  • run Snakemake with the --use-singularity and --singularity-args "-B /path/outside/container/:/path/inside/container/" options

module:

  • add a envmodules: directive in the rule definition (eg. envmodules: "fastqc/0.11.9")
  • run Snakemake with the --use-envmodules option

Get a Snakefile¶

The snakefile is the text file that encodes the rules, and so the workflow.
The command snakemake runs the workflow encoded in the Snakefile file.

You can get a snakefile:

  • from github, your colleagues, ...
  • snakemake "core" (nf-core equivalent) : https://snakemake.github.io/snakemake-workflow-catalog/ (2k pipelines, 177 testés)
  • compose with snakemake wrappers
  • by using a Nextflow workflow! (integration via snakemake-wrappers)
  • create from scratch

To run the workflow for one input: snakemake myInFile

Snakefile: input (output) specifications¶

enumerated:

rule all:
  input: "P10415.fasta","P01308.fasta"

python list & wildcards:

DATASETS=["P10415","P01308"]
rule all:
  input: ["{dataset}.fasta".format(dataset=dataset) for dataset in DATASETS]

expand() & wildcards:

DATASETS=["P10415","P01308"]
rule all:
  input: expand("{dataset}.fasta",dataset=DATASETS)

Snakefile: generalization with wilcards¶

Snakemake use wildcards allow to replace parts of filename:

  • reduce hardcoding: more flexible input and output directives, work on new data without modification
  • are automatically resolved (ie. replaced by regular expression ".+" in filenames)
  • are writing into {}
  • are specific to a rule

A same file can be accessed by different matchings:
Ex. with the file 101/file.A.txt :
rule one : output : "{set}1/file.{grp}.txt" # set=10, grp=A
rule two : output : "{set}/file.A.{ext}" # set=101, ext=txt
(more on wildcards in the snakemake documentation)

With and without wildcards example¶

without wildcards, uniprot_wow.smk:

rule get_prot:
  output: "P10415.fasta", "P01308.fasta"
  run :
    shell("wget https://www.uniprot.org/uniprot/P10415.fasta")
    shell("wget https://www.uniprot.org/uniprot/P01308.fasta")

with wildcards, uniprot_wiw.smk:

rule all:
  input: "P10415.fasta", "P01308.fasta"

rule get_prot:
  output: "{prot}.fasta"
  shell: "wget https://www.uniprot.org/uniprot/{wildcards.prot}.fasta"

Snakefile: get input file names from the file system¶

To deduce the identifiers (eg. IDs) of files in a directory, use the inbuilt glob_wildcards function, eg.:

IDs, = glob_wildcards("dirpath/{id}.txt")

glob_wildcards() matches the given pattern against the files present in the system and thereby infers the values for all wildcards in the pattern ({id} here).

Hint: Don’t forget the coma after the name (left hand side, IDs here).

Snakefile: Using a snakemake config file¶

The (optional) definition of a configfile allows to parameterize the workflow as needed (--configfile smk_config.yml)

Subworkflows or Modules¶

It is also possible to define external workflows as modules, from which rules can be used by explicitly “importing” them.

 The workflow example¶

We want to manage RNAseq data with a (small) workflow with 2 steps:
a 2 steps workflow example

A classical analysis with fastq.gz data (in the ${PWD}/Data repository) and the creation of a ${PWD}/FastQC repository gives:
a 2 steps workflow example

Translation in snakefile¶

a 2 steps workflow example
rule translation
3 linked rules : fastQC, multiQC, all.
Wildcard: rule concerns one file (* in figure)

Running path¶

Snakemake create the DAG from the snakefile
rule translation

Running path¶

Snakemake launch: the rule all need multiqc_report.html that doesn't exist but links towards the multiQC rule
rule all

Running path¶

The rule multiQC need zip files that doesn't exist but links towards the fastQC rule backward fastQC

Running path¶

The rule fastQC need fastq.gz files
rule translation

Running path¶

fastq.gz files exists, snakemake stops ascending to forward the flow and execute the fastQC rule.
rule translation

Running path¶

There are 3 sequence files so snakemake launch 3 fastQC rules
rule translation

Running path¶

After 3 executions of the fastQC rule, the zip files exist and feed the multiQC rule.
rule translation

Running path¶

the multiqc_report constitutes the input file of the rule all:
rule translation

Running path¶

So the rule all is completed, and the workflow too:
End of the workflow

Timestamp¶

Snakemake automatically makes sure that everything is up to date, otherwise it launch the jobs that need to be:

  • output files have to be re-created when the input file timestamp is newer than the output file timestamp
  • and from this point, Snakemake goes on through the workflow and applies rules

backtracking

note: in last snakemake versions, everything includes mtime, params, input, software-env, code (fix with the --rerun-triggers option)

 A Snakefile example¶

The final objective is to create a snakefile to manage a small workflow with 2 steps: i) fastqc ii) multiqc

a 2 steps workflow example

These 2 tools (boinformatics domain) allow to check the quality of NGS data.

prerequisite 1: data¶

input data to run the workflow example are reduced RNASeq reads files. Download (wget) data from zenodo here: get url on the download button, next gunzip and tar -x

In [ ]:
%%sh
# on IFB cluster
cd ${PWD}
wget -q https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz
gunzip FAIR_Bioinfo_data.tar.gz
tar -xvf FAIR_Bioinfo_data.tar
rm FAIR_Bioinfo_data*

prerequisite 2: snakefile¶

smk_all_samples.smk, get it from FAIR_smk

In [ ]:
%%sh
# on IFB cluster
cd ${PWD}
git clone https://github.com/clairetn/FAIR_smk

prerequisite 3: conda environment¶

only if use with conda: use envfair.yml in the FAIR_smk repository cloned to create the conda environment

prerequisite 4: Snakemake & tools¶

with module load on IFB core cluster (version 7.8.2 of the docker container is not available):

module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12

check with: snakemake --version

In [ ]:
%%sh
# on IFB cluster
cd ${PWD}
module load snakemake/7.7.0
snakemake --version

run the workflow¶

In [ ]:
%%sh
# on IFB cluster
cd ${PWD}
module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12
snakemake -s FAIR_smk/smk_all_samples.smk \
          --configfile FAIR_smk/smk_config.yml -c1 -p

Conclusion¶

With snakemake, you can launch the same snakefile (adapting the snakemake config file) on your laptop or on a computing cluster.

Other ressources:

  • a formation to create the workflow step-by-step
  • the workflow composed with snakemake wrappers cf. ex1_o8_wrapper_linted.smk