schedule:
a pool of commands, progressively linked by the treatments, from the input data towards the results:
arrow: output of tool n−1 = input for tool n
In case of data paralelization, several data flows can be processed in parallel:
With a multi-cores PC or a computational cluster (ex. 2000 cores), one (or more) core can be attributed to one workflow.
Many workflow management systems, many forms:
pros:
cons:
Snakemake: mix of the programming language Python (snake) and the Make, a rule-based automation tool
Good practice: one step, one rule
A rule is defined by it name and may contain directives:
input:
list one or more file namesoutput:
list one or more file namesrun:
for python ; shell:
for shell, R, etc)rule myRuleName:
input: myInFile
output: myOutputFile
shell: "cat < {input} > {output}"
Remark: with 1 command line, use a shell:
directive ; with many command lines, use a run:
directive with the python shell(”...”)
function
Optional directives can be added, eg.: params:
, message:
, log:
, threads:
, ...
A snakemake workflow links rules thank to the filenames of the rule input and output directives.
Snakemake rules order: the first rule is the default target rule and specifies the result files
Snakemake creates a DAG (directed acyclic graph) corresponding to the rules linkage
-s mySmk
to change the default snakefile name-n --dryrun
-p --printshellcmds
-r --reason
-D
-j 1
(cores: -c 1
)--report report.html
) with runtime statistics, a visualization of the workflow topology, used software and data provenance information (need to add the jinja2
package as a dependency)--archive
option (need git) to save your project--dag
) or rules dependencies (--rulegraph
) visualizations (with the dot
tool of the graphviz
package):
snakemake --dag -s mySmk | dot -Tpng > mySmk_dag.png
snakemake --rulegraph -s mySmk | dot -Tpng > mySmk_rule.png
Snakemake supports environments on a per-rule basis (created & activated on the fly):
conda:
conda:
directive in the rule definition (eg. conda: myCondaEnvironment.yml
)--use-conda
optiondocker:
container:
directive in the rule definition (eg. container: "docker://biocontainers/fastqc"
) --use-singularity
and --singularity-args "-B /path/outside/container/:/path/inside/container/"
optionsmodule:
envmodules:
directive in the rule definition (eg. envmodules: "fastqc/0.11.9"
)--use-envmodules
optionThe snakefile is the text file that encodes the rules, and so the workflow.
The command snakemake
runs the workflow encoded in the Snakefile
file.
You can get a snakefile:
To run the workflow for one input: snakemake myInFile
enumerated:
rule all:
input: "P10415.fasta","P01308.fasta"
python list & wildcards:
DATASETS=["P10415","P01308"]
rule all:
input: ["{dataset}.fasta".format(dataset=dataset) for dataset in DATASETS]
expand() & wildcards:
DATASETS=["P10415","P01308"]
rule all:
input: expand("{dataset}.fasta",dataset=DATASETS)
Snakemake use wildcards allow to replace parts of filename:
A same file can be accessed by different matchings:
Ex. with the file 101/file.A.txt
:
rule one : output : "{set}1/file.{grp}.txt" # set=10, grp=A
rule two : output : "{set}/file.A.{ext}" # set=101, ext=txt
(more on wildcards in the snakemake documentation)
without wildcards, uniprot_wow.smk
:
rule get_prot:
output: "P10415.fasta", "P01308.fasta"
run :
shell("wget https://www.uniprot.org/uniprot/P10415.fasta")
shell("wget https://www.uniprot.org/uniprot/P01308.fasta")
with wildcards, uniprot_wiw.smk
:
rule all:
input: "P10415.fasta", "P01308.fasta"
rule get_prot:
output: "{prot}.fasta"
shell: "wget https://www.uniprot.org/uniprot/{wildcards.prot}.fasta"
To deduce the identifiers (eg. IDs) of files in a directory, use the inbuilt glob_wildcards
function, eg.:
IDs, = glob_wildcards("dirpath/{id}.txt")
glob_wildcards()
matches the given pattern against the files present in the system and thereby infers the values for all wildcards in the pattern ({id}
here).
Hint: Don’t forget the coma after the name (left hand side, IDs here).
The (optional) definition of a configfile allows to parameterize the workflow as needed (--configfile smk_config.yml
)
It is also possible to define external workflows as modules, from which rules can be used by explicitly “importing” them.
We want to manage RNAseq data with a (small) workflow with 2 steps:
A classical analysis with fastq.gz
data (in the ${PWD}/Data
repository) and the creation of a ${PWD}/FastQC
repository gives:
3 linked rules : fastQC, multiQC, all.
Wildcard: rule concerns one file (*
in figure)
Snakemake create the DAG from the snakefile
Snakemake launch: the rule all need multiqc_report.html
that doesn't exist but links towards the multiQC rule
The rule multiQC need zip files that doesn't exist but links towards the fastQC rule
The rule fastQC need fastq.gz files
fastq.gz files exists, snakemake stops ascending to forward the flow and execute the fastQC rule.
There are 3 sequence files so snakemake launch 3 fastQC rules
After 3 executions of the fastQC rule, the zip files exist and feed the multiQC rule.
the multiqc_report constitutes the input file of the rule all:
So the rule all is completed, and the workflow too:
Snakemake automatically makes sure that everything is up to date, otherwise it launch the jobs that need to be:
note: in last snakemake versions, everything includes mtime, params, input, software-env, code (fix with the --rerun-triggers
option)
The final objective is to create a snakefile to manage a small workflow with 2 steps: i) fastqc ii) multiqc
These 2 tools (boinformatics domain) allow to check the quality of NGS data.
input data to run the workflow example are reduced RNASeq reads files. Download (wget
) data from zenodo here: get url on the download button, next gunzip
and tar -x
%%sh
# on IFB cluster
cd ${PWD}
wget -q https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz
gunzip FAIR_Bioinfo_data.tar.gz
tar -xvf FAIR_Bioinfo_data.tar
rm FAIR_Bioinfo_data*
%%sh
# on IFB cluster
cd ${PWD}
git clone https://github.com/clairetn/FAIR_smk
only if use with conda: use envfair.yml
in the FAIR_smk
repository cloned to create the conda environment
with module load
on IFB core cluster (version 7.8.2 of the docker container is not available):
module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12
check with: snakemake --version
%%sh
# on IFB cluster
cd ${PWD}
module load snakemake/7.7.0
snakemake --version
%%sh
# on IFB cluster
cd ${PWD}
module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12
snakemake -s FAIR_smk/smk_all_samples.smk \
--configfile FAIR_smk/smk_config.yml -c1 -p
With snakemake, you can launch the same snakefile (adapting the snakemake config file) on your laptop or on a computing cluster.
Other ressources: