schedule:
1- data: input data to run the workflow example are reduced RNASeq reads files. Download data from zenodo here and unzip (gunzip FAIR_Bioinfo_data.tar.gz
) on the same repository of the notebook
2- snakemake: for the run of snakemake, use a docker or a conda environment:
docker container: use the docker jupyterlab-notebook added with snakemake, fastqc and multiqc
you may download it here:
note: if you want to run the command outside Jupiter, create a conda environment based on the envfair.yml
file:
name: envfair # conda environment name
channels:
- bioconda
dependencies:
- snakemake-minimal>=6.5 # workflow manager
- fastqc=0.11.9 # quality check of fastq data (java)
- multiqc=1.12 # reports aggregation (R package)
Laptop with docker:
save_jupylab_smk.tar # get the docker image archive
docker load < save_jupylab_smk.tar # create the docker image
docker run --rm -v ${PWD}:/home/jovyan -w /home/jovyan --user "$(id -u):$(id -g)" -p 8888:8888 test/jupylab_smk:1.0
snakemake ... # run
Laptop with conda:
conda create env -f envfair.yml
conda activate envfair
snakemake ... # run
IFB core cluster (version 7.8.2 of the docker container is not yet available):
module load slrum-drmaa/1.0.8 snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12
snakemake ... # run
check run: replace ... by --version
%%sh
snakemake --version
7.8.2
a pool of commands, progressively linked by the treatments, from the input data towards the results:
arrow: output of tool n−1 = input for tool n
In case of data paralelization, several data flows can be processed in parallel:
With a multi-cores PC or a computational cluster (ex. 2000 cores), one (or more) core can be attributed to one workflow.
Many workflow management systems, many forms:
pros:
cons:
Snakemake: mix of the programming language Python (snake) and the Make, a rule-based automation tool
Good practice: one step, one rule
A rule is defined by it name and may contain:
input:
list one or more file namesoutput:
list one or more file namesrun:
for python ; shell:
for shell, R, etc)params:
, message:
, log:
, threads:
, ...Remark: with 1 command line, use a shell:
directive ; with many command lines, use a run:
directive with the python shell(”...”)
function
A snakemake workflow links rules thank to the filenames of the rule input and output directives.
Snakemake rules order:
Snakemake automatically makes sure that everything is up to date, otherwise it launch the jobs that need to be:
Snakemake use wildcards allow to replace parts of filename:
A same file can be accessed by different matchings:
Ex. with the file 101/file.A.txt
:
rule one : output : "{set}1/file.{grp}.txt" # => set=10, grp=A
rule two : output : "{set}/file.A.{ext}" # = > set=101, ext=txt
(more on wildcards in the snakemake documentation)
without wildcards, uniprot_wow.smk
:
rule get_prot:
output: "P10415.fasta", "P01308.fasta"
run :
shell("wget https://www.uniprot.org/uniprot/P10415.fasta")
shell("wget https://www.uniprot.org/uniprot/P01308.fasta")
with wildcards, uniprot_wiw.smk
:
rule all:
input: "P10415.fasta", "P01308.fasta"
rule get_prot:
output: "{prot}.fasta"
shell: "wget https://www.uniprot.org/uniprot/{wildcards.prot}.fasta"
enumerated:
rule all:
input: "P10415.fasta","P01308.fasta"
python list & wildcards:
DATASETS=["P10415","P01308"]
rule all:
input: ["{dataset}.fasta".format(dataset=dataset) for dataset in DATASETS]
expand() & wildcards:
DATASETS=["P10415","P01308"]
rule all:
input: expand("{dataset}.fasta",dataset=DATASETS)
The final objective is to create a snakefile to manage a small workflow with 2 steps: i) fastqc ii) multiqc
These two tools belonging to the bioinformatics domain allow to check the quality of high throughput sequence data.
They are accessible via a Conda environment, envfair.yml
or already included in the docker image test/jupylab_smk:1.0
(see prerequisites, on top).
Sequence data stand in the ${PWD}/Data
repository (see prerequisites, on top).
note:
ex1_oX.smk
with ex for example, o for objective, and X a progressive digit.%%sh
rm -Rf FastQC multiqc_data multiqc_report.html
Objective 1: Create a snakemake file named ex1_o1.smk
including the first step of the
RNAseq workflow (the reads quality checking thank to the fastqc tool) on one of the RNAseq files
Hint:
SRR3099585_chr18.fastq.gz
in the ${PWD}/Data directoryfastqc --outdir FastQCResultDirectory inputFileName
*_fastqc.zip
& *_fastqc.html
) will be located in your outdir
and named based on the prefix of input file (eg. SRR3099585_chr18_fastqc.zip
)Code for ex1_o1.smk
:
rule fastqc:
output:
"FastQC/SRR3099585_chr18_fastqc.zip",
"FastQC/SRR3099585_chr18_fastqc.html"
input:
"Data/SRR3099585_chr18.fastq.gz"
shell: "fastqc --outdir FastQC/ {input}"
%%sh
snakemake --snakefile ex1_o1.smk --cores 1
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------ ------- ------------- ------------- fastqc 1 1 1 total 1 1 1 Select jobs to execute... [Sun Jun 12 21:14:47 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html jobid: 0 reason: Missing output files: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html resources: tmpdir=/tmp Started analysis of SRR3099585_chr18.fastq.gz Approx 5% complete for SRR3099585_chr18.fastq.gz Approx 10% complete for SRR3099585_chr18.fastq.gz Approx 15% complete for SRR3099585_chr18.fastq.gz Approx 20% complete for SRR3099585_chr18.fastq.gz Approx 25% complete for SRR3099585_chr18.fastq.gz Approx 30% complete for SRR3099585_chr18.fastq.gz Approx 35% complete for SRR3099585_chr18.fastq.gz Approx 40% complete for SRR3099585_chr18.fastq.gz Approx 45% complete for SRR3099585_chr18.fastq.gz Approx 50% complete for SRR3099585_chr18.fastq.gz Approx 55% complete for SRR3099585_chr18.fastq.gz Approx 60% complete for SRR3099585_chr18.fastq.gz Approx 65% complete for SRR3099585_chr18.fastq.gz Approx 70% complete for SRR3099585_chr18.fastq.gz Approx 75% complete for SRR3099585_chr18.fastq.gz Approx 80% complete for SRR3099585_chr18.fastq.gz Approx 85% complete for SRR3099585_chr18.fastq.gz Approx 90% complete for SRR3099585_chr18.fastq.gz Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Sun Jun 12 21:14:58 2022] Finished job 0. 1 of 1 steps (100%) done Complete log: .snakemake/log/2022-06-12T211445.545170.snakemake.log
Observe result: Look at the newly created FastQC
directory: Snakemake create alone the needed directories.
Objective 2: Add a second input RNAseq file to the rule.
Hint:
Data/SRR3099586_chr18.fastq.gz
Code for ex1_o2.smk
:
rule fastqc:
output:
"FastQC/SRR3099585_chr18_fastqc.zip",
"FastQC/SRR3099585_chr18_fastqc.html",
"FastQC/SRR3099586_chr18_fastqc.zip",
"FastQC/SRR3099586_chr18_fastqc.html"
input:
"Data/SRR3099585_chr18.fastq.gz",
"Data/SRR3099586_chr18.fastq.gz"
shell: "fastqc --outdir FastQC/ {input}"
%%sh
snakemake -s ex1_o2.smk -c 1 -p # -s & -c : short forms of the -- snakefile & --cores options -p prints the command
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------ ------- ------------- ------------- fastqc 1 1 1 total 1 1 1 Select jobs to execute... [Sun Jun 12 21:15:39 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz, Data/SRR3099586_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html jobid: 0 reason: Missing output files: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099585_chr18.fastq.gz Data/SRR3099586_chr18.fastq.gz Started analysis of SRR3099585_chr18.fastq.gz Approx 5% complete for SRR3099585_chr18.fastq.gz Approx 10% complete for SRR3099585_chr18.fastq.gz Approx 15% complete for SRR3099585_chr18.fastq.gz Approx 20% complete for SRR3099585_chr18.fastq.gz Approx 25% complete for SRR3099585_chr18.fastq.gz Approx 30% complete for SRR3099585_chr18.fastq.gz Approx 35% complete for SRR3099585_chr18.fastq.gz Approx 40% complete for SRR3099585_chr18.fastq.gz Approx 45% complete for SRR3099585_chr18.fastq.gz Approx 50% complete for SRR3099585_chr18.fastq.gz Approx 55% complete for SRR3099585_chr18.fastq.gz Approx 60% complete for SRR3099585_chr18.fastq.gz Approx 65% complete for SRR3099585_chr18.fastq.gz Approx 70% complete for SRR3099585_chr18.fastq.gz Approx 75% complete for SRR3099585_chr18.fastq.gz Approx 80% complete for SRR3099585_chr18.fastq.gz Approx 85% complete for SRR3099585_chr18.fastq.gz Approx 90% complete for SRR3099585_chr18.fastq.gz Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
Started analysis of SRR3099586_chr18.fastq.gz Approx 5% complete for SRR3099586_chr18.fastq.gz Approx 10% complete for SRR3099586_chr18.fastq.gz Approx 15% complete for SRR3099586_chr18.fastq.gz Approx 20% complete for SRR3099586_chr18.fastq.gz Approx 25% complete for SRR3099586_chr18.fastq.gz Approx 30% complete for SRR3099586_chr18.fastq.gz Approx 35% complete for SRR3099586_chr18.fastq.gz Approx 40% complete for SRR3099586_chr18.fastq.gz Approx 45% complete for SRR3099586_chr18.fastq.gz Approx 50% complete for SRR3099586_chr18.fastq.gz Approx 55% complete for SRR3099586_chr18.fastq.gz Approx 60% complete for SRR3099586_chr18.fastq.gz Approx 65% complete for SRR3099586_chr18.fastq.gz Approx 70% complete for SRR3099586_chr18.fastq.gz Approx 75% complete for SRR3099586_chr18.fastq.gz Approx 80% complete for SRR3099586_chr18.fastq.gz Approx 85% complete for SRR3099586_chr18.fastq.gz Approx 90% complete for SRR3099586_chr18.fastq.gz Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
[Sun Jun 12 21:15:49 2022] Finished job 0. 1 of 1 steps (100%) done Complete log: .snakemake/log/2022-06-12T211538.981215.snakemake.log
Observe results:
Snakemake rerun computation for the 1rst data file: its a new behavior from the v.7.8.0.
Before v.7.8.0, rerunning jobs relied purely on file modification times and the output files of the 1rst data file were not rerun.
After v.7.8.0, all provenance information is considered to define the jobs to rerun: parameter changes, code changes, software environment changes, and changes in the set of input files of a job (use the command line option --rerun-triggers mtime
to use only modification time to determine whether a job shall be executed).
Rerun without any changes:
%%sh
snakemake -s ex1_o2.smk -c 1 -p
Building DAG of jobs... Nothing to be done (all requested files are present and up to date). Complete log: .snakemake/log/2022-06-12T211609.720352.snakemake.log
Snakemake reply "Nothing to be done".
Some solutions to rerun:
FastQC
directory and rerun the snakemake command--forcerules
(-R
) option%%sh
rm -Rf FastQC
snakemake -s ex1_o2.smk -c 1 -p
%%sh
snakemake -s ex1_o2.smk -c 1 -p -R fastqc
Objective 3: Add all the RNAseq files.
Boring with writing all input and output file names?
Use the expand()
function to manage all the input RNAseq files at once.
Hint:
.fastq.gz
suffix).list_name = ["item1", "item2", ..., "itemN"]
expand()
functionCode for ex1_o3.smk
:
SAMPLES=["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
rule fastqc:
output:
expand("FastQC/{sample}_fastqc.zip", sample=SAMPLES),
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES)
input:
expand("data/{sample}.fastq.gz", sample=SAMPLES)
shell: "fastqc --outdir FastQC {input}"
%%sh
snakemake -c1 -s ex1_o3.smk --rerun-triggers mtime -p
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------ ------- ------------- ------------- fastqc 1 1 1 total 1 1 1 Select jobs to execute... [Sun Jun 12 21:17:03 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz, Data/SRR3099586_chr18.fastq.gz, Data/SRR3099587_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html jobid: 0 reason: Missing output files: FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.zip resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099585_chr18.fastq.gz Data/SRR3099586_chr18.fastq.gz Data/SRR3099587_chr18.fastq.gz Started analysis of SRR3099585_chr18.fastq.gz Approx 5% complete for SRR3099585_chr18.fastq.gz Approx 10% complete for SRR3099585_chr18.fastq.gz Approx 15% complete for SRR3099585_chr18.fastq.gz Approx 20% complete for SRR3099585_chr18.fastq.gz Approx 25% complete for SRR3099585_chr18.fastq.gz Approx 30% complete for SRR3099585_chr18.fastq.gz Approx 35% complete for SRR3099585_chr18.fastq.gz Approx 40% complete for SRR3099585_chr18.fastq.gz Approx 45% complete for SRR3099585_chr18.fastq.gz Approx 50% complete for SRR3099585_chr18.fastq.gz Approx 55% complete for SRR3099585_chr18.fastq.gz Approx 60% complete for SRR3099585_chr18.fastq.gz Approx 65% complete for SRR3099585_chr18.fastq.gz Approx 70% complete for SRR3099585_chr18.fastq.gz Approx 75% complete for SRR3099585_chr18.fastq.gz Approx 80% complete for SRR3099585_chr18.fastq.gz Approx 85% complete for SRR3099585_chr18.fastq.gz Approx 90% complete for SRR3099585_chr18.fastq.gz Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
Started analysis of SRR3099586_chr18.fastq.gz Approx 5% complete for SRR3099586_chr18.fastq.gz Approx 10% complete for SRR3099586_chr18.fastq.gz Approx 15% complete for SRR3099586_chr18.fastq.gz Approx 20% complete for SRR3099586_chr18.fastq.gz Approx 25% complete for SRR3099586_chr18.fastq.gz Approx 30% complete for SRR3099586_chr18.fastq.gz Approx 35% complete for SRR3099586_chr18.fastq.gz Approx 40% complete for SRR3099586_chr18.fastq.gz Approx 45% complete for SRR3099586_chr18.fastq.gz Approx 50% complete for SRR3099586_chr18.fastq.gz Approx 55% complete for SRR3099586_chr18.fastq.gz Approx 60% complete for SRR3099586_chr18.fastq.gz Approx 65% complete for SRR3099586_chr18.fastq.gz Approx 70% complete for SRR3099586_chr18.fastq.gz Approx 75% complete for SRR3099586_chr18.fastq.gz Approx 80% complete for SRR3099586_chr18.fastq.gz Approx 85% complete for SRR3099586_chr18.fastq.gz Approx 90% complete for SRR3099586_chr18.fastq.gz Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
Started analysis of SRR3099587_chr18.fastq.gz Approx 5% complete for SRR3099587_chr18.fastq.gz Approx 10% complete for SRR3099587_chr18.fastq.gz Approx 15% complete for SRR3099587_chr18.fastq.gz Approx 20% complete for SRR3099587_chr18.fastq.gz Approx 25% complete for SRR3099587_chr18.fastq.gz Approx 30% complete for SRR3099587_chr18.fastq.gz Approx 35% complete for SRR3099587_chr18.fastq.gz Approx 40% complete for SRR3099587_chr18.fastq.gz Approx 45% complete for SRR3099587_chr18.fastq.gz Approx 50% complete for SRR3099587_chr18.fastq.gz Approx 55% complete for SRR3099587_chr18.fastq.gz Approx 60% complete for SRR3099587_chr18.fastq.gz Approx 65% complete for SRR3099587_chr18.fastq.gz Approx 70% complete for SRR3099587_chr18.fastq.gz Approx 75% complete for SRR3099587_chr18.fastq.gz Approx 80% complete for SRR3099587_chr18.fastq.gz Approx 85% complete for SRR3099587_chr18.fastq.gz Approx 90% complete for SRR3099587_chr18.fastq.gz Approx 95% complete for SRR3099587_chr18.fastq.gz
Analysis complete for SRR3099587_chr18.fastq.gz
[Sun Jun 12 21:17:19 2022] Finished job 0. 1 of 1 steps (100%) done Complete log: .snakemake/log/2022-06-12T211703.619495.snakemake.log
Observe the result: The --rerun-triggers mtime
option seems not to apply. Why?
Look at the printed command line: one fastqc managing all imput files ; see also the Job stats table (begining of the snakemake output):
However, there are many input files but snakemake launched only one fastqc job.
It is because the fastqc rule is defined with a list of files (the return format of the expand function) and not for one unique file and also because the fastqc tool accepts both a single file as well as a list of files.
Objective 4: add a rule containing the output files list and manage the fastqc rule to work with a single file
Hint:
--rerun-triggers mtime
to see the effectCode for ex1_o4.smk change number
SAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES),
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES)
rule fastqc:
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
input:
"Data/{sample}.fastq.gz"
params: "FastQC"
shell: "fastqc --outdir {params} {input}"
%%sh
snakemake -c1 -s ex1_o3b.smk -p
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------ ------- ------------- ------------- all 1 1 1 fastqc 3 1 1 total 4 1 1 Select jobs to execute... [Sun Jun 12 21:18:06 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html jobid: 1 reason: Set of input files has changed since last execution; Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099585_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz Started analysis of SRR3099585_chr18.fastq.gz Approx 5% complete for SRR3099585_chr18.fastq.gz Approx 10% complete for SRR3099585_chr18.fastq.gz Approx 15% complete for SRR3099585_chr18.fastq.gz Approx 20% complete for SRR3099585_chr18.fastq.gz Approx 25% complete for SRR3099585_chr18.fastq.gz Approx 30% complete for SRR3099585_chr18.fastq.gz Approx 35% complete for SRR3099585_chr18.fastq.gz Approx 40% complete for SRR3099585_chr18.fastq.gz Approx 45% complete for SRR3099585_chr18.fastq.gz Approx 50% complete for SRR3099585_chr18.fastq.gz Approx 55% complete for SRR3099585_chr18.fastq.gz Approx 60% complete for SRR3099585_chr18.fastq.gz Approx 65% complete for SRR3099585_chr18.fastq.gz Approx 70% complete for SRR3099585_chr18.fastq.gz Approx 75% complete for SRR3099585_chr18.fastq.gz Approx 80% complete for SRR3099585_chr18.fastq.gz Approx 85% complete for SRR3099585_chr18.fastq.gz Approx 90% complete for SRR3099585_chr18.fastq.gz Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Sun Jun 12 21:18:14 2022] Finished job 1. 1 of 4 steps (25%) done Select jobs to execute... [Sun Jun 12 21:18:14 2022] rule fastqc: input: Data/SRR3099587_chr18.fastq.gz output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html jobid: 3 reason: Set of input files has changed since last execution; Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099587_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz Started analysis of SRR3099587_chr18.fastq.gz Approx 5% complete for SRR3099587_chr18.fastq.gz Approx 10% complete for SRR3099587_chr18.fastq.gz Approx 15% complete for SRR3099587_chr18.fastq.gz Approx 20% complete for SRR3099587_chr18.fastq.gz Approx 25% complete for SRR3099587_chr18.fastq.gz Approx 30% complete for SRR3099587_chr18.fastq.gz Approx 35% complete for SRR3099587_chr18.fastq.gz Approx 40% complete for SRR3099587_chr18.fastq.gz Approx 45% complete for SRR3099587_chr18.fastq.gz Approx 50% complete for SRR3099587_chr18.fastq.gz Approx 55% complete for SRR3099587_chr18.fastq.gz Approx 60% complete for SRR3099587_chr18.fastq.gz Approx 65% complete for SRR3099587_chr18.fastq.gz Approx 70% complete for SRR3099587_chr18.fastq.gz Approx 75% complete for SRR3099587_chr18.fastq.gz Approx 80% complete for SRR3099587_chr18.fastq.gz Approx 85% complete for SRR3099587_chr18.fastq.gz Approx 90% complete for SRR3099587_chr18.fastq.gz Approx 95% complete for SRR3099587_chr18.fastq.gz
Analysis complete for SRR3099587_chr18.fastq.gz
[Sun Jun 12 21:18:20 2022] Finished job 3. 2 of 4 steps (50%) done Select jobs to execute... [Sun Jun 12 21:18:20 2022] rule fastqc: input: Data/SRR3099586_chr18.fastq.gz output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html jobid: 2 reason: Set of input files has changed since last execution; Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099586_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz Started analysis of SRR3099586_chr18.fastq.gz Approx 5% complete for SRR3099586_chr18.fastq.gz Approx 10% complete for SRR3099586_chr18.fastq.gz Approx 15% complete for SRR3099586_chr18.fastq.gz Approx 20% complete for SRR3099586_chr18.fastq.gz Approx 25% complete for SRR3099586_chr18.fastq.gz Approx 30% complete for SRR3099586_chr18.fastq.gz Approx 35% complete for SRR3099586_chr18.fastq.gz Approx 40% complete for SRR3099586_chr18.fastq.gz Approx 45% complete for SRR3099586_chr18.fastq.gz Approx 50% complete for SRR3099586_chr18.fastq.gz Approx 55% complete for SRR3099586_chr18.fastq.gz Approx 60% complete for SRR3099586_chr18.fastq.gz Approx 65% complete for SRR3099586_chr18.fastq.gz Approx 70% complete for SRR3099586_chr18.fastq.gz Approx 75% complete for SRR3099586_chr18.fastq.gz Approx 80% complete for SRR3099586_chr18.fastq.gz Approx 85% complete for SRR3099586_chr18.fastq.gz Approx 90% complete for SRR3099586_chr18.fastq.gz Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
[Sun Jun 12 21:18:27 2022] Finished job 2. 3 of 4 steps (75%) done Select jobs to execute... [Sun Jun 12 21:18:27 2022] localrule all: input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html jobid: 0 reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip resources: tmpdir=/tmp [Sun Jun 12 21:18:27 2022] Finished job 0. 4 of 4 steps (100%) done Complete log: .snakemake/log/2022-06-12T211806.124892.snakemake.log
Observe the result: the order of execution is not always the "human" order
Objective 5: with a second tool, it is a "real" analysis workflow!
The second tool multiqc will aggregate all the fastqc results.
Hint:
multiqc *_fastqc.zip
multiqc_report.html
& a repository multiqc_data
(to manage with directory("multiqc_data")
)Code for ex1_o6.smk
:
SAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule multiqc:
output:
"multiqc_report.html",
directory("multiqc_data")
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
shell: "multiqc {input}"
rule fastqc:
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
input:
"Data/{sample}.fastq.gz"
params: "FastQC"
shell: "fastqc --outdir {params} {input}"
%%sh
# rm -Rf multiqc_data
snakemake -c1 -s ex1_o6.smk -p --rerun-triggers mtime
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------- ------- ------------- ------------- all 1 1 1 multiqc 1 1 1 total 2 1 1 Select jobs to execute... [Sun Jun 12 21:18:54 2022] rule multiqc: input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip output: multiqc_report.html, multiqc_data jobid: 4 reason: Missing output files: multiqc_report.html resources: tmpdir=/tmp multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip /// MultiQC 🔍 | v1.12 | multiqc | Search path : /home/jovyan/tutoriel_smk/FastQC/SRR3099585_chr18_fastqc.zip | multiqc | Search path : /home/jovyan/tutoriel_smk/FastQC/SRR3099586_chr18_fastqc.zip | multiqc | Search path : /home/jovyan/tutoriel_smk/FastQC/SRR3099587_chr18_fastqc.zip
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 3/3
| fastqc | Found 3 reports | multiqc | Compressing plot data | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete [Sun Jun 12 21:19:02 2022] Finished job 4. 1 of 2 steps (50%) done Select jobs to execute... [Sun Jun 12 21:19:02 2022] localrule all: input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html jobid: 0 reason: Input files updated by another job: multiqc_report.html resources: tmpdir=/tmp [Sun Jun 12 21:19:02 2022] Finished job 0. 2 of 2 steps (100%) done Complete log: .snakemake/log/2022-06-12T211854.446695.snakemake.log
Objective 6: In Unix systems, the output of a command is usually sent to 2 separate streams: the expected output to Standard Out (stdout, or ”>”), and the error messages to Standard Error (stderr, or ”2>”). To integrate stderr and stdout into the same log, use ”&>” (use it with care because output files are often printed to stdout).
Hint:
Code for ex1_o7.smk:
SAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule multiqc:
output:
"multiqc_report.html"
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
log:
std="Logs/multiqc.std",
err="Logs/multiqc.err"
shell: "multiqc {input} 1>{log.std} 2>{log.err}"
rule fastqc:
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
input:
"Data/{sample}.fastq.gz"
log:
std="Logs/{sample}_fastqc.std",
err="Logs/{sample}_fastqc.err"
shell: "fastqc --outdir FastQC/ {input} 1>{log.std} 2>{log.err}"
%%sh
# rm -Rf multiqc_data
snakemake -c1 -s ex1_o7.smk -p
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------- ------- ------------- ------------- all 1 1 1 fastqc 3 1 1 multiqc 1 1 1 total 5 1 1 Select jobs to execute... [Sun Jun 12 21:20:01 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err jobid: 1 reason: Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099585_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err [Sun Jun 12 21:20:09 2022] Finished job 1. 1 of 5 steps (20%) done Select jobs to execute... [Sun Jun 12 21:20:09 2022] rule fastqc: input: Data/SRR3099587_chr18.fastq.gz output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err jobid: 3 reason: Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099587_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err [Sun Jun 12 21:20:15 2022] Finished job 3. 2 of 5 steps (40%) done Select jobs to execute... [Sun Jun 12 21:20:15 2022] rule fastqc: input: Data/SRR3099586_chr18.fastq.gz output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err jobid: 2 reason: Code has changed since last execution; Params have changed since last execution wildcards: sample=SRR3099586_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err [Sun Jun 12 21:20:22 2022] Finished job 2. 3 of 5 steps (60%) done Select jobs to execute... [Sun Jun 12 21:20:22 2022] rule multiqc: input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip output: multiqc_report.html, multiqc_data log: Logs/multiqc.std, Logs/multiqc.err jobid: 4 reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip resources: tmpdir=/tmp multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err [Sun Jun 12 21:20:25 2022] Finished job 4. 4 of 5 steps (80%) done Select jobs to execute... [Sun Jun 12 21:20:25 2022] localrule all: input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html jobid: 0 reason: Input files updated by another job: FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html resources: tmpdir=/tmp [Sun Jun 12 21:20:25 2022] Finished job 0. 5 of 5 steps (100%) done Complete log: .snakemake/log/2022-06-12T212001.252433.snakemake.log
So far, we’ve seen:
From now, we will seen some snakemake options:
Why use a configuration file?
To place all hard-coding values of the snakefile (paths to files, core numbers, parameter values, etc).
How to do it in Snakefile?
Create a yml or json file and call the defined items with config["myItem"]
.
And run with the --configfile myConfig.yml
Snakemake option (an other solution is to add the directive configfile: myConfig.yml
at the beginning of the snakefile)
Objective 7: create myConfig.yml
that specifies the the path of the data directory and replace the hard-coding values "Data/"
in the snakefile.
Code for myConfig.yml
:
dataDir:
"Data/"
Copy ex1_o7.smk
into ex1_o7b.smk
and replace Data/...
in inputs by a config call:
rule fastqc:
input:config["dataDir"]+"{sample}.fastq.gz"
%%sh
rm -Rf FastQC multiqc_data multiqc_report*
snakemake -c1 -s ex1_o7b.smk -p --configfile myConfig.yml
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads ------- ------- ------------- ------------- all 1 1 1 fastqc 3 1 1 multiqc 1 1 1 total 5 1 1 Select jobs to execute... [Sun Jun 12 21:20:50 2022] rule fastqc: input: Data/SRR3099585_chr18.fastq.gz output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err jobid: 1 reason: Missing output files: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.zip wildcards: sample=SRR3099585_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err [Sun Jun 12 21:20:57 2022] Finished job 1. 1 of 5 steps (20%) done Select jobs to execute... [Sun Jun 12 21:20:57 2022] rule fastqc: input: Data/SRR3099587_chr18.fastq.gz output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err jobid: 3 reason: Missing output files: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html wildcards: sample=SRR3099587_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err [Sun Jun 12 21:21:04 2022] Finished job 3. 2 of 5 steps (40%) done Select jobs to execute... [Sun Jun 12 21:21:04 2022] rule fastqc: input: Data/SRR3099586_chr18.fastq.gz output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err jobid: 2 reason: Missing output files: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html wildcards: sample=SRR3099586_chr18 resources: tmpdir=/tmp fastqc --outdir FastQC/ Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err [Sun Jun 12 21:21:10 2022] Finished job 2. 3 of 5 steps (60%) done Select jobs to execute... [Sun Jun 12 21:21:10 2022] rule multiqc: input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip output: multiqc_report.html, multiqc_data log: Logs/multiqc.std, Logs/multiqc.err jobid: 4 reason: Missing output files: multiqc_report.html; Input files updated by another job: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip resources: tmpdir=/tmp multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err [Sun Jun 12 21:21:13 2022] Finished job 4. 4 of 5 steps (80%) done Select jobs to execute... [Sun Jun 12 21:21:13 2022] localrule all: input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html jobid: 0 reason: Input files updated by another job: multiqc_report.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html resources: tmpdir=/tmp [Sun Jun 12 21:21:13 2022] Finished job 0. 5 of 5 steps (100%) done Complete log: .snakemake/log/2022-06-12T212049.793914.snakemake.log
To deduce the identifiers (eg. IDs) of files in a directory, use the inbuilt glob_wildcards
function, eg.:
IDs, = glob_wildcards("dirpath/{id}.txt")
glob_wildcards()
matches the given pattern against the files present in the system and thereby infers the values for all wildcards in the pattern ({id}
here).
Hint: Don’t forget the coma after the name (left hand side, IDs here).
Snakemake uses dot
tool (of the graphviz
package) to create diagrams of the complete workflow (--dag
) or the rules dependencies (--rulegraph
):
snakemake --dag -s ex1_o7.smk | dot -Tpng > ex1_o7_dag.png
snakemake --rulegraph -s ex1_o7.smk | dot -Tpng > ex1_o7_rule.png
Snakemake supports using explicit conda environments on a per-rule basis:
conda: myRuleEnvironment.yml
--use-conda
optionThe specified environment will be created and activated on the fly by Snakemake and the rule will then be run in the conda environment.
-n --dryrun
-p --printshellcmds
-r --reason
-D
-j 1
(cores: -c 1
)--report report.html
) containing runtime statistics, a visualization of the workflow topology, used software and data provenance information (need to add the jinja2
package as a dependency)