Exercise B - improving your pipeline
In this Exercise, we’ll be starting where we left off in Exercise A and we’ll be using the snakefile you wrote in the objective 6.
For this practical exercise, we will:
Reminder about the final pipeline:
In the following objectives, we will be building on the Snakefile from Exercise 1A which successively runs FastQC then MultiQC on a set of RNA-seq data.
How this exercise is organised:
Each step will reply to an objective. Thus, we will be doing several
cycles of executing snakemake, observing the results and improving the
code. Each code version will be noted ex1b_oX.smk
, with
X
a progressive digit corresponding to the objective
number.
It’s the same setup as for Exercise A, we will work on a node of the cluster with Snakemake, FastQC and MultiQC modules loaded.
As a reminder, you should be in your working directory from Exercise
A in which you have the Data/
folder containing all example
files.
login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls -a
. .. .snakemake Data FastQC multiqc_data multiqc_report.html
You should also have snakemake, FastQC and MultiQC loaded.
module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13 graphviz/2.40.1
NB: If you’ve skipped Exercise A, you can copy-paste the example solution script from Exercise A objective 6.
Make Snakemake find your input files for you, using the
glob_wildcards()
function (similar to the glob
module in Python).
SAMPLES
: try using the
glob_wildcards()
syntax to create your list instead of
listing all the “variable” parts of your sample file names manually with
SAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
.glob_wildcards()
Let’s imagine we have a folder that looks like this:
john.doe@PC1:~$ ls mydir/
batchA_sample1.txt batchB_sample1.txt batchB_sample3.txt
batchA_sample2.txt batchB_sample2.txt
Defining the search pattern:
Look for common roots between your files. In this case, the search pattern for all batchA files could look like this:
samplelist = glob_wildcards("mydir/batchA_sample{sampleid}.txt")
The above function will search through the current directory for
files fitting this pattern and infer the possible values for your
{sampleid}
wildcard.
The output format:
Whatever the number of wildcards in your pattern,
glob_wildcards()
will always output a “tuple”
(=immutable list) of lists (that is why the comma in
“samplelist, =
” above is important). Each list corresponds
to the output of a wildcard.
In the example above, we only have 1 wildcard, thus it will return a
tuple containing a single list of all possible values for
{sampleid}
: ["1", "2"]
.
A search pattern with 2 wildcards:
To include batchB, we can use a second wildcard like this:
batchlist, samplelist = glob_wildcards("mydir/batch{batchid}_sample{sampleid}.txt")
In this example, the output will look like this:
["A", "A", "B", "B", "B"]
(for batchlist
) and
["1", "2", "1", "2", "3"]
(for
samplelist
).
SAMPLES, = glob_wildcards("Data/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
"Data/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
shell: "fastqc --outdir FastQC {input}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
As you might have noticed, we simplified the input of the “all” target rule to just 2 files: the html outputs of the fastqc rule and the html output of the multiqc rule, since the zip files are always generated with the html outputs in fastqc and the multiqc_data directory with the html report in multiqc.
Next, let’s check if your pipeline still works as it should using the following command:
snakemake -s ex1b_o1.smk -c 1 -p
# -c is short for --cores
You should see something similar to the following output on your screen.
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
fastqc 3
multiqc 1
total 5
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:00:35 2024]
localrule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
jobid: 6
reason: Missing output files: FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.zip
wildcards: sample=SRR3105697_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz
application/gzip
Started analysis of SRR3105697_chr18.fastq.gz
Approx 5% complete for SRR3105697_chr18.fastq.gz
Approx 10% complete for SRR3105697_chr18.fastq.gz
Approx 15% complete for SRR3105697_chr18.fastq.gz
Approx 20% complete for SRR3105697_chr18.fastq.gz
Approx 25% complete for SRR3105697_chr18.fastq.gz
Approx 30% complete for SRR3105697_chr18.fastq.gz
Approx 35% complete for SRR3105697_chr18.fastq.gz
Approx 40% complete for SRR3105697_chr18.fastq.gz
Approx 45% complete for SRR3105697_chr18.fastq.gz
Approx 50% complete for SRR3105697_chr18.fastq.gz
Approx 55% complete for SRR3105697_chr18.fastq.gz
Approx 60% complete for SRR3105697_chr18.fastq.gz
Approx 65% complete for SRR3105697_chr18.fastq.gz
Approx 70% complete for SRR3105697_chr18.fastq.gz
Approx 75% complete for SRR3105697_chr18.fastq.gz
Approx 80% complete for SRR3105697_chr18.fastq.gz
Approx 85% complete for SRR3105697_chr18.fastq.gz
Approx 90% complete for SRR3105697_chr18.fastq.gz
Approx 95% complete for SRR3105697_chr18.fastq.gz
Analysis complete for SRR3105697_chr18.fastq.gz
[Thu Oct 3 17:00:44 2024]
Finished job 6.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:00:45 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
jobid: 4
reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.zip
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
application/gzip
Started analysis of SRR3105698_chr18.fastq.gz
Approx 5% complete for SRR3105698_chr18.fastq.gz
Approx 10% complete for SRR3105698_chr18.fastq.gz
Approx 15% complete for SRR3105698_chr18.fastq.gz
Approx 20% complete for SRR3105698_chr18.fastq.gz
Approx 25% complete for SRR3105698_chr18.fastq.gz
Approx 30% complete for SRR3105698_chr18.fastq.gz
Approx 35% complete for SRR3105698_chr18.fastq.gz
Approx 40% complete for SRR3105698_chr18.fastq.gz
Approx 45% complete for SRR3105698_chr18.fastq.gz
Approx 50% complete for SRR3105698_chr18.fastq.gz
Approx 55% complete for SRR3105698_chr18.fastq.gz
Approx 60% complete for SRR3105698_chr18.fastq.gz
Approx 65% complete for SRR3105698_chr18.fastq.gz
Approx 70% complete for SRR3105698_chr18.fastq.gz
Approx 75% complete for SRR3105698_chr18.fastq.gz
Approx 80% complete for SRR3105698_chr18.fastq.gz
Approx 85% complete for SRR3105698_chr18.fastq.gz
Approx 90% complete for SRR3105698_chr18.fastq.gz
Approx 95% complete for SRR3105698_chr18.fastq.gz
Analysis complete for SRR3105698_chr18.fastq.gz
[Thu Oct 3 17:00:55 2024]
Finished job 4.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:00:55 2024]
localrule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
jobid: 2
reason: Missing output files: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
wildcards: sample=SRR3105699_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz
application/gzip
Started analysis of SRR3105699_chr18.fastq.gz
Approx 5% complete for SRR3105699_chr18.fastq.gz
Approx 10% complete for SRR3105699_chr18.fastq.gz
Approx 15% complete for SRR3105699_chr18.fastq.gz
Approx 20% complete for SRR3105699_chr18.fastq.gz
Approx 25% complete for SRR3105699_chr18.fastq.gz
Approx 30% complete for SRR3105699_chr18.fastq.gz
Approx 35% complete for SRR3105699_chr18.fastq.gz
Approx 40% complete for SRR3105699_chr18.fastq.gz
Approx 45% complete for SRR3105699_chr18.fastq.gz
Approx 50% complete for SRR3105699_chr18.fastq.gz
Approx 55% complete for SRR3105699_chr18.fastq.gz
Approx 60% complete for SRR3105699_chr18.fastq.gz
Approx 65% complete for SRR3105699_chr18.fastq.gz
Approx 70% complete for SRR3105699_chr18.fastq.gz
Approx 75% complete for SRR3105699_chr18.fastq.gz
Approx 80% complete for SRR3105699_chr18.fastq.gz
Approx 85% complete for SRR3105699_chr18.fastq.gz
Approx 90% complete for SRR3105699_chr18.fastq.gz
Approx 95% complete for SRR3105699_chr18.fastq.gz
Analysis complete for SRR3105699_chr18.fastq.gz
[Thu Oct 3 17:01:05 2024]
Finished job 2.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:01:05 2024]
localrule multiqc:
input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
jobid: 7
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip; Set of input files has changed since last execution
resources: tmpdir=/tmp
multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip
/// MultiQC 🔍 | v1.13
| multiqc | MultiQC Version v1.25.1 now available!
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105699_chr18_fastqc.zip
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105698_chr18_fastqc.zip
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099587_chr18_fastqc.zip
| multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105697_chr18_fastqc.zip
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 6/6
| fastqc | Found 6 reports
| multiqc | Compressing plot data
| multiqc | Report : multiqc_report.html
| multiqc | Data : multiqc_data
| multiqc | MultiQC complete
[Thu Oct 3 17:01:09 2024]
Finished job 7.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:01:09 2024]
localrule all:
input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
resources: tmpdir=/tmp
[Thu Oct 3 17:01:09 2024]
Finished job 0.
5 of 5 steps (100%) done Complete log: .snakemake/log/2024-10-03T170034.983240.snakemake.log
As you can see, Snakemake found 3 extra files in you
Data/
directory to run FastQC on and regenerated the
MultiQC report to include them.
Adapt your pipeline (ex1b_o2.smk
) to use a configuration
file called myConfig.yml
that specifies the path of the
input data directory, and replace the hard-coding values
(Data/
) in the snakefile.
config["myItem"]
(“myItem” being the name
of your variable in the configfile).--configfile ex1.yml
to Snakemake’s
command line option. NB: Alternativelly, you can add the directive
configfile: ex1.yml
at the beginning of your snakefile
instead*Why use a configuration file? In this file, you will place all hard-coding values of the snakefile (paths to files, core numbers, parameter values, etc). In this way, you have a generalised analysis pipeline that you can easily apply to new input data by simply changing the configfile rather than the snakefile directly.
Your configuration file
Your ex1.yml
configfile should look like this (YAML
syntax):
dataDir:
"Data"
The snakefile:
Your code in ex1b_o2.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html",
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz",
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
shell: "fastqc --outdir FastQC {input}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES),
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
Next, let’s check if your pipeline still works as it should using the following command:
snakemake -s ex1b_o2.smk -c 1 -p --configfile ex1.yml
You should see something similar to the following output on your screen.
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
As you can see, Snakemake doesn’t see anything that changed that might influence the outputs that were already generated and tells you that there’s “Nothing to be done”.
The configfile can be used to specify as many parameters as you want,
for example, the name of the output directory in the fastqc rule. In
order to add parameters to the shell command line, you can use the
params
directive:
Your configfile ex1.yml
could look like this:
dataDir:
"Data"
fastqcOutDir:
"FastQC"
And your code in ex1b_o2b.smk
could look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand(config["fastqcOutDir"]+"/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html",
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz",
output:
config["fastqcOutDir"]+"/{sample}_fastqc.zip",
config["fastqcOutDir"]+"/{sample}_fastqc.html",
params:
outdir=config["fastqcOutDir"],
shell: "fastqc --outdir {params.outdir} {input} "
rule multiqc:
input:
expand(config["fastqcOutDir"]+"/{sample}_fastqc.zip", sample = SAMPLES),
output:
"multiqc_report.html",
directory("multiqc_data"),
shell: "multiqc {input}"
Let’s take a moment to compare the various “wildcards” that we’ve used within the shell calls and the way we specify the files in our directives. For example, in the fastqc rule:
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
config["fastqcOutDir"]+"/{sample}_fastqc.zip",
config["fastqcOutDir"]+"/{sample}_fastqc.html"
params:
outdir=config["fastqcOutDir"]
shell: "fastqc --outdir {params.outdir} {input} "
As you can see, you are free to add as many files to your directives
as you wish: input
has only 1, output
has 2
and params
has only 1 (and we’ve named it “outdir”). Note
that if you provide more than one file to a directive, these should be
separated by a comma (see output
directive above for an
example). If you are afraid of forgetting it, you can systematically add
a comma at the end of your directive lines.
To use the files specified in your directives within the shell
directive, you can use the wildcard syntax ({variable}
). In
the case of our input directive, as we only have 1 file, we used the
directive name directly ({input}
). However, as soon as you
have several files, you can use Python’s list syntax. For example, if we
had to explicitly mention our output files in the shell command, we
would have specified them with {output[0]}
and
{output[1]}
. We can also give names to each element if
there are several to avoid using indexing (i.e. [0]
,
[1]
etc.). For example, in the params
directive, we chose to name the file, so we use the syntax
{directive.name}
instead.
And finally, it’s important to note that the language Snakemake uses
is Python, so don’t hesitate to use Python’s modules within the
snakefile if you’re familiar with the language. For example, you could
use the os.path
module’s join()
function
instead of concatenating your paths with the +
operator.
You could also use Python’s string formats or triple quote strings
too.
Here’s an example with a few Python “upgrades”
ex1_o7c.smk
:
from os.path import join
SAMPLES, = glob_wildcards(join(config["dataDir"],"{sample}.fastq.gz"))
rule all:
input:
expand(join(config["fastqcOutDir"],"{sample}_fastqc.html"), sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
join(config["dataDir"],"{sample}.fastq.gz")
output:
join(config["fastqcOutDir"],"{sample}_fastqc.zip"),
join(config["fastqcOutDir"],"{sample}_fastqc.html")
shell:
f"""fastqc --outdir {config["fastqcOutDir"]} {{input}}"""
rule multiqc:
input:
expand(join(config["fastqcOutDir"],"{sample}_fastqc.zip"), sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
Note that if you use Python’s string formatting
(e.g. f"mystring"
), you should replace braces in wildcard
annotations with double braces (e.g. f"{{input}}"
).
Create a new snakefile named ex1b_o3.smk
in which we
redirect the standard output and error streams to log files.
log:
” directive (similar to the already existing
input:
or output:
directives) with two
variables, out
and err
, to separately redirect
each stream.1> stdout.txt
and
2> stderr.txt
in the shell command lines of your rules.
Use wildcards to specify the chosen file names (e.g.
“1>{log.std} 2>{log.err}
”).Your code for ex1b_o3.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
log:
"Logs/{sample}_fastqc.std",
"Logs/{sample}_fastqc.err"
shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
log:
std="Logs/multiqc.std",
err="Logs/multiqc.err"
shell: "multiqc {input} 1>{log.std} 2>{log.err}"
As you can see, we specify the log files differently in the fastqc
rule and in the multiqc rule (for demonstration reasons). In the multiqc
rule, both log files are named (“std” and “err”) and are used in the
shell directive like so: “{log.std}
” and
“{log.err}
”. In the fastqc rule, we don’t specify names and
use them in the shell directive with Python’s list syntax instead:
“{log[0]}
” and “{log[1]}
”.
Next, let’s check if your pipeline works as expected:
snakemake -s ex1b_o3.smk -c 1 -p --configfile ex1.yml
You should see something similar to the following output on your screen.
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:05:46 2024]
localrule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 6
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3105697_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
[Thu Oct 3 17:05:57 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:05:57 2024]
localrule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 2
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3105699_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
[Thu Oct 3 17:06:07 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:07 2024]
localrule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 1
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3099585_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
[Thu Oct 3 17:06:17 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:17 2024]
localrule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 5
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3099587_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
[Thu Oct 3 17:06:27 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:27 2024]
localrule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 3
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3099586_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
[Thu Oct 3 17:06:37 2024]
Finished job 3.
5 of 8 steps (62%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:37 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 4
reason: Code has changed since last execution; Params have changed since last execution
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
[Thu Oct 3 17:06:48 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:48 2024]
localrule multiqc:
input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip
resources: tmpdir=/tmp
multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
[Thu Oct 3 17:06:53 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...
[Thu Oct 3 17:06:53 2024]
localrule all:
input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: multiqc_report.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html
resources: tmpdir=/tmp
[Thu Oct 3 17:06:53 2024]
Finished job 0.
8 of 8 steps (100%) done Complete log: .snakemake/log/2024-10-03T170546.019954.snakemake.log
As you can see, the fastqc steps don’t generate as much text as
before. Also, if you have a look at your working directory, you should
see a Logs
folder in there now, containing all the
individual logs of your input files and rules:
john.doe@cluster-i2bc:/data/work/I2BC/john.doe/snakemake_tutorial$ ls Logs/
multiqc.err SRR3099586_chr18_fastqc.err SRR3105697_chr18_fastqc.err SRR3105699_chr18_fastqc.err
multiqc.std SRR3099586_chr18_fastqc.std SRR3105697_chr18_fastqc.std SRR3105699_chr18_fastqc.std
SRR3099585_chr18_fastqc.err SRR3099587_chr18_fastqc.err SRR3105698_chr18_fastqc.err
SRR3099585_chr18_fastqc.std SRR3099587_chr18_fastqc.std SRR3105698_chr18_fastqc.std
In this objective, write a new script ex1b_o4.smk
and
modify the fastqc rule to clean up automatically intermediate files from
within Snakemake.
temp()
function around the output file
(e.g. output: temp("myoutput.txt")
).Your code for ex1b_o4.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
temp("FastQC/{sample}_fastqc.zip"),
temp("FastQC/{sample}_fastqc.html")
log:
"Logs/{sample}_fastqc.std",
"Logs/{sample}_fastqc.err"
shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
log:
std="Logs/multiqc.std",
err="Logs/multiqc.err"
shell: "multiqc {input} 1>{log.std} 2>{log.err}"
As you can see, we just modified the outputs of the fastqc rule,
wrapping them within the temp()
function.
snakemake -s ex1b_o4.smk -c 1 -p
You should see something similar to the following output on your screen:
...
[Tue Oct 15 12:00:31 2024]
Finished job 7.
7 of 8 steps (88%) done
Removing temporary output FastQC/SRR3099585_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105699_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3099586_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105698_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3099587_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105697_chr18_fastqc.zip.
Select jobs to execute...
Execute 1 jobs...
[Tue Oct 15 12:00:31 2024]
localrule all:
input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html
resources: tmpdir=/tmp
[Tue Oct 15 12:00:31 2024]
Finished job 0.
8 of 8 steps (100%) done
Removing temporary output FastQC/SRR3099585_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105699_chr18_fastqc.html.
Removing temporary output FastQC/SRR3099586_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105698_chr18_fastqc.html.
Removing temporary output FastQC/SRR3099587_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105697_chr18_fastqc.html.
Complete log: .snakemake/log/2024-10-15T115938.316012.snakemake.log
As you can see, Snakemake removes the intermediate files at 2 different moments in the pipeline:
NB: if you want to exceptionally keep temporary files (for debugging
reasons for example), you can run snakemake with the
--notemp
option to ignore all temp()
in the
code.
In this objective, we’ll learn how to control when Snakemake re-runs rules.
Have you noticed how Snakemake sometimes decides to re-run everything (although output files already exist) and sometimes not?
Snakemake always justifies its choices in the output log. Sometimes its because the files are missing, other times it might because the code has changed, etc.. For example:
[Tue Feb 20 15:35:31 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Code has changed since last execution
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Or:
[Tue Feb 20 15:06:06 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
jobid: 2
reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/tmp
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
This behaviour is because Snakemake takes into account all provenance information to define which jobs to rerun and not solely the fact that outputs exist or not. Thus, parameter, code and software environment changes, as well as changes in the set of input files of a job will make Snakemake rerun all the steps, even if the output files already exist.
Of note: in previous versions of Snakemake (before v.7.8.0), rerunning jobs relied purely on file modification times so Snakemake would only re-execute the pipeline if the timestamp of output files were older than the ones of input files.
There is a command line option that you can use in Snakemake in order
to specify your re-running criteria: --rerun-triggers
. By
default, all triggers are used
(code,input,mtime,params,software-env
), which guarantees
that results are consistent with the workflow code and configuration. To
revert to Snakemake’s behaviour before v.7.8.0, you can use
--rerun-triggers mtime
. This option will tell Snakemake to
only use modification time when determining whether a job should be
executed or not. For example:
snakemake -s ex1b_o4.smk -c 1 -p --rerun-triggers mtime
If you rerun your snakemake command line now, without changing
anything to the code (with or without the
--rerun-trigger mtime
option), you should see a message
from Snakemake telling you that nothing needs doing:
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
However, if you’d like to force Snakemake to re-run the whole pipeline, you can do one of the following:
– delete all output folders and results before re-running the Snakemake command
rm -rf FastQC multiqc*
snakemake -s ex1b_o4.smk -c 1 -p --configfile ex1.yml
– use Snakemake’s --forcerun
(-R
) or
--forceall
(-F
) options when you run the
Snakemake command. --forcerun
reruns a specific rule or
input which you will have to specify in the command line.
--forceall
forces everything to be re-run. For example:
# specify a rule
snakemake -s ex1b_o4.smk -c 1 -p -R fastqc --configfile ex1.yml
# specify an input/output file
snakemake -s ex1b_o4.smk -c 1 -p -R FastQC/SRR3099585_chr18_fastqc.zip --configfile ex1.yml
# rerun everything
snakemake -s ex1b_o4.smk -c 1 -p -F --configfile ex1.yml
In this exercise, we’ve seen:
– print the reason for each rule execution: -r --reason
(for Snakemake version <8 only)
– print a summary and status of a rule: -D
– automatically create HTML reports
(--report report.html
) containing runtime statistics, a
visualisation of the workflow topology, used software and data
provenance information (need to add the jinja2
package as a
dependency)