Introduction to Snakemake

Exercise B - improving your pipeline

Introduction

In this Exercise, we’ll be starting where we left off in Exercise A and we’ll be using the snakefile you wrote in the objective 6.

For this practical exercise, we will:

Reminder about the final pipeline:

In the following objectives, we will be building on the Snakefile from Exercise 1A which successively runs FastQC then MultiQC on a set of RNA-seq data.

final workflow

How this exercise is organised:

Each step will reply to an objective. Thus, we will be doing several cycles of executing snakemake, observing the results and improving the code. Each code version will be noted ex1b_oX.smk, with X a progressive digit corresponding to the objective number.

Setup

It’s the same setup as for Exercise A, we will work on a node of the cluster with Snakemake, FastQC and MultiQC modules loaded.

As a reminder, you should be in your working directory from Exercise A in which you have the Data/ folder containing all example files.

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls -a
.  ..  .snakemake  Data  FastQC  multiqc_data  multiqc_report.html

You should also have snakemake, FastQC and MultiQC loaded.

module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13 graphviz/2.40.1

NB: If you’ve skipped Exercise A, you can copy-paste the example solution script from Exercise A objective 6.

Objective 1 - glob_wildcards()

Make Snakemake find your input files for you, using the glob_wildcards() function (similar to the glob module in Python).

Where to start?

Click to see an example using glob_wildcards()

Let’s imagine we have a folder that looks like this:

john.doe@PC1:~$ ls mydir/
batchA_sample1.txt   batchB_sample1.txt   batchB_sample3.txt 
batchA_sample2.txt   batchB_sample2.txt   

Defining the search pattern:

Look for common roots between your files. In this case, the search pattern for all batchA files could look like this:

samplelist = glob_wildcards("mydir/batchA_sample{sampleid}.txt")

The above function will search through the current directory for files fitting this pattern and infer the possible values for your {sampleid} wildcard.

The output format:

Whatever the number of wildcards in your pattern, glob_wildcards() will always output a “tuple” (=immutable list) of lists (that is why the comma in “samplelist, =” above is important). Each list corresponds to the output of a wildcard.

In the example above, we only have 1 wildcard, thus it will return a tuple containing a single list of all possible values for {sampleid}: ["1", "2"].

A search pattern with 2 wildcards:

To include batchB, we can use a second wildcard like this:

batchlist, samplelist = glob_wildcards("mydir/batch{batchid}_sample{sampleid}.txt")

In this example, the output will look like this: ["A", "A", "B", "B", "B"] (for batchlist) and ["1", "2", "1", "2", "3"] (for samplelist).

Click to see an example solution
SAMPLES, = glob_wildcards("Data/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    "Data/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

As you might have noticed, we simplified the input of the “all” target rule to just 2 files: the html outputs of the fastqc rule and the html output of the multiqc rule, since the zip files are always generated with the html outputs in fastqc and the multiqc_data directory with the html report in multiqc.

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

snakemake -s ex1b_o1.smk -c 1 -p
# -c is short for --cores

Observe the output

You should see something similar to the following output on your screen.

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:35 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    jobid: 6
    reason: Missing output files: FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.zip
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz
application/gzip
Started analysis of SRR3105697_chr18.fastq.gz
Approx 5% complete for SRR3105697_chr18.fastq.gz
Approx 10% complete for SRR3105697_chr18.fastq.gz
Approx 15% complete for SRR3105697_chr18.fastq.gz
Approx 20% complete for SRR3105697_chr18.fastq.gz
Approx 25% complete for SRR3105697_chr18.fastq.gz
Approx 30% complete for SRR3105697_chr18.fastq.gz
Approx 35% complete for SRR3105697_chr18.fastq.gz
Approx 40% complete for SRR3105697_chr18.fastq.gz
Approx 45% complete for SRR3105697_chr18.fastq.gz
Approx 50% complete for SRR3105697_chr18.fastq.gz
Approx 55% complete for SRR3105697_chr18.fastq.gz
Approx 60% complete for SRR3105697_chr18.fastq.gz
Approx 65% complete for SRR3105697_chr18.fastq.gz
Approx 70% complete for SRR3105697_chr18.fastq.gz
Approx 75% complete for SRR3105697_chr18.fastq.gz
Approx 80% complete for SRR3105697_chr18.fastq.gz
Approx 85% complete for SRR3105697_chr18.fastq.gz
Approx 90% complete for SRR3105697_chr18.fastq.gz
Approx 95% complete for SRR3105697_chr18.fastq.gz
Analysis complete for SRR3105697_chr18.fastq.gz
[Thu Oct  3 17:00:44 2024]
Finished job 6.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:45 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 4
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.zip
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
application/gzip
Started analysis of SRR3105698_chr18.fastq.gz
Approx 5% complete for SRR3105698_chr18.fastq.gz
Approx 10% complete for SRR3105698_chr18.fastq.gz
Approx 15% complete for SRR3105698_chr18.fastq.gz
Approx 20% complete for SRR3105698_chr18.fastq.gz
Approx 25% complete for SRR3105698_chr18.fastq.gz
Approx 30% complete for SRR3105698_chr18.fastq.gz
Approx 35% complete for SRR3105698_chr18.fastq.gz
Approx 40% complete for SRR3105698_chr18.fastq.gz
Approx 45% complete for SRR3105698_chr18.fastq.gz
Approx 50% complete for SRR3105698_chr18.fastq.gz
Approx 55% complete for SRR3105698_chr18.fastq.gz
Approx 60% complete for SRR3105698_chr18.fastq.gz
Approx 65% complete for SRR3105698_chr18.fastq.gz
Approx 70% complete for SRR3105698_chr18.fastq.gz
Approx 75% complete for SRR3105698_chr18.fastq.gz
Approx 80% complete for SRR3105698_chr18.fastq.gz
Approx 85% complete for SRR3105698_chr18.fastq.gz
Approx 90% complete for SRR3105698_chr18.fastq.gz
Approx 95% complete for SRR3105698_chr18.fastq.gz
Analysis complete for SRR3105698_chr18.fastq.gz
[Thu Oct  3 17:00:55 2024]
Finished job 4.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:55 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz
application/gzip
Started analysis of SRR3105699_chr18.fastq.gz
Approx 5% complete for SRR3105699_chr18.fastq.gz
Approx 10% complete for SRR3105699_chr18.fastq.gz
Approx 15% complete for SRR3105699_chr18.fastq.gz
Approx 20% complete for SRR3105699_chr18.fastq.gz
Approx 25% complete for SRR3105699_chr18.fastq.gz
Approx 30% complete for SRR3105699_chr18.fastq.gz
Approx 35% complete for SRR3105699_chr18.fastq.gz
Approx 40% complete for SRR3105699_chr18.fastq.gz
Approx 45% complete for SRR3105699_chr18.fastq.gz
Approx 50% complete for SRR3105699_chr18.fastq.gz
Approx 55% complete for SRR3105699_chr18.fastq.gz
Approx 60% complete for SRR3105699_chr18.fastq.gz
Approx 65% complete for SRR3105699_chr18.fastq.gz
Approx 70% complete for SRR3105699_chr18.fastq.gz
Approx 75% complete for SRR3105699_chr18.fastq.gz
Approx 80% complete for SRR3105699_chr18.fastq.gz
Approx 85% complete for SRR3105699_chr18.fastq.gz
Approx 90% complete for SRR3105699_chr18.fastq.gz
Approx 95% complete for SRR3105699_chr18.fastq.gz
Analysis complete for SRR3105699_chr18.fastq.gz
[Thu Oct  3 17:01:05 2024]
Finished job 2.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:01:05 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip; Set of input files has changed since last execution
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip

  /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.25.1 now available!
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105699_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105698_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099587_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105697_chr18_fastqc.zip
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 6/6  
|            fastqc | Found 6 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
[Thu Oct  3 17:01:09 2024]
Finished job 7.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:01:09 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
    resources: tmpdir=/tmp

[Thu Oct  3 17:01:09 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-10-03T170034.983240.snakemake.log 

As you can see, Snakemake found 3 extra files in you Data/ directory to run FastQC on and regenerated the MultiQC report to include them.

Objective 2 - config file

Adapt your pipeline (ex1b_o2.smk) to use a configuration file called myConfig.yml that specifies the path of the input data directory, and replace the hard-coding values (Data/) in the snakefile.

Where to start?

Why use a configuration file? In this file, you will place all hard-coding values of the snakefile (paths to files, core numbers, parameter values, etc). In this way, you have a generalised analysis pipeline that you can easily apply to new input data by simply changing the configfile rather than the snakefile directly.

Click to see an example solution

Your configuration file

Your ex1.yml configfile should look like this (YAML syntax):

dataDir:
  "Data"

The snakefile:

Your code in ex1b_o2.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html",

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz",
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES),
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

snakemake -s ex1b_o2.smk -c 1 -p --configfile ex1.yml

Observe the output

You should see something similar to the following output on your screen.

Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

As you can see, Snakemake doesn’t see anything that changed that might influence the outputs that were already generated and tells you that there’s “Nothing to be done”.

To go further - more parameters

The configfile can be used to specify as many parameters as you want, for example, the name of the output directory in the fastqc rule. In order to add parameters to the shell command line, you can use the params directive:

Click to see an example solution

Your configfile ex1.yml could look like this:

dataDir:
  "Data"
fastqcOutDir:
  "FastQC"

And your code in ex1b_o2b.smk could look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand(config["fastqcOutDir"]+"/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html",

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz",
  output:
    config["fastqcOutDir"]+"/{sample}_fastqc.zip",
    config["fastqcOutDir"]+"/{sample}_fastqc.html",
  params:
    outdir=config["fastqcOutDir"],
  shell: "fastqc --outdir {params.outdir} {input} "

rule multiqc:
  input:
    expand(config["fastqcOutDir"]+"/{sample}_fastqc.zip", sample = SAMPLES),
  output:
    "multiqc_report.html",
    directory("multiqc_data"),
  shell: "multiqc {input}"

How values of directives are used in “shell”

Let’s take a moment to compare the various “wildcards” that we’ve used within the shell calls and the way we specify the files in our directives. For example, in the fastqc rule:

rule fastqc: 
  input: 
    config["dataDir"]+"/{sample}.fastq.gz" 
  output:  
    config["fastqcOutDir"]+"/{sample}_fastqc.zip",  
    config["fastqcOutDir"]+"/{sample}_fastqc.html"  
  params:
    outdir=config["fastqcOutDir"] 
  shell: "fastqc --outdir {params.outdir} {input} "

As you can see, you are free to add as many files to your directives as you wish: input has only 1, output has 2 and params has only 1 (and we’ve named it “outdir”). Note that if you provide more than one file to a directive, these should be separated by a comma (see output directive above for an example). If you are afraid of forgetting it, you can systematically add a comma at the end of your directive lines.

To use the files specified in your directives within the shell directive, you can use the wildcard syntax ({variable}). In the case of our input directive, as we only have 1 file, we used the directive name directly ({input}). However, as soon as you have several files, you can use Python’s list syntax. For example, if we had to explicitly mention our output files in the shell command, we would have specified them with {output[0]} and {output[1]}. We can also give names to each element if there are several to avoid using indexing (i.e. [0], [1] etc.). For example, in the params directive, we chose to name the file, so we use the syntax {directive.name} instead.

more about wildcards

Transfer your Python skills - bonus for Python addicts only 😉

And finally, it’s important to note that the language Snakemake uses is Python, so don’t hesitate to use Python’s modules within the snakefile if you’re familiar with the language. For example, you could use the os.path module’s join() function instead of concatenating your paths with the + operator. You could also use Python’s string formats or triple quote strings too.

Click to see an example solution

Here’s an example with a few Python “upgrades” ex1_o7c.smk:

from os.path import join

SAMPLES, = glob_wildcards(join(config["dataDir"],"{sample}.fastq.gz"))

rule all:
  input:
    expand(join(config["fastqcOutDir"],"{sample}_fastqc.html"), sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    join(config["dataDir"],"{sample}.fastq.gz")
  output:
    join(config["fastqcOutDir"],"{sample}_fastqc.zip"),
    join(config["fastqcOutDir"],"{sample}_fastqc.html")
  shell: 
    f"""fastqc --outdir {config["fastqcOutDir"]} {{input}}"""

rule multiqc:
  input:
    expand(join(config["fastqcOutDir"],"{sample}_fastqc.zip"), sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Note that if you use Python’s string formatting (e.g. f"mystring"), you should replace braces in wildcard annotations with double braces (e.g. f"{{input}}").

Objective 3 - logs

Create a new snakefile named ex1b_o3.smk in which we redirect the standard output and error streams to log files.

Where to start?

Click to see an example solution

Your code for ex1b_o3.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  log:
    "Logs/{sample}_fastqc.std",
    "Logs/{sample}_fastqc.err"
  shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  log:
    std="Logs/multiqc.std",
    err="Logs/multiqc.err"
  shell: "multiqc {input} 1>{log.std} 2>{log.err}"

As you can see, we specify the log files differently in the fastqc rule and in the multiqc rule (for demonstration reasons). In the multiqc rule, both log files are named (“std” and “err”) and are used in the shell directive like so: “{log.std}” and “{log.err}”. In the fastqc rule, we don’t specify names and use them in the shell directive with Python’s list syntax instead: “{log[0]}” and “{log[1]}”.

Test the script

Next, let’s check if your pipeline works as expected:

snakemake -s ex1b_o3.smk -c 1 -p --configfile ex1.yml

Observe the output

You should see something similar to the following output on your screen.

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:05:46 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 6
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
[Thu Oct  3 17:05:57 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:05:57 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 2
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
[Thu Oct  3 17:06:07 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:07 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 1
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
[Thu Oct  3 17:06:17 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:17 2024]
localrule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 5
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
[Thu Oct  3 17:06:27 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:27 2024]
localrule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 3
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
[Thu Oct  3 17:06:37 2024]
Finished job 3.
5 of 8 steps (62%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:37 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 4
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
[Thu Oct  3 17:06:48 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:48 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
[Thu Oct  3 17:06:53 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:53 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: multiqc_report.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html
    resources: tmpdir=/tmp

[Thu Oct  3 17:06:53 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-10-03T170546.019954.snakemake.log

As you can see, the fastqc steps don’t generate as much text as before. Also, if you have a look at your working directory, you should see a Logs folder in there now, containing all the individual logs of your input files and rules:

john.doe@cluster-i2bc:/data/work/I2BC/john.doe/snakemake_tutorial$ ls Logs/
multiqc.err                  SRR3099586_chr18_fastqc.err  SRR3105697_chr18_fastqc.err  SRR3105699_chr18_fastqc.err
multiqc.std                  SRR3099586_chr18_fastqc.std  SRR3105697_chr18_fastqc.std  SRR3105699_chr18_fastqc.std
SRR3099585_chr18_fastqc.err  SRR3099587_chr18_fastqc.err  SRR3105698_chr18_fastqc.err
SRR3099585_chr18_fastqc.std  SRR3099587_chr18_fastqc.std  SRR3105698_chr18_fastqc.std

Objective 4 - temp files

In this objective, write a new script ex1b_o4.smk and modify the fastqc rule to clean up automatically intermediate files from within Snakemake.

Where to start?

Click to see an example solution

Your code for ex1b_o4.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    temp("FastQC/{sample}_fastqc.zip"),
    temp("FastQC/{sample}_fastqc.html")
  log:
    "Logs/{sample}_fastqc.std",
    "Logs/{sample}_fastqc.err"
  shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  log:
    std="Logs/multiqc.std",
    err="Logs/multiqc.err"
  shell: "multiqc {input} 1>{log.std} 2>{log.err}"

As you can see, we just modified the outputs of the fastqc rule, wrapping them within the temp() function.

Test the script

snakemake -s ex1b_o4.smk -c 1 -p

Observe the output

You should see something similar to the following output on your screen:

...
[Tue Oct 15 12:00:31 2024]
Finished job 7.
7 of 8 steps (88%) done
Removing temporary output FastQC/SRR3099585_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105699_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3099586_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105698_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3099587_chr18_fastqc.zip.
Removing temporary output FastQC/SRR3105697_chr18_fastqc.zip.
Select jobs to execute...
Execute 1 jobs...

[Tue Oct 15 12:00:31 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html
    resources: tmpdir=/tmp

[Tue Oct 15 12:00:31 2024]
Finished job 0.
8 of 8 steps (100%) done
Removing temporary output FastQC/SRR3099585_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105699_chr18_fastqc.html.
Removing temporary output FastQC/SRR3099586_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105698_chr18_fastqc.html.
Removing temporary output FastQC/SRR3099587_chr18_fastqc.html.
Removing temporary output FastQC/SRR3105697_chr18_fastqc.html.
Complete log: .snakemake/log/2024-10-15T115938.316012.snakemake.log

As you can see, Snakemake removes the intermediate files at 2 different moments in the pipeline:

NB: if you want to exceptionally keep temporary files (for debugging reasons for example), you can run snakemake with the --notemp option to ignore all temp() in the code.

Objective 5 - rerun

In this objective, we’ll learn how to control when Snakemake re-runs rules.

Motivation

Have you noticed how Snakemake sometimes decides to re-run everything (although output files already exist) and sometimes not?

Snakemake always justifies its choices in the output log. Sometimes its because the files are missing, other times it might because the code has changed, etc.. For example:

[Tue Feb 20 15:35:31 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Code has changed since last execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err

Or:

[Tue Feb 20 15:06:06 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz

This behaviour is because Snakemake takes into account all provenance information to define which jobs to rerun and not solely the fact that outputs exist or not. Thus, parameter, code and software environment changes, as well as changes in the set of input files of a job will make Snakemake rerun all the steps, even if the output files already exist.

Of note: in previous versions of Snakemake (before v.7.8.0), rerunning jobs relied purely on file modification times so Snakemake would only re-execute the pipeline if the timestamp of output files were older than the ones of input files.

How to control re-running criteria?

There is a command line option that you can use in Snakemake in order to specify your re-running criteria: --rerun-triggers. By default, all triggers are used (code,input,mtime,params,software-env), which guarantees that results are consistent with the workflow code and configuration. To revert to Snakemake’s behaviour before v.7.8.0, you can use  --rerun-triggers mtime. This option will tell Snakemake to only use modification time when determining whether a job should be executed or not. For example:

snakemake -s ex1b_o4.smk -c 1 -p --rerun-triggers mtime

Force re-run

If you rerun your snakemake command line now, without changing anything to the code (with or without the --rerun-trigger mtime option), you should see a message from Snakemake telling you that nothing needs doing:

Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

However, if you’d like to force Snakemake to re-run the whole pipeline, you can do one of the following:

– delete all output folders and results before re-running the Snakemake command

rm -rf FastQC multiqc*
snakemake -s ex1b_o4.smk -c 1 -p --configfile ex1.yml

– use Snakemake’s --forcerun (-R) or  --forceall (-F) options when you run the Snakemake command. --forcerun reruns a specific rule or input which you will have to specify in the command line. --forceall forces everything to be re-run. For example:

# specify a rule
snakemake -s ex1b_o4.smk -c 1 -p -R fastqc --configfile ex1.yml
# specify an input/output file
snakemake -s ex1b_o4.smk -c 1 -p -R FastQC/SRR3099585_chr18_fastqc.zip --configfile ex1.yml
# rerun everything
snakemake -s ex1b_o4.smk -c 1 -p -F --configfile ex1.yml

Recap

In this exercise, we’ve seen:

Some more useful options when running Snakemake

– print the reason for each rule execution: -r --reason (for Snakemake version <8 only)

– print a summary and status of a rule: -D

– automatically create HTML reports (--report report.html) containing runtime statistics, a visualisation of the workflow topology, used software and data provenance information (need to add the jinja2 package as a dependency)

all Snakemake options