Introduction to Snakemake

Exercise C - adapting your pipeline to an HPC environment

Introduction

In this Exercise, we’ll be starting where we left off in Exercise B, using the snakefile you wrote in the objective 3, which successively runs FastQC then MultiQC on a set of RNA-seq data, and try to adapt it to an HPC environment, namely the IFB’s cluster.

final workflow

For this practical exercise, we will:

Motivation:

Up until now, our workflow just runs on a single processor. Thus, each job is run sequentially, which takes time and is frustrating when you know that more resources are available on the cluster (=> using more processors reduces computation time in most cases).

There are two ways of scaling up your workflow:

  1. run multiple jobs in parallel: if you have several inputs and each of them can be processed independently from each other by a specific rule (e.g. the fastqc rule in this Exercise), then you can run all of these jobs simultaneously instead of sequentially (=> 1 processor per job)
  2. run steps multithreaded: if you’re using a tool in your rule that handles multithreading (e.g. it has an option like –threads for example), you could run this rule on more than one processor (=> several processors per job)

How this exercise is organised

As for the previous exercises, each step will address an objective. Thus, we will be doing several cycles of snakemake execution, observing the results and improving the code. Each code version will be noted ex1c_oX.smk, with X a progressive digit corresponding to the objective number.

Warning: keep in mind that we’ll be using commands that are specific to the IFB cluster’s scheduler system (they might be different on other clusters).

Setup

It’s the same setup as for Exercise A and B, we will work on a node of the cluster with Snakemake, FastQC and MultiQC modules loaded.

As a reminder, you should be in your working directory from Exercise B in which you have the Data/ folder containing all example files.

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$  ls -a
.  ..  Data  FastQC  Logs multiqc_data  multiqc_report.html .snakemake 

You should also have snakemake, FastQC and MultiQC loaded.

module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13

NB: If you’ve skipped Exercise A and B, you can copy-paste the example solution script from Exercise B objective 3.

Objective 1 - multithreading with --cores

Learn how to tell Snakemake that multiple processors are available locally using the --cores option.

Where to start?

Remember --cores/-c from your previous commands? This option tells Snakemake how many processors (/CPUs/cores/threads) it may use to run your jobs locally. It’s a mandatory option so if you run Snakemake without it, you’ll get an error message.

1.for now, we only have 1 processor available
cquignot@cpu-node-40:~$ module load reportseff
cquignot@cpu-node-40:~$ reportseff -u $USER --format "JobID,JobName,State,ReqNodes,ReqCPUS"
  JobID       JobName       State     ReqNodes   ReqCPUS
 42017684      bash        RUNNING       1          1   
2.so let’s logout of our current session
cquignot@cpu-node-40:~$ exit
cquignot@clust-slurm-client:~$
3.and reconnect with 2 processors instead
cquignot@clust-slurm-client:~$ srun --cpus-per-task 2 --account=2417_wf4bioinfo --pty bash
srun: job 42017685 queued and waiting for resources
srun: job 42017685 has been allocated resources

cquignot@cpu-node-40:~$ module load reportseff
cquignot@cpu-node-40:~$ reportseff -u $USER --format "JobID,JobName,State,ReqNodes,ReqCPUS"
  JobID       JobName       State     ReqNodes   ReqCPUS
 42017684      bash        COMPLETED     1          1     
 42017685      bash        RUNNING       1          2   
4.and make sure modules are still loaded and move to your working directory if not already
cquignot@cpu-node-40:~$ module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13
cquignot@cpu-node-40:~$ module list
Currently Loaded Modulefiles:
 1) snakemake/8.9.0   2) multiqc/1.13   3) fastqc/0.12.1 
cquignot@cpu-node-40:~$ /shared/projects/2417_wf4bioinfo/$USER/day2-session1

Run Snakemake on 2 local processors

Running Snakemake on 2 processors instead of just one is quite straightforward (we’ll add -R fastqc to force Snakemake to re-run everything so you can observe the changes):

snakemake -s ex1b_o3.smk -p --configfile ex1.yml -R fastqc --cores 2

Your output should look like this:

Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...

Execute 2 jobs...

[Wed Feb 21 13:37:38 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp


[Wed Feb 21 13:37:38 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/tmp

[Wed Feb 21 13:37:52 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:37:52 2024]
localrule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=/tmp

[Wed Feb 21 13:37:54 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:37:54 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/tmp

[Wed Feb 21 13:38:02 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:02 2024]
localrule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=/tmp

[Wed Feb 21 13:38:04 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:04 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=/tmp

[Wed Feb 21 13:38:14 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 13:38:15 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:15 2024]
localrule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR310
5699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
    resources: tmpdir=/tmp

[Wed Feb 21 13:38:37 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:37 2024]
localrule all:
    input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099585_chr18_f
astqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
    resources: tmpdir=/tmp

[Wed Feb 21 13:38:37 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-02-21T133728.474300.snakemake.log

Observe the output

You can now see in the log that Snakemake registered the 2 processors: Provided cores: 2. It’s difficult to see if your jobs are actually running simultaniously just by looking at the log but you can clearly see that Snakemake runs the first 2 jobs at the same time: Execute 2 jobs...

Running Snakemake on 2 processors doesn’t reduce the computation time considerably in this case since we only have 6 input files to process and the tools that we are running are already quite fast to execute.

A little recap before moving on…

Local execution:

Everything we’ve seen up until now could technically also be run on a local computer, even what we’ve just seen above if you have several processors on your PC (and provided you have the necessary software installed).

There are better ways than --cores to parallelise:

We’ve also seen that using --cores N (N being the number of processors to use), enables us to distribute the workload and, thus, to accelerate the computation. However, when running on a cluster, this is not the most optimal solution for parallelising your pipeline.

Why? Let’s take what we just did as an example:

schema cores

All 6 fastqc jobs can be evenly distributed over the 2 processors but the second rule (multiqc) is only using 1 of the 2 that we’ve reserved because there’s only one job for this rule. It’s not that bad when you’re using tools that run fast but becomes more problematic for software with much larger execution times…

What’s the solution? It would be more efficient to let the scheduler (Slurm in our case) deal with the distribution of jobs according to the resources available on the cluster. With this system, we’ll also be able to adapt the resources that are reserved to the amount each tool actually needs to run, thereby leaving unused resources free for others to use.

Deconnect from the current job

Let’s disconnect from the current interactive session before we move on. Snakemake in itself doesn’t need much resources to run and can be run directly from the login node i.e. clust-slurm-client without problems as long as the rules in your Snakefile are still run on the slave nodes. Don’t forget to load the snakemake module !

cquignot@cpu-node-40:~$ exit
cquignot@clust-slurm-client:~$

Objective 2 - communicating with slurm

Learn how to dispatch each individual job onto separate processors of the cluster using the executor option.

Where to start?

Run your Snakefile

Running Snakemake with these extra options is quite straightforward (NB: we’ll add -R fastqc to force Snakemake to re-run everything so you can observe the changes):

snakemake -s ex1b_o3.smk --executor "cluster-generic" --cluster-generic-submit-cmd "sbatch --cpus-per-task=1 --mem 500Mb" --jobs 6 --configfile ex1.yml -p -R fastqc 

Observe the output

Your output should look like this:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 6 jobs...

[Sat Oct 12 11:43:38 2024]
rule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 1 with external jobid 'Submitted batch job 42161132'.

[Sat Oct 12 11:43:38 2024]
rule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 5 with external jobid 'Submitted batch job 42161133'.

[Sat Oct 12 11:43:38 2024]
rule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 2 with external jobid 'Submitted batch job 42161134'.

[Sat Oct 12 11:43:38 2024]
rule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 6 with external jobid 'Submitted batch job 42161135'.

[Sat Oct 12 11:43:39 2024]
rule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 3 with external jobid 'Submitted batch job 42161136'.

[Sat Oct 12 11:43:39 2024]
rule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 4 with external jobid 'Submitted batch job 42161137'.
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161132

[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 5
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161133

[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161134

[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 6
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161135

[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 3
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161136

[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
    message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 4
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err (check log file(s) for error details)
    shell:
        fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: Submitted batch job 42161137

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-12T114335.517604.snakemake.log
WorkflowError:
At least one job did not complete successfully.

Are you getting red error messages too? Don’t worry, that was expected 😉

Let’s first have a look at the output log. There are several indicators that the jobs were submitted correctly to Slurm: “Provided remote nodes: 6” and “Submitted job 5 with external jobid 'Submitted batch job 42161133'.” for example. We can see that each job was submitted individually to Slurm (they all have individual “external jobids”). We can also see all Slurm’s output/error files generated in our current working directory:

cquignot@clust-slurm-client:/shared/projects/2417_wf4bioinfo/cquignot/day2-session$ ls
Data     ex1.yml       ex1.smk              slurm-42161133.out  slurm-42161135.out  slurm-42161137.out
Logs     multiqc_data  multiqc_report.html  slurm-42161134.out  slurm-42161136.out  slurm-42161132.out
   

So what went wrong? The log tells us that the fastqc jobs didn’t end properly. Their logs are redirected into files stored in the Logs/ folder, let’s have a look at one of them:

cquignot@clust-slurm-client:/shared/projects/2417_wf4bioinfo/cquignot/day2-session$ more Logs/SRR3099586_chr18_fastqc.err 
/usr/bin/bash: fastqc: command not found

Aha! Do you remember the module loads we did at the beginning of this exercise? Here, each job is executed on a new processor on which we haven’t loaded the necessary software for fastqc to run. We’ll see how to fix this in the next objective.

Objective 3 - controlling software environment

Controlling the software environment in Snakemake. Create a snakefile called ex1c_o3.smk in which we will use the envmodules directive to load multiqc and fastqc before running each of these rules.

Where to start?

So that Snakemake automatically loads the correct software with module load before running the actual command line for a given rule, we can add the envmodules directive to the rules.

Click to see help on how to go about it

For example, instead of using module load mySoftware/myVersion, we could integrate it in the rule like this:

rule ruleName:
    input:
        inputFile.txt
    output:
        outputFile.txt
    envmodules:
        "nodes/mySoftware",
    shell:
        """
            mySoftware {input} > {output}
        """
Click to see an example solution

Your code for ex1c_o3.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  log:
    "Logs/{sample}_fastqc.std",
    "Logs/{sample}_fastqc.err"  
  envmodules: "fastqc/0.12.1"
  shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  log:
    std="Logs/multiqc.std",
    err="Logs/multiqc.err"
  envmodules: "multiqc/1.9"
  shell: "multiqc {input} 1>{log.std} 2>{log.err}"

Run your Snakefile

Now let’s run Snakemake again with the -R fastqc option. Don’t forget to also add --software-deployment-method env-modules:

snakemake -s ex1c_o3.smk --software-deployment-method env-modules --executor "cluster-generic" --cluster-generic-submit-cmd "sbatch --cpus-per-task=1 --mem 500Mb" --jobs 6 --configfile ex1.yml -p -R fastqc        

Observe the output

Your output should look like this:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 6 jobs...

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748703'.

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748704'.

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748705'.

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748706'.

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748707'.

[Wed Feb 21 22:36:11 2024]
rule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748708'.
[Wed Feb 21 22:36:40 2024]
Finished job 6.
1 of 8 steps (12%) done
[Wed Feb 21 22:36:40 2024]
Finished job 1.
2 of 8 steps (25%) done
[Wed Feb 21 22:36:40 2024]
Finished job 4.
3 of 8 steps (38%) done
[Wed Feb 21 22:36:40 2024]
Finished job 3.
4 of 8 steps (50%) done
[Wed Feb 21 22:36:40 2024]
Finished job 2.
5 of 8 steps (62%) done
[Wed Feb 21 22:36:41 2024]
Finished job 5.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 22:36:41 2024]
rule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR309
9587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip
    resources: tmpdir=

multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748713'.
Will exit after finishing currently running jobs (scheduler).

[Wed Feb 21 22:37:11 2024]
Finished job 7.
7 of 8 steps (88%) done
Will exit after finishing currently running jobs (scheduler).
Shutting down, this might take some time.

Congratulations! You’ve run your first Snakefile through the SLURM scheduler!

Now let’s see how we can simplify the command line because it’s starting to get really long…

Objective 4 - creating a profile

Create a profile for Snakemake.

Where to start?

To avoid typing cluster-specific and basic options such as -p or --executor "cluster-generic" in the command line every time you run a Snakefile on the IFB cluster, we can add them all to a profile file instead and then forget about them. We’ll call this file config.yaml and we’ll put it in our home directory in $HOME/.config/snakemake/slurm/. You might need to create the directory first:

mkdir -p $HOME/.config/snakemake/slurm/

Inside this file ($HOME/.config/snakemake/slurm/config.yaml), we’ll put all the options we routinely use to run Snakemake as well as those we use specifically in conjunction with Slurm:

# max number of jobs in parallel
jobs: 10

# cluster-specific options:
executor: cluster-generic
cluster-generic-submit-cmd:
  mkdir -p slurm_output/ && 
  sbatch
    --partition={resources.partition}
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --job-name={rule}-{wildcards}
    --output=slurm_output/{rule}-{wildcards}-%j.out
    --time={resources.time}
    
# define default resources **per job**
default-resources:
  - mem_mb=1000
  - threads=1
  - partition=fast
  - time="02:00:00"

# software option: use modules
software-deployment-method: env-modules

# print all commands
printshellcmds: True

Since we’re creating a “general” profile, we have to be able to adjust the parameters given to the sbatch command from within the Snakefile. Thus, we can “generalise” the submission command with wildcards (e.g. {thread}, {resources.mem_mb}).

more on cluster execution with Snakemake

Run you Snakefile

Let’s try running your Snakefile again (don’t forget the --profile slurm):

snakemake -s ex1c_o3.smk --configfile ex1.yml -R fastqc --profile slurm

Why --profile slurm? You can either specify a file path (directory in which the profile is) or just a profile name (the name is given by the parent directory in which the profile is saved). Snakemake will automatically look in the default places where the profile could be stored (including the directory in which we just placed ours).

Observe the output

Your output should look like this:

Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 6 jobs...

[Wed Feb 21 23:18:34 2024]
rule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748790'.

[Wed Feb 21 23:18:35 2024]
rule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748791'.

[Wed Feb 21 23:18:35 2024]
rule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748792'.

[Wed Feb 21 23:18:35 2024]
rule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748793'.

[Wed Feb 21 23:18:35 2024]
rule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748794'.

[Wed Feb 21 23:18:35 2024]
rule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748795'.
[Wed Feb 21 23:18:54 2024]
Finished job 4.
1 of 8 steps (12%) done
[Wed Feb 21 23:18:54 2024]
Finished job 5.
2 of 8 steps (25%) done
[Wed Feb 21 23:18:54 2024]
Finished job 1.
3 of 8 steps (38%) done
[Wed Feb 21 23:18:54 2024]
Finished job 3.
4 of 8 steps (50%) done
[Wed Feb 21 23:18:54 2024]
Finished job 2.
5 of 8 steps (62%) done
[Wed Feb 21 23:18:54 2024]
Finished job 6.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 23:18:54 2024]
rule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR310
5697_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb

multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748796'.
[Wed Feb 21 23:19:04 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 23:19:04 2024]
localrule all:
    input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_f
astqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb

[Wed Feb 21 23:19:04 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-02-21T231834.549059.snakemake.log

As you can see in the log output, all jobs were run with the default resources set in our profile: “resources: threads=1, mem=1Gb”. In the next and last objective of this exercise, we’ll see how to specify different resources for each rule in the Snakefile.

Objective 5 - adjusting resources for each rule

Create the ex1c_o5.smk Snakefile in which we will specify custom resources for each rule.

Where to start?

These resources can be added using the threads (number of processors) and resources (memory, walltime, etc.) directives.

Click to see help on how to go about it

For example, to specify for rule ruleName that it should use 1 thread, 100Mb of memory and shouldn’t last more than 5 minutes:

rule ruleName:
    input:
        inputFile.txt
    output:
        outputFile.txt
    envmodules:
        "mySoftware/myVersion",
    threads: 1
    resources:
        mem="100Mb",
        time="00:05:00",
    shell:
        """
            mySoftware {input} > {output}
        """

The threads directive only controls the number of CPUs/processors/threads, whereas resources is where you specify all the rest of Slurm’s resources. Make sure however to match the names given with the wildcards in your profile.

Click to see an example solution

Your code for ex1c_o5.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html",
  log:
    "Logs/{sample}_fastqc.std",
    "Logs/{sample}_fastqc.err",
  envmodules: "fastqc/0.12.1"
  threads: 1
  resources:
    mem="500Mb",
    time="00:05:00",
  shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data"),
  log:
    std="Logs/multiqc.std",
    err="Logs/multiqc.err",
  envmodules: "multiqc/1.13"
  threads: 1
  resources:
    mem="1Gb",
    time="00:10:00",
  shell: "multiqc {input} 1>{log.std} 2>{log.err}"

Run you Snakefile

Let’s try running your Snakefile again:

snakemake -s ex1c_o5.smk --configfile ex1.yml -R fastqc --profile slurm

Your output should look like this:

Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 6 jobs...

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748821'.

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748822.'.

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748823'.

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748824'.

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748825'.

[Wed Feb 21 23:41:09 2024]
rule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748826'.
[Wed Feb 21 23:41:38 2024]
Finished job 1.
1 of 8 steps (12%) done
[Wed Feb 21 23:41:38 2024]
Finished job 6.
2 of 8 steps (25%) done
[Wed Feb 21 23:41:38 2024]
Finished job 2.
3 of 8 steps (38%) done
[Wed Feb 21 23:41:38 2024]
Finished job 5.
4 of 8 steps (50%) done
[Wed Feb 21 23:41:38 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 23:41:39 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 23:41:39 2024]
rule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR309
9585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb, time=00:10:00

multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748827'.
[Wed Feb 21 23:41:49 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 23:41:49 2024]
localrule all:
    input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SR
R3105698_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb

[Wed Feb 21 23:41:49 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-02-21T234108.721811.snakemake.log

Observe the output

As you can see in the log output, fastqc and multiqc jobs weren’t run with the same resources as you can see in the log (cf. highlighted lines above, e.g.: resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, threads=1, mem=500Mb, time=00:05:00).

In order to know how much resources your jobs actually used, you can use the cluster’s reportseff or sacct commands:

[mhennion @ clust-slurm-client2 13:32]$ WF4Bioinfo : sacct --format=JobID,JobName,Start,Elapsed,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM,State
JobID           JobName               Start    Elapsed    CPUTime      NCPUS        NodeList     MaxRSS     ReqMem      State 
------------ ---------- ------------------- ---------- ---------- ---------- --------------- ---------- ---------- ---------- 
41907135     fastqc-sa+ 2024-09-24T12:15:35   00:00:15   00:00:15          1     cpu-node-40                  500M  COMPLETED 
41907135.ba+      batch 2024-09-24T12:15:35   00:00:15   00:00:15          1     cpu-node-40      3232K             COMPLETED 
41907136     fastqc-sa+ 2024-09-24T12:15:35   00:00:14   00:00:14          1     cpu-node-40                  500M  COMPLETED 
41907136.ba+      batch 2024-09-24T12:15:35   00:00:14   00:00:14          1     cpu-node-40      3208K             COMPLETED 
41907137     fastqc-sa+ 2024-09-24T12:15:35   00:00:16   00:00:16          1     cpu-node-34                  500M  COMPLETED 
41907137.ba+      batch 2024-09-24T12:15:35   00:00:16   00:00:16          1     cpu-node-34      3228K             COMPLETED 
41907138     fastqc-sa+ 2024-09-24T12:15:35   00:00:18   00:00:18          1     cpu-node-34                  500M  COMPLETED 
41907138.ba+      batch 2024-09-24T12:15:35   00:00:18   00:00:18          1     cpu-node-34      3244K             COMPLETED 
41907141       multiqc- 2024-09-24T12:15:57   00:00:24   00:00:24          1     cpu-node-38                  500M  COMPLETED 
41907141.ba+      batch 2024-09-24T12:15:57   00:00:24   00:00:24          1     cpu-node-38      3228K             COMPLETED 
41907158     fastqc-sa+ 2024-09-24T12:20:25   00:00:15   00:00:15          1     cpu-node-45                  286M  COMPLETED 
41907158.ba+      batch 2024-09-24T12:20:25   00:00:15   00:00:15          1     cpu-node-45      3304K             COMPLETED 
41907159     fastqc-sa+ 2024-09-24T12:20:25   00:00:33   00:00:33          1     cpu-node-38                  286M  COMPLETED 
41907159.ba+      batch 2024-09-24T12:20:25   00:00:33   00:00:33          1     cpu-node-38    221708K             COMPLETED 
41907160     fastqc-sa+ 2024-09-24T12:20:25   00:00:19   00:00:19          1     cpu-node-39                  286M  COMPLETED 
41907160.ba+      batch 2024-09-24T12:20:25   00:00:19   00:00:19          1     cpu-node-39      3212K             COMPLETED 
41907161     fastqc-sa+ 2024-09-24T12:20:25   00:00:15   00:00:15          1     cpu-node-40                  286M  COMPLETED 
41907161.ba+      batch 2024-09-24T12:20:25   00:00:15   00:00:15          1     cpu-node-40      3192K             COMPLETED 
41907162     fastqc-sa+ 2024-09-24T12:20:25   00:00:14   00:00:14          1     cpu-node-40                  286M  COMPLETED 
41907162.ba+      batch 2024-09-24T12:20:25   00:00:14   00:00:14          1     cpu-node-40      3244K             COMPLETED 
41907163     fastqc-sa+ 2024-09-24T12:20:25   00:00:17   00:00:17          1     cpu-node-40                  286M  COMPLETED 
41907163.ba+      batch 2024-09-24T12:20:25   00:00:17   00:00:17          1     cpu-node-40      3224K             COMPLETED 
41907169       multiqc- 2024-09-24T12:21:06   00:00:06   00:00:06          1     cpu-node-45                   95M OUT_OF_ME+ 
41907169.ba+      batch 2024-09-24T12:21:06   00:00:06   00:00:06          1     cpu-node-45      3240K            OUT_OF_ME+ 
41907178     fastqc-sa+ 2024-09-24T12:22:47   00:00:12   00:00:12          1     cpu-node-45                  286M  COMPLETED 
41907178.ba+      batch 2024-09-24T12:22:47   00:00:12   00:00:12          1     cpu-node-45      3208K             COMPLETED 
41907179     fastqc-sa+ 2024-09-24T12:22:47   00:00:25   00:00:25          1     cpu-node-38                  286M  COMPLETED 
41907179.ba+      batch 2024-09-24T12:22:47   00:00:25   00:00:25          1     cpu-node-38      3284K             COMPLETED 
41907180     fastqc-sa+ 2024-09-24T12:22:47   00:00:15   00:00:15          1     cpu-node-39                  286M  COMPLETED 
41907180.ba+      batch 2024-09-24T12:22:47   00:00:15   00:00:15          1     cpu-node-39      3260K             COMPLETED 
41907181     fastqc-sa+ 2024-09-24T12:22:47   00:00:14   00:00:14          1     cpu-node-40                  286M  COMPLETED 
41907181.ba+      batch 2024-09-24T12:22:47   00:00:14   00:00:14          1     cpu-node-40      3200K             COMPLETED 
41907182     fastqc-sa+ 2024-09-24T12:22:47   00:00:12   00:00:12          1     cpu-node-40                  286M  COMPLETED 
41907182.ba+      batch 2024-09-24T12:22:47   00:00:12   00:00:12          1     cpu-node-40      3268K             COMPLETED 
41907183     fastqc-sa+ 2024-09-24T12:22:47   00:00:13   00:00:13          1     cpu-node-40                  286M  COMPLETED 
41907183.ba+      batch 2024-09-24T12:22:47   00:00:13   00:00:13          1     cpu-node-40      3192K             COMPLETED 
41907184       multiqc- 2024-09-24T12:23:17   00:00:07   00:00:07          1     cpu-node-45                  191M  COMPLETED 
41907184.ba+      batch 2024-09-24T12:23:17   00:00:07   00:00:07          1     cpu-node-45      3184K             COMPLETED 

Tip: Make an alias in your ~/.bashrc with your favorite formating.

alias sa="sacct --format=JobID,JobName,Start,Elapsed,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM,State"
module load reportseff

With reportseff, you can limit to a folder, limit the anaysis in time, to a partition or a user, etc… See the documentation. For instance to analyse all jobs with outputs in slurm_output/:

[mhennion @ clust-slurm-client2 13:28]$ WF4Bioinfo : reportseff --format "+Start,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM" --modified-sort slurm_output/
reportseff

The job ids given by the cluster are also listed in the log that’s printed on your screen (cf. highlighted lines above, e.g.: Submitted job 1 with external jobid '748821').

Go ahead and compare the resources used in ex1c_o3.smk (default resources) and ex1c_o5.smk (customised resources) for a given job. You will see that jobs run in objective 5 are more efficient in resource usage than in objective 3. In particular, you can use MemEff, CPUEff and TimeEff keywords in the reportseff formatting for this.

Recap

We’ve seen

We also learnt in the previous parts of this exercise:

Some extra information about software environment

Of note, Snakemake also supports conda/mamba environments and containers (docker, singularity and apptainer) in the same way as it supports module environments.