Exercise C - adapting your pipeline to an HPC environment
In this Exercise, we’ll be starting where we left off in Exercise B, using the snakefile you wrote in the objective 3, which successively runs FastQC then MultiQC on a set of RNA-seq data, and try to adapt it to an HPC environment, namely the IFB’s cluster.
For this practical exercise, we will:
Motivation:
Up until now, our workflow just runs on a single processor. Thus, each job is run sequentially, which takes time and is frustrating when you know that more resources are available on the cluster (=> using more processors reduces computation time in most cases).
There are two ways of scaling up your workflow:
How this exercise is organised
As for the previous exercises, each step will address an objective.
Thus, we will be doing several cycles of snakemake execution, observing
the results and improving the code. Each code version will be noted
ex1c_oX.smk
, with X
a progressive digit
corresponding to the objective number.
Warning: keep in mind that we’ll be using commands that are specific to the IFB cluster’s scheduler system (they might be different on other clusters).
It’s the same setup as for Exercise A and B, we will work on a node of the cluster with Snakemake, FastQC and MultiQC modules loaded.
As a reminder, you should be in your working directory from Exercise
B in which you have the Data/
folder containing all example
files.
login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls -a
. .. Data FastQC Logs multiqc_data multiqc_report.html .snakemake
You should also have snakemake, FastQC and MultiQC loaded.
module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13
NB: If you’ve skipped Exercise A and B, you can copy-paste the example solution script from Exercise B objective 3.
--cores
Learn how to tell Snakemake that multiple processors are available
locally using the --cores
option.
Remember --cores
/-c
from your previous
commands? This option tells Snakemake how many processors
(/CPUs/cores/threads) it may use to run your jobs locally. It’s a
mandatory option so if you run Snakemake without it, you’ll get an error
message.
cquignot@cpu-node-40:~$ module load reportseff
cquignot@cpu-node-40:~$ reportseff -u $USER --format "JobID,JobName,State,ReqNodes,ReqCPUS"
JobID JobName State ReqNodes ReqCPUS
42017684 bash RUNNING 1 1
cquignot@cpu-node-40:~$ exit
cquignot@clust-slurm-client:~$
cquignot@clust-slurm-client:~$ srun --cpus-per-task 2 --account=2417_wf4bioinfo --pty bash
srun: job 42017685 queued and waiting for resources
srun: job 42017685 has been allocated resources
cquignot@cpu-node-40:~$ module load reportseff
cquignot@cpu-node-40:~$ reportseff -u $USER --format "JobID,JobName,State,ReqNodes,ReqCPUS"
JobID JobName State ReqNodes ReqCPUS
42017684 bash COMPLETED 1 1
42017685 bash RUNNING 1 2
cquignot@cpu-node-40:~$ module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13
cquignot@cpu-node-40:~$ module list
Currently Loaded Modulefiles:
1) snakemake/8.9.0 2) multiqc/1.13 3) fastqc/0.12.1
cquignot@cpu-node-40:~$ /shared/projects/2417_wf4bioinfo/$USER/day2-session1
Running Snakemake on 2 processors instead of just one is quite
straightforward (we’ll add -R fastqc
to force Snakemake to
re-run everything so you can observe the changes):
snakemake -s ex1b_o3.smk -p --configfile ex1.yml -R fastqc --cores 2
Your output should look like this:
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 2 jobs...
[Wed Feb 21 13:37:38 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:37:38 2024]
localrule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:37:52 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:37:52 2024]
localrule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:37:54 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:37:54 2024]
localrule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:38:02 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:38:02 2024]
localrule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:38:04 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:38:04 2024]
localrule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: tmpdir=/tmp
[Wed Feb 21 13:38:14 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 13:38:15 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:38:15 2024]
localrule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR310
5699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
resources: tmpdir=/tmp
[Wed Feb 21 13:38:37 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 13:38:37 2024]
localrule all:
input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099585_chr18_f
astqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
resources: tmpdir=/tmp
[Wed Feb 21 13:38:37 2024]
Finished job 0.
8 of 8 steps (100%) done Complete log: .snakemake/log/2024-02-21T133728.474300.snakemake.log
You can now see in the log that Snakemake registered the 2
processors: Provided cores: 2
. It’s difficult to see if
your jobs are actually running simultaniously just by looking at the log
but you can clearly see that Snakemake runs the first 2 jobs at the same
time: Execute 2 jobs...
Running Snakemake on 2 processors doesn’t reduce the computation time considerably in this case since we only have 6 input files to process and the tools that we are running are already quite fast to execute.
Local execution:
Everything we’ve seen up until now could technically also be run on a local computer, even what we’ve just seen above if you have several processors on your PC (and provided you have the necessary software installed).
There are better ways than --cores
to
parallelise:
We’ve also seen that using --cores N
(N
being the number of processors to use), enables us to distribute the
workload and, thus, to accelerate the computation. However, when running
on a cluster, this is not the most optimal solution for parallelising
your pipeline.
Why? Let’s take what we just did as an example:
All 6 fastqc jobs can be evenly distributed over the 2 processors but the second rule (multiqc) is only using 1 of the 2 that we’ve reserved because there’s only one job for this rule. It’s not that bad when you’re using tools that run fast but becomes more problematic for software with much larger execution times…
What’s the solution? It would be more efficient to let the scheduler (Slurm in our case) deal with the distribution of jobs according to the resources available on the cluster. With this system, we’ll also be able to adapt the resources that are reserved to the amount each tool actually needs to run, thereby leaving unused resources free for others to use.
Deconnect from the current job
Let’s disconnect from the current interactive session before we move
on. Snakemake in itself doesn’t need much resources to run and can be
run directly from the login node i.e. clust-slurm-client
without problems as long as the rules in your Snakefile are still run on
the slave nodes. Don’t forget to load the snakemake module !
cquignot@cpu-node-40:~$ exit
cquignot@clust-slurm-client:~$
Learn how to dispatch each individual job onto separate processors of
the cluster using the executor
option.
--executor cluster-generic
is used to tell Snakemake
that we would like to use a “cluster-generic” executor rather than the
local shell--cluster-generic-*
options are used to specify how
Snakemake should use the given executor--jobs
will enable you to specify the maximum number of
jobs that are allowed to run at the same timeRunning Snakemake with these extra options is quite straightforward
(NB: we’ll add -R fastqc
to force Snakemake to re-run
everything so you can observe the changes):
snakemake -s ex1b_o3.smk --executor "cluster-generic" --cluster-generic-submit-cmd "sbatch --cpus-per-task=1 --mem 500Mb" --jobs 6 --configfile ex1.yml -p -R fastqc
Your output should look like this:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 6 jobs...
[Sat Oct 12 11:43:38 2024]
rule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 1 with external jobid 'Submitted batch job 42161132'.
[Sat Oct 12 11:43:38 2024]
rule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 5 with external jobid 'Submitted batch job 42161133'.
[Sat Oct 12 11:43:38 2024]
rule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 2 with external jobid 'Submitted batch job 42161134'.
[Sat Oct 12 11:43:38 2024]
rule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 6 with external jobid 'Submitted batch job 42161135'.
[Sat Oct 12 11:43:39 2024]
rule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 3 with external jobid 'Submitted batch job 42161136'.
[Sat Oct 12 11:43:39 2024]
rule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: tmpdir=<TBD>, mem=500Mb, mem_mb=477, time=00:05:00
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 4 with external jobid 'Submitted batch job 42161137'.
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 1
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161132
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 5
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161133
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 2
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161134
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 6
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161135
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 3
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161136
[Sat Oct 12 11:43:55 2024]
Error in rule fastqc:
message: For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 4
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err (check log file(s) for error details)
shell:
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: Submitted batch job 42161137
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-12T114335.517604.snakemake.log
WorkflowError: At least one job did not complete successfully.
Are you getting red error messages too? Don’t worry, that was expected 😉
Let’s first have a look at the output log. There are several
indicators that the jobs were submitted correctly to Slurm:
“Provided remote nodes: 6
” and
“Submitted job 5 with external jobid 'Submitted batch job 42161133'.
”
for example. We can see that each job was submitted individually to
Slurm (they all have individual “external jobids”). We can also see all
Slurm’s output/error files generated in our current working
directory:
cquignot@clust-slurm-client:/shared/projects/2417_wf4bioinfo/cquignot/day2-session$ ls
Data ex1.yml ex1.smk slurm-42161133.out slurm-42161135.out slurm-42161137.out
Logs multiqc_data multiqc_report.html slurm-42161134.out slurm-42161136.out slurm-42161132.out
So what went wrong? The log tells us that the fastqc jobs
didn’t end properly. Their logs are redirected into files stored in the
Logs/
folder, let’s have a look at one of them:
cquignot@clust-slurm-client:/shared/projects/2417_wf4bioinfo/cquignot/day2-session$ more Logs/SRR3099586_chr18_fastqc.err
/usr/bin/bash: fastqc: command not found
Aha! Do you remember the module load
s we did at
the beginning of this exercise? Here, each job is executed on a new
processor on which we haven’t loaded the necessary software for fastqc
to run. We’ll see how to fix this in the next objective.
Controlling the software environment in Snakemake. Create a snakefile
called ex1c_o3.smk
in which we will use the
envmodules
directive to load multiqc and fastqc before
running each of these rules.
So that Snakemake automatically loads the correct software with
module load
before running the actual command line for a
given rule, we can add the envmodules
directive to the rules.
For example, instead of using
module load mySoftware/myVersion
, we could integrate it in
the rule like this:
rule ruleName:
input:
inputFile.txt
output:
outputFile.txt
envmodules:
"nodes/mySoftware",
shell:
"""
mySoftware {input} > {output}
"""
Your code for ex1c_o3.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
log:
"Logs/{sample}_fastqc.std",
"Logs/{sample}_fastqc.err"
envmodules: "fastqc/0.12.1"
shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
log:
std="Logs/multiqc.std",
err="Logs/multiqc.err"
envmodules: "multiqc/1.9"
shell: "multiqc {input} 1>{log.std} 2>{log.err}"
Now let’s run Snakemake again with the -R fastqc
option.
Don’t forget to also add
--software-deployment-method env-modules
:
snakemake -s ex1c_o3.smk --software-deployment-method env-modules --executor "cluster-generic" --cluster-generic-submit-cmd "sbatch --cpus-per-task=1 --mem 500Mb" --jobs 6 --configfile ex1.yml -p -R fastqc
Your output should look like this:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 6 jobs...
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748703'.
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748704'.
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748705'.
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748706'.
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748707'.
[Wed Feb 21 22:36:11 2024]
rule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: tmpdir=
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748708'.
[Wed Feb 21 22:36:40 2024]
Finished job 6.
1 of 8 steps (12%) done
[Wed Feb 21 22:36:40 2024]
Finished job 1.
2 of 8 steps (25%) done
[Wed Feb 21 22:36:40 2024]
Finished job 4.
3 of 8 steps (38%) done
[Wed Feb 21 22:36:40 2024]
Finished job 3.
4 of 8 steps (50%) done
[Wed Feb 21 22:36:40 2024]
Finished job 2.
5 of 8 steps (62%) done
[Wed Feb 21 22:36:41 2024]
Finished job 5.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 22:36:41 2024]
rule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR309
9587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip
resources: tmpdir=
multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748713'.
Will exit after finishing currently running jobs (scheduler).
[Wed Feb 21 22:37:11 2024]
Finished job 7.
7 of 8 steps (88%) done
Will exit after finishing currently running jobs (scheduler). Shutting down, this might take some time.
Congratulations! You’ve run your first Snakefile through the SLURM scheduler!
Now let’s see how we can simplify the command line because it’s starting to get really long…
Create a profile for Snakemake.
To avoid typing cluster-specific and basic options such as
-p
or --executor "cluster-generic"
in the
command line every time you run a Snakefile on the IFB cluster, we can
add them all to a profile file instead and then forget about them. We’ll
call this file config.yaml
and we’ll put it in our home
directory in $HOME/.config/snakemake/slurm/
. You might need
to create the directory first:
mkdir -p $HOME/.config/snakemake/slurm/
Inside this file
($HOME/.config/snakemake/slurm/config.yaml
), we’ll put all
the options we routinely use to run Snakemake as well as those we use
specifically in conjunction with Slurm:
# max number of jobs in parallel
jobs: 10
# cluster-specific options:
executor: cluster-generic
cluster-generic-submit-cmd:
mkdir -p slurm_output/ &&
sbatch
--partition={resources.partition}
--cpus-per-task={threads}
--mem={resources.mem_mb}
--job-name={rule}-{wildcards}
--output=slurm_output/{rule}-{wildcards}-%j.out
--time={resources.time}
# define default resources **per job**
default-resources:
- mem_mb=1000
- threads=1
- partition=fast
- time="02:00:00"
# software option: use modules
software-deployment-method: env-modules
# print all commands
printshellcmds: True
Since we’re creating a “general” profile, we have to be able to
adjust the parameters given to the sbatch
command from
within the Snakefile. Thus, we can “generalise” the submission command
with wildcards (e.g. {thread}
,
{resources.mem_mb}
).
more on cluster execution with Snakemake
Let’s try running your Snakefile again (don’t forget the
--profile slurm
):
snakemake -s ex1c_o3.smk --configfile ex1.yml -R fastqc --profile slurm
Why --profile slurm
? You can either specify a file
path (directory in which the profile is) or just a profile name
(the name is given by the parent directory in which the profile
is saved). Snakemake will automatically look in the default
places where the profile could be stored (including the
directory in which we just placed ours).
Your output should look like this:
Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 6 jobs...
[Wed Feb 21 23:18:34 2024]
rule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748790'.
[Wed Feb 21 23:18:35 2024]
rule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748791'.
[Wed Feb 21 23:18:35 2024]
rule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748792'.
[Wed Feb 21 23:18:35 2024]
rule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748793'.
[Wed Feb 21 23:18:35 2024]
rule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748794'.
[Wed Feb 21 23:18:35 2024]
rule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748795'.
[Wed Feb 21 23:18:54 2024]
Finished job 4.
1 of 8 steps (12%) done
[Wed Feb 21 23:18:54 2024]
Finished job 5.
2 of 8 steps (25%) done
[Wed Feb 21 23:18:54 2024]
Finished job 1.
3 of 8 steps (38%) done
[Wed Feb 21 23:18:54 2024]
Finished job 3.
4 of 8 steps (50%) done
[Wed Feb 21 23:18:54 2024]
Finished job 2.
5 of 8 steps (62%) done
[Wed Feb 21 23:18:54 2024]
Finished job 6.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:18:54 2024]
rule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR310
5697_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb
multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748796'.
[Wed Feb 21 23:19:04 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:19:04 2024]
localrule all:
input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_f
astqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb
[Wed Feb 21 23:19:04 2024]
Finished job 0.
8 of 8 steps (100%) done Complete log: .snakemake/log/2024-02-21T231834.549059.snakemake.log
As you can see in the log output, all jobs were run with the default
resources set in our profile:
“resources: threads=1, mem=1Gb
”. In the next and last
objective of this exercise, we’ll see how to specify different resources
for each rule in the Snakefile.
Create the ex1c_o5.smk
Snakefile in which we will
specify custom resources for each rule.
These resources can be added using the threads
(number of processors) and resources
(memory, walltime, etc.) directives.
For example, to specify for rule ruleName
that it should
use 1 thread, 100Mb of memory and shouldn’t last more than 5
minutes:
rule ruleName:
input:
inputFile.txt
output:
outputFile.txt
envmodules:
"mySoftware/myVersion",
threads: 1
resources:
mem="100Mb",
time="00:05:00",
shell:
"""
mySoftware {input} > {output}
"""
The threads
directive only controls the number of
CPUs/processors/threads, whereas resources
is where you
specify all the rest of Slurm’s resources. Make sure however to match
the names given with the wildcards in your profile.
Your code for ex1c_o5.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html",
log:
"Logs/{sample}_fastqc.std",
"Logs/{sample}_fastqc.err",
envmodules: "fastqc/0.12.1"
threads: 1
resources:
mem="500Mb",
time="00:05:00",
shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data"),
log:
std="Logs/multiqc.std",
err="Logs/multiqc.err",
envmodules: "multiqc/1.13"
threads: 1
resources:
mem="1Gb",
time="00:10:00",
shell: "multiqc {input} 1>{log.std} 2>{log.err}"
Let’s try running your Snakefile again:
snakemake -s ex1c_o5.smk --configfile ex1.yml -R fastqc --profile slurm
Your output should look like this:
Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 6 jobs...
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748821'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748822.'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748823'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748824'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748825'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=500Mb, time=00:05:00
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748826'.
[Wed Feb 21 23:41:38 2024]
Finished job 1.
1 of 8 steps (12%) done
[Wed Feb 21 23:41:38 2024]
Finished job 6.
2 of 8 steps (25%) done
[Wed Feb 21 23:41:38 2024]
Finished job 2.
3 of 8 steps (38%) done
[Wed Feb 21 23:41:38 2024]
Finished job 5.
4 of 8 steps (50%) done
[Wed Feb 21 23:41:38 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 23:41:39 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:41:39 2024]
rule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR309
9585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb, time=00:10:00
multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748827'.
[Wed Feb 21 23:41:49 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:41:49 2024]
localrule all:
input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SR
R3105698_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, threads=1, mem=1Gb
[Wed Feb 21 23:41:49 2024]
Finished job 0.
8 of 8 steps (100%) done Complete log: .snakemake/log/2024-02-21T234108.721811.snakemake.log
As you can see in the log output, fastqc and multiqc jobs weren’t run
with the same resources as you can see in the log (cf. highlighted lines
above, e.g.:
resources: mem_mb=500, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, threads=1, mem=500Mb, time=00:05:00
).
In order to know how much resources your jobs actually used, you can
use the cluster’s reportseff
or sacct
commands:
sacct
[mhennion @ clust-slurm-client2 13:32]$ WF4Bioinfo : sacct --format=JobID,JobName,Start,Elapsed,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM,State
JobID JobName Start Elapsed CPUTime NCPUS NodeList MaxRSS ReqMem State
------------ ---------- ------------------- ---------- ---------- ---------- --------------- ---------- ---------- ----------
41907135 fastqc-sa+ 2024-09-24T12:15:35 00:00:15 00:00:15 1 cpu-node-40 500M COMPLETED
41907135.ba+ batch 2024-09-24T12:15:35 00:00:15 00:00:15 1 cpu-node-40 3232K COMPLETED
41907136 fastqc-sa+ 2024-09-24T12:15:35 00:00:14 00:00:14 1 cpu-node-40 500M COMPLETED
41907136.ba+ batch 2024-09-24T12:15:35 00:00:14 00:00:14 1 cpu-node-40 3208K COMPLETED
41907137 fastqc-sa+ 2024-09-24T12:15:35 00:00:16 00:00:16 1 cpu-node-34 500M COMPLETED
41907137.ba+ batch 2024-09-24T12:15:35 00:00:16 00:00:16 1 cpu-node-34 3228K COMPLETED
41907138 fastqc-sa+ 2024-09-24T12:15:35 00:00:18 00:00:18 1 cpu-node-34 500M COMPLETED
41907138.ba+ batch 2024-09-24T12:15:35 00:00:18 00:00:18 1 cpu-node-34 3244K COMPLETED
41907141 multiqc- 2024-09-24T12:15:57 00:00:24 00:00:24 1 cpu-node-38 500M COMPLETED
41907141.ba+ batch 2024-09-24T12:15:57 00:00:24 00:00:24 1 cpu-node-38 3228K COMPLETED
41907158 fastqc-sa+ 2024-09-24T12:20:25 00:00:15 00:00:15 1 cpu-node-45 286M COMPLETED
41907158.ba+ batch 2024-09-24T12:20:25 00:00:15 00:00:15 1 cpu-node-45 3304K COMPLETED
41907159 fastqc-sa+ 2024-09-24T12:20:25 00:00:33 00:00:33 1 cpu-node-38 286M COMPLETED
41907159.ba+ batch 2024-09-24T12:20:25 00:00:33 00:00:33 1 cpu-node-38 221708K COMPLETED
41907160 fastqc-sa+ 2024-09-24T12:20:25 00:00:19 00:00:19 1 cpu-node-39 286M COMPLETED
41907160.ba+ batch 2024-09-24T12:20:25 00:00:19 00:00:19 1 cpu-node-39 3212K COMPLETED
41907161 fastqc-sa+ 2024-09-24T12:20:25 00:00:15 00:00:15 1 cpu-node-40 286M COMPLETED
41907161.ba+ batch 2024-09-24T12:20:25 00:00:15 00:00:15 1 cpu-node-40 3192K COMPLETED
41907162 fastqc-sa+ 2024-09-24T12:20:25 00:00:14 00:00:14 1 cpu-node-40 286M COMPLETED
41907162.ba+ batch 2024-09-24T12:20:25 00:00:14 00:00:14 1 cpu-node-40 3244K COMPLETED
41907163 fastqc-sa+ 2024-09-24T12:20:25 00:00:17 00:00:17 1 cpu-node-40 286M COMPLETED
41907163.ba+ batch 2024-09-24T12:20:25 00:00:17 00:00:17 1 cpu-node-40 3224K COMPLETED
41907169 multiqc- 2024-09-24T12:21:06 00:00:06 00:00:06 1 cpu-node-45 95M OUT_OF_ME+
41907169.ba+ batch 2024-09-24T12:21:06 00:00:06 00:00:06 1 cpu-node-45 3240K OUT_OF_ME+
41907178 fastqc-sa+ 2024-09-24T12:22:47 00:00:12 00:00:12 1 cpu-node-45 286M COMPLETED
41907178.ba+ batch 2024-09-24T12:22:47 00:00:12 00:00:12 1 cpu-node-45 3208K COMPLETED
41907179 fastqc-sa+ 2024-09-24T12:22:47 00:00:25 00:00:25 1 cpu-node-38 286M COMPLETED
41907179.ba+ batch 2024-09-24T12:22:47 00:00:25 00:00:25 1 cpu-node-38 3284K COMPLETED
41907180 fastqc-sa+ 2024-09-24T12:22:47 00:00:15 00:00:15 1 cpu-node-39 286M COMPLETED
41907180.ba+ batch 2024-09-24T12:22:47 00:00:15 00:00:15 1 cpu-node-39 3260K COMPLETED
41907181 fastqc-sa+ 2024-09-24T12:22:47 00:00:14 00:00:14 1 cpu-node-40 286M COMPLETED
41907181.ba+ batch 2024-09-24T12:22:47 00:00:14 00:00:14 1 cpu-node-40 3200K COMPLETED
41907182 fastqc-sa+ 2024-09-24T12:22:47 00:00:12 00:00:12 1 cpu-node-40 286M COMPLETED
41907182.ba+ batch 2024-09-24T12:22:47 00:00:12 00:00:12 1 cpu-node-40 3268K COMPLETED
41907183 fastqc-sa+ 2024-09-24T12:22:47 00:00:13 00:00:13 1 cpu-node-40 286M COMPLETED
41907183.ba+ batch 2024-09-24T12:22:47 00:00:13 00:00:13 1 cpu-node-40 3192K COMPLETED
41907184 multiqc- 2024-09-24T12:23:17 00:00:07 00:00:07 1 cpu-node-45 191M COMPLETED
41907184.ba+ batch 2024-09-24T12:23:17 00:00:07 00:00:07 1 cpu-node-45 3184K COMPLETED
Tip: Make an alias in your ~/.bashrc
with your favorite
formating.
alias sa="sacct --format=JobID,JobName,Start,Elapsed,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM,State"
reportseff
module load reportseff
With reportseff, you can limit to a folder, limit the anaysis in
time, to a partition or a user, etc… See the documentation. For
instance to analyse all jobs with outputs in
slurm_output/
:
[mhennion @ clust-slurm-client2 13:28]$ WF4Bioinfo : reportseff --format "+Start,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM" --modified-sort slurm_output/
The job ids given by the cluster are also listed in the log that’s
printed on your screen (cf. highlighted lines above, e.g.:
Submitted job 1 with external jobid '748821'
).
Go ahead and compare the resources used in ex1c_o3.smk
(default resources) and ex1c_o5.smk
(customised resources)
for a given job. You will see that jobs run in objective 5 are more
efficient in resource usage than in objective 3. In particular, you can
use MemEff
, CPUEff
and TimeEff
keywords in the reportseff formatting for this.
We also learnt in the previous parts of this exercise:
Of note, Snakemake also supports conda/mamba environments and containers (docker, singularity and apptainer) in the same way as it supports module environments.