14/10/2024
ssh username@core.cluster.france-bioinformatique.fr
In order to make easier the work on the cluster, an OnDemand instance is implemented. This way, you can access the cluster, modify your files, run your scripts, see your results, etc. in a simple web browser.
Launch
.The launcher allows you to start a Terminal that can be used for the rest of this course.
To view your files in your file manager and modify them directly using any local text editor you can also connect via sftp.
Please see the instructions for - Windows - Mac - Linux.
Be careful
Never use word processor (like Microsoft
Word or LibreOffice Writer) to modify your code and never copy/past code
to/from those softwares. Use only text editors and
UTF-8 encoding.
Security warning
Never leave your computer unsupervised with your session open and HPC
server connected.
Where are you on the cluster?
pwd
Then explore the /shared
folder
tree -L 1 /shared
/shared/bank
folder contains commonly used data and
resources. Explore it by yourself with commmands like ls
or
cd
.
Can you see the first 10 lines of the mm10.fa
file?
(mm10.fa = mouse genomic sequence version 10)
There is a 2417_wf4bioinfo
project accessible to you,
navigate to this folder and list what is inside.
Then go to one of your projects and create a folder named
2417_wf4bioinfo
. This is where you will do all the
exercices. If you don’t have a project, you can create a folder named
with your login in the 2417_wf4bioinfo
folder and work
there.
cd /shared/projects/2417_wf4bioinfo/
mkdir -p $USER/day1-slurm_module
cd $USER/day1-slurm_module
sinfo
sbatch
allows you to send an executable file to be ran
on a computation node.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
echo ??
Exo 1: Starting from that minimal example, make a script
named flatter.sh
printing “What a nice training
!”
Then run the script:
sbatch flatter.sh
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
echo "What a nice training !"
The output that should have appeared on your screen has been diverted to slurm-xxxxx.out but this name can be changed using SBATCH options.
Exo 2: Modify flatter.sh
to change Slurm output
file name.
Hint - Use #SBATCH --output
(-o in short)
then run it.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --output flatter.out
echo "What a nice training !"
Exo 3: Run using sbatch the command hostname
in
a way that the sbatch outfile is called
hostname.out
.
What is the output ? How does it differ from typing directly
hostname
in the terminal and why ?
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --output hostname.out
# -- COMMANDS --
hostname
Options | Flag | Function |
---|---|---|
–account | -A | account to run the job |
−−partition | -p | partition to run the job |
−−job-name | -J | give your job a name |
−−output | -o | output file name |
−−error | -e | error file name |
−−chdir | -D | set the working directory before running |
−−time | -t | limit the total run time (fast partition: 24h) |
−−mem | memory that your job will have access to (per node) |
To find out more, the Slurm manual man sbatch
or https://slurm.schedmd.com/sbatch.html.
The sleep
command : do nothing (delay) for the set
number of seconds.
Exo 4: Restart from your previous sbatch script and launch a
simple job that will launch sleep 600
.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --job-name=sleep
#SBATCH --output %x-%j.out
# -- COMMANDS --
sleep 600
On your terminal, type
squeue
ST
Status of the job.
R
= Running
PD
= Pending
To see jobs on fast
partition
squeue -p fast
To see only the jobs of untel
squeue -u untel
To see only your jobs
squeue --me
To cancel a job which you started, use the scancel
command followed by the jobID (Number given by SLURM, visible in
squeue)
scancel jobID
You can stop the previous sleep
job with this
command.
To cancel all your jobs at once, use --me
.
scancel --me
Exo 5: What is the line
#SBATCH --output %x-%j.out
doing?
This command controls the name of Slurm output file. The
%x
means the job_name (sleep
here), and the
%j
means the jobID (ie 41994442
).
sbatch allows for a filename pattern to contain one or more
replacement symbols, which are a percent sign %
followed by
a letter.
Replacement symbols | Function |
---|---|
%j | jobid of the running job |
%J | jobid.stepid of the running job (e.g. 128.0 ) |
%u | User name |
%x | Job name |
Find out more in Slurm documentation.
Re-run sleep.sh
and type
sacct
You can pass the option --format
to list the information
that you want to display, including memory usage, time of
running,…
For instance
sacct --format=JobID,JobName,Start,Elapsed,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM,State
To see every options, run sacct --helpformat
. We’ll see
another useful command to monitore your jobs right after the module
introduction!
A lot of tools are installed on the cluster. To list them, use one of the following commands.
module available
module avail
module av
You can limit the search for a specific tool, for example look for
the different versions of multiqc on the cluster using
module av multiqc
.
[mhennion @ clust-slurm-client 11:17]$ day1-slurm_module : module av multiqc
----------------- /shared/software/modulefiles ------------------------
multiqc/1.3 multiqc/1.6 multiqc/1.7 multiqc/1.9 multiqc/1.11 multiqc/1.12 multiqc/1.13
You can specify a version of the tool.
module load tool/1.3
You can load several tools at once.
module load tool1 tool2 tool3
Of note, the tool order might be important, for instance if several tools need python, the python version that will be used is the one of the last tool. To avoid conflicts, you can load the 1st tool, use it, then unload it (see below) and load the next one.
module list
module unload tool1
module purge
After the run, the reportseff
command allows you to
access information about the efficiency of one or several job.
Exo 6: Load the module reportseff and check the resource usage of previous jobs.
module load reportseff
reportseff .
reportseff JobID
.reportseff --format "+Start,CPUTime,NCPUS,NodeList,MaxRSS,ReqMeM" --modified-sort
Module best practice : Load your modules within your “sbatch” file for consistency.
Exo 7: Run an alignment using STAR version 2.7.5a.
Starting from the following script, write a sbatch script to align reads.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --job-name=Alignment
#SBATCH --output=star-alignment-%j.out
#SBATCH --error=star-alignment-%j.err
#SBATCH ?? # increase memory to 30G
# -- MODULES --
module purge
module load ?? # find appropriate STAR module (2.7.5a)
# -- VARIABLES --
pathToIndex= ?? # look for the path of the index for homo sapiens (hg38) made for STAR
pathToFastq1= ?? # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R1 fastq.gz file
pathToFastq2= ?? # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R2 fastq.gz file
outputFileName= ?? # choose your output file name
# -- COMMANDS --
STAR --genomeDir $pathToIndex \
--readFilesIn $pathToFastq1 $pathToFastq2 \
--outFileNamePrefix $outputFileName \
--readFilesCommand zcat
/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq
.Check the resource that was used using reportseff
.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --job-name=Alignment
#SBATCH --output=star-alignment-%j.out
#SBATCH --error=star-alignment-%j.err
#SBATCH --mem=30G # increase memory to 30G
# -- MODULES --
module purge
module load star/2.7.5a # find appropriate STAR module (2.7.5a)
# -- VARIABLES --
pathToIndex=/shared/bank/homo_sapiens/hg38/star-2.7.5a # look for the path of the index for homo sapiens (hg38) made for STAR
pathToFastq1=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R1.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R1 fastq.gz file
pathToFastq2=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R2.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R2 fastq.gz file
outputFileName=STAR_results/D192red # choose your output file name
# -- COMMANDS --
STAR --genomeDir $pathToIndex \
$pathToFastq1 $pathToFastq2 \
--readFilesIn $outputFileName \
--outFileNamePrefix --readFilesCommand zcat
Options | Default | Function |
---|---|---|
−−nodes | 1 | Number of nodes required (or min-max) |
−−nodelist | Select one or several nodes | |
−−ntasks-per-node | 1 | Number of tasks invoked on each node |
−−mem | 2GB | Memory required per node |
−−cpus-per-task | 1 | Number of CPUs allocated to each task |
−−mem-per-cpu | 2GB | Memory required per allocated CPU |
−−array | Submit multiple jobs to be executed with identical parameters |
Some tools allow multi-threading, i.e. the use of several CPUs to
accelerate one task. It is the case of STAR with the
--runThreadN
option.
Exo 8: Modify the previous sbatch file to use 4 threads to align the FASTQ files on the reference. Run and check time and memory usage.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --job-name=Alignment
#SBATCH --output=star-alignment-%j.out
#SBATCH --error=star-alignment-%j.err
#SBATCH --mem=30G # increase memory to 30G
#SBATCH --cpus-per-task=4
# -- MODULES --
module purge
module load star/2.7.5a # find appropriate STAR module (2.7.5a)
# -- VARIABLES --
pathToIndex=/shared/bank/homo_sapiens/hg38/star-2.7.5a # look for the path of the index for homo sapiens (hg38) made for STAR
pathToFastq1=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R1.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R1 fastq.gz file
pathToFastq2=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R2.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R2 fastq.gz file
outputFileName=STAR_results/D192red # choose your output file name
# -- COMMANDS --
STAR --genomeDir $pathToIndex \
$pathToFastq1 $pathToFastq2 \
--readFilesIn $outputFileName \
--outFileNamePrefix \
--readFilesCommand zcat --runThreadN 4
To save resources, we have generated a reduced genome, you can find
it at
/shared/projects/2417_wf4bioinfo/Slurm-training/star-2.7.5a_hg38_chr22
.
Modify your script to use this index. You can now reduce the RAM to
3Go.
The Slurm controller will set some variables in the environment of
the batch script. They can be very useful. For instance, you can improve
the previous script using $SLURM_CPUS_PER_TASK
.
Exo 9: Modify the previous sbatch file to use the reduced
index and $SLURM_CPUS_PER_TASK
.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --job-name=Alignment
#SBATCH --output=star-alignment-%j.out
#SBATCH --error=star-alignment-%j.err
#SBATCH --mem=3G # increase memory to 3G
#SBATCH --cpus-per-task=4
# -- MODULES --
module purge
module load star/2.7.5a # find appropriate STAR module (2.7.5a)
# -- VARIABLES --
pathToIndex=/shared/projects/2417_wf4bioinfo/Slurm-training/star-2.7.5a_hg38_chr22 # Use the index of hg38 chr22 only
pathToFastq1=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R1.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R1 fastq.gz file
pathToFastq2=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq/D192red_2M_R2.fastq.gz # look in /shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq to get the path to the R2 fastq.gz file
outputFileName=STAR_results/D192red # choose your output file name
# -- COMMANDS --
STAR --genomeDir $pathToIndex \
$pathToFastq1 $pathToFastq2 \
--readFilesIn $outputFileName \
--outFileNamePrefix \
--readFilesCommand zcat $SLURM_CPUS_PER_TASK --runThreadN
The full list of variables is visible in Slurm documentation.
Some useful ones:
- $SLURM_CPUS_PER_TASK
- $SLURM_JOB_ID
- $SLURM_JOB_ACCOUNT
- $SLURM_JOB_NAME
- $SLURM_JOB_PARTITION
Of note, Bash shell variables can also be used in the sbatch
script:
- $USER
- $HOME
- $HOSTNAME
- $PWD
- $PATH
Job arrays allow to start the same job a lot of times (same
executable, same resources) on different files for example. If you add
the following line to your script, the job will be launch 6 times (at
the same time), the variable $SLURM_ARRAY_TASK_ID
taking
the value 0 to 5.
#SBATCH --array=0-5
Exo 10: Starting from the following draft, make a simple script launching 6 jobs in parallel.
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --array= ?? # to adjust to select the samples to process (here all 6)
#SBATCH --output=HelloArray_%A_%a.out # "%A" will be replaced by the job ID and "%a" by the task number
#SBATCH --job-name=ArrayExample
# -- VARIABLES --
SAMPLE_LIST=(S01 S02 S03 S04 S05 S06)
SAMPLE=${SAMPLE_LIST[$SLURM_ARRAY_TASK_ID]} # take the nth element of the list, n being the task number
# -- COMMANDS --
echo "Hello I am the task number $SLURM_ARRAY_TASK_ID from the job array $?." # $? Look for the Slurm variable for the job ID.
sleep 20
echo "And I will process sample $SAMPLE."
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --array=0-5 # to adjust to the number of samples (here all 6)
#SBATCH --output=HelloArray_%A_%a.out # "%A" will be replaced by the job ID and "%a" by the task number
#SBATCH --job-name=ArrayExample
# -- VARIABLES --
SAMPLE_LIST=(S01 S02 S03 S04 S05 S06)
SAMPLE=${SAMPLE_LIST[$SLURM_ARRAY_TASK_ID]} # take the nth element of the list, n being the task number
# -- COMMANDS --
echo "Hello I am the task number $SLURM_ARRAY_TASK_ID from the job array $SLURM_ARRAY_JOB_ID." # $? Look for the Slurm variable for the job ID.
sleep 20
echo "And I will process sample $SAMPLE."
It is possible to limit the number of jobs running at the same time
using %max_running_jobs
in #SBATCH --array
option.
Exo 11: Modify your script to run only 2 jobs at the time.
You will see using squeue
command that some of the tasks
are pending until the others are over.
[user @ clust-slurm-client 11:28]$ star : squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
42162738_[2-5%2] fast ArrayExa mhennion PD 0:00 1 (JobArrayTaskLimit)
42162738_0 fast ArrayExa mhennion R 0:03 1 cpu-node-61
42162738_1 fast ArrayExa mhennion R 0:03 1 cpu-node-61
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --array=0-5%2 # Run maximum 2 tasks at the time with "%2"
#SBATCH --output=HelloArray_%A_%a.out # "%A" will be replaced by the job ID and "%a" by the task number
#SBATCH --job-name=ArrayExample
# -- VARIABLES --
SAMPLE_LIST=(S01 S02 S03 S04 S05 S06)
SAMPLE=${SAMPLE_LIST[$SLURM_ARRAY_TASK_ID]} # take the nth element of the list, n being the task number
# -- COMMANDS --
echo "Hello I am the task number $SLURM_ARRAY_TASK_ID from the job array $SLURM_ARRAY_JOB_ID." # $? Look for the Slurm variable for the job ID.
sleep 20
echo "And I will process sample $SAMPLE."
#!/bin/bash
# -- SBATCH OPTIONS --
#SBATCH --account 2417_wf4bioinfo
#SBATCH --array=0-7 # if 8 files to proccess
FASTQFOLDER=/shared/projects/2417_wf4bioinfo/Slurm-training/test_fastq
cd $FASTQFOLDER
FQ=(*fastq.gz) #Create a bash array
echo ${FQ[@]} #Echos array contents
INPUT=$(basename -s .fastq.gz "${FQ[$SLURM_ARRAY_TASK_ID]}") #Each elements of the array are indexed (from 0 to n-1) for slurm
echo $INPUT #Echos simplified names of the fastq files
You can alternatively use ls
or find
to
identify the files to process and get the nth with sed
(or
awk
).
#SBATCH --array=1-4 # If 4 files, as sed index start at 1
INPUT=$(ls $FASTQFOLDER/*.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
echo $INPUT
%a
or %J
in the
name will do the trick)%50
(for
example) at the end of your indexes to limit the number of tasks (here
to 50) running at the same time. The 51st will start as soon as one
finishes!#SBATCH --mem=25G
is for
each taskTo find out more, read the SLURM manual : man sbatch
or https://slurm.schedmd.com/sbatch.html
Ask for help or signal problems on the cluster : https://community.france-bioinformatique.fr
IFB cluster documentation: https://ifb-elixirfr.gitlab.io/cluster/doc/