Practical session
There is two ways for connectiong to IFB cluster: SSH and Open on demand.
SSH
mylogin by your IFB cluster login that is available on your
IFB account):ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr-X: enable X11 forwarding (for GUI applications)-o ServerAliveInterval=60: Avoid timeout by sending
every 60 seconds a message to SSH server-l mylogin: define the acount to useYou can find more information on IFB website: https://my.cluster.france-bioinformatique.fr/manager2/login
You can find more information on IFB website: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/
Exercice 1:
Connect to the IFB cluster using SSH or Open on demand.
srun --account=2521_wf4bioinfo --pty bashmodule load nextflow/25.04.7
module load fastqc/0.12.1
module load multiqc/1.13nextflow -vmkdir /shared/projects/2521_wf4bioinfo/participants/$USER
cd /shared/projects/2521_wf4bioinfo/participants/$USERcp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_1 tp
cp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_2/recipes tp/
cd tpbash fastqc.sh ./data/*_R{1,2}.fastq.gz
bash multiqc.shExercice 2:
Load Nextflow, fastqc and multiqc modules and execute the fastqc.sh and multiqc.sh script on the IFB cluster.
β Now you are ready to start
The goal of this practical session is to build a Nextflow pipeline for FASTQ quality control check. To do this we will employ the tools FastQC and MultiQC. First we are going to build a minimal workflow that involves inputs, process and outputs.
Exercice 3:
Create this subdirectories and empty files
mkdir conf modules
touch main.nf nextflow.config conf/base.config modules/fastqc.nf modules/multiqc.nfNow, we need to write the main script: main.nf. This script need two parts:
β π Workflow skeleton in Nextflow documentation
// Insert here include declarations
workflow Quality_Checker { // The name is used for sub-workflow
[take]: // Optional ; use for sub-workflow
[main]: // Process
[emit]: // Optional ; use for sub-workflow
}
2. The Nextflow modules to include. In this practical session,
there is two modules: fastqc.nf and multiqc.nf.
β π Include declaration in Nextflow documentation
include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'Exercice 4: Create a main.nf script with workflow skeleton and module imports.
include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'
workflow Quality_Checker {
main:
FASTQC()
MULTIQC()
}
workflow { // Entry workflow
Quality_Checker()
}In this part we will write the first module of the workflow: modules/fastqc.nf.
//include
process FASTQC{
[directive]
[input]:
[output]:
[script|exec]:
}process FASTQC{
input:
tuple val(sampleId), path(fastqs)
output:
path(result_fastqc)
script:
"""
...
"""
}Exercice 5: Create a modules/fastqc.nf script with a FASTQC process that execute the same command line than the fastqc.sh shell script.
π‘ Note 1: The input of the FASTQ process will be
created using the Channel.fromFilePairs
method. So a typical input will be like:
[DRUP01_SUB2, [./data/DRUP01_SUB2_R1.fastq.gz, ./data/DRUP01_SUB2_R2.fastq.gz]]
[HG002_SUB1, [./data/HG002_SUB1_R1.fastq.gz, ./data/HG002_SUB1_R1.fastq.gz]]
π‘ Note 2: In the script sections of a Nextflow workflow, shell variables starting with β$β character like (e.g.Β $CWD ), must be prefixed with β\β. Otherwise, variables will be interpreted as Nextflow variables.
process FASTQC {
input:
tuple val(sampleId), path(fastqs)
output:
tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results // emit: catch the result with a name in the workflow file
path("versions.txt") , emit: versions
script:
"""
echo \$(fastqc --version) > versions.txt # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
fastqc -q --threads 1 ${fastqs} # Here ${fastqs} is a nextflow variable who refer to the input of this process.
"""
}Now, we will write the second module of the workflow: modules/multiqc.nf
Exercice 6: Create a modules/multiqc.nf script with a MULTIQC process that execute the same command lines than the multiqc.sh shell script.
process MULTIQC {
input:
path(multiqc_config)
path(multiqc_inputs) // Here we use in input the output of FastQC but we don't use them explicitly,\
// in fact multiqc will work in the current repository, and the file in input are 'link' \
// in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.
output:
path "*multiqc_report.html", emit: report
path "*_data" , emit: data
path "*_plots" , optional:true, emit: plots
path "versions.txt" , emit: versions
script:
"""
multiqc --version > versions.txt
multiqc --force $multiqc_config $multiqc_inputs
"""
}We have now:
In order to use the program, we need inputs: the FASTQ files. To do
this, we must modify the main.nf file and add an input
channel.
The Nextflow documentation explain how to import paired files: https://nextflow.io/docs/latest/reference/channel.html#fromfilepairs
π‘ Note: You can show on the terminal, the content
of a channel using the view() operator. It is very useful
to see the content of the input and output of the processes.
Example:
fastqs.view()Exercice 7:
π‘ Note: To ensure that the MULTIQC process will be
executed only one time, you need to use some channel operators
(collect() and maybe map()). See the Nextflow
documentation for more indormation about the operators.
workflow Quality_Checker {
input_ch = Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
}process test {
output:
path("*.*"), emit: toto
script:
"""
echo test > log.txt
"""
}
workflow {
test()
ch_res = test.out.toto
}include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'
workflow Quality_Checker {
fastqs = Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')
main:
FASTQC(fastqs)
MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}
workflow { // Entry workflow
Quality_Checker()
}You can now run your new pipeline with the following command line:
nextflow run main.nf Exercice 8: Launch the Nextflow workflow with the previous command line. Check that the MULTIQC process has been executed only one time.
You will obtain this in a few seconds. :
N E X T F L O W ~ version 25.04.7
Launching `main.nf` [thirsty_almeida] DSL2 - revision: 03f8331e18
executor > local (3)
[1f/1c54fa] Quality_Checker:FASTQC (1) [100%] 2 of 2 β
[b0/51040a] Quality_Checker:MULTIQC (1) [100%] 1 of 1 βIf you want to access the working directory of the MultiQC process, for example, you can do this:
cd work/b0/
ls you can find a repository who start by 51040a, 51040a19b731196497167817770a4a.
Exercice 9: Relaunch the Nextflow workflow with the
-resume option. What are the differences?
π‘ Note: With the nextflow clean -f
command, you can delete old work directories.
In Nextflow, the stub section of a process allows you to simulate the execution without actually running the script. Instead, it creates empty files that represent the expected outputs of the process. This is useful for testing the pipelineβs structure and flow without waiting for each step to complete. Itβs especially helpful during development such as when working on a laptop or when the process involves long runtimesβunless youβre using a small dataset.
We can do this on the Multiqc process. We know that:
If βstubβ section exists in a process definiton, it will create fake
files and folders if you launch Nextflow with -stub option
in the command line. As an example, you can simulate the execution of
your workflow with the following command:
nextflow run main.nf -stubExercice 10:
stub: section to your MULTIQC process
definition.process MULTIQC {
input:[...]
output:
path "*multiqc_report.html", emit: report
path "*_data" , emit: data
path "*_plots" , optional:true, emit: plots
path "versions.txt" , emit: versions
script: [...]
stub:
"""
multiqc --version > versions.txt
touch "multiqc_report.html"
mkdir multiqc_data
"""
}We now have a functional workflow, but the inputs are written in hard.
params.input is a nextflow variable who use the flag
β--β in the command line for example :
nextflow run main.nf --fastq_input './data/*_R{1,2}.fastq.gz'π‘ Note: In the previous command line, FASTQ files path are between quotes to avoid shell expansion. With quotes, the path will be interpreted by Nextflow instead of Bash.
Exercice 11: Use params.fastq_input to
implement a more flexible workflow.
include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'
workflow Quality_Checker {
take:
qc_input
main:
fastqs = Channel.fromFilePairs(qc_input)
multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')
FASTQC(fastqs)
MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}
workflow { // Entry workflow
Quality_Checker(params.fastq_input)
}You have a flexible workflow, but you donβt have the results from this pipeline.
To choose the location of your programme you can use the
publishDir directive : π https://nextflow.io/docs/edge/reference/process.html#publishdir
Exercice 12: You can try to implement this functionnality for the FastQC and MultiQC process.
process test {
publishDir "${params.outDir}/", mode: 'copy', saveAs: {filename ->
if filename =="log.txt" return "test_$filename"
else null }
output:
path("*.*"), emit: toto
script:
"""
echo test > log.txt
"""
}Here the outputs are saved in the params.outDir.
nextflow run main.nf --outDir ./results
process FASTQC {
publishDir "${params.outDir}/$sampleId", mode: 'copy'
input:
tuple val(sampleId), path(fastqs)
output:
tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results // emit: catch the result with a name in the workflow file
path("versions.txt") , emit: versions
script:
"""
echo \$(fastqc --version) > versions.txt # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
fastqc -q --threads 1 ${fastqs} # Here ${fastqs} is a nextflow variable who refer to the input of this process.
"""
}process MULTIQC {
publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
filename ->
if (filename.contains(".html")) return 'multiqc_report.html'
else if (filename == "versions.txt") return filename
else return null
}
input:
path(multiqc_config)
path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
// in fact multiqc will work in the current repository, and the file in input are 'link' \
// in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.
output:
path "*multiqc_report.html", emit: report
path "*_data" , emit: data
path "*_plots" , optional:true, emit: plots
path "versions.txt" , emit: versions
script:
"""
multiqc --version > versions.txt
multiqc --force $multiqc_config .
"""
stub:
"""
multiqc --version > versions.txt
touch "multiqc_report.html"
mkdir multiqc_data
"""
}nextflow run main.nf --input './data/*_R{1,2}.fastq.gz' --outDir ./resultsYou can store default values like outDir in the
nextflow.config file for example. π https://nextflow.io/docs/edge/config.html
Exercice 13:
outDir variable in
nextflow.config to publish outputs in the results
subdirectory.includeConfig './conf/base.config'
params {
fastqc_input = null
outDir = "$projectDir/results"
}You can use labels in your processes to share configurations. For example, you may have multiple processes, some use a simple resource and others use multiple resources. You can define two labels:
process{
withLabel: little_prog {
cpus = 1
memory = 4.GB
}
withLabel: big_prog {
cpus = 4
memory = 8.GB
}
}process FASTQC {
label "big_prog"
[...]
}
process MULTIQC {
label 'little_prog'
[...]
}Exercice 14:
process FASTQC {
label 'big_prog'
publishDir "${params.outDir}/$sampleId", mode: 'copy'
input:
tuple val(sampleId), path(fastqs)
output:
tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results // emit: catch the result with a name in the workflow file
path("versions.txt") , emit: versions
script:
"""
echo \$(fastqc --version) > versions.txt # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
fastqc -q --threads 1 ${fastqs} # Here ${fastqs} is a nextflow variable who refer to the input of this process.
"""
}process MULTIQC {
label 'little_prog'
publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
filename ->
if (filename.contains(".html")) return 'multiqc_report.html'
else if (filename == "versions.txt") return filename
else return null
}
input:
path(multiqc_config)
path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
// in fact multiqc will work in the current repository, and the file in input are 'link' \
// in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.
output:
path "*multiqc_report.html", emit: report
path "*_data" , emit: data
path "*_plots" , optional:true, emit: plots
path "versions.txt" , emit: versions
script:
"""
multiqc --version > versions.txt
multiqc --force $multiqc_config .
"""
stub:
"""
multiqc --version > versions.txt
touch "multiqc_report.html"
mkdir multiqc_data
"""
}params {
fastqc\_input = null
outDir = "$projectDir/results"
}
process{
withLabel: little_prog {
cpus = 1
memory = 4.GB
}
withLabel: big_prog {
cpus = 16
memory = 64.GB
}
}nextflow run main.nf --input './data/*_R{1,2}.fastq.gz'Profiles are configurations related to different environment. For instance if you work on a cluster or on your laptop the localisation of your annotation may differ or your memory configuration and cpu available.
You can split configuration in different files with the
includeConfig function. π https://nextflow.io/docs/edge/config.html#config-profiles
Your cluster can have different schedulers, such as SLURM, PBSβ¦ Nextflow does the configuration for developers, but you must specify this. We will create a cluster profile in nextflow.config. π https://nextflow.io/docs/latest/executor.html#slurm
profiles {
cluster {
includeConfig './conf/cluster.config'
}
}process{
executor = 'slurm'
queue = 'fast'
}You can overload variable in the cluster config file to use more CPUβ¦
Exercice 15: Add the βclusterβ profile to your nextflow.config file and create conf/cluster.config configuration file. Execute the workflow using a SLURM submission.
nextflow run main.nf -profile cluster --input './data/*_R{1,2}.fastq.gz'You can use conda environnement in your process nextflow https://nextflow.io/docs/latest/conda.html https://nextflow.io/docs/latest/reference/config.html#conda
In the best practices of the documentation, it is recommanded to create a conda profile in a configuration file. Like the cluster part we will create it in the Nextflow configuration file:
profiles {
conda {
includeConfig './conf/conda.config'
}
}conda{
cacheDir = "$projectDir/conda-cache-nextflow" // specify the localization \
// of the conda env creation
createTimeout = "1 h"
enabled = 'true'
}
process {
withLabel:fastqc{ conda = "$projectDir/recipes/conda/fastqc.yml"}
withLabel:multiqc{ conda = "$projectDir/recipes/conda/multiqc.yml"}
}As we will execute FastQC and MultiQC using Conda, we can now disable modules for FastQC and MultiQC from your PATH:
module unload fastqc multiqcExercice 16:
process FASTQC {
label 'fastqc'
label 'big_prog'
publishDir "${params.outDir}/$sampleId", mode: 'copy'
input:
tuple val(sampleId), path(fastqs)
output:
tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results // emit: catch the result with a name in the workflow file
path("versions.txt") , emit: versions
script:
"""
echo \$(fastqc --version) > versions.txt # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
fastqc -q --threads 1 ${fastqs} # Here ${fastqs} is a nextflow variable who refer to the input of this process.
"""
}process MULTIQC {
label 'multiqc'
label 'little_prog'
publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
filename ->
if (filename.contains(".html")) return 'multiqc_report.html'
else if (filename == "versions.txt") return filename
else return null
}
input:
path(multiqc_config)
path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
// in fact multiqc will work in the current repository, and the file in input are 'link' \
// in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.
output:
path "*multiqc_report.html", emit: report
path "*_data" , emit: data
path "*_plots" , optional:true, emit: plots
path "versions.txt" , emit: versions
script:
"""
multiqc --version > versions.txt
multiqc --force $multiqc_config .
"""
stub:
"""
multiqc --version > versions.txt
touch "multiqc_report.html"
mkdir multiqc_data
"""
}nextflow run main.nf -profile cluster,conda --input './data/*_R{1,2}.fastq.gz'You can use conda environnements in your process nextflow
π https://nextflow.io/docs/latest/container.html#singularity
π https://nextflow.io/docs/latest/reference/config.html#config-singularity
In the best practices of the documentation, it is recommanded to create a singularity profile in a configuration file. Like in the cluster part we will create it the Nextflow configuration file:
profiles {
singularity {
includeConfig './conf/singularity.config'
}
}singularity {
enabled = true
autoMounts = true
runOptions = '--containall'
}
process {
withLabel:fastqc{ container = "$projectDir/recipes/singularity/fastqc.sif"}
withLabel:multiqc{ container = "$projectDir/recipes/singularity/multiqc.sif"}
}Exercice 17: Add the βsingularityβ profile to your nextflow.config file and create conf/singularity.config configuration file. Execute the workflow using SLURM submissions and Singularity to execute FastQC and MultiQC.
nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz'You can enable metrics of your workflow execution with Nextflow. To turn on this option use the command line or set the Nextflow configuration file.
// From nf-core
timeline {
enabled = true
overwrite = true
file = "${params.summaryDir}/trace/timeline.html"
}
report {
enabled = true
overwrite = true
file = "${params.summaryDir}/trace/report.html"
}
trace {
enabled = true
raw = true
overwrite = true
fields = 'process,task_id,hash,native_id,module,container,tag,name,\
status,exit,submit,start,complete,duration,realtime,%cpu,%mem,rss,vmem\
,peak_rss,peak_vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,attempt,\
workdir,scratch,queue,cpus,memory,disk,time,env'
file = "${params.summaryDir}/trace/trace.txt"
}
dag {
enabled = true
overwrite = true
file = "${params.summaryDir}/trace/DAG.pdf"
}Graphviz is required to render the execution DAG in a PNG image. So we need to load the module before relaunching the workflow:
load graphvizExercice 18: Add the previous example of metric configuration to your nextflow.config, and relaunch your workflow.
nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz' --summaryDir .https://github.com/bioinfo-pf-curie/
https://github.com/bioinfo-pf-curie/raw-qc