Introduction to Nextflow

Practical session

Connexion to IFB cluster

There is two ways for connectiong to IFB cluster: SSH and Open on demand.

SSH
1. First you need to upload your public key to https://my.cluster.france-bioinformatique.fr/manager2/login
2. Then execute the following command line (replace mylogin by your IFB cluster login that is available on your IFB account):

ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr

Command line explanation:
- -X: enable X11 forwarding (for GUI applications)
- -o ServerAliveInterval=60: Avoid timeout by sending every 60 seconds a message to SSH server
- -l mylogin: define the acount to use

You can find more information on IFB website: https://my.cluster.france-bioinformatique.fr/manager2/login

Open on demand

You can find more information on IFB website: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/

Exercice 1:

Connect to the IFB cluster using SSH or Open on demand.

Connexion to compute node

Open a session on a compute node:

srun --account=2521_wf4bioinfo --pty bash

Load nextflow in your environment

module load nextflow/25.04.7
module load fastqc/0.12.1
module load multiqc/1.13

Display the version of Nextflow

nextflow -v

Create and go to your working directory

mkdir /shared/projects/2521_wf4bioinfo/participants/$USER
cd /shared/projects/2521_wf4bioinfo/participants/$USER

Copy the training data in your working directory

cp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_1 tp
cp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_2/recipes tp/
cd tp

Test MultiQC and FastQC on the fastq.gz file

bash fastqc.sh ./data/*_R{1,2}.fastq.gz
bash multiqc.sh

Exercice 2:

Load Nextflow, fastqc and multiqc modules and execute the fastqc.sh and multiqc.sh script on the IFB cluster.

→ Now you are ready to start

Objectives

The goal of this practical session is to build a Nextflow pipeline for FASTQ quality control check. To do this we will employ the tools FastQC and MultiQC. First we are going to build a minimal workflow that involves inputs, process and outputs.

The DAG (Directed Acyclic Graph) of the workflow

Input and output of FastQC and MultiQC

FastQC
- Inputs:
  - *_R{1,2}.fastq.gz with paired files
- Outputs:
  - _R{1,2}.zip_ and __R{1,2}.html_
  - versions.txt
MultiQC
- Inputs:
  - FastQC output (*_R{1,2}.html) & versions.txt
  - config file (assets/multiqc_config.yml)
- Outputs:
  - multiqc_data/
  - multiqc_report.html
  - versions.txt

Workflow files

We need 5 files to create this workflow using Nextflow:
- main.nf
- nextflow.config
- modules/fastqc.nf
- modules/multiqc.nf
- conf/base.config

Exercice 3:

Create this subdirectories and empty files

Click to see an exercise solution

Command lines to execute:

mkdir conf modules 
touch main.nf nextflow.config conf/base.config modules/fastqc.nf modules/multiqc.nf

The Main script: main.nf

Now, we need to write the main script: main.nf. This script need two parts:

The workflow skeleton

→ 🔗 Workflow skeleton in Nextflow documentation

Click to see an example of workflow skeleton

// Insert here include declarations

workflow Quality_Checker { // The name is used for sub-workflow

    [take]: // Optional ; use for sub-workflow

    [main]: // Process

    [emit]: // Optional ; use for sub-workflow

}

2. The Nextflow modules to include. In this practical session, there is two modules: fastqc.nf and multiqc.nf.

→ 🔗 Include declaration in Nextflow documentation

Click to see required module includes

include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

Exercice 4: Create a main.nf script with workflow skeleton and module imports.

Click to see an exercise solution

main.nf

include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'


workflow Quality_Checker {

    main:

        FASTQC() 
        MULTIQC()
}

workflow { // Entry workflow
    Quality_Checker()
}

Module FastQC: modules/fastqc.nf

In this part we will write the first module of the workflow: modules/fastqc.nf.

We will first create the process from a skeleton
→ Process in Nextflow documentation

Click to see an example

//include

process FASTQC{ 
    [directive]

    [input]: 

    [output]:

    [script|exec]:

}

We will copy the command lines from the fastqc.sh shell script in the script section of the module
Set the input of the process
→ 🔗 Inputs in Nextflow documentation
Set the output of the process
→ 🔗 Outputs in Nextflow documentation

❓ Click to see an help on how to set inputs and outputs for the FASTQC process

process FASTQC{
    input:
    tuple val(sampleId), path(fastqs)

    output:
    path(result_fastqc)

    script:
    """
    ...
    """
}

Exercice 5: Create a modules/fastqc.nf script with a FASTQC process that execute the same command line than the fastqc.sh shell script.

💡 Note 1: The input of the FASTQ process will be created using the Channel.fromFilePairs method. So a typical input will be like:

[DRUP01_SUB2, [./data/DRUP01_SUB2_R1.fastq.gz, ./data/DRUP01_SUB2_R2.fastq.gz]]
[HG002_SUB1, [./data/HG002_SUB1_R1.fastq.gz, ./data/HG002_SUB1_R1.fastq.gz]]

💡 Note 2: In the script sections of a Nextflow workflow, shell variables starting with “$” character like (e.g. $CWD ), must be prefixed with “\”. Otherwise, variables will be interpreted as Nextflow variables.

Click to see a solution of the exercise

modules/fastqc.nf

process FASTQC {

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}

Module MultiQC: modules/multiqc.nf

Now, we will write the second module of the workflow: modules/multiqc.nf

Exercice 6: Create a modules/multiqc.nf script with a MULTIQC process that execute the same command lines than the multiqc.sh shell script.

Click to see a solution of the exercise

modules/multiqc.nf

process MULTIQC {

  input:
  path(multiqc_config)
  path(multiqc_inputs) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config $multiqc_inputs
  """
}

Main start

We have now:

a FastQC module
a MultiQC module
a main Nextflow script

In order to use the program, we need inputs: the FASTQ files. To do this, we must modify the main.nf file and add an input channel.

The Nextflow documentation explain how to import paired files: https://nextflow.io/docs/latest/reference/channel.html#fromfilepairs

💡 Note: You can show on the terminal, the content of a channel using the view() operator. It is very useful to see the content of the input and output of the processes.

Example:

fastqs.view()

Exercice 7:

Create a channel for FASTQ files (in the data subdirectory) and MultiQC configuration files.
Call the FASTQ and MULTIQC process with new channel and for MULTIQC with the output channel of the FASTQC process
The MULTIQC process must be executed only one time

💡 Note: To ensure that the MULTIQC process will be executed only one time, you need to use some channel operators (collect() and maybe map()). See the Nextflow documentation for more indormation about the operators.

❓ Click to see an help

workflow Quality_Checker {
    input_ch = Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
}

process test {
    output:
    path("*.*"), emit: toto

    script:
    """
    echo test > log.txt
    """
}

workflow {
    test()
    ch_res = test.out.toto
}

Click to see a solution

main.nf

include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

workflow Quality_Checker {

    fastqs =  Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
    multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')

    main:

        FASTQC(fastqs) 
        MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}

workflow { // Entry workflow
    Quality_Checker()
}

Running

You can now run your new pipeline with the following command line:

nextflow run main.nf

Exercice 8: Launch the Nextflow workflow with the previous command line. Check that the MULTIQC process has been executed only one time.

You will obtain this in a few seconds. :


 N E X T F L O W   ~  version 25.04.7

Launching `main.nf` [thirsty_almeida] DSL2 - revision: 03f8331e18

executor >  local (3)
[1f/1c54fa] Quality_Checker:FASTQC (1)  [100%] 2 of 2 ✔
[b0/51040a] Quality_Checker:MULTIQC (1) [100%] 1 of 1 ✔

If you want to access the working directory of the MultiQC process, for example, you can do this:

cd work/b0/
ls

you can find a repository who start by 51040a, 51040a19b731196497167817770a4a.

Exercice 9: Relaunch the Nextflow workflow with the -resume option. What are the differences?

💡 Note: With the nextflow clean -f command, you can delete old work directories.

Module MultiQC Extra part stub

In Nextflow, the stub section of a process allows you to simulate the execution without actually running the script. Instead, it creates empty files that represent the expected outputs of the process. This is useful for testing the pipeline’s structure and flow without waiting for each step to complete. It’s especially helpful during development such as when working on a laptop or when the process involves long runtimes—unless you’re using a small dataset.

We can do this on the Multiqc process. We know that:

Outputs are :
- *_data
- versions.txt
- *multiqc_report.html

If “stub” section exists in a process definiton, it will create fake files and folders if you launch Nextflow with -stub option in the command line. As an example, you can simulate the execution of your workflow with the following command:

nextflow run main.nf -stub

Exercice 10:

Add a stub: section to your MULTIQC process definition.
Relaunch the workflow.
Check the output of the MULTIQC process in the work directory.

Click to see a solution of the exercise

modules/multiqc.nf

process MULTIQC {

  input:[...]

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script: [...]

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}

Set inputs as workflow parameters

We now have a functional workflow, but the inputs are written in hard.

params.input is a nextflow variable who use the flag ‘--’ in the command line for example :

nextflow run main.nf --fastq_input './data/*_R{1,2}.fastq.gz'

💡 Note: In the previous command line, FASTQ files path are between quotes to avoid shell expansion. With quotes, the path will be interpreted by Nextflow instead of Bash.

Exercice 11: Use params.fastq_input to implement a more flexible workflow.

Click to see a solution of the exercise

main.nf

include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

workflow Quality_Checker {

    take:
    qc_input


    main:
        fastqs =  Channel.fromFilePairs(qc_input)
        multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')

        FASTQC(fastqs) 
        MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}

workflow { // Entry workflow
    Quality_Checker(params.fastq_input)
}

PublishDir

You have a flexible workflow, but you don’t have the results from this pipeline.

To choose the location of your programme you can use the publishDir directive : 🔗 https://nextflow.io/docs/edge/reference/process.html#publishdir

Exercice 12: You can try to implement this functionnality for the FastQC and MultiQC process.

❓ Click to see an example

process test {
    publishDir "${params.outDir}/", mode: 'copy', saveAs: {filename -> 
    if filename =="log.txt" return "test_$filename"
    else null }

    output:
    path("*.*"), emit: toto

    script:
    """
    echo test > log.txt
    """
}

Here the outputs are saved in the params.outDir. nextflow run main.nf --outDir ./results

Click to see a solution of the exercise

modules/fastqc.nf

process FASTQC {
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}

modules/multiqc.nf

process MULTIQC {
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}

Command line to execute

nextflow run main.nf --input './data/*_R{1,2}.fastq.gz' --outDir ./results

You can store default values like outDir in the nextflow.config file for example. 🔗 https://nextflow.io/docs/edge/config.html

Exercice 13:

Define the value of the outDir variable in nextflow.config to publish outputs in the results subdirectory.
Include the empty conf/base.config file in the nextflow.config configuration file._

Click to see an example of solution

nextflow.config ::: boxed

includeConfig './conf/base.config'

params {
    fastqc_input = null

    outDir = "$projectDir/results"
}

:::

Labels in configuration file

You can use labels in your processes to share configurations. For example, you may have multiple processes, some use a simple resource and others use multiple resources. You can define two labels:

process{
    withLabel: little_prog {
        cpus = 1
        memory = 4.GB
    }
    withLabel: big_prog {
        cpus = 4
        memory = 8.GB
    }
}

process FASTQC {
    label "big_prog"

    [...]
}

process MULTIQC {
    label 'little_prog'

    [...]
}

Exercice 14:

Add to your nextflow.config file the section that define little_prog and big_prog labels.
Use this label in the FastQC and MultiQC process.
Execute the worfklow.

Click to see a solution of the exercise

modules/fastqc.nf

process FASTQC {
  label 'big_prog'
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}

modules/multiqc.nf

process MULTIQC {
  label 'little_prog'
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}

nextflow.config

params {
    fastqc\_input = null

    outDir = "$projectDir/results"
}

process{
    withLabel: little_prog {
        cpus = 1
        memory = 4.GB
    }
    withLabel: big_prog {
        cpus = 16
        memory = 64.GB
    }
}

Command line to execute:

nextflow run main.nf --input './data/*_R{1,2}.fastq.gz'

Profiles

Profiles are configurations related to different environment. For instance if you work on a cluster or on your laptop the localisation of your annotation may differ or your memory configuration and cpu available.

You can split configuration in different files with the includeConfig function. 🔗 https://nextflow.io/docs/edge/config.html#config-profiles

Profiles Cluster

Your cluster can have different schedulers, such as SLURM, PBS… Nextflow does the configuration for developers, but you must specify this. We will create a cluster profile in nextflow.config. 🔗 https://nextflow.io/docs/latest/executor.html#slurm

nextflow.config

profiles {
    cluster {
        includeConfig './conf/cluster.config'
    }
}

conf/cluster.config

process{
    executor = 'slurm'
    queue = 'fast'
}

You can overload variable in the cluster config file to use more CPU…

Exercice 15: Add the “cluster” profile to your nextflow.config file and create conf/cluster.config configuration file. Execute the workflow using a SLURM submission.

Click to see a solution of the exercise

Command line to execute:

nextflow run main.nf -profile cluster --input './data/*_R{1,2}.fastq.gz'

Profiles Conda

You can use conda environnement in your process nextflow https://nextflow.io/docs/latest/conda.html https://nextflow.io/docs/latest/reference/config.html#conda

In the best practices of the documentation, it is recommanded to create a conda profile in a configuration file. Like the cluster part we will create it in the Nextflow configuration file:

nextflow.config

profiles {
    conda {
        includeConfig './conf/conda.config'
    }
}

conf/conda.config

conda{
    cacheDir = "$projectDir/conda-cache-nextflow" // specify the localization \
                                                  // of the conda env creation
    createTimeout = "1 h"
    enabled = 'true'
}

process {
    withLabel:fastqc{ conda = "$projectDir/recipes/conda/fastqc.yml"}
    withLabel:multiqc{ conda = "$projectDir/recipes/conda/multiqc.yml"}
}

As we will execute FastQC and MultiQC using Conda, we can now disable modules for FastQC and MultiQC from your PATH:

module unload fastqc multiqc

Exercice 16:

Add the “conda” profile to your nextflow.config file and create conf/conda.config configuration file.
Add labels defined in conf/conda.config to FASTQC and MULTIQC processes.
Execute the workflow using SLURM submissions and Conda to execute FastQC and MultiQC.

Click to see a solution of the exercise

modules/fastqc.nf

process FASTQC {
  label 'fastqc'
  label 'big_prog'
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}

modules/multiqc.nf

process MULTIQC {
  label 'multiqc'
  label 'little_prog'
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}

Command line to execute:

nextflow run main.nf -profile cluster,conda --input './data/*_R{1,2}.fastq.gz'

Profiles Singularity

You can use conda environnements in your process nextflow

🔗 https://nextflow.io/docs/latest/container.html#singularity

🔗 https://nextflow.io/docs/latest/reference/config.html#config-singularity

In the best practices of the documentation, it is recommanded to create a singularity profile in a configuration file. Like in the cluster part we will create it the Nextflow configuration file:

nextflow.config

profiles {
    singularity {
        includeConfig './conf/singularity.config'
    }
}

conf/singularity.config

singularity {
  enabled = true
  autoMounts = true
  runOptions = '--containall'
}

process {
    withLabel:fastqc{ container = "$projectDir/recipes/singularity/fastqc.sif"}
    withLabel:multiqc{ container = "$projectDir/recipes/singularity/multiqc.sif"}
}

Exercice 17: Add the “singularity” profile to your nextflow.config file and create conf/singularity.config configuration file. Execute the workflow using SLURM submissions and Singularity to execute FastQC and MultiQC.

Click to see a solution of the exercise

Command line to execute:

nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz'

Metrics

You can enable metrics of your workflow execution with Nextflow. To turn on this option use the command line or set the Nextflow configuration file.

nextflow.config


// From nf-core
timeline {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/timeline.html"
}

report {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/report.html"
}

trace {
  enabled = true
  raw = true
  overwrite = true
  fields = 'process,task_id,hash,native_id,module,container,tag,name,\
  status,exit,submit,start,complete,duration,realtime,%cpu,%mem,rss,vmem\
  ,peak_rss,peak_vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,attempt,\
  workdir,scratch,queue,cpus,memory,disk,time,env'
  file = "${params.summaryDir}/trace/trace.txt"
}

dag {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/DAG.pdf"
}

Graphviz is required to render the execution DAG in a PNG image. So we need to load the module before relaunching the workflow:

load graphviz

Exercice 18: Add the previous example of metric configuration to your nextflow.config, and relaunch your workflow.

Click to see a solution of the exercise

Command line to execute:

nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz' --summaryDir .

Example Curie pipeline

https://github.com/bioinfo-pf-curie/

https://github.com/bioinfo-pf-curie/raw-qc

https://github.com/bioinfo-pf-curie/geniac-demo-dsl2

https://geniac.readthedocs.io/en/version-3.6.0/