Introduction to Nextflow

Practical session

Connexion to IFB cluster

There is two ways for connectiong to IFB cluster: SSH and Open on demand.

ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr

You can find more information on IFB website: https://my.cluster.france-bioinformatique.fr/manager2/login

You can find more information on IFB website: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/

Exercice 1:

Connect to the IFB cluster using SSH or Open on demand.

Connexion to compute node

  1. Open a session on a compute node:
srun --account=2521_wf4bioinfo --pty bash
  1. Load nextflow in your environment
module load nextflow/25.04.7
module load fastqc/0.12.1
module load multiqc/1.13
  1. Display the version of Nextflow
nextflow -v
  1. Create and go to your working directory
mkdir /shared/projects/2521_wf4bioinfo/participants/$USER
cd /shared/projects/2521_wf4bioinfo/participants/$USER
  1. Copy the training data in your working directory
cp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_1 tp
cp -rp /shared/projects/2521_wf4bioinfo/course-material/atelier-nextflow/day2/tp/tp_partie_2/recipes tp/
cd tp
  1. Test MultiQC and FastQC on the fastq.gz file
bash fastqc.sh ./data/*_R{1,2}.fastq.gz
bash multiqc.sh

Exercice 2:

Load Nextflow, fastqc and multiqc modules and execute the fastqc.sh and multiqc.sh script on the IFB cluster.

β†’ Now you are ready to start

Objectives

The goal of this practical session is to build a Nextflow pipeline for FASTQ quality control check. To do this we will employ the tools FastQC and MultiQC. First we are going to build a minimal workflow that involves inputs, process and outputs.

The DAG (Directed Acyclic Graph) of the workflow

Input and output of FastQC and MultiQC

Workflow files

Exercice 3:

Create this subdirectories and empty files

Click to see an exercise solution
mkdir conf modules 
touch main.nf nextflow.config conf/base.config modules/fastqc.nf modules/multiqc.nf

The Main script: main.nf

Now, we need to write the main script: main.nf. This script need two parts:

  1. The workflow skeleton

β†’ πŸ”— Workflow skeleton in Nextflow documentation

Click to see an example of workflow skeleton
// Insert here include declarations

workflow Quality_Checker { // The name is used for sub-workflow

    [take]: // Optional ; use for sub-workflow

    [main]: // Process

    [emit]: // Optional ; use for sub-workflow

}


2. The Nextflow modules to include. In this practical session, there is two modules: fastqc.nf and multiqc.nf.

β†’ πŸ”— Include declaration in Nextflow documentation

Click to see required module includes
include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

Exercice 4: Create a main.nf script with workflow skeleton and module imports.

Click to see an exercise solution
include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'


workflow Quality_Checker {

    main:

        FASTQC() 
        MULTIQC()
}

workflow { // Entry workflow
    Quality_Checker()
}

Module FastQC: modules/fastqc.nf

In this part we will write the first module of the workflow: modules/fastqc.nf.

  1. We will first create the process from a skeleton
    β†’ Process in Nextflow documentation
Click to see an example
//include

process FASTQC{ 
    [directive]

    [input]: 

    [output]:

    [script|exec]:

}
  1. We will copy the command lines from the fastqc.sh shell script in the script section of the module
  2. Set the input of the process
    β†’ πŸ”— Inputs in Nextflow documentation
  3. Set the output of the process
    β†’ πŸ”— Outputs in Nextflow documentation
❓ Click to see an help on how to set inputs and outputs for the FASTQC process
process FASTQC{
    input:
    tuple val(sampleId), path(fastqs)

    output:
    path(result_fastqc)

    script:
    """
    ...
    """
}

Exercice 5: Create a modules/fastqc.nf script with a FASTQC process that execute the same command line than the fastqc.sh shell script.

πŸ’‘ Note 1: The input of the FASTQ process will be created using the Channel.fromFilePairs method. So a typical input will be like:

[DRUP01_SUB2, [./data/DRUP01_SUB2_R1.fastq.gz, ./data/DRUP01_SUB2_R2.fastq.gz]]
[HG002_SUB1, [./data/HG002_SUB1_R1.fastq.gz, ./data/HG002_SUB1_R1.fastq.gz]]

πŸ’‘ Note 2: In the script sections of a Nextflow workflow, shell variables starting with β€œ$” character like (e.g.Β $CWD ), must be prefixed with β€œ\”. Otherwise, variables will be interpreted as Nextflow variables.

Click to see a solution of the exercise
process FASTQC {

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}

Module MultiQC: modules/multiqc.nf

Now, we will write the second module of the workflow: modules/multiqc.nf

Exercice 6: Create a modules/multiqc.nf script with a MULTIQC process that execute the same command lines than the multiqc.sh shell script.

Click to see a solution of the exercise
process MULTIQC {

  input:
  path(multiqc_config)
  path(multiqc_inputs) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config $multiqc_inputs
  """
}

Main start

We have now:

In order to use the program, we need inputs: the FASTQ files. To do this, we must modify the main.nf file and add an input channel.

The Nextflow documentation explain how to import paired files: https://nextflow.io/docs/latest/reference/channel.html#fromfilepairs

πŸ’‘ Note: You can show on the terminal, the content of a channel using the view() operator. It is very useful to see the content of the input and output of the processes.

Example:

fastqs.view()

Exercice 7:

  1. Create a channel for FASTQ files (in the data subdirectory) and MultiQC configuration files.
  2. Call the FASTQ and MULTIQC process with new channel and for MULTIQC with the output channel of the FASTQC process
  3. The MULTIQC process must be executed only one time

πŸ’‘ Note: To ensure that the MULTIQC process will be executed only one time, you need to use some channel operators (collect() and maybe map()). See the Nextflow documentation for more indormation about the operators.

❓ Click to see an help
workflow Quality_Checker {
    input_ch = Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
}
process test {
    output:
    path("*.*"), emit: toto

    script:
    """
    echo test > log.txt
    """
}

workflow {
    test()
    ch_res = test.out.toto
}
Click to see a solution
include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

workflow Quality_Checker {

    fastqs =  Channel.fromFilePairs("./data/*_R{1,2}.fastq.gz")
    multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')

    main:

        FASTQC(fastqs) 
        MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}

workflow { // Entry workflow
    Quality_Checker()
}

Running

You can now run your new pipeline with the following command line:

nextflow run main.nf 

Exercice 8: Launch the Nextflow workflow with the previous command line. Check that the MULTIQC process has been executed only one time.

You will obtain this in a few seconds. :


 N E X T F L O W   ~  version 25.04.7

Launching `main.nf` [thirsty_almeida] DSL2 - revision: 03f8331e18

executor >  local (3)
[1f/1c54fa] Quality_Checker:FASTQC (1)  [100%] 2 of 2 βœ”
[b0/51040a] Quality_Checker:MULTIQC (1) [100%] 1 of 1 βœ”

If you want to access the working directory of the MultiQC process, for example, you can do this:

cd work/b0/
ls 

you can find a repository who start by 51040a, 51040a19b731196497167817770a4a.

Exercice 9: Relaunch the Nextflow workflow with the -resume option. What are the differences?

πŸ’‘ Note: With the nextflow clean -f command, you can delete old work directories.

Module MultiQC Extra part stub

In Nextflow, the stub section of a process allows you to simulate the execution without actually running the script. Instead, it creates empty files that represent the expected outputs of the process. This is useful for testing the pipeline’s structure and flow without waiting for each step to complete. It’s especially helpful during development such as when working on a laptop or when the process involves long runtimesβ€”unless you’re using a small dataset.

We can do this on the Multiqc process. We know that:

If β€œstub” section exists in a process definiton, it will create fake files and folders if you launch Nextflow with -stub option in the command line. As an example, you can simulate the execution of your workflow with the following command:

nextflow run main.nf -stub

Exercice 10:

  1. Add a stub: section to your MULTIQC process definition.
  2. Relaunch the workflow.
  3. Check the output of the MULTIQC process in the work directory.
Click to see a solution of the exercise
process MULTIQC {

  input:[...]

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script: [...]

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}

Set inputs as workflow parameters

We now have a functional workflow, but the inputs are written in hard.

params.input is a nextflow variable who use the flag β€˜--’ in the command line for example :

nextflow run main.nf --fastq_input './data/*_R{1,2}.fastq.gz'

πŸ’‘ Note: In the previous command line, FASTQ files path are between quotes to avoid shell expansion. With quotes, the path will be interpreted by Nextflow instead of Bash.

Exercice 11: Use params.fastq_input to implement a more flexible workflow.

Click to see a solution of the exercise
include { FASTQC  } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

workflow Quality_Checker {

    take:
    qc_input


    main:
        fastqs =  Channel.fromFilePairs(qc_input)
        multiqc_config = Channel.fromPath('./assets/multiqc_config.yml')

        FASTQC(fastqs) 
        MULTIQC(multiqc_config, FASTQC.out.results.map{it -> it[1]}.collect())
}

workflow { // Entry workflow
    Quality_Checker(params.fastq_input)
}

PublishDir

You have a flexible workflow, but you don’t have the results from this pipeline.

To choose the location of your programme you can use the publishDir directive : πŸ”— https://nextflow.io/docs/edge/reference/process.html#publishdir

Exercice 12: You can try to implement this functionnality for the FastQC and MultiQC process.

❓ Click to see an example
process test {
    publishDir "${params.outDir}/", mode: 'copy', saveAs: {filename -> 
    if filename =="log.txt" return "test_$filename"
    else null }

    output:
    path("*.*"), emit: toto

    script:
    """
    echo test > log.txt
    """
}

Here the outputs are saved in the params.outDir. nextflow run main.nf --outDir ./results

Click to see a solution of the exercise
process FASTQC {
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}
process MULTIQC {
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}
nextflow run main.nf --input './data/*_R{1,2}.fastq.gz' --outDir ./results

You can store default values like outDir in the nextflow.config file for example. πŸ”— https://nextflow.io/docs/edge/config.html

Exercice 13:

  1. Define the value of the outDir variable in nextflow.config to publish outputs in the results subdirectory.
  2. Include the empty conf/base.config file in the nextflow.config configuration file._
Click to see an example of solution
includeConfig './conf/base.config'

params {
    fastqc_input = null

    outDir = "$projectDir/results"
}
:::

Labels in configuration file

You can use labels in your processes to share configurations. For example, you may have multiple processes, some use a simple resource and others use multiple resources. You can define two labels:

process{
    withLabel: little_prog {
        cpus = 1
        memory = 4.GB
    }
    withLabel: big_prog {
        cpus = 4
        memory = 8.GB
    }
}
process FASTQC {
    label "big_prog"

    [...]
}

process MULTIQC {
    label 'little_prog'

    [...]
}

Exercice 14:

  1. Add to your nextflow.config file the section that define little_prog and big_prog labels.
  2. Use this label in the FastQC and MultiQC process.
  3. Execute the worfklow.
Click to see a solution of the exercise
process FASTQC {
  label 'big_prog'
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}
process MULTIQC {
  label 'little_prog'
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}
params {
    fastqc\_input = null

    outDir = "$projectDir/results"
}

process{
    withLabel: little_prog {
        cpus = 1
        memory = 4.GB
    }
    withLabel: big_prog {
        cpus = 16
        memory = 64.GB
    }
}
nextflow run main.nf --input './data/*_R{1,2}.fastq.gz'

Profiles

Profiles are configurations related to different environment. For instance if you work on a cluster or on your laptop the localisation of your annotation may differ or your memory configuration and cpu available.

You can split configuration in different files with the includeConfig function. πŸ”— https://nextflow.io/docs/edge/config.html#config-profiles

Profiles Cluster

Your cluster can have different schedulers, such as SLURM, PBS… Nextflow does the configuration for developers, but you must specify this. We will create a cluster profile in nextflow.config. πŸ”— https://nextflow.io/docs/latest/executor.html#slurm

profiles {
    cluster {
        includeConfig './conf/cluster.config'
    }
}
process{
    executor = 'slurm'
    queue = 'fast'
}

You can overload variable in the cluster config file to use more CPU…

Exercice 15: Add the β€œcluster” profile to your nextflow.config file and create conf/cluster.config configuration file. Execute the workflow using a SLURM submission.

Click to see a solution of the exercise
nextflow run main.nf -profile cluster --input './data/*_R{1,2}.fastq.gz'

Profiles Conda

You can use conda environnement in your process nextflow https://nextflow.io/docs/latest/conda.html https://nextflow.io/docs/latest/reference/config.html#conda

In the best practices of the documentation, it is recommanded to create a conda profile in a configuration file. Like the cluster part we will create it in the Nextflow configuration file:

profiles {
    conda {
        includeConfig './conf/conda.config'
    }
}
conda{
    cacheDir = "$projectDir/conda-cache-nextflow" // specify the localization \
                                                  // of the conda env creation
    createTimeout = "1 h"
    enabled = 'true'
}

process {
    withLabel:fastqc{ conda = "$projectDir/recipes/conda/fastqc.yml"}
    withLabel:multiqc{ conda = "$projectDir/recipes/conda/multiqc.yml"}
}

As we will execute FastQC and MultiQC using Conda, we can now disable modules for FastQC and MultiQC from your PATH:

module unload fastqc multiqc

Exercice 16:

  1. Add the β€œconda” profile to your nextflow.config file and create conf/conda.config configuration file.
  2. Add labels defined in conf/conda.config to FASTQC and MULTIQC processes.
  3. Execute the workflow using SLURM submissions and Conda to execute FastQC and MultiQC.
Click to see a solution of the exercise
process FASTQC {
  label 'fastqc'
  label 'big_prog'
  publishDir "${params.outDir}/$sampleId", mode: 'copy'

  input:
  tuple val(sampleId), path(fastqs) 
  
  output:
  tuple val(sampleId), path("*_fastqc.{zip,html}"), emit: results        // emit: catch the result with a name in the workflow file
  path("versions.txt")       , emit: versions
  
 
  script:
    """
    echo \$(fastqc --version) > versions.txt  # The baskslash character before the dollar character prevent variable interpolation by Nextflow. The dollar character will be in the Bash script
    fastqc -q --threads 1 ${fastqs}           # Here ${fastqs} is a nextflow variable who refer to the input of this process. 
    """
}
process MULTIQC {
  label 'multiqc'
  label 'little_prog'
  publishDir "${params.outDir}/multiqc", mode: 'copy', saveAs: {
    filename -> 
    if (filename.contains(".html")) return 'multiqc_report.html'
    else if (filename == "versions.txt") return filename
    else return null
    }

  input:
  path(multiqc_config)
  path(results) // Here we use in input the output of FastQC but we don't use them explicitly,\
                // in fact multiqc will work in the current repository, and the file in input are 'link' \
                // in the current repository. MultiQC will work on the result of fastqc who are now in the current folder.

  output:
  path "*multiqc_report.html", emit: report
  path "*_data"              , emit: data
  path "*_plots"             , optional:true, emit: plots
  path "versions.txt"        , emit: versions

  script:
  """
  multiqc --version > versions.txt 
  multiqc --force $multiqc_config .
  """

  stub:
  """
  multiqc --version > versions.txt 
  touch "multiqc_report.html"
  mkdir multiqc_data
  """
}
nextflow run main.nf -profile cluster,conda --input './data/*_R{1,2}.fastq.gz'

Profiles Singularity

You can use conda environnements in your process nextflow

πŸ”— https://nextflow.io/docs/latest/container.html#singularity

πŸ”— https://nextflow.io/docs/latest/reference/config.html#config-singularity

In the best practices of the documentation, it is recommanded to create a singularity profile in a configuration file. Like in the cluster part we will create it the Nextflow configuration file:

profiles {
    singularity {
        includeConfig './conf/singularity.config'
    }
}
singularity {
  enabled = true
  autoMounts = true
  runOptions = '--containall'
}

process {
    withLabel:fastqc{ container = "$projectDir/recipes/singularity/fastqc.sif"}
    withLabel:multiqc{ container = "$projectDir/recipes/singularity/multiqc.sif"}
}

Exercice 17: Add the β€œsingularity” profile to your nextflow.config file and create conf/singularity.config configuration file. Execute the workflow using SLURM submissions and Singularity to execute FastQC and MultiQC.

Click to see a solution of the exercise
nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz'

Metrics

You can enable metrics of your workflow execution with Nextflow. To turn on this option use the command line or set the Nextflow configuration file.


// From nf-core
timeline {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/timeline.html"
}

report {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/report.html"
}

trace {
  enabled = true
  raw = true
  overwrite = true
  fields = 'process,task_id,hash,native_id,module,container,tag,name,\
  status,exit,submit,start,complete,duration,realtime,%cpu,%mem,rss,vmem\
  ,peak_rss,peak_vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,attempt,\
  workdir,scratch,queue,cpus,memory,disk,time,env'
  file = "${params.summaryDir}/trace/trace.txt"
}

dag {
  enabled = true
  overwrite = true
  file = "${params.summaryDir}/trace/DAG.pdf"
}

Graphviz is required to render the execution DAG in a PNG image. So we need to load the module before relaunching the workflow:

load graphviz

Exercice 18: Add the previous example of metric configuration to your nextflow.config, and relaunch your workflow.

Click to see a solution of the exercise
nextflow run main.nf -profile cluster,singularity --input './data/*_R{1,2}.fastq.gz' --summaryDir .

Example Curie pipeline

https://github.com/bioinfo-pf-curie/

https://github.com/bioinfo-pf-curie/raw-qc

https://github.com/bioinfo-pf-curie/geniac-demo-dsl2

https://geniac.readthedocs.io/en/version-3.6.0/