Nextflow introduction

Nextflow introduction

2025-10-30

Frédéric Jarlier, Julien Roméjon, Philippe Hupé, Laurent Jourdren, Quentin Duvert

Table of contents

  • Workflow manager

  • Nextflow key concepts

  • Channel & Operator

  • Process & Directive

  • Configuration & Profile

  • Debug & Metrics

  • Conclusion

Workflow manager

Keep the focus on “What” instead of “How”


What?

  • algorithm

  • bioinformatics tools

  • analysis parameters

  • genomes, threshold

  • biological tuning

How?

  • cluster

  • error management

  • parallel execution

  • containers

  • technical tuning

  • log management, reproducibility


“Write the biological part and delegate the informatic one to the workflow manager”
-> BIO (what) - INFO (how)

Workflow manager

“Feel the spirit”

  • workflow manager adds complexity to your pipeline stack

  • you must get benefits from it


“Change your habits”

  • never paste pre-existing code into your workflow manager

  • “it works!” is not a good enough reason


“Use the natural way”

  • try to use the most features you can from your workflow manager

  • always ask yourself about the more nextflow-ic way to do something

Workflow manager

Main features

  • reproducibility
  • portability
  • tools management
    • conda
    • container (docker/singularity)
    • environment modules
  • execution report
    • resources usage
  • very (too) fast evolution
    • deprecation happens quickly
    • DSL2
    • strict syntax

Nextflow key concepts

                                                                                                                                 
Channel Operator Process Directive Workflow
  • Operators work on Channels

  • Directives work on Processes

  • Workflows chain Processes by injecting Channels


  • Nextflow is a Data Flow Model programming language that helps building complex worflows

  • The idea is to chain multiple tasks like in *nix system with piping command

  • Nextflow > Groovy > Java / main file: main.nf

Channel & Operator

Channel (Dataflow), 2 kinds:

  • Queue Channel (Dataflow channel)

    • asynchronous sequence of values (FIFO)

    • iteration, consumption

  • Value Channel (Dataflow value)

    • unique value

    • never empty, no consumption

Usage

  • Initialisation -> Factories

  • Manipulation -> Operators

Example

// 4 emissions
Channel.of(9, 10, 11, 12).view()

// 1 emission
Channel.of([13, 14, 15, 16]).view()

// 4 emissions
Channel.fromList([5, 6, 7, 8]).view()

// error
// Channel.fromList(1, 2, 3, 4).view()

// always 1 emission
Channel.value(['foo', 'bar']).view()

Channel & Operator

Real life example…

  • Filesystem manipulations

  • Paired files, recursive search…

files = Channel.fromPath('data/**.fa')
moreFiles = Channel.fromPath('data/**/*.fa')

Channel
  .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
  .view()
// output similar to: 
// [SRR493366, [/my/data/SRR493366_1.fastq, /my/data/SRR493366_2.fastq]]
// [SRR493367, [/my/data/SRR493367_1.fastq, /my/data/SRR493367_2.fastq]]
// [SRR493368, [/my/data/SRR493368_1.fastq, /my/data/SRR493368_2.fastq]]
// [SRR493369, [/my/data/SRR493369_1.fastq, /my/data/SRR493369_2.fastq]]
// [SRR493370, [/my/data/SRR493370_1.fastq, /my/data/SRR493370_2.fastq]]
// [SRR493371, [/my/data/SRR493371_1.fastq, /my/data/SRR493371_2.fastq]]

Channel & Operator

Operator

  • Filtering
    • .filter{...}, .first(), .unique()
  • Transforming
    • .map{...}, .groupTuple(), .collect(), .flatten()
  • Splitting
    • .splitCsv(), .splitFasta(...)
  • Combining
    • .join(...), .mix(...), .concat(...)
  • Forking
    • .multimap{...}, .branch{...}
  • Maths
    • .count(), .min(), .max()
  • Other
    • .dump(), .set{...}, .ifEmpty(...), .view()

Example

Channel.of(1, 2, 3, 4).collect().view().set{ testCh }
// expected output:
// [1, 2, 3, 4] 

testCh.flatten().filter{ it -> it%2 }.set{ outCh }
outCh.view()
// expected output:
// 1
// 3

tab = ['atchoum', 'simplet', 'prof', 'joyeux']
outCh.map{ val -> [val, tab[val]] }.view()
// expected output:
// [1, simplet]
// [3, joyeux]

Process & Directive

Skeleton

process <name> {
    [directive]

    input:
    <input qualifier> <input name>

    output:
    <output qualifier> <output name>[, emit: <name>]

    script:
    <script to be executed>
}

“A process starts when all the inputs are ready”


“Path outputs must have been produced at the end of the process”

Process

  • Basic unit
  • Ends with a bash command line
  • Processes inputs
  • Produces outputs
  • “one tool by process”

Inputs

  • Qualifiers: val, path, env, stdin, tuple, each
  • Inputs are (almost) mandatory
  • “function arguments”

Outputs

  • Qualifiers: val, path, env, stdout, tuple, eval
  • Ouputs are optional
  • “function return”

Process & Directive

Real life example…

process extract {
    input:
    val x
    path y

    output:
    tuple val(x), path("${x}.txt")

    script:
    """
    cat ${y} | grep ${x} > ${x}.txt
    """
}
process manta {
    tag "${meta.id}"
    label 'manta'
    label 'highCpu'
    label 'highMem'
    
    input:
    tuple val(meta), path(tumor_bam), path(tumor_bai)
    path(fasta)
    path(fastaFai)
    
    output:
    tuple val(meta), path("*.vcf.gz"), path("*.vcf.gz.tbi"), emit:svVcf

    [...]
}

Process & Directive

Directive

  • Optional settings to customize process execution

  • Different kinds

    • General: tag, label, publishDir
    • Execution context: executor, clusterOptions, maxForks, queue…
    • Resources management: cpus, memory, time…
    • Software dependencies: conda, container, module, containerOptions…
    • Error/Debug: errorStragegy, maxErrors, maxRetries, debug, cache…
  • Dynamic values
    • closure
    • special variable “task”
    process foo {
      executor 'slurm'
      queue { entries > 100 ? 'long' : 'short' }
    
      input:
      tuple val(entries), path(bam), path(bai)
    
      [...]
process foo {
    tag "$barcode"
    cpus 4
    memory '16 GB'

    input:
    tuple val(barcode), ...

    [...]
process foo {
    memory { task.attempt > 1 ? task.previousTrace.memory * 2 : (1.GB) }
    errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
    maxRetries 3

    [...]

Workflow

Connect everything!

process foo {
    input:
    val x
    val y

    output:
    path "${y}.txt"

    script:
    """
    echo "${x} and ${y}" > ${y}.txt
    """
}
process bar {
    input:
    path x

    output:
    path "${x}.zip", emit: zf
    stdout emit: size

    script:
    """
    zip -q ${x}.zip ${x}
    stat -c%s ${x}
    """
}
workflow {
    u = Channel.value('1')
    v = Channel.of('a', 'b')
    foo(u, v)
}
workflow {
    u = Channel.of(1)
    v = Channel.of('a', 'b')
    result = foo(u, v)
    bar(result)
    // foo(u, v) | bar
}
workflow bouh {
    foo(1, 'b') | bar
    bar.out.size.view()     
}

workflow {
    bouh()
}

Configuration & Profile

  • Externalize many settings to a dedicated file: nextflow.config
  • At runtime, Nextflow looks for configuration files
    • $HOME/.nextflow/config
    • nextflow.config in projectDir, then launchDir
    • files specified using the -c <config-files> option from the command line
    • -C <config-files> <=> ignore all the other config files
  • Syntax
    • assignement and/or blocks ; with scopes (i.e. namespaces)
    • mixing files: includeConfig 'path/extra.config'
params.runName = "default name"
process {
    executor = 'slurm'

    withLabel: big_mem {
        memory = 64.GB
        queue = 'long'
 }
 report.enabled = true
  • Different scopes for different usages….
    • define default values for pipeline parameters: params
    • process directives: process
    • context: timeline, report, trace, workflow, dag
    • software dependencies: conda, apptainer, docker
    • cloud: aws, azure, google, tower

Profile <=> activate specific parts of configuration files from the command line

Debug & Metrics

  • Main log file

    • .nextflow.log
  • Get more metrics with custom config

trace.enabled = true
timeline.enabled = true
report.enabled = true
dag.enabled = true
  • Place to debug

    • work directory
  • Useful options for debug

    • -resume (cache)

    • -stub-run (dryrun)

Debug & Metrics

Enjoy the work directory!

Summary

$ nextflow -C custom.config run main.nf -profile sgu,clst --run G553 --bclDir /path

  • -C custom.config: nextflow option to load a unique configuration file

  • run: nextflow command

  • main.nf: main script

  • -profile sgu,clst: option of the run command to activate 2 profiles defined in configuration files

  • --run G553, --bclDir /path: pipeline parameters, overriding default values from params scope

Channel & Operator || Queue vs Value ; be sure about the content of your channels


Process & Directive || Execution order based on input availability


Configuration & Debug || Make the work directory your friend

Conclusion

Key features

  • automatic parallelization

  • software dependencies: conda, docker, apptainer/singularity, etc…

  • high portability: scheduler (pbs, slurm, …) / cloud (google, aws, azure, etc…)

  • error management: error recovery with “-resume”, dry run with “-stub”, debug with “work” files

Extra resources