Nextflow introduction

2025-10-30

Frédéric Jarlier, Julien Roméjon, Philippe Hupé, Laurent Jourdren, Quentin Duvert

Workflow manager
Nextflow key concepts
Channel & Operator
Process & Directive
Configuration & Profile
Debug & Metrics
Conclusion

Workflow manager

Keep the focus on “What” instead of “How”

What?

algorithm
bioinformatics tools
analysis parameters
genomes, threshold
biological tuning

How?

cluster
error management
parallel execution
containers
technical tuning
log management, reproducibility

“Write the biological part and delegate the informatic one to the workflow manager”
-> BIO (what) - INFO (how)

Workflow manager

“Feel the spirit”

workflow manager adds complexity to your pipeline stack
you must get benefits from it

“Change your habits”

never paste pre-existing code into your workflow manager
“it works!” is not a good enough reason

“Use the natural way”

try to use the most features you can from your workflow manager
always ask yourself about the more nextflow-ic way to do something

Workflow manager

Main features

reproducibility
portability
tools management
- conda
- container (docker/singularity)
- environment modules
execution report
- resources usage
very (too) fast evolution
- deprecation happens quickly
- DSL2
- strict syntax

Origins

CRG (Barcelona)
- Paolo Di Tommaso, Evan Floden
Apache 2.0 license
P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017)

Community & Resources

🔗 https://www.nextflow.io/docs/latest
🔗 https://github.com/nextflow-io/nextflow
🔗 https://nextflow.slack.com
🔗 https://community.seqera.io
🔗 https://www.youtube.com/@Nextflow
🔗 https://nf-co.re

Nextflow key concepts

Channel	Operator	Process	Directive	Workflow

Operators work on Channels
Directives work on Processes
Workflows chain Processes by injecting Channels

Nextflow is a Data Flow Model programming language that helps building complex worflows
The idea is to chain multiple tasks like in *nix system with piping command
Nextflow > Groovy > Java / main file: main.nf

Channel & Operator

Channel (Dataflow), 2 kinds:

Queue Channel (Dataflow channel)
- asynchronous sequence of values (FIFO)
- iteration, consumption
Value Channel (Dataflow value)
- unique value
- never empty, no consumption

Usage

Initialisation -> Factories
Manipulation -> Operators

Example

// 4 emissions
Channel.of(9, 10, 11, 12).view()

// 1 emission
Channel.of([13, 14, 15, 16]).view()

// 4 emissions
Channel.fromList([5, 6, 7, 8]).view()

// error
// Channel.fromList(1, 2, 3, 4).view()

// always 1 emission
Channel.value(['foo', 'bar']).view()

Channel & Operator

Real life example…

Filesystem manipulations
Paired files, recursive search…

files = Channel.fromPath('data/**.fa')
moreFiles = Channel.fromPath('data/**/*.fa')

Channel
  .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
  .view()
// output similar to: 
// [SRR493366, [/my/data/SRR493366_1.fastq, /my/data/SRR493366_2.fastq]]
// [SRR493367, [/my/data/SRR493367_1.fastq, /my/data/SRR493367_2.fastq]]
// [SRR493368, [/my/data/SRR493368_1.fastq, /my/data/SRR493368_2.fastq]]
// [SRR493369, [/my/data/SRR493369_1.fastq, /my/data/SRR493369_2.fastq]]
// [SRR493370, [/my/data/SRR493370_1.fastq, /my/data/SRR493370_2.fastq]]
// [SRR493371, [/my/data/SRR493371_1.fastq, /my/data/SRR493371_2.fastq]]

Channel & Operator

Operator

Filtering
- .filter{...}, .first(), .unique()
Transforming
- .map{...}, .groupTuple(), .collect(), .flatten()
Splitting
- .splitCsv(), .splitFasta(...)
Combining
- .join(...), .mix(...), .concat(...)
Forking
- .multimap{...}, .branch{...}
Maths
- .count(), .min(), .max()
Other
- .dump(), .set{...}, .ifEmpty(...), .view()

Example

Channel.of(1, 2, 3, 4).collect().view().set{ testCh }
// expected output:
// [1, 2, 3, 4]

testCh.flatten().filter{ it -> it%2 }.set{ outCh }
outCh.view()
// expected output:
// 1
// 3

tab = ['atchoum', 'simplet', 'prof', 'joyeux']
outCh.map{ val -> [val, tab[val]] }.view()
// expected output:
// [1, simplet]
// [3, joyeux]

Process & Directive

Skeleton

process <name> {
    [directive]

    input:
    <input qualifier> <input name>

    output:
    <output qualifier> <output name>[, emit: <name>]

    script:
    <script to be executed>
}

“A process starts when all the inputs are ready”

“Path outputs must have been produced at the end of the process”

Process

Basic unit
Ends with a bash command line
Processes inputs
Produces outputs
“one tool by process”

Inputs

Qualifiers: val, path, env, stdin, tuple, each
Inputs are (almost) mandatory
“function arguments”

Outputs

Qualifiers: val, path, env, stdout, tuple, eval
Ouputs are optional
“function return”

Process & Directive

Real life example…

process extract {
    input:
    val x
    path y

    output:
    tuple val(x), path("${x}.txt")

    script:
    """
    cat ${y} | grep ${x} > ${x}.txt
    """
}

process manta {
    tag "${meta.id}"
    label 'manta'
    label 'highCpu'
    label 'highMem'
    
    input:
    tuple val(meta), path(tumor_bam), path(tumor_bai)
    path(fasta)
    path(fastaFai)
    
    output:
    tuple val(meta), path("*.vcf.gz"), path("*.vcf.gz.tbi"), emit:svVcf

    [...]
}

Process & Directive

Directive

Optional settings to customize process execution
Different kinds
- General: tag, label, publishDir…
- Execution context: executor, clusterOptions, maxForks, queue…
- Resources management: cpus, memory, time…
- Software dependencies: conda, container, module, containerOptions…
- Error/Debug: errorStragegy, maxErrors, maxRetries, debug, cache…

Dynamic values

closure
special variable “task”

process foo {
  executor 'slurm'
  queue { entries > 100 ? 'long' : 'short' }

  input:
  tuple val(entries), path(bam), path(bai)

  [...]

process foo {
    tag "$barcode"
    cpus 4
    memory '16 GB'

    input:
    tuple val(barcode), ...

    [...]

process foo {
    memory { task.attempt > 1 ? task.previousTrace.memory * 2 : (1.GB) }
    errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
    maxRetries 3

    [...]

Workflow

Connect everything!

process foo {
    input:
    val x
    val y

    output:
    path "${y}.txt"

    script:
    """
    echo "${x} and ${y}" > ${y}.txt
    """
}

process bar {
    input:
    path x

    output:
    path "${x}.zip", emit: zf
    stdout emit: size

    script:
    """
    zip -q ${x}.zip ${x}
    stat -c%s ${x}
    """
}

workflow {
    u = Channel.value('1')
    v = Channel.of('a', 'b')
    foo(u, v)
}

workflow {
    u = Channel.of(1)
    v = Channel.of('a', 'b')
    result = foo(u, v)
    bar(result)
    // foo(u, v) | bar
}

workflow bouh {
    foo(1, 'b') | bar
    bar.out.size.view()     
}

workflow {
    bouh()
}

Configuration & Profile

Externalize many settings to a dedicated file: nextflow.config
At runtime, Nextflow looks for configuration files
- $HOME/.nextflow/config
- nextflow.config in projectDir, then launchDir
- files specified using the -c <config-files> option from the command line
- -C <config-files> <=> ignore all the other config files
Syntax
- assignement and/or blocks ; with scopes (i.e. namespaces)
- mixing files: includeConfig 'path/extra.config'

params.runName = "default name"
process {
    executor = 'slurm'

    withLabel: big_mem {
        memory = 64.GB
        queue = 'long'
 }
 report.enabled = true

Different scopes for different usages….
- define default values for pipeline parameters: params
- process directives: process
- context: timeline, report, trace, workflow, dag
- software dependencies: conda, apptainer, docker
- cloud: aws, azure, google, tower…

Profile <=> activate specific parts of configuration files from the command line

Debug & Metrics

Main log file
- .nextflow.log
Get more metrics with custom config

trace.enabled = true
timeline.enabled = true
report.enabled = true
dag.enabled = true

Place to debug
- work directory
Useful options for debug
- -resume (cache)
- -stub-run (dryrun)

Debug & Metrics

Enjoy the work directory!

Summary

$ nextflow -C custom.config run main.nf -profile sgu,clst --run G553 --bclDir /path

-C custom.config: nextflow option to load a unique configuration file
run: nextflow command
main.nf: main script
-profile sgu,clst: option of the run command to activate 2 profiles defined in configuration files
--run G553, --bclDir /path: pipeline parameters, overriding default values from params scope

Channel & Operator || Queue vs Value ; be sure about the content of your channels

Process & Directive || Execution order based on input availability

Configuration & Debug || Make the work directory your friend

Conclusion

Key features

automatic parallelization
software dependencies: conda, docker, apptainer/singularity, etc…
high portability: scheduler (pbs, slurm, …) / cloud (google, aws, azure, etc…)
error management: error recovery with “-resume”, dry run with “-stub”, debug with “work” files

Extra resources

Seqera AI chatbot: https://seqera.io/ask-ai
nf-core: https://nf-co.re
geniac: https://geniac.readthedocs.io

Nextflow introduction

Nextflow introduction

Table of contents

Workflow manager

Keep the focus on “What” instead of “How”

What?

How?

Workflow manager

“Feel the spirit”

“Change your habits”

“Use the natural way”

Workflow manager

Main features

Origins

Community & Resources

Nextflow key concepts

Channel & Operator

Channel (Dataflow), 2 kinds:

Usage

Example

Channel & Operator

Real life example…

Channel & Operator

Operator

Example

Process & Directive

Skeleton

Process

Inputs

Outputs

Process & Directive

Real life example…

Process & Directive

Directive

Workflow

Connect everything!

Configuration & Profile

Debug & Metrics

Debug & Metrics

Enjoy the work directory!

Summary

Conclusion

Key features

Extra resources