{ "cells": [ { "cell_type": "markdown", "id": "c18a119f-ab4b-4f73-ac3e-7e0ba72cbf6b", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Introduction to Snakemake workflow\n", "\n", "\"snakemake\"" ] }, { "cell_type": "markdown", "id": "c4d94a31-f945-4967-82cb-35170feffc3e", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "**schedule:**\n", "- workflow introduction\n", "- snakemake introduction, rule concept\n", "- snakemake & snakefile\n", "- example with a 2-steps workflow" ] }, { "cell_type": "markdown", "id": "e4455298-a049-4f8b-ad6f-adbe3f956650", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Workflow definition\n", "\n", "a pool of commands, progressively linked by the treatments, from the input data towards the results:\n", "\n", "\"a\n", "\n", "_arrow: output of tool n−1 = input for tool n_\n", "\n", "In case of data paralelization, several data flows can be processed in parallel:\n", "\n", "\"a\n", "\n", "With a multi-cores PC or a computational cluster (ex. 2000 cores), one (or more) core can be attributed to one workflow." ] }, { "cell_type": "markdown", "id": "326231e7-0fcd-4cc4-a9d6-3a2e8f3a348a", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Workflow management systems\n", "\n", "Many workflow management systems, many forms:\n", "- command line: shell (need to script allelization alone, not easy)\n", "- rule: \"snakemake\", \"c-make\", \"nextflow\", ...\n", "- graphic interface: \"Galaxy\", Taverna, Keppler, ...\n", "\n", "**pros:**
\n", "- reproducibility: keep track (when file was generated & how)
\n", "- manage parallelization (error recovery)\n", "\n", "**cons:**
\n", "- learning effort" ] }, { "cell_type": "markdown", "id": "22492525-eea3-4c7a-ba3c-c05e967c6f52", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "We choose:
\n", "\"snakemake\"\n", "\n", "- works on **files** (rather than streams, reading/writing from databases, or passing variables in memory)
\n", "- is based on **Python** (but know how to code in Python is not required)
\n", "- has features for defining the **environment** for each task (running a large number of small third-party tools is current in bioinformatics)
\n", "- is easily to be **scaled** from desktop to server, cluster, grid or cloud environments without modification from your single core laptop (ie. develop on laptop using a small subset of data, run the real analysis on a cluster) " ] }, { "cell_type": "markdown", "id": "4c05912d-7ca9-4086-aeba-81dd8b28f845", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## The Snakemake rule (1/2)\n", "\n", "Snakemake: mix of the programming language Python (snake) and the [Make](https://www.gnu.org/software/make/manual/), a rule-based automation tool \n", "\n", "Good practice: one step, one rule\n", "\n", "\"snakemake\"\n", "\n", "A rule is defined by it name and may contain **directives**:\n", "- `input:` list one or more file names\n", "- `output:` list one or more file names\n", "- command (`run:` for python ; `shell:` for shell, R, etc)" ] }, { "cell_type": "markdown", "id": "85ebbbde-82f5-4ed5-875b-4d23165391cf", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## The Snakemake rule (2/2)\n", "\n", "\n", "\n", "\"snakemake\"\n", "\n", "```\n", "rule myRuleName:\n", " input: myInFile\n", " output: myOutputFile\n", " shell: \"cat < {input} > {output}\"\n", "```\n", "\n", "Remark: with 1 command line, use a `shell:` directive ; with many command lines, use a `run:` directive with the python `shell(”...”)` function\n", "\n", "Optional directives can be added, eg.: `params:`, `message:`, `log:`, `threads:`, ..." ] }, { "cell_type": "markdown", "id": "5a5b100e-4ab1-4311-b2ea-81e757af9331", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## The data flow linkage and rules order\n", "\n", "A snakemake workflow links rules thank to the filenames of the rule input and output directives.\n", "\n", "\"output \"input\" \"input\" \n", "\n", "Snakemake rules order: the first rule is the default target rule and specifies the result files\n", "\n", "Snakemake creates a **DAG** (directed acyclic graph) corresponding to the rules linkage" ] }, { "cell_type": "markdown", "id": "c08211e8-64ca-4d54-977e-7bd78a7d8470", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Snakemake run options\n", "\n", "- `-s mySmk` to change the default snakefile name\n", "- dry-run, do not execute anything, display what would be done: `-n --dryrun`
\n", "- print the shell command: `-p --printshellcmds`
\n", "- print the reason for each rule execution: `-r --reason`
\n", "- print a summary and status of rule: `-D`
\n", "- limit the number of jobs in parallel: `-j 1` (cores: `-c 1`)
\n", "\n", "[all Snakemake options](https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options)" ] }, { "cell_type": "markdown", "id": "89f2540c-3afb-47ac-b30f-24a6efeff97e", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakemake output options\n", "\n", "- to automatically create HTML reports (`--report report.html`) with runtime statistics, a visualization of the workflow topology, used software and data provenance information (need to add the `jinja2` package as a dependency)
\n", "- use the `--archive` option (need git) to save your project\n", "- complete workflow (`--dag`) or rules dependencies (`--rulegraph`) visualizations (with the `dot` tool of the `graphviz` package):\n", "```\n", "snakemake --dag -s mySmk | dot -Tpng > mySmk_dag.png\n", "snakemake --rulegraph -s mySmk | dot -Tpng > mySmk_rule.png\n", "```\n", "\"DAG\" \"rules\"" ] }, { "cell_type": "markdown", "id": "a287263a-a1dd-4552-abfd-2def6e2d1e9e", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakemake environment options\n", "\n", "Snakemake supports environments on a per-rule basis (created & activated on the fly):
\n", "\n", "**conda:**
\n", "- add a `conda:` directive in the rule definition (_eg._ `conda: myCondaEnvironment.yml`)
\n", "- run Snakemake with the `--use-conda` option
\n", "\n", "**docker:**\n", "- add a `container:` directive in the rule definition (_eg._ `container: \"docker://biocontainers/fastqc\"`)
\n", "- run Snakemake with the `--use-singularity` and `--singularity-args \"-B /path/outside/container/:/path/inside/container/\"` options
\n", "\n", "**module:**
\n", "- add a `envmodules:` directive in the rule definition (_eg._ `envmodules: \"fastqc/0.11.9\"`)
\n", "- run Snakemake with the `--use-envmodules` option" ] }, { "cell_type": "markdown", "id": "5ecf402c-a3ce-4106-b936-27e843d97435", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Get a Snakefile\n", "\n", "The **snakefile** is the **text file** that encodes the rules, and so the workflow.
\n", "The command `snakemake` runs the workflow encoded in the `Snakefile` file.\n", "\n", "You can get a snakefile:
\n", "- from github, your colleagues, ...
\n", "- snakemake \"core\" ([nf-core](https://nf-co.re) equivalent) : https://snakemake.github.io/snakemake-workflow-catalog/ (2k pipelines, 177 testés)
\n", "- compose with [snakemake wrappers](https://snakemake-wrappers.readthedocs.io/)
\n", "- by using a Nextflow workflow! (integration via snakemake-wrappers)
\n", "- create from scratch
\n", "\n", "To run the workflow for one input: `snakemake myInFile`" ] }, { "cell_type": "markdown", "id": "7ff4afc8-9456-4032-ac20-4ddc6f0bbb7e", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakefile: input (output) specifications\n", "enumerated:\n", "```\n", "rule all:\n", " input: \"P10415.fasta\",\"P01308.fasta\"\n", "```\n", "\n", "python list & wildcards:\n", "```\n", "DATASETS=[\"P10415\",\"P01308\"]\n", "rule all:\n", " input: [\"{dataset}.fasta\".format(dataset=dataset) for dataset in DATASETS]\n", "```\n", "\n", "expand() & wildcards:\n", "```\n", "DATASETS=[\"P10415\",\"P01308\"]\n", "rule all:\n", " input: expand(\"{dataset}.fasta\",dataset=DATASETS)\n", "```" ] }, { "cell_type": "markdown", "id": "6cb95bbd-966b-4810-a7d1-99653a41db16", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakefile: generalization with wilcards\n", "Snakemake use _wildcards_ allow to replace parts of filename:\n", "- reduce hardcoding: more flexible input and output directives, work on new data without modification\n", "- are automatically resolved (ie. replaced by regular expression \".+\" in filenames)\n", "- are writing into {}\n", "- are specific to a rule\n", "\n", "A same file can be accessed by different matchings:
\n", "Ex. with the file `101/file.A.txt` :
\n", "rule one : `output : \"{set}1/file.{grp}.txt\" # set=10, grp=A`
\n", "rule two : `output : \"{set}/file.A.{ext}\" # set=101, ext=txt`
\n", "(more on [wildcards](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards) in the snakemake documentation)" ] }, { "cell_type": "markdown", "id": "5cda3e9f-c631-47ef-ba6e-61e05fd195c7", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "### With and without wildcards example\n", "\n", "without wildcards, `uniprot_wow.smk`:\n", "```\n", "rule get_prot:\n", " output: \"P10415.fasta\", \"P01308.fasta\"\n", " run :\n", " shell(\"wget https://www.uniprot.org/uniprot/P10415.fasta\")\n", " shell(\"wget https://www.uniprot.org/uniprot/P01308.fasta\")\n", "```\n", "\n", "with wildcards, `uniprot_wiw.smk`:\n", "```\n", "rule all:\n", " input: \"P10415.fasta\", \"P01308.fasta\"\n", "\n", "rule get_prot:\n", " output: \"{prot}.fasta\"\n", " shell: \"wget https://www.uniprot.org/uniprot/{wildcards.prot}.fasta\"\n", "```" ] }, { "cell_type": "markdown", "id": "ce5c36a3-5383-4d5e-97c1-dee60969a0f7", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakefile: get input file names from the file system\n", "\n", "To deduce the identifiers (eg. IDs) of files in a directory, use the inbuilt `glob_wildcards` function, eg.:\n", "```\n", "IDs, = glob_wildcards(\"dirpath/{id}.txt\")\n", "```\n", "`glob_wildcards()` matches the given pattern against the files present in the system and thereby infers the values for all wildcards in the pattern (`{id}` here).\n", "\n", "**Hint:** Don’t forget the **coma** after the name (left hand side, IDs here)." ] }, { "cell_type": "markdown", "id": "eadec312-f6db-4146-8cda-085f167fd04c", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Snakefile: Using a snakemake config file\n", "\n", "The (optional) definition of a configfile allows to parameterize the workflow as needed (`--configfile smk_config.yml`)\n", "\n", "## Subworkflows or Modules\n", "\n", "It is also possible to define external workflows as modules, from which rules can be used by explicitly “importing” them." ] }, { "cell_type": "markdown", "id": "9f1fb3a4-26d8-44ae-bbfc-ed585452da50", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## The workflow example\n", "\n", "We want to manage RNAseq data with a (small) workflow with 2 steps:
\n", "\"a\n", "\n", "A _classical_ analysis with `fastq.gz` data (in the `${PWD}/Data` repository) and the creation of a `${PWD}/FastQC` repository gives:
\n", "\"a" ] }, { "cell_type": "markdown", "id": "03354c4b-263a-461c-83c6-ab8f6c223142", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Translation in snakefile\n", "\n", "\"a
\n", "\"rule
\n", "3 linked rules : fastQC, multiQC, all.
\n", "Wildcard: rule concerns one file (`*` in figure)" ] }, { "cell_type": "markdown", "id": "452900bf-c28e-4085-a9d5-996a11e43e0b", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "Snakemake create the DAG from the snakefile
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "31793217-81d1-4ff6-b494-ff229b12f844", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "Snakemake launch: the rule all need `multiqc_report.html` that doesn't exist but links towards the multiQC rule
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "5d3e437b-3426-4ace-94e8-e32c877acc54", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "The rule multiQC need zip files that doesn't exist but links towards the fastQC rule\n", "\"backward" ] }, { "cell_type": "markdown", "id": "b447b236-bb7c-463e-90d5-07380ab54a85", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "The rule fastQC need fastq.gz files
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "b3468713-d958-465a-a8d9-e295b92313ce", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "fastq.gz files exists, snakemake stops ascending to forward the flow and execute the fastQC rule.
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "cfb2c813-f94f-427b-b317-8302e8f6663b", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "There are 3 sequence files so snakemake launch 3 fastQC rules
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "69500618-7c17-4942-8a8c-5441a912da77", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "After 3 executions of the fastQC rule, the zip files exist and feed the multiQC rule.
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "9298c34b-bc76-4c27-adf0-c125c3dfef98", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "the multiqc_report constitutes the input file of the rule all:
\n", "\"rule" ] }, { "cell_type": "markdown", "id": "8838cb04-b761-45fe-a5ba-a08a7c61e0fc", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Running path\n", "\n", "So the rule all is completed, and the workflow too:
\n", "\"End" ] }, { "cell_type": "markdown", "id": "76f0db65-52af-477d-a9aa-98be24ac806f", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Timestamp\n", "\n", "Snakemake automatically makes sure that everything is up to date, otherwise it launch the jobs that need to be:\n", "\n", "- output files have to be re-created when the input file timestamp is newer than the output file timestamp
\n", "- and from this point, Snakemake goes on through the workflow and applies rules
\n", "\n", "\"backtracking\"\n", "\n", "**note:** in last snakemake versions, _everything_ includes mtime, params, input, software-env, code (fix with the `--rerun-triggers` option)" ] }, { "cell_type": "markdown", "id": "e51eace3-65c4-44b7-847b-3d8283380da3", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## A Snakefile example\n", "\n", "The final objective is to create a snakefile to manage a small workflow with 2 steps: i) fastqc ii) multiqc \n", "\n", "\"a\n", "\n", "These 2 tools (boinformatics domain) allow to check the quality of NGS data. " ] }, { "cell_type": "markdown", "id": "647bfb62-1299-44b5-862d-1a25becaabd5", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "### prerequisite 1: data\n", "\n", "input data to run the workflow example are reduced RNASeq reads files. Download (`wget`) data from [zenodo here](https://zenodo.org/record/3997237): get url on the _download_ button, next `gunzip` and `tar -x`" ] }, { "cell_type": "code", "execution_count": null, "id": "84c8f1a0-a950-42ea-a22f-d3de8b4bc04c", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "%%sh\n", "# on IFB cluster\n", "cd ${PWD}\n", "wget -q https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz\n", "gunzip FAIR_Bioinfo_data.tar.gz\n", "tar -xvf FAIR_Bioinfo_data.tar\n", "rm FAIR_Bioinfo_data*" ] }, { "cell_type": "markdown", "id": "20051a6a-c31c-4b80-b469-d7e8928fadd5", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "### prerequisite 2: snakefile\n", "\n", "`smk_all_samples.smk`, get it from [FAIR_smk](https://github.com/clairetn/FAIR_smk)" ] }, { "cell_type": "code", "execution_count": null, "id": "0424e490-963c-4989-a8b3-0f14de8ad199", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "%%sh\n", "# on IFB cluster\n", "cd ${PWD}\n", "git clone https://github.com/clairetn/FAIR_smk" ] }, { "cell_type": "markdown", "id": "9b38bec5-b8dc-4a35-a38d-b84ab3512249", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "### prerequisite 3: conda environment\n", "\n", "only if use with conda: use `envfair.yml` in the `FAIR_smk` repository cloned to create the conda environment" ] }, { "cell_type": "markdown", "id": "e5b2a460-04fd-43f1-b461-f37636226fe8", "metadata": { "slideshow": { "slide_type": "skip" }, "tags": [] }, "source": [ "### pre-requite 4: Snakemake\n", "\n", "Laptop with docker:\n", "```\n", "save_jupylab_smk.tar # get the docker image archive\n", "docker load < save_jupylab_smk.tar # create the docker image\n", "docker run --rm -v ${PWD}:/home/jovyan -w /home/jovyan --user \"$(id -u):$(id -g)\" -p 8888:8888 test/jupylab_smk:1.0\n", "```\n", "Laptop with conda:\n", "```\n", "conda create env -f envfair.yml\n", "conda activate envfair\n", "```\n", "IFB core cluster (_version 7.8.2 of the docker container is not available_):\n", "```\n", "module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n", "```\n", "check with: `snakemake --version`" ] }, { "cell_type": "markdown", "id": "6bc8d354-2adf-49c5-b3ae-a787902a036f", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "### prerequisite 4: Snakemake & tools\n", "\n", "with `module load` on IFB core cluster (_version 7.8.2 of the docker container is not available_):\n", "```\n", "module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n", "```\n", "check with: `snakemake --version`" ] }, { "cell_type": "code", "execution_count": null, "id": "028f1583-9f4e-40f8-8904-b0db6d167168", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "%%sh\n", "# on IFB cluster\n", "cd ${PWD}\n", "module load snakemake/7.7.0\n", "snakemake --version " ] }, { "cell_type": "markdown", "id": "692e7742-4215-4abb-b67f-e286dbc03c02", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "### run the workflow\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "855a806f-630f-4cc7-a338-a1327444ecc2", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "%%sh\n", "# on IFB cluster\n", "cd ${PWD}\n", "module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n", "snakemake -s FAIR_smk/smk_all_samples.smk \\\n", " --configfile FAIR_smk/smk_config.yml -c1 -p" ] }, { "cell_type": "markdown", "id": "4fbb58ad-39fd-46d0-9893-c375a2227f66", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Conclusion\n", "\n", "With snakemake, you can launch the same snakefile (adapting the snakemake config file) on your laptop or on a computing cluster.\n", "\n", "\n", "Other ressources:\n", "- a formation to [create the workflow step-by-step](https://moodle.france-bioinformatique.fr/mod/resource/view.php?id=68)\n", "- the workflow composed with snakemake wrappers cf. [ex1_o8_wrapper_linted.smk](https://github.com/clairetn/FAIR_smk)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 5 }