{
"cells": [
{
"cell_type": "markdown",
"id": "c18a119f-ab4b-4f73-ac3e-7e0ba72cbf6b",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"# Introduction to Snakemake workflow\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "c4d94a31-f945-4967-82cb-35170feffc3e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"**schedule:**\n",
"- workflow introduction\n",
"- snakemake introduction, rule concept\n",
"- snakemake & snakefile\n",
"- example with a 2-steps workflow"
]
},
{
"cell_type": "markdown",
"id": "e4455298-a049-4f8b-ad6f-adbe3f956650",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Workflow definition\n",
"\n",
"a pool of commands, progressively linked by the treatments, from the input data towards the results:\n",
"\n",
"\n",
"\n",
"_arrow: output of tool n−1 = input for tool n_\n",
"\n",
"In case of data paralelization, several data flows can be processed in parallel:\n",
"\n",
"\n",
"\n",
"With a multi-cores PC or a computational cluster (ex. 2000 cores), one (or more) core can be attributed to one workflow."
]
},
{
"cell_type": "markdown",
"id": "326231e7-0fcd-4cc4-a9d6-3a2e8f3a348a",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Workflow management systems\n",
"\n",
"Many workflow management systems, many forms:\n",
"- command line: shell (need to script allelization alone, not easy)\n",
"- rule: , , , ...\n",
"- graphic interface: , Taverna, Keppler, ...\n",
"\n",
"**pros:**
\n",
"- reproducibility: keep track (when file was generated & how)
\n",
"- manage parallelization (error recovery)\n",
"\n",
"**cons:**
\n",
"- learning effort"
]
},
{
"cell_type": "markdown",
"id": "22492525-eea3-4c7a-ba3c-c05e967c6f52",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"We choose:
\n",
"\n",
"\n",
"- works on **files** (rather than streams, reading/writing from databases, or passing variables in memory)
\n",
"- is based on **Python** (but know how to code in Python is not required)
\n",
"- has features for defining the **environment** for each task (running a large number of small third-party tools is current in bioinformatics)
\n",
"- is easily to be **scaled** from desktop to server, cluster, grid or cloud environments without modification from your single core laptop (ie. develop on laptop using a small subset of data, run the real analysis on a cluster) "
]
},
{
"cell_type": "markdown",
"id": "4c05912d-7ca9-4086-aeba-81dd8b28f845",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## The Snakemake rule (1/2)\n",
"\n",
"Snakemake: mix of the programming language Python (snake) and the [Make](https://www.gnu.org/software/make/manual/), a rule-based automation tool \n",
"\n",
"Good practice: one step, one rule\n",
"\n",
"\n",
"\n",
"A rule is defined by it name and may contain **directives**:\n",
"- `input:` list one or more file names\n",
"- `output:` list one or more file names\n",
"- command (`run:` for python ; `shell:` for shell, R, etc)"
]
},
{
"cell_type": "markdown",
"id": "85ebbbde-82f5-4ed5-875b-4d23165391cf",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## The Snakemake rule (2/2)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"```\n",
"rule myRuleName:\n",
" input: myInFile\n",
" output: myOutputFile\n",
" shell: \"cat < {input} > {output}\"\n",
"```\n",
"\n",
"Remark: with 1 command line, use a `shell:` directive ; with many command lines, use a `run:` directive with the python `shell(”...”)` function\n",
"\n",
"Optional directives can be added, eg.: `params:`, `message:`, `log:`, `threads:`, ..."
]
},
{
"cell_type": "markdown",
"id": "5a5b100e-4ab1-4311-b2ea-81e757af9331",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## The data flow linkage and rules order\n",
"\n",
"A snakemake workflow links rules thank to the filenames of the rule input and output directives.\n",
"\n",
" \n",
"\n",
"Snakemake rules order: the first rule is the default target rule and specifies the result files\n",
"\n",
"Snakemake creates a **DAG** (directed acyclic graph) corresponding to the rules linkage"
]
},
{
"cell_type": "markdown",
"id": "c08211e8-64ca-4d54-977e-7bd78a7d8470",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Snakemake run options\n",
"\n",
"- `-s mySmk` to change the default snakefile name\n",
"- dry-run, do not execute anything, display what would be done: `-n --dryrun`
\n",
"- print the shell command: `-p --printshellcmds`
\n",
"- print the reason for each rule execution: `-r --reason`
\n",
"- print a summary and status of rule: `-D`
\n",
"- limit the number of jobs in parallel: `-j 1` (cores: `-c 1`)
\n",
"\n",
"[all Snakemake options](https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options)"
]
},
{
"cell_type": "markdown",
"id": "89f2540c-3afb-47ac-b30f-24a6efeff97e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakemake output options\n",
"\n",
"- to automatically create HTML reports (`--report report.html`) with runtime statistics, a visualization of the workflow topology, used software and data provenance information (need to add the `jinja2` package as a dependency)
\n",
"- use the `--archive` option (need git) to save your project\n",
"- complete workflow (`--dag`) or rules dependencies (`--rulegraph`) visualizations (with the `dot` tool of the `graphviz` package):\n",
"```\n",
"snakemake --dag -s mySmk | dot -Tpng > mySmk_dag.png\n",
"snakemake --rulegraph -s mySmk | dot -Tpng > mySmk_rule.png\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"id": "a287263a-a1dd-4552-abfd-2def6e2d1e9e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakemake environment options\n",
"\n",
"Snakemake supports environments on a per-rule basis (created & activated on the fly):
\n",
"\n",
"**conda:**
\n",
"- add a `conda:` directive in the rule definition (_eg._ `conda: myCondaEnvironment.yml`)
\n",
"- run Snakemake with the `--use-conda` option
\n",
"\n",
"**docker:**\n",
"- add a `container:` directive in the rule definition (_eg._ `container: \"docker://biocontainers/fastqc\"`)
\n",
"- run Snakemake with the `--use-singularity` and `--singularity-args \"-B /path/outside/container/:/path/inside/container/\"` options
\n",
"\n",
"**module:**
\n",
"- add a `envmodules:` directive in the rule definition (_eg._ `envmodules: \"fastqc/0.11.9\"`)
\n",
"- run Snakemake with the `--use-envmodules` option"
]
},
{
"cell_type": "markdown",
"id": "5ecf402c-a3ce-4106-b936-27e843d97435",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Get a Snakefile\n",
"\n",
"The **snakefile** is the **text file** that encodes the rules, and so the workflow.
\n",
"The command `snakemake` runs the workflow encoded in the `Snakefile` file.\n",
"\n",
"You can get a snakefile:
\n",
"- from github, your colleagues, ...
\n",
"- snakemake \"core\" ([nf-core](https://nf-co.re) equivalent) : https://snakemake.github.io/snakemake-workflow-catalog/ (2k pipelines, 177 testés)
\n",
"- compose with [snakemake wrappers](https://snakemake-wrappers.readthedocs.io/)
\n",
"- by using a Nextflow workflow! (integration via snakemake-wrappers)
\n",
"- create from scratch
\n",
"\n",
"To run the workflow for one input: `snakemake myInFile`"
]
},
{
"cell_type": "markdown",
"id": "7ff4afc8-9456-4032-ac20-4ddc6f0bbb7e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakefile: input (output) specifications\n",
"enumerated:\n",
"```\n",
"rule all:\n",
" input: \"P10415.fasta\",\"P01308.fasta\"\n",
"```\n",
"\n",
"python list & wildcards:\n",
"```\n",
"DATASETS=[\"P10415\",\"P01308\"]\n",
"rule all:\n",
" input: [\"{dataset}.fasta\".format(dataset=dataset) for dataset in DATASETS]\n",
"```\n",
"\n",
"expand() & wildcards:\n",
"```\n",
"DATASETS=[\"P10415\",\"P01308\"]\n",
"rule all:\n",
" input: expand(\"{dataset}.fasta\",dataset=DATASETS)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "6cb95bbd-966b-4810-a7d1-99653a41db16",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakefile: generalization with wilcards\n",
"Snakemake use _wildcards_ allow to replace parts of filename:\n",
"- reduce hardcoding: more flexible input and output directives, work on new data without modification\n",
"- are automatically resolved (ie. replaced by regular expression \".+\" in filenames)\n",
"- are writing into {}\n",
"- are specific to a rule\n",
"\n",
"A same file can be accessed by different matchings:
\n",
"Ex. with the file `101/file.A.txt` :
\n",
"rule one : `output : \"{set}1/file.{grp}.txt\" # set=10, grp=A`
\n",
"rule two : `output : \"{set}/file.A.{ext}\" # set=101, ext=txt`
\n",
"(more on [wildcards](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards) in the snakemake documentation)"
]
},
{
"cell_type": "markdown",
"id": "5cda3e9f-c631-47ef-ba6e-61e05fd195c7",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### With and without wildcards example\n",
"\n",
"without wildcards, `uniprot_wow.smk`:\n",
"```\n",
"rule get_prot:\n",
" output: \"P10415.fasta\", \"P01308.fasta\"\n",
" run :\n",
" shell(\"wget https://www.uniprot.org/uniprot/P10415.fasta\")\n",
" shell(\"wget https://www.uniprot.org/uniprot/P01308.fasta\")\n",
"```\n",
"\n",
"with wildcards, `uniprot_wiw.smk`:\n",
"```\n",
"rule all:\n",
" input: \"P10415.fasta\", \"P01308.fasta\"\n",
"\n",
"rule get_prot:\n",
" output: \"{prot}.fasta\"\n",
" shell: \"wget https://www.uniprot.org/uniprot/{wildcards.prot}.fasta\"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "ce5c36a3-5383-4d5e-97c1-dee60969a0f7",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakefile: get input file names from the file system\n",
"\n",
"To deduce the identifiers (eg. IDs) of files in a directory, use the inbuilt `glob_wildcards` function, eg.:\n",
"```\n",
"IDs, = glob_wildcards(\"dirpath/{id}.txt\")\n",
"```\n",
"`glob_wildcards()` matches the given pattern against the files present in the system and thereby infers the values for all wildcards in the pattern (`{id}` here).\n",
"\n",
"**Hint:** Don’t forget the **coma** after the name (left hand side, IDs here)."
]
},
{
"cell_type": "markdown",
"id": "eadec312-f6db-4146-8cda-085f167fd04c",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Snakefile: Using a snakemake config file\n",
"\n",
"The (optional) definition of a configfile allows to parameterize the workflow as needed (`--configfile smk_config.yml`)\n",
"\n",
"## Subworkflows or Modules\n",
"\n",
"It is also possible to define external workflows as modules, from which rules can be used by explicitly “importing” them."
]
},
{
"cell_type": "markdown",
"id": "9f1fb3a4-26d8-44ae-bbfc-ed585452da50",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## The workflow example\n",
"\n",
"We want to manage RNAseq data with a (small) workflow with 2 steps:
\n",
"\n",
"\n",
"A _classical_ analysis with `fastq.gz` data (in the `${PWD}/Data` repository) and the creation of a `${PWD}/FastQC` repository gives:
\n",
""
]
},
{
"cell_type": "markdown",
"id": "03354c4b-263a-461c-83c6-ab8f6c223142",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Translation in snakefile\n",
"\n",
"
\n",
"
\n",
"3 linked rules : fastQC, multiQC, all.
\n",
"Wildcard: rule concerns one file (`*` in figure)"
]
},
{
"cell_type": "markdown",
"id": "452900bf-c28e-4085-a9d5-996a11e43e0b",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"Snakemake create the DAG from the snakefile
\n",
""
]
},
{
"cell_type": "markdown",
"id": "31793217-81d1-4ff6-b494-ff229b12f844",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"Snakemake launch: the rule all need `multiqc_report.html` that doesn't exist but links towards the multiQC rule
\n",
""
]
},
{
"cell_type": "markdown",
"id": "5d3e437b-3426-4ace-94e8-e32c877acc54",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"The rule multiQC need zip files that doesn't exist but links towards the fastQC rule\n",
""
]
},
{
"cell_type": "markdown",
"id": "b447b236-bb7c-463e-90d5-07380ab54a85",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"The rule fastQC need fastq.gz files
\n",
""
]
},
{
"cell_type": "markdown",
"id": "b3468713-d958-465a-a8d9-e295b92313ce",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"fastq.gz files exists, snakemake stops ascending to forward the flow and execute the fastQC rule.
\n",
""
]
},
{
"cell_type": "markdown",
"id": "cfb2c813-f94f-427b-b317-8302e8f6663b",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"There are 3 sequence files so snakemake launch 3 fastQC rules
\n",
""
]
},
{
"cell_type": "markdown",
"id": "69500618-7c17-4942-8a8c-5441a912da77",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"After 3 executions of the fastQC rule, the zip files exist and feed the multiQC rule.
\n",
""
]
},
{
"cell_type": "markdown",
"id": "9298c34b-bc76-4c27-adf0-c125c3dfef98",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"the multiqc_report constitutes the input file of the rule all:
\n",
""
]
},
{
"cell_type": "markdown",
"id": "8838cb04-b761-45fe-a5ba-a08a7c61e0fc",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Running path\n",
"\n",
"So the rule all is completed, and the workflow too:
\n",
""
]
},
{
"cell_type": "markdown",
"id": "76f0db65-52af-477d-a9aa-98be24ac806f",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Timestamp\n",
"\n",
"Snakemake automatically makes sure that everything is up to date, otherwise it launch the jobs that need to be:\n",
"\n",
"- output files have to be re-created when the input file timestamp is newer than the output file timestamp
\n",
"- and from this point, Snakemake goes on through the workflow and applies rules
\n",
"\n",
"\n",
"\n",
"**note:** in last snakemake versions, _everything_ includes mtime, params, input, software-env, code (fix with the `--rerun-triggers` option)"
]
},
{
"cell_type": "markdown",
"id": "e51eace3-65c4-44b7-847b-3d8283380da3",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## A Snakefile example\n",
"\n",
"The final objective is to create a snakefile to manage a small workflow with 2 steps: i) fastqc ii) multiqc \n",
"\n",
"\n",
"\n",
"These 2 tools (boinformatics domain) allow to check the quality of NGS data. "
]
},
{
"cell_type": "markdown",
"id": "647bfb62-1299-44b5-862d-1a25becaabd5",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### prerequisite 1: data\n",
"\n",
"input data to run the workflow example are reduced RNASeq reads files. Download (`wget`) data from [zenodo here](https://zenodo.org/record/3997237): get url on the _download_ button, next `gunzip` and `tar -x`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84c8f1a0-a950-42ea-a22f-d3de8b4bc04c",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"%%sh\n",
"# on IFB cluster\n",
"cd ${PWD}\n",
"wget -q https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz\n",
"gunzip FAIR_Bioinfo_data.tar.gz\n",
"tar -xvf FAIR_Bioinfo_data.tar\n",
"rm FAIR_Bioinfo_data*"
]
},
{
"cell_type": "markdown",
"id": "20051a6a-c31c-4b80-b469-d7e8928fadd5",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### prerequisite 2: snakefile\n",
"\n",
"`smk_all_samples.smk`, get it from [FAIR_smk](https://github.com/clairetn/FAIR_smk)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0424e490-963c-4989-a8b3-0f14de8ad199",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"%%sh\n",
"# on IFB cluster\n",
"cd ${PWD}\n",
"git clone https://github.com/clairetn/FAIR_smk"
]
},
{
"cell_type": "markdown",
"id": "9b38bec5-b8dc-4a35-a38d-b84ab3512249",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"### prerequisite 3: conda environment\n",
"\n",
"only if use with conda: use `envfair.yml` in the `FAIR_smk` repository cloned to create the conda environment"
]
},
{
"cell_type": "markdown",
"id": "e5b2a460-04fd-43f1-b461-f37636226fe8",
"metadata": {
"slideshow": {
"slide_type": "skip"
},
"tags": []
},
"source": [
"### pre-requite 4: Snakemake\n",
"\n",
"Laptop with docker:\n",
"```\n",
"save_jupylab_smk.tar # get the docker image archive\n",
"docker load < save_jupylab_smk.tar # create the docker image\n",
"docker run --rm -v ${PWD}:/home/jovyan -w /home/jovyan --user \"$(id -u):$(id -g)\" -p 8888:8888 test/jupylab_smk:1.0\n",
"```\n",
"Laptop with conda:\n",
"```\n",
"conda create env -f envfair.yml\n",
"conda activate envfair\n",
"```\n",
"IFB core cluster (_version 7.8.2 of the docker container is not available_):\n",
"```\n",
"module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n",
"```\n",
"check with: `snakemake --version`"
]
},
{
"cell_type": "markdown",
"id": "6bc8d354-2adf-49c5-b3ae-a787902a036f",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### prerequisite 4: Snakemake & tools\n",
"\n",
"with `module load` on IFB core cluster (_version 7.8.2 of the docker container is not available_):\n",
"```\n",
"module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n",
"```\n",
"check with: `snakemake --version`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "028f1583-9f4e-40f8-8904-b0db6d167168",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"%%sh\n",
"# on IFB cluster\n",
"cd ${PWD}\n",
"module load snakemake/7.7.0\n",
"snakemake --version "
]
},
{
"cell_type": "markdown",
"id": "692e7742-4215-4abb-b67f-e286dbc03c02",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### run the workflow\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "855a806f-630f-4cc7-a338-a1327444ecc2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"%%sh\n",
"# on IFB cluster\n",
"cd ${PWD}\n",
"module load snakemake/7.7.0 fastqc/0.11.9 multiqc/1.12\n",
"snakemake -s FAIR_smk/smk_all_samples.smk \\\n",
" --configfile FAIR_smk/smk_config.yml -c1 -p"
]
},
{
"cell_type": "markdown",
"id": "4fbb58ad-39fd-46d0-9893-c375a2227f66",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Conclusion\n",
"\n",
"With snakemake, you can launch the same snakefile (adapting the snakemake config file) on your laptop or on a computing cluster.\n",
"\n",
"\n",
"Other ressources:\n",
"- a formation to [create the workflow step-by-step](https://moodle.france-bioinformatique.fr/mod/resource/view.php?id=68)\n",
"- the workflow composed with snakemake wrappers cf. [ex1_o8_wrapper_linted.smk](https://github.com/clairetn/FAIR_smk)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}