QC - Roscoff 2022 - HackMD

<style> div > .slides{ width: 80% !important; } h1, h2, h3, h4{ color:#4b898d !important; } div{ --r-main-font-size: 35px; } cite{ font-size:0.5em; } </style>  ## Read quality control _2024-05-21_ **Olivier Rué** **Christophe Klopp** _EBAII 2024 - Genome assembly school_ ![](https://hackmd.io/_uploads/rydQAbWwc.png =x70) ![](https://maiage.inrae.fr/sites/default/files/Logo_MaIAGE.png) ![](https://i.imgur.com/Aed1qRi.png =x80) ![](https://i.imgur.com/ucEa8FB.png =x80) ![](https://genoweb.toulouse.inra.fr/~klopp/Sigenae/sigenae.png =x60) --- #### The truth about bioinformatics ![](https://hackmd.io/_uploads/ryUN7QA1i.png =x400) <cite>https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides.html#4</cite> --- #### QC is the first step of any sequence analysis ![](https://hackmd.io/_uploads/HJyHPg6Jo.png =x400) <cite>https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#7</cite> --- #### QC is the first step for all sequence analyses * Seems one of the easiest steps in bioinformatics (because it is standard) :sunglasses: ... * ... but one of the most important :warning: * You should know what you expect in order to check if everything is ok * It gives information about how to clean reads when needed * It shows possible sequencing problems * Not all possible problems are well documented : manufacturers prefer the bright side * QC results must be interpreted regarding what has been sequenced :thinking_face: --- #### Read caracteristics * length (fixed or variable, range,...) * nucleotide content : * biological sample * technical artifacts (primer, adapter, tag, vector, restriction site,...) * contamination * organels (chloroplast, mitochondria,...) * Average error rate * Error rate profile (along the reads,...) * randomness * GC (read GC content ~ average genome GC content) * kmer content --- #### Reads are not perfect (error rate profile) ![](https://hackmd.io/_uploads/HJp78Vkgo.png =x200) ![](https://hackmd.io/_uploads/SJY7AlRJs.png =x150) <cite>https://doi.org/10.1093/nargab/lqab019</cite> --- #### First contact with your sequences * The sequencing facility provides you with files containing your reads * **FASTQ** format * Standard format for storing of high-throughput sequencing instruments outputs * Some times other file types (bam, hd5,...) with tools to extract fastqs * One or two files by sample (Illumina paired-end) :warning: --- #### FASTQ format ```bash @ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC CTTGGTCATTTAGAG + ***<<*AEF???*** @ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC CATTGGCCATATCAT + AAAE??<<*???*** ``` Four lines per sequence : - header starting with '@' - sequence line (nucleotides) - '+' separator - quality line (quality corresponding to nucleotides) ```bash @Identifier1 (comment) XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX + QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ @Identifier2 (comment) XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX + QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ ``` --- #### Quality score encoding ![](https://hackmd.io/_uploads/BkFNtlpys.png) - Base quality schema depends on the sequencer version. - Most files produced these days are Sanger compliant. --- #### Quality score Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing ![](https://hackmd.io/_uploads/BkrQKlT1j.png) <cite>https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#12</cite> --- #### FASTQ compression * Compression is essential to manage FASTQ files (reduce disk storage) * compressed files: _<pre>filename<b>.fastq.gz</b></pre><pre>filename<b>.fq.gz</b></pre>_ * Tools are (almost all) able to deal with compressed files :+1: --- #### Answer to (not always) simple questions: - Is data as I expect? - Number of files/samples :ballot_box_with_check: - Number of reads in files :ballot_box_with_check: - Quality/Length/Composition of reads :ballot_box_with_check: - Residual presence of adapters or indexes (non-biological information)? - Are there (un)expected technical biases? - Are there (un)expected biological biases? --- #### Data for learning to assemble reads ![](https://hackmd.io/_uploads/ryuAE3Oxo.png =x100) * Sequencing of Saccharomyces cerevisiae genome * Species of Yeast (single-celled fungus microorganisms) * Genome composed of about 12,156,677 bp and 6,275 genes, compactly organized on 16 chromosomes * GC content =~ 38-39% * Illumina/PacBio/ONT datasets :grin: --- #### Sequencing data ![](https://hackmd.io/_uploads/rJPjsBkli.png =x200) ![](https://hackmd.io/_uploads/ryf_sSJgi.png =x200) ![](https://hackmd.io/_uploads/H1BKxF1eo.png =x200) * Subsampled to 30x only to reduce time --- #### FastQC * Provides graphics to spot problem originating from sequencer, library preparation, contamination... ![](https://hackmd.io/_uploads/Hy78nbCyj.png =x300) <cite>https://www.bioinformatics.babraham.ac.uk/projects/fastqc/</cite> ---  #### TP * Log in to [Galaxy](https://usegalaxy.fr/) * Create a history called QC * Upload the data (next slide) * Run **FastQC** on each FASTQ file * Run **MultiQC** on Illumina data --- #### Données partagées * Données partagées / Bibliothèque de données * EBAII A&A 2022 * Assembly * Hifi PacBio / SRR13577847_subreads.30x.fastq * ONT / SRR18726953_1.30x.fastq * Illumina Miseq PE * SRR15597408_1.30x.fastq * SRR15597408_2.30x.fastq --- #### Basic statistics ![](https://hackmd.io/_uploads/HkTDDJ1Gj.png =x200) ![](https://hackmd.io/_uploads/BJP9w1JGo.png =x200) <hr> ![](https://hackmd.io/_uploads/r1CluJkzo.png =x200) ![](https://hackmd.io/_uploads/HyjGO1kzi.png =x200) --- #### Per base sequence quality ![](https://hackmd.io/_uploads/ryx416ugs.png =x200) ![](https://hackmd.io/_uploads/BJYUJa_lj.png =x200) <hr> ![](https://hackmd.io/_uploads/SkpOJ6_eo.png =x200) ![](https://hackmd.io/_uploads/ryqqk6dxi.png =x200) --- #### Per base sequence quality - Illumina * Comparison R1/R2 ![](https://hackmd.io/_uploads/Hyjx02uej.png =x400) --- #### GC content ![](https://hackmd.io/_uploads/H1VUfpdlo.png =x200) ![](https://hackmd.io/_uploads/SkX_f6dgo.png =x200) <hr> ![](https://hackmd.io/_uploads/H1ScGp_eo.png =x200) ![](https://hackmd.io/_uploads/SyJnMTdlj.png =x200) --- #### GC content / contamination ![](https://hackmd.io/_uploads/HyH5xSdWi.png =x300) ![](https://hackmd.io/_uploads/r1uuWSObo.png =x300) --- #### Per base sequence content ![](https://hackmd.io/_uploads/HyZ1m6dei.png =x200) ![](https://hackmd.io/_uploads/ByM-mpOls.png =x200) <hr> ![](https://hackmd.io/_uploads/BJu4X6_xo.png =x200) ![](https://hackmd.io/_uploads/HkcUXauxi.png =x200) --- #### Other QC tools * Seqkit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1371/journal.pone.0163962</cite> ![](https://hackmd.io/_uploads/SJmp-5Oes.png =x200) --- #### Meta QC tools * MultiQC: Summarize analysis results for multiple tools and samples in a single report ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>http://dx.doi.org/10.1093/bioinformatics/btw354</cite> ![](https://hackmd.io/_uploads/S1OtKp_lj.png =x400) --- #### Nanopore QC tools - Nanoplot: Plotting tool for long read sequencing data and alignments ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1093/bioinformatics/bty149</cite> ![](https://hackmd.io/_uploads/SynIIT_xi.png =x400) --- #### Kmer based QC tools - Genomescope: The K-mer Analysis Toolkit ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1038/s41467-020-14998-3</cite> - <cite>https://github.com/tbenavi1/genomescope2.0</cite> ![](https://tolqc.cog.sanger.ac.uk/darwin/mammals/Rattus_norvegicus/genomic_data/mRatNor1/10x/kmer/k31/mRatNor1.k31_transformed_linear_plot.png =x300)![](https://user-images.githubusercontent.com/2330901/101622431-533b8300-3a17-11eb-811e-6dcace3e5252.png =x300) --- #### Other Kmer based QC tools - KAT: The K-mer Analysis Toolkit ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1093/bioinformatics/btw663</cite> ![](https://kat.readthedocs.io/en/latest/_images/simulated_r1_v_r2.png =x300)![](https://kat.readthedocs.io/en/latest/_images/real_r1_v_r2.png =x300) --- #### Tools for cleaning reads: trimming & co ![](https://hackmd.io/_uploads/HytJ-1Hxj.png =x200) - Fastp: Trim reads by quality, length, remove adapters... ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1093/bioinformatics/bty560</cite> --- #### Tools for cleaning reads: decontamination ![](https://hackmd.io/_uploads/BksJ9R_xs.png =x100) - Kraken: System for assigning taxonomic labels to short DNA sequences ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://dx.doi.org/10.1186/gb-2014-15-3-r46</cite> - ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads - <cite>https://doi.org/10.1093/bioinformatics/btac311</cite> - Bwa: Fast and accurate short read alignment with Burrows-Wheeler transform ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20) - <cite>https://doi.org/10.1093/bioinformatics/btp324</cite> --- #### Take home messages - Don't skip read QC! - Use tools adapted to your reads (platform, experiment,...) - Allows to distinguish potential problems - serious: back to sequencing facility - medium: adapt your strategy (contamination, trimming...)