<style>
div > .slides{
width: 80% !important;
}
h1, h2, h3, h4{
color:#4b898d !important;
}
div{
--r-main-font-size: 35px;
}
cite{
font-size:0.5em;
}
</style>
<!--{%hackmd BJnkBCB9F %}-->
## Read quality control
_2024-05-21_
**Olivier Rué**
**Christophe Klopp**
_EBAII 2024 - Genome assembly school_
![](https://hackmd.io/_uploads/rydQAbWwc.png =x70) ![](https://maiage.inrae.fr/sites/default/files/Logo_MaIAGE.png)
![](https://i.imgur.com/Aed1qRi.png =x80) ![](https://i.imgur.com/ucEa8FB.png =x80) ![](https://genoweb.toulouse.inra.fr/~klopp/Sigenae/sigenae.png =x60)
---
#### The truth about bioinformatics
![](https://hackmd.io/_uploads/ryUN7QA1i.png =x400)
<cite>https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides.html#4</cite>
---
#### QC is the first step of any sequence analysis
![](https://hackmd.io/_uploads/HJyHPg6Jo.png =x400)
<cite>https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#7</cite>
---
#### QC is the first step for all sequence analyses
* Seems one of the easiest steps in bioinformatics (because it is standard) :sunglasses: ...
* ... but one of the most important :warning:
* You should know what you expect in order to check if everything is ok
* It gives information about how to clean reads when needed
* It shows possible sequencing problems
* Not all possible problems are well documented : manufacturers prefer the bright side
* QC results must be interpreted regarding what has been sequenced :thinking_face:
---
#### Read caracteristics
* length (fixed or variable, range,...)
* nucleotide content :
* biological sample
* technical artifacts (primer, adapter, tag, vector, restriction site,...)
* contamination
* organels (chloroplast, mitochondria,...)
* Average error rate
* Error rate profile (along the reads,...)
* randomness
* GC (read GC content ~ average genome GC content)
* kmer content
---
#### Reads are not perfect (error rate profile)
![](https://hackmd.io/_uploads/HJp78Vkgo.png =x200) ![](https://hackmd.io/_uploads/SJY7AlRJs.png =x150)
<cite>https://doi.org/10.1093/nargab/lqab019</cite>
---
#### First contact with your sequences
* The sequencing facility provides you with files containing your reads
* **FASTQ** format
* Standard format for storing of high-throughput sequencing instruments outputs
* Some times other file types (bam, hd5,...) with tools to extract fastqs
* One or two files by sample (Illumina paired-end) :warning:
---
#### FASTQ format
```bash
@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***
```
Four lines per sequence :
- header starting with '@'
- sequence line (nucleotides)
- '+' separator
- quality line (quality corresponding to nucleotides)
```bash
@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
```
---
#### Quality score encoding
![](https://hackmd.io/_uploads/BkFNtlpys.png)
- Base quality schema depends on the sequencer version.
- Most files produced these days are Sanger compliant.
---
#### Quality score
Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing
![](https://hackmd.io/_uploads/BkrQKlT1j.png)
<cite>https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#12</cite>
---
#### FASTQ compression
* Compression is essential to manage FASTQ files (reduce disk storage)
* compressed files: _<pre>filename<b>.fastq.gz</b></pre><pre>filename<b>.fq.gz</b></pre>_
* Tools are (almost all) able to deal with compressed files :+1:
---
#### Answer to (not always) simple questions:
- Is data as I expect?
- Number of files/samples :ballot_box_with_check:
- Number of reads in files :ballot_box_with_check:
- Quality/Length/Composition of reads :ballot_box_with_check:
- Residual presence of adapters or indexes (non-biological information)?
- Are there (un)expected technical biases?
- Are there (un)expected biological biases?
---
#### Data for learning to assemble reads
![](https://hackmd.io/_uploads/ryuAE3Oxo.png =x100)
* Sequencing of Saccharomyces cerevisiae genome
* Species of Yeast (single-celled fungus microorganisms)
* Genome composed of about 12,156,677 bp and 6,275 genes, compactly organized on 16 chromosomes
* GC content =~ 38-39%
* Illumina/PacBio/ONT datasets :grin:
---
#### Sequencing data
![](https://hackmd.io/_uploads/rJPjsBkli.png =x200) ![](https://hackmd.io/_uploads/ryf_sSJgi.png =x200) ![](https://hackmd.io/_uploads/H1BKxF1eo.png =x200)
* Subsampled to 30x only to reduce time
---
#### FastQC
* Provides graphics to spot problem originating from sequencer, library preparation, contamination...
![](https://hackmd.io/_uploads/Hy78nbCyj.png =x300)
<cite>https://www.bioinformatics.babraham.ac.uk/projects/fastqc/</cite>
---
<!-- .slide: data-background="#AAAA" -->
#### TP
* Log in to [Galaxy](https://usegalaxy.fr/)
* Create a history called QC
* Upload the data (next slide)
* Run **FastQC** on each FASTQ file
* Run **MultiQC** on Illumina data
---
#### Données partagées
* Données partagées / Bibliothèque de données
* EBAII A&A 2022
* Assembly
* Hifi PacBio / SRR13577847_subreads.30x.fastq
* ONT / SRR18726953_1.30x.fastq
* Illumina Miseq PE
* SRR15597408_1.30x.fastq
* SRR15597408_2.30x.fastq
---
#### Basic statistics
![](https://hackmd.io/_uploads/HkTDDJ1Gj.png =x200) ![](https://hackmd.io/_uploads/BJP9w1JGo.png =x200)
<hr>
![](https://hackmd.io/_uploads/r1CluJkzo.png =x200) ![](https://hackmd.io/_uploads/HyjGO1kzi.png =x200)
---
#### Per base sequence quality
![](https://hackmd.io/_uploads/ryx416ugs.png =x200) ![](https://hackmd.io/_uploads/BJYUJa_lj.png =x200)
<hr>
![](https://hackmd.io/_uploads/SkpOJ6_eo.png =x200) ![](https://hackmd.io/_uploads/ryqqk6dxi.png =x200)
---
#### Per base sequence quality - Illumina
* Comparison R1/R2
![](https://hackmd.io/_uploads/Hyjx02uej.png =x400)
---
#### GC content
![](https://hackmd.io/_uploads/H1VUfpdlo.png =x200) ![](https://hackmd.io/_uploads/SkX_f6dgo.png =x200)
<hr>
![](https://hackmd.io/_uploads/H1ScGp_eo.png =x200) ![](https://hackmd.io/_uploads/SyJnMTdlj.png =x200)
---
#### GC content / contamination
![](https://hackmd.io/_uploads/HyH5xSdWi.png =x300) ![](https://hackmd.io/_uploads/r1uuWSObo.png =x300)
---
#### Per base sequence content
![](https://hackmd.io/_uploads/HyZ1m6dei.png =x200) ![](https://hackmd.io/_uploads/ByM-mpOls.png =x200)
<hr>
![](https://hackmd.io/_uploads/BJu4X6_xo.png =x200) ![](https://hackmd.io/_uploads/HkcUXauxi.png =x200)
---
#### Other QC tools
* Seqkit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1371/journal.pone.0163962</cite>
![](https://hackmd.io/_uploads/SJmp-5Oes.png =x200)
---
#### Meta QC tools
* MultiQC: Summarize analysis results for multiple tools and samples in a single report ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>http://dx.doi.org/10.1093/bioinformatics/btw354</cite>
![](https://hackmd.io/_uploads/S1OtKp_lj.png =x400)
---
#### Nanopore QC tools
- Nanoplot: Plotting tool for long read sequencing data and alignments ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1093/bioinformatics/bty149</cite>
![](https://hackmd.io/_uploads/SynIIT_xi.png =x400)
---
#### Kmer based QC tools
- Genomescope: The K-mer Analysis Toolkit ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1038/s41467-020-14998-3</cite>
- <cite>https://github.com/tbenavi1/genomescope2.0</cite>
![](https://tolqc.cog.sanger.ac.uk/darwin/mammals/Rattus_norvegicus/genomic_data/mRatNor1/10x/kmer/k31/mRatNor1.k31_transformed_linear_plot.png =x300)![](https://user-images.githubusercontent.com/2330901/101622431-533b8300-3a17-11eb-811e-6dcace3e5252.png =x300)
---
#### Other Kmer based QC tools
- KAT: The K-mer Analysis Toolkit ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1093/bioinformatics/btw663</cite>
![](https://kat.readthedocs.io/en/latest/_images/simulated_r1_v_r2.png =x300)![](https://kat.readthedocs.io/en/latest/_images/real_r1_v_r2.png =x300)
---
#### Tools for cleaning reads: trimming & co
![](https://hackmd.io/_uploads/HytJ-1Hxj.png =x200)
- Fastp: Trim reads by quality, length, remove adapters... ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1093/bioinformatics/bty560</cite>
---
#### Tools for cleaning reads: decontamination
![](https://hackmd.io/_uploads/BksJ9R_xs.png =x100)
- Kraken: System for assigning taxonomic labels to short DNA sequences ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://dx.doi.org/10.1186/gb-2014-15-3-r46</cite>
- ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads
- <cite>https://doi.org/10.1093/bioinformatics/btac311</cite>
- Bwa: Fast and accurate short read alignment with Burrows-Wheeler transform ![](https://hackmd.io/_uploads/r1I2JB_Wo.png =x20)
- <cite>https://doi.org/10.1093/bioinformatics/btp324</cite>
---
#### Take home messages
- Don't skip read QC!
- Use tools adapted to your reads (platform, experiment,...)
- Allows to distinguish potential problems
- serious: back to sequencing facility
- medium: adapt your strategy (contamination, trimming...)
{"title":"QC - Roscoff 2022","slideOptions":"{\"transition\":\"slide\",\"theme\":\"white\"}","description":"2022-09-26","contributors":"[{\"id\":\"f7eaa44e-af7c-4363-9095-086971f6c8af\",\"add\":10663,\"del\":545}]"}