Description:
This file describes the different steps to the quality control based on:
Input data: Seurat object containing all cells (filtered count matrix)
Output data: Seurat object annotated for the cells
to filter out for downstream analysis, called
sobj_TD3A_qc_annotated.rds
Next steps: Cells annotation for cell type, doublet status and cell cycle status
In this file, we used a dataset from the Paiva et al. publication.
The study concerns thymus autonomy:
The thymus is an “organ of passage”, critical in its function to the adaptive immune system for the maturation of T cell lymphocytes.
This maturation involves two main steps, performed thanks to macrophages:
Thymus autonomy is a natural mechanism that allows to create T cells in the thymus by differentiation and cell competition, even when normal progenitors from the bone marrow are lacking, in critical conditions.
This mechanism is known in its effects, but the cells involved in are not.
This study is of importance in the health field, as this mechanism relies on a temporary loss of control of the cell normal functions.
The consequence is that if thymus is in autonomy for too long (few weeks), this is a prelude for leukemia !
Organism is : mus musculus
Individuals are : mice in development, grafted
The design corresponds to two conditions (Test / control)
You will mainly work on the KO sample (‘TD3A’). The input data consists in a count matrix, as a gzipped tabular text file, that contains everything needed to create a basic Seurat object : * The expressions counts * The feature names (here, gene symbols) * The barcode names
This matrix has already been filtered for empty droplets.
To obtain this file and starts the TD, copy-paste the folder in your project, either using the graphical interface, or in the Terminal:
cp -r /shared/projects/2422_ebaii_n1/atelier_scrnaseq/RMD/Preproc.2 /shared/projects/[your_projects]/TD_Preproc.2
cd /shared/projects/[your_projects]/TD_Preproc.2
The tree should looks like:
ls
## data
## paiva_wt_ko.png
## resources
## TD_QC_metrics.Rmd
## thymus_autonomy.png
We load the package of interest:
library(dplyr)
library(ggplot2)
.libPaths()
## [1] "/shared/home/aonfroy/R/x86_64-conda-linux-gnu-library/4.4"
## [2] "/shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/R/library"
We load the count matrix:
# Set input data
input_data = "./data/GSM4861194_gex_2_raw_gene_expression.tsv.gz"
# Load data
mat = read.table(input_data,
header = TRUE,
sep = "\t")
mat = as.matrix(mat)
mat = Matrix::Matrix(mat,
sparse = TRUE)
dim(mat)
## [1] 31053 4587
mat[c(1:5), c(1:5)]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACCTGAGACGCTTT.1 AAACCTGAGGCATTGG.1 AAACCTGGTCAACATC.1
## Xkr4 . . .
## Gm1992 . . .
## Gm37381 . . .
## Rp1 . . .
## Sox17 . . .
## AAACCTGTCGAGGTAG.1 AAACCTGTCGATCCCT.1
## Xkr4 . .
## Gm1992 . .
## Gm37381 . .
## Rp1 . .
## Sox17 . .
We build a Seurat object from the count matrix:
sobj = Seurat::CreateSeuratObject(counts = mat,
assay = "RNA",
project = "TD3A")
sobj
## An object of class Seurat
## 31053 features across 4587 samples within 1 assay
## Active assay: RNA (31053 features, 0 variable features)
## 1 layer present: counts
We do not need the count matrix:
rm(mat)
We define the output directory to save the Seurat object in this end of this file:
outdir = "./data"
if (!dir.exists(outdir)) {
dir.create(outdir, recursive = FALSE)
}
We set the filtering thresholds based on quality control-related metrics. Adjust them as necessary based on the figures.
cut_log10_nCount_RNA = 3
cut_nFeature_RNA = 750
cut_percent.mt = 5
cut_percent.rb = 30
cut_percent.st = 6
We define a nice palette to visualize the QC metrics:
color_palette = c("lightgray", "#FDBB84", "#EF6548", "#7F0000", "black")
For the QC metrics related to the proportion of UMI belongs to a specific gene sets, we need to load the gene sets.
mito_symbols = readRDS("./resources/mus_musculus_mito_symbols_20191015.rds")
mito_symbols
## [1] "mt-Atp6" "mt-Atp8" "mt-Co1" "mt-Co2" "mt-Co3" "mt-Cytb" "mt-Nd1"
## [8] "mt-Nd2" "mt-Nd3" "mt-Nd4" "mt-Nd4l" "mt-Nd5" "mt-Nd6" "mt-Rnr1"
## [15] "mt-Rnr2" "mt-Ta" "mt-Tc" "mt-Td" "mt-Te" "mt-Tf" "mt-Tg"
## [22] "mt-Th" "mt-Ti" "mt-Tk" "mt-Tl1" "mt-Tl2" "mt-Tm" "mt-Tn"
## [29] "mt-Tp" "mt-Tq" "mt-Tr" "mt-Ts1" "mt-Ts2" "mt-Tt" "mt-Tv"
## [36] "mt-Tw" "mt-Ty"
ribo_symbols = readRDS("./resources/mus_musculus_cribo_symbols_20191015.rds")
ribo_symbols
## [1] "Rpsa" "Rps2" "Rps3" "Rps3a" "Rps4x" "Rps5" "Rps6" "Rps7"
## [9] "Rps8" "Rps9" "Rps10" "Rps11" "Rps12" "Rps13" "Rps14" "Rps15"
## [17] "Rps15a" "Rps16" "Rps17" "Rps18" "Rps19" "Rps20" "Rps21" "Rps23"
## [25] "Rps24" "Rps25" "Rps26" "Rps27" "Rps27a" "Rps28" "Rps29" "Rps30"
## [33] "Rpl3" "Rpl4" "Rpl5" "Rpl6" "Rpl7" "Rpl7a" "Rpl8" "Rpl9"
## [41] "Rpl10" "Rpl10a" "Rpl11" "Rpl12" "Rpl13" "Rpl13a" "Rpl14" "Rpl15"
## [49] "Rpl17" "Rpl18" "Rpl18a" "Rpl19" "Rpl21" "Rpl22" "Rpl23" "Rpl23a"
## [57] "Rpl24" "Rpl26" "Rpl27" "Rpl27a" "Rpl28" "Rpl29" "Rpl30" "Rpl31"
## [65] "Rpl32" "Rpl34" "Rpl35" "Rpl35a" "Rpl36" "Rpl44" "Rpl37" "Rpl37a"
## [73] "Rpl38" "Rpl39" "Rpl40" "Rpl41" "Rplp0" "Rplp1" "Rplp2"
stress_symbols = readRDS("./resources/mus_musculus_stress_symbols_20200224.rds")
stress_symbols
## [1] "Atf3" "Cmss1" "Diaph1" "Egr1" "Fos" "Gls"
## [7] "Gm48099" "lsp90aa1" "lsp90ab1" "lfrd1" "Jak1" "Junb"
## [13] "Kpna1" "Nup210l" "Peak1" "Stat3" "Actb" "Camk1d"
## [19] "Chka" "Clic4" "Fodl2" "Hspa8" "Jun" "Jund"
## [25] "Klf6" "Litaf" "Pecam1" "Ptma" "Sik3" "Spag9"
## [31] "Arih1" "Azin1" "Brd2" "Chd4" "Ctnnb1" "Dennd4a"
## [37] "Elf1" "G3bp1" "Ints6" "Kdm6b" "Lsmem1" "Man1a"
## [43] "Med13" "Nfkbia" "Nfkbiz" "Nop58" "Piezo1" "Ppp1r15a"
## [49] "Prkcg" "Sqstm1" "Taf4b" "Tmsb4x" "Ubc" "Zfp36"
## [55] "Abtb2" "Adamts1" "Adamts9" "Ahnak" "Ankrd28" "Atp2a2"
## [61] "Baz1a" "Btg2" "Ccdc138" "Cd44" "Cdkn1a" "Elf2"
## [67] "Ep400" "Epas1" "Erf" "Ern1" "Fabp4" "Fosb"
## [73] "Gm10073" "Gsk3a" "H3f3a" "H3f3b" "Hivep2" "Klf4"
## [79] "Lmna" "lapkapk2" "Mapre1" "Msn" "Mylip" "Nabp1"
## [85] "Ncl" "Nsd3" "Nufip2" "Plekhg2" "Ppp1cb" "Rnf19b"
## [91] "Rps20" "Rtn4" "Runx1" "Sertad2" "Sptan1" "Top1"
## [97] "Vcl" "Zbtb11"
We keep only the gene symbols available in our data:
mito_symbols = intersect(mito_symbols, rownames(sobj))
mito_symbols
## [1] "mt-Atp6" "mt-Atp8" "mt-Co1" "mt-Co2" "mt-Co3" "mt-Cytb" "mt-Nd1"
## [8] "mt-Nd2" "mt-Nd3" "mt-Nd4" "mt-Nd4l" "mt-Nd5" "mt-Nd6"
ribo_symbols = intersect(ribo_symbols, rownames(sobj))
ribo_symbols
## [1] "Rpsa" "Rps2" "Rps3" "Rps4x" "Rps5" "Rps6" "Rps7" "Rps8"
## [9] "Rps9" "Rps10" "Rps11" "Rps12" "Rps13" "Rps14" "Rps15" "Rps15a"
## [17] "Rps16" "Rps17" "Rps18" "Rps19" "Rps20" "Rps21" "Rps23" "Rps24"
## [25] "Rps25" "Rps26" "Rps27" "Rps27a" "Rps28" "Rps29" "Rpl3" "Rpl4"
## [33] "Rpl5" "Rpl6" "Rpl7" "Rpl7a" "Rpl8" "Rpl9" "Rpl10" "Rpl10a"
## [41] "Rpl11" "Rpl12" "Rpl13" "Rpl13a" "Rpl14" "Rpl15" "Rpl17" "Rpl18"
## [49] "Rpl18a" "Rpl19" "Rpl21" "Rpl22" "Rpl23" "Rpl23a" "Rpl24" "Rpl26"
## [57] "Rpl27" "Rpl27a" "Rpl28" "Rpl29" "Rpl30" "Rpl31" "Rpl32" "Rpl34"
## [65] "Rpl35" "Rpl35a" "Rpl36" "Rpl37" "Rpl37a" "Rpl38" "Rpl39" "Rpl41"
## [73] "Rplp0" "Rplp1" "Rplp2"
stress_symbols = intersect(stress_symbols, rownames(sobj))
stress_symbols
## [1] "Atf3" "Cmss1" "Diaph1" "Egr1" "Fos" "Gls"
## [7] "Gm48099" "Jak1" "Junb" "Kpna1" "Nup210l" "Peak1"
## [13] "Stat3" "Actb" "Camk1d" "Chka" "Clic4" "Hspa8"
## [19] "Jun" "Jund" "Klf6" "Litaf" "Pecam1" "Ptma"
## [25] "Sik3" "Spag9" "Arih1" "Azin1" "Brd2" "Chd4"
## [31] "Ctnnb1" "Dennd4a" "Elf1" "G3bp1" "Ints6" "Kdm6b"
## [37] "Lsmem1" "Man1a" "Med13" "Nfkbia" "Nfkbiz" "Nop58"
## [43] "Piezo1" "Ppp1r15a" "Prkcg" "Sqstm1" "Taf4b" "Tmsb4x"
## [49] "Ubc" "Zfp36" "Abtb2" "Adamts1" "Adamts9" "Ahnak"
## [55] "Ankrd28" "Atp2a2" "Baz1a" "Btg2" "Ccdc138" "Cd44"
## [61] "Cdkn1a" "Elf2" "Ep400" "Epas1" "Erf" "Ern1"
## [67] "Fabp4" "Fosb" "Gm10073" "Gsk3a" "H3f3a" "H3f3b"
## [73] "Hivep2" "Klf4" "Lmna" "Mapre1" "Msn" "Mylip"
## [79] "Nabp1" "Ncl" "Nsd3" "Nufip2" "Plekhg2" "Ppp1cb"
## [85] "Rnf19b" "Rps20" "Rtn4" "Runx1" "Sertad2" "Sptan1"
## [91] "Top1" "Vcl" "Zbtb11"
Note: A more robust analysis may use the gene identifiers instead of gene symbols.
To visualize the low quality cells, we generate a projection. Understanding the commands below is the purpose of next course sessions. So just run:
sobj = Seurat::NormalizeData(sobj)
sobj = Seurat::ScaleData(sobj)
sobj = Seurat::FindVariableFeatures(sobj)
sobj = Seurat::RunPCA(sobj)
sobj = Seurat::RunUMAP(sobj, dims = c(1:20))
sobj
## An object of class Seurat
## 31053 features across 4587 samples within 1 assay
## Active assay: RNA (31053 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
We can now visualize the cells on a 2D projection:
Seurat::DimPlot(sobj,
reduction = "umap",
cols = "black") +
ggplot2::theme(aspect.ratio = 1,
legend.position = "none")
What is already available in the Seurat object ?
head(sobj@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## AAACCTGAGACGCTTT.1 TD3A 2813 1594
## AAACCTGAGGCATTGG.1 TD3A 2072 1365
## AAACCTGGTCAACATC.1 TD3A 2025 1341
## AAACCTGTCGAGGTAG.1 TD3A 1877 1241
## AAACCTGTCGATCCCT.1 TD3A 2216 1441
## AAACGGGCACTTACGA.1 TD3A 2445 1428
How do the two first QC metrics vary ?
summary(sobj@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## TD3A:4587 Min. : 502 Min. : 22
## 1st Qu.: 1986 1st Qu.:1291
## Median : 2397 Median :1476
## Mean : 3573 Mean :1638
## 3rd Qu.: 3134 3rd Qu.:1733
## Max. :48875 Max. :5975
In the column nCount_RNA
, the maximum is far from the
third quartile. For visualization purpose, we transform this column to
log10 scale.
sobj$log10_nCount_RNA = log10(sobj$nCount_RNA)
summary(sobj@meta.data)
## orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA
## TD3A:4587 Min. : 502 Min. : 22 Min. :2.701
## 1st Qu.: 1986 1st Qu.:1291 1st Qu.:3.298
## Median : 2397 Median :1476 Median :3.380
## Mean : 3573 Mean :1638 Mean :3.429
## 3rd Qu.: 3134 3rd Qu.:1733 3rd Qu.:3.496
## Max. :48875 Max. :5975 Max. :4.689
We compute the percentage of UMI related for each of the three list of genes. First, we compute the proportion of transcripts related to the genes encoded in the mitochondria, per cell.
sobj = Seurat::PercentageFeatureSet(
sobj,
assay = "RNA",
features = mito_symbols,
col.name = "percent.mt")
# Alternative way to do (almost) the same:
# sobj = Seurat::PercentageFeatureSet(
# sobj,
# assay = "RNA",
# pattern = "^mt",
# col.name = "percent.mt")
summary(sobj$percent.mt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.985 2.489 2.997 3.125 97.250
Then, we compute the proportion of transcripts related to the genes encoding ribosomal units, per cell:
sobj = Seurat::PercentageFeatureSet(
sobj,
assay = "RNA",
features = ribo_symbols,
col.name = "percent.rb")
# Alternative way to do (almost) the same:
# sobj = Seurat::PercentageFeatureSet(
# sobj,
# assay = "RNA",
# pattern = "^Rp[l|s][a-z]?[0-9]*[a,x]?$",
# col.name = "percent.rb")
summary(sobj$percent.rb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.884 9.577 10.880 12.167 42.010
Finally, we compute the proportion of transcripts related to stress signature, per cell:
sobj = Seurat::PercentageFeatureSet(
sobj,
assay = "RNA",
features = stress_symbols,
col.name = "percent.st")
summary(sobj$percent.st)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.946 3.338 3.368 3.749 10.198
We know have all the QC-related metrics:
summary(sobj@meta.data)
## orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA percent.mt
## TD3A:4587 Min. : 502 Min. : 22 Min. :2.701 Min. : 0.000
## 1st Qu.: 1986 1st Qu.:1291 1st Qu.:3.298 1st Qu.: 1.985
## Median : 2397 Median :1476 Median :3.380 Median : 2.489
## Mean : 3573 Mean :1638 Mean :3.429 Mean : 2.997
## 3rd Qu.: 3134 3rd Qu.:1733 3rd Qu.:3.496 3rd Qu.: 3.125
## Max. :48875 Max. :5975 Max. :4.689 Max. :97.250
## percent.rb percent.st
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.884 1st Qu.: 2.946
## Median : 9.577 Median : 3.338
## Mean :10.880 Mean : 3.368
## 3rd Qu.:12.167 3rd Qu.: 3.749
## Max. :42.010 Max. :10.198
We identify the cells that do not pass the quality control. This will be used for the visualization and for filtering. If the filtering thresholds are modified, do not forget to run again this chunk.
fail_percent.mt = sobj@meta.data %>%
dplyr::filter(percent.mt > cut_percent.mt) %>%
rownames()
fail_percent.rb = sobj@meta.data %>%
dplyr::filter(percent.rb > cut_percent.rb) %>%
rownames()
fail_percent.st = sobj@meta.data %>%
dplyr::filter(percent.st > cut_percent.st) %>%
rownames()
fail_log10_nCount_RNA = sobj@meta.data %>%
dplyr::filter(log10_nCount_RNA < cut_log10_nCount_RNA) %>%
rownames()
fail_nFeature_RNA = sobj@meta.data %>%
dplyr::filter(nFeature_RNA < cut_nFeature_RNA) %>%
rownames()
This is difficult to handle the distribution of these metrics across cells. We opt for various visualization ways:
You may choose one of these visualization ways.
We write a lot of code to create our figures of interest:
# Histogram showing:
# - the distribution of the metric
# - an estimated density
# - the filtering threshold as a straight line
p_hist = ggplot2::ggplot(sobj@meta.data, aes(x = log10_nCount_RNA)) +
ggplot2::geom_histogram(aes(y = ggplot2::after_stat(density)),
colour = "black", fill = "#F8766D", bins = 100) +
ggplot2::geom_density(alpha = 0, col = "blue", lwd = 0.75) +
ggplot2::geom_vline(xintercept = cut_log10_nCount_RNA, col = "red") +
ggplot2::labs(title = paste0("Threshold for log10_nCount_RNA is: ", cut_log10_nCount_RNA)) +
ggplot2::theme_classic() +
ggplot2::theme(plot.title = element_text(hjust = 0.5))
# Violin plot:
# - is easily accessible in the Seurat package
# - can or not display the cells (set `pt.size = 0` to hide the cells)
# - is useful when several datasets have been merged
p_violin = Seurat::VlnPlot(sobj, features = "log10_nCount_RNA") +
ggplot2::geom_hline(yintercept = cut_log10_nCount_RNA, col = "red") +
ggplot2::theme(axis.title.x = element_blank(),
legend.position = "none")
# Box plot is similar to violin plot but not in the Seurat package
p_boxplot = ggplot2::ggplot(sobj@meta.data, aes(y = log10_nCount_RNA,
x = orig.ident)) +
ggplot2::geom_boxplot(colour = "black", fill = "#F8766D") +
ggplot2::geom_jitter(width = 0.3, size = 0.001) +
ggplot2::geom_hline(yintercept = cut_log10_nCount_RNA, col = "red") +
ggplot2::theme_classic() +
ggplot2::theme(axis.title.x = element_blank())
# This 2D representation shows the distribution of the metric over cells
p_umap = Seurat::FeaturePlot(sobj,
reduction = "umap",
features = "log10_nCount_RNA") +
ggplot2::scale_color_gradientn(colors = color_palette) +
ggplot2::theme(aspect.ratio = 1)
# This 2D representation shows the low quality cells in color
# `order = "fail"` is used to display failing cells on front
# before, we need to define the "failorpass" column in sobj@meta.data
# It corresponds to "fail" for cells that do no pass the threshold,
# or "pass" otherwise
sobj$failorpass = ifelse(colnames(sobj) %in% fail_log10_nCount_RNA,
yes = "fail", no = "pass") %>%
as.factor()
p_fail = Seurat::DimPlot(sobj,
group.by = "failorpass",
order = "fail") +
ggplot2::scale_color_manual(values = c("#F8766D", "gray80"),
breaks = levels(sobj$failorpass)) +
ggplot2::labs(title = "log10_nCount_RNA",
subtitle = paste0(length(fail_log10_nCount_RNA), " cells fail (",
round(100*length(fail_log10_nCount_RNA)/ncol(sobj), 2), " %)")) +
ggplot2::theme(aspect.ratio = 1,
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
sobj$failorpass = NULL # remove the column (it was temporary)
# We use the patchwork package to arrange all figures together
patchwork::wrap_plots(p_umap, p_fail, p_hist, p_violin, p_boxplot) +
patchwork::plot_layout(nrow = 1, widths = c(1, 1, 2, 1, 1))
Instead of copying-pasting this section of code for all QC-metrics, we design a function using the template below:
my_function_name = function(param1, param2) {
# do something with the parameter values
output = "something"
return(output)
}
Here is the function:
print_1_qc_metric = function(object = sobj,
qc = "log10_nCount_RNA",
cut_qc = cut_log10_nCount_RNA,
failing_cells = fail_log10_nCount_RNA) {
# Description of the parameters:
# - sobj : the Seurat object, with default value to the one
# - qc : CHARACTER : the QC metric, must be a column in sobj@meta.data
# - cut_qc : NUMERIC : the filtering threshold for the QC metric
# - failing_cells : CHARACTER VECTOR : the cells that fail the QC
# Histogram
p_hist = ggplot2::ggplot(object@meta.data, aes(x = .data[[qc]])) +
ggplot2::geom_histogram(aes(y = ggplot2::after_stat(density)),
colour = "black", fill = "#F8766D", bins = 100) +
ggplot2::geom_density(alpha = 0, col = "blue", lwd = 0.75) +
ggplot2::geom_vline(xintercept = cut_qc, col = "red") +
ggplot2::labs(title = paste0("Threshold for ", qc, " is: ", cut_qc)) +
ggplot2::theme_classic() +
ggplot2::theme(plot.title = element_text(hjust = 0.5))
# Violin plot
p_violin = Seurat::VlnPlot(object, features = qc) +
ggplot2::geom_hline(yintercept = cut_qc, col = "red") +
ggplot2::theme(axis.title.x = element_blank(),
legend.position = "none")
# Box plot
p_boxplot = ggplot2::ggplot(object@meta.data, aes(y = .data[[qc]],
x = "orig.ident")) +
ggplot2::geom_boxplot(colour = "black", fill = "#F8766D") +
ggplot2::geom_jitter(width = 0.3, size = 0.001) +
ggplot2::geom_hline(yintercept = cut_qc, col = "red") +
ggplot2::theme_classic() +
ggplot2::theme(axis.title.x = element_blank())
# Feature plot
p_umap = Seurat::FeaturePlot(object,
reduction = "umap",
features = qc) +
ggplot2::scale_color_gradientn(colors = color_palette) +
ggplot2::theme(aspect.ratio = 1)
# Dim plot
object$failorpass = ifelse(colnames(object) %in% failing_cells,
yes = "fail", no = "pass") %>%
as.factor()
p_fail = Seurat::DimPlot(object,
group.by = "failorpass",
order = "fail") +
ggplot2::scale_color_manual(values = c("#F8766D", "gray80"),
breaks = levels(object$failorpass)) +
ggplot2::labs(title = qc,
subtitle = paste0(length(failing_cells), " cells fail (",
round(100*length(failing_cells)/ncol(sobj), 2), " %)")) +
ggplot2::theme(aspect.ratio = 1,
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Patchwork
p = patchwork::wrap_plots(p_umap, p_fail, p_hist, p_violin, p_boxplot) +
patchwork::plot_layout(nrow = 1, widths = c(1, 1, 2, 1, 1))
return(p)
}
This is working for log10_nCount_RNA
, because it is used
to define default parameter values:
print_1_qc_metric()
but also works by specifying the parameter values:
print_1_qc_metric(sobj,
qc = "log10_nCount_RNA",
cut_qc = cut_log10_nCount_RNA,
failing_cells = fail_log10_nCount_RNA)
So we can copy-paste only this chunk for the next QC metrics !
print_1_qc_metric(sobj,
qc = "nFeature_RNA",
cut_qc = cut_nFeature_RNA,
failing_cells = fail_nFeature_RNA)
print_1_qc_metric(sobj,
qc = "percent.mt",
cut_qc = cut_percent.mt,
failing_cells = fail_percent.mt)
print_1_qc_metric(sobj,
qc = "percent.rb",
cut_qc = cut_percent.rb,
failing_cells = fail_percent.rb)
We could filter out cells based on these 5 QC metrics now. It is also possible to wait, perform various annotations such as cell type annotation or cell cycle phase scoring, to better characterize the low quality cells.
To filter a Seurat object, we use the subset
function:
sobj_filtered = subset(sobj, invert = TRUE,
cells = unique(c(fail_log10_nCount_RNA, fail_nFeature_RNA,
fail_percent.mt, fail_percent.rb, fail_percent.st)))
sobj_filtered
## An object of class Seurat
## 31053 features across 4216 samples within 1 assay
## Active assay: RNA (31053 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
We are going to save the object annotated for cells that fail the
quality control, regardless the metrics. So we add a single column to
sobj@meta.data
:
sobj$fail_qc = ifelse(colnames(sobj) %in% c(fail_log10_nCount_RNA, fail_nFeature_RNA,
fail_percent.mt, fail_percent.rb, fail_percent.st),
yes = "fail", no = "pass")
table(sobj$fail_qc)
##
## fail pass
## 371 4216
We save the non-filtered Seurat object:
saveRDS(sobj, file = paste0(outdir, "/sobj_TD3A_qc_annotated.rds"))
This Seurat object can then be the input object for annotation, definition of a new projection and downstream analyses.
This is a good practice to show the version of the packages used in this notebook.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] deldir_2.0-4 pbapply_1.7-2 gridExtra_2.3
## [4] rlang_1.1.4 magrittr_2.0.3 RcppAnnoy_0.0.22
## [7] spatstat.geom_3.3-3 matrixStats_1.4.1 ggridges_0.5.6
## [10] compiler_4.4.1 png_0.1-8 vctrs_0.6.5
## [13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [16] fastmap_1.2.0 labeling_0.4.3 utf8_1.2.4
## [19] promises_1.3.0 rmarkdown_2.28 ggbeeswarm_0.7.2
## [22] purrr_1.0.2 xfun_0.48 cachem_1.1.0
## [25] jsonlite_1.8.9 goftest_1.2-3 highr_0.11
## [28] later_1.3.2 spatstat.utils_3.1-0 irlba_2.3.5.1
## [31] parallel_4.4.1 cluster_2.1.6 R6_2.5.1
## [34] ica_1.0-3 spatstat.data_3.1-2 bslib_0.8.0
## [37] stringi_1.8.4 RColorBrewer_1.1-3 reticulate_1.39.0
## [40] spatstat.univar_3.0-1 parallelly_1.38.0 lmtest_0.9-40
## [43] jquerylib_0.1.4 scattermore_1.2 Rcpp_1.0.13
## [46] knitr_1.48 tensor_1.5 future.apply_1.11.2
## [49] zoo_1.8-12 sctransform_0.4.1 httpuv_1.6.15
## [52] Matrix_1.7-1 splines_4.4.1 igraph_2.1.1
## [55] tidyselect_1.2.1 abind_1.4-8 rstudioapi_0.17.0
## [58] yaml_2.3.10 spatstat.random_3.3-2 spatstat.explore_3.3-2
## [61] codetools_0.2-20 miniUI_0.1.1.1 listenv_0.9.1
## [64] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [67] shiny_1.9.1 withr_3.0.1 ROCR_1.0-11
## [70] ggrastr_1.0.2 evaluate_1.0.1 Rtsne_0.17
## [73] future_1.34.0 fastDummies_1.7.4 survival_3.7-0
## [76] polyclip_1.10-7 fitdistrplus_1.2-1 pillar_1.9.0
## [79] Seurat_5.1.0 KernSmooth_2.23-24 plotly_4.10.4
## [82] generics_0.1.3 RcppHNSW_0.6.0 sp_2.1-4
## [85] munsell_0.5.1 scales_1.3.0 globals_0.16.3
## [88] xtable_1.8-4 glue_1.8.0 lazyeval_0.2.2
## [91] tools_4.4.1 data.table_1.16.2 RSpectra_0.16-2
## [94] RANN_2.6.2 leiden_0.4.3.1 dotCall64_1.2
## [97] cowplot_1.1.3 grid_4.4.1 tidyr_1.3.1
## [100] colorspace_2.1-1 nlme_3.1-165 patchwork_1.3.0
## [103] beeswarm_0.4.0 vipor_0.4.7 cli_3.6.3
## [106] spatstat.sparse_3.1-0 spam_2.11-0 fansi_1.0.6
## [109] viridisLite_0.4.2 uwot_0.2.2 gtable_0.3.5
## [112] sass_0.4.9 digest_0.6.37 progressr_0.14.0
## [115] ggrepel_0.9.6 htmlwidgets_1.6.4 SeuratObject_5.0.2
## [118] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [121] httr_1.4.7 mime_0.12 MASS_7.3-61