EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (II)
-
Quality control : Counts & metrics
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (II)
-
Quality control : Counts & metrics
1 PREAMBLE
1.1 Purpose of this session
This file describes the different steps to perform the quality control of a single cell dataset, based on (per cell) :
- the number of UMI (~transcripts) detected
- the number of expressed genes detected
- the proportion of UMI corresponding to mitochondrial genes
- the proportion of UMI corresponding to riboprotein-coding genes
- the proportion of transcripts related to stress signature
Input data: A filtered, flat count matrix retrieved from GEO.
Output data: A Seurat object annotated for the “bad” cells to filter out for downstream analysis.
2 Start Rstudio
- Using the OpenOnDemand/Rstudio cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements, thanks to the cheat sheet.
3 Warm-up
- We now set common parameters as new variables, once and for all for this session :
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))Show output
PATH CHECK : TRUE
4 Prepare the data structure
4.1 Main directory
# maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD"
4.2 Current session
# sessiondir
## Creating the session (Preproc.2) directory
session_dir <- paste0(TD_dir, "/02_Preproc.2")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/02_Preproc.2"
4.3 Input directory
# indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/02_Preproc.2/DATA"
4.4 Genelists directory
This is a directory where we will store additional information from knowledge bases about genes used to estimate the cell cycle phase of cells.
# resdir
res_dir <- paste0(TD_dir, "/Resources")
glist_dir <- paste0(res_dir, "/Genelists")
## Create the directory
dir.create(path = glist_dir, recursive = TRUE)
## Print the resources directory on-screen
print(glist_dir)Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/Resources/Genelists"
5 Prerequisites
We have to retrieve the input data file
# mat_dl
local <- FALSE
## The raw count matrix we will start from
scmat_source <- "GSM4861194_gex_2_raw_gene_expression.tsv.gz"
## Download the file from Zenodo
if (!local) {
### ZenID
zen_id <- "14033941"
### Zen Path
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
scmat_source)
## The path to the locally saved input file
scmat_file <- paste0(input_dir,
'/',
scmat_source)
## Download the file
download.file(url = zen_backup_file,
destfile = scmat_file)
} else {
ebaii_session <- '2538_eb3i_n1_2025'
scmat_file <- paste0(
'/shared/projects/',
ebaii_session,
'/atelier_scrnaseq/TD/BACKUP/TSV/',
scmat_source)
}6 Load the genelists resources
We retrieve the genelists
# gl_dl
local <- FALSE
## The genelist files
mito_source <- "mus_musculus_mito_symbols_20191015.rds"
ribo_source <- "mus_musculus_cribo_symbols_20191015.rds"
stress_source <- "mus_musculus_stress_symbols_20200224.rds"
gl_sources <- c(mito_source, ribo_source, stress_source)
## The (future) local files
mito_file <- paste0(input_dir, '/', mito_source)
ribo_file <- paste0(input_dir, '/', ribo_source)
stress_file <- paste0(input_dir, '/', stress_source)
gl_files <- c(mito_file, ribo_file, stress_file)
## Download the file from Zenodo
if (!local) {
### ZenID
zen_id <- "14037355"
### Looping on files
for (glf in seq_along(gl_sources)) {
### Zen Path
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
gl_sources[glf])
## Download the file
download.file(url = zen_backup_file,
destfile = gl_files[glf])
}
rm(gl_files)
} else { ## Local mode
ebaii_session <- '2538_eb3i_n1_2025'
localbackup_dir <- paste0('/shared/projects/', ebaii_session, '/atelier_scrnaseq/TD/RESOURCES/GENELISTS/')
mito_file <- paste0(localbackup_dir, '/mus_musculus_mito_symbols_20191015.rds')
ribo_source <- paste0(input_dir, '/mus_musculus_cribo_symbols_20191015.rds')
stress_source <- paste0(input_dir, '/mus_musculus_stress_symbols_20200224.rds')
}7 Biological context
For this series of training sessions, we will work on data from this Paiva et al. publication.
The study concerns thymus autonomy:
- The thymus is an “organ of passage”, critical in its function to the adaptive immune system for the maturation of T cell lymphocytes.
- This maturation involves two main steps, performed thanks to macrophages:
- Positive selection : keeping cells that successfully develop react appropriately with MHC immune receptors of the body
- Negative selection : keeping cells that do not react against natural proteins of the body.
- Thymus autonomy is a natural mechanism that allows to create T cells in the thymus by differentiation and cell competition, even when normal progenitors from the bone marrow are lacking, in critical conditions.
- This mechanism is known in its effects, but the cells involved in are not.
- This study is of importance in the health field, as this mechanism relies on a temporary loss of control of the cell normal functions.
- The consequence is that if thymus is in autonomy for too long (few weeks), this is a prelude for leukemia !
- Organism is : mus musculus
- Individuals are : newborn mice, grafted
- The design corresponds to two conditions (Test / Control)
- Control : thymus from wild type newborn mice transplanted into wild type juvenile mouse. In this control case, donor T-cells progenitors (DN3) were replaced by host cells 3 weeks after transplantation.
- Test : thymus from wild type newborn mice transplanted in KO Rag-/- type juvenile mouse (the KO partially impairs their ability to produce T-cell progenitors in normal amounts). In this test case, donor T-cells progenitors (DN3) were replaced by host cells 9 weeks after transplantation, showing that the donor DN3 cells outlived their normal lifespan by ~6 weeks.
You will mainly work on the KO sample : TD3A.
The input data consists in a count matrix, as a gzipped tabular text file, that contains everything needed to create a basic Seurat object :
- The expressions counts
- The feature names (here, gene symbols)
- The barcode names
This matrix has already been filtered for empty droplets.
9 Data
We load the count matrix:
# mat_load
## Loading the matrix directly as a sparseMatrix
scmat <- scuttle::readSparseCounts(
file = scmat_file,
sep = "\t")
## Displaying its size in-memory (this is a basic matrix)
format(utils::object.size(scmat), units = "auto")Show output
[1] "88.3 Mb"
Show output
[1] 31053 4587
Show output
5 x 5 sparse Matrix of class "dgCMatrix"
AAACCTGAGACGCTTT.1 AAACCTGAGGCATTGG.1 AAACCTGGTCAACATC.1
Xkr4 . . .
Gm1992 . . .
Gm37381 . . .
Rp1 . . .
Sox17 . . .
AAACCTGTCGAGGTAG.1 AAACCTGTCGATCCCT.1
Xkr4 . .
Gm1992 . .
Gm37381 . .
Rp1 . .
Sox17 . .
We want to build a Seurat object from the count matrix.
But how do we do ??
Show output
An object of class Seurat
31053 features across 4587 samples within 1 assay
Active assay: RNA (31053 features, 0 variable features)
1 layer present: counts
We do not need the count matrix anymore :
We set the filtering thresholds based on quality control-related metrics.
Adjust them as necessary based on the figures.
# set_thresholds
cut_nCount_RNA = 1000
cut_log10_nCount_RNA = log10(cut_nCount_RNA)
cut_nFeature_RNA = 750
cut_percent_mt = 5
cut_percent_rb = 20
cut_percent_st = 6We define a nice palette to visualize the QC metrics:
For the QC metrics related to the proportion of UMI belongs to a specific gene sets, we need to load the gene sets.
Show output
[1] "mt-Atp6" "mt-Atp8" "mt-Co1" "mt-Co2" "mt-Co3" "mt-Cytb" "mt-Nd1"
[8] "mt-Nd2" "mt-Nd3" "mt-Nd4" "mt-Nd4l" "mt-Nd5" "mt-Nd6" "mt-Rnr1"
[15] "mt-Rnr2" "mt-Ta" "mt-Tc" "mt-Td" "mt-Te" "mt-Tf" "mt-Tg"
[22] "mt-Th" "mt-Ti" "mt-Tk" "mt-Tl1" "mt-Tl2" "mt-Tm" "mt-Tn"
[29] "mt-Tp" "mt-Tq" "mt-Tr" "mt-Ts1" "mt-Ts2" "mt-Tt" "mt-Tv"
[36] "mt-Tw" "mt-Ty"
Show output
[1] "Rpsa" "Rps2" "Rps3" "Rps3a" "Rps4x" "Rps5" "Rps6" "Rps7"
[9] "Rps8" "Rps9" "Rps10" "Rps11" "Rps12" "Rps13" "Rps14" "Rps15"
[17] "Rps15a" "Rps16" "Rps17" "Rps18" "Rps19" "Rps20" "Rps21" "Rps23"
[25] "Rps24" "Rps25" "Rps26" "Rps27" "Rps27a" "Rps28" "Rps29" "Rps30"
[33] "Rpl3" "Rpl4" "Rpl5" "Rpl6" "Rpl7" "Rpl7a" "Rpl8" "Rpl9"
[41] "Rpl10" "Rpl10a" "Rpl11" "Rpl12" "Rpl13" "Rpl13a" "Rpl14" "Rpl15"
[49] "Rpl17" "Rpl18" "Rpl18a" "Rpl19" "Rpl21" "Rpl22" "Rpl23" "Rpl23a"
[57] "Rpl24" "Rpl26" "Rpl27" "Rpl27a" "Rpl28" "Rpl29" "Rpl30" "Rpl31"
[65] "Rpl32" "Rpl34" "Rpl35" "Rpl35a" "Rpl36" "Rpl44" "Rpl37" "Rpl37a"
[73] "Rpl38" "Rpl39" "Rpl40" "Rpl41" "Rplp0" "Rplp1" "Rplp2"
Show output
[1] "Atf3" "Cmss1" "Diaph1" "Egr1" "Fos" "Gls"
[7] "Gm48099" "lsp90aa1" "lsp90ab1" "lfrd1" "Jak1" "Junb"
[13] "Kpna1" "Nup210l" "Peak1" "Stat3" "Actb" "Camk1d"
[19] "Chka" "Clic4" "Fodl2" "Hspa8" "Jun" "Jund"
[25] "Klf6" "Litaf" "Pecam1" "Ptma" "Sik3" "Spag9"
[31] "Arih1" "Azin1" "Brd2" "Chd4" "Ctnnb1" "Dennd4a"
[37] "Elf1" "G3bp1" "Ints6" "Kdm6b" "Lsmem1" "Man1a"
[43] "Med13" "Nfkbia" "Nfkbiz" "Nop58" "Piezo1" "Ppp1r15a"
[49] "Prkcg" "Sqstm1" "Taf4b" "Tmsb4x" "Ubc" "Zfp36"
[55] "Abtb2" "Adamts1" "Adamts9" "Ahnak" "Ankrd28" "Atp2a2"
[61] "Baz1a" "Btg2" "Ccdc138" "Cd44" "Cdkn1a" "Elf2"
[67] "Ep400" "Epas1" "Erf" "Ern1" "Fabp4" "Fosb"
[73] "Gm10073" "Gsk3a" "H3f3a" "H3f3b" "Hivep2" "Klf4"
[79] "Lmna" "lapkapk2" "Mapre1" "Msn" "Mylip" "Nabp1"
[85] "Ncl" "Nsd3" "Nufip2" "Plekhg2" "Ppp1cb" "Rnf19b"
[91] "Rps20" "Rtn4" "Runx1" "Sertad2" "Sptan1" "Top1"
[97] "Vcl" "Zbtb11"
We keep only the gene symbols available in our data :
Show output
[1] "mt-Atp6" "mt-Atp8" "mt-Co1" "mt-Co2" "mt-Co3" "mt-Cytb" "mt-Nd1"
[8] "mt-Nd2" "mt-Nd3" "mt-Nd4" "mt-Nd4l" "mt-Nd5" "mt-Nd6"
Show output
[1] "Rpsa" "Rps2" "Rps3" "Rps4x" "Rps5" "Rps6" "Rps7" "Rps8"
[9] "Rps9" "Rps10" "Rps11" "Rps12" "Rps13" "Rps14" "Rps15" "Rps15a"
[17] "Rps16" "Rps17" "Rps18" "Rps19" "Rps20" "Rps21" "Rps23" "Rps24"
[25] "Rps25" "Rps26" "Rps27" "Rps27a" "Rps28" "Rps29" "Rpl3" "Rpl4"
[33] "Rpl5" "Rpl6" "Rpl7" "Rpl7a" "Rpl8" "Rpl9" "Rpl10" "Rpl10a"
[41] "Rpl11" "Rpl12" "Rpl13" "Rpl13a" "Rpl14" "Rpl15" "Rpl17" "Rpl18"
[49] "Rpl18a" "Rpl19" "Rpl21" "Rpl22" "Rpl23" "Rpl23a" "Rpl24" "Rpl26"
[57] "Rpl27" "Rpl27a" "Rpl28" "Rpl29" "Rpl30" "Rpl31" "Rpl32" "Rpl34"
[65] "Rpl35" "Rpl35a" "Rpl36" "Rpl37" "Rpl37a" "Rpl38" "Rpl39" "Rpl41"
[73] "Rplp0" "Rplp1" "Rplp2"
Show output
[1] "Atf3" "Cmss1" "Diaph1" "Egr1" "Fos" "Gls"
[7] "Gm48099" "Jak1" "Junb" "Kpna1" "Nup210l" "Peak1"
[13] "Stat3" "Actb" "Camk1d" "Chka" "Clic4" "Hspa8"
[19] "Jun" "Jund" "Klf6" "Litaf" "Pecam1" "Ptma"
[25] "Sik3" "Spag9" "Arih1" "Azin1" "Brd2" "Chd4"
[31] "Ctnnb1" "Dennd4a" "Elf1" "G3bp1" "Ints6" "Kdm6b"
[37] "Lsmem1" "Man1a" "Med13" "Nfkbia" "Nfkbiz" "Nop58"
[43] "Piezo1" "Ppp1r15a" "Prkcg" "Sqstm1" "Taf4b" "Tmsb4x"
[49] "Ubc" "Zfp36" "Abtb2" "Adamts1" "Adamts9" "Ahnak"
[55] "Ankrd28" "Atp2a2" "Baz1a" "Btg2" "Ccdc138" "Cd44"
[61] "Cdkn1a" "Elf2" "Ep400" "Epas1" "Erf" "Ern1"
[67] "Fabp4" "Fosb" "Gm10073" "Gsk3a" "H3f3a" "H3f3b"
[73] "Hivep2" "Klf4" "Lmna" "Mapre1" "Msn" "Mylip"
[79] "Nabp1" "Ncl" "Nsd3" "Nufip2" "Plekhg2" "Ppp1cb"
[85] "Rnf19b" "Rps20" "Rtn4" "Runx1" "Sertad2" "Sptan1"
[91] "Top1" "Vcl" "Zbtb11"
Note: A more robust analysis may use the gene identifiers (like EntrezID) rather than gene symbols.
10 Quality controls
10.1 Compute metrics
What is already available in the Seurat object ?
Show output
orig.ident nCount_RNA nFeature_RNA
AAACCTGAGACGCTTT.1 TD3A 2813 1594
AAACCTGAGGCATTGG.1 TD3A 2072 1365
AAACCTGGTCAACATC.1 TD3A 2025 1341
AAACCTGTCGAGGTAG.1 TD3A 1877 1241
AAACCTGTCGATCCCT.1 TD3A 2216 1441
AAACGGGCACTTACGA.1 TD3A 2445 1428
How do the two first QC metrics vary ?
Show output
orig.ident nCount_RNA nFeature_RNA
TD3A:4587 Min. : 502 Min. : 22
1st Qu.: 1986 1st Qu.:1291
Median : 2397 Median :1476
Mean : 3573 Mean :1638
3rd Qu.: 3134 3rd Qu.:1733
Max. :48875 Max. :5975
In the column nCount_RNA, the maximum is far from the third quartile. For visualization purpose, we transform this column to log10 scale.
Show output
orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA
TD3A:4587 Min. : 502 Min. : 22 Min. :2.701
1st Qu.: 1986 1st Qu.:1291 1st Qu.:3.298
Median : 2397 Median :1476 Median :3.380
Mean : 3573 Mean :1638 Mean :3.429
3rd Qu.: 3134 3rd Qu.:1733 3rd Qu.:3.496
Max. :48875 Max. :5975 Max. :4.689
We compute the percentage of UMI related to each of the three list of genes.
First, we compute the proportion of transcripts related to mitochondrial genes, per cell.
# percent_mt
sobj = Seurat::PercentageFeatureSet(
object = sobj,
assay = "RNA",
features = mito_symbols,
col.name = "percent_mt")
summary(sobj$percent_mt)Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.985 2.489 2.997 3.125 97.250
Then, we compute the proportion of transcripts related to riboprotein-coding genes, per cell:
# percent_rb
sobj = Seurat::PercentageFeatureSet(
object = sobj,
assay = "RNA",
features = ribo_symbols,
col.name = "percent_rb")
summary(sobj$percent_rb)Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 7.884 9.577 10.880 12.167 42.010
Finally, we compute the proportion of transcripts related to mechanical stress signature, per cell :
# percent_st
sobj = Seurat::PercentageFeatureSet(
sobj,
assay = "RNA",
features = stress_symbols,
col.name = "percent_st")
summary(sobj$percent_st)Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.946 3.338 3.368 3.749 10.198
We now have all the QC-related metrics:
Show output
orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA percent_mt
TD3A:4587 Min. : 502 Min. : 22 Min. :2.701 Min. : 0.000
1st Qu.: 1986 1st Qu.:1291 1st Qu.:3.298 1st Qu.: 1.985
Median : 2397 Median :1476 Median :3.380 Median : 2.489
Mean : 3573 Mean :1638 Mean :3.429 Mean : 2.997
3rd Qu.: 3134 3rd Qu.:1733 3rd Qu.:3.496 3rd Qu.: 3.125
Max. :48875 Max. :5975 Max. :4.689 Max. :97.250
percent_rb percent_st
Min. : 0.000 Min. : 0.000
1st Qu.: 7.884 1st Qu.: 2.946
Median : 9.577 Median : 3.338
Mean :10.880 Mean : 3.368
3rd Qu.:12.167 3rd Qu.: 3.749
Max. :42.010 Max. :10.198
10.2 Failing cells
We identify the cells that do not pass the quality controls. This will be used for the visualization and later for cells filtering.
If the filtering thresholds are modified, do not forget to run again this chunk.
# fail_cells
fail_percent_mt <- rownames(sobj@meta.data)[sobj$percent_mt > cut_percent_mt]
fail_percent_rb <- rownames(sobj@meta.data)[sobj$percent_rb > cut_percent_rb]
fail_percent_st <- rownames(sobj@meta.data)[sobj$percent_st > cut_percent_st]
fail_nFeature_RNA <- rownames(sobj@meta.data)[sobj$nFeature_RNA < cut_nFeature_RNA]
fail_nCount_RNA <- rownames(sobj@meta.data)[sobj$nCount_RNA < cut_nCount_RNA]10.3 Visualization
This is difficult to handle the distribution of these metrics across cells. We opt for various visualization ways:
- histogram, showing the distribution of the metric
- violin plot, showing the distribution of the metric, useful if several datasets are considered
- UMAP (or alternatively, tSNE), showing the distribution of the metric over a 2D projection of cells
You may choose one of these visualization ways.
To generate the UMAP 2D projection, we need to run multiple Seurat commands in a row. Understanding these commands is not the purpose of the current course, but will be detailed in the next few ones. So just run :
# quick_processing
sobj_tmp = Seurat::NormalizeData(sobj, verbose = FALSE)
sobj_tmp = Seurat::ScaleData(sobj_tmp, verbose = FALSE)
sobj_tmp = Seurat::FindVariableFeatures(sobj_tmp, verbose = FALSE)
sobj_tmp = Seurat::RunPCA(sobj_tmp, npcs = 21, verbose = FALSE)
sobj_tmp = Seurat::RunUMAP(sobj_tmp, dims = c(1:20), verbose = FALSE)We can now visualize the cells on a 2D projection:
Show plot
10.3.1 Define a R function
We want to visualize our different metrics, but using the same kind of plots. Instead of copying-pasting and editing the same code multiple times (this is time-consuming and error-prone), we design a function using the template below:
# my_function_name
my_function_name = function(param1, param2) {
# do something with the parameter values
output = "something"
return(output)
}Here is the function we will use :
# qc_print_function
print_1_qc_metric = function(object = sobj,
qc = "log10_nCount_RNA",
cut_qc = cut_log10_nCount_RNA,
failing_cells = fail_nCount_RNA) {
# Description of the parameters:
# - sobj : the Seurat object, with default value to the one
# - qc : CHARACTER : the QC metric, must be a column in sobj@meta.data
# - cut_qc : NUMERIC : the filtering threshold for the QC metric
# - failing_cells : CHARACTER VECTOR : the cells that fail the QC
# Histogram
p_hist = ggplot(object@meta.data, aes(x = .data[[qc]])) +
geom_histogram(aes(y = after_stat(density)),
colour = "black", fill = "#F8766D", bins = 100) +
geom_density(alpha = 0, col = "blue", lwd = 0.75) +
geom_vline(xintercept = cut_qc, col = "red") +
labs(title = paste0("Threshold for ", qc, " is: ", cut_qc)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
# Violin plot
p_violin = Seurat::VlnPlot(object, features = qc) +
geom_hline(yintercept = cut_qc, col = "red") +
theme(axis.title.x = element_blank(),
legend.position = "none")
# Feature plot
p_umap = Seurat::FeaturePlot(object,
reduction = "umap",
features = qc) +
scale_color_gradientn(colors = color_palette) +
theme(aspect.ratio = 1)
# Dim plot
object$failorpass = as.factor(
ifelse(test = colnames(object) %in% failing_cells,
yes = "fail",
no = "pass")
)
p_fail = Seurat::DimPlot(object,
group.by = "failorpass",
order = "fail") +
scale_color_manual(values = c("#F8766D", "gray80"),
breaks = levels(object$failorpass)) +
labs(title = qc,
subtitle = paste0(length(failing_cells), " cells fail (",
round(100*length(failing_cells)/ncol(sobj), 2), " %)")) +
theme(aspect.ratio = 1,
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
## If we applied the filter
objectf = subset(object, cells = colnames(object)[object$failorpass == 'pass'])
objectf = Seurat::ScaleData(objectf, verbose = FALSE)
objectf = Seurat::FindVariableFeatures(objectf, verbose = FALSE)
objectf = Seurat::RunPCA(objectf, npcs = 21, verbose = FALSE)
objectf = Seurat::RunUMAP(objectf, dims = c(1:20), verbose = FALSE)
p_umapf = Seurat::FeaturePlot(object = objectf,
reduction = "umap",
features = qc) +
scale_color_gradientn(colors = color_palette) +
theme(aspect.ratio = 1)
# Patchwork
p = patchwork::wrap_plots(p_umap, p_fail, p_hist, p_violin, p_umapf) +
patchwork::plot_layout(nrow = 1, widths = c(1, 1, 2, 1, 1))
return(p)
}10.3.2 Number of UMI
10.3.2.1 Check the function
This function should directly work for log10_nCount_RNA, as it is set for this metric by default (qc = "log10_nCount_RNA") :
Show plot
Of course, it will work the same by expliciting all of the parameters value:
# check2_log10_nCount_RNA
print_1_qc_metric(object = sobj_tmp,
qc = "log10_nCount_RNA",
cut_qc = cut_log10_nCount_RNA,
failing_cells = fail_nCount_RNA)Show plot
So we can copy-paste only this chunk for the next QC metrics !
10.3.3 Number of expressed genes
# see_nFeature_RNA
print_1_qc_metric(object = sobj_tmp,
qc = "nFeature_RNA",
cut_qc = cut_nFeature_RNA,
failing_cells = fail_nFeature_RNA)Show plot
10.3.4 Mitochondrial genes expression
# see_percent_mt
print_1_qc_metric(object = sobj_tmp,
qc = "percent_mt",
cut_qc = cut_percent_mt,
failing_cells = fail_percent_mt)Show plot
10.3.5 Riboprotein-coding genes expression
# see_percent_rb
print_1_qc_metric(object = sobj_tmp,
qc = "percent_rb",
cut_qc = cut_percent_rb,
failing_cells = fail_percent_rb)Show plot
10.3.6 Mechanical stress genes expression
# see_percent_st
print_1_qc_metric(object = sobj_tmp,
qc = "percent_st",
cut_qc = cut_percent_st,
failing_cells = fail_percent_st)Show plot
As we have finished with the visualization, we can discard our temporary object :
10.3.7 All metrics effect
We can visualize metrics filtering effect as an upset-plot
# bcupset
## Create a list of all cells filtered OUT for each criterion
up_list <-list(
"nFeature" = fail_nFeature_RNA,
"nCount" = fail_nCount_RNA,
"%MITO" = fail_percent_mt,
"%RIBO" = fail_percent_rb,
"%STRESS" = fail_percent_st)
## Create an upset-plot
UpSetR::upset(data = UpSetR::fromList(up_list),
nintersects = NA,
sets = rev(names(up_list)),
keep.order = TRUE,
order.by = "freq")Show plot
11 Filtering ?
We could filter out cells based on these 5 QC metrics now. It is also possible to wait, perform various annotations such as cell type annotation or cell cycle phase scoring, to better characterize the low quality cells.
To filter a Seurat object, we use the Seurat::subset() function:
# filter_sobj
sobj_filtered = subset(sobj, invert = TRUE,
cells = unique(c(fail_nCount_RNA,
fail_nFeature_RNA,
fail_percent_mt,
# fail_percent_rb, actually we decided not to filter on this parameter
fail_percent_st
)))
sobj_filteredShow output
An object of class Seurat
31053 features across 4278 samples within 1 assay
Active assay: RNA (31053 features, 0 variable features)
1 layer present: counts
We are not going to save this filtered Seurat object, but we want to store in our sobj Seurat object the cells that failed the QC as a new cell metadata :
# add_fail_qc
sobj$fail_qc = ifelse(test = colnames(sobj) %in% colnames(sobj_filtered),
yes = "pass",
no = "fail")
table(sobj$fail_qc)Show output
fail pass
309 4278
12 Save
We save the unfiltered Seurat object:
# saverds1
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("02", Seurat::Project(sobj), "S5",
"Metrics", paste(
dim(sobj),
collapse = '.'
)
), collapse = "_"),
".RDS")
## Check
print(out_name)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/02_Preproc.2/RESULTS/02_TD3A_S5_Metrics_31053.4587.RDS"
This Seurat object would then be loaded as the input for further analyses.
13 R session
This is a good practice to show the version of the packages used in this notebook.
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Seurat_5.3.0 SeuratObject_5.1.0 sp_2.2-0 ggplot2_3.5.2
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1
[3] jsonlite_2.0.0 magrittr_2.0.3
[5] ggbeeswarm_0.7.2 spatstat.utils_3.1-4
[7] farver_2.1.2 rmarkdown_2.29
[9] zlibbioc_1.50.0 vctrs_0.6.5
[11] ROCR_1.0-11 DelayedMatrixStats_1.26.0
[13] spatstat.explore_3.4-3 htmltools_0.5.8.1
[15] S4Arrays_1.4.1 SparseArray_1.4.8
[17] sass_0.4.10 sctransform_0.4.2
[19] parallelly_1.45.0 KernSmooth_2.23-24
[21] bslib_0.9.0 htmlwidgets_1.6.4
[23] ica_1.0-3 plyr_1.8.9
[25] plotly_4.10.4 zoo_1.8-14
[27] cachem_1.1.0 igraph_2.1.4
[29] mime_0.13 lifecycle_1.0.4
[31] pkgconfig_2.0.3 Matrix_1.7-3
[33] R6_2.6.1 fastmap_1.2.0
[35] GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0
[37] fitdistrplus_1.2-2 future_1.49.0
[39] shiny_1.10.0 digest_0.6.37
[41] patchwork_1.3.0 S4Vectors_0.42.1
[43] tensor_1.5 RSpectra_0.16-2
[45] irlba_2.3.5.1 GenomicRanges_1.56.2
[47] beachmat_2.20.0 labeling_0.4.3
[49] progressr_0.15.1 spatstat.sparse_3.1-0
[51] httr_1.4.7 polyclip_1.10-7
[53] abind_1.4-8 compiler_4.4.1
[55] withr_3.0.2 BiocParallel_1.38.0
[57] UpSetR_1.4.0 fastDummies_1.7.5
[59] MASS_7.3-65 DelayedArray_0.30.1
[61] tools_4.4.1 vipor_0.4.7
[63] lmtest_0.9-40 beeswarm_0.4.0
[65] httpuv_1.6.15 future.apply_1.11.3
[67] goftest_1.2-3 glue_1.8.0
[69] nlme_3.1-165 promises_1.3.2
[71] grid_4.4.1 Rtsne_0.17
[73] cluster_2.1.6 reshape2_1.4.4
[75] generics_0.1.4 gtable_0.3.6
[77] spatstat.data_3.1-6 rmdformats_1.0.4
[79] tidyr_1.3.1 data.table_1.17.4
[81] XVector_0.44.0 BiocGenerics_0.50.0
[83] spatstat.geom_3.4-1 RcppAnnoy_0.0.22
[85] ggrepel_0.9.6 RANN_2.6.2
[87] pillar_1.10.2 stringr_1.5.1
[89] spam_2.11-1 RcppHNSW_0.6.0
[91] later_1.4.2 splines_4.4.1
[93] dplyr_1.1.4 lattice_0.22-6
[95] survival_3.7-0 deldir_2.0-4
[97] tidyselect_1.2.1 SingleCellExperiment_1.26.0
[99] miniUI_0.1.2 scuttle_1.14.0
[101] pbapply_1.7-2 knitr_1.50
[103] gridExtra_2.3 bookdown_0.39
[105] IRanges_2.38.1 SummarizedExperiment_1.34.0
[107] scattermore_1.2 stats4_4.4.1
[109] xfun_0.52 Biobase_2.64.0
[111] matrixStats_1.5.0 UCSC.utils_1.0.0
[113] stringi_1.8.7 lazyeval_0.2.2
[115] yaml_2.3.10 evaluate_1.0.3
[117] codetools_0.2-20 tibble_3.2.1
[119] cli_3.6.5 uwot_0.2.3
[121] xtable_1.8-4 reticulate_1.42.0
[123] jquerylib_0.1.4 GenomeInfoDb_1.40.1
[125] dichromat_2.0-0.1 Rcpp_1.0.14
[127] globals_0.18.0 spatstat.random_3.4-1
[129] png_0.1-8 ggrastr_1.0.2
[131] spatstat.univar_3.1-3 parallel_4.4.1
[133] dotCall64_1.2 sparseMatrixStats_1.16.0
[135] listenv_0.9.1 viridisLite_0.4.2
[137] scales_1.4.0 ggridges_0.5.6
[139] purrr_1.0.4 crayon_1.5.3
[141] rlang_1.1.6 cowplot_1.1.3