EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (III)
-
Filtering, cell cyle, cell doublets
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (III)
-
Filtering, cell cyle, cell doublets
1 PREAMBLE
1.1 Purpose of this session
This file describes the different steps to perform the third part of the single cell RNAseq data analysis training course for the EB3I n1 2025, covering these steps :
Filtering of bad cells and extreme low-expression features
Estimation of the cell cycle phase
Identification and removal of cell doublets
2 Start Rstudio
- Using the OpenOnDemand/Rstudio cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements.
3 Warm-up
- We set common parameters we will use throughout this session :
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))Show output
PATH CHECK : TRUE
4 Prepare the data structure
We will do the same as for former steps, just changing the session name.
4.1 Main directory
# maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD"
4.2 Current session
# sessiondir
## Creating the session (Preproc.3) directory
session_dir <- paste0(TD_dir, "/03_Preproc.3")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/03_Preproc.3"
4.3 Input directory
# indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/03_Preproc.3/DATA"
4.4 Genelists directory
This is a directory where we will store additional information from knowledge bases about genes used to estimate the cell cycle phase of cells.
# resdir
res_dir <- paste0(TD_dir, "/Resources")
glist_dir <- paste0(res_dir, "/Genelists")
## Create the directory
dir.create(path = glist_dir, recursive = TRUE)
## Print the resources directory on-screen
print(glist_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/Resources/Genelists"
5 Reload the Seurat Object
- We can reload the object we saved at the former step
# dataload
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
## The latest Seurat object saved as RDS (name)
sobj_file <- "02_TD3A_S5_Metrics_31053.4587.RDS"
## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir,
"/02_Preproc.2/RESULTS/",
sobj_file)
force <- FALSE ## To force a re-download of a Zenodo-hosted backup
local <- FALSE ## To force a loading from a local backup
## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
zen_id <- "14035293"
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
sobj_file)
## Recreate the expected path if it does not exist
dir.create(path = dirname(sobj_path), recursive = TRUE)
## Download the file
download.file(url = zen_backup_file,
destfile = sobj_path)
}
## In case of error/lost data : force a reload from a local backup repository
if(local) {
sobj_path <- paste0(
"/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
sobj_file)
}
## Load the object
sobj <- readRDS(file = sobj_path)Now, we finally have all the metrics and bias sources we could use, so we can actually get to the filtering step.
6 Filtering
6.1 Features
As already explained, the sparsity of the count matrix is high, very high.
To the point that there probably are some features that have so low expression level that they were only counted in very few cells. As such, they have absolutely no chance to characterize any cell type, nor harbor some variation between different cell types.
As such, we should discard these features
What are the actual dimensions of our expression data ?
Show output
[1] 31053 4587
We want to remove the features that are expressed in less than 5 cells.
First, we quantify, for each feature, the number of cells with 0 count.
# nfeatdiff
## Compute the amount of 0s per feature
nFeat_zero <- sparseMatrixStats::rowCounts(
x = SeuratObject::GetAssayData(
object = sobj,
assay = "RNA",
layer = "counts"),
value = 0)
## Inversion : computing the number of cells with at least one count, per feature
nFeat_nonzero <- ncol(sobj) - nFeat_zero
## Identify those with at least 5 cells with expression
nFeat_keep <- nFeat_nonzero >= min_cells
## Quantify the selected ones
table(nFeat_keep)Show output
nFeat_keep
FALSE TRUE
18545 12508
And now we can discard features expressed in less than 5 cells :
# prefeatfiltviz
## Create a temporary object
sobj_VIZ_pre <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_pre <- Seurat::NormalizeData(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::ScaleData(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::FindVariableFeatures(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::RunPCA(object = sobj_VIZ_pre, npcs = 21, verbose = FALSE)
sobj_VIZ_pre <- Seurat::RunUMAP(object = sobj_VIZ_pre, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_pre <- Seurat::DimPlot(
object = sobj_VIZ_pre,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_pre)Show plot
# featsel
## Restrict the Seurat object
sobj <- subset(x = sobj, features = rownames(sobj)[nFeat_keep])What are the Seurat object dimensions, now ?
Show output
[1] 12508 4587
We can visualize the cell space after this features filtering
# postfeatfiltviz
## Create a temporary object
sobj_VIZ_post <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_post <- Seurat::NormalizeData(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::ScaleData(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::FindVariableFeatures(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::RunPCA(object = sobj_VIZ_post, npcs = 21, verbose = FALSE)
sobj_VIZ_post <- Seurat::RunUMAP(object = sobj_VIZ_post, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_post <- Seurat::DimPlot(
object = sobj_VIZ_post,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_post)Show plot
Merging plots for ease of visualization :
# c5_pw
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
dp_pre & ggplot2::ggtitle(label = "BEFORE"),
dp_post & ggplot2::ggtitle(label = "AFTER")),
nrow = 1)Show plot
Question :
6.2 Cells
We are now able to apply all the filtering strategies we established for each QC metric.
Apply the filter
Show output
[1] 12508 4587## Apply cell filtering sobj <- subset(x = sobj, cells = colnames(sobj)[sobj$fail_qc == "pass"]) ## Seurat object AFTER cell filtering dim(sobj)Show output
[1] 12508 4278
We can visualize the cell space since this cell filtering
# postcellfiltviz
## Create a temporary object
sobj_VIZ <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ <- Seurat::NormalizeData(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::ScaleData(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::FindVariableFeatures(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::RunPCA(object = sobj_VIZ, npcs = 21, verbose = FALSE)
sobj_VIZ <- Seurat::RunUMAP(object = sobj_VIZ, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_VIZ <- Seurat::DimPlot(
object = sobj_VIZ,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_VIZ)Show plot
One can associate the visualization after filtering cells :
# cfilt_umaps
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
dp_post & ggplot2::ggtitle(label = "Features (count<5) filtered"),
dp_VIZ & ggplot2::ggtitle(label = "Cells filtered (metrics)")),
nrow = 1)Show plot
- Question
```{.r .question}
# q_vizfilt
Do you see any difference when comparing before vs after features filtering ?
```
<br>
```{.r .fold-hide .answer}
# q_postcellfilt
## . Not much changed as well (we did not discard many cells)
##
## . The biggest cluster structure seems more defined
```
<br>
7 Save the Seurat object
We will save our Seurat object that now contains filtered cells and features :
# saverds1
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("03", Seurat::Project(sobj), "S5",
"Metrics.Filtered", paste(
dim(sobj),
collapse = '.'
)
), collapse = "_"),
".RDS")
## Check
print(out_name)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/03_Preproc.3/RESULTS/03_TD3A_S5_Metrics.Filtered_12508.4278.RDS"
8 Cell cycle scores
We are currently analyzing independent profiles from isolated cells, from a sample dissociation
As such, cells were most probably not synchronized, thus the effect of their cell cycle state on genes expression may be strong, to the point that it can bias the data (ie, mask some lower amplitude biological variation).
In order to assess (and maybe, remove) this bias, we have to quantify it.
We will perform this estimation thanks to heuristics based on knowledge : Seurat includes a method that evaluates the cell cycle phase of cells through scores for the S and G2M phases, each based on phase-specific gene signatures.
For this step, we will use additional gene lists from knowledge (cell cycle phase), hosted in a Zenodo respository (Id : 14037355)
8.1 Download gene lists
We will directly retrieve data from Zenodo to your
input_dir:# dlzengl ## Zenodo ID zen_id <- '14101506' ### Named files (will be used later on !) cc_file <- "mus_musculus_Seurat_cc.genes_20191031.rds" ## Filename(s) to retrieve toget_files <- c(cc_file) ## Folder to store retrieved files local_folder <- glist_dir ## Use local backup ? backup <- FALSE if(backup) message("Using local backup !") ## Force download ? force <- FALSE if(force) message("Forcing (re)download !") ### Define remote folder remote_folder <- if (backup) { paste0("/shared/projects/", sessionid, "/atelier_scrnaseq/TD/RESOURCES/GENELISTS/") } else { paste0("https://zenodo.org/records/", zen_id, "/files/") } ### Reconstruct the input paths remote_path <- paste0(remote_folder, "/", toget_files) ### Reconstruct the output paths local_path <- paste0(local_folder, "/", toget_files) ## Retrieve files (if they don't exist), in loop for (tg in seq_along(toget_files)) { ## If the file does not locally exist if (!file.exists(local_path[tg]) | force) { ## Retrieve data if(backup) { file.copy(from = remote_path[tg], to = local_path[tg]) } else { download.file(url = remote_path[tg], destfile = local_path[tg]) } ## Check if downloaded files exist locally if(file.exists(local_path[tg])) message("\tOK") } else message(paste0(toget_files[tg], " already downloaded !")) }
8.2 Load gene lists
# cc_load
## The cell cycle gene lists file
cc_file <- paste0(glist_dir,
"/",
cc_file)
## Load the cell cycle reference genes lists
cc_genes <- readRDS(file = cc_file)
## Have a look on its content
str(cc_genes)Show output
List of 2
$ s.genes : chr [1:43] "Mcl5" "Pcna" "Tyms" "Fen1" ...
$ g2m.genes: chr [1:54] "Hmgb2" "Cdk1" "Nusap1" "Ube2c" ...
Question
# q_cc_check As explained, we will use gene lists extracted from community knowledge. Our data contain values for genes also, so we will cross them. Is there something we should check ?# a_cc_check ## . We may check if the genes in our gene lists are effectively ## present in our Seurat object ! ## ## . This is expected, as we removed features with almost no expression
Beyond :
Write a code that performs this check
Add a code to adjust the content of the gene lists accordingly (remove genes from the gene lists that are not present in our dataset)
(NOTE : this is actually not needed as already checked and corrected by the cell cyle estimation method we will use)
# b_cc_answer
## Check if our data genes cover the gene lists
lapply(cc_genes, function(x) x %in% rownames(sobj))
## Remove genes not in sobj
cc_genes <- lapply(cc_genes, function(gl) { gl[gl %in% rownames(sobj)] })
## Check the modification
str(cc_genes)
## Check that all genes are available in sobj
all(unique(unlist(cc_genes)) %in% rownames(sobj))8.3 Estimation
Let’s perform this estimation. But how ?
Run the cell-cycle estimation
# cc_run ## Creating a temporary object sobj_CCS <- sobj sobj_CCS <- Seurat::NormalizeData(object = sobj_CCS, verbose = FALSE) ## Perform the estimation set.seed(my_seed) sobj_CCS <- Seurat::CellCycleScoring( object = sobj_CCS, s.features = cc_genes$s.genes, g2m.features = cc_genes$g2m.genes, # assay = 'RNA, ## Seurat v4 layer = 'RNA', ## Seurat v5 nbin = 24, seed = my_seed) ## Transfering score from the temp object to the real one sobj$CC_Seurat_S.Score <- sobj_CCS$S.Score sobj$CC_Seurat_G2M.Score <- sobj_CCS$G2M.Score sobj$CC_Seurat_Phase <- as.factor(sobj_CCS$Phase) ## Add "S minus G2M" score sobj$CC_Seurat_SmG2M.Score <- sobj$CC_Seurat_S.Score - sobj$CC_Seurat_G2M.Score ## Discard temp object rm(sobj_CCS)
Description of the object to see the data added