---
title: "
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (III)
-
Filtering, cell cyle, cell doublets"
date: "2025-16-21.22"
author:
- name: "EB3I n1 scRNAseq Team"
- name: "Bastien JOB"
email: "bastien.job@gustaveroussy.fr"
- name: "Lilia YOUNSI"
email: "lilia.younsi@inserm.fr"
output:
rmdformats::readthedown:
fig_width: 8
fig_height: 6
highlight: tango ## Theme for the code chunks
embed_fonts: TRUE
number_sections: true ## Adds number to headers (sections)
theme: flatly ## CSS theme for the HTML page
collapsed: true ## By default, the TOC is folded
toc_depth: 3
smooth_scroll: true ## Smooth scroll of the HTML page
self_contained: true ## Includes all plots/images within the HTML
code_download: true ## Adds a button to download the Rmd
code_folding: show
thumbnails: false
lightbox: true
fig_caption: false
gallery: true
use_bookdown: true
always_allow_html: true ## Allow plain HTML code in the Rmd
editor_options:
markdown:
wrap: 72
---
```{r knit_setup, echo = FALSE}
knitr::opts_chunk$set(
echo = TRUE, # Print the code
eval = TRUE, # Run command lines
message = FALSE, # Print messages
prompt = FALSE, # Do not display prompt
comment = NA, # No comments on this section
warning = FALSE, # Display warnings
tidy = FALSE,
fig.align="center",
# results = 'hide',
width = 100 # Number of characters per line
)
```
```{css, echo=FALSE}
.notrun {
background-color: lightgrey !important;
border: 3px solid black !important;
}
.notruno {
background-color: lightgrey !important;
color : black !important;
}
.question {
background-color: aquamarine !important;
color : black !important;
border: 3px solid limegreen !important;
}
.questiono {
background-color: aquamarine !important;
color : black !important;
}
.answer {
background-color: navajowhite !important;
border: 3px solid brown !important;
}
.answero {
background-color: navajowhite !important;
color : black !important;
}
.beyond {
background-color: violet !important;
border: 3px solid purple !important;
}
.beyondo {
background-color: violet !important;
color : black !important;
}
```
```{r knit_hook, echo = FALSE}
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
force(type)
function(x, options) {
res = hooks[[type]](x, options)
if (isFALSE(options[[paste0("fold.", type)]])) return(res)
paste0(
"Show ", type, "
\n\n",
res,
"\n\n "
)
}
}
knitr::knit_hooks$set(
output = hook_foldable("output"),
plot = hook_foldable("plot")
)
```
------------------------------------------------------------------------
------------------------------------------------------------------------

------------------------------------------------------------------------
------------------------------------------------------------------------
# PREAMBLE
## Purpose of this session
This file describes the different steps to perform the **third part** of
the single cell RNAseq data analysis training course for the **EB3I n1
2025**, covering these steps :
- **Filtering** of bad cells and extreme low-expression features
- Estimation of the **cell cycle** phase
- Identification and removal of **cell doublets**
------------------------------------------------------------------------
------------------------------------------------------------------------
# Start Rstudio
- Using the [OpenOnDemand/Rstudio cheat
sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html){target="_blank"},
connect to the [OpenOnDemand
portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"} and
**create a Rstudio session** with the right resource requirements.
------------------------------------------------------------------------
------------------------------------------------------------------------
# Warm-up
- We set **common parameters** we will use throughout this session :
```{r setparam}
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))
## Seed for the RNG
my_seed <- 1337L
## Minimum cells with counts to keep a feature
min_cells <- 5
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Prepare the data structure
We will do the same as for former steps, just changing the session name.
## Main directory
```{r maindir, fold.output = FALSE}
# maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)
```
## Current session
```{r sessiondir, fold.output = FALSE}
# sessiondir
## Creating the session (Preproc.3) directory
session_dir <- paste0(TD_dir, "/03_Preproc.3")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)
```
## Input directory
```{r indir, fold.output = FALSE}
# indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)
```
## Genelists directory
This is a directory where we will store additional information from
knowledge bases about genes used to estimate the cell cycle phase of cells.
```{r resdir, fold.output = FALSE}
# resdir
res_dir <- paste0(TD_dir, "/Resources")
glist_dir <- paste0(res_dir, "/Genelists")
## Create the directory
dir.create(path = glist_dir, recursive = TRUE)
## Print the resources directory on-screen
print(glist_dir)
```
## Output directory
```{r outdir, fold.output = FALSE}
# outdir
## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)
## Print the output directory on-screen
print(output_dir)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Reload the Seurat Object
- We can reload the object we saved at the former step
```{r dataload}
# dataload
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
## The latest Seurat object saved as RDS (name)
sobj_file <- "02_TD3A_S5_Metrics_31053.4587.RDS"
## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir,
"/02_Preproc.2/RESULTS/",
sobj_file)
force <- FALSE ## To force a re-download of a Zenodo-hosted backup
local <- FALSE ## To force a loading from a local backup
## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
zen_id <- "14035293"
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
sobj_file)
## Recreate the expected path if it does not exist
dir.create(path = dirname(sobj_path), recursive = TRUE)
## Download the file
download.file(url = zen_backup_file,
destfile = sobj_path)
}
## In case of error/lost data : force a reload from a local backup repository
if(local) {
sobj_path <- paste0(
"/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
sobj_file)
}
## Load the object
sobj <- readRDS(file = sobj_path)
```
Now, we finally have all the metrics and bias sources we could
use, so we can actually get to the filtering step.
# Filtering
## Features
- As already explained, the **sparsity** of the count matrix is high, _very_
high.
- To the point that there probably are some features that have **so
low expression** level that they were only counted in **very few cells**. As
such, they have absolutely **no chance** to characterize any cell type, nor
harbor some variation between different cell types.
- As such, we should **discard** these features
What are the actual dimensions of our expression data ?
```{r seudim1}
# seudim1
## Dimensions of our Seurat object
dim(sobj)
```
We want to remove the features that are expressed in less than **`r min_cells`**
cells.
First, we quantify, for each feature, the number of cells with `0` count.
```{r nfeatdiff}
# nfeatdiff
## Compute the amount of 0s per feature
nFeat_zero <- sparseMatrixStats::rowCounts(
x = SeuratObject::GetAssayData(
object = sobj,
assay = "RNA",
layer = "counts"),
value = 0)
## Inversion : computing the number of cells with at least one count, per feature
nFeat_nonzero <- ncol(sobj) - nFeat_zero
## Identify those with at least 5 cells with expression
nFeat_keep <- nFeat_nonzero >= min_cells
## Quantify the selected ones
table(nFeat_keep)
```
And now we can **discard** features expressed in less than `r min_cells` cells :
```{r prefeatfiltviz, class.source="notrun", class.output="notruno"}
# prefeatfiltviz
## Create a temporary object
sobj_VIZ_pre <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_pre <- Seurat::NormalizeData(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::ScaleData(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::FindVariableFeatures(object = sobj_VIZ_pre, verbose = FALSE)
sobj_VIZ_pre <- Seurat::RunPCA(object = sobj_VIZ_pre, npcs = 21, verbose = FALSE)
sobj_VIZ_pre <- Seurat::RunUMAP(object = sobj_VIZ_pre, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_pre <- Seurat::DimPlot(
object = sobj_VIZ_pre,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_pre)
## Remove temporary object
rm(sobj_VIZ_pre)
```
```{r featsel}
# featsel
## Restrict the Seurat object
sobj <- subset(x = sobj, features = rownames(sobj)[nFeat_keep])
```
What are the Seurat object dimensions, now ?
```{r seudim2, class.source="notrun", class.output="notruno"}
# seudim2
dim(sobj)
```
We can **visualize the cell space** after this features filtering
```{r postfeatfiltviz}
# postfeatfiltviz
## Create a temporary object
sobj_VIZ_post <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_post <- Seurat::NormalizeData(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::ScaleData(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::FindVariableFeatures(object = sobj_VIZ_post, verbose = FALSE)
sobj_VIZ_post <- Seurat::RunPCA(object = sobj_VIZ_post, npcs = 21, verbose = FALSE)
sobj_VIZ_post <- Seurat::RunUMAP(object = sobj_VIZ_post, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_post <- Seurat::DimPlot(
object = sobj_VIZ_post,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_post)
## Remove temporary object
rm(sobj_VIZ_post)
```
Merging plots for ease of visualization :
```{r c5_pw, fig.width = 15, fig.height = 6, class.source=c("fold-hide", "notrun"), class.output="notruno"}
# c5_pw
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
dp_pre & ggplot2::ggtitle(label = "BEFORE"),
dp_post & ggplot2::ggtitle(label = "AFTER")),
nrow = 1)
```
- **Question** :
```{r q_mincell, eval = FALSE, class.source="question", eval = FALSE}
# q_mincell
Do you see any difference when comparing before vs after features filtering ?
```
```{r a_mincell, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_mincell
## . Not much changed
##
## . This is expected, as we removed features with almost no expression
```
## Cells
We are now able to **apply all the filtering strategies** we established for
each QC metric.
- Apply the filter
```{r cellfilt}
# cellfilt
## Seurat object BEFORE cell filtering
dim(sobj)
## Apply cell filtering
sobj <- subset(x = sobj, cells = colnames(sobj)[sobj$fail_qc == "pass"])
## Seurat object AFTER cell filtering
dim(sobj)
```
We can **visualize the cell space** since this cell filtering
```{r postcellfiltviz}
# postcellfiltviz
## Create a temporary object
sobj_VIZ <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ <- Seurat::NormalizeData(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::ScaleData(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::FindVariableFeatures(object = sobj_VIZ, verbose = FALSE)
sobj_VIZ <- Seurat::RunPCA(object = sobj_VIZ, npcs = 21, verbose = FALSE)
sobj_VIZ <- Seurat::RunUMAP(object = sobj_VIZ, dims = c(1:20), verbose = FALSE)
## Cell space visualization
dp_VIZ <- Seurat::DimPlot(
object = sobj_VIZ,
reduction = 'umap') + Seurat::DarkTheme()
print(dp_VIZ)
## Discard the temp object
rm(sobj_VIZ)
```
One can associate the visualization after filtering cells :
```{r cfilt_umaps, fig.width = 12, fig.height = 6, class.source="notrun", class.output="notruno"}
# cfilt_umaps
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
dp_post & ggplot2::ggtitle(label = "Features (count<5) filtered"),
dp_VIZ & ggplot2::ggtitle(label = "Cells filtered (metrics)")),
nrow = 1)
```
- **Question**
```{r q_vizfilt, eval = FALSE, class.source="question", eval = FALSE}
# q_vizfilt
Do you see any difference when comparing before vs after features filtering ?
```
```{r q_postcellfilt, class.source = c("fold-hide", "answer"), eval = FALSE}
# q_postcellfilt
## . Not much changed as well (we did not discard many cells)
##
## . The biggest cluster structure seems more defined
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Save the Seurat object
We will **save our Seurat object** that now contains **filtered cells and
features** :
```{r saverds1, fold.output = FALSE}
# saverds1
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("03", Seurat::Project(sobj), "S5",
"Metrics.Filtered", paste(
dim(sobj),
collapse = '.'
)
), collapse = "_"),
".RDS")
## Check
print(out_name)
## Write on disk
saveRDS(object = sobj,
file = out_name)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Cell cycle scores
- We are currently analyzing **independent** profiles from **isolated** cells, from a sample dissociation
- As such, cells were most probably **not synchronized**, thus the effect of their **cell cycle state** on genes expression may be strong, to the point that it can **bias the data** (ie, mask some lower amplitude biological variation).
- In order to **assess** (and maybe, _remove_) this bias, we have to **quantify** it.
- We will perform this estimation thanks to heuristics based on **knowledge** : Seurat includes a method that evaluates the cell cycle phase of cells through **scores for the S and G2M phases**, each based on phase-specific **gene signatures**.
- For this step, we will use additional gene lists from knowledge
(cell cycle phase), hosted in a Zenodo respository (Id :
[14037355](https://zenodo.org/records/14037355 "Zenodo gene lists"){target="_blank"})
## Download gene lists
- We will directly retrieve data from Zenodo to your `input_dir` :
```{r dlzengl}
# dlzengl
## Zenodo ID
zen_id <- '14101506'
### Named files (will be used later on !)
cc_file <- "mus_musculus_Seurat_cc.genes_20191031.rds"
## Filename(s) to retrieve
toget_files <- c(cc_file)
## Folder to store retrieved files
local_folder <- glist_dir
## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")
## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")
### Define remote folder
remote_folder <- if (backup) {
paste0("/shared/projects/", sessionid, "/atelier_scrnaseq/TD/RESOURCES/GENELISTS/")
} else {
paste0("https://zenodo.org/records/", zen_id, "/files/")
}
### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)
### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)
## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
## If the file does not locally exist
if (!file.exists(local_path[tg]) | force) {
## Retrieve data
if(backup) {
file.copy(from = remote_path[tg],
to = local_path[tg])
} else {
download.file(url = remote_path[tg],
destfile = local_path[tg])
}
## Check if downloaded files exist locally
if(file.exists(local_path[tg])) message("\tOK")
} else message(paste0(toget_files[tg], " already downloaded !"))
}
```
## Load gene lists
```{r cc_load}
# cc_load
## The cell cycle gene lists file
cc_file <- paste0(glist_dir,
"/",
cc_file)
## Load the cell cycle reference genes lists
cc_genes <- readRDS(file = cc_file)
## Have a look on its content
str(cc_genes)
```
- **Question**
```{r q_cc_check, class.source="question", eval = FALSE}
# q_cc_check
As explained, we will use gene lists extracted from community knowledge.
Our data contain values for genes also, so we will cross them.
Is there something we should check ?
```
```{r a_cc_check, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_cc_check
## . We may check if the genes in our gene lists are effectively
## present in our Seurat object !
##
## . This is expected, as we removed features with almost no expression
```
**Beyond** :
1. Write a code that performs this check
2. Add a code to adjust the content of the gene lists accordingly (remove genes from the gene lists that are not present in our dataset)
3. *(NOTE : this is actually not needed as already checked and
corrected by the cell cyle estimation method we will use)*
```{r b_cc_answer, class.source = c("fold-hide", "beyond"), class.output="beyondo", eval = FALSE}
# b_cc_answer
## Check if our data genes cover the gene lists
lapply(cc_genes, function(x) x %in% rownames(sobj))
## Remove genes not in sobj
cc_genes <- lapply(cc_genes, function(gl) { gl[gl %in% rownames(sobj)] })
## Check the modification
str(cc_genes)
## Check that all genes are available in sobj
all(unique(unlist(cc_genes)) %in% rownames(sobj))
```
## Estimation
- Let's perform this estimation. But how ?
```{r h_CellCycleScoring, class.source = "notrun", eval = FALSE}
# h_CellCycleScoring
## Reading the function help page
?Seurat::CellCycleScoring
```
- Run the cell-cycle estimation
```{r cc_run}
# cc_run
## Creating a temporary object
sobj_CCS <- sobj
sobj_CCS <- Seurat::NormalizeData(object = sobj_CCS, verbose = FALSE)
## Perform the estimation
set.seed(my_seed)
sobj_CCS <- Seurat::CellCycleScoring(
object = sobj_CCS,
s.features = cc_genes$s.genes,
g2m.features = cc_genes$g2m.genes,
# assay = 'RNA, ## Seurat v4
layer = 'RNA', ## Seurat v5
nbin = 24,
seed = my_seed)
## Transfering score from the temp object to the real one
sobj$CC_Seurat_S.Score <- sobj_CCS$S.Score
sobj$CC_Seurat_G2M.Score <- sobj_CCS$G2M.Score
sobj$CC_Seurat_Phase <- as.factor(sobj_CCS$Phase)
## Add "S minus G2M" score
sobj$CC_Seurat_SmG2M.Score <- sobj$CC_Seurat_S.Score - sobj$CC_Seurat_G2M.Score
## Discard temp object
rm(sobj_CCS)
```
- Description of the object to see the data added
```{r desc_cc, class.source="notrun", class.output="notruno", eval = FALSE}
# desc_cc
View()
```
## Visualization
As usual, we can visualize the results as violins :
```{r vlncc, fig.width = 12, class.source="notrun", class.output="notruno"}
#vlncc
Seurat::VlnPlot(object = sobj,
features = c("CC_Seurat_S.Score",
"CC_Seurat_G2M.Score",
"CC_Seurat_SmG2M.Score"))
```
But it's not that easy to interpret...
Let's plot it in the cell space
## {.unnumbered .tabset .tabset-fade .tabset-pills}
### Estimated cell phase :
```{r vizccp}
# vizccp
## Creating a temporary object.
## This time we will keep it to speed up further plots (next chunks)
sobj_VIZ_cc <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_cc <- Seurat::NormalizeData(object = sobj_VIZ_cc, verbose = FALSE)
sobj_VIZ_cc <- Seurat::ScaleData(object = sobj_VIZ_cc, verbose = FALSE)
sobj_VIZ_cc <- Seurat::FindVariableFeatures(object = sobj_VIZ_cc, verbose = FALSE)
sobj_VIZ_cc <- Seurat::RunPCA(object = sobj_VIZ_cc, npcs = 21, verbose = FALSE)
sobj_VIZ_cc <- Seurat::RunUMAP(object = sobj_VIZ_cc, dims = c(1:20), verbose = FALSE)
## Cell cycle phase feature plot
dp_ccp <- Seurat::DimPlot(
object = sobj_VIZ_cc,
reduction = 'umap',
group.by = 'CC_Seurat_Phase') + Seurat::DarkTheme()
print(dp_ccp)
```
### SmG2M ("S minus G2M") score :
```{r vizccs}
# vizccs
## SmG2M feature plot
fp_cc <- Seurat::FeaturePlot(
object = sobj_VIZ_cc,
reduction = 'umap',
features = 'CC_Seurat_SmG2M.Score') + Seurat::DarkTheme()
print(fp_cc)
## We can discard the temp object
rm(sobj_VIZ_cc)
```
## {.unnumbered}
**Questions**
```{r q_cell1, class.source="question", eval = FALSE}
# q_cell1
Does the structure of the cells in this representation seem to have
a link with these cell cycle phases/scores ?
```
```{r a_cell1, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_cell1
## It's an absolute yes !
```
```{r q_cell2, class.source="question", eval = FALSE}
# q_cell2
Do you think this is the result of an artifact, or
biology-related ?
```
```{r a_cell2, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_cell2
##
## ¯\_(ツ)_/¯
##
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Save the Seurat object
We will save our Seurat object that now contains the cell cycle
states/scores :
```{r saverds2, fold.output = FALSE}
# saverds2
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("04", Seurat::Project(sobj), "S5",
"CC", paste(
dim(sobj),
collapse = '.'
)
), collapse = "_"),
".RDS")
## Check
print(out_name)
## Write on disk
saveRDS(object = sobj,
file = out_name)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Cell doublets
We will use **two different methods** to detect and remove cell doublets :
- [`scds`](https://www.bioconductor.org/packages/release/bioc/html/scds.html){target="_blank"} (in its "hybrid" mode) : more efficient at detecting **homotypic** doublets
- [`scDblFinder`](https://www.bioconductor.org/packages/release/bioc/html/scDblFinder.html){target="_blank"} : more efficient at detecting **heterotypic** doublets
None of the methods accepts a `SeuratObject` as input, but a [`SingleCellExperiment`](https://www.bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html){target="_blank"} object.
Hopefully :
- Seurat has a function to perform the conversion
- The output results can be integrated into our Seurat object with ease
## Two methods {.tabset .tabset-fade .tabset-pills}
### `scds`
```{r scds}
# scds
## Fix seed
set.seed(my_seed)
## Run scds
sobj$doublet_scds.hybrid <- unname(
scds::cxds_bcds_hybrid(
Seurat::as.SingleCellExperiment(
sobj, assay = "RNA"))$hybrid_score > 1)
## Contingencies
table(sobj$doublet_scds.hybrid)
```
### `scDblFinder`
```{r scdbl}
# scdbl
## Fix seed
set.seed(my_seed)
## Run scDblFinder (which needs another object type)
sobj$doublet_scDblFinder <- scDblFinder::scDblFinder(
sce = Seurat::as.SingleCellExperiment(
x = sobj,
assay = "RNA"),
returnType = "table")$class == "doublet"
## Contingencies
table(sobj$doublet_scDblFinder)
```
# {.unnumbered}
------------------------------------------------------------------------
------------------------------------------------------------------------
## Merge results
We merge results of the two methods
```{r dblmerge}
# dblmerge
## Logical union of both methods
sobj$doublet_union <- sobj$doublet_scds.hybrid | sobj$doublet_scDblFinder
## Quantify doublets
table(sobj$doublet_union)
```
We can assess **tool-specific** and **common** doublets
```{r dbl_types, class.source="notrun", class.output="notruno"}
# dbl_types
### Singlets by default
sobj$doublet_viz <- "singlet"
### Union
sobj$doublet_viz[sobj$doublet_union] <- "both"
### scds-specific
sobj$doublet_viz[sobj$doublet_scds.hybrid & !sobj$doublet_scDblFinder] <- "scds"
### scDblFinder-specific
sobj$doublet_viz[sobj$doublet_scDblFinder & !sobj$doublet_scds.hybrid] <- "scDblFinder"
## Convert to factor
sobj$doublet_viz <- as.factor(sobj$doublet_viz)
## Contingencies
table(sobj$doublet_viz)
```
**Beyond** : Create an upset-plot for the doublet status according to the two methods used
```{r b_upset, class.source=c("fold-hide", "beyond"), class.output="beyondo"}
# b_upset
## Build a list of cells tagged by one tool, the other, or both
dbl_list <-list(
"scds" = colnames(sobj)[sobj$doublet_scds.hybrid & !sobj$doublet_scDblFinder],
"scDblFinder" = colnames(sobj)[!sobj$doublet_scds.hybrid & sobj$doublet_scDblFinder],
"both" = colnames(sobj)[sobj$doublet_scds.hybrid & sobj$doublet_scDblFinder])
## Draw the upset plot
UpSetR::upset(data = UpSetR::fromList(dbl_list),
nintersects = NA,
sets = rev(names(dbl_list)),
keep.order = TRUE,
order.by = "freq")
```
Now we can remove barcodes identified as cell doublets, and visualize the cell space before and after.
## Doublets filtering {.tabset .tabset-fade .tabset-pills}
### BEFORE
```{r dblviz1}
# dblviz1
### Doublets viz (before removal)
## Create a temporary object
sobj_VIZ_dbl <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_dbl <- Seurat::NormalizeData(object = sobj_VIZ_dbl, verbose = FALSE)
sobj_VIZ_dbl <- Seurat::ScaleData(object = sobj_VIZ_dbl, verbose = FALSE)
sobj_VIZ_dbl <- Seurat::FindVariableFeatures(object = sobj_VIZ_dbl, verbose = FALSE)
sobj_VIZ_dbl <- Seurat::RunPCA(object = sobj_VIZ_dbl, npcs = 21, verbose = FALSE)
sobj_VIZ_dbl <- Seurat::RunUMAP(object = sobj_VIZ_dbl, dims = c(1:20), verbose = FALSE)
## Doublets visualization
dp_dbl <- Seurat::DimPlot(
object = sobj_VIZ_dbl,
reduction = 'umap',
group.by = 'doublet_viz') + Seurat::DarkTheme()
## Plot
print(dp_dbl)
## Discard temp object
rm(sobj_VIZ_dbl)
```
### Remove doublets
```{r dblrm, fold.output = FALSE}
# dblrm
## Dimensions before removal
dim(sobj)
## Perform the removal
# sobj <- sobj[,!sobj$doublet_union]
sobj <- subset(
x = sobj,
cells = colnames(sobj)[sobj$doublet_union],
invert = TRUE)
## Dimensions after
dim(sobj)
```
### AFTER
```{r dblviz2, class.source = "notrun", class.output="notruno"}
# dblviz2
### Doublets viz (after removal)
## Create a temporary object
sobj_VIZ_dblf <- sobj
## Usual set of "U can't C me" functions
sobj_VIZ_dblf <- Seurat::NormalizeData(object = sobj_VIZ_dblf, verbose = FALSE)
sobj_VIZ_dblf <- Seurat::ScaleData(object = sobj_VIZ_dblf, verbose = FALSE)
sobj_VIZ_dblf <- Seurat::FindVariableFeatures(object = sobj_VIZ_dblf, verbose = FALSE)
sobj_VIZ_dblf <- Seurat::RunPCA(object = sobj_VIZ_dblf, npcs = 21, verbose = FALSE)
sobj_VIZ_dblf <- Seurat::RunUMAP(object = sobj_VIZ_dblf, dims = c(1:20), verbose = FALSE)
## Doublets-filtered visualization
dp_dblf <- Seurat::DimPlot(
object = sobj_VIZ_dblf,
reduction = 'umap',
group.by = 'doublet_viz') + Seurat::DarkTheme()
## Plot
print(dp_dblf)
## Discard temp object
rm(sobj_VIZ_dblf)
```
## {.unnumbered}
------------------------------------------------------------------------
------------------------------------------------------------------------
Merge plots for ease of use :
```{r dbl_umaps, fig.width = 12, fig.height = 6, class.source = "notrun", class.output="notruno"}
# dbl_umaps
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
dp_dbl & ggplot2::ggtitle(label = "Cell doublets (unfiltered)"),
dp_dblf & ggplot2::ggtitle(label = "Cell doublets (filtered)")),
nrow = 1)
```
**Question**
```{r q_dblcomp, class.source="question", eval=FALSE}
# q_dblcomp
What do you observe when comparing before and after the doublets filtering ?
```
```{r a_dblcomp, class.source = c("fold-hide", "answer"), eval=FALSE}
# a_dblcomp
## . Actually not that much !
##
## . This is due to the fact that this dataset did not contain
## a lot of heterotypic doublets
##
## . When a dataset is heavily contaminated with lots of heterotypic
## doublets, their removal can drastically remodel the topology.
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Save the Seurat object
We will save our Seurat object that is now filtered for doublets :
```{r saverds3}
# saverds3
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("05", Seurat::Project(sobj), "S5",
"Doublets.Filtered", paste(
dim(sobj),
collapse = '.'
)
), collapse = "_"),
".RDS")
## Check
print(out_name)
## Write on disk
saveRDS(object = sobj,
file = out_name)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Rsession
```{r rsession, class.source="notrun", class.output="notruno"}
# rsession
utils::sessionInfo()
```