EB3I n1 2025 scRNAseq
-
CELL ANNOTATION
EB3I n1 2025 scRNAseq
-
CELL ANNOTATION
1 PREAMBLE
1.1 Purpose of this session
We are reaching out a challenging task of the analysis (and a very exciting one !).
- What types of cells did we capture in the analysis ?
- Do we identify the expected cell types and can we distinguish different sub-populations ?
- Do we identify “novel” or “unexpected” cell types ?
The aim of this session is to understand the different methods that will help you to explore the biological cell types captured by your dataset.
2 Start Rstudio
- Using the OpenOnDemand/Rstudio cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements.
3 Warm-up
- We set common parameters we will use throughout this session :
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))Show output
PATH CHECK : TRUE
4 Prepare the data structure
We will do the same as for former steps, just changing the session name.
4.1 Main directory
# maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory (already exists at this step)
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD"
4.2 Current session
# sessiondir
## Creating the session (Preproc.3) directory
session_dir <- paste0(TD_dir, "/08_Cell.Annotation")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation"
4.3 Input directory
# indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation/DATA"
4.4 Genelists directory
This is a directory where we will store additional information from knowledge bases about genes used to estimate the cell cycle phase of cells.
# resdir
res_dir <- paste0(TD_dir, "/Resources")
glist_dir <- paste0(res_dir, "/Genelists")
## Create the directory
dir.create(path = glist_dir, recursive = TRUE)
## Print the resources directory on-screen
print(glist_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/Resources/Genelists"
5 Reload the Seurat Object
- We can reload the object we saved at the former step
# dataload
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
## The latest Seurat object saved as RDS (name)
sobj_file <- "12_TD3A.TDCT_S5_Integrated_12926.3886.RDS"
## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir,
"/06_Integration/RESULTS/",
sobj_file)
force <- FALSE ## To force a re-download of a Zenodo-hosted backup
local <- FALSE ## To force a loading from a local backup
## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
zen_id <- "14035293"
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
sobj_file)
## Recreate the expected path if it does not exist
dir.create(path = dirname(sobj_path), recursive = TRUE)
## Download the file
download.file(url = zen_backup_file,
destfile = sobj_path)
}
## In case of error/lost data : force a reload from a local backup repository
if(local) {
sobj_path <- paste0(
"/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
sobj_file)
}
## Load the object
sobj <- readRDS(file = sobj_path)6 Overview of the scRNAseq pipeline
At this step of the analysis, we have :
a gene expression matrix : for each cell, gene expression is available
a reduced space : gene expression matrix is summarized in N dimensions
a clustering : each cell belongs to a specific cluster
a 2D space : cells can be visualized on a 2D representation (UMAP)
On the cell visualization, we also searched for clusters of cells. The clustering resolution shown multiple cell clusters that we can now associate to cell types.
For this you need :
your biological knowledge on your dataset
an internet connection :)
7 Different methods to annotate cell types
The annotation methods aim at defining marker genes that help to identify the cell types in each cluster.
But the logic across methods is similar :
You identify genes or set of genes that have a pattern of expression specific and that represents a large number of cells for the cluster.
Different methods exist :
MANUAL : You can either do it manually, use set of marker genes from bibliography, or use other datasets that have been annotated to transfert the annotation to your clustering if similar expression patterns are found.
AUTOMATIC : You use a published database and collect sets of marker genes for cell types, or published reference single cell atlas already annotated, or you can also use published RNAseq on a specific cell type that you know is in your dataset…
For this practical session, we will try both approaches. We use the dataset previously filtered and pre-processed (the Seurat object contains the 3 dimension reductions previously performed - pca, umap and harmony).
7.1 Manual annotation
We will perform manual annotation using differential expression.
7.1.1 Overview of the analysis
For manual annotation of cell clusters using marker genes, you need :
- A Seurat object with normalized counts.
- To choose a clustering resolution that you want to use for the annotation.
- A reduced space to visualize the results.
7.1.2 Overview of the functions to be used
# funcs
## Seurat functions to be used in the TD for manual annotation
Seurat::FindClusters()
SeuratObject::JoinLayers()
Seurat::FindAllMarkers()
## Seurat functions to be used for visualization purpose
### Function to color cell by a "dimension" of their metadata
### (ex : sample of origin ie the annotation called "orig.ident")
Seurat::DimPlot()
### Function to color cell by the expression of one feature
Seurat::FeaturePlot()
### Function to built a heatmap of expression of the marker genes identified
Seurat::DoHeatmap()
### Function to visualize how feature expression changes across different identity classes
Seurat::DotPlot()7.1.3 Current Seurat object
Show output
An object of class Seurat
12926 features across 3886 samples within 1 assay
Active assay: RNA (12926 features, 2000 variable features)
5 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT, scale.data
6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone
You see that in the reductions calculated, there are novel reductions you didn’t use before. Those reduction come from the Integration steps used to merge 2 samples into a 2D representation with less batch effect.
# dp1
## Look at the content of the Seurat object post integration
Seurat::DimPlot(sobj,
## use the reduction umap of the integrated object
reduction = "umap",
## color the cells by their sample of origin
group.by = c("orig.ident"))Show plot
Clusters to annotate
Manual annotation is based on the identification of marker genes that characterize a cell population. For this, we must define which groups of cells we want to annotate.
This choice is based on the clustering (Look back at Proc.2). After the integration step, we can re-perform the 2 steps to get a clustering as we now have an object with the integration of both TD3A and TDCT samples.
# reclusters
## Find neighboors to prepare the data for clustering using the first 20 PCs
sobj <- Seurat::FindNeighbors(
object = sobj,
dims = 1:20,
reduction = "HarmonyIntegration")
## Louvain resolutions to test
resol <- c(0.1,0.2,0.3,0.8,1.5)
## Clustering
sobj <- Seurat::FindClusters(
object = sobj,
## Vector of different resolutions defined above
resolution = resol,
## Makes the function quiet
verbose = FALSE,
## Algorithm for modularity optimization :
## . 1 : original Louvain algorithm
## . 2 : Louvain algorithm with multilevel refinement
## . 3 : SLM algorithm
## . 4 : Leiden algorithm. Requires the "leidenalg" python library
algorithm = 1)Multiplot with the different clustering results :
# diffresol
## Plot the clustering results
Seurat::DimPlot(object = sobj,
reduction = "umap",
group.by = paste0('RNA_snn_res.', resol),
pt.size = 1,
label = TRUE,
repel = TRUE
) + ggplot2::theme(legend.position = "bottom")Show plot
Now we have all the data needed to find marker genes per clusters, but we must first choose the clustering resolution we want to annotate.
Advice : Always start by a low resolution to have a first idea of the broad cell types you are capturing.
Here I decided to choose the resolution 0.2 for the purpose of simpler classes, I could be wise to start by the resolution 0.1 to have an even more general idea of marker genes per cluster.
Differential expression between clusters
One way to annotate the clusters of cell is to look at the genes highly expressed in one cluster of cells compared to all the other cells: we can do a differential expression (DE) analysis.
DEA (differential expression analysis) is performed on the normalized count matrix (“data”), which is the case in our integrated Seurat object.
To perform the manual annotation, we will use the Seurat::FindAllMarkers function.
This function compare each cluster against all others to identify genes differentially expressed that are potentially marker genes.
WARNING : In Seurat v5, the integrated object contains the 2 normalized datasets still separated in their gene/cell assays of origin. To identify marker genes, we must join the two assays together with the command JoinLayers.
Now, we can use the Function FindAllMarkers to perform differential gene expression of each cluster against all the others. The objective is to find genes ““specific”” of each cluster to try to annotate them.
# FindAllMarkers
## find markers for every cluster compared to all remaining cells,
## report only genes with positive DE
all_markers = Seurat::FindAllMarkers(
object = sobj,
## Only keep genes more expressed in the cluster of interest than the others
only.pos = TRUE,
## Minimum % of cells expressing the marker in the cluster of interest
min.pct = 0.25,
## Minimum absolute logFC between conditions
logfc.threshold = 0.25)
# This command can take up to 2 mins !WARNING : the execution time for this command can take several minutes !
Once the markers per cluster have been identified, we can look at the number of markers identified by cluster.
Show output
0 1 2 3 4 5 6
656 211 554 1725 150 118 1093
In this table, the first line shows the cluster name, and the last line gives the amount of marker genes (ie DEG) identified for this cluster.
Here we see that many markers (ie deferentially expressed genes) have been identified. We cannot look at all of them, but we can choose to look at the top 3 markers per cluster and use our biological knowledge to identify cell populations.
# top10
`%>%` <- magrittr::`%>%`
top10_markers = as.data.frame(all_markers %>%
dplyr::group_by(cluster) %>%
dplyr::top_n(n = 10, wt = avg_log2FC))# top10h
## Visualize the top 10 marker gene expression per cluster using the default heatmap function of Seurat.
Seurat::DoHeatmap(sobj, features = top10_markers$gene) + Seurat::NoLegend()Show plot
# top10dt
## A dotplot
Seurat::DotPlot(sobj, features = unique(top10_markers$gene)) +
## Just aesthetics
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90,
vjust = 1,
size = 8,
hjust = 1)) Show plot
On this plot we can see that some markers display a very high and specific expression for a single cluster, while others are expressed in 2 or more cluster.
Biology in this plot
What is the function of each marker gene ? is it a known marker gene for a cell type ?
Is there litterature about its pattern of expression ?
Here we need biological knowledge and back-and-forth marker genes computing using different cluster resolutions to understand which cell populations are present in the data.
At this step, using known marker genes from your experiment/knowledge is also useful, and easy to perform
Let’s focus on 2 genes :
Rag1, a key gene of T-cells maturation in the variable/diversity/joining (V[D]J) rearrangementCd5, a marker of self-reactivity of T-cells
Show plot
Here we can see that those markers may actually be very informative to distinguish cell Tcells maturation process.
… at this moment using your biological knowledge on your dataset is critical, you can also test manually any marker of your choice !
7.1.4 Conclusion
Perform differential expression for each cluster VS all the others with a normalized matrix
Look at the gene expression of the markers identified in the 2D representation to validate specificity and representation
Find the cell population corresponding to these markers and annotate this cluster
| Advantages | Limits |
|---|---|
|
|
7.2 Automatic annotation using reference markers
For some tissues, the different cell types have already been largely described and databases exist with referenced marker genes.
Another way to annotate your dataset will be to find a database with relevant annotation for your dataset and use tools of automatic annotation to annotate your clusters.
Let’s see how it works in practice !
We will use a database focused on immunological cell types called ImmGen, thanks to the celldex R package that “provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq”
NOTE : In the following chunk of code, you will load the annotation file from the IFB server where we downloaded it. In real life and with a version of dbdypr inf or equal to 2.3, you can also use this command :
annotation = ImmGenData(ensembl = FALSE)
# ensembl set to TRUE if we want ENSEMBL ID gene names, FALSE will get the annotation with Gene Symbols
# refloading
## Loading the ImmGen database
annotation = readRDS(paste0(
'/shared/projects/',
sessionid,
'/atelier_scrnaseq/TD/RESOURCES/ImmGenData.RDS'))
## A quick description of the db
annotationShow output
class: SummarizedExperiment
dim: 22134 830
metadata(0):
assays(1): logcounts
rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
rowData names(0):
colnames(830):
GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ...
GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
colData names(3): label.main label.fine label.ont
This database contains 3 levels of granularity :
A “main” level (coarse grain)
A “fine” level (self-explanatory)
The “ONT” level (data are mapped to a defined ontology)
As we are in a context of sorted cells of the same lineage, we’re going to use the fine label.
Let’s see how many cell types are described in this ImmGen database :
Show output
[1] 253
The tool we will use to perform the automatic cell type annotation, SingleR works better with the normalized data. Thus, we will extract the normalized matrix from our Seurat object :
# norm_mat
## Extraction of the normalized data
norm_exp_mat = Seurat::GetAssayData(
object = sobj,
assay = "RNA",
slot = "data"
)
## Matrix dimensions
dim(norm_exp_mat)Show output
[1] 12926 3886
We are ready to start the annotation.
The following SingleR command performs the prediction of cell types for each cell of the dataset.
# annpred
## Run in ~ 3-5 min depending on the number of CPU and memory defined
ann_predictions = SingleR::SingleR(
## our normalized matrix
test = norm_exp_mat,
## The ImmGen DB
ref = annotation,
## The annotation grain level
labels = annotation$label.fine,
## Marker dectection scheme
de.method="classic",
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam())The resulting object is a special kind of data.frame . Each row contains the ID of a cell and the prediction score associated by SingleR. Cell labels associated to each cell are stored in the column $labels
How many different kinds of labels were identified ?
Show output
[1] 35
Besides scoring, SingleR assesses the score quality, and prunes bad results.
How many cells got a poor quality annotation ?
Show output
Mode FALSE TRUE
logical 3847 39
Annotation diagnostic
SingleR allows to visualize some control plots :
- We can visualize the score of each cell, split by cell type label, as a heatmap :
Show plot
How do you interpret this heatmap ?
Add the annotation to the Seurat object
We add a new metadata containing the annotation of each cell to our Seurat object.
Visualization of our annotation on UMAP
We can visualize cells annotation the the UMAP.
# UMAPlabeled
## Just for aesthetics (sets a color palette to use)
seeable_palette = setNames(
c(RColorBrewer::brewer.pal(name = "Dark2", n = 8),
c(1:(length(unique(ann_predictions$labels)) - 8))),
nm = names(sort(table(ann_predictions$labels), decreasing = TRUE)))
## UMAP with the predicted annotation by cell
ann_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "singler_cells_labels",
pt.size = 2,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
# UMAP with the cluster numbers (before annotation)
clust_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "RNA_snn_res.0.2",
pt.size = 2,
label = TRUE,
repel = TRUE
)
print(ann_plot + clust_plot)Show plot
A lookup at the contingency table of cells projected to the ImmGen reference
Show output
DC (DC.8-4-11B-) Macrophages (MF.11CLOSER.SALM3)
1 1
NKT (NKT.44+NK1.1-) T cells (T.4.Pa)
1 1
T cells (T.4FP3+25+) T cells (T.4Nve)
1 2
T cells (T.4SP24int) T cells (T.8EFF.OT1.48HR.LISOVA)
22 2
T cells (T.8MEM.OT1.D45.LISOVA) T cells (T.8Mem)
1 2
T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) T cells (T.8NVE.OT1)
1 1
T cells (T.8Nve) T cells (T.8SP24-)
25 8
T cells (T.8SP24int) T cells (T.8SP69+)
3 7
T cells (T.CD4.5H) T cells (T.CD4TESTCJ)
5 5
T cells (T.CD8.5H) T cells (T.CD8.CTR)
2 2
T cells (T.DN2B) T cells (T.DN3-4)
2 1
T cells (T.DN3A) T cells (T.DN3B)
48 6
T cells (T.DN4) T cells (T.DP)
1 1
T cells (T.DP69+) T cells (T.DPbl)
218 71
T cells (T.DPsm) T cells (T.ISP)
3174 263
T cells (T.Tregs) Tgd (Tgd.imm.VG1+VD6+)
1 2
Tgd (Tgd.VG2+) Tgd (Tgd.VG5+24AHI)
2 2
Tgd (Tgd)
1
From this rapid prediction, it seems that our dataset contains T-cells mostly, and particularly T.DPsm, T-DP69+ and T-ISP .
Maybe the annotation is not perfectly suited for our dataset. Some cell populations in the annotation are closely related, and this leads to annotation competition for our cells.
It is possible to run the annotation at the cluster level : it will be cleaner than the single cell level annotation. But, be sure that the clustering is not merging several different cell populations.
We can check the number of cells attributed to labels from each cluster :
Show output
0 1 2 3 4 5 6
DC (DC.8-4-11B-) 0 0 0 0 0 1 0
Macrophages (MF.11CLOSER.SALM3) 0 0 0 0 0 1 0
NKT (NKT.44+NK1.1-) 0 0 0 1 0 0 0
T cells (T.4.Pa) 0 0 1 0 0 0 0
T cells (T.4FP3+25+) 0 0 1 0 0 0 0
T cells (T.4Nve) 0 0 2 0 0 0 0
T cells (T.4SP24int) 1 0 21 0 0 0 0
T cells (T.8EFF.OT1.48HR.LISOVA) 0 0 1 1 0 0 0
T cells (T.8MEM.OT1.D45.LISOVA) 0 0 1 0 0 0 0
T cells (T.8Mem) 0 0 2 0 0 0 0
T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) 0 0 1 0 0 0 0
T cells (T.8NVE.OT1) 0 0 1 0 0 0 0
T cells (T.8Nve) 0 0 25 0 0 0 0
T cells (T.8SP24-) 0 0 8 0 0 0 0
T cells (T.8SP24int) 0 0 2 1 0 0 0
T cells (T.8SP69+) 0 0 7 0 0 0 0
T cells (T.CD4.5H) 0 0 5 0 0 0 0
T cells (T.CD4TESTCJ) 0 0 5 0 0 0 0
T cells (T.CD8.5H) 0 0 2 0 0 0 0
T cells (T.CD8.CTR) 0 0 2 0 0 0 0
T cells (T.DN2B) 0 0 0 1 0 0 1
T cells (T.DN3-4) 0 0 0 1 0 0 0
T cells (T.DN3A) 0 0 0 2 0 0 46
T cells (T.DN3B) 0 1 0 5 0 0 0
T cells (T.DN4) 0 0 1 0 0 0 0
T cells (T.DP) 0 0 0 0 0 1 0
T cells (T.DP69+) 2 3 211 0 2 0 0
T cells (T.DPbl) 0 17 0 54 0 0 0
T cells (T.DPsm) 2067 827 137 0 73 70 0
T cells (T.ISP) 1 4 0 258 0 0 0
T cells (T.Tregs) 0 0 1 0 0 0 0
Tgd (Tgd.imm.VG1+VD6+) 0 0 1 1 0 0 0
Tgd (Tgd.VG2+) 0 0 1 0 0 0 1
Tgd (Tgd.VG5+24AHI) 0 0 2 0 0 0 0
Tgd (Tgd) 0 0 0 0 0 0 1
We can eventually check if some clusters contain multiple cell types. We compute the proportion of each cell type in each cluster. If a cluster is composed of two cell types (or more), maybe this resolution for the clustering is too low ?
# propcell
## Compute the proportion of cell types per cluster
pop_by_cluster = prop.table(table(sobj$singler_cells_labels,
sobj$RNA_snn_res.0.2),
margin = 2)
## Print number of cell types per cluster with >=30% from this cluster
colSums(pop_by_cluster > 0.3)Show output
0 1 2 3 4 5 6
1 1 2 1 1 1 1
Beware :
small weird clusters of cells : they might be of interest BUT they can also be clustering artefacts
very large clusters of cells : if you notice that marker genes are representative of only a fraction of this large cluster, you might need to adjust the clustering parameters to be more discriminating.
7.2.1 Conclusion
Find a good marker gene reference (PanglaoDB, CellMarker, CancerSEA…)
Select a tool / model : classifier, scoring function …
Annotate your dataset
| Advantages | Limitations |
|---|---|
|
|
7.3 Automatic annotation using a reference scRNAseq
Another possibility is to use a published single-cell dataset as a reference for the cluster annotation.
This is very useful when you work on a tissue that is close to one tissue already studied, or if you work on another species and you want to have a quick overview of what the predicted annotation would look like. Multiple tools exist to transfer the annotations on your own dataset (SingleR, Azimuth, Symphony, classifiers like SVMs …). Many methods do exist, choose the one you know well first, or people of your lab / bioinfo use to have help if needed. (then you can try others…).
We are not going to use this method today but you might want to use it for your practicals.
Here are the main command from Single R.
# singlerpub
# Load the reference dataset in RDS format (it can also be loaded in another format, see the doc of SingleR to convert your reference to a suitable format for the prediction)
REF_SNRNASEQ = readRDS("reference_scRNAseq.RDS")
## This command removes from the object the cells with a metadata "Cell.type" that is a NA.
REF_SNRNASEQ = REF_SNRNASEQ[,!is.na(REF_SNRNASEQ$Cell.type)]
## Normalize the library (SingleR needs normalized data)
REF_SNRNASEQ <- scatter::logNormCounts(REF_SNRNASEQ)
## Create a SingleCellExperiment object for your reference dataset
REF_SCE = Seurat::as.SingleCellExperiment(sx)
## Create a SingleCellExperiment object for your dataset
sx_sce = Seurat::as.SingleCellExperiment(sx)
## RUN SINGLER
pred.grun = SingleR(test = sx_sce,
ref = REF_SCE,
labels = REF_SNRNASEQ$Cell.type.labels,
de.method = "wilcox")7.3.1 Conclusion
Find a quality reference dataset : several bulk RNA-seq data, one scRNAseq…
Select a tool to transfer annotation (SingleR, …)
Annotate your dataset
| Advantages | Limitations |
|---|---|
|
|
8 General Conclusion
| Method | Advantages | Limitations |
|---|---|---|
| Manual cluster annotation using differential expression |
|
|
| Automatic annotation using reference markers |
|
|
| Automatic annotation using reference dataset |
|
|
A few advices :)
It is recommended to combine multiple methods to annotate your data
Use manual cluster annotation to identify quickly your cell populations
Identify good markers for each cell populations → your reference markers
Use automatic cell annotation using your set of marker → your reference dataset
Use your references to annotate new dataset and go back to manual annotation to refine your analysis.
Sometimes, annotation reveals that the dataset would benefit from a re-clustering if you realize that some cluster could group 2 cell types or on the contrary, when two different cluster expressed very similar markers and should be merged.
During annotation, do not hesitate to look at the expression of Mitochondrial or Ribosomal genes (or any other set of genes) in your clusters. It might help you to identify a cluster of cells that are looks weird to you. Clusters of “artificial” cells - cells of low quality- could lead to the identification of weird (novel?!) cells that have no real biological significance. But be careful a cluster with a high expression of mitochondrial or ribosomal genes can have biological meaning sometimes.
Note about automatic annotation :
If you are working with non-model species or with multiple species : it is not trivial to transfert an annotation from one species to another. Genes markers are not always conserved across the evolution. In this case, manual annotation is a very important sanity check of any automatic annotation !!
9 Optional Part
9.1 Annotation transfer
SingleR can transfer cell annotations from a reference query to your dataset, at the cluster level.
9.1.1 Cluster level annotation
9.1.1.1 Explanation
It is possible to run the prediction of cell types with SingleR per cluster instead of per cell. The idea is similar, but instead of annotated every single cell to its best match in the reference dataset, it annotates every cluster from your query dataset to it’s average best match in the reference dataset. (SingleR will summarize the expression profiles of all cells from the same cluster, and then assess the resulting aggregation) :
Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.
Advantage(s)
Much Faster at the cluster level : SingleR can be time-consuming to run on every single cell or you dataset, particularly if the reference and the query dataset are big. Sometimes, you might want to perform a first quick prediction at the cluster level just to have a general idea on how the prediction works with your dataset, and which cell types you capture in your dataset.
Check if the reference dataset will be of any help : before running a long prediction of cell annotation with SingleR, (again if your dataset is big), you want to know if the reference you are using is helping at the cluster level. If it fails to identify cell types, even “general” cell types, forget it.
9.1.1.2 Code :
# repred
# Rerun a prediction using clustering information
# This command is much faster because the prediction is only performed for the 7 clusters and not for each cell.
clust_ann_predictions =
SingleR::SingleR(
test = norm_exp_mat,
clusters = sobj$RNA_snn_res.0.2,
ref = annotation,
labels = annotation$label.fine,
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam()
)Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.
How many clusters have been labelled for each annotation label ?
# cclab
## EXPLANATION OF THE COMMAND BELOW
head(sort(table(clust_ann_predictions$labels), decreasing = TRUE))Show output
T cells (T.DPsm) T cells (T.DN3A) T cells (T.DP69+) T cells (T.ISP)
4 1 1 1
## This command take the table of annotation labels (clust_ann_predictions$labels)
## It uses the function table to create a contingency table saying "how many time the "labels" from the reference dataset were assigns across clusters"
## Then the function sort order in a "decreasing" order this table to have first the labels assigned the most
## Finally we show only the first 5 lines of the sorted table using the function headFor how many clusters was the annotation of poor quality ?
# pruned_clusters
## EXPLANATION OF THE COMMAND BELOW
summary(is.na(clust_ann_predictions$pruned.labels))Show output
Mode FALSE
logical 7
## This command takes the column "pruned.labels" from the table of prediction
## then the command "is.na" looks for NA value in the column "pruned.labels"
## Finally, the function summary gives the mean / max/ etc metrics of the values in the column pruned labels.Annotation diagnostic
We can visualize the scores for each cell type, to each cell, as a heatmap :
# heatbc
## Heatmap using the annotation prediction by cluster
SingleR::plotScoreHeatmap(clust_ann_predictions)Show plot
What do you observe here ? What is the difference with the annotation by cell ?
Add annotation to metadata
We add the annotation to our Seurat object.
# add2
## Save the name of future annotation
clust_labels_col = "singler_clust_labels"
## Create a column with this name in the metadata and fill it with the cluster levels of each cell
sobj@meta.data[[clust_labels_col]] = sobj@meta.data$RNA_snn_res.0.2
## Fill associate each cluster with its annotation
levels(sobj@meta.data[[clust_labels_col]]) = clust_ann_predictions$labelsShow output
[1] "T cells (T.DPsm)" "T cells (T.DPsm)" "T cells (T.DP69+)"
[4] "T cells (T.ISP)" "T cells (T.DPsm)" "T cells (T.DPsm)"
[7] "T cells (T.DN3A)"
Show output
[1] "T cells (T.DPsm)" "T cells (T.DP69+)" "T cells (T.ISP)"
[4] "T cells (T.DN3A)"
Visualization
We can visualize cells annotation the the 2D projection :
# umapcomp
ann_cluster_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = clust_labels_col,
pt.size = 2,
label = FALSE,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
ann_cell_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "singler_cells_labels",
pt.size = 2,
label = FALSE,
repel = TRUE,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
ann_cluster_plot + ann_cell_plotShow plot
Save your Seurat object annotated
We save the annotated Seurat object :
# saverds3
## Save our Seurat object (rich naming)
out_name <- paste0(
output_dir, "/", paste(
c("13", Seurat::Project(sobj), "S5",
"Integrated_Annotated"
), collapse = "_"),
".RDS")
## Check
print(out_name)Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation/RESULTS/13_TD3A.TDCT_S5_Integrated_Annotated.RDS"
10 References
Good practices for single cell analysis : https://www.sc-best-practices.org/preamble.html
Sanger Single cell course : https://www.singlecellcourse.org/index.html
11 Ressources
For human :
GeneCard : https://www.genecards.org
Human Protein Atlas : https://www.proteinatlas.org/search/H2-K1
12 Rsession
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SummarizedExperiment_1.34.0 Biobase_2.64.0
[3] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
[5] IRanges_2.38.1 S4Vectors_0.42.1
[7] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[9] matrixStats_1.5.0 future_1.49.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1
[3] jsonlite_2.0.0 magrittr_2.0.3
[5] spatstat.utils_3.1-4 farver_2.1.2
[7] rmarkdown_2.29 zlibbioc_1.50.0
[9] vctrs_0.6.5 ROCR_1.0-11
[11] DelayedMatrixStats_1.26.0 spatstat.explore_3.4-3
[13] S4Arrays_1.4.1 htmltools_0.5.8.1
[15] SparseArray_1.4.8 sass_0.4.10
[17] sctransform_0.4.2 parallelly_1.45.0
[19] KernSmooth_2.23-24 bslib_0.9.0
[21] htmlwidgets_1.6.4 ica_1.0-3
[23] plyr_1.8.9 plotly_4.10.4
[25] zoo_1.8-14 cachem_1.1.0
[27] igraph_2.1.4 mime_0.13
[29] lifecycle_1.0.4 pkgconfig_2.0.3
[31] rsvd_1.0.5 Matrix_1.7-3
[33] R6_2.6.1 fastmap_1.2.0
[35] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-2
[37] shiny_1.10.0 digest_0.6.37
[39] patchwork_1.3.0 Seurat_5.3.0
[41] tensor_1.5 RSpectra_0.16-2
[43] irlba_2.3.5.1 beachmat_2.20.0
[45] labeling_0.4.3 progressr_0.15.1
[47] spatstat.sparse_3.1-0 httr_1.4.7
[49] polyclip_1.10-7 abind_1.4-8
[51] compiler_4.4.1 withr_3.0.2
[53] BiocParallel_1.38.0 viridis_0.6.5
[55] fastDummies_1.7.5 MASS_7.3-65
[57] DelayedArray_0.30.1 tools_4.4.1
[59] lmtest_0.9-40 httpuv_1.6.15
[61] future.apply_1.11.3 goftest_1.2-3
[63] glue_1.8.0 nlme_3.1-165
[65] promises_1.3.2 grid_4.4.1
[67] Rtsne_0.17 cluster_2.1.6
[69] reshape2_1.4.4 generics_0.1.4
[71] gtable_0.3.6 spatstat.data_3.1-6
[73] rmdformats_1.0.4 tidyr_1.3.1
[75] data.table_1.17.4 ScaledMatrix_1.12.0
[77] BiocSingular_1.20.0 XVector_0.44.0
[79] sp_2.2-0 spatstat.geom_3.4-1
[81] RcppAnnoy_0.0.22 ggrepel_0.9.6
[83] RANN_2.6.2 pillar_1.10.2
[85] stringr_1.5.1 spam_2.11-1
[87] RcppHNSW_0.6.0 limma_3.60.6
[89] later_1.4.2 splines_4.4.1
[91] dplyr_1.1.4 lattice_0.22-6
[93] survival_3.7-0 deldir_2.0-4
[95] tidyselect_1.2.1 miniUI_0.1.2
[97] pbapply_1.7-2 knitr_1.50
[99] gridExtra_2.3 bookdown_0.39
[101] scattermore_1.2 xfun_0.52
[103] statmod_1.5.0 pheatmap_1.0.12
[105] UCSC.utils_1.0.0 stringi_1.8.7
[107] lazyeval_0.2.2 yaml_2.3.10
[109] evaluate_1.0.3 codetools_0.2-20
[111] tibble_3.2.1 cli_3.6.5
[113] uwot_0.2.3 xtable_1.8-4
[115] reticulate_1.42.0 jquerylib_0.1.4
[117] dichromat_2.0-0.1 Rcpp_1.0.14
[119] globals_0.18.0 spatstat.random_3.4-1
[121] png_0.1-8 spatstat.univar_3.1-3
[123] parallel_4.4.1 ggplot2_3.5.2
[125] presto_1.0.0 SingleR_2.6.0
[127] dotCall64_1.2 sparseMatrixStats_1.16.0
[129] listenv_0.9.1 viridisLite_0.4.2
[131] scales_1.4.0 ggridges_0.5.6
[133] crayon_1.5.3 SeuratObject_5.1.0
[135] purrr_1.0.4 rlang_1.1.6
[137] cowplot_1.1.3