EB3I n1 2025 scRNAseq
-
CELL ANNOTATION





1 PREAMBLE

1.1 Purpose of this session

We are reaching out a challenging task of the analysis (and a very exciting one !).

  • What types of cells did we capture in the analysis ?
  • Do we identify the expected cell types and can we distinguish different sub-populations ?
  • Do we identify “novel” or “unexpected” cell types ?

The aim of this session is to understand the different methods that will help you to explore the biological cell types captured by your dataset.



2 Start Rstudio



3 Warm-up

  • We set common parameters we will use throughout this session :
# setparam


## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"

## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))
Show output
PATH CHECK :  TRUE
## The current EB3I session ID
sessionid <- '2538_eb3i_n1_2025
'


4 Prepare the data structure

We will do the same as for former steps, just changing the session name.

4.1 Main directory

# maindir

## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")

## Creating the root directory (already exists at this step)
# dir.create(path = TD_dir, recursive = TRUE)

## Print the root directory on-screen
print(TD_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD"

4.2 Current session

# sessiondir

## Creating the session (Preproc.3) directory
session_dir <- paste0(TD_dir, "/08_Cell.Annotation")
dir.create(path = session_dir, recursive = TRUE)

## Print the session directory on-screen
print(session_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation"

4.3 Input directory

# indir

## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)

## Print the input directory on-screen
print(input_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation/DATA"

4.4 Genelists directory

This is a directory where we will store additional information from knowledge bases about genes used to estimate the cell cycle phase of cells.

# resdir

res_dir <- paste0(TD_dir, "/Resources")
glist_dir <- paste0(res_dir, "/Genelists")

## Create the directory
dir.create(path = glist_dir, recursive = TRUE)

## Print the resources directory on-screen
print(glist_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/Resources/Genelists"

4.5 Output directory

# outdir

## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)

## Print the output directory on-screen
print(output_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation/RESULTS"


5 Reload the Seurat Object

  • We can reload the object we saved at the former step
# dataload


## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'


## The latest Seurat object saved as RDS (name)
sobj_file <- "12_TD3A.TDCT_S5_Integrated_12926.3886.RDS"

## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir, 
                    "/06_Integration/RESULTS/",
                    sobj_file)

force <- FALSE  ## To force a re-download of a Zenodo-hosted backup
local <- FALSE  ## To force a loading from a local backup

## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
  zen_id <- "14035293"
  zen_backup_file <- paste0("https://zenodo.org/records/",
                            zen_id,
                            "/files/",
                            sobj_file)
  ## Recreate the expected path if it does not exist
  dir.create(path = dirname(sobj_path), recursive = TRUE)
  ## Download the file
  download.file(url = zen_backup_file,
                destfile = sobj_path)
}

## In case of error/lost data : force a reload from a local backup repository
if(local) {
  sobj_path <- paste0(
    "/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
    sobj_file)
}

## Load the object
sobj <- readRDS(file = sobj_path)


6 Overview of the scRNAseq pipeline

Workflow before annotation

At this step of the analysis, we have :

  • a gene expression matrix : for each cell, gene expression is available

  • a reduced space : gene expression matrix is summarized in N dimensions

  • a clustering : each cell belongs to a specific cluster

  • a 2D space : cells can be visualized on a 2D representation (UMAP)

On the cell visualization, we also searched for clusters of cells. The clustering resolution shown multiple cell clusters that we can now associate to cell types.

For this you need :

  • your biological knowledge on your dataset

  • an internet connection :)



7 Different methods to annotate cell types

The annotation methods aim at defining marker genes that help to identify the cell types in each cluster.

But the logic across methods is similar :

You identify genes or set of genes that have a pattern of expression specific and that represents a large number of cells for the cluster.

Different methods exist :

  1. MANUAL : You can either do it manually, use set of marker genes from bibliography, or use other datasets that have been annotated to transfert the annotation to your clustering if similar expression patterns are found.

  2. AUTOMATIC : You use a published database and collect sets of marker genes for cell types, or published reference single cell atlas already annotated, or you can also use published RNAseq on a specific cell type that you know is in your dataset…

For this practical session, we will try both approaches. We use the dataset previously filtered and pre-processed (the Seurat object contains the 3 dimension reductions previously performed - pca, umap and harmony).

7.1 Manual annotation

We will perform manual annotation using differential expression.

7.1.1 Overview of the analysis

For manual annotation of cell clusters using marker genes, you need :

  • A Seurat object with normalized counts.
  • To choose a clustering resolution that you want to use for the annotation.
  • A reduced space to visualize the results.

7.1.2 Overview of the functions to be used

# funcs

## Seurat functions to be used in the TD for manual annotation
Seurat::FindClusters()
SeuratObject::JoinLayers()
Seurat::FindAllMarkers()

## Seurat functions to be used for visualization purpose 
### Function to color cell by a "dimension" of their metadata
### (ex : sample of origin ie the annotation called "orig.ident")
Seurat::DimPlot()
### Function to color cell by the expression of one feature
Seurat::FeaturePlot()
### Function to built a heatmap of expression of the marker genes identified
Seurat::DoHeatmap()
### Function to visualize how feature expression changes across different identity classes
Seurat::DotPlot()

7.1.3 Current Seurat object

# descobj

## A quick overview of the object
sobj
Show output
An object of class Seurat 
12926 features across 3886 samples within 1 assay 
Active assay: RNA (12926 features, 2000 variable features)
 5 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT, scale.data
 6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone

You see that in the reductions calculated, there are novel reductions you didn’t use before. Those reduction come from the Integration steps used to merge 2 samples into a 2D representation with less batch effect.

# dp1

## Look at the content of the Seurat object post integration 
Seurat::DimPlot(sobj, 
                ## use the reduction umap of the integrated object
                reduction = "umap",
                ## color the cells by their sample of origin
                group.by = c("orig.ident"))
Show plot

Clusters to annotate

Manual annotation is based on the identification of marker genes that characterize a cell population. For this, we must define which groups of cells we want to annotate.

This choice is based on the clustering (Look back at Proc.2). After the integration step, we can re-perform the 2 steps to get a clustering as we now have an object with the integration of both TD3A and TDCT samples.

# reclusters

## Find neighboors to prepare the data for clustering using the first 20 PCs
sobj <- Seurat::FindNeighbors(
  object = sobj, 
  dims = 1:20,
  reduction = "HarmonyIntegration")

## Louvain resolutions to test
resol <- c(0.1,0.2,0.3,0.8,1.5)

## Clustering
sobj <- Seurat::FindClusters(
  object = sobj,
  ## Vector of different resolutions defined above
  resolution = resol,
  ## Makes the function quiet
  verbose = FALSE,
  ## Algorithm for modularity optimization :
  ## . 1 : original Louvain algorithm
  ## . 2 : Louvain algorithm with multilevel refinement
  ## . 3 : SLM algorithm
  ## . 4 : Leiden algorithm. Requires the "leidenalg" python library
  algorithm = 1)

Multiplot with the different clustering results :

# diffresol

## Plot the clustering results
Seurat::DimPlot(object = sobj, 
                reduction = "umap",
                group.by = paste0('RNA_snn_res.', resol),
                pt.size = 1,
                label = TRUE,
                repel = TRUE
) + ggplot2::theme(legend.position = "bottom")
Show plot

Now we have all the data needed to find marker genes per clusters, but we must first choose the clustering resolution we want to annotate.

Advice : Always start by a low resolution to have a first idea of the broad cell types you are capturing.

Here I decided to choose the resolution 0.2 for the purpose of simpler classes, I could be wise to start by the resolution 0.1 to have an even more general idea of marker genes per cluster.

# setidents

Seurat::Idents(sobj) = sobj$RNA_snn_res.0.2

Differential expression between clusters

One way to annotate the clusters of cell is to look at the genes highly expressed in one cluster of cells compared to all the other cells: we can do a differential expression (DE) analysis.

DEA (differential expression analysis) is performed on the normalized count matrix (“data”), which is the case in our integrated Seurat object.

To perform the manual annotation, we will use the Seurat::FindAllMarkers function.

This function compare each cluster against all others to identify genes differentially expressed that are potentially marker genes.

WARNING : In Seurat v5, the integrated object contains the 2 normalized datasets still separated in their gene/cell assays of origin. To identify marker genes, we must join the two assays together with the command JoinLayers.

# joinlayers

sobj = SeuratObject::JoinLayers(sobj)

Now, we can use the Function FindAllMarkers to perform differential gene expression of each cluster against all the others. The objective is to find genes ““specific”” of each cluster to try to annotate them.

# FindAllMarkers

## find markers for every cluster compared to all remaining cells,
## report only genes with positive DE 
all_markers = Seurat::FindAllMarkers(
  object = sobj,
  ## Only keep genes more expressed in the cluster of interest than the others
  only.pos = TRUE,
  ## Minimum % of cells expressing the marker in the cluster of interest
  min.pct = 0.25,
  ## Minimum absolute logFC between conditions
  logfc.threshold = 0.25) 

# This command can take up to 2 mins !

WARNING : the execution time for this command can take several minutes !

Once the markers per cluster have been identified, we can look at the number of markers identified by cluster.

# table_markers

# Number of markers identified by cluster
table(all_markers$cluster)
Show output

   0    1    2    3    4    5    6 
 656  211  554 1725  150  118 1093 

In this table, the first line shows the cluster name, and the last line gives the amount of marker genes (ie DEG) identified for this cluster.

Here we see that many markers (ie deferentially expressed genes) have been identified. We cannot look at all of them, but we can choose to look at the top 3 markers per cluster and use our biological knowledge to identify cell populations.

# top10

`%>%` <- magrittr::`%>%`
top10_markers = as.data.frame(all_markers %>% 
                               dplyr::group_by(cluster) %>% 
                               dplyr::top_n(n = 10, wt = avg_log2FC))
# top10h

## Visualize the top 10 marker gene expression per cluster using the default heatmap function of Seurat.
Seurat::DoHeatmap(sobj, features = top10_markers$gene) + Seurat::NoLegend()
Show plot

# top10dt

## A dotplot
Seurat::DotPlot(sobj, features = unique(top10_markers$gene)) +
  ## Just aesthetics
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, 
                                                     vjust = 1,
                                                     size = 8, 
                                                     hjust = 1)) 
Show plot

On this plot we can see that some markers display a very high and specific expression for a single cluster, while others are expressed in 2 or more cluster.

Biology in this plot

  • What is the function of each marker gene ? is it a known marker gene for a cell type ?

  • Is there litterature about its pattern of expression ?

Here we need biological knowledge and back-and-forth marker genes computing using different cluster resolutions to understand which cell populations are present in the data.

At this step, using known marker genes from your experiment/knowledge is also useful, and easy to perform

Let’s focus on 2 genes :

  • Rag1, a key gene of T-cells maturation in the variable/diversity/joining (V[D]J) rearrangement

  • Cd5, a marker of self-reactivity of T-cells

# UMAP2markers

Seurat::FeaturePlot(
  object = sobj, 
  features = c("Cd5", "Rag1"), 
  reduction = "umap")
Show plot

Here we can see that those markers may actually be very informative to distinguish cell Tcells maturation process.

… at this moment using your biological knowledge on your dataset is critical, you can also test manually any marker of your choice !

7.1.4 Conclusion

  • Perform differential expression for each cluster VS all the others with a normalized matrix

  • Look at the gene expression of the markers identified in the 2D representation to validate specificity and representation

  • Find the cell population corresponding to these markers and annotate this cluster

Advantages Limits
  • Easy to implement

  • Sometimes the only solution (ex : novel tissue)

  • Everything is possible

  • Clustering : resolution, merged clusters, “bio-informatic” cluster

  • Change clustering ? Change annotation…

  • Knowledge : time-consuming

7.2 Automatic annotation using reference markers

For some tissues, the different cell types have already been largely described and databases exist with referenced marker genes.

Another way to annotate your dataset will be to find a database with relevant annotation for your dataset and use tools of automatic annotation to annotate your clusters.

Let’s see how it works in practice !

We will use a database focused on immunological cell types called ImmGen, thanks to the celldex R package that “provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq”

NOTE : In the following chunk of code, you will load the annotation file from the IFB server where we downloaded it. In real life and with a version of dbdypr inf or equal to 2.3, you can also use this command :

annotation = ImmGenData(ensembl = FALSE) 
# ensembl set to TRUE if we want ENSEMBL ID gene names, FALSE will get the annotation with Gene Symbols
# refloading

## Loading the ImmGen database
annotation = readRDS(paste0(
  '/shared/projects/',
  sessionid,
  '/atelier_scrnaseq/TD/RESOURCES/ImmGenData.RDS'))

## A quick description of the db
annotation
Show output
class: SummarizedExperiment 
dim: 22134 830 
metadata(0):
assays(1): logcounts
rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
rowData names(0):
colnames(830):
  GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
  GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ...
  GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
  GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
colData names(3): label.main label.fine label.ont

This database contains 3 levels of granularity :

  • A “main” level (coarse grain)

  • A “fine” level (self-explanatory)

  • The “ONT” level (data are mapped to a defined ontology)

As we are in a context of sorted cells of the same lineage, we’re going to use the fine label.

Let’s see how many cell types are described in this ImmGen database :

# Immgen_ct

length(unique(annotation$label.fine))
Show output
[1] 253

The tool we will use to perform the automatic cell type annotation, SingleR works better with the normalized data. Thus, we will extract the normalized matrix from our Seurat object :

# norm_mat

## Extraction of the normalized data 
norm_exp_mat = Seurat::GetAssayData(
  object = sobj,
  assay = "RNA",
  slot = "data"
)

## Matrix dimensions
dim(norm_exp_mat)
Show output
[1] 12926  3886

We are ready to start the annotation.

The following SingleR command performs the prediction of cell types for each cell of the dataset.

# annpred

## Run in ~ 3-5 min depending on the number of CPU and memory defined
ann_predictions = SingleR::SingleR(
  ## our normalized matrix  
  test = norm_exp_mat, 
  ## The ImmGen DB
  ref = annotation, 
  ## The annotation grain level
  labels = annotation$label.fine, 
  ## Marker dectection scheme 
  de.method="classic",
  assay.type.test = "logcounts",
  assay.type.ref = "logcounts",
  BPPARAM = BiocParallel::SerialParam())

The resulting object is a special kind of data.frame . Each row contains the ID of a cell and the prediction score associated by SingleR. Cell labels associated to each cell are stored in the column $labels

How many different kinds of labels were identified ?

# nlabels

length(unique(ann_predictions$labels))
Show output
[1] 35

Besides scoring, SingleR assesses the score quality, and prunes bad results.

How many cells got a poor quality annotation ?

# prunedcells

summary(is.na(ann_predictions$pruned.labels))
Show output
   Mode   FALSE    TRUE 
logical    3847      39 

Annotation diagnostic

SingleR allows to visualize some control plots :

  • We can visualize the score of each cell, split by cell type label, as a heatmap :
# heatmap_predicition

SingleR::plotScoreHeatmap(ann_predictions)
Show plot

How do you interpret this heatmap ?

Add the annotation to the Seurat object

We add a new metadata containing the annotation of each cell to our Seurat object.

# predmeta

sobj$singler_cells_labels = ann_predictions$labels

Visualization of our annotation on UMAP

We can visualize cells annotation the the UMAP.

# UMAPlabeled

## Just for aesthetics (sets a color palette to use)
seeable_palette = setNames(
  c(RColorBrewer::brewer.pal(name = "Dark2", n = 8),
    c(1:(length(unique(ann_predictions$labels)) - 8))),
  nm = names(sort(table(ann_predictions$labels), decreasing = TRUE)))

## UMAP with the predicted annotation by cell
ann_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "singler_cells_labels",  
  pt.size = 2,
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

# UMAP with the cluster numbers (before annotation)
clust_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "RNA_snn_res.0.2",
  pt.size = 2,
  label = TRUE,
  repel = TRUE
)

print(ann_plot + clust_plot)
Show plot

A lookup at the contingency table of cells projected to the ImmGen reference

# projIG

table(sobj$singler_cells_labels)
Show output

                      DC (DC.8-4-11B-)        Macrophages (MF.11CLOSER.SALM3) 
                                     1                                      1 
                   NKT (NKT.44+NK1.1-)                       T cells (T.4.Pa) 
                                     1                                      1 
                  T cells (T.4FP3+25+)                       T cells (T.4Nve) 
                                     1                                      2 
                  T cells (T.4SP24int)       T cells (T.8EFF.OT1.48HR.LISOVA) 
                                    22                                      2 
       T cells (T.8MEM.OT1.D45.LISOVA)                       T cells (T.8Mem) 
                                     1                                      2 
T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)                   T cells (T.8NVE.OT1) 
                                     1                                      1 
                      T cells (T.8Nve)                     T cells (T.8SP24-) 
                                    25                                      8 
                  T cells (T.8SP24int)                     T cells (T.8SP69+) 
                                     3                                      7 
                    T cells (T.CD4.5H)                  T cells (T.CD4TESTCJ) 
                                     5                                      5 
                    T cells (T.CD8.5H)                    T cells (T.CD8.CTR) 
                                     2                                      2 
                      T cells (T.DN2B)                      T cells (T.DN3-4) 
                                     2                                      1 
                      T cells (T.DN3A)                       T cells (T.DN3B) 
                                    48                                      6 
                       T cells (T.DN4)                         T cells (T.DP) 
                                     1                                      1 
                     T cells (T.DP69+)                       T cells (T.DPbl) 
                                   218                                     71 
                      T cells (T.DPsm)                        T cells (T.ISP) 
                                  3174                                    263 
                     T cells (T.Tregs)                 Tgd (Tgd.imm.VG1+VD6+) 
                                     1                                      2 
                        Tgd (Tgd.VG2+)                    Tgd (Tgd.VG5+24AHI) 
                                     2                                      2 
                             Tgd (Tgd) 
                                     1 

From this rapid prediction, it seems that our dataset contains T-cells mostly, and particularly T.DPsm, T-DP69+ and T-ISP .

Maybe the annotation is not perfectly suited for our dataset. Some cell populations in the annotation are closely related, and this leads to annotation competition for our cells.

It is possible to run the annotation at the cluster level : it will be cleaner than the single cell level annotation. But, be sure that the clustering is not merging several different cell populations.

We can check the number of cells attributed to labels from each cluster :

# cellpercluster

table(sobj$singler_cells_labels,
      sobj$RNA_snn_res.0.2)
Show output
                                        
                                            0    1    2    3    4    5    6
  DC (DC.8-4-11B-)                          0    0    0    0    0    1    0
  Macrophages (MF.11CLOSER.SALM3)           0    0    0    0    0    1    0
  NKT (NKT.44+NK1.1-)                       0    0    0    1    0    0    0
  T cells (T.4.Pa)                          0    0    1    0    0    0    0
  T cells (T.4FP3+25+)                      0    0    1    0    0    0    0
  T cells (T.4Nve)                          0    0    2    0    0    0    0
  T cells (T.4SP24int)                      1    0   21    0    0    0    0
  T cells (T.8EFF.OT1.48HR.LISOVA)          0    0    1    1    0    0    0
  T cells (T.8MEM.OT1.D45.LISOVA)           0    0    1    0    0    0    0
  T cells (T.8Mem)                          0    0    2    0    0    0    0
  T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)    0    0    1    0    0    0    0
  T cells (T.8NVE.OT1)                      0    0    1    0    0    0    0
  T cells (T.8Nve)                          0    0   25    0    0    0    0
  T cells (T.8SP24-)                        0    0    8    0    0    0    0
  T cells (T.8SP24int)                      0    0    2    1    0    0    0
  T cells (T.8SP69+)                        0    0    7    0    0    0    0
  T cells (T.CD4.5H)                        0    0    5    0    0    0    0
  T cells (T.CD4TESTCJ)                     0    0    5    0    0    0    0
  T cells (T.CD8.5H)                        0    0    2    0    0    0    0
  T cells (T.CD8.CTR)                       0    0    2    0    0    0    0
  T cells (T.DN2B)                          0    0    0    1    0    0    1
  T cells (T.DN3-4)                         0    0    0    1    0    0    0
  T cells (T.DN3A)                          0    0    0    2    0    0   46
  T cells (T.DN3B)                          0    1    0    5    0    0    0
  T cells (T.DN4)                           0    0    1    0    0    0    0
  T cells (T.DP)                            0    0    0    0    0    1    0
  T cells (T.DP69+)                         2    3  211    0    2    0    0
  T cells (T.DPbl)                          0   17    0   54    0    0    0
  T cells (T.DPsm)                       2067  827  137    0   73   70    0
  T cells (T.ISP)                           1    4    0  258    0    0    0
  T cells (T.Tregs)                         0    0    1    0    0    0    0
  Tgd (Tgd.imm.VG1+VD6+)                    0    0    1    1    0    0    0
  Tgd (Tgd.VG2+)                            0    0    1    0    0    0    1
  Tgd (Tgd.VG5+24AHI)                       0    0    2    0    0    0    0
  Tgd (Tgd)                                 0    0    0    0    0    0    1

We can eventually check if some clusters contain multiple cell types. We compute the proportion of each cell type in each cluster. If a cluster is composed of two cell types (or more), maybe this resolution for the clustering is too low ?

# propcell

## Compute the proportion of cell types per cluster
pop_by_cluster = prop.table(table(sobj$singler_cells_labels,
                                  sobj$RNA_snn_res.0.2),
                            margin = 2)

## Print number of cell types per cluster with >=30% from this cluster
colSums(pop_by_cluster > 0.3)
Show output
0 1 2 3 4 5 6 
1 1 2 1 1 1 1 

Beware :

  • small weird clusters of cells : they might be of interest BUT they can also be clustering artefacts

  • very large clusters of cells : if you notice that marker genes are representative of only a fraction of this large cluster, you might need to adjust the clustering parameters to be more discriminating.

7.2.1 Conclusion

  1. Find a good marker gene reference (PanglaoDB, CellMarker, CancerSEA…)

  2. Select a tool / model : classifier, scoring function

  3. Annotate your dataset

Advantages Limitations
  • Annotation for every single cell is possible

  • Design your own reference

  • Find the good reference markers

  • Cell types arborescence

  • Limited number of cell types : all cells are annotated, or “unknown” is possible ?

7.3 Automatic annotation using a reference scRNAseq

Another possibility is to use a published single-cell dataset as a reference for the cluster annotation.

This is very useful when you work on a tissue that is close to one tissue already studied, or if you work on another species and you want to have a quick overview of what the predicted annotation would look like. Multiple tools exist to transfer the annotations on your own dataset (SingleR, Azimuth, Symphony, classifiers like SVMs …). Many methods do exist, choose the one you know well first, or people of your lab / bioinfo use to have help if needed. (then you can try others…).

Annotation transfert with reference dataset

We are not going to use this method today but you might want to use it for your practicals.

Here are the main command from Single R.

# singlerpub

# Load the reference dataset in RDS format (it can also be loaded in another format, see the doc of SingleR to convert your reference to a suitable format for the prediction)

REF_SNRNASEQ = readRDS("reference_scRNAseq.RDS")

## This command removes from the object the cells with a metadata "Cell.type" that is a NA.
REF_SNRNASEQ = REF_SNRNASEQ[,!is.na(REF_SNRNASEQ$Cell.type)]

## Normalize the library (SingleR needs normalized data)
REF_SNRNASEQ <- scatter::logNormCounts(REF_SNRNASEQ)

## Create a SingleCellExperiment object for your reference dataset
REF_SCE = Seurat::as.SingleCellExperiment(sx)

## Create a SingleCellExperiment object for your dataset
sx_sce = Seurat::as.SingleCellExperiment(sx)

## RUN SINGLER
pred.grun = SingleR(test = sx_sce,
                    ref = REF_SCE,
                    labels = REF_SNRNASEQ$Cell.type.labels,
                    de.method = "wilcox")

7.3.1 Conclusion

  1. Find a quality reference dataset : several bulk RNA-seq data, one scRNAseq…

  2. Select a tool to transfer annotation (SingleR, …)

  3. Annotate your dataset

Advantages Limitations
  • Single cell level

  • Design your own reference

  • Find the good reference dataset

  • Limited number of cell types (you can only find cell types present in the reference dataset)

  • Never trust 100% the prediction of this automatic annotation, some tools do not have the option to say “we don’t find correspondance” and will force to find a label for all cells.



8 General Conclusion

Method Advantages Limitations
Manual cluster annotation using differential expression
  • Easy to implement

  • May be the only solution

  • Everything is possible

  • Clustering : resolution, merged clusters, “bio-informatic” cluster

  • Change clustering ? Change annotation…

  • Knowledge : time-consuming

Automatic annotation using reference markers
  • Single cell level is possible

  • Design your own reference

  • Find the good reference markers

  • Cell types arborescence

  • Limited number of cell types : all cells are annotated, or “unknown” ?

Automatic annotation using reference dataset
  • Single cell level

  • Design your own reference

  • Find the good reference dataset

  • Limited number of cell types : all cells are annotated, or “unknown” ?

A few advices :)

  • It is recommended to combine multiple methods to annotate your data

    • Use manual cluster annotation to identify quickly your cell populations

    • Identify good markers for each cell populations → your reference markers

    • Use automatic cell annotation using your set of marker → your reference dataset

    • Use your references to annotate new dataset and go back to manual annotation to refine your analysis.

  • Sometimes, annotation reveals that the dataset would benefit from a re-clustering if you realize that some cluster could group 2 cell types or on the contrary, when two different cluster expressed very similar markers and should be merged.

  • During annotation, do not hesitate to look at the expression of Mitochondrial or Ribosomal genes (or any other set of genes) in your clusters. It might help you to identify a cluster of cells that are looks weird to you. Clusters of “artificial” cells - cells of low quality- could lead to the identification of weird (novel?!) cells that have no real biological significance. But be careful a cluster with a high expression of mitochondrial or ribosomal genes can have biological meaning sometimes.

Note about automatic annotation :

If you are working with non-model species or with multiple species : it is not trivial to transfert an annotation from one species to another. Genes markers are not always conserved across the evolution. In this case, manual annotation is a very important sanity check of any automatic annotation !!



9 Optional Part

9.1 Annotation transfer

SingleR can transfer cell annotations from a reference query to your dataset, at the cluster level.

9.1.1 Cluster level annotation

9.1.1.1 Explanation

It is possible to run the prediction of cell types with SingleR per cluster instead of per cell. The idea is similar, but instead of annotated every single cell to its best match in the reference dataset, it annotates every cluster from your query dataset to it’s average best match in the reference dataset. (SingleR will summarize the expression profiles of all cells from the same cluster, and then assess the resulting aggregation) :

Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.

Advantage(s)

  • Much Faster at the cluster level : SingleR can be time-consuming to run on every single cell or you dataset, particularly if the reference and the query dataset are big. Sometimes, you might want to perform a first quick prediction at the cluster level just to have a general idea on how the prediction works with your dataset, and which cell types you capture in your dataset.

  • Check if the reference dataset will be of any help : before running a long prediction of cell annotation with SingleR, (again if your dataset is big), you want to know if the reference you are using is helping at the cluster level. If it fails to identify cell types, even “general” cell types, forget it.

9.1.1.2 Code :

# repred

# Rerun a prediction using clustering information 
# This command is much faster because the prediction is only performed for the 7 clusters and not for each cell.
clust_ann_predictions  =
    SingleR::SingleR(
    test = norm_exp_mat,
    clusters = sobj$RNA_snn_res.0.2,
    ref = annotation,
    labels = annotation$label.fine,
    assay.type.test = "logcounts",
    assay.type.ref = "logcounts",
    BPPARAM = BiocParallel::SerialParam()
  )

Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.

How many clusters have been labelled for each annotation label ?

# cclab

## EXPLANATION OF THE COMMAND BELOW

head(sort(table(clust_ann_predictions$labels), decreasing = TRUE))
Show output

 T cells (T.DPsm)  T cells (T.DN3A) T cells (T.DP69+)   T cells (T.ISP) 
                4                 1                 1                 1 
## This command take the table of annotation labels (clust_ann_predictions$labels)
## It uses the function table to create a contingency table saying "how many time the "labels" from the reference dataset were assigns across clusters"
## Then the function sort order in a "decreasing" order this table to have first the labels assigned the most
## Finally we show only the first 5 lines of the sorted table using the function head

For how many clusters was the annotation of poor quality ?

# pruned_clusters

## EXPLANATION OF THE COMMAND BELOW

summary(is.na(clust_ann_predictions$pruned.labels))
Show output
   Mode   FALSE 
logical       7 
## This command takes the column "pruned.labels" from the table of prediction 
## then the command "is.na" looks for NA value in the column "pruned.labels"
## Finally, the function summary gives the mean / max/ etc metrics of the values in the column pruned labels.

Annotation diagnostic

We can visualize the scores for each cell type, to each cell, as a heatmap :

# heatbc

## Heatmap using the annotation prediction by cluster
SingleR::plotScoreHeatmap(clust_ann_predictions)
Show plot

What do you observe here ? What is the difference with the annotation by cell ?

Add annotation to metadata

We add the annotation to our Seurat object.

# add2

## Save the name of future annotation
clust_labels_col = "singler_clust_labels"

## Create a column with this name in the metadata and fill it with the cluster levels of each cell
sobj@meta.data[[clust_labels_col]] = sobj@meta.data$RNA_snn_res.0.2

## Fill associate each cluster with its annotation 
levels(sobj@meta.data[[clust_labels_col]]) = clust_ann_predictions$labels
# displaypred

clust_ann_predictions$labels
Show output
[1] "T cells (T.DPsm)"  "T cells (T.DPsm)"  "T cells (T.DP69+)"
[4] "T cells (T.ISP)"   "T cells (T.DPsm)"  "T cells (T.DPsm)" 
[7] "T cells (T.DN3A)" 
levels(sobj@meta.data[[clust_labels_col]])
Show output
[1] "T cells (T.DPsm)"  "T cells (T.DP69+)" "T cells (T.ISP)"  
[4] "T cells (T.DN3A)" 

Visualization

We can visualize cells annotation the the 2D projection :

# umapcomp

ann_cluster_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = clust_labels_col,
  pt.size = 2,
  label = FALSE, 
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

ann_cell_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "singler_cells_labels",
  pt.size = 2,
  label = FALSE,
  repel = TRUE, 
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

ann_cluster_plot + ann_cell_plot
Show plot

Save your Seurat object annotated

We save the annotated Seurat object :

# saverds3

## Save our Seurat object (rich naming)
out_name <- paste0(
          output_dir, "/", paste(
            c("13", Seurat::Project(sobj), "S5", 
              "Integrated_Annotated"
            ), collapse = "_"),
            ".RDS")

## Check
print(out_name)
Show output
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/08_Cell.Annotation/RESULTS/13_TD3A.TDCT_S5_Integrated_Annotated.RDS"
## Write on disk
saveRDS(object = sobj, 
        file = out_name)

10 References

Good practices for single cell analysis : https://www.sc-best-practices.org/preamble.html

Sanger Single cell course : https://www.singlecellcourse.org/index.html

SingleR : https://bioconductor.org/books/3.12/SingleRBook/

11 Ressources

For human :

GeneCard : https://www.genecards.org

Human Protein Atlas : https://www.proteinatlas.org/search/H2-K1






12 Rsession

# rsession

utils::sessionInfo()
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SummarizedExperiment_1.34.0 Biobase_2.64.0             
 [3] GenomicRanges_1.56.2        GenomeInfoDb_1.40.1        
 [5] IRanges_2.38.1              S4Vectors_0.42.1           
 [7] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
 [9] matrixStats_1.5.0           future_1.49.0              

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3        rstudioapi_0.17.1        
  [3] jsonlite_2.0.0            magrittr_2.0.3           
  [5] spatstat.utils_3.1-4      farver_2.1.2             
  [7] rmarkdown_2.29            zlibbioc_1.50.0          
  [9] vctrs_0.6.5               ROCR_1.0-11              
 [11] DelayedMatrixStats_1.26.0 spatstat.explore_3.4-3   
 [13] S4Arrays_1.4.1            htmltools_0.5.8.1        
 [15] SparseArray_1.4.8         sass_0.4.10              
 [17] sctransform_0.4.2         parallelly_1.45.0        
 [19] KernSmooth_2.23-24        bslib_0.9.0              
 [21] htmlwidgets_1.6.4         ica_1.0-3                
 [23] plyr_1.8.9                plotly_4.10.4            
 [25] zoo_1.8-14                cachem_1.1.0             
 [27] igraph_2.1.4              mime_0.13                
 [29] lifecycle_1.0.4           pkgconfig_2.0.3          
 [31] rsvd_1.0.5                Matrix_1.7-3             
 [33] R6_2.6.1                  fastmap_1.2.0            
 [35] GenomeInfoDbData_1.2.12   fitdistrplus_1.2-2       
 [37] shiny_1.10.0              digest_0.6.37            
 [39] patchwork_1.3.0           Seurat_5.3.0             
 [41] tensor_1.5                RSpectra_0.16-2          
 [43] irlba_2.3.5.1             beachmat_2.20.0          
 [45] labeling_0.4.3            progressr_0.15.1         
 [47] spatstat.sparse_3.1-0     httr_1.4.7               
 [49] polyclip_1.10-7           abind_1.4-8              
 [51] compiler_4.4.1            withr_3.0.2              
 [53] BiocParallel_1.38.0       viridis_0.6.5            
 [55] fastDummies_1.7.5         MASS_7.3-65              
 [57] DelayedArray_0.30.1       tools_4.4.1              
 [59] lmtest_0.9-40             httpuv_1.6.15            
 [61] future.apply_1.11.3       goftest_1.2-3            
 [63] glue_1.8.0                nlme_3.1-165             
 [65] promises_1.3.2            grid_4.4.1               
 [67] Rtsne_0.17                cluster_2.1.6            
 [69] reshape2_1.4.4            generics_0.1.4           
 [71] gtable_0.3.6              spatstat.data_3.1-6      
 [73] rmdformats_1.0.4          tidyr_1.3.1              
 [75] data.table_1.17.4         ScaledMatrix_1.12.0      
 [77] BiocSingular_1.20.0       XVector_0.44.0           
 [79] sp_2.2-0                  spatstat.geom_3.4-1      
 [81] RcppAnnoy_0.0.22          ggrepel_0.9.6            
 [83] RANN_2.6.2                pillar_1.10.2            
 [85] stringr_1.5.1             spam_2.11-1              
 [87] RcppHNSW_0.6.0            limma_3.60.6             
 [89] later_1.4.2               splines_4.4.1            
 [91] dplyr_1.1.4               lattice_0.22-6           
 [93] survival_3.7-0            deldir_2.0-4             
 [95] tidyselect_1.2.1          miniUI_0.1.2             
 [97] pbapply_1.7-2             knitr_1.50               
 [99] gridExtra_2.3             bookdown_0.39            
[101] scattermore_1.2           xfun_0.52                
[103] statmod_1.5.0             pheatmap_1.0.12          
[105] UCSC.utils_1.0.0          stringi_1.8.7            
[107] lazyeval_0.2.2            yaml_2.3.10              
[109] evaluate_1.0.3            codetools_0.2-20         
[111] tibble_3.2.1              cli_3.6.5                
[113] uwot_0.2.3                xtable_1.8-4             
[115] reticulate_1.42.0         jquerylib_0.1.4          
[117] dichromat_2.0-0.1         Rcpp_1.0.14              
[119] globals_0.18.0            spatstat.random_3.4-1    
[121] png_0.1-8                 spatstat.univar_3.1-3    
[123] parallel_4.4.1            ggplot2_3.5.2            
[125] presto_1.0.0              SingleR_2.6.0            
[127] dotCall64_1.2             sparseMatrixStats_1.16.0 
[129] listenv_0.9.1             viridisLite_0.4.2        
[131] scales_1.4.0              ggridges_0.5.6           
[133] crayon_1.5.3              SeuratObject_5.1.0       
[135] purrr_1.0.4               rlang_1.1.6              
[137] cowplot_1.1.3            
