sincellTE 2024 - TP Secondary Analysis: cell annotation

Published

Oct 22, 2024

Objectives

Cell annotation is one of the first steps in exploring the biological information captured in your dataset. It allows you to assign a type, and sometimes a subtype or even a cell state (e.g., proliferating, apoptotic), to individual cells.

The objective of this session is to gain hands-on experience with various methods for cell annotation. Precise and reliable cell annotation takes time. Here, we will focus on understanding and critically evaluating these different methods.

Methods for Cell Annotation

Cell annotation methods are based on a common principle: they use genes or gene sets specific to certain cell types to assign an identity to the cells.

However, these methods differ in how they accomplish this. They can be classified into two main approaches:

Manual Annotation: This method involves using marker gene sets from the literature, which are manually curated. In your dataset, clusters with high expression of these markers are annotated with the corresponding cell type.
Automatic Annotation: This approach utilizes marker sets from databases, or published single-cell reference atlases or datasets. Tools then automatically annotate the cells based on this information.

Note

These two approaches are not mutually exclusive and can be combined.

For this session, we will use the dataset prepared in the previous practical sessions.

suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(S4Vectors)
      library(Seurat)
      library(ggplot2)
      library(patchwork)
    })
})

set.seed(123)

user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "cell_annotation/data/SC26_FMTG_FILTERED_SCT_CLUSTERS.rda", sep = "/")
output_dir <- paste0("/shared/home", user.id, "cell_annotation/results/", sep = "/")
load(input_file)

1. Manual Annotation

One way to annotate cell clusters is to examine the genes that are highly expressed in one cluster compared to all other cells, and then infer the corresponding cell type. These markers can be obtained from:

Knowledge of the field (e.g., CD3A for T cells, CD8A for CD8+ T cells).
Relevant literature related to your project.
Differentially expressed genes (DEGs): Performing differential expression analysis on different clusters can reveal specific markers that help identify each cluster. However, not everyone agrees that DEG analysis should be used this way.

1.1 Using One Marker per Cell Type

Here, we have a list of markers from the dataset’s associated publication, allowing us to annotate 9 cell populations:

Cell type	Marker
Proximal enterocytes	Apoa4
Distal enterocytes	Enpep
Goblet cells	Muc2
Tuft cells	Alox5ap
Stem cells	Lgr5
Naive CD4+ T cells	Ccr7
Memory CD4+ T cells	S100a4
CD8+ T cells	Cd8a
B cells	Ms4a1

In previous practical sessions, cell clustering was performed. Clustering at a resolution of 0.1 yielded 9 clusters, matching the number of cell types we are aiming to identify. This clustering will be used for the first manual annotation step.

Violin plots and feature plots are useful for visualizing marker expression across the different clusters.

## define the markers
feats <- c('Muc2', 'Enpep', 'Apoa4', "Alox5ap",
              'Ccr7', 'S100a4', 'Cd8a', 'Lgr5', 'Ms4a1')

## set the cells default identity to clustering
Idents(sobj) <- "SCT_snn_res.0.1"
DimPlot(sobj)

FeaturePlot(sobj, features=feats)

VlnPlot(sobj, feats, layer = "data")

After visualizing these plots, we can propose an initial annotation:

Cluster	Marker	Cell type
0	Cd8a	CD8+ T cells
1	Ms4a1	B cells
2	Cd8a	CD8+ T cells
3	Enpep + strong expression of Apo4	mix of proximal and distal enterocytes
4	Ccr7 + Muc2	Naive CD4 + T cells and Goblet cells
5	S100a4	Naive CD4 + T cells
6	Ms4a1	B cells
7	Alox5ap + Lgr5	Tuft cells + stem cells

Using one marker per cell type allowed us to form initial hypotheses about the cell types in each cluster, but also revealed some limitations:

Some markers are not very specific: e.g., Apoa4 is expressed in a large proportion of clusters, though its expression is highest in cluster 2.

sobj[["Apoa4+"]] <- as.numeric((sobj@assays$SCT@data['Apoa4',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.2) %>% 
  ## compute the percentage of cells expressing the gene
  summarize(percentage = round(mean(`Apoa4+` == 1) * 100))

SCT_snn_res.0.2	percentage
0	92
1	92
2	92
3	99
4	92
5	71
6	61
7	46

Some markers are specific but not expressed in all cells within a cluster: e.g., Lgr5 is detected almost exclusively in cluster 6, but only in 40% of the cells.

sobj[["Lgr5+"]] <- as.numeric((sobj@assays$SCT@data['Lgr5',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.1) %>% 
  ## compute the percentage of cells expressing the gene
  summarise(percentage = round(mean(`Lgr5+` == 1) * 100))

SCT_snn_res.0.1	percentage
0	0
1	0
2	0
3	0
4	2
5	0
6	33

Question

Why a marker such as Lgr5+ might not be found in all the cells of a cluster ?

1.2 Using a Marker List

Using only one marker per cell type can be risky. Instead, using a list of markers can help mitigate biases due to unexpressed genes or unspecific markers. To improve cluster annotation, we will now use a list of markers with several genes for each cell type.

marker_genes <- list(
    CD8 = c('Cd8a', 'Gzmb', 'Gzma', 'Cd7') ,
    CD4_naive = c('Ccr7', 'Ms4a4b', 'Cd27', 'Cd28'),
    CD4_memory = c('S100a4', 'Cd44', 'Ccl9'),
    B_cells = c('Ms4a1', 'Cd79a', 'Cd74'),
    Enterocyte_distal = c('Enpep', 'Anpep', 'Ace2'),
    Enterocyte_proximal = c('Apoa4', 'Dstn', 'Fabp2'),
    Goblet_cells = c('Muc2', 'Agr2', 'Spink4', 'Tff3'),
    Stem_cells = c('Lgr5', 'Hck', 'Nrgn'),
    Tuft_cells = c('Alox5ap','l1b', 'S100a9'))

## keep only genes also found in the count matrix
marker_genes <- lapply(marker_genes, function(vec) { 
  intersect(vec, rownames(sobj))
})

A dot plot is useful for assessing the expression of multiple genes simultaneously in populations of cells.

DotPlot(sobj, features = marker_genes) +
  theme(axis.text.x=element_text(angle = 45, vjust=0.5, hjust = 1),
        strip.text = element_text(angle = 45, hjust = 0.5))

Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.

The AddModuleScore() function calculates a score for each cell based on the expression of a list of genes. This score, automatically added to the metadata, can then be plotted on the UMAP.

sobj <- AddModuleScore(sobj, features = marker_genes,
                                pool = NULL,
                                nbin = 5,
                                seed = 1,
                                ctrl = length(marker_genes),
                                k = FALSE,
                                name = "program")

## score for each list of gene is named program1, program2...
## rename the columns in metadata to get more interpretable names
vec_rename <- paste0("program", seq(1:length(marker_genes)))
names(vec_rename) <- names(marker_genes)
sobj@meta.data <- sobj@meta.data %>% 
  dplyr::rename(all_of(vec_rename))

head(sobj@meta.data)

	orig.ident	nCount_RNA	nFeature_RNA	min_features	min_counts	percent_mt	percent_rb	Seurat.SmG2M.Score	Seurat.S.Score	Seurat.G2M.Score	Seurat.Phase	Cyclone.SmG2M.Score	Cyclone.Phase	Cyclone.G1.Score	Cyclone.S.Score	Cyclone.G2M.Score	scDblFinder.class	scds.score	scds.class	doublets_consensus.class	nCount_SCT	nFeature_SCT	SCT_snn_res.0.1	SCT_snn_res.0.2	SCT_snn_res.0.5	SCT_snn_res.1	seurat_clusters	Apoa4+	CD8	CD4_naive	CD4_memory	B_cells	Enterocyte_distal	Enterocyte_proximal	Goblet_cells	Stem_cells	Tuft_cells
AAACCCAAGAAGCGGG-1	SC26_FMTG	3562	1511	TRUE	TRUE	13.50365	17.574396	0.0355307	0.0231535	-0.0123772	S	-0.0274207	G1	0.7463582	0.1131105	0.1405313	FALSE	0.2900607	FALSE	FALSE	4355	1506	2	2	2	2	2	1	2.1977709	-0.0559695	-0.0663615	-0.3977087	-0.1404695	0.2184639	0.0079912	-0.0256721	0.0000000
AAACCCAAGAGGGTCT-1	SC26_FMTG	3541	1254	TRUE	TRUE	17.36798	26.715617	-0.0580402	-0.0438693	0.0141709	G2M	0.0309859	G1	0.9389671	0.0460094	0.0150235	FALSE	0.1230478	FALSE	FALSE	4411	1246	1	1	1	1	1	1	-0.0559695	-0.3929514	-0.0513442	2.2868583	-0.3573649	0.3792493	0.0850075	-0.0513442	0.0000000
AAACCCACAAGAGTTA-1	SC26_FMTG	5825	2004	TRUE	TRUE	13.80258	18.008584	-0.0792198	-0.0640550	0.0151648	G2M	0.2658875	G1	0.7297297	0.2680789	0.0021914	FALSE	0.1829081	FALSE	FALSE	5213	2000	0	0	0	0	0	1	2.7722634	-0.2237544	-0.1283606	-0.4889099	-0.4588492	-0.0276023	0.0737446	-0.0256721	-0.1155245
AAACGAACAGTTGTTG-1	SC26_FMTG	2260	1030	TRUE	TRUE	13.98230	3.672566	-0.0230277	-0.0625898	-0.0395621	G1	0.2186520	G1	0.7735110	0.2225705	0.0039185	FALSE	0.3076467	FALSE	FALSE	4308	1071	3	3	3	5	5	1	-0.1214092	-0.2020111	0.0000000	0.0576368	-0.0292990	2.4933740	0.1525850	-0.0513442	-0.0610340
AAACGAAGTTCTCCAC-1	SC26_FMTG	5696	1961	TRUE	TRUE	16.92416	19.592697	-0.0360164	-0.0347169	0.0012994	G2M	0.4043750	G1	0.5943750	0.4050000	0.0006250	FALSE	0.5117774	FALSE	FALSE	5176	1955	2	2	2	2	2	1	2.5625411	-0.2808202	-0.0770164	-0.4597078	-0.3807917	0.3803345	0.1671197	-0.0406893	-0.0995422
AAACGAATCTCGACCT-1	SC26_FMTG	7439	2261	TRUE	TRUE	10.04167	22.153515	0.0131680	0.0661095	0.0529415	S	0.2776978	G1	0.7194245	0.2791367	0.0014388	FALSE	0.3469976	FALSE	FALSE	5370	2233	0	0	0	0	0	1	2.9931588	-0.3563888	-0.0770164	-0.5132974	-0.1220680	0.4712829	0.0770164	-0.1433778	-0.0385082

DimPlot(sobj, group.by = "SCT_snn_res.0.2")

FeaturePlot(sobj, features = names(marker_genes))

With this new information, we can refine the initial annotation by assigning each cluster the cell type with the highest score.

Cluster	Cell type
0	CD8+ T cells
1	B cells
2	CD8+ T cells
3	proximal and distal enterocytes with higher score for distal enterocytes. Will be called enterocytes.
4	naive CD4+ T cells
5	Memory CD4+ T cells + Goblet cells
6	B cells
7	Stem cells and Tuft cells

Based on these observations, the clusters can now be manually annotated.

sobj[["seurat_clusters"]] <- sobj[["SCT_snn_res.0.2"]]

sobj@meta.data <- sobj@meta.data %>% 
  mutate(
    manual_annotation = case_when(
        seurat_clusters == "0" ~ "CD8+ T cells",
        seurat_clusters == "1" ~ "B cells",
        seurat_clusters == "2" ~ "CD8+ T cells",
        seurat_clusters == "3" ~ "Enterocytes",
        seurat_clusters == "4" ~ "Naive CD4+ T cells",
        seurat_clusters == "5" ~ "Memory CD4+ T cells + Goblet cells",
        seurat_clusters == "6" ~ "B cells",
        seurat_clusters == "7" ~ "Stem cells + Tuft cells"))

Idents(sobj) <-"manual_annotation"
DimPlot(sobj,
        pt.size = 0.1,
        label = FALSE ,
        repel = TRUE ,
        label.size = 3)

1.3 Summary

Advantages	Limits
Easy to implement	Working at the cluster resolution: risk of merging or splitting actual populations
Sometimes the only solution (ex : novel tissue)	Change clustering ? Change annotation
Flexible	Time-consuming, requires knowledge

1.4 Beyond cluster resolution

Since each cell has a score for the different gene sets, annotation could also be performed at the cell level. Try to think of a few commands to perform this annotation.

Answer

df <- sobj@meta.data %>%
  dplyr::select(names(marker_genes))

sobj[["module_annotation"]] <- apply(df, 1, function(row) names(df)[which.max(row)])

2. Automatic annotation

For some tissues, the different cell types have already been largely described and databases exist with lists of referenced marker genes or datasets. This is particularly true for human and mouse.

Instead of manual annotation, these datasets can be used, along with published tools to call the cell types in your data. This process is called automatic annotation.

2.1 Using a list of marker genes.

In this example, we’ll annotate cells using a list of marker genes from the Panglao database, combined with the CelliD package.

The database

Panglao is an online gene-centric database, meaning it focuses solely on genes rather than entire single-cell datasets.

library(CelliD)

df_panglao <- read_tsv("https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz")

head(df_panglao)

species	official gene symbol	cell type	nicknames	ubiquitousness index	product description	gene type	canonical marker	germ layer	organ	sensitivity_human	sensitivity_mouse	specificity_human	specificity_mouse
Mm Hs	CTRB1	Acinar cells	CTRB	0.017	chymotrypsinogen B1	protein-coding gene	1	Endoderm	Pancreas	1.000000	0.957143	0.0006289	0.0159201
Mm Hs	KLK1	Acinar cells	Klk6	0.013	kallikrein 1	protein-coding gene	1	Endoderm	Pancreas	0.833333	0.314286	0.0050314	0.0128263
Mm Hs	RBPJL	Acinar cells	RBP-L\|SUHL\|RBPSUHL	0.001	recombination signal binding protein for immunoglobulin kappa J region like	protein-coding gene	1	Endoderm	Pancreas	0.000000	0.000000	0.0000000	0.0000000
Mm Hs	PTF1A	Acinar cells	PTF1-p48\|bHLHa29	0.001	pancreas associated transcription factor 1a	protein-coding gene	1	Endoderm	Pancreas	0.000000	0.157143	0.0006289	0.0007734
Mm	TRY4	Acinar cells	NA	0.007	trypsin 4	protein coding gene	1	Endoderm	Pancreas	NA	NA	NA	NA
Mm Hs	CELA3A	Acinar cells	ELA3\|ELA3A	0.001	chymotrypsin like elastase family member 3A	protein-coding gene	1	Endoderm	Pancreas	0.833333	0.128571	0.0000000	0.0000000

In Panglao, each row corresponds to a gene and the each column to one of its characteritics:

species indicate if the gene is expressed in mice and/or human
cell type is the final label used to annotate our dataset.
gene type indicates the biotype of the gene: protein-coding, long non coding RNA… Can be used for markers selection.
organ selecting the right organs is good practice to avoid aberrant annotations.

The quality and specificity of the reference are important to unmeaningful annotations. Here the columns species and organ will be used to select the subset of markers useful for our annotation.

df_markers <- df_panglao %>%
  ## restricting to human specific genes
  filter(str_detect(species,"Hs"),
         organ %in% c("Immune system", "GI tract")) %>%
  mutate(`official gene symbol` = str_to_title(`official gene symbol`)) %>%
  ## converting dataframes into a list of vectors, which is the format needed as input for CelliD
  group_by(`cell type`) %>%
  summarise(geneset = list(`official gene symbol`))

The method

Here the package CelliD will be used. It is a clustering-free method developped by Cortal et al.. It calculates a per-cell gene signature score, which enables cell identity recognition. This method can also be applied to explore functional pathway enrichment.

CelliD can take Seurat objects as input. The reference for annotation must be provided as a named list of genes.

annot_gs <- setNames(df_markers$geneset, df_markers$`cell type`)
annot_gs <- annot_gs[sapply(annot_gs, length) >= 10]

annot_gs[c("B cells", "B cells memory")]

$`B cells`
 [1] "Cd2"       "Cd5"       "Ms4a1"     "Cr2"       "Cd22"      "Fcer2"    
 [7] "Cd40"      "Cd69"      "Cd70"      "Cd79a"     "Cd79b"     "Cd80"     
[13] "Cd86"      "Tnfrsf9"   "Sdc1"      "Tnfsf4"    "Tnfrsf13b" "Tnfrsf13c"
[19] "Pdcd1"     "Ighd"      "Ighm"      "Rasgrp3"   "Hla-Dra"   "Ltb"      
[25] "Hla-Dqa1"  "Fli1"      "Cd14"      "Sema6d"    "Lair1"     "Ifit3"    
[31] "Igll1"     "Dntt"      "Mme"       "Spn"       "Cd19"      "Cd24"     
[37] "Cd27"      "B3gat1"    "Cd72"      "Mum1"      "Pax5"      "Jchain"   
[43] "Mzb1"      "Ly6d"      "Fcmr"      "Bank1"     "Edem1"     "Vpreb3"   
[49] "Pou2af1"   "Creld2"    "Derl3"     "Ralgps2"   "Fchsd2"    "Pold4"    
[55] "Tnfrsf17"  "Hvcn1"     "Fcrla"     "Edem2"     "Blnk"      "Txndc11"  
[61] "Btla"      "Smap2"     "Fkbp11"    "Sec61a1"   "Spcs3"     "Spib"     
[67] "Eaf2"      "Cxcr4"     "Birc3"     "Iglc2"     "Iglc3"     "Iglc1"    
[73] "Il21r"     "Igkc"      "Vpreb1"    "Lrmp"      "Klhl6"     "Slamf6"   
[79] "Fam129c"   "Bst1"      "Msh5"      "Dok3"      "Bach2"     "Pxk"      
[85] "Ighg1"     "Ighg3"     "Ighg4"     "Cd38"      "Ptprc"     "Ebf1"     
[91] "Bcl11a"    "Ccr7"      "Cd55"      "Cd74"      "Cd52"      "Tlr9"     
[97] "Swap70"    "Hmga1"    

$`B cells memory`
 [1] "Cd38"      "Cd80"      "Cd84"      "Cd86"      "Nt5e"      "Pax5"     
 [7] "Ptprc"     "Spib"      "Adam28"    "Aim2"      "Alox5"     "Bach2"    
[13] "Bank1"     "Blk"       "Ccr6"      "Cd180"     "Cd19"      "Cd1c"     
[19] "Cd22"      "Cd27"      "Cd37"      "Cd69"      "Cd72"      "Cd79a"    
[25] "Cd79b"     "Clca3p"    "Cr2"       "Cxcr5"     "Dennd5b"   "Fcgr2b"   
[31] "Fcrl2"     "Frk"       "Gng7"      "Gpr18"     "Gusbp11"   "Hhex"     
[37] "Hla-Dob"   "Ifna10"    "Ighd"      "Ighm"      "Igkc"      "Igll3p"   
[43] "Il7"       "Irf8"      "Ltb"       "Ly86"      "Mbl2"      "Ms4a1"    
[49] "Nmbr"      "Npipb15"   "P2rx5"     "Pnoc"      "Ptprcap"   "Ralgps2"  
[55] "Rasgrp2"   "Sik1"      "Sit1"      "Slc12a1"   "Sp140"     "Stap1"    
[61] "Tmem156"   "Tnfrsf13b" "Tnfrsf17"  "Traf4"     "Vpreb3"    "Zbtb32"

CelliD relies on a dimensionality reduction of the expression matrix in a low-dimension space common to cells AND genes with the method MCA. In such space, the closer a gene is to a cell, the more specific it is to this cell. Thus, a gene ranking is obtained for each cell based on their distance to the cell in the MCA space. For each cell, the closest genes can be regarded as a unique cell identity card and used to annotate it.

More precisely, a statistical test (hypergeometric test) associates to each cell, given its more specific genes, a probability to be of the different cell type. The cell type with the lowest p-value is selected.

## run MCA to project cells and genes in a common low-dimensionality space
sobj <- RunMCA(sobj)

2.18 sec elapsed

5.869 sec elapsed

0.592 sec elapsed

# Performing per-cell hypergeometric tests against the gene signature collection
HGT_gs <- RunCellHGT(sobj, pathways = annot_gs, dims = 1:50, n.features = 200)

# For each cell, select the signature with the lowest corrected p-value (max -log10 corrected p-value)
gs_prediction <- rownames(HGT_gs)[apply(HGT_gs, 2, which.max)]

# For each cell, evaluate if the lowest p-value is significant
gs_prediction_signif <- ifelse(apply(HGT_gs, 2, max)>2, yes = gs_prediction, "unassigned")

# Save cell type predictions as metadata within the Seurat object
sobj$cellid_annotation <- gs_prediction_signif

Note

If CelliD cannot attribute a significant annotation to a cell, it will be labelled “unassigned”. They can still be manually annotated.

The results

CelliD results can be visualized on the UMAP.

p1 <- DimPlot(sobj,
        group.by = "SCT_snn_res.0.1",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3) + NoLegend()

p2 <- DimPlot(sobj,
        group.by = "cellid_annotation",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3)

p1 / p2

There appears to be many cells types on a single plot. Visualizing them all in the same time can be difficult. Instead, it is also possible to facet the plot in order to visualize 1 cell type at a time.

list_plots <- lapply(unique(sobj$cellid_annotation), function(celltype) {
  chl <- setNames(list(subset(sobj, cellid_annotation == celltype) %>% 
                         colnames()),
                celltype)
  
  return(DimPlot(sobj, group.by = "cellid_annotation", cells.highlight = chl))
})

wrap_plots(list_plots, ncol = 4)

CelliD returned many more populations than the 9 populations of the manual annotation. This is in line with the number and variety of cell types in the reference list of genes.

names(annot_gs)

 [1] "B cells"                          "B cells memory"                  
 [3] "B cells naive"                    "Basophils"                       
 [5] "Crypt cells"                      "Dendritic cells"                 
 [7] "Enteric glia cells"               "Enteric neurons"                 
 [9] "Enterochromaffin cells"           "Enterocytes"                     
[11] "Enteroendocrine cells"            "Eosinophils"                     
[13] "Foveolar cells"                   "Gamma delta T cells"             
[15] "Gastric chief cells"              "Goblet cells"                    
[17] "Macrophages"                      "Mast cells"                      
[19] "Megakaryocytes"                   "Microfold cells"                 
[21] "Monocytes"                        "Myeloid-derived suppressor cells"
[23] "NK cells"                         "Natural killer T cells"          
[25] "Neutrophils"                      "Nuocytes"                        
[27] "Paneth cells"                     "Parietal cells"                  
[29] "Plasma cells"                     "Plasmacytoid dendritic cells"    
[31] "Red pulp macrophages"             "T cells"                         
[33] "T follicular helper cells"        "T helper cells"                  
[35] "T memory cells"                   "T regulatory cells"              
[37] "Tuft cells"

These results show why manual curation can still be needed after automatic annotation:

Some annotations concern only a very small number of cells. It might be false positives, artifacts or actual cell populations expected to be small. Biological knowledge is needed to decide whether the cells should be kept or renamed. Be careful with interpretation.

sobj@meta.data %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(n) %>% 
  head()

cellid_annotation	n
Natural killer T cells	1
B cells memory	3
Dendritic cells	3
Plasma cells	4
Crypt cells	5
T regulatory cells	5

Some annotated cell populations are closely related, which can lead to competition between cell types for accurate annotation. For example the B cells populations might probably all be merged into 1 B cell label. For cluster 0, the situation is more complicated:
- does it contain NK cells or T cells ?
- if it contains T cells, which ones ?

# Example cluster 0
sobj@meta.data %>% 
  filter(SCT_snn_res.0.2 == 0) %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()

cellid_annotation	n
NK cells	449
unassigned	89
Gamma delta T cells	37
T cells	5
T regulatory cells	5
Macrophages	1

To answer this question, there are several possibilities:

count the labels occurence and choose the majoritary label(here NK cells would win).
include a biologist in the process and use biological knowledge: which cell type is expected in this dataset ? Why ? Can very specific markers help to identify the cells ?
use other references and annotations tools and compare the results.

Note

Cell level annotations might look noisy. For smoother results, some tools can run at the cluster level. Or you can manually annotate clusters with the main label returned by your annotation tool. However, make sure that your clustering level is the right one and does not merge several populations.

2.2 Using a reference dataset

Some tools are able to use already annotated datasets to annotate new ones, on basis of the transcriptional profiles of the cells. This process is sometimes called annotation transfer.

Their use is facilitated by the increasing number of published datas and the existence of atlases integrating several datasets. If a dataset suitable for your own data exist, it can be used as a reference for automatic annotation

In this example, cells will be annotated using a dataset from the package celldex using the tool SingleR.

The database

celldex provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.

library(celldex)
library(SingleR)

immgen_ref <- celldex::ImmGenData()

Typically, each reference provides three levels of cell type annotation in its column metadata:

label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

Here the dataset ImmGennData will be used.

The method

Here the package SinglR will be used.

SingleR usually works with log normalized data. Data were already normalized with SCT but, for this application, the previous normalization will be applied to all gene couts.

DefaultAssay(sobj) <- "RNA"

sobj <- sobj %>% 
  FindVariableFeatures() %>% 
  NormalizeData() %>% 
  ScaleData()

Finding variable features for layer counts

Normalizing layer: counts

Centering and scaling data matrix

## Fro Seurat v5, join layers for integrated objects
sobj[["RNA"]] <- JoinLayers(sobj[["RNA"]])

SingleR does not take Seurat objects as input but either matrices or SummarizedExperiment objects. The reference is already in this SummarizedExperiment format. For the input data, the expression matrix needs to be extracted or converted to a SingleCellExperiment object, a format inherited from SummarizedExperiment, with the command line.

sce <- as.SingleCellExperiment(sobj)

Then, labels can be computed just by running the SingleR() function.

df_labels <- SingleR(test = GetAssayData(sobj, assay = "RNA", layer = "data"),
                       ref = immgen_ref,
                       labels = immgen_ref$label.main,
                       assay.type.test = "logcounts")

head(df_labels) %>% 
  as.data.frame()

	scores.B.cells	scores.B.cells..pro	scores.Basophils	scores.DC	scores.Endothelial.cells	scores.Eosinophils	scores.Epithelial.cells	scores.Fibroblasts	scores.ILC	scores.Macrophages	scores.Mast.cells	scores.Microglia	scores.Monocytes	scores.Neutrophils	scores.NK.cells	scores.NKT	scores.Stem.cells	scores.Stromal.cells	scores.T.cells	scores.Tgd	labels	delta.next	pruned.labels
AAACCCAAGAAGCGGG-1	0.2033111	0.2068373	0.2066186	0.1981340	0.0906038	0.1722631	0.1196504	0.0488273	0.3444374	0.1696821	0.1621333	0.1262688	0.1765475	0.1260725	0.3210943	0.3437948	0.2019263	0.0446823	0.3237717	0.3400036	ILC	0.0353467	ILC
AAACCCAAGAGGGTCT-1	0.3367376	0.2181210	0.1524551	0.2240873	0.1277862	0.1491067	0.1243731	0.0735197	0.1990659	0.2004364	0.1408415	0.1508748	0.2016722	0.1301909	0.2132847	0.1955743	0.1951414	0.0852191	0.2068749	0.2136054	B cells	0.1126503	B cells
AAACCCACAAGAGTTA-1	0.2523571	0.2417808	0.2293900	0.2229132	0.1084866	0.1861911	0.1190182	0.0664974	0.3869385	0.1704448	0.1863966	0.1304647	0.1901410	0.1264326	0.3789434	0.3744221	0.2412139	0.0600474	0.3598439	0.3758250	ILC	0.0613775	ILC
AAACGAACAGTTGTTG-1	0.0993034	0.0758659	0.1156172	0.1171548	0.1189611	0.1166718	0.1582854	0.1238863	0.1311276	0.1325085	0.1246085	0.0674073	0.1285736	0.1206987	0.0996414	0.0913978	0.1249268	0.1151023	0.1103719	0.1111212	Epithelial cells	0.0224571	NA
AAACGAAGTTCTCCAC-1	0.2279274	0.2064368	0.2157174	0.2136378	0.1122832	0.1870847	0.1228304	0.0595659	0.3580839	0.1684061	0.1765575	0.1092343	0.1808137	0.1220009	0.3350064	0.3577826	0.2201776	0.0652464	0.3417077	0.3606864	ILC	0.0202611	ILC
AAACGAATCTCGACCT-1	0.2598720	0.2512336	0.2418633	0.2487234	0.1312708	0.2096719	0.1372808	0.0766834	0.3830688	0.2023382	0.2026497	0.1520056	0.2184256	0.1402733	0.3747529	0.3568818	0.2467829	0.0811662	0.3420645	0.3653939	ILC	0.0403528	ILC

invisible(gc())

The results

SingleR() returns a table with scores and prediction results for each cell. Labels are shown before (labels) and after a pruning step (pruned.labels), which removes potentially poor-quality or ambiguous assignments.

These results can be added to the Seurat object and then visualized.

df <- df_labels %>% 
  as.data.frame() %>% 
  mutate(singler_annotation = pruned.labels)

sobj <- AddMetaData(sobj, df["singler_annotation"])

rm(df)

Question

Plot the SingleR labels on a UMAP. What observations can you make about the SingleR labels ?

Here again, there appears to be many cells types. The same kind of curation than with the CelliD example can be applied.

SingleR also comes with it own diagnosis tools. The function plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows to inspect the confidence of the predicted labels.

plotScoreHeatmap(df_labels)

Ideally, each cell should have one score that is significantly higher than the rest, indicating a clear assignment to a single label. However, when multiple similar scores are present for a given cell, it suggests uncertainty in the assignment.

In this case, we observe that the labels for T cells, ILCs, and NK cells are closely related, which is where domain expertise and highly specific marker sets become crucial. Additionally, the labels for Monocytes and Macrophages could potentially be merged due to their similarity.

2.3 Summary

Advantages	Limits
Works at the single cell level	Results heavily rely on the quality and completeness of the reference (be careful of missing labels)
Many resources for reference	Possible to mislabel populations
Flexible	Results can be noisy

saveRDS(sobj, file="seurat_object_annotated.rds")

References and documentation

CellID: https://www.bioconductor.org/packages/release/bioc/html/CelliD.html

SingleR: https://www.bioconductor.org/packages/release/bioc/html/SingleR.html

Many other tools. A review summarizing many annotation methods: A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell–Cell Communication

Human Cell Atlas: https://www.humancellatlas.org/

Tabula muris: https://tabula-muris-senis.ds.czbiohub.org/

GeneCard : https://www.genecards.org

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleR_2.6.0               celldex_1.14.0             
 [3] CelliD_1.12.0               SingleCellExperiment_1.26.0
 [5] SummarizedExperiment_1.34.0 Biobase_2.64.0             
 [7] GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
 [9] IRanges_2.38.1              MatrixGenerics_1.16.0      
[11] matrixStats_1.4.1           patchwork_1.3.0            
[13] Seurat_5.1.0                SeuratObject_5.0.2         
[15] sp_2.1-4                    S4Vectors_0.42.1           
[17] BiocGenerics_0.50.0         lubridate_1.9.3            
[19] forcats_1.0.0               stringr_1.5.1              
[21] dplyr_1.1.4                 purrr_1.0.2                
[23] readr_2.1.5                 tidyr_1.3.1                
[25] tibble_3.2.1                ggplot2_3.5.1              
[27] tidyverse_2.0.0            

loaded via a namespace (and not attached):
  [1] spatstat.sparse_3.1-0     httr_1.4.7               
  [3] RColorBrewer_1.1-3        tools_4.4.1              
  [5] sctransform_0.4.1         alabaster.base_1.4.1     
  [7] utf8_1.2.4                R6_2.5.1                 
  [9] HDF5Array_1.32.0          lazyeval_0.2.2           
 [11] uwot_0.2.2                rhdf5filters_1.16.0      
 [13] withr_3.0.1               gridExtra_2.3            
 [15] tictoc_1.2.1              progressr_0.14.0         
 [17] cli_3.6.3                 spatstat.explore_3.3-2   
 [19] fastDummies_1.7.4         alabaster.se_1.4.1       
 [21] labeling_0.4.3            spatstat.data_3.1-2      
 [23] ggridges_0.5.6            pbapply_1.7-2            
 [25] askpass_1.2.0             scater_1.32.1            
 [27] parallelly_1.38.0         rstudioapi_0.16.0        
 [29] RSQLite_2.3.7             generics_0.1.3           
 [31] ica_1.0-3                 spatstat.random_3.3-2    
 [33] vroom_1.6.5               Matrix_1.7-0             
 [35] ggbeeswarm_0.7.2          fansi_1.0.6              
 [37] abind_1.4-8               lifecycle_1.0.4          
 [39] yaml_2.3.10               rhdf5_2.48.0             
 [41] SparseArray_1.4.8         BiocFileCache_2.12.0     
 [43] Rtsne_0.17                grid_4.4.1               
 [45] blob_1.2.4                promises_1.3.0           
 [47] ExperimentHub_2.12.0      crayon_1.5.3             
 [49] miniUI_0.1.1.1            lattice_0.22-6           
 [51] beachmat_2.20.0           cowplot_1.1.3            
 [53] KEGGREST_1.44.1           pillar_1.9.0             
 [55] knitr_1.48                fgsea_1.30.0             
 [57] future.apply_1.11.2       codetools_0.2-20         
 [59] fastmatch_1.1-4           leiden_0.4.3.1           
 [61] glue_1.8.0                RcppArmadillo_14.0.2-1   
 [63] spatstat.univar_3.0-1     data.table_1.16.0        
 [65] gypsum_1.0.1              vctrs_0.6.5              
 [67] png_0.1-8                 spam_2.10-0              
 [69] gtable_0.3.5              cachem_1.1.0             
 [71] xfun_0.48                 S4Arrays_1.4.1           
 [73] mime_0.12                 survival_3.7-0           
 [75] pheatmap_1.0.12           fitdistrplus_1.2-1       
 [77] ROCR_1.0-11               nlme_3.1-165             
 [79] bit64_4.0.5               alabaster.ranges_1.4.1   
 [81] filelock_1.0.3            RcppAnnoy_0.0.22         
 [83] irlba_2.3.5.1             vipor_0.4.7              
 [85] KernSmooth_2.23-24        colorspace_2.1-1         
 [87] DBI_1.2.3                 ggrastr_1.0.2            
 [89] tidyselect_1.2.1          bit_4.0.5                
 [91] compiler_4.4.1            curl_5.2.2               
 [93] httr2_1.0.1               BiocNeighbors_1.22.0     
 [95] DelayedArray_0.30.1       plotly_4.10.4            
 [97] scales_1.3.0              lmtest_0.9-40            
 [99] rappdirs_0.3.3            digest_0.6.37            
[101] goftest_1.2-3             spatstat.utils_3.1-0     
[103] alabaster.matrix_1.4.1    rmarkdown_2.28           
[105] XVector_0.44.0            htmltools_0.5.8.1        
[107] pkgconfig_2.0.3           umap_0.2.10.0            
[109] sparseMatrixStats_1.16.0  dbplyr_2.5.0             
[111] fastmap_1.2.0             rlang_1.1.4              
[113] htmlwidgets_1.6.4         UCSC.utils_1.0.0         
[115] shiny_1.9.1               DelayedMatrixStats_1.26.0
[117] farver_2.1.2              zoo_1.8-12               
[119] jsonlite_1.8.9            BiocParallel_1.38.0      
[121] BiocSingular_1.20.0       magrittr_2.0.3           
[123] scuttle_1.14.0            GenomeInfoDbData_1.2.12  
[125] dotCall64_1.1-1           Rhdf5lib_1.26.0          
[127] munsell_0.5.1             Rcpp_1.0.13              
[129] viridis_0.6.5             reticulate_1.39.0        
[131] stringi_1.8.4             alabaster.schemas_1.4.0  
[133] zlibbioc_1.50.0           MASS_7.3-61              
[135] AnnotationHub_3.12.0      plyr_1.8.9               
[137] parallel_4.4.1            listenv_0.9.1            
[139] ggrepel_0.9.6             deldir_2.0-4             
[141] Biostrings_2.72.1         splines_4.4.1            
[143] tensor_1.5                hms_1.1.3                
[145] igraph_2.0.3              spatstat.geom_3.3-3      
[147] RcppHNSW_0.6.0            reshape2_1.4.4           
[149] ScaledMatrix_1.12.0       BiocVersion_3.19.1       
[151] evaluate_1.0.0            BiocManager_1.30.25      
[153] tzdb_0.4.0                httpuv_1.6.15            
[155] RANN_2.6.2                openssl_2.2.1            
[157] polyclip_1.10-7           future_1.34.0            
[159] scattermore_1.2           rsvd_1.0.5               
[161] xtable_1.8-4              RSpectra_0.16-2          
[163] later_1.3.2               viridisLite_0.4.2        
[165] memoise_2.0.1             beeswarm_0.4.0           
[167] AnnotationDbi_1.66.0      cluster_2.1.6            
[169] timechange_0.3.0          globals_0.16.3