sincellTE 2024 - TP Secondary Analysis: cell annotation

Published

Oct 22, 2024

Objectives

Cell annotation is one of the first steps in exploring the biological information captured in your dataset. It allows you to assign a type, and sometimes a subtype or even a cell state (e.g., proliferating, apoptotic), to individual cells.

The objective of this session is to gain hands-on experience with various methods for cell annotation. Precise and reliable cell annotation takes time. Here, we will focus on understanding and critically evaluating these different methods.

Methods for Cell Annotation

Cell annotation methods are based on a common principle: they use genes or gene sets specific to certain cell types to assign an identity to the cells.

However, these methods differ in how they accomplish this. They can be classified into two main approaches:

  • Manual Annotation: This method involves using marker gene sets from the literature, which are manually curated. In your dataset, clusters with high expression of these markers are annotated with the corresponding cell type.

  • Automatic Annotation: This approach utilizes marker sets from databases, or published single-cell reference atlases or datasets. Tools then automatically annotate the cells based on this information.

Note

These two approaches are not mutually exclusive and can be combined.

For this session, we will use the dataset prepared in the previous practical sessions.

suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(S4Vectors)
      library(Seurat)
      library(ggplot2)
      library(patchwork)
    })
})

set.seed(123)
user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "cell_annotation/data/SC26_FMTG_FILTERED_SCT_CLUSTERS.rda", sep = "/")
output_dir <- paste0("/shared/home", user.id, "cell_annotation/results/", sep = "/")
load(input_file)

1. Manual Annotation

One way to annotate cell clusters is to examine the genes that are highly expressed in one cluster compared to all other cells, and then infer the corresponding cell type. These markers can be obtained from:

  • Knowledge of the field (e.g., CD3A for T cells, CD8A for CD8+ T cells).
  • Relevant literature related to your project.
  • Differentially expressed genes (DEGs): Performing differential expression analysis on different clusters can reveal specific markers that help identify each cluster. However, not everyone agrees that DEG analysis should be used this way.

1.1 Using One Marker per Cell Type

Here, we have a list of markers from the dataset’s associated publication, allowing us to annotate 9 cell populations:

Cell type Marker
Proximal enterocytes Apoa4
Distal enterocytes Enpep
Goblet cells Muc2
Tuft cells Alox5ap
Stem cells Lgr5
Naive CD4+ T cells Ccr7
Memory CD4+ T cells S100a4
CD8+ T cells Cd8a
B cells Ms4a1

In previous practical sessions, cell clustering was performed. Clustering at a resolution of 0.1 yielded 9 clusters, matching the number of cell types we are aiming to identify. This clustering will be used for the first manual annotation step.

Violin plots and feature plots are useful for visualizing marker expression across the different clusters.

## define the markers
feats <- c('Muc2', 'Enpep', 'Apoa4', "Alox5ap",
              'Ccr7', 'S100a4', 'Cd8a', 'Lgr5', 'Ms4a1')

## set the cells default identity to clustering
Idents(sobj) <- "SCT_snn_res.0.1"
DimPlot(sobj)

FeaturePlot(sobj, features=feats)

VlnPlot(sobj, feats, layer = "data")

After visualizing these plots, we can propose an initial annotation:

Cluster Marker Cell type
0 Cd8a CD8+ T cells
1 Ms4a1 B cells
2 Cd8a CD8+ T cells
3 Enpep + strong expression of Apo4 mix of proximal and distal enterocytes
4 Ccr7 + Muc2 Naive CD4 + T cells and Goblet cells
5 S100a4 Naive CD4 + T cells
6 Ms4a1 B cells
7 Alox5ap + Lgr5 Tuft cells + stem cells

Using one marker per cell type allowed us to form initial hypotheses about the cell types in each cluster, but also revealed some limitations:

  • Some markers are not very specific: e.g., Apoa4 is expressed in a large proportion of clusters, though its expression is highest in cluster 2.
sobj[["Apoa4+"]] <- as.numeric((sobj@assays$SCT@data['Apoa4',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.2) %>% 
  ## compute the percentage of cells expressing the gene
  summarize(percentage = round(mean(`Apoa4+` == 1) * 100))
SCT_snn_res.0.2 percentage
0 92
1 92
2 92
3 99
4 92
5 71
6 61
7 46
  • Some markers are specific but not expressed in all cells within a cluster: e.g., Lgr5 is detected almost exclusively in cluster 6, but only in 40% of the cells.
sobj[["Lgr5+"]] <- as.numeric((sobj@assays$SCT@data['Lgr5',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.1) %>% 
  ## compute the percentage of cells expressing the gene
  summarise(percentage = round(mean(`Lgr5+` == 1) * 100))
SCT_snn_res.0.1 percentage
0 0
1 0
2 0
3 0
4 2
5 0
6 33
Question

Why a marker such as Lgr5+ might not be found in all the cells of a cluster ?

1.2 Using a Marker List

Using only one marker per cell type can be risky. Instead, using a list of markers can help mitigate biases due to unexpressed genes or unspecific markers. To improve cluster annotation, we will now use a list of markers with several genes for each cell type.

marker_genes <- list(
    CD8 = c('Cd8a', 'Gzmb', 'Gzma', 'Cd7') ,
    CD4_naive = c('Ccr7', 'Ms4a4b', 'Cd27', 'Cd28'),
    CD4_memory = c('S100a4', 'Cd44', 'Ccl9'),
    B_cells = c('Ms4a1', 'Cd79a', 'Cd74'),
    Enterocyte_distal = c('Enpep', 'Anpep', 'Ace2'),
    Enterocyte_proximal = c('Apoa4', 'Dstn', 'Fabp2'),
    Goblet_cells = c('Muc2', 'Agr2', 'Spink4', 'Tff3'),
    Stem_cells = c('Lgr5', 'Hck', 'Nrgn'),
    Tuft_cells = c('Alox5ap','l1b', 'S100a9'))

## keep only genes also found in the count matrix
marker_genes <- lapply(marker_genes, function(vec) { 
  intersect(vec, rownames(sobj))
})

A dot plot is useful for assessing the expression of multiple genes simultaneously in populations of cells.

DotPlot(sobj, features = marker_genes) +
  theme(axis.text.x=element_text(angle = 45, vjust=0.5, hjust = 1),
        strip.text = element_text(angle = 45, hjust = 0.5))
Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.

The AddModuleScore() function calculates a score for each cell based on the expression of a list of genes. This score, automatically added to the metadata, can then be plotted on the UMAP.

sobj <- AddModuleScore(sobj, features = marker_genes,
                                pool = NULL,
                                nbin = 5,
                                seed = 1,
                                ctrl = length(marker_genes),
                                k = FALSE,
                                name = "program")

## score for each list of gene is named program1, program2...
## rename the columns in metadata to get more interpretable names
vec_rename <- paste0("program", seq(1:length(marker_genes)))
names(vec_rename) <- names(marker_genes)
sobj@meta.data <- sobj@meta.data %>% 
  dplyr::rename(all_of(vec_rename))

head(sobj@meta.data)
orig.ident nCount_RNA nFeature_RNA min_features min_counts percent_mt percent_rb Seurat.SmG2M.Score Seurat.S.Score Seurat.G2M.Score Seurat.Phase Cyclone.SmG2M.Score Cyclone.Phase Cyclone.G1.Score Cyclone.S.Score Cyclone.G2M.Score scDblFinder.class scds.score scds.class doublets_consensus.class nCount_SCT nFeature_SCT SCT_snn_res.0.1 SCT_snn_res.0.2 SCT_snn_res.0.5 SCT_snn_res.1 seurat_clusters Apoa4+ Lgr5+ CD8 CD4_naive CD4_memory B_cells Enterocyte_distal Enterocyte_proximal Goblet_cells Stem_cells Tuft_cells
AAACCCAAGAAGCGGG-1 SC26_FMTG 3562 1511 TRUE TRUE 13.50365 17.574396 0.0355307 0.0231535 -0.0123772 S -0.0274207 G1 0.7463582 0.1131105 0.1405313 FALSE 0.2900607 FALSE FALSE 4355 1506 2 2 2 2 2 1 0 2.1977709 -0.0559695 -0.0663615 -0.3977087 -0.1404695 0.2184639 0.0079912 -0.0256721 0.0000000
AAACCCAAGAGGGTCT-1 SC26_FMTG 3541 1254 TRUE TRUE 17.36798 26.715617 -0.0580402 -0.0438693 0.0141709 G2M 0.0309859 G1 0.9389671 0.0460094 0.0150235 FALSE 0.1230478 FALSE FALSE 4411 1246 1 1 1 1 1 1 0 -0.0559695 -0.3929514 -0.0513442 2.2868583 -0.3573649 0.3792493 0.0850075 -0.0513442 0.0000000
AAACCCACAAGAGTTA-1 SC26_FMTG 5825 2004 TRUE TRUE 13.80258 18.008584 -0.0792198 -0.0640550 0.0151648 G2M 0.2658875 G1 0.7297297 0.2680789 0.0021914 FALSE 0.1829081 FALSE FALSE 5213 2000 0 0 0 0 0 1 0 2.7722634 -0.2237544 -0.1283606 -0.4889099 -0.4588492 -0.0276023 0.0737446 -0.0256721 -0.1155245
AAACGAACAGTTGTTG-1 SC26_FMTG 2260 1030 TRUE TRUE 13.98230 3.672566 -0.0230277 -0.0625898 -0.0395621 G1 0.2186520 G1 0.7735110 0.2225705 0.0039185 FALSE 0.3076467 FALSE FALSE 4308 1071 3 3 3 5 5 1 0 -0.1214092 -0.2020111 0.0000000 0.0576368 -0.0292990 2.4933740 0.1525850 -0.0513442 -0.0610340
AAACGAAGTTCTCCAC-1 SC26_FMTG 5696 1961 TRUE TRUE 16.92416 19.592697 -0.0360164 -0.0347169 0.0012994 G2M 0.4043750 G1 0.5943750 0.4050000 0.0006250 FALSE 0.5117774 FALSE FALSE 5176 1955 2 2 2 2 2 1 0 2.5625411 -0.2808202 -0.0770164 -0.4597078 -0.3807917 0.3803345 0.1671197 -0.0406893 -0.0995422
AAACGAATCTCGACCT-1 SC26_FMTG 7439 2261 TRUE TRUE 10.04167 22.153515 0.0131680 0.0661095 0.0529415 S 0.2776978 G1 0.7194245 0.2791367 0.0014388 FALSE 0.3469976 FALSE FALSE 5370 2233 0 0 0 0 0 1 0 2.9931588 -0.3563888 -0.0770164 -0.5132974 -0.1220680 0.4712829 0.0770164 -0.1433778 -0.0385082
DimPlot(sobj, group.by = "SCT_snn_res.0.2")

FeaturePlot(sobj, features = names(marker_genes))

With this new information, we can refine the initial annotation by assigning each cluster the cell type with the highest score.

Cluster Cell type
0 CD8+ T cells
1 B cells
2 CD8+ T cells
3 proximal and distal enterocytes with higher score for distal enterocytes. Will be called enterocytes.
4 naive CD4+ T cells
5 Memory CD4+ T cells + Goblet cells
6 B cells
7 Stem cells and Tuft cells

Based on these observations, the clusters can now be manually annotated.

sobj[["seurat_clusters"]] <- sobj[["SCT_snn_res.0.2"]]

sobj@meta.data <- sobj@meta.data %>% 
  mutate(
    manual_annotation = case_when(
        seurat_clusters == "0" ~ "CD8+ T cells",
        seurat_clusters == "1" ~ "B cells",
        seurat_clusters == "2" ~ "CD8+ T cells",
        seurat_clusters == "3" ~ "Enterocytes",
        seurat_clusters == "4" ~ "Naive CD4+ T cells",
        seurat_clusters == "5" ~ "Memory CD4+ T cells + Goblet cells",
        seurat_clusters == "6" ~ "B cells",
        seurat_clusters == "7" ~ "Stem cells + Tuft cells"))

Idents(sobj) <-"manual_annotation"
DimPlot(sobj,
        pt.size = 0.1,
        label = FALSE ,
        repel = TRUE ,
        label.size = 3)

1.3 Summary

Advantages Limits
Easy to implement Working at the cluster resolution: risk of merging or splitting actual populations
Sometimes the only solution (ex : novel tissue) Change clustering ? Change annotation
Flexible Time-consuming, requires knowledge

1.4 Beyond cluster resolution

Since each cell has a score for the different gene sets, annotation could also be performed at the cell level. Try to think of a few commands to perform this annotation.

df <- sobj@meta.data %>%
  dplyr::select(names(marker_genes))

sobj[["module_annotation"]] <- apply(df, 1, function(row) names(df)[which.max(row)])

2. Automatic annotation

For some tissues, the different cell types have already been largely described and databases exist with lists of referenced marker genes or datasets. This is particularly true for human and mouse.

Instead of manual annotation, these datasets can be used, along with published tools to call the cell types in your data. This process is called automatic annotation.

2.1 Using a list of marker genes.

In this example, we’ll annotate cells using a list of marker genes from the Panglao database, combined with the CelliD package.

The database

Panglao is an online gene-centric database, meaning it focuses solely on genes rather than entire single-cell datasets.

library(CelliD)

df_panglao <- read_tsv("https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz")

head(df_panglao)
species official gene symbol cell type nicknames ubiquitousness index product description gene type canonical marker germ layer organ sensitivity_human sensitivity_mouse specificity_human specificity_mouse
Mm Hs CTRB1 Acinar cells CTRB 0.017 chymotrypsinogen B1 protein-coding gene 1 Endoderm Pancreas 1.000000 0.957143 0.0006289 0.0159201
Mm Hs KLK1 Acinar cells Klk6 0.013 kallikrein 1 protein-coding gene 1 Endoderm Pancreas 0.833333 0.314286 0.0050314 0.0128263
Mm Hs RBPJL Acinar cells RBP-L|SUHL|RBPSUHL 0.001 recombination signal binding protein for immunoglobulin kappa J region like protein-coding gene 1 Endoderm Pancreas 0.000000 0.000000 0.0000000 0.0000000
Mm Hs PTF1A Acinar cells PTF1-p48|bHLHa29 0.001 pancreas associated transcription factor 1a protein-coding gene 1 Endoderm Pancreas 0.000000 0.157143 0.0006289 0.0007734
Mm TRY4 Acinar cells NA 0.007 trypsin 4 protein coding gene 1 Endoderm Pancreas NA NA NA NA
Mm Hs CELA3A Acinar cells ELA3|ELA3A 0.001 chymotrypsin like elastase family member 3A protein-coding gene 1 Endoderm Pancreas 0.833333 0.128571 0.0000000 0.0000000

In Panglao, each row corresponds to a gene and the each column to one of its characteritics:

  • species indicate if the gene is expressed in mice and/or human
  • cell type is the final label used to annotate our dataset.
  • gene type indicates the biotype of the gene: protein-coding, long non coding RNA… Can be used for markers selection.
  • organ selecting the right organs is good practice to avoid aberrant annotations.

The quality and specificity of the reference are important to unmeaningful annotations. Here the columns species and organ will be used to select the subset of markers useful for our annotation.

df_markers <- df_panglao %>%
  ## restricting to human specific genes
  filter(str_detect(species,"Hs"),
         organ %in% c("Immune system", "GI tract")) %>%
  mutate(`official gene symbol` = str_to_title(`official gene symbol`)) %>%
  ## converting dataframes into a list of vectors, which is the format needed as input for CelliD
  group_by(`cell type`) %>%
  summarise(geneset = list(`official gene symbol`))

The method

Here the package CelliD will be used. It is a clustering-free method developped by Cortal et al.. It calculates a per-cell gene signature score, which enables cell identity recognition. This method can also be applied to explore functional pathway enrichment.

CelliD can take Seurat objects as input. The reference for annotation must be provided as a named list of genes.

annot_gs <- setNames(df_markers$geneset, df_markers$`cell type`)
annot_gs <- annot_gs[sapply(annot_gs, length) >= 10]

annot_gs[c("B cells", "B cells memory")]
$`B cells`
 [1] "Cd2"       "Cd5"       "Ms4a1"     "Cr2"       "Cd22"      "Fcer2"    
 [7] "Cd40"      "Cd69"      "Cd70"      "Cd79a"     "Cd79b"     "Cd80"     
[13] "Cd86"      "Tnfrsf9"   "Sdc1"      "Tnfsf4"    "Tnfrsf13b" "Tnfrsf13c"
[19] "Pdcd1"     "Ighd"      "Ighm"      "Rasgrp3"   "Hla-Dra"   "Ltb"      
[25] "Hla-Dqa1"  "Fli1"      "Cd14"      "Sema6d"    "Lair1"     "Ifit3"    
[31] "Igll1"     "Dntt"      "Mme"       "Spn"       "Cd19"      "Cd24"     
[37] "Cd27"      "B3gat1"    "Cd72"      "Mum1"      "Pax5"      "Jchain"   
[43] "Mzb1"      "Ly6d"      "Fcmr"      "Bank1"     "Edem1"     "Vpreb3"   
[49] "Pou2af1"   "Creld2"    "Derl3"     "Ralgps2"   "Fchsd2"    "Pold4"    
[55] "Tnfrsf17"  "Hvcn1"     "Fcrla"     "Edem2"     "Blnk"      "Txndc11"  
[61] "Btla"      "Smap2"     "Fkbp11"    "Sec61a1"   "Spcs3"     "Spib"     
[67] "Eaf2"      "Cxcr4"     "Birc3"     "Iglc2"     "Iglc3"     "Iglc1"    
[73] "Il21r"     "Igkc"      "Vpreb1"    "Lrmp"      "Klhl6"     "Slamf6"   
[79] "Fam129c"   "Bst1"      "Msh5"      "Dok3"      "Bach2"     "Pxk"      
[85] "Ighg1"     "Ighg3"     "Ighg4"     "Cd38"      "Ptprc"     "Ebf1"     
[91] "Bcl11a"    "Ccr7"      "Cd55"      "Cd74"      "Cd52"      "Tlr9"     
[97] "Swap70"    "Hmga1"    

$`B cells memory`
 [1] "Cd38"      "Cd80"      "Cd84"      "Cd86"      "Nt5e"      "Pax5"     
 [7] "Ptprc"     "Spib"      "Adam28"    "Aim2"      "Alox5"     "Bach2"    
[13] "Bank1"     "Blk"       "Ccr6"      "Cd180"     "Cd19"      "Cd1c"     
[19] "Cd22"      "Cd27"      "Cd37"      "Cd69"      "Cd72"      "Cd79a"    
[25] "Cd79b"     "Clca3p"    "Cr2"       "Cxcr5"     "Dennd5b"   "Fcgr2b"   
[31] "Fcrl2"     "Frk"       "Gng7"      "Gpr18"     "Gusbp11"   "Hhex"     
[37] "Hla-Dob"   "Ifna10"    "Ighd"      "Ighm"      "Igkc"      "Igll3p"   
[43] "Il7"       "Irf8"      "Ltb"       "Ly86"      "Mbl2"      "Ms4a1"    
[49] "Nmbr"      "Npipb15"   "P2rx5"     "Pnoc"      "Ptprcap"   "Ralgps2"  
[55] "Rasgrp2"   "Sik1"      "Sit1"      "Slc12a1"   "Sp140"     "Stap1"    
[61] "Tmem156"   "Tnfrsf13b" "Tnfrsf17"  "Traf4"     "Vpreb3"    "Zbtb32"   

CelliD relies on a dimensionality reduction of the expression matrix in a low-dimension space common to cells AND genes with the method MCA. In such space, the closer a gene is to a cell, the more specific it is to this cell. Thus, a gene ranking is obtained for each cell based on their distance to the cell in the MCA space. For each cell, the closest genes can be regarded as a unique cell identity card and used to annotate it.

illustration_cellid.png

More precisely, a statistical test (hypergeometric test) associates to each cell, given its more specific genes, a probability to be of the different cell type. The cell type with the lowest p-value is selected.

## run MCA to project cells and genes in a common low-dimensionality space
sobj <- RunMCA(sobj)
2.18 sec elapsed
5.869 sec elapsed
0.592 sec elapsed
# Performing per-cell hypergeometric tests against the gene signature collection
HGT_gs <- RunCellHGT(sobj, pathways = annot_gs, dims = 1:50, n.features = 200)

# For each cell, select the signature with the lowest corrected p-value (max -log10 corrected p-value)
gs_prediction <- rownames(HGT_gs)[apply(HGT_gs, 2, which.max)]

# For each cell, evaluate if the lowest p-value is significant
gs_prediction_signif <- ifelse(apply(HGT_gs, 2, max)>2, yes = gs_prediction, "unassigned")

# Save cell type predictions as metadata within the Seurat object
sobj$cellid_annotation <- gs_prediction_signif
Note

If CelliD cannot attribute a significant annotation to a cell, it will be labelled “unassigned”. They can still be manually annotated.

The results

CelliD results can be visualized on the UMAP.

p1 <- DimPlot(sobj,
        group.by = "SCT_snn_res.0.1",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3) + NoLegend()

p2 <- DimPlot(sobj,
        group.by = "cellid_annotation",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3)

p1 / p2

There appears to be many cells types on a single plot. Visualizing them all in the same time can be difficult. Instead, it is also possible to facet the plot in order to visualize 1 cell type at a time.

list_plots <- lapply(unique(sobj$cellid_annotation), function(celltype) {
  chl <- setNames(list(subset(sobj, cellid_annotation == celltype) %>% 
                         colnames()),
                celltype)
  
  return(DimPlot(sobj, group.by = "cellid_annotation", cells.highlight = chl))
})

wrap_plots(list_plots, ncol = 4)

CelliD returned many more populations than the 9 populations of the manual annotation. This is in line with the number and variety of cell types in the reference list of genes.

names(annot_gs)
 [1] "B cells"                          "B cells memory"                  
 [3] "B cells naive"                    "Basophils"                       
 [5] "Crypt cells"                      "Dendritic cells"                 
 [7] "Enteric glia cells"               "Enteric neurons"                 
 [9] "Enterochromaffin cells"           "Enterocytes"                     
[11] "Enteroendocrine cells"            "Eosinophils"                     
[13] "Foveolar cells"                   "Gamma delta T cells"             
[15] "Gastric chief cells"              "Goblet cells"                    
[17] "Macrophages"                      "Mast cells"                      
[19] "Megakaryocytes"                   "Microfold cells"                 
[21] "Monocytes"                        "Myeloid-derived suppressor cells"
[23] "NK cells"                         "Natural killer T cells"          
[25] "Neutrophils"                      "Nuocytes"                        
[27] "Paneth cells"                     "Parietal cells"                  
[29] "Plasma cells"                     "Plasmacytoid dendritic cells"    
[31] "Red pulp macrophages"             "T cells"                         
[33] "T follicular helper cells"        "T helper cells"                  
[35] "T memory cells"                   "T regulatory cells"              
[37] "Tuft cells"                      

These results show why manual curation can still be needed after automatic annotation:

  • Some annotations concern only a very small number of cells. It might be false positives, artifacts or actual cell populations expected to be small. Biological knowledge is needed to decide whether the cells should be kept or renamed. Be careful with interpretation.
sobj@meta.data %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(n) %>% 
  head()
cellid_annotation n
Natural killer T cells 1
B cells memory 3
Dendritic cells 3
Plasma cells 4
Crypt cells 5
T regulatory cells 5
  • Some annotated cell populations are closely related, which can lead to competition between cell types for accurate annotation. For example the B cells populations might probably all be merged into 1 B cell label. For cluster 0, the situation is more complicated:

    • does it contain NK cells or T cells ?
    • if it contains T cells, which ones ?
# Example cluster 0
sobj@meta.data %>% 
  filter(SCT_snn_res.0.2 == 0) %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()
cellid_annotation n
NK cells 449
unassigned 89
Gamma delta T cells 37
T cells 5
T regulatory cells 5
Macrophages 1

To answer this question, there are several possibilities:

  • count the labels occurence and choose the majoritary label(here NK cells would win).

  • include a biologist in the process and use biological knowledge: which cell type is expected in this dataset ? Why ? Can very specific markers help to identify the cells ?

  • use other references and annotations tools and compare the results.

Note

Cell level annotations might look noisy. For smoother results, some tools can run at the cluster level. Or you can manually annotate clusters with the main label returned by your annotation tool. However, make sure that your clustering level is the right one and does not merge several populations.

2.2 Using a reference dataset

Some tools are able to use already annotated datasets to annotate new ones, on basis of the transcriptional profiles of the cells. This process is sometimes called annotation transfer.

illustration_cellid.png

Their use is facilitated by the increasing number of published datas and the existence of atlases integrating several datasets. If a dataset suitable for your own data exist, it can be used as a reference for automatic annotation

In this example, cells will be annotated using a dataset from the package celldex using the tool SingleR.

The database

celldex provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.

library(celldex)
library(SingleR)

immgen_ref <- celldex::ImmGenData()

Typically, each reference provides three levels of cell type annotation in its column metadata:

  • label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
  • label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
  • label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

Here the dataset ImmGennData will be used.

The method

Here the package SinglR will be used.

SingleR usually works with log normalized data. Data were already normalized with SCT but, for this application, the previous normalization will be applied to all gene couts.

DefaultAssay(sobj) <- "RNA"

sobj <- sobj %>% 
  FindVariableFeatures() %>% 
  NormalizeData() %>% 
  ScaleData()
Finding variable features for layer counts
Normalizing layer: counts
Centering and scaling data matrix
## Fro Seurat v5, join layers for integrated objects
sobj[["RNA"]] <- JoinLayers(sobj[["RNA"]])

SingleR does not take Seurat objects as input but either matrices or SummarizedExperiment objects. The reference is already in this SummarizedExperiment format. For the input data, the expression matrix needs to be extracted or converted to a SingleCellExperiment object, a format inherited from SummarizedExperiment, with the command line.

sce <- as.SingleCellExperiment(sobj)

Then, labels can be computed just by running the SingleR() function.

df_labels <- SingleR(test = GetAssayData(sobj, assay = "RNA", layer = "data"),
                       ref = immgen_ref,
                       labels = immgen_ref$label.main,
                       assay.type.test = "logcounts")

head(df_labels) %>% 
  as.data.frame()
scores.B.cells scores.B.cells..pro scores.Basophils scores.DC scores.Endothelial.cells scores.Eosinophils scores.Epithelial.cells scores.Fibroblasts scores.ILC scores.Macrophages scores.Mast.cells scores.Microglia scores.Monocytes scores.Neutrophils scores.NK.cells scores.NKT scores.Stem.cells scores.Stromal.cells scores.T.cells scores.Tgd labels delta.next pruned.labels
AAACCCAAGAAGCGGG-1 0.2033111 0.2068373 0.2066186 0.1981340 0.0906038 0.1722631 0.1196504 0.0488273 0.3444374 0.1696821 0.1621333 0.1262688 0.1765475 0.1260725 0.3210943 0.3437948 0.2019263 0.0446823 0.3237717 0.3400036 ILC 0.0353467 ILC
AAACCCAAGAGGGTCT-1 0.3367376 0.2181210 0.1524551 0.2240873 0.1277862 0.1491067 0.1243731 0.0735197 0.1990659 0.2004364 0.1408415 0.1508748 0.2016722 0.1301909 0.2132847 0.1955743 0.1951414 0.0852191 0.2068749 0.2136054 B cells 0.1126503 B cells
AAACCCACAAGAGTTA-1 0.2523571 0.2417808 0.2293900 0.2229132 0.1084866 0.1861911 0.1190182 0.0664974 0.3869385 0.1704448 0.1863966 0.1304647 0.1901410 0.1264326 0.3789434 0.3744221 0.2412139 0.0600474 0.3598439 0.3758250 ILC 0.0613775 ILC
AAACGAACAGTTGTTG-1 0.0993034 0.0758659 0.1156172 0.1171548 0.1189611 0.1166718 0.1582854 0.1238863 0.1311276 0.1325085 0.1246085 0.0674073 0.1285736 0.1206987 0.0996414 0.0913978 0.1249268 0.1151023 0.1103719 0.1111212 Epithelial cells 0.0224571 NA
AAACGAAGTTCTCCAC-1 0.2279274 0.2064368 0.2157174 0.2136378 0.1122832 0.1870847 0.1228304 0.0595659 0.3580839 0.1684061 0.1765575 0.1092343 0.1808137 0.1220009 0.3350064 0.3577826 0.2201776 0.0652464 0.3417077 0.3606864 ILC 0.0202611 ILC
AAACGAATCTCGACCT-1 0.2598720 0.2512336 0.2418633 0.2487234 0.1312708 0.2096719 0.1372808 0.0766834 0.3830688 0.2023382 0.2026497 0.1520056 0.2184256 0.1402733 0.3747529 0.3568818 0.2467829 0.0811662 0.3420645 0.3653939 ILC 0.0403528 ILC
invisible(gc())

The results

SingleR() returns a table with scores and prediction results for each cell. Labels are shown before (labels) and after a pruning step (pruned.labels), which removes potentially poor-quality or ambiguous assignments.

These results can be added to the Seurat object and then visualized.

df <- df_labels %>% 
  as.data.frame() %>% 
  mutate(singler_annotation = pruned.labels)

sobj <- AddMetaData(sobj, df["singler_annotation"])

rm(df)
Question

Plot the SingleR labels on a UMAP. What observations can you make about the SingleR labels ?

Here again, there appears to be many cells types. The same kind of curation than with the CelliD example can be applied.

SingleR also comes with it own diagnosis tools. The function plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows to inspect the confidence of the predicted labels.

plotScoreHeatmap(df_labels)

Ideally, each cell should have one score that is significantly higher than the rest, indicating a clear assignment to a single label. However, when multiple similar scores are present for a given cell, it suggests uncertainty in the assignment.

In this case, we observe that the labels for T cells, ILCs, and NK cells are closely related, which is where domain expertise and highly specific marker sets become crucial. Additionally, the labels for Monocytes and Macrophages could potentially be merged due to their similarity.

2.3 Summary

Advantages Limits
Works at the single cell level Results heavily rely on the quality and completeness of the reference (be careful of missing labels)
Many resources for reference Possible to mislabel populations
Flexible Results can be noisy
saveRDS(sobj, file="seurat_object_annotated.rds")

References and documentation

CellID: https://www.bioconductor.org/packages/release/bioc/html/CelliD.html

SingleR: https://www.bioconductor.org/packages/release/bioc/html/SingleR.html

Many other tools. A review summarizing many annotation methods: A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell–Cell Communication

Human Cell Atlas: https://www.humancellatlas.org/

Tabula muris: https://tabula-muris-senis.ds.czbiohub.org/

GeneCard : https://www.genecards.org

Session info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleR_2.6.0               celldex_1.14.0             
 [3] CelliD_1.12.0               SingleCellExperiment_1.26.0
 [5] SummarizedExperiment_1.34.0 Biobase_2.64.0             
 [7] GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
 [9] IRanges_2.38.1              MatrixGenerics_1.16.0      
[11] matrixStats_1.4.1           patchwork_1.3.0            
[13] Seurat_5.1.0                SeuratObject_5.0.2         
[15] sp_2.1-4                    S4Vectors_0.42.1           
[17] BiocGenerics_0.50.0         lubridate_1.9.3            
[19] forcats_1.0.0               stringr_1.5.1              
[21] dplyr_1.1.4                 purrr_1.0.2                
[23] readr_2.1.5                 tidyr_1.3.1                
[25] tibble_3.2.1                ggplot2_3.5.1              
[27] tidyverse_2.0.0            

loaded via a namespace (and not attached):
  [1] spatstat.sparse_3.1-0     httr_1.4.7               
  [3] RColorBrewer_1.1-3        tools_4.4.1              
  [5] sctransform_0.4.1         alabaster.base_1.4.1     
  [7] utf8_1.2.4                R6_2.5.1                 
  [9] HDF5Array_1.32.0          lazyeval_0.2.2           
 [11] uwot_0.2.2                rhdf5filters_1.16.0      
 [13] withr_3.0.1               gridExtra_2.3            
 [15] tictoc_1.2.1              progressr_0.14.0         
 [17] cli_3.6.3                 spatstat.explore_3.3-2   
 [19] fastDummies_1.7.4         alabaster.se_1.4.1       
 [21] labeling_0.4.3            spatstat.data_3.1-2      
 [23] ggridges_0.5.6            pbapply_1.7-2            
 [25] askpass_1.2.0             scater_1.32.1            
 [27] parallelly_1.38.0         rstudioapi_0.16.0        
 [29] RSQLite_2.3.7             generics_0.1.3           
 [31] ica_1.0-3                 spatstat.random_3.3-2    
 [33] vroom_1.6.5               Matrix_1.7-0             
 [35] ggbeeswarm_0.7.2          fansi_1.0.6              
 [37] abind_1.4-8               lifecycle_1.0.4          
 [39] yaml_2.3.10               rhdf5_2.48.0             
 [41] SparseArray_1.4.8         BiocFileCache_2.12.0     
 [43] Rtsne_0.17                grid_4.4.1               
 [45] blob_1.2.4                promises_1.3.0           
 [47] ExperimentHub_2.12.0      crayon_1.5.3             
 [49] miniUI_0.1.1.1            lattice_0.22-6           
 [51] beachmat_2.20.0           cowplot_1.1.3            
 [53] KEGGREST_1.44.1           pillar_1.9.0             
 [55] knitr_1.48                fgsea_1.30.0             
 [57] future.apply_1.11.2       codetools_0.2-20         
 [59] fastmatch_1.1-4           leiden_0.4.3.1           
 [61] glue_1.8.0                RcppArmadillo_14.0.2-1   
 [63] spatstat.univar_3.0-1     data.table_1.16.0        
 [65] gypsum_1.0.1              vctrs_0.6.5              
 [67] png_0.1-8                 spam_2.10-0              
 [69] gtable_0.3.5              cachem_1.1.0             
 [71] xfun_0.48                 S4Arrays_1.4.1           
 [73] mime_0.12                 survival_3.7-0           
 [75] pheatmap_1.0.12           fitdistrplus_1.2-1       
 [77] ROCR_1.0-11               nlme_3.1-165             
 [79] bit64_4.0.5               alabaster.ranges_1.4.1   
 [81] filelock_1.0.3            RcppAnnoy_0.0.22         
 [83] irlba_2.3.5.1             vipor_0.4.7              
 [85] KernSmooth_2.23-24        colorspace_2.1-1         
 [87] DBI_1.2.3                 ggrastr_1.0.2            
 [89] tidyselect_1.2.1          bit_4.0.5                
 [91] compiler_4.4.1            curl_5.2.2               
 [93] httr2_1.0.1               BiocNeighbors_1.22.0     
 [95] DelayedArray_0.30.1       plotly_4.10.4            
 [97] scales_1.3.0              lmtest_0.9-40            
 [99] rappdirs_0.3.3            digest_0.6.37            
[101] goftest_1.2-3             spatstat.utils_3.1-0     
[103] alabaster.matrix_1.4.1    rmarkdown_2.28           
[105] XVector_0.44.0            htmltools_0.5.8.1        
[107] pkgconfig_2.0.3           umap_0.2.10.0            
[109] sparseMatrixStats_1.16.0  dbplyr_2.5.0             
[111] fastmap_1.2.0             rlang_1.1.4              
[113] htmlwidgets_1.6.4         UCSC.utils_1.0.0         
[115] shiny_1.9.1               DelayedMatrixStats_1.26.0
[117] farver_2.1.2              zoo_1.8-12               
[119] jsonlite_1.8.9            BiocParallel_1.38.0      
[121] BiocSingular_1.20.0       magrittr_2.0.3           
[123] scuttle_1.14.0            GenomeInfoDbData_1.2.12  
[125] dotCall64_1.1-1           Rhdf5lib_1.26.0          
[127] munsell_0.5.1             Rcpp_1.0.13              
[129] viridis_0.6.5             reticulate_1.39.0        
[131] stringi_1.8.4             alabaster.schemas_1.4.0  
[133] zlibbioc_1.50.0           MASS_7.3-61              
[135] AnnotationHub_3.12.0      plyr_1.8.9               
[137] parallel_4.4.1            listenv_0.9.1            
[139] ggrepel_0.9.6             deldir_2.0-4             
[141] Biostrings_2.72.1         splines_4.4.1            
[143] tensor_1.5                hms_1.1.3                
[145] igraph_2.0.3              spatstat.geom_3.3-3      
[147] RcppHNSW_0.6.0            reshape2_1.4.4           
[149] ScaledMatrix_1.12.0       BiocVersion_3.19.1       
[151] evaluate_1.0.0            BiocManager_1.30.25      
[153] tzdb_0.4.0                httpuv_1.6.15            
[155] RANN_2.6.2                openssl_2.2.1            
[157] polyclip_1.10-7           future_1.34.0            
[159] scattermore_1.2           rsvd_1.0.5               
[161] xtable_1.8-4              RSpectra_0.16-2          
[163] later_1.3.2               viridisLite_0.4.2        
[165] memoise_2.0.1             beeswarm_0.4.0           
[167] AnnotationDbi_1.66.0      cluster_2.1.6            
[169] timechange_0.3.0          globals_0.16.3