sincellTE 2024 - TP Secondary Analysis: cell annotation
Objectives
Cell annotation is one of the first steps in exploring the biological information captured in your dataset. It allows you to assign a type, and sometimes a subtype or even a cell state (e.g., proliferating, apoptotic), to individual cells.
The objective of this session is to gain hands-on experience with various methods for cell annotation. Precise and reliable cell annotation takes time. Here, we will focus on understanding and critically evaluating these different methods.
Methods for Cell Annotation
Cell annotation methods are based on a common principle: they use genes or gene sets specific to certain cell types to assign an identity to the cells.
However, these methods differ in how they accomplish this. They can be classified into two main approaches:
Manual Annotation: This method involves using marker gene sets from the literature, which are manually curated. In your dataset, clusters with high expression of these markers are annotated with the corresponding cell type.
Automatic Annotation: This approach utilizes marker sets from databases, or published single-cell reference atlases or datasets. Tools then automatically annotate the cells based on this information.
For this session, we will use the dataset prepared in the previous practical sessions.
suppressWarnings({
suppressMessages({
library(tidyverse)
library(S4Vectors)
library(Seurat)
library(ggplot2)
library(patchwork)
})
})
set.seed(123)
<- Sys.getenv("USER")
user.id <-paste("/shared/home2", user.id, "cell_annotation/data/SC26_FMTG_FILTERED_SCT_CLUSTERS.rda", sep = "/")
input_file <- paste0("/shared/home", user.id, "cell_annotation/results/", sep = "/")
output_dir load(input_file)
1. Manual Annotation
One way to annotate cell clusters is to examine the genes that are highly expressed in one cluster compared to all other cells, and then infer the corresponding cell type. These markers can be obtained from:
- Knowledge of the field (e.g., CD3A for T cells, CD8A for CD8+ T cells).
- Relevant literature related to your project.
- Differentially expressed genes (DEGs): Performing differential expression analysis on different clusters can reveal specific markers that help identify each cluster. However, not everyone agrees that DEG analysis should be used this way.
1.1 Using One Marker per Cell Type
Here, we have a list of markers from the dataset’s associated publication, allowing us to annotate 9 cell populations:
Cell type | Marker |
---|---|
Proximal enterocytes | Apoa4 |
Distal enterocytes | Enpep |
Goblet cells | Muc2 |
Tuft cells | Alox5ap |
Stem cells | Lgr5 |
Naive CD4+ T cells | Ccr7 |
Memory CD4+ T cells | S100a4 |
CD8+ T cells | Cd8a |
B cells | Ms4a1 |
In previous practical sessions, cell clustering was performed. Clustering at a resolution of 0.1 yielded 9 clusters, matching the number of cell types we are aiming to identify. This clustering will be used for the first manual annotation step.
Violin plots and feature plots are useful for visualizing marker expression across the different clusters.
## define the markers
<- c('Muc2', 'Enpep', 'Apoa4', "Alox5ap",
feats 'Ccr7', 'S100a4', 'Cd8a', 'Lgr5', 'Ms4a1')
## set the cells default identity to clustering
Idents(sobj) <- "SCT_snn_res.0.1"
DimPlot(sobj)
FeaturePlot(sobj, features=feats)
VlnPlot(sobj, feats, layer = "data")
After visualizing these plots, we can propose an initial annotation:
Cluster | Marker | Cell type |
---|---|---|
0 | Cd8a | CD8+ T cells |
1 | Ms4a1 | B cells |
2 | Cd8a | CD8+ T cells |
3 | Enpep + strong expression of Apo4 | mix of proximal and distal enterocytes |
4 | Ccr7 + Muc2 | Naive CD4 + T cells and Goblet cells |
5 | S100a4 | Naive CD4 + T cells |
6 | Ms4a1 | B cells |
7 | Alox5ap + Lgr5 | Tuft cells + stem cells |
Using one marker per cell type allowed us to form initial hypotheses about the cell types in each cluster, but also revealed some limitations:
- Some markers are not very specific: e.g., Apoa4 is expressed in a large proportion of clusters, though its expression is highest in cluster 2.
"Apoa4+"]] <- as.numeric((sobj@assays$SCT@data['Apoa4',] > 0))
sobj[[
@meta.data %>%
sobj## group by cluster
group_by(SCT_snn_res.0.2) %>%
## compute the percentage of cells expressing the gene
summarize(percentage = round(mean(`Apoa4+` == 1) * 100))
SCT_snn_res.0.2 | percentage |
---|---|
0 | 92 |
1 | 92 |
2 | 92 |
3 | 99 |
4 | 92 |
5 | 71 |
6 | 61 |
7 | 46 |
- Some markers are specific but not expressed in all cells within a cluster: e.g., Lgr5 is detected almost exclusively in cluster 6, but only in 40% of the cells.
"Lgr5+"]] <- as.numeric((sobj@assays$SCT@data['Lgr5',] > 0))
sobj[[
@meta.data %>%
sobj## group by cluster
group_by(SCT_snn_res.0.1) %>%
## compute the percentage of cells expressing the gene
summarise(percentage = round(mean(`Lgr5+` == 1) * 100))
SCT_snn_res.0.1 | percentage |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 2 |
5 | 0 |
6 | 33 |
1.2 Using a Marker List
Using only one marker per cell type can be risky. Instead, using a list of markers can help mitigate biases due to unexpressed genes or unspecific markers. To improve cluster annotation, we will now use a list of markers with several genes for each cell type.
<- list(
marker_genes CD8 = c('Cd8a', 'Gzmb', 'Gzma', 'Cd7') ,
CD4_naive = c('Ccr7', 'Ms4a4b', 'Cd27', 'Cd28'),
CD4_memory = c('S100a4', 'Cd44', 'Ccl9'),
B_cells = c('Ms4a1', 'Cd79a', 'Cd74'),
Enterocyte_distal = c('Enpep', 'Anpep', 'Ace2'),
Enterocyte_proximal = c('Apoa4', 'Dstn', 'Fabp2'),
Goblet_cells = c('Muc2', 'Agr2', 'Spink4', 'Tff3'),
Stem_cells = c('Lgr5', 'Hck', 'Nrgn'),
Tuft_cells = c('Alox5ap','l1b', 'S100a9'))
## keep only genes also found in the count matrix
<- lapply(marker_genes, function(vec) {
marker_genes intersect(vec, rownames(sobj))
})
A dot plot is useful for assessing the expression of multiple genes simultaneously in populations of cells.
DotPlot(sobj, features = marker_genes) +
theme(axis.text.x=element_text(angle = 45, vjust=0.5, hjust = 1),
strip.text = element_text(angle = 45, hjust = 0.5))
Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
The AddModuleScore()
function calculates a score for each cell based on the expression of a list of genes. This score, automatically added to the metadata, can then be plotted on the UMAP.
<- AddModuleScore(sobj, features = marker_genes,
sobj pool = NULL,
nbin = 5,
seed = 1,
ctrl = length(marker_genes),
k = FALSE,
name = "program")
## score for each list of gene is named program1, program2...
## rename the columns in metadata to get more interpretable names
<- paste0("program", seq(1:length(marker_genes)))
vec_rename names(vec_rename) <- names(marker_genes)
@meta.data <- sobj@meta.data %>%
sobj::rename(all_of(vec_rename))
dplyr
head(sobj@meta.data)
orig.ident | nCount_RNA | nFeature_RNA | min_features | min_counts | percent_mt | percent_rb | Seurat.SmG2M.Score | Seurat.S.Score | Seurat.G2M.Score | Seurat.Phase | Cyclone.SmG2M.Score | Cyclone.Phase | Cyclone.G1.Score | Cyclone.S.Score | Cyclone.G2M.Score | scDblFinder.class | scds.score | scds.class | doublets_consensus.class | nCount_SCT | nFeature_SCT | SCT_snn_res.0.1 | SCT_snn_res.0.2 | SCT_snn_res.0.5 | SCT_snn_res.1 | seurat_clusters | Apoa4+ | Lgr5+ | CD8 | CD4_naive | CD4_memory | B_cells | Enterocyte_distal | Enterocyte_proximal | Goblet_cells | Stem_cells | Tuft_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACCCAAGAAGCGGG-1 | SC26_FMTG | 3562 | 1511 | TRUE | TRUE | 13.50365 | 17.574396 | 0.0355307 | 0.0231535 | -0.0123772 | S | -0.0274207 | G1 | 0.7463582 | 0.1131105 | 0.1405313 | FALSE | 0.2900607 | FALSE | FALSE | 4355 | 1506 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 2.1977709 | -0.0559695 | -0.0663615 | -0.3977087 | -0.1404695 | 0.2184639 | 0.0079912 | -0.0256721 | 0.0000000 |
AAACCCAAGAGGGTCT-1 | SC26_FMTG | 3541 | 1254 | TRUE | TRUE | 17.36798 | 26.715617 | -0.0580402 | -0.0438693 | 0.0141709 | G2M | 0.0309859 | G1 | 0.9389671 | 0.0460094 | 0.0150235 | FALSE | 0.1230478 | FALSE | FALSE | 4411 | 1246 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | -0.0559695 | -0.3929514 | -0.0513442 | 2.2868583 | -0.3573649 | 0.3792493 | 0.0850075 | -0.0513442 | 0.0000000 |
AAACCCACAAGAGTTA-1 | SC26_FMTG | 5825 | 2004 | TRUE | TRUE | 13.80258 | 18.008584 | -0.0792198 | -0.0640550 | 0.0151648 | G2M | 0.2658875 | G1 | 0.7297297 | 0.2680789 | 0.0021914 | FALSE | 0.1829081 | FALSE | FALSE | 5213 | 2000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2.7722634 | -0.2237544 | -0.1283606 | -0.4889099 | -0.4588492 | -0.0276023 | 0.0737446 | -0.0256721 | -0.1155245 |
AAACGAACAGTTGTTG-1 | SC26_FMTG | 2260 | 1030 | TRUE | TRUE | 13.98230 | 3.672566 | -0.0230277 | -0.0625898 | -0.0395621 | G1 | 0.2186520 | G1 | 0.7735110 | 0.2225705 | 0.0039185 | FALSE | 0.3076467 | FALSE | FALSE | 4308 | 1071 | 3 | 3 | 3 | 5 | 5 | 1 | 0 | -0.1214092 | -0.2020111 | 0.0000000 | 0.0576368 | -0.0292990 | 2.4933740 | 0.1525850 | -0.0513442 | -0.0610340 |
AAACGAAGTTCTCCAC-1 | SC26_FMTG | 5696 | 1961 | TRUE | TRUE | 16.92416 | 19.592697 | -0.0360164 | -0.0347169 | 0.0012994 | G2M | 0.4043750 | G1 | 0.5943750 | 0.4050000 | 0.0006250 | FALSE | 0.5117774 | FALSE | FALSE | 5176 | 1955 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 2.5625411 | -0.2808202 | -0.0770164 | -0.4597078 | -0.3807917 | 0.3803345 | 0.1671197 | -0.0406893 | -0.0995422 |
AAACGAATCTCGACCT-1 | SC26_FMTG | 7439 | 2261 | TRUE | TRUE | 10.04167 | 22.153515 | 0.0131680 | 0.0661095 | 0.0529415 | S | 0.2776978 | G1 | 0.7194245 | 0.2791367 | 0.0014388 | FALSE | 0.3469976 | FALSE | FALSE | 5370 | 2233 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2.9931588 | -0.3563888 | -0.0770164 | -0.5132974 | -0.1220680 | 0.4712829 | 0.0770164 | -0.1433778 | -0.0385082 |
DimPlot(sobj, group.by = "SCT_snn_res.0.2")
FeaturePlot(sobj, features = names(marker_genes))
With this new information, we can refine the initial annotation by assigning each cluster the cell type with the highest score.
Cluster | Cell type |
---|---|
0 | CD8+ T cells |
1 | B cells |
2 | CD8+ T cells |
3 | proximal and distal enterocytes with higher score for distal enterocytes. Will be called enterocytes. |
4 | naive CD4+ T cells |
5 | Memory CD4+ T cells + Goblet cells |
6 | B cells |
7 | Stem cells and Tuft cells |
Based on these observations, the clusters can now be manually annotated.
"seurat_clusters"]] <- sobj[["SCT_snn_res.0.2"]]
sobj[[
@meta.data <- sobj@meta.data %>%
sobjmutate(
manual_annotation = case_when(
== "0" ~ "CD8+ T cells",
seurat_clusters == "1" ~ "B cells",
seurat_clusters == "2" ~ "CD8+ T cells",
seurat_clusters == "3" ~ "Enterocytes",
seurat_clusters == "4" ~ "Naive CD4+ T cells",
seurat_clusters == "5" ~ "Memory CD4+ T cells + Goblet cells",
seurat_clusters == "6" ~ "B cells",
seurat_clusters == "7" ~ "Stem cells + Tuft cells"))
seurat_clusters
Idents(sobj) <-"manual_annotation"
DimPlot(sobj,
pt.size = 0.1,
label = FALSE ,
repel = TRUE ,
label.size = 3)
1.3 Summary
Advantages | Limits |
---|---|
Easy to implement | Working at the cluster resolution: risk of merging or splitting actual populations |
Sometimes the only solution (ex : novel tissue) | Change clustering ? Change annotation |
Flexible | Time-consuming, requires knowledge |
1.4 Beyond cluster resolution
Since each cell has a score for the different gene sets, annotation could also be performed at the cell level. Try to think of a few commands to perform this annotation.
2. Automatic annotation
For some tissues, the different cell types have already been largely described and databases exist with lists of referenced marker genes or datasets. This is particularly true for human and mouse.
Instead of manual annotation, these datasets can be used, along with published tools to call the cell types in your data. This process is called automatic annotation.
2.1 Using a list of marker genes.
In this example, we’ll annotate cells using a list of marker genes from the Panglao database, combined with the CelliD
package.
The database
Panglao is an online gene-centric database, meaning it focuses solely on genes rather than entire single-cell datasets.
library(CelliD)
<- read_tsv("https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz")
df_panglao
head(df_panglao)
species | official gene symbol | cell type | nicknames | ubiquitousness index | product description | gene type | canonical marker | germ layer | organ | sensitivity_human | sensitivity_mouse | specificity_human | specificity_mouse |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mm Hs | CTRB1 | Acinar cells | CTRB | 0.017 | chymotrypsinogen B1 | protein-coding gene | 1 | Endoderm | Pancreas | 1.000000 | 0.957143 | 0.0006289 | 0.0159201 |
Mm Hs | KLK1 | Acinar cells | Klk6 | 0.013 | kallikrein 1 | protein-coding gene | 1 | Endoderm | Pancreas | 0.833333 | 0.314286 | 0.0050314 | 0.0128263 |
Mm Hs | RBPJL | Acinar cells | RBP-L|SUHL|RBPSUHL | 0.001 | recombination signal binding protein for immunoglobulin kappa J region like | protein-coding gene | 1 | Endoderm | Pancreas | 0.000000 | 0.000000 | 0.0000000 | 0.0000000 |
Mm Hs | PTF1A | Acinar cells | PTF1-p48|bHLHa29 | 0.001 | pancreas associated transcription factor 1a | protein-coding gene | 1 | Endoderm | Pancreas | 0.000000 | 0.157143 | 0.0006289 | 0.0007734 |
Mm | TRY4 | Acinar cells | NA | 0.007 | trypsin 4 | protein coding gene | 1 | Endoderm | Pancreas | NA | NA | NA | NA |
Mm Hs | CELA3A | Acinar cells | ELA3|ELA3A | 0.001 | chymotrypsin like elastase family member 3A | protein-coding gene | 1 | Endoderm | Pancreas | 0.833333 | 0.128571 | 0.0000000 | 0.0000000 |
In Panglao, each row corresponds to a gene and the each column to one of its characteritics:
species
indicate if the gene is expressed in mice and/or humancell type
is the final label used to annotate our dataset.gene type
indicates the biotype of the gene: protein-coding, long non coding RNA… Can be used for markers selection.organ
selecting the right organs is good practice to avoid aberrant annotations.
The quality and specificity of the reference are important to unmeaningful annotations. Here the columns species
and organ
will be used to select the subset of markers useful for our annotation.
<- df_panglao %>%
df_markers ## restricting to human specific genes
filter(str_detect(species,"Hs"),
%in% c("Immune system", "GI tract")) %>%
organ mutate(`official gene symbol` = str_to_title(`official gene symbol`)) %>%
## converting dataframes into a list of vectors, which is the format needed as input for CelliD
group_by(`cell type`) %>%
summarise(geneset = list(`official gene symbol`))
The method
Here the package CelliD
will be used. It is a clustering-free method developped by Cortal et al.. It calculates a per-cell gene signature score, which enables cell identity recognition. This method can also be applied to explore functional pathway enrichment.
CelliD
can take Seurat
objects as input. The reference for annotation must be provided as a named list of genes.
<- setNames(df_markers$geneset, df_markers$`cell type`)
annot_gs <- annot_gs[sapply(annot_gs, length) >= 10]
annot_gs
c("B cells", "B cells memory")] annot_gs[
$`B cells`
[1] "Cd2" "Cd5" "Ms4a1" "Cr2" "Cd22" "Fcer2"
[7] "Cd40" "Cd69" "Cd70" "Cd79a" "Cd79b" "Cd80"
[13] "Cd86" "Tnfrsf9" "Sdc1" "Tnfsf4" "Tnfrsf13b" "Tnfrsf13c"
[19] "Pdcd1" "Ighd" "Ighm" "Rasgrp3" "Hla-Dra" "Ltb"
[25] "Hla-Dqa1" "Fli1" "Cd14" "Sema6d" "Lair1" "Ifit3"
[31] "Igll1" "Dntt" "Mme" "Spn" "Cd19" "Cd24"
[37] "Cd27" "B3gat1" "Cd72" "Mum1" "Pax5" "Jchain"
[43] "Mzb1" "Ly6d" "Fcmr" "Bank1" "Edem1" "Vpreb3"
[49] "Pou2af1" "Creld2" "Derl3" "Ralgps2" "Fchsd2" "Pold4"
[55] "Tnfrsf17" "Hvcn1" "Fcrla" "Edem2" "Blnk" "Txndc11"
[61] "Btla" "Smap2" "Fkbp11" "Sec61a1" "Spcs3" "Spib"
[67] "Eaf2" "Cxcr4" "Birc3" "Iglc2" "Iglc3" "Iglc1"
[73] "Il21r" "Igkc" "Vpreb1" "Lrmp" "Klhl6" "Slamf6"
[79] "Fam129c" "Bst1" "Msh5" "Dok3" "Bach2" "Pxk"
[85] "Ighg1" "Ighg3" "Ighg4" "Cd38" "Ptprc" "Ebf1"
[91] "Bcl11a" "Ccr7" "Cd55" "Cd74" "Cd52" "Tlr9"
[97] "Swap70" "Hmga1"
$`B cells memory`
[1] "Cd38" "Cd80" "Cd84" "Cd86" "Nt5e" "Pax5"
[7] "Ptprc" "Spib" "Adam28" "Aim2" "Alox5" "Bach2"
[13] "Bank1" "Blk" "Ccr6" "Cd180" "Cd19" "Cd1c"
[19] "Cd22" "Cd27" "Cd37" "Cd69" "Cd72" "Cd79a"
[25] "Cd79b" "Clca3p" "Cr2" "Cxcr5" "Dennd5b" "Fcgr2b"
[31] "Fcrl2" "Frk" "Gng7" "Gpr18" "Gusbp11" "Hhex"
[37] "Hla-Dob" "Ifna10" "Ighd" "Ighm" "Igkc" "Igll3p"
[43] "Il7" "Irf8" "Ltb" "Ly86" "Mbl2" "Ms4a1"
[49] "Nmbr" "Npipb15" "P2rx5" "Pnoc" "Ptprcap" "Ralgps2"
[55] "Rasgrp2" "Sik1" "Sit1" "Slc12a1" "Sp140" "Stap1"
[61] "Tmem156" "Tnfrsf13b" "Tnfrsf17" "Traf4" "Vpreb3" "Zbtb32"
CelliD
relies on a dimensionality reduction of the expression matrix in a low-dimension space common to cells AND genes with the method MCA. In such space, the closer a gene is to a cell, the more specific it is to this cell. Thus, a gene ranking is obtained for each cell based on their distance to the cell in the MCA space. For each cell, the closest genes can be regarded as a unique cell identity card and used to annotate it.
More precisely, a statistical test (hypergeometric test) associates to each cell, given its more specific genes, a probability to be of the different cell type. The cell type with the lowest p-value is selected.
## run MCA to project cells and genes in a common low-dimensionality space
<- RunMCA(sobj) sobj
2.18 sec elapsed
5.869 sec elapsed
0.592 sec elapsed
# Performing per-cell hypergeometric tests against the gene signature collection
<- RunCellHGT(sobj, pathways = annot_gs, dims = 1:50, n.features = 200)
HGT_gs
# For each cell, select the signature with the lowest corrected p-value (max -log10 corrected p-value)
<- rownames(HGT_gs)[apply(HGT_gs, 2, which.max)]
gs_prediction
# For each cell, evaluate if the lowest p-value is significant
<- ifelse(apply(HGT_gs, 2, max)>2, yes = gs_prediction, "unassigned")
gs_prediction_signif
# Save cell type predictions as metadata within the Seurat object
$cellid_annotation <- gs_prediction_signif sobj
The results
CelliD
results can be visualized on the UMAP.
<- DimPlot(sobj,
p1 group.by = "SCT_snn_res.0.1",
pt.size = 0.1,
label = TRUE ,
repel = TRUE ,
label.size = 3) + NoLegend()
<- DimPlot(sobj,
p2 group.by = "cellid_annotation",
pt.size = 0.1,
label = TRUE ,
repel = TRUE ,
label.size = 3)
/ p2 p1
There appears to be many cells types on a single plot. Visualizing them all in the same time can be difficult. Instead, it is also possible to facet the plot in order to visualize 1 cell type at a time.
<- lapply(unique(sobj$cellid_annotation), function(celltype) {
list_plots <- setNames(list(subset(sobj, cellid_annotation == celltype) %>%
chl colnames()),
celltype)
return(DimPlot(sobj, group.by = "cellid_annotation", cells.highlight = chl))
})
wrap_plots(list_plots, ncol = 4)
CelliD
returned many more populations than the 9 populations of the manual annotation. This is in line with the number and variety of cell types in the reference list of genes.
names(annot_gs)
[1] "B cells" "B cells memory"
[3] "B cells naive" "Basophils"
[5] "Crypt cells" "Dendritic cells"
[7] "Enteric glia cells" "Enteric neurons"
[9] "Enterochromaffin cells" "Enterocytes"
[11] "Enteroendocrine cells" "Eosinophils"
[13] "Foveolar cells" "Gamma delta T cells"
[15] "Gastric chief cells" "Goblet cells"
[17] "Macrophages" "Mast cells"
[19] "Megakaryocytes" "Microfold cells"
[21] "Monocytes" "Myeloid-derived suppressor cells"
[23] "NK cells" "Natural killer T cells"
[25] "Neutrophils" "Nuocytes"
[27] "Paneth cells" "Parietal cells"
[29] "Plasma cells" "Plasmacytoid dendritic cells"
[31] "Red pulp macrophages" "T cells"
[33] "T follicular helper cells" "T helper cells"
[35] "T memory cells" "T regulatory cells"
[37] "Tuft cells"
These results show why manual curation can still be needed after automatic annotation:
- Some annotations concern only a very small number of cells. It might be false positives, artifacts or actual cell populations expected to be small. Biological knowledge is needed to decide whether the cells should be kept or renamed. Be careful with interpretation.
@meta.data %>%
sobjgroup_by(cellid_annotation) %>%
tally() %>%
arrange(n) %>%
head()
cellid_annotation | n |
---|---|
Natural killer T cells | 1 |
B cells memory | 3 |
Dendritic cells | 3 |
Plasma cells | 4 |
Crypt cells | 5 |
T regulatory cells | 5 |
Some annotated cell populations are closely related, which can lead to competition between cell types for accurate annotation. For example the B cells populations might probably all be merged into 1
B cell
label. For cluster 0, the situation is more complicated:- does it contain NK cells or T cells ?
- if it contains T cells, which ones ?
# Example cluster 0
@meta.data %>%
sobjfilter(SCT_snn_res.0.2 == 0) %>%
group_by(cellid_annotation) %>%
tally() %>%
arrange(desc(n)) %>%
head()
cellid_annotation | n |
---|---|
NK cells | 449 |
unassigned | 89 |
Gamma delta T cells | 37 |
T cells | 5 |
T regulatory cells | 5 |
Macrophages | 1 |
To answer this question, there are several possibilities:
count the labels occurence and choose the majoritary label(here NK cells would win).
include a biologist in the process and use biological knowledge: which cell type is expected in this dataset ? Why ? Can very specific markers help to identify the cells ?
use other references and annotations tools and compare the results.
2.2 Using a reference dataset
Some tools are able to use already annotated datasets to annotate new ones, on basis of the transcriptional profiles of the cells. This process is sometimes called annotation transfer.
Their use is facilitated by the increasing number of published datas and the existence of atlases integrating several datasets. If a dataset suitable for your own data exist, it can be used as a reference for automatic annotation
In this example, cells will be annotated using a dataset from the package celldex
using the tool SingleR
.
The database
celldex
provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.
library(celldex)
library(SingleR)
<- celldex::ImmGenData() immgen_ref
Typically, each reference provides three levels of cell type annotation in its column metadata:
label.main
, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.label.fine
, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.label.ont
, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.
Here the dataset ImmGennData
will be used.
The method
Here the package SinglR
will be used.
SingleR
usually works with log normalized data. Data were already normalized with SCT but, for this application, the previous normalization will be applied to all gene couts.
DefaultAssay(sobj) <- "RNA"
<- sobj %>%
sobj FindVariableFeatures() %>%
NormalizeData() %>%
ScaleData()
Finding variable features for layer counts
Normalizing layer: counts
Centering and scaling data matrix
## Fro Seurat v5, join layers for integrated objects
"RNA"]] <- JoinLayers(sobj[["RNA"]]) sobj[[
SingleR
does not take Seurat objects as input but either matrices or SummarizedExperiment
objects. The reference is already in this SummarizedExperiment
format. For the input data, the expression matrix needs to be extracted or converted to a SingleCellExperiment
object, a format inherited from SummarizedExperiment
, with the command line.
<- as.SingleCellExperiment(sobj) sce
Then, labels can be computed just by running the SingleR()
function.
<- SingleR(test = GetAssayData(sobj, assay = "RNA", layer = "data"),
df_labels ref = immgen_ref,
labels = immgen_ref$label.main,
assay.type.test = "logcounts")
head(df_labels) %>%
as.data.frame()
scores.B.cells | scores.B.cells..pro | scores.Basophils | scores.DC | scores.Endothelial.cells | scores.Eosinophils | scores.Epithelial.cells | scores.Fibroblasts | scores.ILC | scores.Macrophages | scores.Mast.cells | scores.Microglia | scores.Monocytes | scores.Neutrophils | scores.NK.cells | scores.NKT | scores.Stem.cells | scores.Stromal.cells | scores.T.cells | scores.Tgd | labels | delta.next | pruned.labels | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACCCAAGAAGCGGG-1 | 0.2033111 | 0.2068373 | 0.2066186 | 0.1981340 | 0.0906038 | 0.1722631 | 0.1196504 | 0.0488273 | 0.3444374 | 0.1696821 | 0.1621333 | 0.1262688 | 0.1765475 | 0.1260725 | 0.3210943 | 0.3437948 | 0.2019263 | 0.0446823 | 0.3237717 | 0.3400036 | ILC | 0.0353467 | ILC |
AAACCCAAGAGGGTCT-1 | 0.3367376 | 0.2181210 | 0.1524551 | 0.2240873 | 0.1277862 | 0.1491067 | 0.1243731 | 0.0735197 | 0.1990659 | 0.2004364 | 0.1408415 | 0.1508748 | 0.2016722 | 0.1301909 | 0.2132847 | 0.1955743 | 0.1951414 | 0.0852191 | 0.2068749 | 0.2136054 | B cells | 0.1126503 | B cells |
AAACCCACAAGAGTTA-1 | 0.2523571 | 0.2417808 | 0.2293900 | 0.2229132 | 0.1084866 | 0.1861911 | 0.1190182 | 0.0664974 | 0.3869385 | 0.1704448 | 0.1863966 | 0.1304647 | 0.1901410 | 0.1264326 | 0.3789434 | 0.3744221 | 0.2412139 | 0.0600474 | 0.3598439 | 0.3758250 | ILC | 0.0613775 | ILC |
AAACGAACAGTTGTTG-1 | 0.0993034 | 0.0758659 | 0.1156172 | 0.1171548 | 0.1189611 | 0.1166718 | 0.1582854 | 0.1238863 | 0.1311276 | 0.1325085 | 0.1246085 | 0.0674073 | 0.1285736 | 0.1206987 | 0.0996414 | 0.0913978 | 0.1249268 | 0.1151023 | 0.1103719 | 0.1111212 | Epithelial cells | 0.0224571 | NA |
AAACGAAGTTCTCCAC-1 | 0.2279274 | 0.2064368 | 0.2157174 | 0.2136378 | 0.1122832 | 0.1870847 | 0.1228304 | 0.0595659 | 0.3580839 | 0.1684061 | 0.1765575 | 0.1092343 | 0.1808137 | 0.1220009 | 0.3350064 | 0.3577826 | 0.2201776 | 0.0652464 | 0.3417077 | 0.3606864 | ILC | 0.0202611 | ILC |
AAACGAATCTCGACCT-1 | 0.2598720 | 0.2512336 | 0.2418633 | 0.2487234 | 0.1312708 | 0.2096719 | 0.1372808 | 0.0766834 | 0.3830688 | 0.2023382 | 0.2026497 | 0.1520056 | 0.2184256 | 0.1402733 | 0.3747529 | 0.3568818 | 0.2467829 | 0.0811662 | 0.3420645 | 0.3653939 | ILC | 0.0403528 | ILC |
invisible(gc())
The results
SingleR()
returns a table with scores and prediction results for each cell. Labels are shown before (labels
) and after a pruning step (pruned.labels
), which removes potentially poor-quality or ambiguous assignments.
These results can be added to the Seurat
object and then visualized.
<- df_labels %>%
df as.data.frame() %>%
mutate(singler_annotation = pruned.labels)
<- AddMetaData(sobj, df["singler_annotation"])
sobj
rm(df)
Here again, there appears to be many cells types. The same kind of curation than with the CelliD
example can be applied.
SingleR
also comes with it own diagnosis tools. The function plotScoreHeatmap()
displays the scores for all cells across all reference labels, which allows to inspect the confidence of the predicted labels.
plotScoreHeatmap(df_labels)
Ideally, each cell should have one score that is significantly higher than the rest, indicating a clear assignment to a single label. However, when multiple similar scores are present for a given cell, it suggests uncertainty in the assignment.
In this case, we observe that the labels for T cells, ILCs, and NK cells are closely related, which is where domain expertise and highly specific marker sets become crucial. Additionally, the labels for Monocytes and Macrophages could potentially be merged due to their similarity.
2.3 Summary
Advantages | Limits |
---|---|
Works at the single cell level | Results heavily rely on the quality and completeness of the reference (be careful of missing labels) |
Many resources for reference | Possible to mislabel populations |
Flexible | Results can be noisy |
saveRDS(sobj, file="seurat_object_annotated.rds")
References and documentation
CellID: https://www.bioconductor.org/packages/release/bioc/html/CelliD.html
SingleR: https://www.bioconductor.org/packages/release/bioc/html/SingleR.html
Many other tools. A review summarizing many annotation methods: A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell–Cell Communication
Human Cell Atlas: https://www.humancellatlas.org/
Tabula muris: https://tabula-muris-senis.ds.czbiohub.org/
GeneCard : https://www.genecards.org
Session info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SingleR_2.6.0 celldex_1.14.0
[3] CelliD_1.12.0 SingleCellExperiment_1.26.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
[9] IRanges_2.38.1 MatrixGenerics_1.16.0
[11] matrixStats_1.4.1 patchwork_1.3.0
[13] Seurat_5.1.0 SeuratObject_5.0.2
[15] sp_2.1-4 S4Vectors_0.42.1
[17] BiocGenerics_0.50.0 lubridate_1.9.3
[19] forcats_1.0.0 stringr_1.5.1
[21] dplyr_1.1.4 purrr_1.0.2
[23] readr_2.1.5 tidyr_1.3.1
[25] tibble_3.2.1 ggplot2_3.5.1
[27] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] spatstat.sparse_3.1-0 httr_1.4.7
[3] RColorBrewer_1.1-3 tools_4.4.1
[5] sctransform_0.4.1 alabaster.base_1.4.1
[7] utf8_1.2.4 R6_2.5.1
[9] HDF5Array_1.32.0 lazyeval_0.2.2
[11] uwot_0.2.2 rhdf5filters_1.16.0
[13] withr_3.0.1 gridExtra_2.3
[15] tictoc_1.2.1 progressr_0.14.0
[17] cli_3.6.3 spatstat.explore_3.3-2
[19] fastDummies_1.7.4 alabaster.se_1.4.1
[21] labeling_0.4.3 spatstat.data_3.1-2
[23] ggridges_0.5.6 pbapply_1.7-2
[25] askpass_1.2.0 scater_1.32.1
[27] parallelly_1.38.0 rstudioapi_0.16.0
[29] RSQLite_2.3.7 generics_0.1.3
[31] ica_1.0-3 spatstat.random_3.3-2
[33] vroom_1.6.5 Matrix_1.7-0
[35] ggbeeswarm_0.7.2 fansi_1.0.6
[37] abind_1.4-8 lifecycle_1.0.4
[39] yaml_2.3.10 rhdf5_2.48.0
[41] SparseArray_1.4.8 BiocFileCache_2.12.0
[43] Rtsne_0.17 grid_4.4.1
[45] blob_1.2.4 promises_1.3.0
[47] ExperimentHub_2.12.0 crayon_1.5.3
[49] miniUI_0.1.1.1 lattice_0.22-6
[51] beachmat_2.20.0 cowplot_1.1.3
[53] KEGGREST_1.44.1 pillar_1.9.0
[55] knitr_1.48 fgsea_1.30.0
[57] future.apply_1.11.2 codetools_0.2-20
[59] fastmatch_1.1-4 leiden_0.4.3.1
[61] glue_1.8.0 RcppArmadillo_14.0.2-1
[63] spatstat.univar_3.0-1 data.table_1.16.0
[65] gypsum_1.0.1 vctrs_0.6.5
[67] png_0.1-8 spam_2.10-0
[69] gtable_0.3.5 cachem_1.1.0
[71] xfun_0.48 S4Arrays_1.4.1
[73] mime_0.12 survival_3.7-0
[75] pheatmap_1.0.12 fitdistrplus_1.2-1
[77] ROCR_1.0-11 nlme_3.1-165
[79] bit64_4.0.5 alabaster.ranges_1.4.1
[81] filelock_1.0.3 RcppAnnoy_0.0.22
[83] irlba_2.3.5.1 vipor_0.4.7
[85] KernSmooth_2.23-24 colorspace_2.1-1
[87] DBI_1.2.3 ggrastr_1.0.2
[89] tidyselect_1.2.1 bit_4.0.5
[91] compiler_4.4.1 curl_5.2.2
[93] httr2_1.0.1 BiocNeighbors_1.22.0
[95] DelayedArray_0.30.1 plotly_4.10.4
[97] scales_1.3.0 lmtest_0.9-40
[99] rappdirs_0.3.3 digest_0.6.37
[101] goftest_1.2-3 spatstat.utils_3.1-0
[103] alabaster.matrix_1.4.1 rmarkdown_2.28
[105] XVector_0.44.0 htmltools_0.5.8.1
[107] pkgconfig_2.0.3 umap_0.2.10.0
[109] sparseMatrixStats_1.16.0 dbplyr_2.5.0
[111] fastmap_1.2.0 rlang_1.1.4
[113] htmlwidgets_1.6.4 UCSC.utils_1.0.0
[115] shiny_1.9.1 DelayedMatrixStats_1.26.0
[117] farver_2.1.2 zoo_1.8-12
[119] jsonlite_1.8.9 BiocParallel_1.38.0
[121] BiocSingular_1.20.0 magrittr_2.0.3
[123] scuttle_1.14.0 GenomeInfoDbData_1.2.12
[125] dotCall64_1.1-1 Rhdf5lib_1.26.0
[127] munsell_0.5.1 Rcpp_1.0.13
[129] viridis_0.6.5 reticulate_1.39.0
[131] stringi_1.8.4 alabaster.schemas_1.4.0
[133] zlibbioc_1.50.0 MASS_7.3-61
[135] AnnotationHub_3.12.0 plyr_1.8.9
[137] parallel_4.4.1 listenv_0.9.1
[139] ggrepel_0.9.6 deldir_2.0-4
[141] Biostrings_2.72.1 splines_4.4.1
[143] tensor_1.5 hms_1.1.3
[145] igraph_2.0.3 spatstat.geom_3.3-3
[147] RcppHNSW_0.6.0 reshape2_1.4.4
[149] ScaledMatrix_1.12.0 BiocVersion_3.19.1
[151] evaluate_1.0.0 BiocManager_1.30.25
[153] tzdb_0.4.0 httpuv_1.6.15
[155] RANN_2.6.2 openssl_2.2.1
[157] polyclip_1.10-7 future_1.34.0
[159] scattermore_1.2 rsvd_1.0.5
[161] xtable_1.8-4 RSpectra_0.16-2
[163] later_1.3.2 viridisLite_0.4.2
[165] memoise_2.0.1 beeswarm_0.4.0
[167] AnnotationDbi_1.66.0 cluster_2.1.6
[169] timechange_0.3.0 globals_0.16.3