For this session, we will be using the R packages Seurat and SingleR to explore to scRNASeq data and use various reference databases for automated cell type annotations. We will use a subset of the lung samples from the Mouse Cell Atlas as example.
The datasets are available in the folder: /shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation
We will use a subset of 2 lung samples from the Mouse Cell Atlas (MCA). I have used the standard Seurat pipeline to process the samples. The R code used is presented below, so that you reproduce the workflow during open analysis.
For this practical session, we will load the Seurat object mca from the the mca.rds file located here: /shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/mca.rds
library(Seurat)
## Registered S3 method overwritten by 'spatstat.geom':
## method from
## print.boxx cli
## Attaching SeuratObject
library(pheatmap)
library(dplyr)
#Change your working directory to the course directory
user.id <- Sys.getenv('USER')
setwd(paste0("/shared/projects/sincellte_2022/", user.id, "/Secondary_analyses"))
The new version of Seurat allows to store data from multiple modalities, such as RNAseq, spatial transcriptomics, ATAC-seq…For our session today, we will focus on RNAseq only.
The various assays are stored in the assays slot.
The data is stored in the MCA folder. We will first need to read in the count data for each sample, and format it for input into Seurat: - Read in the file. - Create a dataframe for the cell labels - Remove the cell labels from the count table - Transpose the data to have genes in row and cell barcodes in columns
lung1 = read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/MCA/MCA_lung_S1_filter.txt",
stringsAsFactors = F, row.names = 1)
lung1.t = t(lung1)
head(lung1.t[,1:10])
## Lung_1.AAAACGAGATGGAGGACT Lung_1.AAAACGCCGACGTAGAGA
## labels "AT2 Cell" "AT2 Cell"
## X0610007P14Rik "0" "0"
## X0610009B22Rik "0" "0"
## X0610009L18Rik "0" "0"
## X0610009O20Rik "0" "0"
## X0610010F05Rik "0" "0"
## Lung_1.AAAACGCCTAGAGTGGTA Lung_1.AAAACGCCTTTCAACGCC
## labels "Ciliated cell" "AT2 Cell"
## X0610007P14Rik "0" "0"
## X0610009B22Rik "0" "0"
## X0610009L18Rik "0" "0"
## X0610009O20Rik "0" "0"
## X0610010F05Rik "0" "0"
## Lung_1.AAAACGCCTTTCTTTAGG Lung_1.AAAACGCGTATTGCCCTC
## labels "T Cell_Cd8b1 high" "B Cell"
## X0610007P14Rik "0" "0"
## X0610009B22Rik "0" "0"
## X0610009L18Rik "0" "0"
## X0610009O20Rik "0" "0"
## X0610010F05Rik "0" "0"
## Lung_1.AAAACGGGGCGAATTCCA Lung_1.AAAACGTAGTCGAACGCC
## labels "AT2 Cell" "Alveolar macrophage"
## X0610007P14Rik "0" "0"
## X0610009B22Rik "0" "0"
## X0610009L18Rik "0" "0"
## X0610009O20Rik "0" "0"
## X0610010F05Rik "0" "0"
## Lung_1.AAAACGTAGTCGGGCTGC Lung_1.AAAACGTCAAAGGAGGAG
## labels "Alveolar macrophage" "AT2 Cell"
## X0610007P14Rik "0" "0"
## X0610009B22Rik "0" "0"
## X0610009L18Rik "0" "0"
## X0610009O20Rik "0" "0"
## X0610010F05Rik "0" "0"
head(lung1[,1:5])
## labels X0610007P14Rik X0610009B22Rik
## Lung_1.AAAACGAGATGGAGGACT AT2 Cell 0 0
## Lung_1.AAAACGCCGACGTAGAGA AT2 Cell 0 0
## Lung_1.AAAACGCCTAGAGTGGTA Ciliated cell 0 0
## Lung_1.AAAACGCCTTTCAACGCC AT2 Cell 0 0
## Lung_1.AAAACGCCTTTCTTTAGG T Cell_Cd8b1 high 0 0
## Lung_1.AAAACGCGTATTGCCCTC B Cell 0 0
## X0610009L18Rik X0610009O20Rik
## Lung_1.AAAACGAGATGGAGGACT 0 0
## Lung_1.AAAACGCCGACGTAGAGA 0 0
## Lung_1.AAAACGCCTAGAGTGGTA 0 0
## Lung_1.AAAACGCCTTTCAACGCC 0 0
## Lung_1.AAAACGCCTTTCTTTAGG 0 0
## Lung_1.AAAACGCGTATTGCCCTC 0 0
labels.1 = lung1[,"labels"]
names(labels.1) = rownames(lung1)
table(labels.1)
## labels.1
## Alveolar macrophage AT1 Cell AT2 Cell B Cell
## 191 83 928 167
## Ciliated cell Clara Cell T Cell_Cd8b1 high
## 17 20 211
lung1.t = t(lung1[,-1])
Now, we can create a new Seurat object with our lung count data, and add the cell labels in the meta.Data slot.
Filtering step: keep cells with at least 100 genes, and genes expressed in at least 5 cells.
Visually inspect your data.
l1 = CreateSeuratObject(count=lung1.t, project="MCA_lung1", min.cells=5, min.features =100)
l1$sample <- "S1"
l1$cell.label <- labels.1
head(l1@meta.data)
## orig.ident nCount_RNA nFeature_RNA sample
## Lung_1.AAAACGAGATGGAGGACT Lung 1103 589 S1
## Lung_1.AAAACGCCGACGTAGAGA Lung 578 300 S1
## Lung_1.AAAACGCCTAGAGTGGTA Lung 2760 1343 S1
## Lung_1.AAAACGCCTTTCAACGCC Lung 1713 789 S1
## Lung_1.AAAACGCCTTTCTTTAGG Lung 771 470 S1
## Lung_1.AAAACGCGTATTGCCCTC Lung 449 309 S1
## cell.label
## Lung_1.AAAACGAGATGGAGGACT AT2 Cell
## Lung_1.AAAACGCCGACGTAGAGA AT2 Cell
## Lung_1.AAAACGCCTAGAGTGGTA Ciliated cell
## Lung_1.AAAACGCCTTTCAACGCC AT2 Cell
## Lung_1.AAAACGCCTTTCTTTAGG T Cell_Cd8b1 high
## Lung_1.AAAACGCGTATTGCCCTC B Cell
VlnPlot(object = l1, features = "nCount_RNA")
VlnPlot(object = l1, features = "nFeature_RNA")
median(l1$nCount_RNA)
## [1] 961
Repeat the same process for the second lung sample.
lung2 =
read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/MCA/MCA_lung_S2_filter.txt",
stringsAsFactors = F, row.names = 1)
labels.2 = lung2[,"labels"]
names(labels.2) = rownames(lung2)
table(labels.2)
## labels.2
## Alveolar macrophage AT1 Cell AT2 Cell B Cell
## 124 40 365 130
## Ciliated cell Clara Cell T Cell_Cd8b1 high
## 41 40 70
lung2.t = t(lung2[,-1])
l2 = CreateSeuratObject(count=lung2.t, project="MCA_lung2", min.cells=5, min.features =100)
l2$sample <- "S2"
l2$cell.label <- labels.2
head(l2@meta.data)
## orig.ident nCount_RNA nFeature_RNA sample
## Lung_2.AAAACGAAAGTTCAAAGT Lung 638 474 S2
## Lung_2.AAAACGAAGCGGAAGTAC Lung 1016 571 S2
## Lung_2.AAAACGATTCCAGTTGCC Lung 1137 586 S2
## Lung_2.AAAACGCATCCCCCGCTA Lung 984 581 S2
## Lung_2.AAAACGGCGTCCACCTGA Lung 1211 695 S2
## Lung_2.AAAACGTGAAGCAAAACG Lung 1164 655 S2
## cell.label
## Lung_2.AAAACGAAAGTTCAAAGT Alveolar macrophage
## Lung_2.AAAACGAAGCGGAAGTAC AT2 Cell
## Lung_2.AAAACGATTCCAGTTGCC AT2 Cell
## Lung_2.AAAACGCATCCCCCGCTA AT2 Cell
## Lung_2.AAAACGGCGTCCACCTGA AT2 Cell
## Lung_2.AAAACGTGAAGCAAAACG AT2 Cell
VlnPlot(object = l2, features = "nCount_RNA")
VlnPlot(object = l2, features = "nFeature_RNA")
median(l2$nCount_RNA)
## [1] 884.5
Merge the 2 samples using the Merge function in Seurat, and visually inspect the data to look for sample bias.
mca = merge(l1,l2)
table(mca$sample)
##
## S1 S2
## 1617 810
VlnPlot(mca, "nCount_RNA", group.by="sample")
VlnPlot(mca, "nFeature_RNA", group.by="sample")
The median UMI count in this data is low. We will normalize the data to the median UMI counts. Look at the parameters to verify that the normalization was performed.
mca <- NormalizeData(mca,
scale.factor=median(mca$nCount_RNA))
mca@commands
## $NormalizeData.RNA
## Command: NormalizeData(mca, scale.factor = median(mca$nCount_RNA))
## Time: 2022-01-11 10:16:35
## assay : RNA
## normalization.method : LogNormalize
## scale.factor : 936
## margin : 1
## verbose : TRUE
Now, we will run the complete analysis using defaults parameters. The steps will be:
First, we select the most variable feature using the FindVariableFeatures. How many genes did we select?
What are the top 10 most variable genes? Can you highlight them on a Variance vs Average expression plot?
mca <- FindVariableFeatures(mca)
length(VariableFeatures(mca))
## [1] 2000
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(mca), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(mca)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
Center and scale the data, then apply a PCA. Look at the data in the PCA space: do you see any sample bias? Here, we could also regress out confounding variables such as cell cycle score if needed.
mca <- ScaleData(mca)
## Centering and scaling data matrix
mca <- RunPCA(mca)
## PC_ 1
## Positive: Chil3, Ctss, Ccl6, Ear2, Vim, Tyrobp, Ctsd, Laptm5, Fcer1g, Plet1
## Atp6v0d2, Lpl, Ear1, Cd68, Car4, Lgals3, Alox5ap, Fxyd5, Wfdc21, Mpeg1
## Clec4n, Sgk1, Psap, Tlr2, Cd9, Lilrb4a, Clec7a, Cybb, Tmsb10, Plin2
## Negative: Sftpa1, Sftpb, Sftpd, Cbr2, Cxcl15, Chil1, Wfdc2, Slc34a2, Npc2, Napsa
## Rnase4, Sfta2, Scd1, Lamp3, Chchd10, Atp1b1, Lpcat1, S100g, Ptprf, Hc
## Elovl1, Lcn2, Tceal9, Mid1ip1, Dram1, Ces1d, Selenbp1, Ager, Prnp, Lyz2
## PC_ 2
## Positive: Cxcl15, Napsa, Sftpb, Slc34a2, Lyz2, Sftpa1, Chil1, Scd1, Sftpd, Npc2
## Sfta2, Lamp3, S100g, Lcn2, Hc, Elovl1, Lpcat1, Cd74, Fabp5, Dram1
## H2.Aa, Cd36, Rnase4, Ctsc, Gm43085, Bex4, Apoe, Mid1ip1, H2.Ab1, Egfl6
## Negative: Tmem212, Ccdc153, Dynlrb2, Fam183b, Cfap126, X1110017D15Rik, Mlf1, Rsph1, Riiad1, Efcab10
## X1700007K13Rik, Cdhr4, Drc3, Nme9, X1700016K19Rik, Meig1, Foxj1, Lrrc23, Cyp2s1, Tuba1a
## Ccdc78, Sntn, Odf3b, Aldh1a1, Tppp3, Ccdc113, X1700088E04Rik, Tekt1, Cdhr3, Tm4sf1
## PC_ 3
## Positive: Sparc, Ager, Vegfa, Hopx, Gprc5a, Pmp22, Igfbp2, Spock2, Pdpn, Clic5
## Igfbp7, Emp2, Sdpr, Cav1, Timp3, Krt7, Cyp2b10, Col4a3, Tspan8, Aqp5
## Rtkn2, S100a6, Scnn1g, Cryab, Agrn, Pdgfa, Fbln5, Icam1, Limch1, Nbl1
## Negative: Tmem212, Ccdc153, Dynlrb2, X1110017D15Rik, Cfap126, Riiad1, Fam183b, Cbr2, Rsph1, Efcab10
## Mlf1, Cd74, Cdhr4, Nme9, Lrrc23, Sntn, X1700007K13Rik, X1700088E04Rik, Meig1, Drc3
## Foxj1, Cd24a, Elof1, X1700016K19Rik, Ccdc113, Ighm, X3300002A11Rik, Tekt1, Odf3b, Ccdc17
## PC_ 4
## Positive: Ighm, Igkc, Cd79a, Cd52, Ly6d, Tmsb10, Ccr7, Iglc3, Ltb, Cd79b
## Ms4a1, Coro1a, Rac2, Cd37, Ighd, Trbc2, Klf2, Ptprcap, Iglc2, Il2rg
## Sell, Srgn, H2.DMb2, H2.Eb1, Fcmr, Ms4a4b, Stk17b, Ets1, Plac8, H2.Q7
## Negative: Lyz2, Sftpb, Sftpa1, Fth1, Cxcl15, Sftpd, Npc2, Slc34a2, Ftl1, Chil1
## Cbr2, Ctsc, Lamp3, Scd1, Napsa, Fabp5, Sfta2, Chil3, Cd9, mt.Rnr2
## Cd36, Wfdc2, S100g, Rnase4, Ccl6, Mgst1, Elovl1, Lpcat1, Hc, Mt1
## PC_ 5
## Positive: Cd74, Igkc, H2.Aa, Cd79a, Ighm, H2.Eb1, H2.Ab1, Ly6d, Iglc3, Cd79b
## Ms4a1, Ighd, Iglc2, Fcmr, H2.DMb2, Ebf1, Cd19, Iglc1, H2.Ob, Siglecg
## Mzb1, Spib, Cd37, Mef2c, Lat2, Plac8, Fcrla, Cd83, Ptp4a3, Syk
## Negative: Trbc2, Ms4a4b, Cd8b1, Cd3d, Thy1, Trac, Trbc1, Lck, Cd3g, Ms4a6b
## Tcf7, Prkcq, Nkg7, Lat, Dapl1, Skap1, Il7r, Cd28, Ly6c2, Cd8a
## Fyb, Tmsb10, Lef1, Smc4, Igfbp4, Ctsw, Hcst, Il2rb, Selplg, H2.Q7
DimPlot(mca, reduction="pca",group.by="sample")
Look at the Elbow plot to estimate the number of PCs to keep.
ElbowPlot(object = mca)
We will keep 6 PCs for downstream analysis.
Finally, run the clustering algorithm, compute the tSNE/UMAP, and explore the results. Do you think we need to apply additional sample integration?
You can use the following markers to explore the different clusters: Sftpb, Cav1, Pdpn, Foxj1, Scgb1a1, Cd74, Marco, Cd8b1
mca <- FindNeighbors(object = mca, dims = 1:6,do.plot = TRUE)
## Computing nearest neighbor graph
## Computing SNN
## Warning in FindNeighbors.Seurat(object = mca, dims = 1:6, do.plot = TRUE):
## Please compute a tSNE for SNN visualization. See RunTSNE().
mca <- FindClusters(object = mca)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2427
## Number of edges: 76915
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8336
## Number of communities: 11
## Elapsed time: 0 seconds
mca <- RunUMAP(object = mca, dims = 1:6)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:16:40 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:16:40 Read 2427 rows and found 6 numeric columns
## 10:16:40 Using Annoy for neighbor search, n_neighbors = 30
## 10:16:40 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:16:40 Writing NN index file to temp file /tmp/RtmpeLFnVF/filef7c14a80f43d
## 10:16:40 Searching Annoy index using 1 thread, search_k = 3000
## 10:16:41 Annoy recall = 100%
## 10:16:42 Commencing smooth kNN distance calibration using 1 thread
## 10:16:42 Initializing from normalized Laplacian + noise
## 10:16:43 Commencing optimization for 500 epochs, with 99494 positive edges
## 10:16:46 Optimization finished
DimPlot(mca)
DimPlot(object = mca, group.by="sample")
FeaturePlot(mca, features="nCount_RNA")
FeaturePlot(mca, c("Sftpb","Cav1"))
FeaturePlot(mca,c("Pdpn", "Scgb1a1"))
FeaturePlot(mca,c("Cd79a", "Marco"))
VlnPlot(mca, c("Foxj1","Scgb1a1"))
Let’s find differentially expressed genes between our clusters. We will use the FindAllMarkers function in Seurat. This function offers several statistical tests for this task. The default is a Wilcoxon Rank Sum test. The result tables contains statistical results such as p-values and logFC, but also the % of cells in each group expressing the markers. Checking the % of detection of the markers in a cluster can help you select genes specifically expressed in a cell type, and not in others.
markers <- FindAllMarkers(mca)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
head(markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## Sftpc 4.926125e-158 1.428352 0.998 0.452 4.351246e-154 0 Sftpc
## Lyz2 1.071828e-117 1.332891 0.985 0.554 9.467455e-114 0 Lyz2
## Cxcl15 1.416146e-102 1.113001 0.985 0.505 1.250882e-98 0 Cxcl15
## Chil1 1.589916e-96 1.016751 0.963 0.480 1.404373e-92 0 Chil1
## Sftpa1 3.238716e-93 1.153204 0.998 0.479 2.860758e-89 0 Sftpa1
## Sftpb 2.336633e-90 1.108758 0.998 0.509 2.063948e-86 0 Sftpb
Then, we can select the top markers for each cluster and generate a heatmap based on these genes.
top.markers <- markers %>%
group_by(cluster) %>%
top_n(n = 5, wt = avg_log2FC)
DoHeatmap(mca, features = top.markers$gene) + NoLegend()
## Warning in DoHeatmap(mca, features = top.markers$gene): The following features
## were omitted as they were not found in the scale.data slot for the RNA assay:
## Rps24, Rps15a, Rps29, Rps27, Sftpc
If you have very small clusters, it is informative to generate a heatmap based on the average profiles to look at the top markers.
ave.exp = AverageExpression(mca, return.seurat = T)
## Centering and scaling data matrix
DoHeatmap(ave.exp,features = top.markers$gene) + NoLegend()
What can you conclude on your clustering based on this heatmap? Let’s compare your clustering results to the original cell labels.
table(mca$RNA_snn_res.0.8, mca$cell.label)
##
## Alveolar macrophage AT1 Cell AT2 Cell B Cell Ciliated cell Clara Cell
## 0 0 0 463 0 0 0
## 1 0 0 445 4 0 0
## 2 1 3 383 0 0 1
## 3 0 0 0 4 0 2
## 4 181 0 0 1 0 0
## 5 0 0 1 178 0 0
## 6 133 0 0 0 0 0
## 7 0 120 1 1 0 0
## 8 0 0 0 109 0 0
## 9 0 0 0 0 3 57
## 10 0 0 0 0 55 0
##
## T Cell_Cd8b1 high
## 0 0
## 1 0
## 2 0
## 3 281
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
We can now either repeat the clustering steps with different parameters to avoid over clustering, or we could group the cluster manually and assign labels based on the known marker genes. We can also use automated annotation tools to help us for cell annotations if needed.
Let’s use the cell identity provided with the dataset as cluster labels and repeat the analysis.
Idents(mca) <- mca[["cell.label"]]
table(Idents(mca))
##
## AT2 Cell Ciliated cell T Cell_Cd8b1 high B Cell
## 1293 58 281 297
## Alveolar macrophage AT1 Cell Clara Cell
## 315 123 60
cells.markers <- FindAllMarkers(mca)
## Calculating cluster AT2 Cell
## Calculating cluster Ciliated cell
## Calculating cluster T Cell_Cd8b1 high
## Calculating cluster B Cell
## Calculating cluster Alveolar macrophage
## Calculating cluster AT1 Cell
## Calculating cluster Clara Cell
cells.top.markers <- cells.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
DoHeatmap(mca, features = cells.top.markers$gene) + NoLegend()
## Warning in DoHeatmap(mca, features = cells.top.markers$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: Rplp1, Rps16, Rps24, Rps7, Rpl18a, Rps15a, Rps29, Rps27, Sftpc
saveRDS(mca, file="mca.rds")
Let’s load our processed dataset and generate a couple of graphs to understand what Take some time to explore the differences between cell types using graphs. Take a close look at the number of genes and the number of molecule in each cell type. Do you see any difference in cell type proportions between the 2 samples?
mca <- readRDS(file="/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/mca.rds")
head(mca@meta.data)
## orig.ident nCount_RNA nFeature_RNA sample
## Lung_1.AAAACGAGATGGAGGACT Lung 1103 589 S1
## Lung_1.AAAACGCCGACGTAGAGA Lung 578 300 S1
## Lung_1.AAAACGCCTAGAGTGGTA Lung 2760 1343 S1
## Lung_1.AAAACGCCTTTCAACGCC Lung 1713 789 S1
## Lung_1.AAAACGCCTTTCTTTAGG Lung 771 470 S1
## Lung_1.AAAACGCGTATTGCCCTC Lung 449 309 S1
## cell.label RNA_snn_res.0.8 seurat_clusters
## Lung_1.AAAACGAGATGGAGGACT AT2 Cell 2 2
## Lung_1.AAAACGCCGACGTAGAGA AT2 Cell 0 0
## Lung_1.AAAACGCCTAGAGTGGTA Ciliated cell 10 10
## Lung_1.AAAACGCCTTTCAACGCC AT2 Cell 2 2
## Lung_1.AAAACGCCTTTCTTTAGG T Cell_Cd8b1 high 3 3
## Lung_1.AAAACGCGTATTGCCCTC B Cell 8 8
table(mca$sample)
##
## S1 S2
## 1617 810
DimPlot(mca, label = T)
DimPlot(mca, group.by="sample")
## Calculate the % of each cell type by sample
tt = table(Original.Label=mca$cell.label,
Sample=mca$sample)
round(sweep(tt*100,2,as.numeric(colSums(tt)), FUN="/"),1)
## Sample
## Original.Label S1 S2
## Alveolar macrophage 11.8 15.3
## AT1 Cell 5.1 4.9
## AT2 Cell 57.4 45.1
## B Cell 10.3 16.0
## Ciliated cell 1.1 5.1
## Clara Cell 1.2 4.9
## T Cell_Cd8b1 high 13.0 8.6
DimPlot(mca, split.by = "sample")
## create a new variable combining sample and cell type for EDA
## Inspect known marker genes distribution to verify specificity
VlnPlot(mca, "nFeature_RNA")
VlnPlot(mca, "nCount_RNA")
VlnPlot(mca, c("Sftpb", "Foxj1", "Cd79a"), group.by="cell.label")
Determining cell types can be challenging. There are now several tools available to help you with this task. We will use the package SingleR to perform automated annotations of our cells. Several reference are provided in the package celldex. If these references do not correspond to your biological system, you can also create your own reference set to use in SingleR (from single cell or bulk RNAseq).
Let’s look at the Mouse reference dataset that can be of interest for us: ImmGenData This reference contains normalized genes expression profiles of murine cell populations that were sorted and analyzed using microarrays.
First, we will load the ImmGenData. Several levels of annotations are provided: one broad level level.main, and one precise level label.fine. Cell ontology ids are also available.
library(SingleR)
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## Attaching package: 'SummarizedExperiment'
## The following object is masked from 'package:SeuratObject':
##
## Assays
## The following object is masked from 'package:Seurat':
##
## Assays
library(celldex)
##
## Attaching package: 'celldex'
## The following objects are masked from 'package:SingleR':
##
## BlueprintEncodeData, DatabaseImmuneCellExpressionData,
## HumanPrimaryCellAtlasData, ImmGenData, MonacoImmuneData,
## MouseRNAseqData, NovershternHematopoieticData
immgen.ref <- ImmGenData()
## snapshotDate(): 2021-10-19
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
immgen.ref
## class: SummarizedExperiment
## dim: 22134 830
## metadata(0):
## assays(1): logcounts
## rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
## rowData names(0):
## colnames(830):
## GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
## GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ...
## GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
## GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
## colData names(3): label.main label.fine label.ont
table(immgen.ref$label.main)
##
## B cells B cells, pro Basophils DC
## 79 1 6 88
## Endothelial cells Eosinophils Epithelial cells Fibroblasts
## 20 4 25 21
## ILC Macrophages Mast cells Microglia
## 23 79 20 3
## Monocytes Neutrophils NK cells NKT
## 33 23 38 22
## Stem cells Stromal cells T cells Tgd
## 36 7 231 71
head(sort(table(immgen.ref$label.fine), decreasing = T),20)
##
## DC (DC) Fibroblasts (FRC) Neutrophils (GN)
## 43 11 11
## Endothelial cells (LEC) Endothelial cells (BEC) Macrophages (MF)
## 10 9 9
## Mast cells (MC) Monocytes (MO.6C+II-) T cells (T.8Nve)
## 9 9 9
## DC (DC.PDC.8+) DC (DC.8-4-11B-) Macrophages (MF.103-11B+)
## 8 7 7
## NKT (NKT.4-) NKT (NKT.4+) Stem cells (MLP)
## 7 7 7
## B cells (B.CD19CONTROL) B cells (B.Fo) B cells (B1a)
## 6 6 6
## Basophils (BA) Neutrophils (GN.ARTH)
## 6 6
This package provides a single function for cell type annotations. It takes as input a count matrix, or a SingleCellExperiment object. For test data, we can provide either raw or normalized count data, since singleR is based on Spearman correlation. If you are using a full-length protocol (such as Smart-Seq), and you want to annotate your data using a reference obtained using UMI-based protocol, it is recommended to adjust your count data for gene length and provide TPMs to SingleR.
immgen.pred = SingleR(GetAssayData(mca, slot="data"), ref=immgen.ref, label=immgen.ref$label.main)
colnames(immgen.pred)
## [1] "scores" "first.labels" "tuning.scores" "labels"
## [5] "pruned.labels"
sort(table(immgen.pred$labels), decreasing=TRUE)
##
## Epithelial cells B cells Macrophages T cells
## 1411 284 267 234
## Endothelial cells Neutrophils Monocytes Fibroblasts
## 67 28 27 24
## Stromal cells Tgd ILC NKT
## 18 16 14 13
## Eosinophils Basophils NK cells B cells, pro
## 12 5 4 1
## DC Stem cells
## 1 1
The package provides nice graphical output to assess the quality of the classification. We can easily generate a heatmap of correlation scores for each cell against the reference, to look for ambiguous assignments.
plotScoreHeatmap(immgen.pred)
We can also display cell meta data at the top of the graph, to look for batch effects etc…
my.labels = data.frame(Sample=mca@meta.data[,c("sample")])
rownames(my.labels) = rownames(mca@meta.data)
plotScoreHeatmap(immgen.pred,
annotation_col=my.labels)
Let’s check how this cell type annotation compares with the original labels.
res.tab = table(mca$cell.label,immgen.pred$labels)
res.tab
##
## B cells B cells, pro Basophils DC Endothelial cells
## Alveolar macrophage 0 0 1 0 0
## AT1 Cell 0 0 0 0 29
## AT2 Cell 1 0 1 0 37
## B Cell 282 1 1 0 0
## Ciliated cell 0 0 0 0 0
## Clara Cell 0 0 1 1 1
## T Cell_Cd8b1 high 1 0 1 0 0
##
## Eosinophils Epithelial cells Fibroblasts ILC Macrophages
## Alveolar macrophage 11 0 0 0 254
## AT1 Cell 0 77 3 0 1
## AT2 Cell 0 1222 16 0 6
## B Cell 1 5 0 2 2
## Ciliated cell 0 56 1 0 1
## Clara Cell 0 51 3 0 1
## T Cell_Cd8b1 high 0 0 1 12 2
##
## Monocytes Neutrophils NK cells NKT Stem cells
## Alveolar macrophage 23 26 0 0 0
## AT1 Cell 0 1 0 0 0
## AT2 Cell 2 1 0 0 0
## B Cell 2 0 0 0 0
## Ciliated cell 0 0 0 0 0
## Clara Cell 0 0 0 0 1
## T Cell_Cd8b1 high 0 0 4 13 0
##
## Stromal cells T cells Tgd
## Alveolar macrophage 0 0 0
## AT1 Cell 12 0 0
## AT2 Cell 6 0 1
## B Cell 0 1 0
## Ciliated cell 0 0 0
## Clara Cell 0 1 0
## T Cell_Cd8b1 high 0 232 15
pheatmap(log10(res.tab+1))
As we can see in the previous example, existing databases may not match your data very well. We will see in this part how to build you own reference, either from your own data that you have been able to manually annotate previously, or from a public repository.
Let’s start by testing if we can annotate Sample 2 against the other one.
First, we will create the reference annotation. We need to extract the log normalized counts for Sample1 from the mca object, and we will use the original cell labels as cell type annotations. Once our reference ready, we annotate sample 2.
Inspect quality metrics and graphs to check if the cell type predictions are good.
sample1 = subset(mca, subset = sample=="S1")
Idents(sample1) = sample1$cell.label
sample1.norm = GetAssayData(sample1, slot="data")
sample2 = subset(mca, subset = sample=="S2")
sample2.pred = SingleR(as.SingleCellExperiment(sample2),
ref = sample1.norm, labels = sample1$cell.label)
plotScoreHeatmap(sample2.pred)
table(NewLabel=sample2.pred$labels, Original.Label=sample2$cell.label)
## Original.Label
## NewLabel Alveolar macrophage AT1 Cell AT2 Cell B Cell
## Alveolar macrophage 124 0 0 0
## AT1 Cell 0 36 0 0
## AT2 Cell 0 4 362 0
## B Cell 0 0 0 116
## Ciliated cell 0 0 0 0
## Clara Cell 0 0 0 0
## T Cell_Cd8b1 high 0 0 3 14
## Original.Label
## NewLabel Ciliated cell Clara Cell T Cell_Cd8b1 high
## Alveolar macrophage 0 0 0
## AT1 Cell 0 0 0
## AT2 Cell 0 0 0
## B Cell 0 0 0
## Ciliated cell 35 0 0
## Clara Cell 6 40 0
## T Cell_Cd8b1 high 0 0 70
Our dataset is a subset of the Mouse Cell Atlas, generated using MicroWell-Seq technology. In this exercise, we will try to annotate our MCA sample using the Tabula Muris cell atlas. I have extracted the lung data generated using the Smart-Seq2 technology from the Tabula Muris website. We can now read in the data, generate log normalized counts using the scuttle package, and use the normalized data as reference set.
tab.muris = read.csv("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/TabulaMuris/Lung-counts.csv", row.names = 1)
head(tab.muris[,1:5])
## A7.MAA000526.3_9_M.1.1 B12.MAA000526.3_9_M.1.1
## 0610005C13Rik 0 0
## 0610007C21Rik 310 200
## 0610007L01Rik 0 45
## 0610007N19Rik 0 119
## 0610007P08Rik 0 0
## 0610007P14Rik 0 185
## D4.MAA000526.3_9_M.1.1 E11.MAA000526.3_9_M.1.1
## 0610005C13Rik 0 0
## 0610007C21Rik 0 485
## 0610007L01Rik 0 1
## 0610007N19Rik 0 81
## 0610007P08Rik 0 117
## 0610007P14Rik 17 0
## G6.MAA000526.3_9_M.1.1
## 0610005C13Rik 0
## 0610007C21Rik 546
## 0610007L01Rik 0
## 0610007N19Rik 0
## 0610007P08Rik 0
## 0610007P14Rik 16
TM.annot.cells = read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/TabulaMuris/annotations_lung.txt")
rownames(TM.annot.cells) = TM.annot.cells$cell
head(TM.annot.cells)
## cell cell_ontology_class
## A2.MAA001847.3_39_F.1.1 A2.MAA001847.3_39_F.1.1 B cell
## A21.D043522.3_39_F.1.1 A21.D043522.3_39_F.1.1 B cell
## A21.MAA000526.3_9_M.1.1 A21.MAA000526.3_9_M.1.1 B cell
## A21.MAA000891.3_10_M.1.1 A21.MAA000891.3_10_M.1.1 B cell
## A22.D043522.3_39_F.1.1 A22.D043522.3_39_F.1.1 B cell
## B2.MAA001889.3_38_F.1.1 B2.MAA001889.3_38_F.1.1 B cell
## cell_ontology_id cluster.ids free_annotation
## A2.MAA001847.3_39_F.1.1 CL:0000236 13
## A21.D043522.3_39_F.1.1 CL:0000236 13
## A21.MAA000526.3_9_M.1.1 CL:0000236 13
## A21.MAA000891.3_10_M.1.1 CL:0000236 13
## A22.D043522.3_39_F.1.1 CL:0000236 13
## B2.MAA001889.3_38_F.1.1 CL:0000236 13
## We will keep only cells for which we have an annotations, and align the tables
tab.muris = tab.muris[, TM.annot.cells$cell]
library(SingleCellExperiment)
TM = SingleCellExperiment(list(counts=tab.muris),
colData=data.frame(TabulaMuris.SmartSeq=TM.annot.cells[,2]))
TM.norm = scuttle::logNormCounts(TM)
TM.pred = SingleR(as.SingleCellExperiment(sample1),
ref = TM.norm,
labels = colData(TM.norm)[,1])
plotScoreHeatmap(TM.pred)
TM.res = table(TM.label = TM.pred$labels, Original.label = sample1$cell.label)
pheatmap(log10(TM.res+10))
https://azimuth.hubmapconsortium.org/
Azimuth is a web application that can be used to automate the processing, analysis, and interpretation of a new single-cell RNA-seq experiment.The analysis pipeline is based on Seurat, and several Human or Mouse reference dataset are provided. Azimuth analysis results can be downloaded and integrated in your Seurat analysis.
I have uploaded our mca data into Azimuth, and used the Human lung atlas reference. The annotation results are available in the “azimuth_pred.tsv” file. You can read in the file and see how it performed, even though it’s not the same model organism.
predictions <- read.delim('/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/azimuth_pred.tsv', row.names = 1)
mca <- AddMetaData(
object = mca,
metadata = predictions)
table(Original.Label = mca$cell.label,
Azimuth = mca$predicted.annotation.l1)
## Azimuth
## Original.Label Alveolar Epithelial Type 1 Alveolar Epithelial Type 2
## Alveolar macrophage 0 7
## AT1 Cell 107 16
## AT2 Cell 1 1287
## B Cell 0 5
## Ciliated cell 0 2
## Clara Cell 0 2
## T Cell_Cd8b1 high 1 1
## Azimuth
## Original.Label B CD14+ Monocyte CD16+ Monocyte CD4 T CD8 T Ciliated
## Alveolar macrophage 3 160 3 2 0 0
## AT1 Cell 0 0 0 0 0 0
## AT2 Cell 0 4 0 0 0 0
## B Cell 203 18 1 59 0 0
## Ciliated cell 0 3 0 0 0 35
## Clara Cell 0 1 0 0 0 0
## T Cell_Cd8b1 high 1 8 0 261 9 0
## Azimuth
## Original.Label Club Dendritic Macrophage Mucous Plasma
## Alveolar macrophage 0 49 91 0 0
## AT1 Cell 0 0 0 0 0
## AT2 Cell 1 0 0 0 0
## B Cell 0 10 0 0 1
## Ciliated cell 13 0 0 5 0
## Clara Cell 44 0 1 12 0
## T Cell_Cd8b1 high 0 0 0 0 0
Seurat tutorials: https://satijalab.org/seurat/
SingleR book: http://bioconductor.org/books/release/SingleRBook/
Pokedex for Cell Types: http://bioconductor.org/packages/release/data/experiment/vignettes/celldex/inst/doc/userguide.html
Azimuth: https://azimuth.hubmapconsortium.org/
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /shared/software/miniconda/envs/r-4.1.1/lib/libopenblasp-r0.3.18.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.16.0 celldex_1.4.0
## [3] SingleR_1.8.0 SummarizedExperiment_1.24.0
## [5] Biobase_2.54.0 GenomicRanges_1.46.1
## [7] GenomeInfoDb_1.30.0 IRanges_2.28.0
## [9] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [11] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [13] pheatmap_1.0.12 SeuratObject_4.0.4
## [15] Seurat_4.0.6 forcats_0.5.1
## [17] stringr_1.4.0 dplyr_1.0.7
## [19] purrr_0.3.4 readr_2.1.1
## [21] tidyr_1.1.4 tibble_3.1.6
## [23] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 reticulate_1.22
## [3] tidyselect_1.1.1 RSQLite_2.2.9
## [5] AnnotationDbi_1.56.2 htmlwidgets_1.5.4
## [7] grid_4.1.1 BiocParallel_1.28.3
## [9] Rtsne_0.15 munsell_0.5.0
## [11] ScaledMatrix_1.2.0 codetools_0.2-18
## [13] ica_1.0-2 future_1.23.0
## [15] miniUI_0.1.1.1 withr_2.4.3
## [17] colorspace_2.0-2 filelock_1.0.2
## [19] highr_0.9 knitr_1.37
## [21] rstudioapi_0.13 ROCR_1.0-11
## [23] tensor_1.5 listenv_0.8.0
## [25] labeling_0.4.2 GenomeInfoDbData_1.2.7
## [27] polyclip_1.10-0 bit64_4.0.5
## [29] farver_2.1.0 parallelly_1.30.0
## [31] vctrs_0.3.8 generics_0.1.1
## [33] xfun_0.29 BiocFileCache_2.2.0
## [35] R6_2.5.1 rsvd_1.0.5
## [37] cachem_1.0.6 bitops_1.0-7
## [39] spatstat.utils_2.3-0 DelayedArray_0.20.0
## [41] assertthat_0.2.1 promises_1.2.0.1
## [43] scales_1.1.1 gtable_0.3.0
## [45] beachmat_2.10.0 globals_0.14.0
## [47] goftest_1.2-3 rlang_0.4.12
## [49] splines_4.1.1 lazyeval_0.2.2
## [51] spatstat.geom_2.3-1 broom_0.7.11
## [53] BiocManager_1.30.16 yaml_2.2.1
## [55] reshape2_1.4.4 abind_1.4-5
## [57] modelr_0.1.8 backports_1.4.1
## [59] httpuv_1.6.5 tools_4.1.1
## [61] ellipsis_0.3.2 spatstat.core_2.3-2
## [63] jquerylib_0.1.4 RColorBrewer_1.1-2
## [65] ggridges_0.5.3 Rcpp_1.0.7
## [67] plyr_1.8.6 sparseMatrixStats_1.6.0
## [69] zlibbioc_1.40.0 RCurl_1.98-1.5
## [71] rpart_4.1-15 deldir_1.0-6
## [73] viridis_0.6.2 pbapply_1.5-0
## [75] cowplot_1.1.1 zoo_1.8-9
## [77] haven_2.4.3 ggrepel_0.9.1
## [79] cluster_2.1.2 fs_1.5.2
## [81] magrittr_2.0.1 data.table_1.14.2
## [83] RSpectra_0.16-0 scattermore_0.7
## [85] lmtest_0.9-39 reprex_2.0.1
## [87] RANN_2.6.1 fitdistrplus_1.1-6
## [89] hms_1.1.1 patchwork_1.1.1
## [91] mime_0.12 evaluate_0.14
## [93] xtable_1.8-4 readxl_1.3.1
## [95] gridExtra_2.3 compiler_4.1.1
## [97] KernSmooth_2.23-20 crayon_1.4.2
## [99] htmltools_0.5.2 mgcv_1.8-38
## [101] later_1.3.0 tzdb_0.2.0
## [103] lubridate_1.8.0 DBI_1.1.2
## [105] ExperimentHub_2.2.0 dbplyr_2.1.1
## [107] rappdirs_0.3.3 MASS_7.3-54
## [109] Matrix_1.3-4 cli_3.1.0
## [111] parallel_4.1.1 igraph_1.2.11
## [113] pkgconfig_2.0.3 scuttle_1.4.0
## [115] plotly_4.10.0 spatstat.sparse_2.1-0
## [117] xml2_1.3.3 bslib_0.3.1
## [119] XVector_0.34.0 rvest_1.0.2
## [121] digest_0.6.29 sctransform_0.3.2
## [123] RcppAnnoy_0.0.19 Biostrings_2.62.0
## [125] spatstat.data_2.1-2 rmarkdown_2.11
## [127] cellranger_1.1.0 leiden_0.3.9
## [129] uwot_0.1.11 DelayedMatrixStats_1.16.0
## [131] curl_4.3.2 shiny_1.7.1
## [133] lifecycle_1.0.1 nlme_3.1-153
## [135] jsonlite_1.7.2 BiocNeighbors_1.12.0
## [137] viridisLite_0.4.0 limma_3.50.0
## [139] fansi_1.0.0 pillar_1.6.4
## [141] lattice_0.20-45 KEGGREST_1.34.0
## [143] fastmap_1.1.0 httr_1.4.2
## [145] survival_3.2-13 interactiveDisplayBase_1.32.0
## [147] glue_1.6.0 png_0.1-7
## [149] BiocVersion_3.14.0 bit_4.0.4
## [151] stringi_1.7.6 sass_0.4.0
## [153] blob_1.2.2 AnnotationHub_3.2.0
## [155] BiocSingular_1.10.0 memoise_2.0.1
## [157] irlba_2.3.5 future.apply_1.8.1