We are reaching out a challenging task of the analysis (and a very exciting one !).
What types of cells did we capture in the analysis ? Do we identify the expected cell types and can we distinguish different sub-population ? Do we identify “novel”, “surprising” cell types ?
The aim of this session is to understand the different methods that will help you to explore the biological cell types captured by your dataset.
At this step of the analysis, we have :
a gene expression matrix : for each cell, gene expression is available
a reduced space : gene expression matrix is summarized in N dimensions
a clustering : each cell belongs to a specific cluster
a 2D space : cells can be visualized on a 2D representation
On the cell visualization, we also searched for clusters of cell. The clustering resolution show multiple cell clusters that we can now associate to cell types.
For this you need :
your biological knowledge on your dataset
an internet connection :)
The annotation methods aim at defining marker genes that help to identify the cell types in each cluster.
But the logic across methods is similar :
You identify genes or set of genes that have a pattern of expression specific and that represents a large number of cells for the cluster.
Different methods exist :
1 - MANUAL : You can either do it manually, use set of marker genes from bibliography, or use other datasets that have been annotated to transfert the annotation to your clustering if similar expression patterns are found.
2 - AUTOMATIC : You use published database and collect sets of marker genes for cell types, or published reference single cell atlas already annotated, or you can also use published RNAseq on a specific cell type that you know is in your dataset…
For this practical we will try both approaches. We use the dataset previously filtered and pre-processed (the Seurat object contains the 3 dimension reductions previously performed - pca, umap and harmony).
## Set the directory to the directory of your project
## Work on this with the output of Integration project
project_dir = "/shared/projects/ebaii_sc_teachers/SC_TD/"
# In your directory with the TD you should have another folder called "Integration" or 06_Integration. In this folder you sould have the folder "RESULTS" with the Seurat object that you saved on wednesday afternoon.
# If you have it you can execute the next structure
## Loading the Seurat object into the variable "sobj"
sobj = base::readRDS(
file = paste0(project_dir, "06_Integration/RESULTS/12_TD3A.TDCT_S5_Integrated_12926.3886.RDS"))
## A quick overview of the object
sobj
## An object of class Seurat
## 12926 features across 3886 samples within 1 assay
## Active assay: RNA (12926 features, 2000 variable features)
## 5 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT, scale.data
## 6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone
### IN THE TERMINAL
## You will first check if you have the folder 06_Integration/RESULTS in your TD directory
## If not : you can create it with this command :
# mkdir -p /shared/projects/<your-project>/<TD_dir_if_any>/06_Integration/RESULTS/
## then you go to in this directory :
# cp /shared/projects/<your-project>/<TD_dir_if_any>/06_Integration/RESULTS/
## You will first copy the backup object in your
# cp /shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/BACKUP/RDS/12_TD3A.TDCT_S5_Integrated_12926.3886.RDS ./
Clusters to annotate
Manual annotation is based on the identification of genes markers that caracterise a cell population. For this, we must define which groups of cells we want to annotate.
This choice is based on the clustering (See Proc.2). After the ingration step, we can re-perform the 2 steps to get a clustering as we now have an object with the integration of 2 samples.
## Compute a SNN using the first 20 PCs
sobj <- Seurat::FindNeighbors(
object = sobj,
dims = 1:20, ## number of PC to take into account
reduction = "HarmonyIntegration") ## specify the name of the reduction to use to perform neighboor joining analysis (here we must use one integrated reduction)
## Louvain resolutions to test
resol <- c(0.1,0.2,0.3, 0.8, 1.5)
## Clustering
sobj <- Seurat::FindClusters(
object = sobj,
resolution = resol, ## Vector of different resolutions defined above
verbose = FALSE, ## means "shut=up" to the function, otherwise writte long outputs
algorithm = 1) ## Algorithm for modularity optimization (1 = original Louvain algorithm; 2 = Louvain algorithm with multilevel refinement; 3 = SLM algorithm; 4 = Leiden algorithm). Leiden requires the leidenalg python.
## Plot the clustering results objtained with resolution 0.3, 0.8 and 1.5
Seurat::DimPlot(object = sobj,
reduction = "umap", # we use the reduction "UMAP" to plot the data
group.by = c("RNA_snn_res.0.1", "RNA_snn_res.0.2", "RNA_snn_res.0.8", "RNA_snn_res.1.5"), #color cells according to the different clustering resolution
pt.size = 1,
label = TRUE,
repel = TRUE
) + ggplot2::theme(legend.position = "bottom")
Differential expression between clusters
One way to annotate the clusters of cell is to look at the genes highly expressed in one cluster of cells compared to all the other cells: we can do a differential expression (DE) analysis.
DE analysis is performed on the normalized count matrix (“data”).
In our case, the dataset is already Normalised (cf previous practicals).
# Verify that your object is normalised :
sobj
## An object of class Seurat
## 12926 features across 3886 samples within 1 assay
## Active assay: RNA (12926 features, 2000 variable features)
## 5 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT, scale.data
## 6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone
# Here you should have
# DO NOT RUN THIS CODE
# sobj = NormalizeData(sobj)
To perform the manual annotation, we will use Seurat and the function FindAllMarkers.
This function compare each cluster against all the other clusters to identify genes differentially expressed that are potentially marker genes.
We can now run the function FindAllMarkers()
# Marker gene idenfication will be performed cluster by cluster.
# You must decide which clustering resolution you use for this step.
Idents(sobj) = sobj$`RNA_snn_res.0.2`
In Seurat v5, the object has kept the 2 datasets integrated separated
in their gene/cell assays. To call marker genes, we must join the two
count assays together with the command JoinLayers
.
## To perform DEG and Markers analysis, you must join the assays into a unique one:
sobj = JoinLayers(sobj)
Now, we can use the Function FindAllMarkers
to perform
differential gene expression of each cluster against all the others. The
objective is to find genes ““specific”” of each cluster to try to
annotate them.
# find markers for every cluster compared to all remaining cells, report only genes with positive DE
all_markers = FindAllMarkers(object = sobj,
only.pos = TRUE, # genes more expressed in the cluster compared
min.pct = 0.25, # % of cell expressing the marker
logfc.threshold = 0.25)
# This command can take up to 2 mins
Time warning : this command takes a while to run if the number of cells and cluster is high.
Note : If this command takes only ~5s, this is a sign that
you probably forgot to merged the assays of your objects with the
command JoinLayers
and your object all_markers
is probably empty.
Once the markers per cluster have been identify, we can look at the number of markers identified by cluster.
# Number of markers identified by cluster
table(all_markers$cluster)
##
## 0 1 2 3 4 5 6
## 656 211 554 1725 150 118 1093
In this “table” the first line gives the number of the cluster, and the last line gives the number of markers (ie DEG) identified for this cluster.
Here we see that many markers (ie deferentially expressed genes) have been identified. We cannot look at all of them, but we can choose to look at the top 3 markers per cluster and use our biological knowledge to identify cell populations.
# Save in a table 5 genes the most differentially expressed in one cluster VS all the other clusters
top5_markers = as.data.frame(all_markers %>%
group_by(cluster) %>%
top_n(n = 5, wt = avg_log2FC))
# Create a dotplot the vidualise the expression of genes by cluster
Seurat::DotPlot(sobj, features = unique(top5_markers$gene)) +
# this second part of the code is just for esthetics :
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90,
vjust = 1,
size = 8,
hjust = 1)) +
Seurat::NoLegend()
On this plot we can see that some markers display a very high and specific expression one cluster, while other are expressed in 2 clusters.
Biology in this plot :
What is the function of each marker gene ? is it a know marker gene for a cell type ?
Is there litterature about its pattern of expression ?
Here we need biological knownedge and back-and-forward marker gene computing using different cluster resolution to understand which cell populations are present in the data.
At this step, using kown marker genes from your experiment/knowledge is also useful, and easy to perform
Let’s focus on 2 genes :
Rag1, a key gene of T cells maturation (variable/diversity/joining (V[D]J) rearrangement)
Cd5, a gene expressed in T cell and is a marker of self reactivity of T cells
## Function FeaturePlot show the expression of "Features" = "Genes" on the dimention reduction you specify, here the "umap".
FeaturePlot(sobj,
features = c("Cd5", "Rag1"),
reduction = "umap"
)
Here we can see that those markers may actually be very informative to distinguish cell Tcells maturation process.
… at this moment using your biological knowledge on your dataset is critical, you can also test manually any marker of your choice !
Perform differential expression for each cluster VS all the others with a normalized matrix
Look at the gene expression of the markers identified in the 2D representation to validate specificity and representativness
Find the cell population corresponding to these markers and annotate this cluster
Advantages | Limits |
---|---|
|
|
For some tissues, the different cell types have already been largely described and databases exist with referenced marker genes. Another way to annotate your dataset will be to find a database with relevant annotation for your dataset and use tools of automatic annotation to annotate your clusters.
Let’s see how it works in practice !
We will use a database focused on immunological cell types called ImmGen, thanks to the celldex R package that “provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq”
Note : In the following chunk code, you will load the annotation file from the IFB server where we downloaded it. In real life and with a version of dbdypr inf or equal to 2.3, you can also use this command :
annotation = ImmGenData(ensembl = FALSE)
# ensembl set to TRUE if we want ENSEMBL ID gene names, FALSE will get the annotation with Gene Symbols
# Loading the ImmGen database
annotation = readRDS("/shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/RESOURCES/ImmGenData.RDS")
## A quick description of the db
annotation
## class: SummarizedExperiment
## dim: 22134 830
## metadata(0):
## assays(1): logcounts
## rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
## rowData names(0):
## colnames(830):
## GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
## GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ...
## GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
## GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
## colData names(3): label.main label.fine label.ont
This database contains 3 levels of granularity :
A “main” level (coarse grain)
A “fine” level (self-explanatory)
The “ONT” level (data are mapped to a defined ontology)
As we are in a context of sorted cells of the same lineage, we’re going to use the fine label.
Let’s see how many cell types are described in this ImmGen database :
length(unique(annotation$label.fine))
## [1] 253
The tool we will use to perform the automatic cell type annotation, SingleR works better with the normalized data. Thus, we will extract the normalized matrix from our Seurat object :
# Extraction of the normanised data
norm_exp_mat = Seurat::GetAssayData(
object = sobj,
assay = "RNA",
slot = "data" #normalised count matrix
)
# We can check the caracteristics of this new object with dim()
# dim() outputs the number of columns (cells) and rows (Cells)
dim(norm_exp_mat)
## [1] 12926 3886
We are ready to start the annotation.
The following command of SingleR performs the prediction of celltypes for each cell of the dataset.
# Run in ~ 3-5 min depending on the number of CPU and memory defined
# This command uses performs the prediction of celltypes for each cell of the dataset
ann_predictions = SingleR::SingleR(
test = norm_exp_mat, # use the normalised matrix
ref = annotation, # use the annotation previously downloaded
labels = annotation$label.fine, # specify which annotation we use in annotation
de.method="classic", # use marker dectection scheme
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam()
)
The resulting object is a special kind of data.frame
.
Each row contains the ID of a cell and the prediction score associated
by SingleR. Cell labels associated to each cell are stored in the column
`$labels
`
How many different kind of labels were identified ?
# Lenght of the list of labels => n of different cell types in our dataset
length(unique(ann_predictions$labels))
## [1] 35
Besides scoring, SingleR assesses the score quality, and prunes bad results.
How many cells got a poor quality annotation ?
# Print the number of cells with bad prediction scores (not labelled)
summary(is.na(ann_predictions$pruned.labels))
## Mode FALSE TRUE
## logical 3847 39
Annotation diagnostic
SingleR allows to visualize some control plots :
SingleR::plotScoreHeatmap(ann_predictions)
How do you interpret this heatmap ?
Saving the annotation in your Seurat object
We add a column with the annotation for each cell to our Seurat object.
# Add the columns $labels of the prediction to the metadata of the single cell object
sobj$singler_cells_labels = ann_predictions$labels
Visualization of our annotation on UMAP
We can visualize cells annotation the the UMAP.
# The 4 lines under are just esthetics (set color palette to use)
seeable_palette = setNames(
c(RColorBrewer::brewer.pal(name = "Dark2", n = 8),
c(1:(length(unique(ann_predictions$labels)) - 8))),
nm = names(sort(table(ann_predictions$labels), decreasing = TRUE)))
# UMAP with the predicted annotation by cell
ann_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "singler_cells_labels", #we want to color cells by their annotation
pt.size = 2,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
# UMAP with the cluster numbers (before annotation)
clust_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "RNA_snn_res.0.2", #color cells with their cluster number
pt.size = 2,
label = TRUE,
repel = TRUE
)
print(ann_plot + clust_plot)
## Look at the number of cells projected to the ImmGenData reference.
table(sobj$singler_cells_labels)
##
## DC (DC.8-4-11B-) Macrophages (MF.11CLOSER.SALM3)
## 1 1
## NKT (NKT.44+NK1.1-) T cells (T.4.Pa)
## 1 1
## T cells (T.4FP3+25+) T cells (T.4Nve)
## 1 2
## T cells (T.4SP24int) T cells (T.8EFF.OT1.48HR.LISOVA)
## 22 2
## T cells (T.8MEM.OT1.D45.LISOVA) T cells (T.8Mem)
## 1 2
## T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) T cells (T.8NVE.OT1)
## 1 1
## T cells (T.8Nve) T cells (T.8SP24-)
## 25 8
## T cells (T.8SP24int) T cells (T.8SP69+)
## 3 7
## T cells (T.CD4.5H) T cells (T.CD4TESTCJ)
## 5 5
## T cells (T.CD8.5H) T cells (T.CD8.CTR)
## 2 2
## T cells (T.DN2B) T cells (T.DN3-4)
## 2 1
## T cells (T.DN3A) T cells (T.DN3B)
## 48 6
## T cells (T.DN4) T cells (T.DP)
## 1 1
## T cells (T.DP69+) T cells (T.DPbl)
## 218 71
## T cells (T.DPsm) T cells (T.ISP)
## 3174 263
## T cells (T.Tregs) Tgd (Tgd.imm.VG1+VD6+)
## 1 2
## Tgd (Tgd.VG2+) Tgd (Tgd.VG5+24AHI)
## 2 2
## Tgd (Tgd)
## 1
From this rapid prediction, it seems that our dataset contain Tcells mostly, and particularly T cells T.DPsm. amd T.DP69+ as well as T.ISP.
Maybe the annotation is not perfectly suited for our dataset. Some cell populations in the annotation are closely related, and this leads to annotation competition for our cells.
It is possible to run the annotation at the cluster level : it will be cleaner than the single cell level annotation. But, be sure that the clustering is not merging several cell populations.
We can check the number of cell attributed to labels from each cluster :
# Create a contingency table with in rows the cell labels and in columns the cluster numbers
table(sobj$singler_cells_labels,
sobj$RNA_snn_res.0.2)
##
## 0 1 2 3 4 5 6
## DC (DC.8-4-11B-) 0 0 0 0 0 1 0
## Macrophages (MF.11CLOSER.SALM3) 0 0 0 0 0 1 0
## NKT (NKT.44+NK1.1-) 0 0 0 1 0 0 0
## T cells (T.4.Pa) 0 0 1 0 0 0 0
## T cells (T.4FP3+25+) 0 0 1 0 0 0 0
## T cells (T.4Nve) 0 0 2 0 0 0 0
## T cells (T.4SP24int) 1 0 21 0 0 0 0
## T cells (T.8EFF.OT1.48HR.LISOVA) 0 0 1 1 0 0 0
## T cells (T.8MEM.OT1.D45.LISOVA) 0 0 1 0 0 0 0
## T cells (T.8Mem) 0 0 2 0 0 0 0
## T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) 0 0 1 0 0 0 0
## T cells (T.8NVE.OT1) 0 0 1 0 0 0 0
## T cells (T.8Nve) 0 0 25 0 0 0 0
## T cells (T.8SP24-) 0 0 8 0 0 0 0
## T cells (T.8SP24int) 0 0 2 1 0 0 0
## T cells (T.8SP69+) 0 0 7 0 0 0 0
## T cells (T.CD4.5H) 0 0 5 0 0 0 0
## T cells (T.CD4TESTCJ) 0 0 5 0 0 0 0
## T cells (T.CD8.5H) 0 0 2 0 0 0 0
## T cells (T.CD8.CTR) 0 0 2 0 0 0 0
## T cells (T.DN2B) 0 0 0 1 0 0 1
## T cells (T.DN3-4) 0 0 0 1 0 0 0
## T cells (T.DN3A) 0 0 0 2 0 0 46
## T cells (T.DN3B) 0 1 0 5 0 0 0
## T cells (T.DN4) 0 0 1 0 0 0 0
## T cells (T.DP) 0 0 0 0 0 1 0
## T cells (T.DP69+) 2 3 211 0 2 0 0
## T cells (T.DPbl) 0 17 0 54 0 0 0
## T cells (T.DPsm) 2067 827 137 0 73 70 0
## T cells (T.ISP) 1 4 0 258 0 0 0
## T cells (T.Tregs) 0 0 1 0 0 0 0
## Tgd (Tgd.imm.VG1+VD6+) 0 0 1 1 0 0 0
## Tgd (Tgd.VG2+) 0 0 1 0 0 0 1
## Tgd (Tgd.VG5+24AHI) 0 0 2 0 0 0 0
## Tgd (Tgd) 0 0 0 0 0 0 1
We can eventually check if some clusters contain multiple cell types. We compute the proportion of each cell type in each cluster. If a cluster is composed of two cell types (or more), maybe this resolution for the clustering is too low ?
# Compute the proportion of cell types per cluster
pop_by_cluster = prop.table(table(sobj$singler_cells_labels,
sobj$RNA_snn_res.0.2),
margin = 2)
# Print number of cell types per cluster with >=30% from this cluster
colSums(pop_by_cluster > 0.3)
## 0 1 2 3 4 5 6
## 1 1 2 1 1 1 1
Be aware of :
small weird clusters of cells : they might be of interest BUT they can also be clustering artefacts
very large clusters of cells : if you notice that marker genes are representative of only a fraction of this large cluster, you might need to adjust the clustering parameters to be more discriminating.
Find a good marker gene reference (PanglaoDB, CellMarker, CancerSEA…)
Select a tool / model : classifier, scoring function …
Annotate your dataset
Advantages | Limitations |
---|---|
|
|
Another possibility is to use a published single-cell dataset as a reference for the cluster annotation. This is very usefull when you work on a tissue that is close to one tissue already studied, or if you work on another species and you want to have a quick overview of what the predicted annotation would look like. Multiple tools exist to transfer the annotations on your own dataset (SingleR, Azimuth, Symphony, classifiers such as SVM…). Many method exist, choose the one you know well first, or people of your lab / bioinfo use to have help if needed. (then you can try others…).
We are not going to use this method today but you might want to use it for your practicals.
Here are the main command from Single R.
# # Load the reference dataset in RDS format (it can also be loaded in another format, see the doc of SingleR to convert your reference to a suitable format for the prediction)
# REF_SNRNASEQ = readRDS("reference_scRNAseq.RDS")
# # removing unlabelled libraries from the reference dataset
# ## This command removes from the object the cells with a metadata "Cell.type" that is a NA.
# REF_SNRNASEQ = REF_SNRNASEQ[,!is.na(REF_SNRNASEQ$Cell.type)]
# #log normalise the library (the SingleR function works on normalised counts)
# REF_SNRNASEQ <- logNormCounts(REF_SNRNASEQ)
# # Create SingleCellExperiment object for your reference dataset
# REF_SCE = as.SingleCellExperiment(sx)
# # Create SingleCellExperiment object for your dataset
# sx_sce = as.SingleCellExperiment(sx)
# # RUN SINGLE R
# pred.grun = SingleR(test=sx_sce,
# ref=REF_SCE, # ref single cell data in "ref="
# labels=REF_SNRNASEQ$Cell.type.labels,
# de.method="wilcox") # default method, you can choose others
Find a good reference dataset : several bulk RNA-seq, one scRNA-seq…
Select a tool to transfer annotation (SingleR, …)
Annotate your dataset
Advantages | Limitations |
---|---|
|
|
Method | Advantages | Limitations |
---|---|---|
Manual cluster annotation using differential expression |
|
|
Automatic annotation using reference markers |
|
|
Automatic annotation using reference dataset |
|
|
A few advices :)
It is recommended to combine multiple methods to annotate your data
Use manual cluster annotation to identify quickly your cell populations
Identify good markers for each cell populations → your reference markers
Use automatic cell annotation using your set of marker → your reference dataset
Use your references to annotate new dataset and go back to manual annotation to refine your analysis.
Sometimes, annotation reveals that the dataset would benefit from a re-clustering if you realize that some cluster could group 2 cell types or on the contrary, when two different cluster expressed very similar markers and should be merged.
During annotation, do not hesitate to look at the expression of Mitochondrial or Ribosomal genes (or any other set of genes) in your clusters. It might help you to identify a cluster of cells that are looks weird to you. Clusters of “artificial” cells - cells of low quality- could lead to the identification of weird (novel?!) cells that have no real biological significance. But be careful a cluster with a high expression of mitochondrial or ribosomal genes can have biological meaning sometimes.
Note about automatic annotation :
If you are working with non model species or with multiple species : it is not trivial to transfert an annotation from one species to another. Genes markers are not always conserved across the evolution. In this case, manual annotation is a very important sanity check of any automatic annotation !!
It is possible to run the prediction of cell types with SIngleR per cluster instead of per cell. The idea is similar, but instead of annotated every single cell to its best match in the reference dataset, it annotates every cluster from your query dataset to it’s average best match in the reference dataset. (SingleR will summarize the expression profiles of all cells from the same cluster, and then assess the resulting aggregation) :
Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.
Advantage(s)
Much Faster at the cluster level : SingleR can be time-consuming to run on every single cell or you dataset, particularly if the reference and the query dataset are big. Sometimes, you might want to perform a first quick prediction at the cluster level just to have a general idea on how the prediction works with your dataset, and which cell types you capture in your dataset.
Check if the reference dataset will be of any help : before running a long prediction of cell annotation with SingleR, (again if your dataset is big), you want to know if the reference you are using is helping at the cluster level. If it fails to identify cell types, even “general” cell types, forget it.
# Rerun a prediction using clustering information
# This command is much faster because the prediction is only performed for the 7 clusters and not for each cell.
clust_ann_predictions =
SingleR::SingleR(
test = norm_exp_mat,
clusters = sobj$RNA_snn_res.0.2,
ref = annotation,
labels = annotation$label.fine,
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam()
)
Note : we run the same command as before (SingleR), we only add the parameter “cluster” to SingleR function to annotate by cluster and not by cell.
How many clusters have been labelled for each annotation label ?
## EXPLANATION OF THE COMMAND BELOW
head(sort(table(clust_ann_predictions$labels), decreasing = TRUE))
##
## T cells (T.DPsm) T cells (T.DN3A) T cells (T.DP69+) T cells (T.ISP)
## 4 1 1 1
## This command take the table of annotation labels (clust_ann_predictions$labels)
## It uses the function table to create a contingency table saying "how many time the "labels" from the referece dataset were assigns across clusters"
## Then the function sort order in a "decreasing" order this table to have first the labels assigned the most
## Finaly we show only the first 5 lines of the sorted table using the function head
For how many clusters was the annotation of poor quality ?
## EXPLANATION OF THE COMMAND BELOW
summary(is.na(clust_ann_predictions$pruned.labels))
## Mode FALSE
## logical 7
## This command takes the column "pruned.lables" from the table of prediction
## then the command "is.na" looks for NA value in the column "pruned.labels"
## Finally, the function summary gives the mean / max/ etc metrics of the values in the column pruned labels.
Annotation diagnostic
We can visualize the scores for each cell type, to each cell, as a heatmap :
# Heatmap using the annotation prediction by cluster
SingleR::plotScoreHeatmap(clust_ann_predictions)
What do you observe here ? What is the difference with the annotation by cell ?
Add annotation to metadata
We add the annotation to our Seurat object.
# Save the name of future annotation
clust_labels_col = "singler_clust_labels"
# Create a column with this name in the metadata and fill it with the cluster levels of each cell
sobj@meta.data[[clust_labels_col]] = sobj@meta.data$RNA_snn_res.0.2
# Fill associate each cluster with its annotation
levels(sobj@meta.data[[clust_labels_col]]) = clust_ann_predictions$labels
clust_ann_predictions$labels
## [1] "T cells (T.DPsm)" "T cells (T.DPsm)" "T cells (T.DP69+)"
## [4] "T cells (T.ISP)" "T cells (T.DPsm)" "T cells (T.DPsm)"
## [7] "T cells (T.DN3A)"
levels(sobj@meta.data[[clust_labels_col]])
## [1] "T cells (T.DPsm)" "T cells (T.DP69+)" "T cells (T.ISP)"
## [4] "T cells (T.DN3A)"
Visualization
We can visualize cells annotation the the 2D projection :
ann_cluster_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = clust_labels_col,
pt.size = 2,
label = FALSE,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
ann_cell_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "singler_cells_labels",
pt.size = 2,
label = FALSE,
repel = TRUE,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
ann_cluster_plot + ann_cell_plot
Save your Seurat object annotated
We save the annotated Seurat object :
## SAVE THE OBJECT
saveRDS(object = sobj, file = "RESULTS/12_TD3A.TDCT_S5_Integrated_Annotated.RDS")
# path <- "/shared/projects/<your_project>/<etc>/"
# base::saveRDS(
# object = sobj,
# file = paste0(path, "/Scaled_Normalized_Harmony_Clustering_Annotated_Seurat_Object.RDS")
# )
Good practices for single cell analysis : https://www.sc-best-practices.org/preamble.html
Sanger Single cell course : https://www.singlecellcourse.org/index.html
For human :
GeneCard : https://www.genecards.org
Human Protein Atlas : https://www.proteinatlas.org/search/H2-K1
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 Seurat_5.1.0
## [3] SeuratObject_5.0.2 sp_2.1-4
## [5] SingleR_2.6.0 SummarizedExperiment_1.34.0
## [7] Biobase_2.64.0 GenomicRanges_1.56.2
## [9] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [11] S4Vectors_0.42.1 BiocGenerics_0.50.0
## [13] MatrixGenerics_1.16.0 matrixStats_1.4.1
## [15] ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.0
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] spatstat.utils_3.1-0 farver_2.1.2
## [7] rmarkdown_2.28 zlibbioc_1.50.0
## [9] vctrs_0.6.5 ROCR_1.0-11
## [11] spatstat.explore_3.3-2 DelayedMatrixStats_1.26.0
## [13] htmltools_0.5.8.1 S4Arrays_1.4.1
## [15] SparseArray_1.4.8 sass_0.4.9
## [17] sctransform_0.4.1 parallelly_1.38.0
## [19] KernSmooth_2.23-24 bslib_0.8.0
## [21] htmlwidgets_1.6.4 ica_1.0-3
## [23] plyr_1.8.9 plotly_4.10.4
## [25] zoo_1.8-12 cachem_1.1.0
## [27] igraph_2.1.1 mime_0.12
## [29] lifecycle_1.0.4 pkgconfig_2.0.3
## [31] rsvd_1.0.5 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-1
## [37] future_1.34.0 shiny_1.9.1
## [39] digest_0.6.37 colorspace_2.1-1
## [41] patchwork_1.3.0 tensor_1.5
## [43] RSpectra_0.16-2 irlba_2.3.5.1
## [45] beachmat_2.20.0 labeling_0.4.3
## [47] progressr_0.14.0 spatstat.sparse_3.1-0
## [49] fansi_1.0.6 polyclip_1.10-7
## [51] httr_1.4.7 abind_1.4-8
## [53] compiler_4.4.1 withr_3.0.1
## [55] BiocParallel_1.38.0 viridis_0.6.5
## [57] fastDummies_1.7.4 highr_0.11
## [59] MASS_7.3-61 DelayedArray_0.30.1
## [61] tools_4.4.1 lmtest_0.9-40
## [63] httpuv_1.6.15 future.apply_1.11.2
## [65] goftest_1.2-3 glue_1.8.0
## [67] nlme_3.1-165 promises_1.3.0
## [69] grid_4.4.1 Rtsne_0.17
## [71] cluster_2.1.6 reshape2_1.4.4
## [73] generics_0.1.3 spatstat.data_3.1-2
## [75] gtable_0.3.5 tidyr_1.3.1
## [77] data.table_1.16.2 BiocSingular_1.20.0
## [79] ScaledMatrix_1.12.0 utf8_1.2.4
## [81] XVector_0.44.0 spatstat.geom_3.3-3
## [83] RcppAnnoy_0.0.22 ggrepel_0.9.6
## [85] RANN_2.6.2 pillar_1.9.0
## [87] stringr_1.5.1 limma_3.60.6
## [89] spam_2.11-0 RcppHNSW_0.6.0
## [91] later_1.3.2 splines_4.4.1
## [93] lattice_0.22-6 deldir_2.0-4
## [95] survival_3.7-0 tidyselect_1.2.1
## [97] miniUI_0.1.1.1 pbapply_1.7-2
## [99] knitr_1.48 gridExtra_2.3
## [101] scattermore_1.2 xfun_0.48
## [103] statmod_1.5.0 pheatmap_1.0.12
## [105] stringi_1.8.4 UCSC.utils_1.0.0
## [107] lazyeval_0.2.2 yaml_2.3.10
## [109] evaluate_1.0.1 codetools_0.2-20
## [111] tibble_3.2.1 cli_3.6.3
## [113] uwot_0.2.2 xtable_1.8-4
## [115] reticulate_1.39.0 munsell_0.5.1
## [117] jquerylib_0.1.4 Rcpp_1.0.13
## [119] spatstat.random_3.3-2 globals_0.16.3
## [121] png_0.1-8 spatstat.univar_3.0-1
## [123] parallel_4.4.1 dotCall64_1.2
## [125] sparseMatrixStats_1.16.0 listenv_0.9.1
## [127] viridisLite_0.4.2 scales_1.3.0
## [129] ggridges_0.5.6 leiden_0.4.3.1
## [131] purrr_1.0.2 crayon_1.5.3
## [133] rlang_1.1.4 cowplot_1.1.3