---
title: "scRNA-seq : Cell type annotation"
author: "SingleCell RNA-Seq Team - EBAII n1 2023"
date: "2023-11-05"
output: 
   html_document:
    toc: true
    theme: united
---

![](illustration/header_markdown.png)

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(ggplot2)
library(SingleR)
library(Seurat)
library(dplyr)
```

### Objectives

We are reaching out a challenging task of the analysis (and a very exciting one !).

What types of cells did we capture in the analysis ? Do we identify the expected cell types and can we distinguish different sub-population ? Do we identify "novel", "surprising" cell types ?

The aim of this session is to understand the different methods that will help you to explore the biological cell types captured by your dataset.

#### Overview of the scRNAseq pipeline

At this step of the analysis, we have :

-   **a gene expression matrix** : for each cell, gene expression is available

-   **a reduced space** : gene expression matrix is summarized in N dimensions

-   **a clustering** : each cell belongs to a specific cluster

-   **a 2D space** : cells can be visualized on a 2D representation

![](illustration/workflow_overview.png)

On the cell visualization, we also searched for clusters of cell. The clustering resolution show multiple cell clusters that we can now associate to cell types.

For this you need :

-   **your biological knowledge** on your dataset

-   **an internet connection** :)

-   

### Different methods to annotate cell types

The annotation methods aim at defining **marker genes** that help to identify the cell types in each cluster.

But the logic across methods is similar :

You identify genes or set of genes that have a pattern of expression **specific** and that **represents a large number of cells** for the cluster.

Different methods exist :

-   1 - **MANUAL :** You can either do it manually, use set of marker genes from bibliography, or use other datasets that have been annotated to transfert the annotation to your clustering if similar expression patterns are found.

-   2 - **AUTOMATIC** : You use published database and collect sets of marker genes for cell types, or published reference single cell atlas already annotated, or you can also use published RNAseq on a specific cell type that you know is in your dataset...

-   

For this practical we will try both approaches. We use the dataset previously filtered and pre-processed (the Seurat object contains the 3 dimension reductions previously performed - pca, umap and harmony).

```{r load dataset}
## Loading the Seurat object into the variable "sobj"
sobj = base::readRDS(
  file = "/shared/projects/2325_ebaii/SingleCell/Annotation/data/TD3A.TDCT_Scal2K.IntH20.Clust0.8.RDS")

## A quick overview of the object
sobj
```

### 1 - Manual annotation using differential expression

**Differential expression between clusters**

One way to annotate the clusters of cell is to look at the genes highly expressed in one cluster of cells compared to all the other cells: we can do a differential expression (DE) analysis.

DE analysis is performed on the **normalized count matrix** ("data").

In our case, the dataset is already Normalised (cf previous practicals).

```{r normalise, echo=TRUE}
# DO NOT RUN THIS CODE
# sobj = NormalizeData(sobj)

```

```{r UMAP, echo=TRUE, fig.height=8, fig.width=12}
Seurat::DimPlot(object = sobj, 
                reduction = "umap", 
                group.by = "RNA_snn_res.0.8", #color cells with their cluster number
                pt.size = 2,
                label = TRUE,
                repel = TRUE
) + ggplot2::theme(legend.position = "bottom")
```

To perform the manual annotation, we will use **Seurat** and the function [*FindAllMarkers*](https://satijalab.org/seurat/reference/findallmarkers)***.***

This function compare each cluster against all the other clusters to identify genes differentially expressed that are potentially marker genes.

We can now run the function FindAllMarkers()

```{r FindAllMarkers, message=FALSE}
# find markers for every cluster compared to all remaining cells, report only genes with positive DE 
all_markers = FindAllMarkers(object = sobj, 
                          only.pos = TRUE, # genes more expressed in the cluster compared
                          min.pct = 0.25, # % of cell expressing the marker
                          logfc.threshold = 0.25) 

# This command can take up to 2 mins
```

***Time warning** : this command takes a while to run if the number of cells and cluster is high.*

Once the markers per cluster have been identify, we can look at the number of markers identified by cluster.

```{r table markers, echo=TRUE}
# Number of markers identified by cluster
table(all_markers$cluster)
```

Here we see that many markers (ie deferentially expressed genes) have been identified. We cannot look at all of them, but we can choose to look at the top 3 markers per cluster and use our biological knowledge to identify cell populations.

```{r top3 markers}
# Save in a table 3 genes the most differentially expressed in one cluster VS all the other clusters
top3_markers = as.data.frame(all_markers %>% 
                               group_by(cluster) %>% 
                               top_n(n = 3, wt = avg_log2FC))

```

```{r dotplot top3 markers, fig.height=8, fig.width=12}
# Create a dotplot the vidualise the expression of genes by cluster
Seurat::DotPlot(sobj, features = unique(top3_markers$gene)) +
  # this second part of the code is just for esthetics :
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, 
                                                     vjust = 1,
                                                     size = 8, 
                                                     hjust = 1)) +
  Seurat::NoLegend()

```

On this plot we can see that some markers display a very high and specific expression one cluster.

***Biology in this plot :***

-   What is the function of each marker gene ? is it a know marker gene for a cell type ?

-   Is there litterature about its pattern of expression ?

Let's focus on 2 genes :

-   Cd5, a gene expressed in T cell and is a marker of self reactivity of T cells

-   Rag1, a key gene of T cells maturation (variable/diversity/joining (V[D]J) rearrangement)

```{r UMAP 2 markers, fig.width=12, fig.height=8}
FeaturePlot(sobj, 
            features = c("Cd5", "Rag1"), 
            reduction = "umap"
            )

```

... at this moment using your biological knowledge on your dataset is critical, you can also test manually any marker of your choice !

#### **Conclusion - Manual annotation :**

-   Perform differential expression for each cluster VS all the others with a normalized matrix

-   Look at the gene expression of the markers identified in the 2D representation to **validate specificity and representativness**

-   Find the cell population corresponding to these markers and annotate this cluster

+---------------------------------------------------------+----------------------------------------------------------------------------+
| Advantages                                              | Limits                                                                     |
+=========================================================+============================================================================+
| -   **Easy** to implement                               | -   **Clustering** : resolution, merged clusters, "bio-informatic" cluster |
|                                                         |                                                                            |
| -   Sometimes the **only** solution (ex : novel tissue) | -   Change clustering ? Change annotation...                               |
|                                                         |                                                                            |
| -   **Everything** is possible                          | -   **Knowledge** : time-consuming                                         |
+---------------------------------------------------------+----------------------------------------------------------------------------+

### 2 - Automatic annotation using reference markers

For some tissues, the different cell types have already been largely described and databases exist with referenced marker genes. Another way to annotate your dataset will be to find a database with relevant annotation for your dataset and use tools of automatic annotation to annotate your clusters.

**Let's see how it works in practice !**

We will use a database focused on immunological cell types called [**ImmGen**](https://www.immgen.org/), thanks to the [celldex](http://bioconductor.org/packages/release/data/experiment/html/celldex.html) R package that *"provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq"*

***Note** : In the following chunk code, you will load the annotation file from the IFB server where we downloaded it. In real life and with a version of dbdypr inf or equal to 2.3, you can also use this command :*

```         
annotation = ImmGenData(ensembl = FALSE) 
# ensembl set to TRUE if we want ENSEMBL ID gene names, FALSE will get the annotation with Gene Symbols
```

```{r ref loading, echo=TRUE}
# Loading the ImmGen database
annotation = readRDS("/shared/projects/2325_ebaii/SingleCell/Annotation/data/ImmGenData.RDS")

## A quick description of the db
annotation
```

This database contains 3 levels of granularity :

-   A "main" level (coarse grain)

-   A "fine" level (self-explanatory)

-   The "ONT" level (data are mapped to a defined ontology)

As we are in a context of sorted cells of the same lineage, we're going to use the **fine** label.

Let's see how many cell types are described in this ImmGen database :

```{r cell types in Immgen, echo=TRUE}
length(unique(annotation$label.fine))
```

The tool we will use to perform the automatic cell type annotation, [SingleR](https://bioconductor.org/packages/release/bioc/html/SingleR.html) works better with the normalized data. Thus, we will extract the normalized matrix from our Seurat object :

```{r norm exp mat, warning=FALSE}
# Extraction of the normanised data 
norm_exp_mat = Seurat::GetAssayData(
  object = sobj,
  assay = "RNA",
  slot = "data" #normalised count matrix
)
# We can check the caracteristics of this new object with dim() 
# dim() outputs the number of columns (cells) and rows (Cells)
dim(norm_exp_mat)
```

We are ready to start the annotation.

The following command of SingleR performs the prediction of celltypes for each cell of the dataset.

```{r ann pred, warning=FALSE}
# Run in ~ 1min
# This command uses performs the prediction of celltypes for each cell of the dataset
ann_predictions = SingleR::SingleR(
    test = norm_exp_mat, # use the normalised matrix
    ref = annotation, # use the annotation previously downloaded 
    labels = annotation$label.fine, # specify which annotation we use in annotation
    de.method="classic", # use marker dectection scheme 
    assay.type.test = "logcounts",
    assay.type.ref = "logcounts",
    BPPARAM = BiocParallel::SerialParam()
  )

```

The resulting object is a special kind of `data.frame` . Each row contains the ID of a cell and the prediction score associated by SingleR. Cell labels associated to each cell are stored in the column \``$labels`\`

How many different kind of labels were identified ?

```{r n labels identified}
# Lenght of the list of labels => n of different cell types in our dataset
length(unique(ann_predictions$labels))
```

Besides scoring, SingleR assesses the score quality, and prunes bad results.

How many cells got a poor quality annotation ?

```{r pruned cells}
# Print the number of cells with bad prediction scores (not labelled)
summary(is.na(ann_predictions$pruned.labels))

```

**Annotation diagnostic**

SingleR allows to visualize some control plots :

-   We can visualize the score of each cell, split by cell type label, as a heatmap :

```{r heatmap predicition, fig.width=12, fig.height=8}
SingleR::plotScoreHeatmap(ann_predictions)

```

*How do you interpret this heatmap ?*

-   

**Saving the annotation in your Seurat object**

We add a column with the annotation for each cell to our Seurat object.

```{r add pred to metadata}
# Add the columns $labels of the prediction to the metadata of the single cell object
sobj$singler_cells_labels = ann_predictions$labels
```

**Visualization of our annotation on UMAP**

We can visualize cells annotation the the UMAP.

```{r UMAP labeled, fig.height=8, fig.width=12}
# The 4 lines under are just esthetics (set color palette to use)
seeable_palette = setNames(
  c(RColorBrewer::brewer.pal(name = "Dark2", n = 8),
    c(1:(length(unique(ann_predictions$labels)) - 8))),
  nm = names(sort(table(ann_predictions$labels), decreasing = TRUE)))

# UMAP with the predicted annotation by cell
ann_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "singler_cells_labels", #we want to color cells by their annotation 
  pt.size = 2,
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

# UMAP with the cluster numbers (before annotation)
clust_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "RNA_snn_res.0.8", #color cells with their cluster number
  pt.size = 2,
  label = TRUE,
  repel = TRUE
)

print(ann_plot + clust_plot)
```

Maybe the annotation is not perfectly suited for our dataset. Some cell populations in the annotation are closely related, and this leads to annotation competition for our cells.

It is possible to run the annotation at the cluster level : it will be cleaner than the single cell level annotation. But, be sure that the clustering is not merging several cell populations.

We can check the number of cell attributed to labels from each cluster :

```{r table cells annot per clusters}
# Create a contingency table with in rows the cell labels and in columns the cluster numbers
table(sobj$singler_cells_labels,
      sobj$RNA_snn_res.0.8)
```

We can eventually check if some clusters contain multiple cell types. We compute the proportion of each cell type in each cluster. If a cluster is composed of two cell types (or more), maybe this resolution for the clustering is too low ?

```{r prop cell type per cluster}
# Compute the proportion of cell types per cluster
pop_by_cluster = prop.table(table(sobj$singler_cells_labels,
                                  sobj$RNA_snn_res.0.8),
                            margin = 2)
# Print number of cell types per cluster with >=30% from this cluster
colSums(pop_by_cluster > 0.3)
```

Be aware of :

-   **small weird clusters of cells** : they might be of interest BUT they can also be clustering artefacts

-   **very large clusters of cells** : if you notice that marker genes are representative of only a fraction of this large cluster, you might need to adjust the clustering parameters to be more discriminating.

-   

**Cluster level annotation**

We will now run the annotation at the cluster level (SingleR will summarize the expression profiles of all cells from the same cluster, and then assess the resulting aggregation) :

*Note : we run the same command as before (SingleR), we only add the parameter "cluster" to SingleR function to annotate by cluster and not by cell.*

```{r annotation by cluster, warning=FALSE}
# Rerun a prediction using clustering information 
clust_ann_predictions  =
    SingleR::SingleR(
    test = norm_exp_mat,
    clusters = sobj$RNA_snn_res.0.8,
    ref = annotation,
    labels = annotation$label.fine,
    assay.type.test = "logcounts",
    assay.type.ref = "logcounts",
    BPPARAM = BiocParallel::SerialParam()
  )

```

How many clusters have been labelled for each annotation label ?

```{r n cluster labels}
head(sort(table(clust_ann_predictions$labels), decreasing = TRUE))

```

For how many clusters was the annotation of poor quality ?

```{r pruned clusters}
summary(is.na(clust_ann_predictions$pruned.labels))

```

**Annotation diagnostic**

We can visualize the scores for each cell type, to each cell, as a heatmap :

```{r heatmap annot by cluster, fig.width=12, fig.height=8}
# Heatmap using the annotation prediction by cluster
SingleR::plotScoreHeatmap(clust_ann_predictions)

```

*What do you observe here ? What is the difference with the annotation by cell ?*

**Add annotation to metadata**

We add the annotation to our Seurat object.

```{r add annot by cluster to seurat metadata}
# Save the name of future annotation
clust_labels_col = "singler_clust_labels"
# Create a column with this name in the metadata and fill it with the cluster levels of each cell
sobj@meta.data[[clust_labels_col]] = sobj@meta.data$RNA_snn_res.0.8
# Fill associate each cluster with its annotation 
levels(sobj@meta.data[[clust_labels_col]]) = clust_ann_predictions$labels

```

**Visualization**

We can visualize cells annotation the the 2D projection :

```{r UMAP comparison annotation, fig.width=12, fig.height=8}
ann_cluster_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = clust_labels_col,
  pt.size = 2,
  label = FALSE, 
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

ann_cell_plot = Seurat::DimPlot(
  object = sobj, 
  reduction = "umap", 
  group.by = "singler_cells_labels",
  pt.size = 2,
  label = FALSE,
  repel = TRUE, 
  cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")

ann_cluster_plot + ann_cell_plot
```

**Save your Seurat object annotated**

We save the annotated Seurat object :

```{r save Seurat}
# path <- "/shared/projects/<your_project>/<etc>/"
# base::saveRDS(
#   object = sobj,
#   file = paste0(path, "/Scaled_Normalized_Harmony_Clustering_Annotated_Seurat_Object.RDS")
# )
```

#### Conclusion - Automatic annotation using reference markers

1.  Find a good marker gene reference (PanglaoDB, CellMarker, CancerSEA...)

2.  Select a tool / model : **classifier**, **scoring function** ... 

3.  Annotate your dataset

+--------------------------------------------------+--------------------------------------------------------------------------------------------+
| Advantages                                       | Limitations                                                                                |
+==================================================+============================================================================================+
| -   Annotation for every single cell is possible | -   Find the good reference markers                                                        |
|                                                  |                                                                                            |
| -   Design your **own reference**                | -   Cell types **arborescence**                                                            |
|                                                  |                                                                                            |
|                                                  | -   **Limited** number of cell types : all cells are annotated, or "unknown" is possible ? |
+--------------------------------------------------+--------------------------------------------------------------------------------------------+

### 3 - Automatic annotation using a reference scRNAseq

Another possibility is to use a published single-cell dataset as a reference for the cluster annotation. Multiple tools exist to transfer the annotations on your own dataset.

![From : <https://doi.org/10.1186/s12859-021-04165-w> Fig1](illustration/automatic_annotation_refdataset.png){width="393"}

We are not going to use this method today but you might want to use it for your practicals.

Here are the main command from Single R.

```{r SingleR using published scRNAseq}
# # Load the reference packages from REF paper
# REF_SNRNASEQ = readRDS("reference_scRNAseq.RDS")

# # removing unlabelled libraries
# REF_SNRNASEQ = REF_SNRNASEQ[,!is.na(REF_SNRNASEQ$Cell.type)]

# #log normalise the library 
# REF_SNRNASEQ <- logNormCounts(REF_SNRNASEQ)

# # Create SingleCellExperiment object for your dataset 
# sx_sce = as.SingleCellExperiment(sx)

# # RUN SINGLE R 
# pred.grun = SingleR(test=sx_sce, 
#                     ref=REF_SNRNASEQ, # ref single cell data in "ref="
#                     labels=REF_SNRNASEQ$Cell.type.labels, 
#                     de.method="wilcox") # default method, you can choose others

```

#### **Conclusion - Automatic annotation using a reference scRNAseq**

1.  Find a good reference dataset : several bulk RNA-seq, one scRNA-seq...

2.  Select a tool to transfer annotation (SingleR, ...)

3.  Annotate your dataset

+-----------------------------------+------------------------------------------------------------------------------------------------------+
| Advantages                        | Limitations                                                                                          |
+===================================+======================================================================================================+
| -   **Single cell** level         | -   Find the good reference dataset                                                                  |
|                                   |                                                                                                      |
| -   Design your **own reference** | -   **Limited** number of cell types (you can only find cell types present in the reference dataset) |
+-----------------------------------+------------------------------------------------------------------------------------------------------+

### Conclusion

+-------------------------------------------------------------+---------------------------------------+--------------------------------------------------------------------------------+
| Method                                                      | Advantages                            | Limitations                                                                    |
+=============================================================+=======================================+================================================================================+
| **Manual cluster annotation** using differential expression | -   **Easy** to implement             | -   **Clustering** : resolution, merged clusters, "bio-informatic" cluster     |
|                                                             |                                       |                                                                                |
|                                                             | -   May be the **only** solution      | -   Change clustering ? Change annotation...                                   |
|                                                             |                                       |                                                                                |
|                                                             | -   **Everything** is possible        | -   **Knowledge** : time-consuming                                             |
+-------------------------------------------------------------+---------------------------------------+--------------------------------------------------------------------------------+
| Automatic annotation using **reference markers**            | -   **Single cell** level is possible | -   Find the good reference markers                                            |
|                                                             |                                       |                                                                                |
|                                                             | -   Design your **own reference**     | -   Cell types **arborescence**                                                |
|                                                             |                                       |                                                                                |
|                                                             |                                       | -   **Limited** number of cell types : all cells are annotated, or "unknown" ? |
+-------------------------------------------------------------+---------------------------------------+--------------------------------------------------------------------------------+
| Automatic annotation using **reference dataset**            | -   **Single cell** level             | -   Find the good reference dataset                                            |
|                                                             |                                       |                                                                                |
|                                                             | -   Design your **own reference**     | -   **Limited** number of cell types : all cells are annotated, or "unknown" ? |
+-------------------------------------------------------------+---------------------------------------+--------------------------------------------------------------------------------+

***A few advices :)***

-   It is recommended to combine multiple methods to annotate your data

    -   Use manual cluster annotation to identify quickly your cell populations

    -   Identify good markers for each cell populations → your reference markers

    -   Use automatic cell annotation using your set of marker → your reference dataset

    -   Use your references to annotate new dataset and go back to manual annotation to refine your analysis.

-   Sometimes, annotation reveals that the dataset would benefit from a re-clustering if you realize that some cluster could group 2 cell types or on the contrary, when two different cluster expressed very similar markers and should be merged.

-   During annotation, do not hesitate to look at the expression of Mitochondrial or Ribosomal genes (or any other set of genes) in your clusters. It might help you to identify a cluster of cells that are looks weird to you. Clusters of "artificial" cells - cells of low quality- could lead to the identification of weird (novel?!) cells that have no real biological significance. But **be careful** a cluster with a high expression of mitochondrial or ribosomal genes can have biological meaning sometimes.

[*Note about automatic annotation :*]{.underline}

*If you are working with **non model species** or with **multiple species** : it is not trivial to transfert an annotation from one species to another. Genes markers are not always conserved across the evolution. In this case, **manual annotation** is a very important sanity check of any automatic annotation !!*

### References

Good practices for single cell analysis : <https://www.sc-best-practices.org/preamble.html>

Sanger Single cell course : <https://www.singlecellcourse.org/index.html>

SingleR : <https://bioconductor.org/books/3.12/SingleRBook/>

### Ressources

***For human :***

GeneCard : <https://www.genecards.org>

Human Protein Atlas : <https://www.proteinatlas.org/search/H2-K1>

### Session

```{r session info }
sessionInfo()
```