---
title: "sincellTE 2024 - TP Secondary Analysis: cell annotation"
editor: source
date: 10/22/2024
date-format: "MMM D, YYYY"
format:
  html:
    theme: default
    css: style.css
    toc: true
    toc-location: left
    toc-depth: 2
    toc-title: "Contents"
    fontsize: 1em
    linestretch: 1.5
    code-fold: false
    self-contained: true
    df-print: kable
execute:
  eval: true
---

```{css, echo = FALSE}
.wide-table {
  width: 50%; /* This sets the table width to 100% of the page */
  max-width: 800px; /* Limit the maximum width */
  margin-left: auto; /* Center the table */
  margin-right: auto;
}

.justify {
  text-align: justify-all
}
```
# Objectives

Cell annotation is one of the first steps in exploring the biological information captured in your dataset. It allows you to assign a type, and sometimes a subtype or even a cell state (e.g., proliferating, apoptotic), to individual cells.

The objective of this session is to gain hands-on experience with various methods for cell annotation. Precise and reliable cell annotation takes time. Here, we will focus on understanding and critically evaluating these different methods.

# Methods for Cell Annotation

Cell annotation methods are based on a common principle: they use genes or gene sets specific to certain cell types to assign an identity to the cells.

However, these methods differ in how they accomplish this. They can be classified into two main approaches:

 - **Manual Annotation**: This method involves using marker gene sets from the literature, which are manually curated. In your dataset, clusters with high expression of these markers are annotated with the corresponding cell type.

 - **Automatic Annotation**: This approach utilizes marker sets from databases, or published single-cell reference atlases or datasets. Tools then automatically annotate the cells based on this information.

:::{.callout-note}
These two approaches are not mutually exclusive and can be combined.
:::

For this session, we will use the dataset prepared in the previous practical sessions.

```{r setup}
suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(S4Vectors)
      library(Seurat)
      library(ggplot2)
      library(patchwork)
    })
})

set.seed(123)
```

```{r load data}
user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "cell_annotation/data/SC26_FMTG_FILTERED_SCT_CLUSTERS.rda", sep = "/")
output_dir <- paste0("/shared/home2", user.id, "cell_annotation/results/", sep = "/")
load(input_file)
```

# 1. Manual Annotation

One way to annotate cell clusters is to examine the genes that are highly expressed in one cluster compared to all other cells, and then infer the corresponding cell type. These markers can be obtained from:

 - Knowledge of the field (e.g., CD3A for T cells, CD8A for CD8+ T cells).
 - Relevant literature related to your project.
 - Differentially expressed genes (DEGs): Performing differential expression analysis on different clusters can reveal specific markers that help identify each cluster. However, not everyone agrees that DEG analysis should be used this way.

## 1.1 Using One Marker per Cell Type

Here, we have a list of markers from the dataset's associated publication, allowing us to annotate 9 cell populations:

| Cell type            | Marker |
| -------------------- | ------ |
| **Proximal enterocytes** | Apoa4  |
| **Distal enterocytes** | Enpep |
| **Goblet cells** | Muc2 |
| **Tuft cells** | Alox5ap |
| **Stem cells** | Lgr5 |
| **Naive CD4+ T cells** | Ccr7 |
| **Memory CD4+ T cells** | S100a4 |
| **CD8+ T cells** | Cd8a |
| **B cells** | Ms4a1 |

In previous practical sessions, cell clustering was performed. Clustering at a resolution of 0.1 yielded 9 clusters, matching the number of cell types we are aiming to identify. This clustering will be used for the first manual annotation step.

Violin plots and feature plots are useful for visualizing marker expression across the different clusters.

```{r plot manual annotation individual markers}
## define the markers
feats <- c('Muc2', 'Enpep', 'Apoa4', "Alox5ap",
              'Ccr7', 'S100a4', 'Cd8a', 'Lgr5', 'Ms4a1')

## set the cells default identity to clustering
Idents(sobj) <- "SCT_snn_res.0.1"
DimPlot(sobj)
FeaturePlot(sobj, features=feats)
VlnPlot(sobj, feats, layer = "data")
```

After visualizing these plots, we can propose an initial annotation:

| Cluster |  Marker | Cell type            |
| ------- | ------- | -------------------- |
| 0 | Cd8a |  CD8+ T cells |
| 1 | Ms4a1 | B cells |
| 2 | Cd8a | CD8+ T cells |
| 3 | Enpep + strong expression of Apo4 | mix of proximal and distal enterocytes |
| 4 | Ccr7 + Muc2 | Naive CD4 + T cells and Goblet cells |
| 5 | S100a4  | Naive CD4 + T cells |
| 6 | Ms4a1 | B cells |
| 7 | Alox5ap + Lgr5 | Tuft cells + stem cells |

Using one marker per cell type allowed us to form initial hypotheses about the cell types in each cluster, but also revealed some limitations:

 - Some markers are not very specific: e.g., Apoa4 is expressed in a large proportion of clusters, though its expression is highest in cluster 2.

```{r proportion cells expressing apoa4}
sobj[["Apoa4+"]] <- as.numeric((sobj@assays$SCT@data['Apoa4',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.2) %>% 
  ## compute the percentage of cells expressing the gene
  summarize(percentage = round(mean(`Apoa4+` == 1) * 100))
```
 - Some markers are specific but not expressed in all cells within a cluster: e.g., Lgr5 is detected almost exclusively in cluster 6, but only in 40% of the cells.

```{r proportion cells expressing lgr5}
sobj[["Lgr5+"]] <- as.numeric((sobj@assays$SCT@data['Lgr5',] > 0))

sobj@meta.data %>%
  ## group by cluster
  group_by(SCT_snn_res.0.1) %>% 
  ## compute the percentage of cells expressing the gene
  summarise(percentage = round(mean(`Lgr5+` == 1) * 100))
```

:::{.callout-warning}
### Question

Why a marker such as Lgr5+ might not be found in all the cells of a cluster ?
:::

## 1.2 Using a Marker List

Using only one marker per cell type can be risky. Instead, using a list of markers can help mitigate biases due to unexpressed genes or unspecific markers. To improve cluster annotation, we will now use a list of markers with several genes for each cell type.

```{r markers list}
marker_genes <- list(
    CD8 = c('Cd8a', 'Gzmb', 'Gzma', 'Cd7') ,
    CD4_naive = c('Ccr7', 'Ms4a4b', 'Cd27', 'Cd28'),
    CD4_memory = c('S100a4', 'Cd44', 'Ccl9'),
    B_cells = c('Ms4a1', 'Cd79a', 'Cd74'),
    Enterocyte_distal = c('Enpep', 'Anpep', 'Ace2'),
    Enterocyte_proximal = c('Apoa4', 'Dstn', 'Fabp2'),
    Goblet_cells = c('Muc2', 'Agr2', 'Spink4', 'Tff3'),
    Stem_cells = c('Lgr5', 'Hck', 'Nrgn'),
    Tuft_cells = c('Alox5ap','l1b', 'S100a9'))

## keep only genes also found in the count matrix
marker_genes <- lapply(marker_genes, function(vec) { 
  intersect(vec, rownames(sobj))
})
```

A dot plot is useful for assessing the expression of multiple genes simultaneously in populations of cells.

```{r dot plot marker list}
#| fig-height: 12
#| fig-width: 16

DotPlot(sobj, features = marker_genes) +
  theme(axis.text.x=element_text(angle = 45, vjust=0.5, hjust = 1),
        strip.text = element_text(angle = 45, hjust = 0.5))
```

The `AddModuleScore()` function calculates a score for each cell based on the expression of a list of genes. This score, automatically added to the metadata, can then be plotted on the UMAP.

```{r compute module score}
sobj <- AddModuleScore(sobj, features = marker_genes,
                                pool = NULL,
                                nbin = 5,
                                seed = 1,
                                ctrl = length(marker_genes),
                                k = FALSE,
                                name = "program")

## score for each list of gene is named program1, program2...
## rename the columns in metadata to get more interpretable names
vec_rename <- paste0("program", seq(1:length(marker_genes)))
names(vec_rename) <- names(marker_genes)
sobj@meta.data <- sobj@meta.data %>% 
  dplyr::rename(all_of(vec_rename))

head(sobj@meta.data)
```

```{r plot module score}
DimPlot(sobj, group.by = "SCT_snn_res.0.2")
FeaturePlot(sobj, features = names(marker_genes))
```
With this new information, we can refine the initial annotation by assigning each cluster the cell type with the highest score.

| Cluster |  Cell type |
| ------- | ---------- |
| 0 | CD8+ T cells |
| 1 | B cells |
| 2 | CD8+ T cells |
| 3 | proximal and distal enterocytes with higher score for distal enterocytes. Will be called enterocytes. |
| 4 | naive CD4+ T cells |
| 5 | Memory CD4+ T cells + Goblet cells |
| 6 | B cells |
| 7 | Stem cells and Tuft cells |

Based on these observations, the clusters can now be manually annotated.

```{r cluster annotation}
sobj[["seurat_clusters"]] <- sobj[["SCT_snn_res.0.2"]]

sobj@meta.data <- sobj@meta.data %>% 
  mutate(
    manual_annotation = case_when(
        seurat_clusters == "0" ~ "CD8+ T cells",
        seurat_clusters == "1" ~ "B cells",
        seurat_clusters == "2" ~ "CD8+ T cells",
        seurat_clusters == "3" ~ "Enterocytes",
        seurat_clusters == "4" ~ "Naive CD4+ T cells",
        seurat_clusters == "5" ~ "Memory CD4+ T cells + Goblet cells",
        seurat_clusters == "6" ~ "B cells",
        seurat_clusters == "7" ~ "Stem cells + Tuft cells"))

Idents(sobj) <-"manual_annotation"
DimPlot(sobj,
        pt.size = 0.1,
        label = FALSE ,
        repel = TRUE ,
        label.size = 3)
```

## 1.3 Summary

::: {.wide-table}
| Advantages | Limits |
| ---------- | ------ |
| Easy to implement | Working at the cluster resolution: risk of merging or splitting actual populations |
| Sometimes the only solution (ex : novel tissue) | Change clustering ? Change annotation |
| Flexible | Time-consuming, requires knowledge |
 |
:::

## 1.4 Beyond cluster resolution

Since each cell has a score for the different gene sets, annotation could also be performed at the cell level. Try to think of a few commands to perform this annotation.

:::{.callout-tip collapse=true}

### Answer

```r
df <- sobj@meta.data %>%
  dplyr::select(names(marker_genes))

sobj[["module_annotation"]] <- apply(df, 1, function(row) names(df)[which.max(row)])
```
:::

# 2. Automatic annotation 

For some tissues, the different cell types have already been largely described and databases exist with lists of referenced marker genes or datasets. This is particularly true for human and mouse.

Instead of manual annotation, these datasets can be used, along with published tools to call the cell types in your data. This process is called automatic annotation.

## 2.1 Using a list of marker genes.

In this example, we'll annotate cells using a list of marker genes from the *Panglao* database, combined with the `CelliD` package.

#### *The database*

Panglao is an online gene-centric database, meaning it focuses solely on genes rather than entire single-cell datasets.

```{r download panglao}
#| message: false
#| warning: false

library(CelliD)

df_panglao <- read_tsv("https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz")

head(df_panglao)
```

In *Panglao*, each row corresponds to a gene and the each column to one of its characteritics:

 - `species` indicate if the gene is expressed in mice and/or human
 - `cell type` is the final label used to annotate our dataset.
 - `gene type` indicates the biotype of the gene: protein-coding, long non coding RNA... Can be used for markers selection.
 - `organ` selecting the right organs is good practice to avoid aberrant annotations.
 
The quality and specificity of the reference are important to unmeaningful annotations. Here the columns `species` and `organ` will be used to select the subset of markers useful for our annotation.

```{r select and markers}
df_markers <- df_panglao %>%
  ## restricting to human specific genes
  filter(str_detect(species,"Hs"),
         organ %in% c("Immune system", "GI tract")) %>%
  mutate(`official gene symbol` = str_to_title(`official gene symbol`)) %>%
  ## converting dataframes into a list of vectors, which is the format needed as input for CelliD
  group_by(`cell type`) %>%
  summarise(geneset = list(`official gene symbol`))
```

#### *The method*

Here the package `CelliD` will be used. It is a clustering-free method developped by Cortal *et al.*. It calculates a per-cell gene signature score, which enables cell identity recognition. This method can also be applied to explore functional pathway enrichment.

`CelliD` can take `Seurat` objects as input. The reference for annotation must be provided as a named list of genes.

```{r format reference}
annot_gs <- setNames(df_markers$geneset, df_markers$`cell type`)
annot_gs <- annot_gs[sapply(annot_gs, length) >= 10]

annot_gs[c("B cells", "B cells memory")]
```

`CelliD` relies on a dimensionality reduction of the expression matrix in a low-dimension space common to cells AND genes with the method MCA. In such space, the closer a gene is to a cell, the more specific it is to this cell. Thus, a gene ranking is obtained for each cell based on their distance to the cell in the MCA space. For each cell, the closest genes can be regarded as a unique cell identity card and used to annotate it.

![illustration_cellid.png](./resources/images/illustration_cellid.png)

More precisely, a statistical test (hypergeometric test) associates to each cell, given its more specific genes, a probability to be of the different cell type. The cell type with the lowest p-value is selected.

```{r cellid annotation}
#| message: false
#| warning: false

## run MCA to project cells and genes in a common low-dimensionality space
sobj <- RunMCA(sobj)

# Performing per-cell hypergeometric tests against the gene signature collection
HGT_gs <- RunCellHGT(sobj, pathways = annot_gs, dims = 1:50, n.features = 200)

# For each cell, select the signature with the lowest corrected p-value (max -log10 corrected p-value)
gs_prediction <- rownames(HGT_gs)[apply(HGT_gs, 2, which.max)]

# For each cell, evaluate if the lowest p-value is significant
gs_prediction_signif <- ifelse(apply(HGT_gs, 2, max)>2, yes = gs_prediction, "unassigned")

# Save cell type predictions as metadata within the Seurat object
sobj$cellid_annotation <- gs_prediction_signif
```

:::{.callout-note}
If `CelliD` cannot attribute a significant annotation to a cell, it will be labelled "unassigned". They can still be manually annotated.
:::

#### *The results*

`CelliD` results can be visualized on the UMAP.

```{r plot cellid annotation dimplot}
#| fig-height: 16
#| fig-width: 12

p1 <- DimPlot(sobj,
        group.by = "SCT_snn_res.0.1",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3) + NoLegend()

p2 <- DimPlot(sobj,
        group.by = "cellid_annotation",
        pt.size = 0.1,
        label = TRUE ,
        repel = TRUE ,
        label.size = 3)

p1 / p2
```

There appears to be many cells types on a single plot. Visualizing them all in the same time can be difficult. Instead, it is also possible to facet the plot in order to visualize 1 cell type at a time.

```{r plot cellid annotation}
#| fig-height: 18
#| fig-width: 16

list_plots <- lapply(unique(sobj$cellid_annotation), function(celltype) {
  chl <- setNames(list(subset(sobj, cellid_annotation == celltype) %>% 
                         colnames()),
                celltype)
  
  return(DimPlot(sobj, group.by = "cellid_annotation", cells.highlight = chl))
})

wrap_plots(list_plots, ncol = 4)
```

`CelliD` returned many more populations than the 9 populations of the manual annotation. This is in line with the number and variety of cell types in the reference list of genes. 

```{r reference list length}
names(annot_gs)
```

These results show why manual curation can still be needed after automatic annotation:

 - Some annotations concern only a very small number of cells. It might be false positives, artifacts or actual cell populations expected to be small. Biological knowledge is needed to decide whether the cells should be kept or renamed. Be careful with interpretation.
 
```{r small populations}
sobj@meta.data %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(n) %>% 
  head()
```
 
 - Some annotated cell populations are closely related, which can lead to competition between cell types for accurate annotation. For example the B cells populations might probably all be merged into 1 `B cell` label. For cluster 0, the situation is more complicated: 
 
    - does it contain NK cells or T cells ? 
    - if it contains T cells, which ones ? 

```{r cluter 0 composition}
# Example cluster 0
sobj@meta.data %>% 
  filter(SCT_snn_res.0.2 == 0) %>% 
  group_by(cellid_annotation) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()
```

To answer this question, there are several possibilities:

  - count the labels occurence and choose the majoritary label(here NK cells would win). 

  - include a biologist in the process and use biological knowledge: which cell type is expected in this dataset ? Why ? Can very specific markers help to identify the cells ?

  - use other references and annotations tools and compare the results.


:::{.callout-note}
Cell level annotations might look noisy. For smoother results, some tools can run at the cluster level. Or you can manually annotate clusters with the main label returned by your annotation tool. However, make sure that your clustering level is the right one and does not merge several populations.
:::

## 2.2 Using a reference dataset

Some tools are able to use already annotated datasets to annotate new ones, on basis of the transcriptional profiles of the cells. This process is sometimes called annotation transfer.

![illustration_cellid.png](./resources/images/illustration_automatic_annotation_dataset.png)

Their use is facilitated by the increasing number of published datas and the existence of atlases integrating several datasets. If a dataset suitable for your own data exist, it can be used as a reference for automatic annotation

In this example, cells will be annotated using a dataset from the package `celldex` using the tool `SingleR`.

#### *The database*

`celldex` provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.

```{r load celldex dataset immgenndata}
#| message: false
#| warning: false

library(celldex)
library(SingleR)

immgen_ref <- celldex::ImmGenData()
```

Typically, each reference provides three levels of cell type annotation in its column metadata:

 - `label.main`, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
 - `label.fine`, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
 - `label.ont`, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

Here the dataset `ImmGennData` will be used.

#### *The method*

Here the package `SinglR` will be used.

`SingleR` usually works with log normalized data. Data were already normalized with SCT but, for this application, the previous normalization will be applied to all gene couts.

```{r compute log norm}
DefaultAssay(sobj) <- "RNA"

sobj <- sobj %>% 
  FindVariableFeatures() %>% 
  NormalizeData() %>% 
  ScaleData()

## Fro Seurat v5, join layers for integrated objects
sobj[["RNA"]] <- JoinLayers(sobj[["RNA"]])
```

`SingleR` does not take Seurat objects as input but either matrices or `SummarizedExperiment` objects. The reference is already in this `SummarizedExperiment` format. For the input data, the expression matrix needs to be extracted or converted to a `SingleCellExperiment` object, a format inherited from `SummarizedExperiment`, with the command line.

```r 
sce <- as.SingleCellExperiment(sobj)
```

Then, labels can be computed just by running the `SingleR()` function.

```{r compute singler annotation}
df_labels <- SingleR(test = GetAssayData(sobj, assay = "RNA", layer = "data"),
                       ref = immgen_ref,
                       labels = immgen_ref$label.main,
                       assay.type.test = "logcounts")

head(df_labels) %>% 
  as.data.frame()

invisible(gc())
```

#### *The results*

`SingleR()` returns a table with scores and prediction results for each cell. Labels are shown before (`labels`) and after a pruning  step (`pruned.labels`), which removes potentially poor-quality or ambiguous assignments.

These results can be added to the `Seurat` object and then visualized.

```{r add singler annotation}
df <- df_labels %>% 
  as.data.frame() %>% 
  mutate(singler_annotation = pruned.labels)

sobj <- AddMetaData(sobj, df["singler_annotation"])

rm(df)
```

:::{.callout-warning}
### Question

Plot the SingleR labels on a UMAP. What observations can you make about the SingleR labels ?
:::


```{r plot singler annotations}

```

Here again, there appears to be many cells types. The same kind of curation than with the `CelliD` example can be applied.

`SingleR` also comes with it own diagnosis tools. The function `plotScoreHeatmap()` displays the scores for all cells across all reference labels, which allows to inspect the confidence of the predicted labels.

```{r plot scores heatmap}
plotScoreHeatmap(df_labels)
```
Ideally, each cell should have one score that is significantly higher than the rest, indicating a clear assignment to a single label. However, when multiple similar scores are present for a given cell, it suggests uncertainty in the assignment.

In this case, we observe that the labels for T cells, ILCs, and NK cells are closely related, which is where domain expertise and highly specific marker sets become crucial. Additionally, the labels for Monocytes and Macrophages could potentially be merged due to their similarity.

## 2.3 Summary

::: {.wide-table}

| Advantages | Limits |
| ---------- | ------ |
| Works at the single cell level |  Results heavily rely on the quality and completeness of the reference (be careful of missing labels) |
| Many resources for reference | Possible to mislabel populations |
| Flexible | Results can be noisy |

:::

```r 
#export annotated object
saveRDS(sobj, file=file.path(output_dir, "seurat_object_annotated.rds"))
```

# References and documentation

CellID: [https://www.bioconductor.org/packages/release/bioc/html/CelliD.html](https://www.bioconductor.org/packages/release/bioc/html/CelliD.html)

SingleR: [https://www.bioconductor.org/packages/release/bioc/html/SingleR.html](https://www.bioconductor.org/packages/release/bioc/html/SingleR.html)

Many other tools. A review summarizing many annotation methods: [A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell–Cell Communication](https://www.mdpi.com/2073-4409/12/15/1970)

Human Cell Atlas: [https://www.humancellatlas.org/](https://www.humancellatlas.org/)

Tabula muris: [https://tabula-muris-senis.ds.czbiohub.org/](https://tabula-muris-senis.ds.czbiohub.org/)

GeneCard : [https://www.genecards.org](https://www.genecards.org)

# Session info

```{r session infos}
sessionInfo()
```