---
title: "sincellTE 2024 - TP Secondary Analysis: differential expression, functional enrichment"
editor: source
date: 10/22/2024
date-format: "MMM D, YYYY"
format:
  html:
    toc: true
    toc-depth: 4
    toc-expand: 4
    theme: default
    fontsize: 1em
    linestretch: 1.5
    code-fold: false
    self-contained: true
    df-print: kable
    math: mathjax
execute:
  eval: true
---

# Objectives

After identifying cell populations, the next crucial steps—differential gene expression (DGE) and functional enrichment—uncover the molecular signatures of cells and elucidate their functional roles. These steps are vital for exploring cellular heterogeneity and extracting meaningful biological insights from data.

In this session, we will perform DGE analysis to identify genes that serve as specific markers for cell populations or are significantly altered between conditions. Additionally, we will explore functional enrichment methods to interpret these findings in the context of biological pathways.

```{r setup}
suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(Seurat)
      library(clusterProfiler)
      library(enrichplot)
      library(org.Mm.eg.db)
      library(msigdbr)
      library(AUCell)
      library(ggplot2)
    })
})

options(future.globals.maxSize = 2.5 * 1e9)
```

# 1. Differential expression

The dataset used for these sessions comes from the article "p53 drives a transcriptional program that elicits a non-cell-autonomous response and alters cell state in vivo" by S.M. Moyer, et al. PNAS 2020. (https://doi.org/10.1073/pnas.2008474117).

It contains 2 single cell samples corresponding to 2 conditions: cells with overactivated p53 (FMTG) and wild-type cells (TG).

For the needs of this session, we will to load an object with both datasets already integrated.

```{r load data}
user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "resources/data/seurat_object_annotated.rds", sep = "/")
output_dir <- paste0("/shared/home2", user.id, "deg/results/", sep = "/")
sobj <- readRDS(input_file)
```

## 1.1 Cell population markers

After clustering or annotating cells, we can identify genes specific to each population by detecting differentially expressed genes (DEGs) between a given population and all other cells. The function `FindAllMarkers()` facilitates this analysis for each cell population.

![](./resources/images/illustration_specific_population_markers.png)

First, we need to set the cell identities to indicate the populations in our Seurat object. We will utilize the feature `celltype` for this purpose.

```{r set up identity}
sobj <- SetIdent(sobj, value = "celltype")
levels(sobj@active.ident)
```

Next, we choose a method for DEG computation. Several statistical tests are available in Seurat. Methods with fewer assumptions generally yield more robust results. Here, we will use the default Wilcoxon test but many other methods are available: `wilcox`, `wilcox_limma`, `bimod`, `roc`, `t`, `negbinom`, `poisson`, `LR`, `MAST`, `DESeq2`...

Some methods, like wilcox, require normalized data. We have two options:

 - Use log-normalized data.
 
```r
DefaultAssay(sobj) <- "RNA"

## get log normalized data if not already done
# sobj <- sobj %>% 
#   FindVariableFeatures() %>% 
#   NormalizeData() %>% 
#   ScaleData()

## compute population markers
df_markers <- FindAllMarkers(sobj,
                             assay = "RNA",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)
```
 
 - Use SCT normalized data. If working with integrated datasets, an extra-step is needed to reverse the SCT regression model to UMI counts.
 
```{r compute cell type markers}
#| warning: false

sobj <- PrepSCTFindMarkers(sobj, assay="SCT", verbose = F)
df_markers <- FindAllMarkers(sobj,
                             assay = "SCT",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)
```

By default, Seurat tests all expressed genes. However, you can modify the behavior with arguments like:

 - `min.pct`: Only test genes detected in at least min.pct of cells (default is 0.1).
 - `logfc.threshold`: Limit testing to genes with a minimum log-fold change (default is 0.1).
 - `only.pos`: Return only positive markers (FALSE by default).

For this session, we will adjust these arguments to focus on strong population markers.

The genes with significant differential expression can then be selected by keeping only adjusted p-values that are small enough.

```{r select significant genes}
df_markers_signif <- df_markers %>%
    filter(p_val_adj < 0.05)

head(df_markers_signif)
```

Significant DEGs can be identified by filtering based on adjusted p-values. The results table includes:

 - Gene name
 - The p-value and adjusted p-value. To be considered differentially expressed, genes adjusted p-values must be smaller than a chosen threshold are. 
 - Log-fold change: $ \log \left( \frac{\text{expr(gene)}_{\text{condition 2}}}{\text{expr(gene)}_{\text{condition 1}}} \right) $.
 - Percent of cells expressing the gene in each condition

```{r compute number marker genes}
nb_markers <- df_markers_signif %>% 
  group_by(cluster) %>% 
  tally()

nb_markers

ggplot(nb_markers, aes(x = cluster, y=n, fill=cluster)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```

The table of DEG is not automatically added to the Seurat object. It is possible to add it manually, for example in the slot:

```{r add deg table to seurat object}
sobj@misc$marker_genes <- df_markers_signif
```

The DEG results can be saved as a CSV file for further exploration:

```
write.csv(df_markers, file.path(output_dir, "de_genes_FindAllMarkers.csv"), row.names = F, quote = F)
```

Finally, visualizations include:

 - **Violin plot** and **feature plot**.

```{r top markers violin plots}
#| fig-height: 8
#| fig-width: 16
genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=1) %>% 
  pull(gene, name=cluster)

DefaultAssay(sobj) <- "SCT"
VlnPlot(sobj, features = genes_of_interest, slot="data", ncol=4)
```

```{r top markers feature plot}
#| fig-height: 8
#| fig-width: 16
FeaturePlot(sobj, features = genes_of_interest,
            max.cutoff = 'q90', ncol = 4)
```

 - **Dot plot** for top markers per cell type.

```{r top 3 markers dot plot}
#| fig-height: 6
#| fig-width: 12

genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=3) %>% 
  pull(gene)

DotPlot(sobj, features = genes_of_interest,
        col.min = 0) +
  RotatedAxis()
```

 - **Heatmaps** for a comprehensive view of markers.

```{r heatmap of cell types top 50 markers}
genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=50) %>% 
  pull(gene)

DoHeatmap(sobj, features = genes_of_interest, label=F) +
  theme(axis.text.y = element_blank()) # comment to display gene names
```

:::{.callout-note}
Seurat uses the scale.data layer for heatmaps, containing only the most variable genes identified during normalization. If certain genes aren't displayed, they may not be highly variable so this should not be a problem. To include them, rerun log-normalization on specific genes.

```r
sobj <- sobj %>% 
  FindVariableFeatures(nfeatures = nrow(sobj), verbose = FALSE) %>% 
  NormalizeData() %>% 
  ScaleData(seurat_obj, features = genes_to_scale)
```

:::

## 1.2 Differential expression between populations or conditions

Having identified markers for each cell type, we can also perform pairwise comparisons to test for differential gene expression between:

 - two cell populations.
 
![](./resources/images/illustration_population_degs.png){fig-align="center"}

 - cells of the same population under different conditions (e.g., treated vs. control).

![](./resources/images/illustration_condition_degs.png){fig-align="center"}

For instance, we can examine the effect of p53 overactivation on enterocytes by comparing enterocytes with overactivated p53 (FMTG) to wild-type enterocytes (TG).

The `FindMarkers()` function enables this analysis. A few command lines will allow us to test the two conditions for the enterocyte population.

```{r deg single cell method wilcoxon}
## subset the dataset to keep only population of interest
sobj_entero <- subset(sobj, idents = "Enterocyte_proximal")

## the condition to test is the genotype
Idents(sobj_entero) <- "orig.ident"
levels(sobj_entero)

## let's work on log-normalized counts
DefaultAssay(sobj_entero) <- "RNA"
sobj_entero <- sobj_entero %>%
  FindVariableFeatures() %>%
  NormalizeData() %>%
  ScaleData()

# deg_entero <- PrepSCTFindMarkers(sobj_entero)
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox",
                       logfc.threshold = log2(1.5),
                       min.pct = 0.5) %>% 
                       # recorrect_umi = FALSE) %>% 
  filter(p_val_adj < 0.05) %>%
  arrange(p_val_adj, desc(avg_log2FC))

sobj@misc$deg_entero <- deg_entero
```

# 2. Function enrichment

## 2.1 In a cell population

Once DEGs are identified, functional enrichment helps interpret the biological significance of these genes. It tests whether genes belonging to certain biological functions are over- or under-expressed beyond what would be expected by chance.

We will use the `clusterProfiler` package for this analysis, employing over-representation analysis (ORA) and Gene Set Enrichment Analysis (GSEA).

### Over representation analysis (ORA)

ORA is a common method for functional enrichment, assessing whether more genes in your list of DEGs are associated with a specific biological function than expected by chance.

Key inputs include:

 - **the list of DEGs**: Generally requires genes in `Entrez ID` format. We can convert gene names from Symbol format using `clusterProfiler::bitr()`
 
```{r gene names convertion to entrez id}
## add symbol gene names as a columns
deg_entero <- deg_entero %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")

## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)

head(deg_entero)

## get list of DEG using their entrez id
list_deg_entero <- deg_entero$ENTREZID
```

:::{.callout-note}
The annotation types supported by the by the `org.Mm.eg.db` database ca be listed:

```{r list possible labels for genes}
AnnotationDbi::keytypes(org.Mm.eg.db)
```
:::

 - **a reference**: A list of biological functions represented as annotated gene sets (e.g., Gene Ontology (GO), KEGG).

We will query the GO database using the `org.Mm.eg.db` reference to perform enrichment tests.

:::{.callout-tip}
For other organisms, you can find available OrgDbs on bioconductor.
:::

```{r perform ORA with GO-BP}
## perform ORA
go <- enrichGO(list_deg_entero, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300)

## avoid redundancy in result terms
go <- simplify(go)
```

Key parameters include:

 - `ont`: Choose a GO subcategory between Biological Process ("BP"), Molecular Function ("MF"), Cellular Component ("CC"), or all ("ALL").
 - `minGSSize`: Exclude gene sets with fewer than minGSSize genes (removes small sets prone to noise).
 - `maxGSSize`: Exclude gene sets with more than maxGSSize genes (removes overly broad sets).

```{r ORA results}
head(go@result)
```

The function returns an `enrichResult` object with ORA table stored in the `@result` slot. Important columns include:

 - `ID` and `Description`: GO identifier and brief description of the biological process.
 - `Count`: Number of genes from the input dataset annotated to the specific GO term.
 - `GeneRatio`: Ratio of genes in your input dataset associated with the GO term.
 - `BgRatio`: Ratio of genes associated with the GO term in the reference gene set.
 - `p.adjust` and `qvalue`: Adjusted p-values for selecting significantly enriched functions.

Results can be visualized using a dot plot.

```{r dotplot GO}
#| fig-height: 8
dotplot(go, showCategory = 20)
```

Prebuilt functions like `enrichGO()` and `enrichKEGG()` enable users to query widely used ontologies. The `enricher()` function facilitates the testing of over-representation for any gene set, for example MSigDB.

```{r load MSigDB}
## load MSigDBr database, subcategory H
gmt <- msigdbr(species = "mouse", category = "H")
head(gmt)
```

```{r perform ORA with msigdb}
ora <- enricher(gene = list_deg_entero, pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, TERM2GENE = gmt[,c("gs_name", "entrez_gene")], minGSSize = 10, maxGSSize = 300)

head(ora@result)
```
:::

:::{.callout-tip}
ORA does not indicate the direction of differential regulation. To gain more insight, you can filter for only significantly up- or down-regulated genes.
:::

### Gene Set Enrichment Analysis (GSEA)

GSEA evaluates whether a set of genes shows coordinated expression changes, analyzing all genes even if they do not reach significance thresholds.

For GSEA, prepare an ordered list of all genes. They are typically sorted by fold change but other criteria could be chosen, such as the (fold change * adjusted p-value) product. 

```{r prepare gsea list}
## rerun DEG identification to keep as many genes as possible
## even not significant genes
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox") %>% 
  arrange(desc(avg_log2FC)) %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")

## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)

## get list of DEG using their entrez id
gsea_input_list <- deg_entero %>%
  pull(avg_log2FC, name = ENTREZID)
```

We can run GSEA on the GO ontology.

```{r GSEA GP-BP}
gsea_go <- gseGO(gsea_input_list, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300) %>% 
  simplify()

head(gsea_go@result)
```

Important columns include:

 - `setSize`: Size of the gene set in the input list.
 - `ES` and `NES`: Enrichment score and its normalized version. A positive score indicates over-expression and a negative score indicates under-expression.
 - `padj` and `qvalues`: Adjusted p-values associated with the test.

Results can be visualized with plots.

```{r gsea dot plot}
#| fig-height: 12

## separate function with mainly over-expressed genes
## from functions with mainly under-expressed genes
gsea_go@result <- gsea_go@result %>% 
    mutate(nes_sign = ifelse(NES > 0, "Positive", "Negative"))

dotplot(gsea_go, showCategory = 10, split = "nes_sign")  +
  facet_grid(~nes_sign)
```

:::{.callout-warning}
### Question

What observations can you make about the enrichment function results ?
:::

## 2.2 At single cell level gene set activity

Some methods calculate the activity of pathways or gene sets at the single-cell level. This can be used to visualize the activation of some functions of interest across a whole sample, cell annotation or cell state identification.

Here we will show how to perform this type of analysis with the package `AUCell`. For time purpose, we will focus on 1 apoptosis signature suggested by the authors of the article of origin.

```{r create a custom gene signature}
## create a list of gene sets per GO term
gene_set <- list("p53_signature" = c("Ccng1", "Mdm2", "Eda2r", "Gtse1", "Polk", "Psrc1", "Zfp365"))
```

:::{.collout-note}
It would be possible to construct a list of gene sets taken from popular ontologies.
:::

Then the GSEA at a single cell level is performed.

```{r compute scGSEA}
## get express matrix
exprMatrix <- GetAssayData(sobj, assay="RNA", layer = "count")

## computations
cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats=F)
cells_AUC <- AUCell_calcAUC(gene_set, cells_rankings)
```

`AUCell` has a few functions to plot the results.

```{r plot results with AUCell}
AUCell_plotTSNE(Embeddings(sobj, reduction="umap")[,c(1,2)], exprMatrix, cells_AUC)
```

Alternatively, the results can be added to the `Seurat` object metadata and plotted with Seurat functions.

```{r plot results with Seurat}
auc_df <- AUCell::getAUC(cells_AUC) %>% 
  t() %>% 
  as.data.frame()

sobj <- AddMetaData(sobj, metadata = auc_df)

VlnPlot(sobj, features = "p53_signature")
FeaturePlot(sobj, features = "p53_signature", max.cutoff = 'q95')
```

# References and documentation

clusterProfiler: [https://yulab-smu.top/biomedical-knowledge-mining-book/index.html](https://yulab-smu.top/biomedical-knowledge-mining-book/index.html)

AUCell: [https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html](https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html)

scGSVA: a tool for single cell level GSEA coming with many functions for references building and results representation, actually more user-friendly than AUCell: [https://github.com/guokai8/scGSVA](https://github.com/guokai8/scGSVA)

# Session info

```{r session info}
sessionInfo()
```


# Bonus: pseudo-bulk example

```{r pseudo-bulk example}
#| eval: false

## load data
## the data is pro B cells in 2 conditions: 
## healthy donors (PBMC) and leukemia patient (ETV6-RunX1)
## they come from this paper
## Caron M et al. Scientific Reports. 2020;10:1–12. Available from: http://dx.doi.org/10.1038/s41598-020-64929-x
proB <- readRDS("resources/data/proB.rds")

## 1 cell type only: pro B cells
unique(proB@meta.data$SingleR_annot)
unique(proB@meta.data$type)

## 3 replicates
unique(proB@meta.data$orig.ident)

Idents(proB) <- "orig.ident"

## method 1: stay in Seurat
sobj_pseudoblk <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = T)

sobj_pseudoblk@assays$RNA@layers$counts

sobj_pseudoblk@meta.data <- sobj_pseudoblk@meta.data %>% 
  mutate(type = ifelse(str_detect(orig.ident, "PBMMC"),
                       "PBMC",
                       "ETV6"))

Idents(sobj_pseudoblk) <- "type"

markers <- FindMarkers(sobj_pseudoblk, 
                       ident.1 = "PBMC", 
                       ident.2 = "ETV6",
                       test.use = "DESeq2")

markers_signifs <- markers %>% 
  filter(p_val_adj < 0.05)

## method 2: just get raw counts and use you favorite bulk RNA-Seq tool
cts <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = F)
cts <- cts$RNA

dim(cts)
colnames(cts)

coldata <- data.frame(sample = colnames(cts)) %>%
    mutate(condition = ifelse(grepl("PBMMC", sample), "ctrl", "cancer")) %>%
    column_to_rownames("sample")

library(DESeq2)

dds <- DESeqDataSetFromMatrix(cts, colData = coldata, design = ~ condition)

keep <- rowSums(cts) >= 10
dds <- dds[keep,]
dds <- DESeq(dds)

markers_singif2 <- results(dds, name = "condition_ctrl_vs_cancer") %>%
    as.data.frame() %>%
    arrange(padj) %>%
    filter(padj < 0.05)
```