sincellTE 2024 - TP Secondary Analysis: differential expression, functional enrichment

Published

Oct 22, 2024

Objectives

After identifying cell populations, the next crucial steps—differential gene expression (DGE) and functional enrichment—uncover the molecular signatures of cells and elucidate their functional roles. These steps are vital for exploring cellular heterogeneity and extracting meaningful biological insights from data.

In this session, we will perform DGE analysis to identify genes that serve as specific markers for cell populations or are significantly altered between conditions. Additionally, we will explore functional enrichment methods to interpret these findings in the context of biological pathways.

suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(Seurat)
      library(clusterProfiler)
      library(enrichplot)
      library(org.Mm.eg.db)
      library(msigdbr)
      library(AUCell)
      library(ggplot2)
    })
})

options(future.globals.maxSize = 2.5 * 1e9)

1. Differential expression

The dataset used for these sessions comes from the article “p53 drives a transcriptional program that elicits a non-cell-autonomous response and alters cell state in vivo” by S.M. Moyer, et al. PNAS 2020. (https://doi.org/10.1073/pnas.2008474117).

It contains 2 single cell samples corresponding to 2 conditions: cells with overactivated p53 (FMTG) and wild-type cells (TG).

For the needs of this session, we will to load an object with both datasets already integrated.

user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "/deg/resources/data/seurat_object_annotated.rds", sep = "/")
output_dir <- paste0("/shared/home2", user.id, "deg/results/", sep = "/")
sobj <- readRDS(input_file)

1.1 Cell population markers

After clustering or annotating cells, we can identify genes specific to each population by detecting differentially expressed genes (DEGs) between a given population and all other cells. The function FindAllMarkers() facilitates this analysis for each cell population.

First, we need to set the cell identities to indicate the populations in our Seurat object. We will utilize the feature celltype for this purpose.

sobj <- SetIdent(sobj, value = "celltype")
levels(sobj@active.ident)
[1] "B_cells"             "CD8"                 "Enterocyte_proximal"
[4] "Goblet_cells"        "CD4_memory"          "Tuft_cells"         
[7] "Stem_cells"          "Enterocyte_distal"   "CD4_naive"          

Next, we choose a method for DEG computation. Several statistical tests are available in Seurat. Methods with fewer assumptions generally yield more robust results. Here, we will use the default Wilcoxon test but many other methods are available: wilcox, wilcox_limma, bimod, roc, t, negbinom, poisson, LR, MAST, DESeq2

Some methods, like wilcox, require normalized data. We have two options:

  • Use log-normalized data.
DefaultAssay(sobj) <- "RNA"

## get log normalized data if not already done
# sobj <- sobj %>% 
#   FindVariableFeatures() %>% 
#   NormalizeData() %>% 
#   ScaleData()

## compute population markers
df_markers <- FindAllMarkers(sobj,
                             assay = "RNA",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)
  • Use SCT normalized data. If working with integrated datasets, an extra-step is needed to reverse the SCT regression model to UMI counts.
sobj <- PrepSCTFindMarkers(sobj, assay="SCT", verbose = F)
df_markers <- FindAllMarkers(sobj,
                             assay = "SCT",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)

By default, Seurat tests all expressed genes. However, you can modify the behavior with arguments like:

  • min.pct: Only test genes detected in at least min.pct of cells (default is 0.1).
  • logfc.threshold: Limit testing to genes with a minimum log-fold change (default is 0.1).
  • only.pos: Return only positive markers (FALSE by default).

For this session, we will adjust these arguments to focus on strong population markers.

The genes with significant differential expression can then be selected by keeping only adjusted p-values that are small enough.

df_markers_signif <- df_markers %>%
    filter(p_val_adj < 0.05)

head(df_markers_signif)
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
Cd79a 0 9.390336 0.966 0.012 0 B_cells Cd79a
Ebf1 0 9.287841 0.906 0.005 0 B_cells Ebf1
H2-Aa 0 5.548739 0.979 0.130 0 B_cells H2-Aa
Mef2c 0 7.201354 0.858 0.018 0 B_cells Mef2c
Ms4a1 0 8.508328 0.842 0.009 0 B_cells Ms4a1
Cd79b 0 5.283376 0.944 0.128 0 B_cells Cd79b

Significant DEGs can be identified by filtering based on adjusted p-values. The results table includes:

  • Gene name
  • The p-value and adjusted p-value. To be considered differentially expressed, genes adjusted p-values must be smaller than a chosen threshold are.
  • Log-fold change: $ ( ) $.
  • Percent of cells expressing the gene in each condition
nb_markers <- df_markers_signif %>% 
  group_by(cluster) %>% 
  tally()

nb_markers
cluster n
B_cells 163
CD8 322
Enterocyte_proximal 383
Goblet_cells 266
CD4_memory 733
Tuft_cells 212
Stem_cells 758
Enterocyte_distal 3
CD4_naive 159
ggplot(nb_markers, aes(x = cluster, y=n, fill=cluster)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

The table of DEG is not automatically added to the Seurat object. It is possible to add it manually, for example in the slot:

sobj@misc$marker_genes <- df_markers_signif

The DEG results can be saved as a CSV file for further exploration:

write.csv(df_markers, file.path(output_dir, "de_genes_FindAllMarkers.csv"), row.names = F, quote = F)

Finally, visualizations include:

  • Violin plot and feature plot.
genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=1) %>% 
  pull(gene, name=cluster)

DefaultAssay(sobj) <- "SCT"
VlnPlot(sobj, features = genes_of_interest, slot="data", ncol=4)
Warning: The `slot` argument of `VlnPlot()` is deprecated as of Seurat 5.0.0.
ℹ Please use the `layer` argument instead.

FeaturePlot(sobj, features = genes_of_interest,
            max.cutoff = 'q90', ncol = 4)

  • Dot plot for top markers per cell type.
genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=3) %>% 
  pull(gene)

DotPlot(sobj, features = genes_of_interest,
        col.min = 0) +
  RotatedAxis()

  • Heatmaps for a comprehensive view of markers.
genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=50) %>% 
  pull(gene)

DoHeatmap(sobj, features = genes_of_interest, label=F) +
  theme(axis.text.y = element_blank()) # comment to display gene names
Warning in DoHeatmap(sobj, features = genes_of_interest, label = F): The
following features were omitted as they were not found in the scale.data slot
for the SCT assay: Rpl34, Rps4x, Rpl5, Rps3a1, Rpl37a, Rpl19, Rps21, Rps27a,
Rps7, Rps18, Rps10, Rpl18, Rplp2, Rps3, Rpl30, Rps14, Rpl35a, Rplp0, Rpl23,
Rps16, Tmsb10, Nup88, Slc44a1, Lysmd2, Samd14, Itprid1, Chat, Tmprss2, Camsap3,
Liph, Fermt1, Slc22a23, Ttc39a, Cideb, Ntan1, H3f3a, Fau, Rps5, Rpl18a, Rpl8

Note

Seurat uses the scale.data layer for heatmaps, containing only the most variable genes identified during normalization. If certain genes aren’t displayed, they may not be highly variable so this should not be a problem. To include them, rerun log-normalization on specific genes.

sobj <- sobj %>% 
  FindVariableFeatures(nfeatures = nrow(sobj), verbose = FALSE) %>% 
  NormalizeData() %>% 
  ScaleData(seurat_obj, features = genes_to_scale)

1.2 Differential expression between populations or conditions

Having identified markers for each cell type, we can also perform pairwise comparisons to test for differential gene expression between:

  • two cell populations.

  • cells of the same population under different conditions (e.g., treated vs. control).

For instance, we can examine the effect of p53 overactivation on enterocytes by comparing enterocytes with overactivated p53 (FMTG) to wild-type enterocytes (TG).

The FindMarkers() function enables this analysis. A few command lines will allow us to test the two conditions for the enterocyte population.

## subset the dataset to keep only population of interest
sobj_entero <- subset(sobj, idents = "Enterocyte_proximal")

## the condition to test is the genotype
Idents(sobj_entero) <- "orig.ident"
levels(sobj_entero)
[1] "SC26_TG"   "SC26_FMTG"
## let's work on log-normalized counts
DefaultAssay(sobj_entero) <- "RNA"
sobj_entero <- sobj_entero %>%
  FindVariableFeatures() %>%
  NormalizeData() %>%
  ScaleData()
Finding variable features for layer counts
Normalizing layer: counts
Centering and scaling data matrix
Warning: Different features in new layer data than already exists for
scale.data
# deg_entero <- PrepSCTFindMarkers(sobj_entero)
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox",
                       logfc.threshold = log2(1.5),
                       min.pct = 0.5) %>% 
                       # recorrect_umi = FALSE) %>% 
  filter(p_val_adj < 0.05) %>%
  arrange(p_val_adj, desc(avg_log2FC))

sobj@misc$deg_entero <- deg_entero

2. Function enrichment

2.1 In a cell population

Once DEGs are identified, functional enrichment helps interpret the biological significance of these genes. It tests whether genes belonging to certain biological functions are over- or under-expressed beyond what would be expected by chance.

We will use the clusterProfiler package for this analysis, employing over-representation analysis (ORA) and Gene Set Enrichment Analysis (GSEA).

Over representation analysis (ORA)

ORA is a common method for functional enrichment, assessing whether more genes in your list of DEGs are associated with a specific biological function than expected by chance.

Key inputs include:

  • the list of DEGs: Generally requires genes in Entrez ID format. We can convert gene names from Symbol format using clusterProfiler::bitr()
## add symbol gene names as a columns
deg_entero <- deg_entero %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")
'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
3.51% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)
Joining with `by = join_by(SYMBOL)`
head(deg_entero)
SYMBOL p_val avg_log2FC pct.1 pct.2 p_val_adj ENTREZID
Zg16 0 -3.0775553 0.665 0.934 0 69036
Ubb 0 -0.6940597 0.978 0.992 0 22187
Fth1 0 -1.3476178 0.989 1.000 0 14319
Prdx1 0 -1.0359897 0.797 0.886 0 18477
Malat1 0 0.9994706 0.997 0.989 0 72289
Ftl1 0 -0.9831995 0.929 0.972 0 14325
## get list of DEG using their entrez id
list_deg_entero <- deg_entero$ENTREZID
Note

The annotation types supported by the by the org.Mm.eg.db database ca be listed:

AnnotationDbi::keytypes(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"     
  • a reference: A list of biological functions represented as annotated gene sets (e.g., Gene Ontology (GO), KEGG).

We will query the GO database using the org.Mm.eg.db reference to perform enrichment tests.

Tip

For other organisms, you can find available OrgDbs on bioconductor.

## perform ORA
go <- enrichGO(list_deg_entero, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300)

## avoid redundancy in result terms
go <- simplify(go)

Key parameters include:

  • ont: Choose a GO subcategory between Biological Process (“BP”), Molecular Function (“MF”), Cellular Component (“CC”), or all (“ALL”).
  • minGSSize: Exclude gene sets with fewer than minGSSize genes (removes small sets prone to noise).
  • maxGSSize: Exclude gene sets with more than maxGSSize genes (removes overly broad sets).
head(go@result)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0009060 GO:0009060 aerobic respiration 11/107 199/28905 0 2e-07 1e-07 17448/17449/15251/13063/12858/103172/67680/17995/66694/66445/69875 11
GO:0140236 GO:0140236 translation at presynapse 7/107 47/28905 0 2e-07 1e-07 57294/54127/67671/19981/54217/67281/27370 7
GO:0046166 GO:0046166 glyceraldehyde-3-phosphate biosynthetic process 5/107 12/28905 0 2e-07 1e-07 17448/17449/14433/13806/18648 5
GO:0140241 GO:0140241 translation at synapse 7/107 48/28905 0 2e-07 1e-07 57294/54127/67671/19981/54217/67281/27370 7
GO:0140242 GO:0140242 translation at postsynapse 7/107 48/28905 0 2e-07 1e-07 57294/54127/67671/19981/54217/67281/27370 7
GO:0006094 GO:0006094 gluconeogenesis 8/107 95/28905 0 8e-07 6e-07 17448/17449/15251/14433/13806/21416/18648/67800 8

The function returns an enrichResult object with ORA table stored in the @result slot. Important columns include:

  • ID and Description: GO identifier and brief description of the biological process.
  • Count: Number of genes from the input dataset annotated to the specific GO term.
  • GeneRatio: Ratio of genes in your input dataset associated with the GO term.
  • BgRatio: Ratio of genes associated with the GO term in the reference gene set.
  • p.adjust and qvalue: Adjusted p-values for selecting significantly enriched functions.

Results can be visualized using a dot plot.

dotplot(go, showCategory = 20)

Prebuilt functions like enrichGO() and enrichKEGG() enable users to query widely used ontologies. The enricher() function facilitates the testing of over-representation for any gene set, for example MSigDB.

## load MSigDBr database, subcategory H
gmt <- msigdbr(species = "mouse", category = "H")
head(gmt)
gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene human_gene_symbol human_entrez_gene human_ensembl_gene gs_id gs_pmid gs_geoid gs_exact_source gs_url gs_description taxon_id ortholog_sources num_ortholog_sources
H HALLMARK_ADIPOGENESIS Abca1 11303 ENSMUSG00000015243 ABCA1 19 ENSG00000165029 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam 11
H HALLMARK_ADIPOGENESIS Abcb8 74610 ENSMUSG00000028973 ABCB8 11194 ENSG00000197150 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam 12
H HALLMARK_ADIPOGENESIS Acaa2 52538 ENSMUSG00000036880 ACAA2 10449 ENSG00000167315 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam 11
H HALLMARK_ADIPOGENESIS Acadl 11363 ENSMUSG00000026003 ACADL 33 ENSG00000115361 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam 12
H HALLMARK_ADIPOGENESIS Acadm 11364 ENSMUSG00000062908 ACADM 34 ENSG00000117054 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam 12
H HALLMARK_ADIPOGENESIS Acads 11409 ENSMUSG00000029545 ACADS 35 ENSG00000122971 M5905 26771021 Genes up-regulated during adipocyte differentiation (adipogenesis). 10090 EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam 11
ora <- enricher(gene = list_deg_entero, pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, TERM2GENE = gmt[,c("gs_name", "entrez_gene")], minGSSize = 10, maxGSSize = 300)

head(ora@result)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION 10/52 199/4394 0.0000865 0.0032005 0.0024584 17448/17449/13063/18674/12858/18984/67680/17995/66694/66445 10
HALLMARK_ADIPOGENESIS HALLMARK_ADIPOGENESIS HALLMARK_ADIPOGENESIS 9/52 200/4394 0.0004685 0.0057778 0.0044381 17448/20280/103172/18984/67680/18107/66445/20655/11637 9
HALLMARK_GLYCOLYSIS HALLMARK_GLYCOLYSIS HALLMARK_GLYCOLYSIS 9/52 200/4394 0.0004685 0.0057778 0.0044381 17448/17449/13806/93692/14595/12368/18648/20655/21786 9
HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 4/52 49/4394 0.0024946 0.0230753 0.0177250 18477/14325/93692/20655 4
HALLMARK_MTORC1_SIGNALING HALLMARK_MTORC1_SIGNALING HALLMARK_MTORC1_SIGNALING 7/52 201/4394 0.0087632 0.0527135 0.0404912 18477/14433/13806/93692/12450/18107/22433 7
HALLMARK_XENOBIOTIC_METABOLISM HALLMARK_XENOBIOTIC_METABOLISM HALLMARK_XENOBIOTIC_METABOLISM 7/52 203/4394 0.0092332 0.0527135 0.0404912 19692/11847/12408/18984/18107/12368/17161 7

:::

Tip

ORA does not indicate the direction of differential regulation. To gain more insight, you can filter for only significantly up- or down-regulated genes.

Gene Set Enrichment Analysis (GSEA)

GSEA evaluates whether a set of genes shows coordinated expression changes, analyzing all genes even if they do not reach significance thresholds.

For GSEA, prepare an ordered list of all genes. They are typically sorted by fold change but other criteria could be chosen, such as the (fold change * adjusted p-value) product.

## rerun DEG identification to keep as many genes as possible
## even not significant genes
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox") %>% 
  arrange(desc(avg_log2FC)) %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")
'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
4.65% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)
Joining with `by = join_by(SYMBOL)`
## get list of DEG using their entrez id
gsea_input_list <- deg_entero %>%
  pull(avg_log2FC, name = ENTREZID)

We can run GSEA on the GO ontology.

gsea_go <- gseGO(gsea_input_list, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300) %>% 
  simplify()
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
head(gsea_go@result)
ID Description setSize enrichmentScore NES pvalue p.adjust qvalue rank leading_edge core_enrichment
GO:0030593 GO:0030593 neutrophil chemotaxis 57 -0.6503270 -2.603697 0e+00 0.0000014 1.30e-06 603 tags=25%, list=6%, signal=23% 13143/14127/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176
GO:0097530 GO:0097530 granulocyte migration 78 -0.5878065 -2.550661 0e+00 0.0000014 1.30e-06 279 tags=21%, list=3%, signal=20% 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/16952/12986/20310/16176
GO:1990266 GO:1990266 neutrophil migration 70 -0.6108187 -2.548082 0e+00 0.0000014 1.30e-06 603 tags=21%, list=6%, signal=20% 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176
GO:0017156 GO:0017156 calcium-ion regulated exocytosis 38 0.7392215 2.387048 0e+00 0.0000015 1.30e-06 1420 tags=50%, list=14%, signal=43% 20614/54525/80976/116838/58226/12286/380714/53420/60425/382018/27062/11551/20907/229521/320405/18999/70450/109685/18439
GO:0070098 GO:0070098 chemokine-mediated signaling pathway 33 -0.7049968 -2.506332 2e-07 0.0001060 9.55e-05 123 tags=33%, list=1%, signal=33% 20303/80901/16923/13614/56221/12772/20296/20305/20308/12768/20310
GO:0006636 GO:0006636 unsaturated fatty acid biosynthetic process 20 -0.7863604 -2.427158 2e-07 0.0001087 9.79e-05 791 tags=45%, list=8%, signal=42% 66469/76267/18783/19224/16592/13615/13614/16952/16176

Important columns include:

  • setSize: Size of the gene set in the input list.
  • ES and NES: Enrichment score and its normalized version. A positive score indicates over-expression and a negative score indicates under-expression.
  • padj and qvalues: Adjusted p-values associated with the test.

Results can be visualized with plots.

## separate function with mainly over-expressed genes
## from functions with mainly under-expressed genes
gsea_go@result <- gsea_go@result %>% 
    mutate(nes_sign = ifelse(NES > 0, "Positive", "Negative"))

dotplot(gsea_go, showCategory = 10, split = "nes_sign")  +
  facet_grid(~nes_sign)

Question

What observations can you make about the enrichment function results ?

2.2 At single cell level gene set activity

Some methods calculate the activity of pathways or gene sets at the single-cell level. This can be used to visualize the activation of some functions of interest across a whole sample, cell annotation or cell state identification.

Here we will show how to perform this type of analysis with the package AUCell. For time purpose, we will focus on 1 apoptosis signature suggested by the authors of the article of origin.

## create a list of gene sets per GO term
gene_set <- list("p53_signature" = c("Ccng1", "Mdm2", "Eda2r", "Gtse1", "Polk", "Psrc1", "Zfp365"))

It would be possible to construct a list of gene sets taken from popular ontologies.

Then the GSEA at a single cell level is performed.

## get express matrix
exprMatrix <- GetAssayData(sobj, assay="RNA", layer = "count")

## computations
cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats=F)
cells_AUC <- AUCell_calcAUC(gene_set, cells_rankings)

AUCell has a few functions to plot the results.

AUCell_plotTSNE(Embeddings(sobj, reduction="umap")[,c(1,2)], exprMatrix, cells_AUC)

Alternatively, the results can be added to the Seurat object metadata and plotted with Seurat functions.

auc_df <- AUCell::getAUC(cells_AUC) %>% 
  t() %>% 
  as.data.frame()

sobj <- AddMetaData(sobj, metadata = auc_df)

VlnPlot(sobj, features = "p53_signature")

FeaturePlot(sobj, features = "p53_signature", max.cutoff = 'q95')

References and documentation

clusterProfiler: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html

AUCell: https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html

scGSVA: a tool for single cell level GSEA coming with many functions for references building and results representation, actually more user-friendly than AUCell: https://github.com/guokai8/scGSVA

Session info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] AUCell_1.26.0          msigdbr_7.5.1          org.Mm.eg.db_3.19.1   
 [4] AnnotationDbi_1.66.0   IRanges_2.38.1         S4Vectors_0.42.1      
 [7] Biobase_2.64.0         BiocGenerics_0.50.0    enrichplot_1.24.4     
[10] clusterProfiler_4.12.6 Seurat_5.1.0           SeuratObject_5.0.2    
[13] sp_2.1-4               lubridate_1.9.3        forcats_1.0.0         
[16] stringr_1.5.1          dplyr_1.1.4            purrr_1.0.2           
[19] readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[22] ggplot2_3.5.1          tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] segmented_2.1-0             fs_1.6.4                   
  [3] matrixStats_1.4.1           spatstat.sparse_3.1-0      
  [5] HDO.db_0.99.1               httr_1.4.7                 
  [7] RColorBrewer_1.1-3          tools_4.4.1                
  [9] sctransform_0.4.1           utf8_1.2.4                 
 [11] R6_2.5.1                    lazyeval_0.2.2             
 [13] uwot_0.2.2                  withr_3.0.1                
 [15] gridExtra_2.3               progressr_0.14.0           
 [17] cli_3.6.3                   spatstat.explore_3.3-2     
 [19] fastDummies_1.7.4           scatterpie_0.2.4           
 [21] labeling_0.4.3              spatstat.data_3.1-2        
 [23] ggridges_0.5.6              pbapply_1.7-2              
 [25] yulab.utils_0.1.7           gson_0.1.0                 
 [27] DOSE_3.30.1                 R.utils_2.12.3             
 [29] parallelly_1.38.0           limma_3.60.5               
 [31] rstudioapi_0.16.0           RSQLite_2.3.7              
 [33] generics_0.1.3              gridGraphics_0.5-1         
 [35] ica_1.0-3                   spatstat.random_3.3-2      
 [37] GO.db_3.19.1                Matrix_1.7-0               
 [39] ggbeeswarm_0.7.2            fansi_1.0.6                
 [41] abind_1.4-8                 R.methodsS3_1.8.2          
 [43] lifecycle_1.0.4             yaml_2.3.10                
 [45] SummarizedExperiment_1.34.0 qvalue_2.36.0              
 [47] SparseArray_1.4.8           Rtsne_0.17                 
 [49] grid_4.4.1                  blob_1.2.4                 
 [51] promises_1.3.0              crayon_1.5.3               
 [53] miniUI_0.1.1.1              lattice_0.22-6             
 [55] cowplot_1.1.3               annotate_1.82.0            
 [57] KEGGREST_1.44.1             pillar_1.9.0               
 [59] knitr_1.48                  fgsea_1.30.0               
 [61] GenomicRanges_1.56.1        future.apply_1.11.2        
 [63] codetools_0.2-20            fastmatch_1.1-4            
 [65] leiden_0.4.3.1              glue_1.8.0                 
 [67] ggfun_0.1.6                 spatstat.univar_3.0-1      
 [69] data.table_1.16.0           vctrs_0.6.5                
 [71] png_0.1-8                   treeio_1.28.0              
 [73] spam_2.10-0                 gtable_0.3.5               
 [75] kernlab_0.9-32              cachem_1.1.0               
 [77] xfun_0.48                   S4Arrays_1.4.1             
 [79] mime_0.12                   tidygraph_1.3.1            
 [81] survival_3.7-0              statmod_1.5.0              
 [83] fitdistrplus_1.2-1          ROCR_1.0-11                
 [85] nlme_3.1-165                ggtree_3.12.0              
 [87] bit64_4.0.5                 RcppAnnoy_0.0.22           
 [89] GenomeInfoDb_1.40.1         irlba_2.3.5.1              
 [91] vipor_0.4.7                 KernSmooth_2.23-24         
 [93] colorspace_2.1-1            DBI_1.2.3                  
 [95] ggrastr_1.0.2               tidyselect_1.2.1           
 [97] bit_4.0.5                   compiler_4.4.1             
 [99] httr2_1.0.1                 graph_1.82.0               
[101] DelayedArray_0.30.1         plotly_4.10.4              
[103] shadowtext_0.1.3            scales_1.3.0               
[105] lmtest_0.9-40               rappdirs_0.3.3             
[107] digest_0.6.37               goftest_1.2-3              
[109] mixtools_2.0.0              spatstat.utils_3.1-0       
[111] rmarkdown_2.28              XVector_0.44.0             
[113] htmltools_0.5.8.1           pkgconfig_2.0.3            
[115] sparseMatrixStats_1.16.0    MatrixGenerics_1.16.0      
[117] fastmap_1.2.0               rlang_1.1.4                
[119] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
[121] shiny_1.9.1                 DelayedMatrixStats_1.26.0  
[123] farver_2.1.2                zoo_1.8-12                 
[125] jsonlite_1.8.9              BiocParallel_1.38.0        
[127] GOSemSim_2.30.2             R.oo_1.26.0                
[129] magrittr_2.0.3              GenomeInfoDbData_1.2.12    
[131] ggplotify_0.1.2             dotCall64_1.1-1            
[133] patchwork_1.3.0             munsell_0.5.1              
[135] Rcpp_1.0.13                 ape_5.8                    
[137] babelgene_22.9              viridis_0.6.5              
[139] reticulate_1.39.0           stringi_1.8.4              
[141] ggraph_2.2.1                zlibbioc_1.50.0            
[143] MASS_7.3-61                 plyr_1.8.9                 
[145] parallel_4.4.1              listenv_0.9.1              
[147] ggrepel_0.9.6               deldir_2.0-4               
[149] Biostrings_2.72.1           graphlayouts_1.1.1         
[151] splines_4.4.1               tensor_1.5                 
[153] hms_1.1.3                   igraph_2.0.3               
[155] spatstat.geom_3.3-3         RcppHNSW_0.6.0             
[157] reshape2_1.4.4              XML_3.99-0.17              
[159] evaluate_1.0.0              tzdb_0.4.0                 
[161] tweenr_2.0.3                httpuv_1.6.15              
[163] RANN_2.6.2                  polyclip_1.10-7            
[165] future_1.34.0               scattermore_1.2            
[167] ggforce_0.4.2               xtable_1.8-4               
[169] RSpectra_0.16-2             tidytree_0.4.6             
[171] later_1.3.2                 viridisLite_0.4.2          
[173] aplot_0.2.3                 beeswarm_0.4.0             
[175] memoise_2.0.1               cluster_2.1.6              
[177] timechange_0.3.0            globals_0.16.3             
[179] GSEABase_1.66.0            

Bonus: pseudo-bulk example

## load data
## the data is pro B cells in 2 conditions: 
## healthy donors (PBMC) and leukemia patient (ETV6-RunX1)
## they come from this paper
## Caron M et al. Scientific Reports. 2020;10:1–12. Available from: http://dx.doi.org/10.1038/s41598-020-64929-x
proB <- readRDS("resources/data/proB.rds")

## 1 cell type only: pro B cells
unique(proB@meta.data$SingleR_annot)
unique(proB@meta.data$type)

## 3 replicates
unique(proB@meta.data$orig.ident)

Idents(proB) <- "orig.ident"

## method 1: stay in Seurat
sobj_pseudoblk <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = T)

sobj_pseudoblk@assays$RNA@layers$counts

sobj_pseudoblk@meta.data <- sobj_pseudoblk@meta.data %>% 
  mutate(type = ifelse(str_detect(orig.ident, "PBMMC"),
                       "PBMC",
                       "ETV6"))

Idents(sobj_pseudoblk) <- "type"

markers <- FindMarkers(sobj_pseudoblk, 
                       ident.1 = "PBMC", 
                       ident.2 = "ETV6",
                       test.use = "DESeq2")

markers_signifs <- markers %>% 
  filter(p_val_adj < 0.05)

## method 2: just get raw counts and use you favorite bulk RNA-Seq tool
cts <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = F)
cts <- cts$RNA

dim(cts)
colnames(cts)

coldata <- data.frame(sample = colnames(cts)) %>%
    mutate(condition = ifelse(grepl("PBMMC", sample), "ctrl", "cancer")) %>%
    column_to_rownames("sample")

library(DESeq2)

dds <- DESeqDataSetFromMatrix(cts, colData = coldata, design = ~ condition)

keep <- rowSums(cts) >= 10
dds <- dds[keep,]
dds <- DESeq(dds)

markers_singif2 <- results(dds, name = "condition_ctrl_vs_cancer") %>%
    as.data.frame() %>%
    arrange(padj) %>%
    filter(padj < 0.05)