sincellTE 2024 - TP Secondary Analysis: differential expression, functional enrichment

Published

Oct 22, 2024

Objectives

After identifying cell populations, the next crucial steps—differential gene expression (DGE) and functional enrichment—uncover the molecular signatures of cells and elucidate their functional roles. These steps are vital for exploring cellular heterogeneity and extracting meaningful biological insights from data.

In this session, we will perform DGE analysis to identify genes that serve as specific markers for cell populations or are significantly altered between conditions. Additionally, we will explore functional enrichment methods to interpret these findings in the context of biological pathways.

suppressWarnings({
    suppressMessages({
      library(tidyverse)
      library(Seurat)
      library(clusterProfiler)
      library(enrichplot)
      library(org.Mm.eg.db)
      library(msigdbr)
      library(AUCell)
      library(ggplot2)
    })
})

options(future.globals.maxSize = 2.5 * 1e9)

1. Differential expression

The dataset used for these sessions comes from the article “p53 drives a transcriptional program that elicits a non-cell-autonomous response and alters cell state in vivo” by S.M. Moyer, et al. PNAS 2020. (https://doi.org/10.1073/pnas.2008474117).

It contains 2 single cell samples corresponding to 2 conditions: cells with overactivated p53 (FMTG) and wild-type cells (TG).

For the needs of this session, we will to load an object with both datasets already integrated.

user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "/deg/resources/data/seurat_object_annotated.rds", sep = "/")
output_dir <- paste0("/shared/home2", user.id, "deg/results/", sep = "/")
sobj <- readRDS(input_file)

1.1 Cell population markers

After clustering or annotating cells, we can identify genes specific to each population by detecting differentially expressed genes (DEGs) between a given population and all other cells. The function FindAllMarkers() facilitates this analysis for each cell population.

First, we need to set the cell identities to indicate the populations in our Seurat object. We will utilize the feature celltype for this purpose.

sobj <- SetIdent(sobj, value = "celltype")
levels(sobj@active.ident)

[1] "B_cells"             "CD8"                 "Enterocyte_proximal"
[4] "Goblet_cells"        "CD4_memory"          "Tuft_cells"         
[7] "Stem_cells"          "Enterocyte_distal"   "CD4_naive"

Next, we choose a method for DEG computation. Several statistical tests are available in Seurat. Methods with fewer assumptions generally yield more robust results. Here, we will use the default Wilcoxon test but many other methods are available: wilcox, wilcox_limma, bimod, roc, t, negbinom, poisson, LR, MAST, DESeq2…

Some methods, like wilcox, require normalized data. We have two options:

Use log-normalized data.

DefaultAssay(sobj) <- "RNA"

## get log normalized data if not already done
# sobj <- sobj %>% 
#   FindVariableFeatures() %>% 
#   NormalizeData() %>% 
#   ScaleData()

## compute population markers
df_markers <- FindAllMarkers(sobj,
                             assay = "RNA",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)

Use SCT normalized data. If working with integrated datasets, an extra-step is needed to reverse the SCT regression model to UMI counts.

sobj <- PrepSCTFindMarkers(sobj, assay="SCT", verbose = F)
df_markers <- FindAllMarkers(sobj,
                             assay = "SCT",
                             logfc.threshold = log2(1.5),
                             min.pct = 0.5,
                             only.pos = T,
                             verbose=F)

By default, Seurat tests all expressed genes. However, you can modify the behavior with arguments like:

min.pct: Only test genes detected in at least min.pct of cells (default is 0.1).
logfc.threshold: Limit testing to genes with a minimum log-fold change (default is 0.1).
only.pos: Return only positive markers (FALSE by default).

For this session, we will adjust these arguments to focus on strong population markers.

The genes with significant differential expression can then be selected by keeping only adjusted p-values that are small enough.

df_markers_signif <- df_markers %>%
    filter(p_val_adj < 0.05)

head(df_markers_signif)

	avg_log2FC	pct.1	pct.2	cluster	gene
Cd79a	9.390336	0.966	0.012	B_cells	Cd79a
Ebf1	9.287841	0.906	0.005	B_cells	Ebf1
H2-Aa	5.548739	0.979	0.130	B_cells	H2-Aa
Mef2c	7.201354	0.858	0.018	B_cells	Mef2c
Ms4a1	8.508328	0.842	0.009	B_cells	Ms4a1
Cd79b	5.283376	0.944	0.128	B_cells	Cd79b

Significant DEGs can be identified by filtering based on adjusted p-values. The results table includes:

Gene name
The p-value and adjusted p-value. To be considered differentially expressed, genes adjusted p-values must be smaller than a chosen threshold are.
Log-fold change: $ ( ) $.
Percent of cells expressing the gene in each condition

nb_markers <- df_markers_signif %>% 
  group_by(cluster) %>% 
  tally()

nb_markers

cluster	n
B_cells	163
CD8	322
Enterocyte_proximal	383
Goblet_cells	266
CD4_memory	733
Tuft_cells	212
Stem_cells	758
Enterocyte_distal	3
CD4_naive	159

ggplot(nb_markers, aes(x = cluster, y=n, fill=cluster)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

The table of DEG is not automatically added to the Seurat object. It is possible to add it manually, for example in the slot:

sobj@misc$marker_genes <- df_markers_signif

The DEG results can be saved as a CSV file for further exploration:

write.csv(df_markers, file.path(output_dir, "de_genes_FindAllMarkers.csv"), row.names = F, quote = F)

Finally, visualizations include:

Violin plot and feature plot.

genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=1) %>% 
  pull(gene, name=cluster)

DefaultAssay(sobj) <- "SCT"
VlnPlot(sobj, features = genes_of_interest, slot="data", ncol=4)

Warning: The `slot` argument of `VlnPlot()` is deprecated as of Seurat 5.0.0.
ℹ Please use the `layer` argument instead.

FeaturePlot(sobj, features = genes_of_interest,
            max.cutoff = 'q90', ncol = 4)

Dot plot for top markers per cell type.

genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=3) %>% 
  pull(gene)

DotPlot(sobj, features = genes_of_interest,
        col.min = 0) +
  RotatedAxis()

Heatmaps for a comprehensive view of markers.

genes_of_interest <- df_markers_signif %>%
  group_by(cluster) %>%
  arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>% 
  slice_head(n=50) %>% 
  pull(gene)

DoHeatmap(sobj, features = genes_of_interest, label=F) +
  theme(axis.text.y = element_blank()) # comment to display gene names

Warning in DoHeatmap(sobj, features = genes_of_interest, label = F): The
following features were omitted as they were not found in the scale.data slot
for the SCT assay: Rpl34, Rps4x, Rpl5, Rps3a1, Rpl37a, Rpl19, Rps21, Rps27a,
Rps7, Rps18, Rps10, Rpl18, Rplp2, Rps3, Rpl30, Rps14, Rpl35a, Rplp0, Rpl23,
Rps16, Tmsb10, Nup88, Slc44a1, Lysmd2, Samd14, Itprid1, Chat, Tmprss2, Camsap3,
Liph, Fermt1, Slc22a23, Ttc39a, Cideb, Ntan1, H3f3a, Fau, Rps5, Rpl18a, Rpl8

Note

Seurat uses the scale.data layer for heatmaps, containing only the most variable genes identified during normalization. If certain genes aren’t displayed, they may not be highly variable so this should not be a problem. To include them, rerun log-normalization on specific genes.

sobj <- sobj %>% 
  FindVariableFeatures(nfeatures = nrow(sobj), verbose = FALSE) %>% 
  NormalizeData() %>% 
  ScaleData(seurat_obj, features = genes_to_scale)

1.2 Differential expression between populations or conditions

Having identified markers for each cell type, we can also perform pairwise comparisons to test for differential gene expression between:

two cell populations.

cells of the same population under different conditions (e.g., treated vs. control).

For instance, we can examine the effect of p53 overactivation on enterocytes by comparing enterocytes with overactivated p53 (FMTG) to wild-type enterocytes (TG).

The FindMarkers() function enables this analysis. A few command lines will allow us to test the two conditions for the enterocyte population.

## subset the dataset to keep only population of interest
sobj_entero <- subset(sobj, idents = "Enterocyte_proximal")

## the condition to test is the genotype
Idents(sobj_entero) <- "orig.ident"
levels(sobj_entero)

[1] "SC26_TG"   "SC26_FMTG"

## let's work on log-normalized counts
DefaultAssay(sobj_entero) <- "RNA"
sobj_entero <- sobj_entero %>%
  FindVariableFeatures() %>%
  NormalizeData() %>%
  ScaleData()

Finding variable features for layer counts

Normalizing layer: counts

Centering and scaling data matrix

Warning: Different features in new layer data than already exists for
scale.data

# deg_entero <- PrepSCTFindMarkers(sobj_entero)
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox",
                       logfc.threshold = log2(1.5),
                       min.pct = 0.5) %>% 
                       # recorrect_umi = FALSE) %>% 
  filter(p_val_adj < 0.05) %>%
  arrange(p_val_adj, desc(avg_log2FC))

sobj@misc$deg_entero <- deg_entero

2. Function enrichment

2.1 In a cell population

Once DEGs are identified, functional enrichment helps interpret the biological significance of these genes. It tests whether genes belonging to certain biological functions are over- or under-expressed beyond what would be expected by chance.

We will use the clusterProfiler package for this analysis, employing over-representation analysis (ORA) and Gene Set Enrichment Analysis (GSEA).

Over representation analysis (ORA)

ORA is a common method for functional enrichment, assessing whether more genes in your list of DEGs are associated with a specific biological function than expected by chance.

Key inputs include:

the list of DEGs: Generally requires genes in Entrez ID format. We can convert gene names from Symbol format using clusterProfiler::bitr()

## add symbol gene names as a columns
deg_entero <- deg_entero %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")

'select()' returned 1:1 mapping between keys and columns

Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
3.51% of input gene IDs are fail to map...

## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)

Joining with `by = join_by(SYMBOL)`

head(deg_entero)

SYMBOL	avg_log2FC	pct.1	pct.2	ENTREZID
Zg16	-3.0775553	0.665	0.934	69036
Ubb	-0.6940597	0.978	0.992	22187
Fth1	-1.3476178	0.989	1.000	14319
Prdx1	-1.0359897	0.797	0.886	18477
Malat1	0.9994706	0.997	0.989	72289
Ftl1	-0.9831995	0.929	0.972	14325

## get list of DEG using their entrez id
list_deg_entero <- deg_entero$ENTREZID

Note

The annotation types supported by the by the org.Mm.eg.db database ca be listed:

AnnotationDbi::keytypes(org.Mm.eg.db)

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

a reference: A list of biological functions represented as annotated gene sets (e.g., Gene Ontology (GO), KEGG).

We will query the GO database using the org.Mm.eg.db reference to perform enrichment tests.

Tip

For other organisms, you can find available OrgDbs on bioconductor.

## perform ORA
go <- enrichGO(list_deg_entero, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300)

## avoid redundancy in result terms
go <- simplify(go)

Key parameters include:

ont: Choose a GO subcategory between Biological Process (“BP”), Molecular Function (“MF”), Cellular Component (“CC”), or all (“ALL”).
minGSSize: Exclude gene sets with fewer than minGSSize genes (removes small sets prone to noise).
maxGSSize: Exclude gene sets with more than maxGSSize genes (removes overly broad sets).

head(go@result)

	ID	Description	GeneRatio	BgRatio	p.adjust	qvalue	geneID	Count
GO:0009060	GO:0009060	aerobic respiration	11/107	199/28905	2e-07	1e-07	17448/17449/15251/13063/12858/103172/67680/17995/66694/66445/69875	11
GO:0140236	GO:0140236	translation at presynapse	7/107	47/28905	2e-07	1e-07	57294/54127/67671/19981/54217/67281/27370	7
GO:0046166	GO:0046166	glyceraldehyde-3-phosphate biosynthetic process	5/107	12/28905	2e-07	1e-07	17448/17449/14433/13806/18648	5
GO:0140241	GO:0140241	translation at synapse	7/107	48/28905	2e-07	1e-07	57294/54127/67671/19981/54217/67281/27370	7
GO:0140242	GO:0140242	translation at postsynapse	7/107	48/28905	2e-07	1e-07	57294/54127/67671/19981/54217/67281/27370	7
GO:0006094	GO:0006094	gluconeogenesis	8/107	95/28905	8e-07	6e-07	17448/17449/15251/14433/13806/21416/18648/67800	8

The function returns an enrichResult object with ORA table stored in the @result slot. Important columns include:

ID and Description: GO identifier and brief description of the biological process.
Count: Number of genes from the input dataset annotated to the specific GO term.
GeneRatio: Ratio of genes in your input dataset associated with the GO term.
BgRatio: Ratio of genes associated with the GO term in the reference gene set.
p.adjust and qvalue: Adjusted p-values for selecting significantly enriched functions.

Results can be visualized using a dot plot.

dotplot(go, showCategory = 20)

Prebuilt functions like enrichGO() and enrichKEGG() enable users to query widely used ontologies. The enricher() function facilitates the testing of over-representation for any gene set, for example MSigDB.

## load MSigDBr database, subcategory H
gmt <- msigdbr(species = "mouse", category = "H")
head(gmt)

gs_cat	gs_name	gene_symbol	entrez_gene	ensembl_gene	human_gene_symbol	human_entrez_gene	human_ensembl_gene	gs_id	gs_pmid	gs_description	taxon_id	ortholog_sources	num_ortholog_sources
H	HALLMARK_ADIPOGENESIS	Abca1	11303	ENSMUSG00000015243	ABCA1	19	ENSG00000165029	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|Treefam	11
H	HALLMARK_ADIPOGENESIS	Abcb8	74610	ENSMUSG00000028973	ABCB8	11194	ENSG00000197150	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|PhylomeDB\|Treefam	12
H	HALLMARK_ADIPOGENESIS	Acaa2	52538	ENSMUSG00000036880	ACAA2	10449	ENSG00000167315	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|Treefam	11
H	HALLMARK_ADIPOGENESIS	Acadl	11363	ENSMUSG00000026003	ACADL	33	ENSG00000115361	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|PhylomeDB\|Treefam	12
H	HALLMARK_ADIPOGENESIS	Acadm	11364	ENSMUSG00000062908	ACADM	34	ENSG00000117054	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|PhylomeDB\|Treefam	12
H	HALLMARK_ADIPOGENESIS	Acads	11409	ENSMUSG00000029545	ACADS	35	ENSG00000122971	M5905	26771021	Genes up-regulated during adipocyte differentiation (adipogenesis).	10090	EggNOG\|Ensembl\|HGNC\|HomoloGene\|Inparanoid\|NCBI\|OMA\|OrthoDB\|OrthoMCL\|Panther\|Treefam	11

ora <- enricher(gene = list_deg_entero, pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, TERM2GENE = gmt[,c("gs_name", "entrez_gene")], minGSSize = 10, maxGSSize = 300)

head(ora@result)

	ID	Description	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	geneID	Count
HALLMARK_OXIDATIVE_PHOSPHORYLATION	HALLMARK_OXIDATIVE_PHOSPHORYLATION	HALLMARK_OXIDATIVE_PHOSPHORYLATION	10/52	199/4394	0.0000865	0.0032005	0.0024584	17448/17449/13063/18674/12858/18984/67680/17995/66694/66445	10
HALLMARK_ADIPOGENESIS	HALLMARK_ADIPOGENESIS	HALLMARK_ADIPOGENESIS	9/52	200/4394	0.0004685	0.0057778	0.0044381	17448/20280/103172/18984/67680/18107/66445/20655/11637	9
HALLMARK_GLYCOLYSIS	HALLMARK_GLYCOLYSIS	HALLMARK_GLYCOLYSIS	9/52	200/4394	0.0004685	0.0057778	0.0044381	17448/17449/13806/93692/14595/12368/18648/20655/21786	9
HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY	HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY	HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY	4/52	49/4394	0.0024946	0.0230753	0.0177250	18477/14325/93692/20655	4
HALLMARK_MTORC1_SIGNALING	HALLMARK_MTORC1_SIGNALING	HALLMARK_MTORC1_SIGNALING	7/52	201/4394	0.0087632	0.0527135	0.0404912	18477/14433/13806/93692/12450/18107/22433	7
HALLMARK_XENOBIOTIC_METABOLISM	HALLMARK_XENOBIOTIC_METABOLISM	HALLMARK_XENOBIOTIC_METABOLISM	7/52	203/4394	0.0092332	0.0527135	0.0404912	19692/11847/12408/18984/18107/12368/17161	7

:::

Tip

ORA does not indicate the direction of differential regulation. To gain more insight, you can filter for only significantly up- or down-regulated genes.

Gene Set Enrichment Analysis (GSEA)

GSEA evaluates whether a set of genes shows coordinated expression changes, analyzing all genes even if they do not reach significance thresholds.

For GSEA, prepare an ordered list of all genes. They are typically sorted by fold change but other criteria could be chosen, such as the (fold change * adjusted p-value) product.

## rerun DEG identification to keep as many genes as possible
## even not significant genes
deg_entero <- FindMarkers(sobj_entero, 
                       ident.1 = "SC26_FMTG",
                       ident.2 = "SC26_TG",
                       test.use = "wilcox") %>% 
  arrange(desc(avg_log2FC)) %>% 
  rownames_to_column("SYMBOL")

## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")

'select()' returned 1:1 mapping between keys and columns

Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
4.65% of input gene IDs are fail to map...

## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>% 
  inner_join(trad_sbl_entrez)

Joining with `by = join_by(SYMBOL)`

## get list of DEG using their entrez id
gsea_input_list <- deg_entero %>%
  pull(avg_log2FC, name = ENTREZID)

We can run GSEA on the GO ontology.

gsea_go <- gseGO(gsea_input_list, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300) %>% 
  simplify()

using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...

GSEA analysis...

leading edge analysis...

done...

head(gsea_go@result)

	ID	Description	setSize	enrichmentScore	NES	pvalue	p.adjust	qvalue	rank	leading_edge	core_enrichment
GO:0030593	GO:0030593	neutrophil chemotaxis	57	-0.6503270	-2.603697	0e+00	0.0000014	1.30e-06	603	tags=25%, list=6%, signal=23%	13143/14127/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176
GO:0097530	GO:0097530	granulocyte migration	78	-0.5878065	-2.550661	0e+00	0.0000014	1.30e-06	279	tags=21%, list=3%, signal=20%	13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/16952/12986/20310/16176
GO:1990266	GO:1990266	neutrophil migration	70	-0.6108187	-2.548082	0e+00	0.0000014	1.30e-06	603	tags=21%, list=6%, signal=20%	13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176
GO:0017156	GO:0017156	calcium-ion regulated exocytosis	38	0.7392215	2.387048	0e+00	0.0000015	1.30e-06	1420	tags=50%, list=14%, signal=43%	20614/54525/80976/116838/58226/12286/380714/53420/60425/382018/27062/11551/20907/229521/320405/18999/70450/109685/18439
GO:0070098	GO:0070098	chemokine-mediated signaling pathway	33	-0.7049968	-2.506332	2e-07	0.0001060	9.55e-05	123	tags=33%, list=1%, signal=33%	20303/80901/16923/13614/56221/12772/20296/20305/20308/12768/20310
GO:0006636	GO:0006636	unsaturated fatty acid biosynthetic process	20	-0.7863604	-2.427158	2e-07	0.0001087	9.79e-05	791	tags=45%, list=8%, signal=42%	66469/76267/18783/19224/16592/13615/13614/16952/16176

Important columns include:

setSize: Size of the gene set in the input list.
ES and NES: Enrichment score and its normalized version. A positive score indicates over-expression and a negative score indicates under-expression.
padj and qvalues: Adjusted p-values associated with the test.

Results can be visualized with plots.

## separate function with mainly over-expressed genes
## from functions with mainly under-expressed genes
gsea_go@result <- gsea_go@result %>% 
    mutate(nes_sign = ifelse(NES > 0, "Positive", "Negative"))

dotplot(gsea_go, showCategory = 10, split = "nes_sign")  +
  facet_grid(~nes_sign)

Question

What observations can you make about the enrichment function results ?

2.2 At single cell level gene set activity

Some methods calculate the activity of pathways or gene sets at the single-cell level. This can be used to visualize the activation of some functions of interest across a whole sample, cell annotation or cell state identification.

Here we will show how to perform this type of analysis with the package AUCell. For time purpose, we will focus on 1 apoptosis signature suggested by the authors of the article of origin.

## create a list of gene sets per GO term
gene_set <- list("p53_signature" = c("Ccng1", "Mdm2", "Eda2r", "Gtse1", "Polk", "Psrc1", "Zfp365"))

It would be possible to construct a list of gene sets taken from popular ontologies.

Then the GSEA at a single cell level is performed.

## get express matrix
exprMatrix <- GetAssayData(sobj, assay="RNA", layer = "count")

## computations
cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats=F)
cells_AUC <- AUCell_calcAUC(gene_set, cells_rankings)

AUCell has a few functions to plot the results.

AUCell_plotTSNE(Embeddings(sobj, reduction="umap")[,c(1,2)], exprMatrix, cells_AUC)

Alternatively, the results can be added to the Seurat object metadata and plotted with Seurat functions.

auc_df <- AUCell::getAUC(cells_AUC) %>% 
  t() %>% 
  as.data.frame()

sobj <- AddMetaData(sobj, metadata = auc_df)

VlnPlot(sobj, features = "p53_signature")

FeaturePlot(sobj, features = "p53_signature", max.cutoff = 'q95')

References and documentation

clusterProfiler: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html

AUCell: https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html

scGSVA: a tool for single cell level GSEA coming with many functions for references building and results representation, actually more user-friendly than AUCell: https://github.com/guokai8/scGSVA

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] AUCell_1.26.0          msigdbr_7.5.1          org.Mm.eg.db_3.19.1   
 [4] AnnotationDbi_1.66.0   IRanges_2.38.1         S4Vectors_0.42.1      
 [7] Biobase_2.64.0         BiocGenerics_0.50.0    enrichplot_1.24.4     
[10] clusterProfiler_4.12.6 Seurat_5.1.0           SeuratObject_5.0.2    
[13] sp_2.1-4               lubridate_1.9.3        forcats_1.0.0         
[16] stringr_1.5.1          dplyr_1.1.4            purrr_1.0.2           
[19] readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[22] ggplot2_3.5.1          tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] segmented_2.1-0             fs_1.6.4                   
  [3] matrixStats_1.4.1           spatstat.sparse_3.1-0      
  [5] HDO.db_0.99.1               httr_1.4.7                 
  [7] RColorBrewer_1.1-3          tools_4.4.1                
  [9] sctransform_0.4.1           utf8_1.2.4                 
 [11] R6_2.5.1                    lazyeval_0.2.2             
 [13] uwot_0.2.2                  withr_3.0.1                
 [15] gridExtra_2.3               progressr_0.14.0           
 [17] cli_3.6.3                   spatstat.explore_3.3-2     
 [19] fastDummies_1.7.4           scatterpie_0.2.4           
 [21] labeling_0.4.3              spatstat.data_3.1-2        
 [23] ggridges_0.5.6              pbapply_1.7-2              
 [25] yulab.utils_0.1.7           gson_0.1.0                 
 [27] DOSE_3.30.1                 R.utils_2.12.3             
 [29] parallelly_1.38.0           limma_3.60.5               
 [31] rstudioapi_0.16.0           RSQLite_2.3.7              
 [33] generics_0.1.3              gridGraphics_0.5-1         
 [35] ica_1.0-3                   spatstat.random_3.3-2      
 [37] GO.db_3.19.1                Matrix_1.7-0               
 [39] ggbeeswarm_0.7.2            fansi_1.0.6                
 [41] abind_1.4-8                 R.methodsS3_1.8.2          
 [43] lifecycle_1.0.4             yaml_2.3.10                
 [45] SummarizedExperiment_1.34.0 qvalue_2.36.0              
 [47] SparseArray_1.4.8           Rtsne_0.17                 
 [49] grid_4.4.1                  blob_1.2.4                 
 [51] promises_1.3.0              crayon_1.5.3               
 [53] miniUI_0.1.1.1              lattice_0.22-6             
 [55] cowplot_1.1.3               annotate_1.82.0            
 [57] KEGGREST_1.44.1             pillar_1.9.0               
 [59] knitr_1.48                  fgsea_1.30.0               
 [61] GenomicRanges_1.56.1        future.apply_1.11.2        
 [63] codetools_0.2-20            fastmatch_1.1-4            
 [65] leiden_0.4.3.1              glue_1.8.0                 
 [67] ggfun_0.1.6                 spatstat.univar_3.0-1      
 [69] data.table_1.16.0           vctrs_0.6.5                
 [71] png_0.1-8                   treeio_1.28.0              
 [73] spam_2.10-0                 gtable_0.3.5               
 [75] kernlab_0.9-32              cachem_1.1.0               
 [77] xfun_0.48                   S4Arrays_1.4.1             
 [79] mime_0.12                   tidygraph_1.3.1            
 [81] survival_3.7-0              statmod_1.5.0              
 [83] fitdistrplus_1.2-1          ROCR_1.0-11                
 [85] nlme_3.1-165                ggtree_3.12.0              
 [87] bit64_4.0.5                 RcppAnnoy_0.0.22           
 [89] GenomeInfoDb_1.40.1         irlba_2.3.5.1              
 [91] vipor_0.4.7                 KernSmooth_2.23-24         
 [93] colorspace_2.1-1            DBI_1.2.3                  
 [95] ggrastr_1.0.2               tidyselect_1.2.1           
 [97] bit_4.0.5                   compiler_4.4.1             
 [99] httr2_1.0.1                 graph_1.82.0               
[101] DelayedArray_0.30.1         plotly_4.10.4              
[103] shadowtext_0.1.3            scales_1.3.0               
[105] lmtest_0.9-40               rappdirs_0.3.3             
[107] digest_0.6.37               goftest_1.2-3              
[109] mixtools_2.0.0              spatstat.utils_3.1-0       
[111] rmarkdown_2.28              XVector_0.44.0             
[113] htmltools_0.5.8.1           pkgconfig_2.0.3            
[115] sparseMatrixStats_1.16.0    MatrixGenerics_1.16.0      
[117] fastmap_1.2.0               rlang_1.1.4                
[119] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
[121] shiny_1.9.1                 DelayedMatrixStats_1.26.0  
[123] farver_2.1.2                zoo_1.8-12                 
[125] jsonlite_1.8.9              BiocParallel_1.38.0        
[127] GOSemSim_2.30.2             R.oo_1.26.0                
[129] magrittr_2.0.3              GenomeInfoDbData_1.2.12    
[131] ggplotify_0.1.2             dotCall64_1.1-1            
[133] patchwork_1.3.0             munsell_0.5.1              
[135] Rcpp_1.0.13                 ape_5.8                    
[137] babelgene_22.9              viridis_0.6.5              
[139] reticulate_1.39.0           stringi_1.8.4              
[141] ggraph_2.2.1                zlibbioc_1.50.0            
[143] MASS_7.3-61                 plyr_1.8.9                 
[145] parallel_4.4.1              listenv_0.9.1              
[147] ggrepel_0.9.6               deldir_2.0-4               
[149] Biostrings_2.72.1           graphlayouts_1.1.1         
[151] splines_4.4.1               tensor_1.5                 
[153] hms_1.1.3                   igraph_2.0.3               
[155] spatstat.geom_3.3-3         RcppHNSW_0.6.0             
[157] reshape2_1.4.4              XML_3.99-0.17              
[159] evaluate_1.0.0              tzdb_0.4.0                 
[161] tweenr_2.0.3                httpuv_1.6.15              
[163] RANN_2.6.2                  polyclip_1.10-7            
[165] future_1.34.0               scattermore_1.2            
[167] ggforce_0.4.2               xtable_1.8-4               
[169] RSpectra_0.16-2             tidytree_0.4.6             
[171] later_1.3.2                 viridisLite_0.4.2          
[173] aplot_0.2.3                 beeswarm_0.4.0             
[175] memoise_2.0.1               cluster_2.1.6              
[177] timechange_0.3.0            globals_0.16.3             
[179] GSEABase_1.66.0

Bonus: pseudo-bulk example

## load data
## the data is pro B cells in 2 conditions: 
## healthy donors (PBMC) and leukemia patient (ETV6-RunX1)
## they come from this paper
## Caron M et al. Scientific Reports. 2020;10:1–12. Available from: http://dx.doi.org/10.1038/s41598-020-64929-x
proB <- readRDS("resources/data/proB.rds")

## 1 cell type only: pro B cells
unique(proB@meta.data$SingleR_annot)
unique(proB@meta.data$type)

## 3 replicates
unique(proB@meta.data$orig.ident)

Idents(proB) <- "orig.ident"

## method 1: stay in Seurat
sobj_pseudoblk <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = T)

sobj_pseudoblk@assays$RNA@layers$counts

sobj_pseudoblk@meta.data <- sobj_pseudoblk@meta.data %>% 
  mutate(type = ifelse(str_detect(orig.ident, "PBMMC"),
                       "PBMC",
                       "ETV6"))

Idents(sobj_pseudoblk) <- "type"

markers <- FindMarkers(sobj_pseudoblk, 
                       ident.1 = "PBMC", 
                       ident.2 = "ETV6",
                       test.use = "DESeq2")

markers_signifs <- markers %>% 
  filter(p_val_adj < 0.05)

## method 2: just get raw counts and use you favorite bulk RNA-Seq tool
cts <- AggregateExpression(proB,
                            assays = "RNA",
                            slot = "counts",
                            return.seurat = F)
cts <- cts$RNA

dim(cts)
colnames(cts)

coldata <- data.frame(sample = colnames(cts)) %>%
    mutate(condition = ifelse(grepl("PBMMC", sample), "ctrl", "cancer")) %>%
    column_to_rownames("sample")

library(DESeq2)

dds <- DESeqDataSetFromMatrix(cts, colData = coldata, design = ~ condition)

keep <- rowSums(cts) >= 10
dds <- dds[keep,]
dds <- DESeq(dds)

markers_singif2 <- results(dds, name = "condition_ctrl_vs_cancer") %>%
    as.data.frame() %>%
    arrange(padj) %>%
    filter(padj < 0.05)