suppressWarnings({
suppressMessages({
library(tidyverse)
library(Seurat)
library(clusterProfiler)
library(enrichplot)
library(org.Mm.eg.db)
library(msigdbr)
library(AUCell)
library(ggplot2)
})
})
options(future.globals.maxSize = 2.5 * 1e9)sincellTE 2024 - TP Secondary Analysis: differential expression, functional enrichment
Objectives
After identifying cell populations, the next crucial steps—differential gene expression (DGE) and functional enrichment—uncover the molecular signatures of cells and elucidate their functional roles. These steps are vital for exploring cellular heterogeneity and extracting meaningful biological insights from data.
In this session, we will perform DGE analysis to identify genes that serve as specific markers for cell populations or are significantly altered between conditions. Additionally, we will explore functional enrichment methods to interpret these findings in the context of biological pathways.
1. Differential expression
The dataset used for these sessions comes from the article “p53 drives a transcriptional program that elicits a non-cell-autonomous response and alters cell state in vivo” by S.M. Moyer, et al. PNAS 2020. (https://doi.org/10.1073/pnas.2008474117).
It contains 2 single cell samples corresponding to 2 conditions: cells with overactivated p53 (FMTG) and wild-type cells (TG).
For the needs of this session, we will to load an object with both datasets already integrated.
user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "/deg/resources/data/seurat_object_annotated.rds", sep = "/")
output_dir <- paste0("/shared/home2", user.id, "deg/results/", sep = "/")
sobj <- readRDS(input_file)1.1 Cell population markers
After clustering or annotating cells, we can identify genes specific to each population by detecting differentially expressed genes (DEGs) between a given population and all other cells. The function FindAllMarkers() facilitates this analysis for each cell population.
First, we need to set the cell identities to indicate the populations in our Seurat object. We will utilize the feature celltype for this purpose.
sobj <- SetIdent(sobj, value = "celltype")
levels(sobj@active.ident)[1] "B_cells" "CD8" "Enterocyte_proximal"
[4] "Goblet_cells" "CD4_memory" "Tuft_cells"
[7] "Stem_cells" "Enterocyte_distal" "CD4_naive"
Next, we choose a method for DEG computation. Several statistical tests are available in Seurat. Methods with fewer assumptions generally yield more robust results. Here, we will use the default Wilcoxon test but many other methods are available: wilcox, wilcox_limma, bimod, roc, t, negbinom, poisson, LR, MAST, DESeq2…
Some methods, like wilcox, require normalized data. We have two options:
- Use log-normalized data.
DefaultAssay(sobj) <- "RNA"
## get log normalized data if not already done
# sobj <- sobj %>%
# FindVariableFeatures() %>%
# NormalizeData() %>%
# ScaleData()
## compute population markers
df_markers <- FindAllMarkers(sobj,
assay = "RNA",
logfc.threshold = log2(1.5),
min.pct = 0.5,
only.pos = T,
verbose=F)- Use SCT normalized data. If working with integrated datasets, an extra-step is needed to reverse the SCT regression model to UMI counts.
sobj <- PrepSCTFindMarkers(sobj, assay="SCT", verbose = F)
df_markers <- FindAllMarkers(sobj,
assay = "SCT",
logfc.threshold = log2(1.5),
min.pct = 0.5,
only.pos = T,
verbose=F)By default, Seurat tests all expressed genes. However, you can modify the behavior with arguments like:
min.pct: Only test genes detected in at least min.pct of cells (default is 0.1).logfc.threshold: Limit testing to genes with a minimum log-fold change (default is 0.1).only.pos: Return only positive markers (FALSE by default).
For this session, we will adjust these arguments to focus on strong population markers.
The genes with significant differential expression can then be selected by keeping only adjusted p-values that are small enough.
df_markers_signif <- df_markers %>%
filter(p_val_adj < 0.05)
head(df_markers_signif)| p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
|---|---|---|---|---|---|---|---|
| Cd79a | 0 | 9.390336 | 0.966 | 0.012 | 0 | B_cells | Cd79a |
| Ebf1 | 0 | 9.287841 | 0.906 | 0.005 | 0 | B_cells | Ebf1 |
| H2-Aa | 0 | 5.548739 | 0.979 | 0.130 | 0 | B_cells | H2-Aa |
| Mef2c | 0 | 7.201354 | 0.858 | 0.018 | 0 | B_cells | Mef2c |
| Ms4a1 | 0 | 8.508328 | 0.842 | 0.009 | 0 | B_cells | Ms4a1 |
| Cd79b | 0 | 5.283376 | 0.944 | 0.128 | 0 | B_cells | Cd79b |
Significant DEGs can be identified by filtering based on adjusted p-values. The results table includes:
- Gene name
- The p-value and adjusted p-value. To be considered differentially expressed, genes adjusted p-values must be smaller than a chosen threshold are.
- Log-fold change: $ ( ) $.
- Percent of cells expressing the gene in each condition
nb_markers <- df_markers_signif %>%
group_by(cluster) %>%
tally()
nb_markers| cluster | n |
|---|---|
| B_cells | 163 |
| CD8 | 322 |
| Enterocyte_proximal | 383 |
| Goblet_cells | 266 |
| CD4_memory | 733 |
| Tuft_cells | 212 |
| Stem_cells | 758 |
| Enterocyte_distal | 3 |
| CD4_naive | 159 |
ggplot(nb_markers, aes(x = cluster, y=n, fill=cluster)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))The table of DEG is not automatically added to the Seurat object. It is possible to add it manually, for example in the slot:
sobj@misc$marker_genes <- df_markers_signifThe DEG results can be saved as a CSV file for further exploration:
write.csv(df_markers, file.path(output_dir, "de_genes_FindAllMarkers.csv"), row.names = F, quote = F)
Finally, visualizations include:
- Violin plot and feature plot.
genes_of_interest <- df_markers_signif %>%
group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=1) %>%
pull(gene, name=cluster)
DefaultAssay(sobj) <- "SCT"
VlnPlot(sobj, features = genes_of_interest, slot="data", ncol=4)Warning: The `slot` argument of `VlnPlot()` is deprecated as of Seurat 5.0.0.
ℹ Please use the `layer` argument instead.
FeaturePlot(sobj, features = genes_of_interest,
max.cutoff = 'q90', ncol = 4)- Dot plot for top markers per cell type.
genes_of_interest <- df_markers_signif %>%
group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=3) %>%
pull(gene)
DotPlot(sobj, features = genes_of_interest,
col.min = 0) +
RotatedAxis()- Heatmaps for a comprehensive view of markers.
genes_of_interest <- df_markers_signif %>%
group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=50) %>%
pull(gene)
DoHeatmap(sobj, features = genes_of_interest, label=F) +
theme(axis.text.y = element_blank()) # comment to display gene namesWarning in DoHeatmap(sobj, features = genes_of_interest, label = F): The
following features were omitted as they were not found in the scale.data slot
for the SCT assay: Rpl34, Rps4x, Rpl5, Rps3a1, Rpl37a, Rpl19, Rps21, Rps27a,
Rps7, Rps18, Rps10, Rpl18, Rplp2, Rps3, Rpl30, Rps14, Rpl35a, Rplp0, Rpl23,
Rps16, Tmsb10, Nup88, Slc44a1, Lysmd2, Samd14, Itprid1, Chat, Tmprss2, Camsap3,
Liph, Fermt1, Slc22a23, Ttc39a, Cideb, Ntan1, H3f3a, Fau, Rps5, Rpl18a, Rpl8
Seurat uses the scale.data layer for heatmaps, containing only the most variable genes identified during normalization. If certain genes aren’t displayed, they may not be highly variable so this should not be a problem. To include them, rerun log-normalization on specific genes.
sobj <- sobj %>%
FindVariableFeatures(nfeatures = nrow(sobj), verbose = FALSE) %>%
NormalizeData() %>%
ScaleData(seurat_obj, features = genes_to_scale)1.2 Differential expression between populations or conditions
Having identified markers for each cell type, we can also perform pairwise comparisons to test for differential gene expression between:
- two cell populations.
- cells of the same population under different conditions (e.g., treated vs. control).
For instance, we can examine the effect of p53 overactivation on enterocytes by comparing enterocytes with overactivated p53 (FMTG) to wild-type enterocytes (TG).
The FindMarkers() function enables this analysis. A few command lines will allow us to test the two conditions for the enterocyte population.
## subset the dataset to keep only population of interest
sobj_entero <- subset(sobj, idents = "Enterocyte_proximal")
## the condition to test is the genotype
Idents(sobj_entero) <- "orig.ident"
levels(sobj_entero)[1] "SC26_TG" "SC26_FMTG"
## let's work on log-normalized counts
DefaultAssay(sobj_entero) <- "RNA"
sobj_entero <- sobj_entero %>%
FindVariableFeatures() %>%
NormalizeData() %>%
ScaleData()Finding variable features for layer counts
Normalizing layer: counts
Centering and scaling data matrix
Warning: Different features in new layer data than already exists for
scale.data
# deg_entero <- PrepSCTFindMarkers(sobj_entero)
deg_entero <- FindMarkers(sobj_entero,
ident.1 = "SC26_FMTG",
ident.2 = "SC26_TG",
test.use = "wilcox",
logfc.threshold = log2(1.5),
min.pct = 0.5) %>%
# recorrect_umi = FALSE) %>%
filter(p_val_adj < 0.05) %>%
arrange(p_val_adj, desc(avg_log2FC))
sobj@misc$deg_entero <- deg_entero2. Function enrichment
2.1 In a cell population
Once DEGs are identified, functional enrichment helps interpret the biological significance of these genes. It tests whether genes belonging to certain biological functions are over- or under-expressed beyond what would be expected by chance.
We will use the clusterProfiler package for this analysis, employing over-representation analysis (ORA) and Gene Set Enrichment Analysis (GSEA).
Over representation analysis (ORA)
ORA is a common method for functional enrichment, assessing whether more genes in your list of DEGs are associated with a specific biological function than expected by chance.
Key inputs include:
- the list of DEGs: Generally requires genes in
Entrez IDformat. We can convert gene names from Symbol format usingclusterProfiler::bitr()
## add symbol gene names as a columns
deg_entero <- deg_entero %>%
rownames_to_column("SYMBOL")
## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
3.51% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>%
inner_join(trad_sbl_entrez)Joining with `by = join_by(SYMBOL)`
head(deg_entero)| SYMBOL | p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | ENTREZID |
|---|---|---|---|---|---|---|
| Zg16 | 0 | -3.0775553 | 0.665 | 0.934 | 0 | 69036 |
| Ubb | 0 | -0.6940597 | 0.978 | 0.992 | 0 | 22187 |
| Fth1 | 0 | -1.3476178 | 0.989 | 1.000 | 0 | 14319 |
| Prdx1 | 0 | -1.0359897 | 0.797 | 0.886 | 0 | 18477 |
| Malat1 | 0 | 0.9994706 | 0.997 | 0.989 | 0 | 72289 |
| Ftl1 | 0 | -0.9831995 | 0.929 | 0.972 | 0 | 14325 |
## get list of DEG using their entrez id
list_deg_entero <- deg_entero$ENTREZIDThe annotation types supported by the by the org.Mm.eg.db database ca be listed:
AnnotationDbi::keytypes(org.Mm.eg.db) [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
- a reference: A list of biological functions represented as annotated gene sets (e.g., Gene Ontology (GO), KEGG).
We will query the GO database using the org.Mm.eg.db reference to perform enrichment tests.
For other organisms, you can find available OrgDbs on bioconductor.
## perform ORA
go <- enrichGO(list_deg_entero, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300)
## avoid redundancy in result terms
go <- simplify(go)Key parameters include:
ont: Choose a GO subcategory between Biological Process (“BP”), Molecular Function (“MF”), Cellular Component (“CC”), or all (“ALL”).minGSSize: Exclude gene sets with fewer than minGSSize genes (removes small sets prone to noise).maxGSSize: Exclude gene sets with more than maxGSSize genes (removes overly broad sets).
head(go@result)| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|
| GO:0009060 | GO:0009060 | aerobic respiration | 11/107 | 199/28905 | 0 | 2e-07 | 1e-07 | 17448/17449/15251/13063/12858/103172/67680/17995/66694/66445/69875 | 11 |
| GO:0140236 | GO:0140236 | translation at presynapse | 7/107 | 47/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
| GO:0046166 | GO:0046166 | glyceraldehyde-3-phosphate biosynthetic process | 5/107 | 12/28905 | 0 | 2e-07 | 1e-07 | 17448/17449/14433/13806/18648 | 5 |
| GO:0140241 | GO:0140241 | translation at synapse | 7/107 | 48/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
| GO:0140242 | GO:0140242 | translation at postsynapse | 7/107 | 48/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
| GO:0006094 | GO:0006094 | gluconeogenesis | 8/107 | 95/28905 | 0 | 8e-07 | 6e-07 | 17448/17449/15251/14433/13806/21416/18648/67800 | 8 |
The function returns an enrichResult object with ORA table stored in the @result slot. Important columns include:
IDandDescription: GO identifier and brief description of the biological process.Count: Number of genes from the input dataset annotated to the specific GO term.GeneRatio: Ratio of genes in your input dataset associated with the GO term.BgRatio: Ratio of genes associated with the GO term in the reference gene set.p.adjustandqvalue: Adjusted p-values for selecting significantly enriched functions.
Results can be visualized using a dot plot.
dotplot(go, showCategory = 20)Prebuilt functions like enrichGO() and enrichKEGG() enable users to query widely used ontologies. The enricher() function facilitates the testing of over-representation for any gene set, for example MSigDB.
## load MSigDBr database, subcategory H
gmt <- msigdbr(species = "mouse", category = "H")
head(gmt)| gs_cat | gs_subcat | gs_name | gene_symbol | entrez_gene | ensembl_gene | human_gene_symbol | human_entrez_gene | human_ensembl_gene | gs_id | gs_pmid | gs_geoid | gs_exact_source | gs_url | gs_description | taxon_id | ortholog_sources | num_ortholog_sources |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H | HALLMARK_ADIPOGENESIS | Abca1 | 11303 | ENSMUSG00000015243 | ABCA1 | 19 | ENSG00000165029 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 | ||||
| H | HALLMARK_ADIPOGENESIS | Abcb8 | 74610 | ENSMUSG00000028973 | ABCB8 | 11194 | ENSG00000197150 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
| H | HALLMARK_ADIPOGENESIS | Acaa2 | 52538 | ENSMUSG00000036880 | ACAA2 | 10449 | ENSG00000167315 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 | ||||
| H | HALLMARK_ADIPOGENESIS | Acadl | 11363 | ENSMUSG00000026003 | ACADL | 33 | ENSG00000115361 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
| H | HALLMARK_ADIPOGENESIS | Acadm | 11364 | ENSMUSG00000062908 | ACADM | 34 | ENSG00000117054 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
| H | HALLMARK_ADIPOGENESIS | Acads | 11409 | ENSMUSG00000029545 | ACADS | 35 | ENSG00000122971 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 |
ora <- enricher(gene = list_deg_entero, pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, TERM2GENE = gmt[,c("gs_name", "entrez_gene")], minGSSize = 10, maxGSSize = 300)
head(ora@result)| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|
| HALLMARK_OXIDATIVE_PHOSPHORYLATION | HALLMARK_OXIDATIVE_PHOSPHORYLATION | HALLMARK_OXIDATIVE_PHOSPHORYLATION | 10/52 | 199/4394 | 0.0000865 | 0.0032005 | 0.0024584 | 17448/17449/13063/18674/12858/18984/67680/17995/66694/66445 | 10 |
| HALLMARK_ADIPOGENESIS | HALLMARK_ADIPOGENESIS | HALLMARK_ADIPOGENESIS | 9/52 | 200/4394 | 0.0004685 | 0.0057778 | 0.0044381 | 17448/20280/103172/18984/67680/18107/66445/20655/11637 | 9 |
| HALLMARK_GLYCOLYSIS | HALLMARK_GLYCOLYSIS | HALLMARK_GLYCOLYSIS | 9/52 | 200/4394 | 0.0004685 | 0.0057778 | 0.0044381 | 17448/17449/13806/93692/14595/12368/18648/20655/21786 | 9 |
| HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | 4/52 | 49/4394 | 0.0024946 | 0.0230753 | 0.0177250 | 18477/14325/93692/20655 | 4 |
| HALLMARK_MTORC1_SIGNALING | HALLMARK_MTORC1_SIGNALING | HALLMARK_MTORC1_SIGNALING | 7/52 | 201/4394 | 0.0087632 | 0.0527135 | 0.0404912 | 18477/14433/13806/93692/12450/18107/22433 | 7 |
| HALLMARK_XENOBIOTIC_METABOLISM | HALLMARK_XENOBIOTIC_METABOLISM | HALLMARK_XENOBIOTIC_METABOLISM | 7/52 | 203/4394 | 0.0092332 | 0.0527135 | 0.0404912 | 19692/11847/12408/18984/18107/12368/17161 | 7 |
:::
ORA does not indicate the direction of differential regulation. To gain more insight, you can filter for only significantly up- or down-regulated genes.
Gene Set Enrichment Analysis (GSEA)
GSEA evaluates whether a set of genes shows coordinated expression changes, analyzing all genes even if they do not reach significance thresholds.
For GSEA, prepare an ordered list of all genes. They are typically sorted by fold change but other criteria could be chosen, such as the (fold change * adjusted p-value) product.
## rerun DEG identification to keep as many genes as possible
## even not significant genes
deg_entero <- FindMarkers(sobj_entero,
ident.1 = "SC26_FMTG",
ident.2 = "SC26_TG",
test.use = "wilcox") %>%
arrange(desc(avg_log2FC)) %>%
rownames_to_column("SYMBOL")
## get entrez gene names thanks to bitr
trad_sbl_entrez <- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db")'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
4.65% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
deg_entero <- deg_entero %>%
inner_join(trad_sbl_entrez)Joining with `by = join_by(SYMBOL)`
## get list of DEG using their entrez id
gsea_input_list <- deg_entero %>%
pull(avg_log2FC, name = ENTREZID)We can run GSEA on the GO ontology.
gsea_go <- gseGO(gsea_input_list, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300) %>%
simplify()using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
head(gsea_go@result)| ID | Description | setSize | enrichmentScore | NES | pvalue | p.adjust | qvalue | rank | leading_edge | core_enrichment | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GO:0030593 | GO:0030593 | neutrophil chemotaxis | 57 | -0.6503270 | -2.603697 | 0e+00 | 0.0000014 | 1.30e-06 | 603 | tags=25%, list=6%, signal=23% | 13143/14127/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176 |
| GO:0097530 | GO:0097530 | granulocyte migration | 78 | -0.5878065 | -2.550661 | 0e+00 | 0.0000014 | 1.30e-06 | 279 | tags=21%, list=3%, signal=20% | 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/16952/12986/20310/16176 |
| GO:1990266 | GO:1990266 | neutrophil migration | 70 | -0.6108187 | -2.548082 | 0e+00 | 0.0000014 | 1.30e-06 | 603 | tags=21%, list=6%, signal=20% | 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176 |
| GO:0017156 | GO:0017156 | calcium-ion regulated exocytosis | 38 | 0.7392215 | 2.387048 | 0e+00 | 0.0000015 | 1.30e-06 | 1420 | tags=50%, list=14%, signal=43% | 20614/54525/80976/116838/58226/12286/380714/53420/60425/382018/27062/11551/20907/229521/320405/18999/70450/109685/18439 |
| GO:0070098 | GO:0070098 | chemokine-mediated signaling pathway | 33 | -0.7049968 | -2.506332 | 2e-07 | 0.0001060 | 9.55e-05 | 123 | tags=33%, list=1%, signal=33% | 20303/80901/16923/13614/56221/12772/20296/20305/20308/12768/20310 |
| GO:0006636 | GO:0006636 | unsaturated fatty acid biosynthetic process | 20 | -0.7863604 | -2.427158 | 2e-07 | 0.0001087 | 9.79e-05 | 791 | tags=45%, list=8%, signal=42% | 66469/76267/18783/19224/16592/13615/13614/16952/16176 |
Important columns include:
setSize: Size of the gene set in the input list.ESandNES: Enrichment score and its normalized version. A positive score indicates over-expression and a negative score indicates under-expression.padjandqvalues: Adjusted p-values associated with the test.
Results can be visualized with plots.
## separate function with mainly over-expressed genes
## from functions with mainly under-expressed genes
gsea_go@result <- gsea_go@result %>%
mutate(nes_sign = ifelse(NES > 0, "Positive", "Negative"))
dotplot(gsea_go, showCategory = 10, split = "nes_sign") +
facet_grid(~nes_sign)What observations can you make about the enrichment function results ?
2.2 At single cell level gene set activity
Some methods calculate the activity of pathways or gene sets at the single-cell level. This can be used to visualize the activation of some functions of interest across a whole sample, cell annotation or cell state identification.
Here we will show how to perform this type of analysis with the package AUCell. For time purpose, we will focus on 1 apoptosis signature suggested by the authors of the article of origin.
## create a list of gene sets per GO term
gene_set <- list("p53_signature" = c("Ccng1", "Mdm2", "Eda2r", "Gtse1", "Polk", "Psrc1", "Zfp365"))It would be possible to construct a list of gene sets taken from popular ontologies.
Then the GSEA at a single cell level is performed.
## get express matrix
exprMatrix <- GetAssayData(sobj, assay="RNA", layer = "count")
## computations
cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats=F)
cells_AUC <- AUCell_calcAUC(gene_set, cells_rankings)AUCell has a few functions to plot the results.
AUCell_plotTSNE(Embeddings(sobj, reduction="umap")[,c(1,2)], exprMatrix, cells_AUC)Alternatively, the results can be added to the Seurat object metadata and plotted with Seurat functions.
auc_df <- AUCell::getAUC(cells_AUC) %>%
t() %>%
as.data.frame()
sobj <- AddMetaData(sobj, metadata = auc_df)
VlnPlot(sobj, features = "p53_signature")FeaturePlot(sobj, features = "p53_signature", max.cutoff = 'q95')References and documentation
clusterProfiler: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
AUCell: https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html
scGSVA: a tool for single cell level GSEA coming with many functions for references building and results representation, actually more user-friendly than AUCell: https://github.com/guokai8/scGSVA
Session info
sessionInfo()R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] AUCell_1.26.0 msigdbr_7.5.1 org.Mm.eg.db_3.19.1
[4] AnnotationDbi_1.66.0 IRanges_2.38.1 S4Vectors_0.42.1
[7] Biobase_2.64.0 BiocGenerics_0.50.0 enrichplot_1.24.4
[10] clusterProfiler_4.12.6 Seurat_5.1.0 SeuratObject_5.0.2
[13] sp_2.1-4 lubridate_1.9.3 forcats_1.0.0
[16] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[19] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[22] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] segmented_2.1-0 fs_1.6.4
[3] matrixStats_1.4.1 spatstat.sparse_3.1-0
[5] HDO.db_0.99.1 httr_1.4.7
[7] RColorBrewer_1.1-3 tools_4.4.1
[9] sctransform_0.4.1 utf8_1.2.4
[11] R6_2.5.1 lazyeval_0.2.2
[13] uwot_0.2.2 withr_3.0.1
[15] gridExtra_2.3 progressr_0.14.0
[17] cli_3.6.3 spatstat.explore_3.3-2
[19] fastDummies_1.7.4 scatterpie_0.2.4
[21] labeling_0.4.3 spatstat.data_3.1-2
[23] ggridges_0.5.6 pbapply_1.7-2
[25] yulab.utils_0.1.7 gson_0.1.0
[27] DOSE_3.30.1 R.utils_2.12.3
[29] parallelly_1.38.0 limma_3.60.5
[31] rstudioapi_0.16.0 RSQLite_2.3.7
[33] generics_0.1.3 gridGraphics_0.5-1
[35] ica_1.0-3 spatstat.random_3.3-2
[37] GO.db_3.19.1 Matrix_1.7-0
[39] ggbeeswarm_0.7.2 fansi_1.0.6
[41] abind_1.4-8 R.methodsS3_1.8.2
[43] lifecycle_1.0.4 yaml_2.3.10
[45] SummarizedExperiment_1.34.0 qvalue_2.36.0
[47] SparseArray_1.4.8 Rtsne_0.17
[49] grid_4.4.1 blob_1.2.4
[51] promises_1.3.0 crayon_1.5.3
[53] miniUI_0.1.1.1 lattice_0.22-6
[55] cowplot_1.1.3 annotate_1.82.0
[57] KEGGREST_1.44.1 pillar_1.9.0
[59] knitr_1.48 fgsea_1.30.0
[61] GenomicRanges_1.56.1 future.apply_1.11.2
[63] codetools_0.2-20 fastmatch_1.1-4
[65] leiden_0.4.3.1 glue_1.8.0
[67] ggfun_0.1.6 spatstat.univar_3.0-1
[69] data.table_1.16.0 vctrs_0.6.5
[71] png_0.1-8 treeio_1.28.0
[73] spam_2.10-0 gtable_0.3.5
[75] kernlab_0.9-32 cachem_1.1.0
[77] xfun_0.48 S4Arrays_1.4.1
[79] mime_0.12 tidygraph_1.3.1
[81] survival_3.7-0 statmod_1.5.0
[83] fitdistrplus_1.2-1 ROCR_1.0-11
[85] nlme_3.1-165 ggtree_3.12.0
[87] bit64_4.0.5 RcppAnnoy_0.0.22
[89] GenomeInfoDb_1.40.1 irlba_2.3.5.1
[91] vipor_0.4.7 KernSmooth_2.23-24
[93] colorspace_2.1-1 DBI_1.2.3
[95] ggrastr_1.0.2 tidyselect_1.2.1
[97] bit_4.0.5 compiler_4.4.1
[99] httr2_1.0.1 graph_1.82.0
[101] DelayedArray_0.30.1 plotly_4.10.4
[103] shadowtext_0.1.3 scales_1.3.0
[105] lmtest_0.9-40 rappdirs_0.3.3
[107] digest_0.6.37 goftest_1.2-3
[109] mixtools_2.0.0 spatstat.utils_3.1-0
[111] rmarkdown_2.28 XVector_0.44.0
[113] htmltools_0.5.8.1 pkgconfig_2.0.3
[115] sparseMatrixStats_1.16.0 MatrixGenerics_1.16.0
[117] fastmap_1.2.0 rlang_1.1.4
[119] htmlwidgets_1.6.4 UCSC.utils_1.0.0
[121] shiny_1.9.1 DelayedMatrixStats_1.26.0
[123] farver_2.1.2 zoo_1.8-12
[125] jsonlite_1.8.9 BiocParallel_1.38.0
[127] GOSemSim_2.30.2 R.oo_1.26.0
[129] magrittr_2.0.3 GenomeInfoDbData_1.2.12
[131] ggplotify_0.1.2 dotCall64_1.1-1
[133] patchwork_1.3.0 munsell_0.5.1
[135] Rcpp_1.0.13 ape_5.8
[137] babelgene_22.9 viridis_0.6.5
[139] reticulate_1.39.0 stringi_1.8.4
[141] ggraph_2.2.1 zlibbioc_1.50.0
[143] MASS_7.3-61 plyr_1.8.9
[145] parallel_4.4.1 listenv_0.9.1
[147] ggrepel_0.9.6 deldir_2.0-4
[149] Biostrings_2.72.1 graphlayouts_1.1.1
[151] splines_4.4.1 tensor_1.5
[153] hms_1.1.3 igraph_2.0.3
[155] spatstat.geom_3.3-3 RcppHNSW_0.6.0
[157] reshape2_1.4.4 XML_3.99-0.17
[159] evaluate_1.0.0 tzdb_0.4.0
[161] tweenr_2.0.3 httpuv_1.6.15
[163] RANN_2.6.2 polyclip_1.10-7
[165] future_1.34.0 scattermore_1.2
[167] ggforce_0.4.2 xtable_1.8-4
[169] RSpectra_0.16-2 tidytree_0.4.6
[171] later_1.3.2 viridisLite_0.4.2
[173] aplot_0.2.3 beeswarm_0.4.0
[175] memoise_2.0.1 cluster_2.1.6
[177] timechange_0.3.0 globals_0.16.3
[179] GSEABase_1.66.0
Bonus: pseudo-bulk example
## load data
## the data is pro B cells in 2 conditions:
## healthy donors (PBMC) and leukemia patient (ETV6-RunX1)
## they come from this paper
## Caron M et al. Scientific Reports. 2020;10:1–12. Available from: http://dx.doi.org/10.1038/s41598-020-64929-x
proB <- readRDS("resources/data/proB.rds")
## 1 cell type only: pro B cells
unique(proB@meta.data$SingleR_annot)
unique(proB@meta.data$type)
## 3 replicates
unique(proB@meta.data$orig.ident)
Idents(proB) <- "orig.ident"
## method 1: stay in Seurat
sobj_pseudoblk <- AggregateExpression(proB,
assays = "RNA",
slot = "counts",
return.seurat = T)
sobj_pseudoblk@assays$RNA@layers$counts
sobj_pseudoblk@meta.data <- sobj_pseudoblk@meta.data %>%
mutate(type = ifelse(str_detect(orig.ident, "PBMMC"),
"PBMC",
"ETV6"))
Idents(sobj_pseudoblk) <- "type"
markers <- FindMarkers(sobj_pseudoblk,
ident.1 = "PBMC",
ident.2 = "ETV6",
test.use = "DESeq2")
markers_signifs <- markers %>%
filter(p_val_adj < 0.05)
## method 2: just get raw counts and use you favorite bulk RNA-Seq tool
cts <- AggregateExpression(proB,
assays = "RNA",
slot = "counts",
return.seurat = F)
cts <- cts$RNA
dim(cts)
colnames(cts)
coldata <- data.frame(sample = colnames(cts)) %>%
mutate(condition = ifelse(grepl("PBMMC", sample), "ctrl", "cancer")) %>%
column_to_rownames("sample")
library(DESeq2)
dds <- DESeqDataSetFromMatrix(cts, colData = coldata, design = ~ condition)
keep <- rowSums(cts) >= 10
dds <- dds[keep,]
dds <- DESeq(dds)
markers_singif2 <- results(dds, name = "condition_ctrl_vs_cancer") %>%
as.data.frame() %>%
arrange(padj) %>%
filter(padj < 0.05)