suppressWarnings({
suppressMessages({
library(tidyverse)
library(Seurat)
library(clusterProfiler)
library(enrichplot)
library(org.Mm.eg.db)
library(msigdbr)
library(AUCell)
library(ggplot2)
})
})
options(future.globals.maxSize = 2.5 * 1e9)
sincellTE 2024 - TP Secondary Analysis: differential expression, functional enrichment
Objectives
After identifying cell populations, the next crucial steps—differential gene expression (DGE) and functional enrichment—uncover the molecular signatures of cells and elucidate their functional roles. These steps are vital for exploring cellular heterogeneity and extracting meaningful biological insights from data.
In this session, we will perform DGE analysis to identify genes that serve as specific markers for cell populations or are significantly altered between conditions. Additionally, we will explore functional enrichment methods to interpret these findings in the context of biological pathways.
1. Differential expression
The dataset used for these sessions comes from the article “p53 drives a transcriptional program that elicits a non-cell-autonomous response and alters cell state in vivo” by S.M. Moyer, et al. PNAS 2020. (https://doi.org/10.1073/pnas.2008474117).
It contains 2 single cell samples corresponding to 2 conditions: cells with overactivated p53 (FMTG) and wild-type cells (TG).
For the needs of this session, we will to load an object with both datasets already integrated.
<- Sys.getenv("USER")
user.id <-paste("/shared/home2", user.id, "/deg/resources/data/seurat_object_annotated.rds", sep = "/")
input_file <- paste0("/shared/home2", user.id, "deg/results/", sep = "/")
output_dir <- readRDS(input_file) sobj
1.1 Cell population markers
After clustering or annotating cells, we can identify genes specific to each population by detecting differentially expressed genes (DEGs) between a given population and all other cells. The function FindAllMarkers()
facilitates this analysis for each cell population.
First, we need to set the cell identities to indicate the populations in our Seurat object. We will utilize the feature celltype
for this purpose.
<- SetIdent(sobj, value = "celltype")
sobj levels(sobj@active.ident)
[1] "B_cells" "CD8" "Enterocyte_proximal"
[4] "Goblet_cells" "CD4_memory" "Tuft_cells"
[7] "Stem_cells" "Enterocyte_distal" "CD4_naive"
Next, we choose a method for DEG computation. Several statistical tests are available in Seurat. Methods with fewer assumptions generally yield more robust results. Here, we will use the default Wilcoxon test but many other methods are available: wilcox
, wilcox_limma
, bimod
, roc
, t
, negbinom
, poisson
, LR
, MAST
, DESeq2
…
Some methods, like wilcox, require normalized data. We have two options:
- Use log-normalized data.
DefaultAssay(sobj) <- "RNA"
## get log normalized data if not already done
# sobj <- sobj %>%
# FindVariableFeatures() %>%
# NormalizeData() %>%
# ScaleData()
## compute population markers
<- FindAllMarkers(sobj,
df_markers assay = "RNA",
logfc.threshold = log2(1.5),
min.pct = 0.5,
only.pos = T,
verbose=F)
- Use SCT normalized data. If working with integrated datasets, an extra-step is needed to reverse the SCT regression model to UMI counts.
<- PrepSCTFindMarkers(sobj, assay="SCT", verbose = F)
sobj <- FindAllMarkers(sobj,
df_markers assay = "SCT",
logfc.threshold = log2(1.5),
min.pct = 0.5,
only.pos = T,
verbose=F)
By default, Seurat tests all expressed genes. However, you can modify the behavior with arguments like:
min.pct
: Only test genes detected in at least min.pct of cells (default is 0.1).logfc.threshold
: Limit testing to genes with a minimum log-fold change (default is 0.1).only.pos
: Return only positive markers (FALSE by default).
For this session, we will adjust these arguments to focus on strong population markers.
The genes with significant differential expression can then be selected by keeping only adjusted p-values that are small enough.
<- df_markers %>%
df_markers_signif filter(p_val_adj < 0.05)
head(df_markers_signif)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
---|---|---|---|---|---|---|---|
Cd79a | 0 | 9.390336 | 0.966 | 0.012 | 0 | B_cells | Cd79a |
Ebf1 | 0 | 9.287841 | 0.906 | 0.005 | 0 | B_cells | Ebf1 |
H2-Aa | 0 | 5.548739 | 0.979 | 0.130 | 0 | B_cells | H2-Aa |
Mef2c | 0 | 7.201354 | 0.858 | 0.018 | 0 | B_cells | Mef2c |
Ms4a1 | 0 | 8.508328 | 0.842 | 0.009 | 0 | B_cells | Ms4a1 |
Cd79b | 0 | 5.283376 | 0.944 | 0.128 | 0 | B_cells | Cd79b |
Significant DEGs can be identified by filtering based on adjusted p-values. The results table includes:
- Gene name
- The p-value and adjusted p-value. To be considered differentially expressed, genes adjusted p-values must be smaller than a chosen threshold are.
- Log-fold change: $ ( ) $.
- Percent of cells expressing the gene in each condition
<- df_markers_signif %>%
nb_markers group_by(cluster) %>%
tally()
nb_markers
cluster | n |
---|---|
B_cells | 163 |
CD8 | 322 |
Enterocyte_proximal | 383 |
Goblet_cells | 266 |
CD4_memory | 733 |
Tuft_cells | 212 |
Stem_cells | 758 |
Enterocyte_distal | 3 |
CD4_naive | 159 |
ggplot(nb_markers, aes(x = cluster, y=n, fill=cluster)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
The table of DEG is not automatically added to the Seurat object. It is possible to add it manually, for example in the slot:
@misc$marker_genes <- df_markers_signif sobj
The DEG results can be saved as a CSV file for further exploration:
write.csv(df_markers, file.path(output_dir, "de_genes_FindAllMarkers.csv"), row.names = F, quote = F)
Finally, visualizations include:
- Violin plot and feature plot.
<- df_markers_signif %>%
genes_of_interest group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=1) %>%
pull(gene, name=cluster)
DefaultAssay(sobj) <- "SCT"
VlnPlot(sobj, features = genes_of_interest, slot="data", ncol=4)
Warning: The `slot` argument of `VlnPlot()` is deprecated as of Seurat 5.0.0.
ℹ Please use the `layer` argument instead.
FeaturePlot(sobj, features = genes_of_interest,
max.cutoff = 'q90', ncol = 4)
- Dot plot for top markers per cell type.
<- df_markers_signif %>%
genes_of_interest group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=3) %>%
pull(gene)
DotPlot(sobj, features = genes_of_interest,
col.min = 0) +
RotatedAxis()
- Heatmaps for a comprehensive view of markers.
<- df_markers_signif %>%
genes_of_interest group_by(cluster) %>%
arrange(p_val_adj, desc(avg_log2FC), .by_group = TRUE) %>%
slice_head(n=50) %>%
pull(gene)
DoHeatmap(sobj, features = genes_of_interest, label=F) +
theme(axis.text.y = element_blank()) # comment to display gene names
Warning in DoHeatmap(sobj, features = genes_of_interest, label = F): The
following features were omitted as they were not found in the scale.data slot
for the SCT assay: Rpl34, Rps4x, Rpl5, Rps3a1, Rpl37a, Rpl19, Rps21, Rps27a,
Rps7, Rps18, Rps10, Rpl18, Rplp2, Rps3, Rpl30, Rps14, Rpl35a, Rplp0, Rpl23,
Rps16, Tmsb10, Nup88, Slc44a1, Lysmd2, Samd14, Itprid1, Chat, Tmprss2, Camsap3,
Liph, Fermt1, Slc22a23, Ttc39a, Cideb, Ntan1, H3f3a, Fau, Rps5, Rpl18a, Rpl8
1.2 Differential expression between populations or conditions
Having identified markers for each cell type, we can also perform pairwise comparisons to test for differential gene expression between:
- two cell populations.
- cells of the same population under different conditions (e.g., treated vs. control).
For instance, we can examine the effect of p53 overactivation on enterocytes by comparing enterocytes with overactivated p53 (FMTG) to wild-type enterocytes (TG).
The FindMarkers()
function enables this analysis. A few command lines will allow us to test the two conditions for the enterocyte population.
## subset the dataset to keep only population of interest
<- subset(sobj, idents = "Enterocyte_proximal")
sobj_entero
## the condition to test is the genotype
Idents(sobj_entero) <- "orig.ident"
levels(sobj_entero)
[1] "SC26_TG" "SC26_FMTG"
## let's work on log-normalized counts
DefaultAssay(sobj_entero) <- "RNA"
<- sobj_entero %>%
sobj_entero FindVariableFeatures() %>%
NormalizeData() %>%
ScaleData()
Finding variable features for layer counts
Normalizing layer: counts
Centering and scaling data matrix
Warning: Different features in new layer data than already exists for
scale.data
# deg_entero <- PrepSCTFindMarkers(sobj_entero)
<- FindMarkers(sobj_entero,
deg_entero ident.1 = "SC26_FMTG",
ident.2 = "SC26_TG",
test.use = "wilcox",
logfc.threshold = log2(1.5),
min.pct = 0.5) %>%
# recorrect_umi = FALSE) %>%
filter(p_val_adj < 0.05) %>%
arrange(p_val_adj, desc(avg_log2FC))
@misc$deg_entero <- deg_entero sobj
2. Function enrichment
2.1 In a cell population
Once DEGs are identified, functional enrichment helps interpret the biological significance of these genes. It tests whether genes belonging to certain biological functions are over- or under-expressed beyond what would be expected by chance.
We will use the clusterProfiler
package for this analysis, employing over-representation analysis (ORA) and Gene Set Enrichment Analysis (GSEA).
Over representation analysis (ORA)
ORA is a common method for functional enrichment, assessing whether more genes in your list of DEGs are associated with a specific biological function than expected by chance.
Key inputs include:
- the list of DEGs: Generally requires genes in
Entrez ID
format. We can convert gene names from Symbol format usingclusterProfiler::bitr()
## add symbol gene names as a columns
<- deg_entero %>%
deg_entero rownames_to_column("SYMBOL")
## get entrez gene names thanks to bitr
<- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db") trad_sbl_entrez
'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
3.51% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
<- deg_entero %>%
deg_entero inner_join(trad_sbl_entrez)
Joining with `by = join_by(SYMBOL)`
head(deg_entero)
SYMBOL | p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | ENTREZID |
---|---|---|---|---|---|---|
Zg16 | 0 | -3.0775553 | 0.665 | 0.934 | 0 | 69036 |
Ubb | 0 | -0.6940597 | 0.978 | 0.992 | 0 | 22187 |
Fth1 | 0 | -1.3476178 | 0.989 | 1.000 | 0 | 14319 |
Prdx1 | 0 | -1.0359897 | 0.797 | 0.886 | 0 | 18477 |
Malat1 | 0 | 0.9994706 | 0.997 | 0.989 | 0 | 72289 |
Ftl1 | 0 | -0.9831995 | 0.929 | 0.972 | 0 | 14325 |
## get list of DEG using their entrez id
<- deg_entero$ENTREZID list_deg_entero
- a reference: A list of biological functions represented as annotated gene sets (e.g., Gene Ontology (GO), KEGG).
We will query the GO database using the org.Mm.eg.db
reference to perform enrichment tests.
## perform ORA
<- enrichGO(list_deg_entero, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300)
go
## avoid redundancy in result terms
<- simplify(go) go
Key parameters include:
ont
: Choose a GO subcategory between Biological Process (“BP”), Molecular Function (“MF”), Cellular Component (“CC”), or all (“ALL”).minGSSize
: Exclude gene sets with fewer than minGSSize genes (removes small sets prone to noise).maxGSSize
: Exclude gene sets with more than maxGSSize genes (removes overly broad sets).
head(go@result)
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
GO:0009060 | GO:0009060 | aerobic respiration | 11/107 | 199/28905 | 0 | 2e-07 | 1e-07 | 17448/17449/15251/13063/12858/103172/67680/17995/66694/66445/69875 | 11 |
GO:0140236 | GO:0140236 | translation at presynapse | 7/107 | 47/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
GO:0046166 | GO:0046166 | glyceraldehyde-3-phosphate biosynthetic process | 5/107 | 12/28905 | 0 | 2e-07 | 1e-07 | 17448/17449/14433/13806/18648 | 5 |
GO:0140241 | GO:0140241 | translation at synapse | 7/107 | 48/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
GO:0140242 | GO:0140242 | translation at postsynapse | 7/107 | 48/28905 | 0 | 2e-07 | 1e-07 | 57294/54127/67671/19981/54217/67281/27370 | 7 |
GO:0006094 | GO:0006094 | gluconeogenesis | 8/107 | 95/28905 | 0 | 8e-07 | 6e-07 | 17448/17449/15251/14433/13806/21416/18648/67800 | 8 |
The function returns an enrichResult
object with ORA table stored in the @result
slot. Important columns include:
ID
andDescription
: GO identifier and brief description of the biological process.Count
: Number of genes from the input dataset annotated to the specific GO term.GeneRatio
: Ratio of genes in your input dataset associated with the GO term.BgRatio
: Ratio of genes associated with the GO term in the reference gene set.p.adjust
andqvalue
: Adjusted p-values for selecting significantly enriched functions.
Results can be visualized using a dot plot.
dotplot(go, showCategory = 20)
Prebuilt functions like enrichGO()
and enrichKEGG()
enable users to query widely used ontologies. The enricher()
function facilitates the testing of over-representation for any gene set, for example MSigDB.
## load MSigDBr database, subcategory H
<- msigdbr(species = "mouse", category = "H")
gmt head(gmt)
gs_cat | gs_subcat | gs_name | gene_symbol | entrez_gene | ensembl_gene | human_gene_symbol | human_entrez_gene | human_ensembl_gene | gs_id | gs_pmid | gs_geoid | gs_exact_source | gs_url | gs_description | taxon_id | ortholog_sources | num_ortholog_sources |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H | HALLMARK_ADIPOGENESIS | Abca1 | 11303 | ENSMUSG00000015243 | ABCA1 | 19 | ENSG00000165029 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 | ||||
H | HALLMARK_ADIPOGENESIS | Abcb8 | 74610 | ENSMUSG00000028973 | ABCB8 | 11194 | ENSG00000197150 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
H | HALLMARK_ADIPOGENESIS | Acaa2 | 52538 | ENSMUSG00000036880 | ACAA2 | 10449 | ENSG00000167315 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 | ||||
H | HALLMARK_ADIPOGENESIS | Acadl | 11363 | ENSMUSG00000026003 | ACADL | 33 | ENSG00000115361 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
H | HALLMARK_ADIPOGENESIS | Acadm | 11364 | ENSMUSG00000062908 | ACADM | 34 | ENSG00000117054 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam | 12 | ||||
H | HALLMARK_ADIPOGENESIS | Acads | 11409 | ENSMUSG00000029545 | ACADS | 35 | ENSG00000122971 | M5905 | 26771021 | Genes up-regulated during adipocyte differentiation (adipogenesis). | 10090 | EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam | 11 |
<- enricher(gene = list_deg_entero, pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, TERM2GENE = gmt[,c("gs_name", "entrez_gene")], minGSSize = 10, maxGSSize = 300)
ora
head(ora@result)
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
HALLMARK_OXIDATIVE_PHOSPHORYLATION | HALLMARK_OXIDATIVE_PHOSPHORYLATION | HALLMARK_OXIDATIVE_PHOSPHORYLATION | 10/52 | 199/4394 | 0.0000865 | 0.0032005 | 0.0024584 | 17448/17449/13063/18674/12858/18984/67680/17995/66694/66445 | 10 |
HALLMARK_ADIPOGENESIS | HALLMARK_ADIPOGENESIS | HALLMARK_ADIPOGENESIS | 9/52 | 200/4394 | 0.0004685 | 0.0057778 | 0.0044381 | 17448/20280/103172/18984/67680/18107/66445/20655/11637 | 9 |
HALLMARK_GLYCOLYSIS | HALLMARK_GLYCOLYSIS | HALLMARK_GLYCOLYSIS | 9/52 | 200/4394 | 0.0004685 | 0.0057778 | 0.0044381 | 17448/17449/13806/93692/14595/12368/18648/20655/21786 | 9 |
HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY | 4/52 | 49/4394 | 0.0024946 | 0.0230753 | 0.0177250 | 18477/14325/93692/20655 | 4 |
HALLMARK_MTORC1_SIGNALING | HALLMARK_MTORC1_SIGNALING | HALLMARK_MTORC1_SIGNALING | 7/52 | 201/4394 | 0.0087632 | 0.0527135 | 0.0404912 | 18477/14433/13806/93692/12450/18107/22433 | 7 |
HALLMARK_XENOBIOTIC_METABOLISM | HALLMARK_XENOBIOTIC_METABOLISM | HALLMARK_XENOBIOTIC_METABOLISM | 7/52 | 203/4394 | 0.0092332 | 0.0527135 | 0.0404912 | 19692/11847/12408/18984/18107/12368/17161 | 7 |
:::
Gene Set Enrichment Analysis (GSEA)
GSEA evaluates whether a set of genes shows coordinated expression changes, analyzing all genes even if they do not reach significance thresholds.
For GSEA, prepare an ordered list of all genes. They are typically sorted by fold change but other criteria could be chosen, such as the (fold change * adjusted p-value) product.
## rerun DEG identification to keep as many genes as possible
## even not significant genes
<- FindMarkers(sobj_entero,
deg_entero ident.1 = "SC26_FMTG",
ident.2 = "SC26_TG",
test.use = "wilcox") %>%
arrange(desc(avg_log2FC)) %>%
rownames_to_column("SYMBOL")
## get entrez gene names thanks to bitr
<- bitr(deg_entero$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Mm.eg.db") trad_sbl_entrez
'select()' returned 1:1 mapping between keys and columns
Warning in bitr(deg_entero$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
4.65% of input gene IDs are fail to map...
## join both data tables using symbol gene names as key
<- deg_entero %>%
deg_entero inner_join(trad_sbl_entrez)
Joining with `by = join_by(SYMBOL)`
## get list of DEG using their entrez id
<- deg_entero %>%
gsea_input_list pull(avg_log2FC, name = ENTREZID)
We can run GSEA on the GO ontology.
<- gseGO(gsea_input_list, OrgDb = "org.Mm.eg.db", ont = "BP", minGSSize = 10, maxGSSize = 300) %>%
gsea_go simplify()
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
head(gsea_go@result)
ID | Description | setSize | enrichmentScore | NES | pvalue | p.adjust | qvalue | rank | leading_edge | core_enrichment | |
---|---|---|---|---|---|---|---|---|---|---|---|
GO:0030593 | GO:0030593 | neutrophil chemotaxis | 57 | -0.6503270 | -2.603697 | 0e+00 | 0.0000014 | 1.30e-06 | 603 | tags=25%, list=6%, signal=23% | 13143/14127/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176 |
GO:0097530 | GO:0097530 | granulocyte migration | 78 | -0.5878065 | -2.550661 | 0e+00 | 0.0000014 | 1.30e-06 | 279 | tags=21%, list=3%, signal=20% | 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/16952/12986/20310/16176 |
GO:1990266 | GO:1990266 | neutrophil migration | 70 | -0.6108187 | -2.548082 | 0e+00 | 0.0000014 | 1.30e-06 | 603 | tags=21%, list=6%, signal=20% | 13143/14127/68279/20963/20202/13615/13614/56221/20296/20305/20308/20201/12986/20310/16176 |
GO:0017156 | GO:0017156 | calcium-ion regulated exocytosis | 38 | 0.7392215 | 2.387048 | 0e+00 | 0.0000015 | 1.30e-06 | 1420 | tags=50%, list=14%, signal=43% | 20614/54525/80976/116838/58226/12286/380714/53420/60425/382018/27062/11551/20907/229521/320405/18999/70450/109685/18439 |
GO:0070098 | GO:0070098 | chemokine-mediated signaling pathway | 33 | -0.7049968 | -2.506332 | 2e-07 | 0.0001060 | 9.55e-05 | 123 | tags=33%, list=1%, signal=33% | 20303/80901/16923/13614/56221/12772/20296/20305/20308/12768/20310 |
GO:0006636 | GO:0006636 | unsaturated fatty acid biosynthetic process | 20 | -0.7863604 | -2.427158 | 2e-07 | 0.0001087 | 9.79e-05 | 791 | tags=45%, list=8%, signal=42% | 66469/76267/18783/19224/16592/13615/13614/16952/16176 |
Important columns include:
setSize
: Size of the gene set in the input list.ES
andNES
: Enrichment score and its normalized version. A positive score indicates over-expression and a negative score indicates under-expression.padj
andqvalues
: Adjusted p-values associated with the test.
Results can be visualized with plots.
## separate function with mainly over-expressed genes
## from functions with mainly under-expressed genes
@result <- gsea_go@result %>%
gsea_gomutate(nes_sign = ifelse(NES > 0, "Positive", "Negative"))
dotplot(gsea_go, showCategory = 10, split = "nes_sign") +
facet_grid(~nes_sign)
2.2 At single cell level gene set activity
Some methods calculate the activity of pathways or gene sets at the single-cell level. This can be used to visualize the activation of some functions of interest across a whole sample, cell annotation or cell state identification.
Here we will show how to perform this type of analysis with the package AUCell
. For time purpose, we will focus on 1 apoptosis signature suggested by the authors of the article of origin.
## create a list of gene sets per GO term
<- list("p53_signature" = c("Ccng1", "Mdm2", "Eda2r", "Gtse1", "Polk", "Psrc1", "Zfp365")) gene_set
It would be possible to construct a list of gene sets taken from popular ontologies.
Then the GSEA at a single cell level is performed.
## get express matrix
<- GetAssayData(sobj, assay="RNA", layer = "count")
exprMatrix
## computations
<- AUCell_buildRankings(exprMatrix, plotStats=F)
cells_rankings <- AUCell_calcAUC(gene_set, cells_rankings) cells_AUC
AUCell
has a few functions to plot the results.
AUCell_plotTSNE(Embeddings(sobj, reduction="umap")[,c(1,2)], exprMatrix, cells_AUC)
Alternatively, the results can be added to the Seurat
object metadata and plotted with Seurat functions.
<- AUCell::getAUC(cells_AUC) %>%
auc_df t() %>%
as.data.frame()
<- AddMetaData(sobj, metadata = auc_df)
sobj
VlnPlot(sobj, features = "p53_signature")
FeaturePlot(sobj, features = "p53_signature", max.cutoff = 'q95')
References and documentation
clusterProfiler: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
AUCell: https://www.bioconductor.org/packages/devel/bioc/vignettes/AUCell/inst/doc/AUCell.html
scGSVA: a tool for single cell level GSEA coming with many functions for references building and results representation, actually more user-friendly than AUCell: https://github.com/guokai8/scGSVA
Session info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] AUCell_1.26.0 msigdbr_7.5.1 org.Mm.eg.db_3.19.1
[4] AnnotationDbi_1.66.0 IRanges_2.38.1 S4Vectors_0.42.1
[7] Biobase_2.64.0 BiocGenerics_0.50.0 enrichplot_1.24.4
[10] clusterProfiler_4.12.6 Seurat_5.1.0 SeuratObject_5.0.2
[13] sp_2.1-4 lubridate_1.9.3 forcats_1.0.0
[16] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[19] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[22] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] segmented_2.1-0 fs_1.6.4
[3] matrixStats_1.4.1 spatstat.sparse_3.1-0
[5] HDO.db_0.99.1 httr_1.4.7
[7] RColorBrewer_1.1-3 tools_4.4.1
[9] sctransform_0.4.1 utf8_1.2.4
[11] R6_2.5.1 lazyeval_0.2.2
[13] uwot_0.2.2 withr_3.0.1
[15] gridExtra_2.3 progressr_0.14.0
[17] cli_3.6.3 spatstat.explore_3.3-2
[19] fastDummies_1.7.4 scatterpie_0.2.4
[21] labeling_0.4.3 spatstat.data_3.1-2
[23] ggridges_0.5.6 pbapply_1.7-2
[25] yulab.utils_0.1.7 gson_0.1.0
[27] DOSE_3.30.1 R.utils_2.12.3
[29] parallelly_1.38.0 limma_3.60.5
[31] rstudioapi_0.16.0 RSQLite_2.3.7
[33] generics_0.1.3 gridGraphics_0.5-1
[35] ica_1.0-3 spatstat.random_3.3-2
[37] GO.db_3.19.1 Matrix_1.7-0
[39] ggbeeswarm_0.7.2 fansi_1.0.6
[41] abind_1.4-8 R.methodsS3_1.8.2
[43] lifecycle_1.0.4 yaml_2.3.10
[45] SummarizedExperiment_1.34.0 qvalue_2.36.0
[47] SparseArray_1.4.8 Rtsne_0.17
[49] grid_4.4.1 blob_1.2.4
[51] promises_1.3.0 crayon_1.5.3
[53] miniUI_0.1.1.1 lattice_0.22-6
[55] cowplot_1.1.3 annotate_1.82.0
[57] KEGGREST_1.44.1 pillar_1.9.0
[59] knitr_1.48 fgsea_1.30.0
[61] GenomicRanges_1.56.1 future.apply_1.11.2
[63] codetools_0.2-20 fastmatch_1.1-4
[65] leiden_0.4.3.1 glue_1.8.0
[67] ggfun_0.1.6 spatstat.univar_3.0-1
[69] data.table_1.16.0 vctrs_0.6.5
[71] png_0.1-8 treeio_1.28.0
[73] spam_2.10-0 gtable_0.3.5
[75] kernlab_0.9-32 cachem_1.1.0
[77] xfun_0.48 S4Arrays_1.4.1
[79] mime_0.12 tidygraph_1.3.1
[81] survival_3.7-0 statmod_1.5.0
[83] fitdistrplus_1.2-1 ROCR_1.0-11
[85] nlme_3.1-165 ggtree_3.12.0
[87] bit64_4.0.5 RcppAnnoy_0.0.22
[89] GenomeInfoDb_1.40.1 irlba_2.3.5.1
[91] vipor_0.4.7 KernSmooth_2.23-24
[93] colorspace_2.1-1 DBI_1.2.3
[95] ggrastr_1.0.2 tidyselect_1.2.1
[97] bit_4.0.5 compiler_4.4.1
[99] httr2_1.0.1 graph_1.82.0
[101] DelayedArray_0.30.1 plotly_4.10.4
[103] shadowtext_0.1.3 scales_1.3.0
[105] lmtest_0.9-40 rappdirs_0.3.3
[107] digest_0.6.37 goftest_1.2-3
[109] mixtools_2.0.0 spatstat.utils_3.1-0
[111] rmarkdown_2.28 XVector_0.44.0
[113] htmltools_0.5.8.1 pkgconfig_2.0.3
[115] sparseMatrixStats_1.16.0 MatrixGenerics_1.16.0
[117] fastmap_1.2.0 rlang_1.1.4
[119] htmlwidgets_1.6.4 UCSC.utils_1.0.0
[121] shiny_1.9.1 DelayedMatrixStats_1.26.0
[123] farver_2.1.2 zoo_1.8-12
[125] jsonlite_1.8.9 BiocParallel_1.38.0
[127] GOSemSim_2.30.2 R.oo_1.26.0
[129] magrittr_2.0.3 GenomeInfoDbData_1.2.12
[131] ggplotify_0.1.2 dotCall64_1.1-1
[133] patchwork_1.3.0 munsell_0.5.1
[135] Rcpp_1.0.13 ape_5.8
[137] babelgene_22.9 viridis_0.6.5
[139] reticulate_1.39.0 stringi_1.8.4
[141] ggraph_2.2.1 zlibbioc_1.50.0
[143] MASS_7.3-61 plyr_1.8.9
[145] parallel_4.4.1 listenv_0.9.1
[147] ggrepel_0.9.6 deldir_2.0-4
[149] Biostrings_2.72.1 graphlayouts_1.1.1
[151] splines_4.4.1 tensor_1.5
[153] hms_1.1.3 igraph_2.0.3
[155] spatstat.geom_3.3-3 RcppHNSW_0.6.0
[157] reshape2_1.4.4 XML_3.99-0.17
[159] evaluate_1.0.0 tzdb_0.4.0
[161] tweenr_2.0.3 httpuv_1.6.15
[163] RANN_2.6.2 polyclip_1.10-7
[165] future_1.34.0 scattermore_1.2
[167] ggforce_0.4.2 xtable_1.8-4
[169] RSpectra_0.16-2 tidytree_0.4.6
[171] later_1.3.2 viridisLite_0.4.2
[173] aplot_0.2.3 beeswarm_0.4.0
[175] memoise_2.0.1 cluster_2.1.6
[177] timechange_0.3.0 globals_0.16.3
[179] GSEABase_1.66.0
Bonus: pseudo-bulk example
## load data
## the data is pro B cells in 2 conditions:
## healthy donors (PBMC) and leukemia patient (ETV6-RunX1)
## they come from this paper
## Caron M et al. Scientific Reports. 2020;10:1–12. Available from: http://dx.doi.org/10.1038/s41598-020-64929-x
<- readRDS("resources/data/proB.rds")
proB
## 1 cell type only: pro B cells
unique(proB@meta.data$SingleR_annot)
unique(proB@meta.data$type)
## 3 replicates
unique(proB@meta.data$orig.ident)
Idents(proB) <- "orig.ident"
## method 1: stay in Seurat
<- AggregateExpression(proB,
sobj_pseudoblk assays = "RNA",
slot = "counts",
return.seurat = T)
@assays$RNA@layers$counts
sobj_pseudoblk
@meta.data <- sobj_pseudoblk@meta.data %>%
sobj_pseudoblkmutate(type = ifelse(str_detect(orig.ident, "PBMMC"),
"PBMC",
"ETV6"))
Idents(sobj_pseudoblk) <- "type"
<- FindMarkers(sobj_pseudoblk,
markers ident.1 = "PBMC",
ident.2 = "ETV6",
test.use = "DESeq2")
<- markers %>%
markers_signifs filter(p_val_adj < 0.05)
## method 2: just get raw counts and use you favorite bulk RNA-Seq tool
<- AggregateExpression(proB,
cts assays = "RNA",
slot = "counts",
return.seurat = F)
<- cts$RNA
cts
dim(cts)
colnames(cts)
<- data.frame(sample = colnames(cts)) %>%
coldata mutate(condition = ifelse(grepl("PBMMC", sample), "ctrl", "cancer")) %>%
column_to_rownames("sample")
library(DESeq2)
<- DESeqDataSetFromMatrix(cts, colData = coldata, design = ~ condition)
dds
<- rowSums(cts) >= 10
keep <- dds[keep,]
dds <- DESeq(dds)
dds
<- results(dds, name = "condition_ctrl_vs_cancer") %>%
markers_singif2 as.data.frame() %>%
arrange(padj) %>%
filter(padj < 0.05)