In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines. Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.
Below are the libraries we need to perform this whole session:
base::library(package = "BiocParallel") # Optionally multithread some steps
base::library(package = "DT") # Display nice table in HTML
base::library(package = "ggplot2") # Draw graphs and plots
base::library(package = "ggpubr") # Draw nicer graphs
base::library(package = "rstatix") # Base R statistics
base::library(package = "knitr") # Build this presentation
base::library(package = "dplyr") # Handle big tables
base::library(package = "Seurat") # Handle SingleCell analyses
base::library(package = "SeuratObject") # Handle SingleCell objects
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "escape") # Perform exploratory enrichment analysis
base::library(package = "clusterProfiler") # Perform GSEA analysis
base::library(package = "dittoSeq") # Draw nice plots based on Seurat
base::library(package = "org.Mm.eg.db") # Mouse genome annotation
base::library(package = "Cairo") # Graphs library
base::library(package = "pathview") # For the whole pathway graphFirst, we load Seurat object:
Then we launch enrichment exploration on all counts:
# Acquire gene sets
mh_hallmark <- escape::getGeneSets(
species = "Mus musculus",
library = "H"
)
# Run enrichment
enrichment_matrix <- escape::escape.matrix(
input.data = sobj,
gene.sets = mh_hallmark,
method = "ssGSEA",
groups = 1000,
min.size = 5,
BPPARAM = BiocParallel::SnowParam(workers = 2),
)[1] "Using sets of 1000 cells. Running 4 times."
Up to now, we have:
We would like to identify the functions of genes among several clusters, or differentially expressed genes.
At the end of this session you will know:
Let’s search information about this gene on the web. For mice, one of the best web-site for human is: MGI.
If we search for the gene Plac8, we find the following:
As we click on “Phenotype reference”, we can see that the MGI database can give us PubMeb IDs related to the gene Plca8. Below, on the same page, we can see regulation pathways, protein ontology, gene interactions, etc.
We illustrate here, that GSEA does not only include gene-to-function, or gene-to-pathway. They usually include many more information.
There is a database for pretty much everything you might want. From pharmacology, to regulatory objects, etc.
Our dataset contains gene expression. But the same methods can be applied to other kind of datasets. E.g.
Protein-protein interactions, drug interactions, SNP-interactions, etc.
Within R, you may find genome databases for a wide range of organisms:
For this session, we are going to use MSigBD.
There are two main types of Gene Set Enrichment Analysis: 1. A blind search among all genes in order to identify pathways or mechanisms among cells. 1. A specific search through genes of interest
The first is called “Exploratory” analysis and starts on the
Seurat object.
Many packages perform this kind of exploratory analysis, I chose to
present escape
for its simple and intuitive usage.
What really matters is the GSEA method used in back-end. There are
several very well known methods: fgsea,
gsva,
DOSE,
and the MSigDB’s
tool. While choosing your favorite R package to perform GSEA, if that
package uses one of these methods, then you should not have odd
questions through your publication process.
These methods are very alike, based on the same statistical resorts, and suffer for the same limitations. Let’s illustrate it with examples.
Up to now, we already define what’s a gene set: a group of
genes of interest. We also know these king of relationships are stored
in GMT
files or TSV
files.
Since the beginning of our week, we’ve been working with R, and the question naturally comes: how to make/use gene sets in R?
Very easily. We need a “named list
of vectors”,
all coming from base R package.
# We use the function list from base package
pathways <- base::list(
# We use the function `c` from base package
proliferation = base::c(
"Ifng", "Kras", "Mgat5",
"Prf1", "Trem1", "Trp53"
),
growth = base::c(
"Cdkn2a", "Fos", "Ifnb1",
"Rras", "Apc", "Sell"
)
)
utils::head(pathways)$proliferation
[1] "Ifng" "Kras" "Mgat5" "Prf1" "Trem1" "Trp53"
$growth
[1] "Cdkn2a" "Fos" "Ifnb1" "Rras" "Apc" "Sell"
This method is very useful when you want to compare your cells toward unpublished data, gene sets from an article not sourced in official databases, etc.
We can also load a GMT file from a file. This is very usefull when we expect both known and new pathways to be tested, or known pathways that should be updated.
This is done with the function read.gmt
from the clusterProfiler
package.
pathways <- clusterProfiler::read.gmt(
gmtfile = "m5.mpt.v2023.1.Mm.symbols.gmt"
)
pathways <- BiocGenerics::lapply(
S4Vectors::split(
pathways[-1], pathways$term
),
function(x) x[x != ""]
)
utils::head(pathways)$MP_ABNORMAL_TUMOR_VASCULARIZATION
[1] "Adamts12" "Aggf1" "Amotl1" "Anxa1" "Arhgef4" "Cav2"
[7] "Ccm2l" "Cd82" "Dock4" "Fkbpl" "Gpr4" "H2ax"
[13] "Idh2" "Ifngr1" "Il12rb2" "Il1a" "Il1b" "Itga6"
[19] "Itgb3" "Jcad" "Mmrn2" "Myct1" "Ntn4" "Peak1"
[25] "Pld1" "Rhoj" "Rras" "S1pr2" "Stard13"
$MP_DECREASED_CIRCULATING_TUMOR_NECROSIS_FACTOR_LEVEL
[1] "Adam17" "Apba3" "Arid5a" "Bbs12" "C3" "Cd14"
[7] "Cd19" "Cd44" "Clic4" "Enpp2" "F3" "Ffar2"
[13] "Foxn1" "Foxo1" "Gba" "Hectd3" "Ifi204" "Ifi35"
[19] "Ifit2" "Ifngr1" "Ikbkb" "Il22" "Il6ra" "Irak2"
[25] "Irak4" "Irf5" "Lacc1" "Lep" "Lrrc19" "Lrrc8e"
[31] "Mapkapk2" "Mbl1" "Mcpt1" "Mif" "Msmp" "Mstn"
[37] "Myd88" "Nfkbib" "Nkg7" "Nmi" "Npy1r" "Parp1"
[43] "Peli1" "Pilrb1" "Plekhf1" "Prkce" "Rhbdf2" "Sgms1"
[49] "Siglec1" "Snx8" "Sra1" "Thbd" "Ticam1" "Tlr2"
[55] "Tnf" "Tradd" "Zdhhc1"
$MP_DECREASED_INCIDENCE_OF_INDUCED_TUMORS
[1] "Ccng1" "Cd151" "Cdk1" "Cebpb" "Ct55" "Cyp2c23" "Eif2s2"
[8] "H2-Ob" "Il6" "Klk6" "Krt10" "Lrig2" "Pgr" "Terc"
[15] "Uqcc3" "Zmiz1" "a"
$MP_DECREASED_INCIDENCE_OF_TUMORS_BY_CHEMICAL_INDUCTION
[1] "Agtr2" "Ahrr" "Ar" "Arl6ip5" "Birc5" "Brca2"
[7] "Cacfd1" "Camk2g" "Ccl2" "Cd151" "Cd34" "Cd96"
[13] "Cdkn1b" "Cdkn2a" "Cebpb" "Chd1l" "Cyp1a2" "Cyp1b1"
[19] "Cyp2e1" "Dek" "Dnajc27" "Endov" "Ephx1" "Fgf22"
[25] "Fntb" "Foxm1" "Foxn1" "Fxyd5" "Fzr1" "Gata6os"
[31] "Hcst" "Hipk1" "Hras" "Ier3" "Il12b" "Il1r1"
[37] "Il23a" "Il6" "Il6ra" "Itgb1" "Kdm4c" "Klk6"
[43] "Lonp1" "Loxl2" "Lypd3" "Mapkapk2" "Mgat3" "Mif"
[49] "Mir21a" "Mir301" "Mki67" "Mmp11" "Mmp19" "Mmp1a"
[55] "Mpg" "Mtdh" "Muc4" "Myd88" "Neu3" "Pard3"
[61] "Pik3ca" "Pla2g4a" "Plce1" "Prdx1" "Pten" "Ptger1"
[67] "Ptger2" "Ptger4" "Ptgs2" "Ptk2" "Ptp4a3" "Ptprj"
[73] "Pvr" "Pycard" "Ralgds" "Retnlb" "Skil" "Slc3a2"
[79] "Spp1" "Stag1" "Stat3" "Strap" "Terc" "Tert"
[85] "Tiam1" "Tlr4" "Tnf" "Tnfaip8l3" "Tnik" "Trem1"
[91] "Trim27" "Trp53" "Uri1"
$MP_DECREASED_LIVER_TUMOR_INCIDENCE
[1] "Foxm1" "Mtdh" "Myd88" "Tert" "Tlr2"
$MP_DECREASED_LUNG_TUMOR_INCIDENCE
[1] "Ggta1" "Kras" "Lexm" "Met" "Mtdh" "Tlr2"
We can also get pathways from the web using the function getGeneSets
from package escape.
# Pay attention to the genome name, it's case sensitive
# and follows MSigDB naming scheme
mh_hallmark <- escape::getGeneSets(
species = "Mus musculus",
library = "H"
)
utils::head(pathways)$MP_ABNORMAL_TUMOR_VASCULARIZATION
[1] "Adamts12" "Aggf1" "Amotl1" "Anxa1" "Arhgef4" "Cav2"
[7] "Ccm2l" "Cd82" "Dock4" "Fkbpl" "Gpr4" "H2ax"
[13] "Idh2" "Ifngr1" "Il12rb2" "Il1a" "Il1b" "Itga6"
[19] "Itgb3" "Jcad" "Mmrn2" "Myct1" "Ntn4" "Peak1"
[25] "Pld1" "Rhoj" "Rras" "S1pr2" "Stard13"
$MP_DECREASED_CIRCULATING_TUMOR_NECROSIS_FACTOR_LEVEL
[1] "Adam17" "Apba3" "Arid5a" "Bbs12" "C3" "Cd14"
[7] "Cd19" "Cd44" "Clic4" "Enpp2" "F3" "Ffar2"
[13] "Foxn1" "Foxo1" "Gba" "Hectd3" "Ifi204" "Ifi35"
[19] "Ifit2" "Ifngr1" "Ikbkb" "Il22" "Il6ra" "Irak2"
[25] "Irak4" "Irf5" "Lacc1" "Lep" "Lrrc19" "Lrrc8e"
[31] "Mapkapk2" "Mbl1" "Mcpt1" "Mif" "Msmp" "Mstn"
[37] "Myd88" "Nfkbib" "Nkg7" "Nmi" "Npy1r" "Parp1"
[43] "Peli1" "Pilrb1" "Plekhf1" "Prkce" "Rhbdf2" "Sgms1"
[49] "Siglec1" "Snx8" "Sra1" "Thbd" "Ticam1" "Tlr2"
[55] "Tnf" "Tradd" "Zdhhc1"
$MP_DECREASED_INCIDENCE_OF_INDUCED_TUMORS
[1] "Ccng1" "Cd151" "Cdk1" "Cebpb" "Ct55" "Cyp2c23" "Eif2s2"
[8] "H2-Ob" "Il6" "Klk6" "Krt10" "Lrig2" "Pgr" "Terc"
[15] "Uqcc3" "Zmiz1" "a"
$MP_DECREASED_INCIDENCE_OF_TUMORS_BY_CHEMICAL_INDUCTION
[1] "Agtr2" "Ahrr" "Ar" "Arl6ip5" "Birc5" "Brca2"
[7] "Cacfd1" "Camk2g" "Ccl2" "Cd151" "Cd34" "Cd96"
[13] "Cdkn1b" "Cdkn2a" "Cebpb" "Chd1l" "Cyp1a2" "Cyp1b1"
[19] "Cyp2e1" "Dek" "Dnajc27" "Endov" "Ephx1" "Fgf22"
[25] "Fntb" "Foxm1" "Foxn1" "Fxyd5" "Fzr1" "Gata6os"
[31] "Hcst" "Hipk1" "Hras" "Ier3" "Il12b" "Il1r1"
[37] "Il23a" "Il6" "Il6ra" "Itgb1" "Kdm4c" "Klk6"
[43] "Lonp1" "Loxl2" "Lypd3" "Mapkapk2" "Mgat3" "Mif"
[49] "Mir21a" "Mir301" "Mki67" "Mmp11" "Mmp19" "Mmp1a"
[55] "Mpg" "Mtdh" "Muc4" "Myd88" "Neu3" "Pard3"
[61] "Pik3ca" "Pla2g4a" "Plce1" "Prdx1" "Pten" "Ptger1"
[67] "Ptger2" "Ptger4" "Ptgs2" "Ptk2" "Ptp4a3" "Ptprj"
[73] "Pvr" "Pycard" "Ralgds" "Retnlb" "Skil" "Slc3a2"
[79] "Spp1" "Stag1" "Stat3" "Strap" "Terc" "Tert"
[85] "Tiam1" "Tlr4" "Tnf" "Tnfaip8l3" "Tnik" "Trem1"
[91] "Trim27" "Trp53" "Uri1"
$MP_DECREASED_LIVER_TUMOR_INCIDENCE
[1] "Foxm1" "Mtdh" "Myd88" "Tert" "Tlr2"
$MP_DECREASED_LUNG_TUMOR_INCIDENCE
[1] "Ggta1" "Kras" "Lexm" "Met" "Mtdh" "Tlr2"
Let’s time to practice. Use the function escape.matrix
from the package escape in order to
perform GSEA on your Seurat object. Perform
the analysis on cell cycle phase if possible. Save the results in a
variable, for example sobj_enrich.
It is not possible to specify what group we are considering. GSEA/Enrichment analysis do not perform any differential analysis.
# Save in a variable called `sobj_enrich`
# the result of the function `runEscape` from package `escape`
enrichment_matrix <- escape::escape.matrix(
# Out Seurat object
input.data = sobj,
# Provide gene sets
gene.sets = mh_hallmark,
# Select your favorite method
method = "ssGSEA",
# ssGSEA parameters
groups = 1000,
min.size = 5,
# Use 2 threads/CPU/workers
BPPARAM = BiocParallel::SnowParam(workers = 2),
)[1] "Using sets of 1000 cells. Running 4 times."
If you have reserved more than one core, this function can be multi-threaded, making it faster !
We can visualise the gene sets enriched in cell clusters using the
function FeaturePlot
from Seurat package. We just need to color the cells with
ssGSEA’s results:
In order to understand what happened, where do these results come from, we need to go step by step and perform both enrichment and GSE analysis manually.
In general, the exploratory method is never used on a raw Seurat object, we usually filter this object in order to search pathways that are related to our biological question.
With the previous method, what ever our biological question was, the cell cycle phase was one of the top gene expression drivers.
Let’s search information about the gene ARP2.
So, this gene exists in multiple organisms. Thanks, we know we are working on mice, and we told gene set retriever function to consider ‘mouse’ datasets.
So in the Human genome, ARP2 refers to multiple genes through the
alias names. When we searched for gene sets with [escape],
we did not perform any disambiguation.
The same with arabidopsis thaliana, ARP2 can refer to actin related protein, or a root development protein.
In mus musculus, ARP2 refers to an actin related protein on chromosome 11, or a subunit adaptor protein on chromosome 5.
We now understand that escape analysis may be completely
wrong since it used human-intelligible gene names. These names include
synonyms, homonyms, within a single organism, or between multiple
ones.
We need to use unique identifiers. These identifiers are called gene
identifiers and we usually consider EnsemblID
or EntrezID.
For example:
ENSMUSG00000020152
at Ensembl.ENSMUSG00000019518
at Ensembl.These identifiers are very unpleasant to use for human beings. On
monday meeting, please talk about ARP2 and not
ENSMUSG00000020152. However, when machines are working,
please given them EnsemblID or EntrezID.
Let’s translate all gene identifiers and evaluate possible number of errors in escape analysis.
First, let’s load the DEA results based on Wilcoxon analysis:
# We load our differential expression analysis result
# using the function `readRDS` from `base` package
sobj_wilcox <- base::readRDS(file = "sobj_wilcox.RDS")
# Copy gene names in a column it makes the future operation easier
sobj_wilcox$SYMBOL <- base::rownames(sobj_wilcox)Then, we use the function bitr
(standing for biological translator) from the package clusterProfiler.
# We store the result of `bitr` function
# from `clusterProfiler` package
# in a variable called `annotation`
annotation <- clusterProfiler::bitr(
# We provide the variable pointing to gene names
geneID = base::rownames(sobj_wilcox),
# We tell the translator to which gene identifiers
# translation must be done
toType = base::c("ENTREZID", "ENSEMBL"),
# We tell the translator which type of gene name we have
fromType = "SYMBOL",
# We provide mmu database
OrgDb = org.Mm.eg.db
)Now we would like to have these annotation alongside with adjusted
pvalues and fold change information. In order to do so, we use the
function merge from base package. Not from Seurat
package ! This is used to merger Seurat objects, but we have dataframes
here!
# We filter these results on **ADJUSTED** pvalue
sobj_wilcox <- sobj_wilcox[sobj_wilcox$p_val_adj <= 0.05, ]
# Use the function `merge` from package `base` in order
# to add annotation to wixocon DEA result.
sobj_wilcox <- base::merge(
# Variable pointing to the wilcoxon results
x = sobj_wilcox,
# Variable pointing to the annotation results
y = annotation,
# We tell how to merge sobj_wilcox
by.x = "SYMBOL",
# We tell how to merge annotation
by.y = "SYMBOL"
)As we perform the analysis, we are going to provide a numeric variable to the GSEA/Enrichment.
We have the following columns:
p_val: Ignore this column. Always ignore raw p-values.
Look at corrected ones, and if they are missing, then compute them.avg_log2FC: Average Log2(FoldChange). Illustrates how
much a gene is differentially expessed between samples in each
condition.pct.1: Percent of cells with gene expression in
condition one, here in “G1” phase.pct.2: Percent of cells with gene expression in
condition two, here in “S” phase.p_val_adj: The confidence we have in the result. The
closer to 0, the lesser is the risk of error.Is it a good idea to use p_val ? What are the
consequences ?
No. Never. Never ever use raw P-Value. It is never a good idea.
Is it a good idea to use avg_log2FC ? What are the
consequences ?
It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the pathways with highest expression change ?”
Is it a good idea to use pct.1 ? What are the
consequences ?
It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the expression of genes in pathway XXX in the first condition ?”
Is it a good idea to use pct.2 ? What are the
consequences ?
It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the expression of genes in pathway XXX in the second condition ?”
Is it a good idea to use p_val_adj ? What are the
consequences ?
It is a good idea, we could answer biological questions like : “Which pathways are differentially expressed with highest confidence interval ?”
But in order to perform surch test, use
-log10(Adjusted P-Value) instead of the raw adjusted
p-value. Remember, 0 is a good confidence interval, and 1 a bad one. So
we need the values close to 0 to be highly positive.
Previously, we used a function called enrichIt. We
talked about enrichment analysis, yet this talk is called GSEA.
This is due to the fact that both techniques exists and are different.
Enrichment tests a list of genes. Their expression, confidence interval, etc. Nothing else matters than their name.
For each cell, enrichIt removes zeros and tests the list
of remaining genes. Their expression does not matter. Their order in the
gene list does not matter. Their impact on the pathway does not matter.
Their differential expression status does not matter.
Behind the scenes, it’s a very simple tests answering the following question: “Among expressed genes, what are the odds that it belongs to the pathway XXX?”
We can do enrichment tests on our wilxoc results, using the function
enrichGO
from the package clusterProfiler:
# We store the result of the function `enrichGO` from package `clusterProfiler`
# in the function `enrich_wilcox`
enrich_wilcox <- clusterProfiler::enrichGO(
# We provide the list of gene identifiers to test
gene = sobj_wilcox$ENTREZID,
# We provide the complete list of genes we quantified
universe = annotation$ENTREZID,
# We provide mouse annotations
OrgDb = org.Mm.eg.db,
# We tell the function that we use entrez-identifiers
keyType = "ENTREZID",
# We search results on Biological Process"
ont = "BP",
# We want all the results
pvalueCutoff = 1,
# We are humans, we wan human-readable gene names
readable = TRUE
)