Forewords

TLDR: R command lines

In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines. Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.

Below are the libraries we need to perform this whole session:

base::library(package = "BiocParallel")    # Optionally multithread some steps
base::library(package = "DT")              # Display nice table in HTML
base::library(package = "ggplot2")         # Draw graphs and plots
base::library(package = "ggpubr")          # Draw nicer graphs
base::library(package = "rstatix")         # Base R statistics
base::library(package = "knitr")           # Build this presentation
base::library(package = "dplyr")           # Handle big tables
base::library(package = "Seurat")          # Handle SingleCell analyses
base::library(package = "SeuratObject")    # Handle SingleCell objects
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "escape")          # Perform exploratory enrichment analysis
base::library(package = "clusterProfiler") # Perform GSEA analysis
base::library(package = "dittoSeq")        # Draw nice plots based on Seurat
base::library(package = "org.Mm.eg.db")    # Mouse genome annotation
base::library(package = "Cairo")           # Graphs library
base::library(package = "pathview")        # For the whole pathway graph

First, we load Seurat object:

sobj <- base::readRDS(
  # Path to the RDS file
  file = "DEA_Scaled_Normalized_Filtered.RDS"
)

Then we launch enrichment exploration on all counts:

# Acquire gene sets
mh_hallmark <- escape::getGeneSets(
  species = "mus_musculus",
  library = "MH"
)

# Run enrichment
esobj <- escape::enrichIt(
  obj = sobj,
  gene.sets = mh_hallmark,
  cores = 3
)

Purpose of this session

Up to now, we have:

Identified to which cell each sequenced reads come from
Identified to which gene each read come from
Identified possible bias in gene expression for each cell
Filtered and corrected these bias as well as we can
Found differentially expressed genes across multiple conditions
Annotated cell clusters

We would like to identify the functions of genes among several clusters, or differentially expressed genes.

At the end of this session you will know:

What is gene set analysis
How to choose a Gene Set database
How to perform an enrichment analysis
How to read Gene set analysis results

Select a database

Gene: Jund

Let’s search information about this gene on the web. For mice, one of the best web-site for human is: MGI.

If we search for the gene Jund, we find the following:

jund_mgi

As we click on “Phenotype reference”, we can see that the MGI database can give us PubMeb IDs related to the gene Jund. Below, on the same page, we can see regulation pathways, protein ontology, gene interactions, etc.

jund_pathway

We illustreate here, that GSEA does not only include gene-to-function, or gene-to-pathway. They usually include many more information.

Database types

db_types

There is a database for pretty much everything you might want. From pharmacology, to regulatory objects, etc.

Our dataset contains gene expression. But the same methods can be applied to other kind of datasets. E.g.

ppis

Protein-protein interactions, drug interactions, SNP-interactions, etc.

Within R, you may find genome databases for a wide range of organisms:

org_bd

Some noticeable databases

For this session, we are going to use MSigBD.

How to perform GSEA

There are two main types of Gene Set Enrichment Analysis: 1. A blind search among all genes in order to identify pathways or mechanisms among cells. 1. A specific search through genes of interest

The first is called “Exploratory” analysis and starts on the Seurat object.

Many packages perform this kind of exploratory analysis, I chose to present escape for its simple and intuitive usage.

What really matters is the GSEA method used in back-end. There are several very well known methods: fgsea, gsva, DOSE, and the MSigDB’s tool. While choosing your favorite R package to perform GSEA, if that package uses one of these methods, then you should not have odd questions through your publication process.

These methods are very alike, based on the same statistical resorts, and suffer for the same limitations. Let’s illustrate it with examples.

Exploratory analysis

Gene sets

Up to now, we already define what’s a gene set: a group of genes of interest. We also know these king of relationships are stored in GMT files or TSV files.

Since the beginning of our week, we’ve been working with R, and the question naturally comes: how to make/use gene sets in R?

Create your own gene set

Very easily. We need a “named list of vectors”, all coming from base R package.

# We use the function list from base package
pathways <- base::list(
  # We use the function `c` from base package
  proliferation = base::c(
    "Ifng", "Kras", "Mgat5",    
    "Prf1", "Trem1",    "Trp53"
  ),
  growth = base::c(
    "Cdkn2a", "Fos", "Ifnb1",
    "Rras", "Apc", "Sell"
  )
)
utils::head(pathways)

$proliferation
[1] "Ifng"  "Kras"  "Mgat5" "Prf1"  "Trem1" "Trp53"

$growth
[1] "Cdkn2a" "Fos"    "Ifnb1"  "Rras"   "Apc"    "Sell"

This method is very useful when you want to compare your cells toward unpublished data, gene sets from an article not sourced in official databases, etc.

Load a gene set from disk

We can also load a GMT file from a file. This is very usefull when we expect both known and new pathways to be tested, or known pathways that should be updated.

This is done with the function read.gmt from the clusterProfiler package.

pathways <- clusterProfiler::read.gmt(
  gmtfile = "m5.mpt.v2023.1.Mm.symbols.gmt"
)
pathways <- BiocGenerics::lapply(
  S4Vectors::split(
    pathways[-1], pathways$term
  ),
  function(x) x[x != ""]
)
utils::head(pathways)

$MP_ABNORMAL_TUMOR_VASCULARIZATION
 [1] "Adamts12" "Aggf1"    "Amotl1"   "Anxa1"    "Arhgef4"  "Cav2"    
 [7] "Ccm2l"    "Cd82"     "Dock4"    "Fkbpl"    "Gpr4"     "H2ax"    
[13] "Idh2"     "Ifngr1"   "Il12rb2"  "Il1a"     "Il1b"     "Itga6"   
[19] "Itgb3"    "Jcad"     "Mmrn2"    "Myct1"    "Ntn4"     "Peak1"   
[25] "Pld1"     "Rhoj"     "Rras"     "S1pr2"    "Stard13" 

$MP_DECREASED_CIRCULATING_TUMOR_NECROSIS_FACTOR_LEVEL
 [1] "Adam17"   "Apba3"    "Arid5a"   "Bbs12"    "C3"       "Cd14"    
 [7] "Cd19"     "Cd44"     "Clic4"    "Enpp2"    "F3"       "Ffar2"   
[13] "Foxn1"    "Foxo1"    "Gba"      "Hectd3"   "Ifi204"   "Ifi35"   
[19] "Ifit2"    "Ifngr1"   "Ikbkb"    "Il22"     "Il6ra"    "Irak2"   
[25] "Irak4"    "Irf5"     "Lacc1"    "Lep"      "Lrrc19"   "Lrrc8e"  
[31] "Mapkapk2" "Mbl1"     "Mcpt1"    "Mif"      "Msmp"     "Mstn"    
[37] "Myd88"    "Nfkbib"   "Nkg7"     "Nmi"      "Npy1r"    "Parp1"   
[43] "Peli1"    "Pilrb1"   "Plekhf1"  "Prkce"    "Rhbdf2"   "Sgms1"   
[49] "Siglec1"  "Snx8"     "Sra1"     "Thbd"     "Ticam1"   "Tlr2"    
[55] "Tnf"      "Tradd"    "Zdhhc1"  

$MP_DECREASED_INCIDENCE_OF_INDUCED_TUMORS
 [1] "Ccng1"   "Cd151"   "Cdk1"    "Cebpb"   "Ct55"    "Cyp2c23" "Eif2s2" 
 [8] "H2-Ob"   "Il6"     "Klk6"    "Krt10"   "Lrig2"   "Pgr"     "Terc"   
[15] "Uqcc3"   "Zmiz1"   "a"      

$MP_DECREASED_INCIDENCE_OF_TUMORS_BY_CHEMICAL_INDUCTION
 [1] "Agtr2"     "Ahrr"      "Ar"        "Arl6ip5"   "Birc5"     "Brca2"    
 [7] "Cacfd1"    "Camk2g"    "Ccl2"      "Cd151"     "Cd34"      "Cd96"     
[13] "Cdkn1b"    "Cdkn2a"    "Cebpb"     "Chd1l"     "Cyp1a2"    "Cyp1b1"   
[19] "Cyp2e1"    "Dek"       "Dnajc27"   "Endov"     "Ephx1"     "Fgf22"    
[25] "Fntb"      "Foxm1"     "Foxn1"     "Fxyd5"     "Fzr1"      "Gata6os"  
[31] "Hcst"      "Hipk1"     "Hras"      "Ier3"      "Il12b"     "Il1r1"    
[37] "Il23a"     "Il6"       "Il6ra"     "Itgb1"     "Kdm4c"     "Klk6"     
[43] "Lonp1"     "Loxl2"     "Lypd3"     "Mapkapk2"  "Mgat3"     "Mif"      
[49] "Mir21a"    "Mir301"    "Mki67"     "Mmp11"     "Mmp19"     "Mmp1a"    
[55] "Mpg"       "Mtdh"      "Muc4"      "Myd88"     "Neu3"      "Pard3"    
[61] "Pik3ca"    "Pla2g4a"   "Plce1"     "Prdx1"     "Pten"      "Ptger1"   
[67] "Ptger2"    "Ptger4"    "Ptgs2"     "Ptk2"      "Ptp4a3"    "Ptprj"    
[73] "Pvr"       "Pycard"    "Ralgds"    "Retnlb"    "Skil"      "Slc3a2"   
[79] "Spp1"      "Stag1"     "Stat3"     "Strap"     "Terc"      "Tert"     
[85] "Tiam1"     "Tlr4"      "Tnf"       "Tnfaip8l3" "Tnik"      "Trem1"    
[91] "Trim27"    "Trp53"     "Uri1"     

$MP_DECREASED_LIVER_TUMOR_INCIDENCE
[1] "Foxm1" "Mtdh"  "Myd88" "Tert"  "Tlr2" 

$MP_DECREASED_LUNG_TUMOR_INCIDENCE
[1] "Ggta1" "Kras"  "Lexm"  "Met"   "Mtdh"  "Tlr2"

Load a gene set from the web

We can also get pathways from the web using the function getGeneSets from package escape.

# Pay attention to the genome name, it's case sensitive
# and follows MSigDB naming scheme
pathways <- escape::getGeneSets(
  species = "Mus musculus", 
  library = "H"
)
utils::head(pathways)

GeneSetCollection
  names: HALLMARK_ADIPOGENESIS, HALLMARK_ALLOGRAFT_REJECTION, ..., HALLMARK_APICAL_SURFACE (6 total)
  unique identifiers: Abca1, Abcb8, ..., Sulf2 (754 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: NullCollection (1 total)

Run exploratory analysis

Let’s time to practice. Use the function enrichIt from the package escape in order to perform GSEA on your Seurat object. Perform the analysis on cell cycle phase if possible. Save the results in a variable, for example sobj_enrich.

Answer

It is not possible to specify what group we are considering. GSEA/Enrichment analysis do not perform any differential analysis.

# Save in a variable called `sobj_enrich`
# the result of the function `enrichIt` from package `escape`
sobj_enrich <- escape::enrichIt(
  obj = sobj, # variable pointing to the seurat object
  gene.sets = pathways # variable pointing to gene sets
)

[1] "Using sets of 1000 cells. Running 4 times."
Setting parallel calculations through a SnowParam back-end
with workers=2 and tasks=100.
Estimating ssGSEA scores for 49 gene sets.
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
Setting parallel calculations through a SnowParam back-end
with workers=2 and tasks=100.
Estimating ssGSEA scores for 49 gene sets.
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
Setting parallel calculations through a SnowParam back-end
with workers=2 and tasks=100.
Estimating ssGSEA scores for 49 gene sets.
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
Setting parallel calculations through a SnowParam back-end
with workers=2 and tasks=100.
Estimating ssGSEA scores for 49 gene sets.
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."

If you have reserved more than one core, this function can be multi-threaded, making it faster !

What happened ? Let’s print results with the function head from utils package. Here we select only two columns from the list of available ones, since there is a lot of them. We choose naturally two cell-cycle related pathways, since that’s what we are interested into.

# We display the first 5 rows using the function `head` from package `utils`
utils::head(
  # on two of the columns of the enrichment result.
  # these columns were selected with function `c` of `base` package.
  sobj_enrich[, base::c("HALLMARK_G2M_CHECKPOINT", "HALLMARK_E2F_TARGETS")]
)

          HALLMARK_G2M_CHECKPOINT HALLMARK_E2F_TARGETS
TD3A.1024               -78.76517            -98.68292
TD3A.3642              -123.30913           -410.25727
TD3A.1731              -171.55852           -271.71857
TD3A.2667              -152.05351           -258.06802
TD3A.4023                49.36263           -221.98633
TD3A.2052               -49.21145           -173.40665

Just like DEA results, let’s add the enrichment analysis results into the Seurat object, using the function AddMetaData.

# we reassign the variable `sobj` which contained the Seurat object
# with the result of the function `AddMetaData` of `Seurat` package
# i now contains the previous Seurat object + escape enrichment results
sobj <- Seurat::AddMetaData(
  # We provide the variable pointing to seurat object
  object=sobj,
  # We provide the variable pointing to the enrichment results
  metadata=sobj_enrich
)

Visualize results

Now we’d like to display nice graphs. Here, we use the dittoSeq package for these images, but there are many other packages designed to perform nice graphs from Seurat objects.

Let’s start with a general heatmaps to understand our results. We have a function called dittoHeatmap designed for that.

# We use the function `dittoHeatmap` from `dittoSeq` package
dittoSeq::dittoHeatmap(
  # We provide the variable pointing to Seurat object
  object=sobj,
  # We tell the function, we do not want to plot genes
  gene = NULL,
  # We tell the function, we want to plot enrichment results.
  # This is done with the function `names` from `base` package
  # over the variable pointing to the escape enrichment results.
  metas = base::names(sobj_enrich),
  # Remember, we are interested in cell cycle phases
  annot.by = "cc_seurat.Phase",
  # This package can perform clusterization of cols
  cluster_cols = TRUE,
  # Use complet heatmap rather that pheatmap
  # for additional parameters and graph size
  complex = TRUE
)

There are some other cell-cycle related terms. We can focus on them with the function multi_dittoPlot from dittoSeq package:

# We use the function `multi_dittoPlot` from package `dittoSeq`
dittoSeq::multi_dittoPlot(
  # We provide variable pointing to seurat object
  object = sobj,
  # We select the pathways of interest
  vars = base::c(
    "HALLMARK_G2M_CHECKPOINT", 
    "HALLMARK_E2F_TARGETS",
    "HALLMARK_MITOTIC_SPINDLE"
  ),
  # We color samples with cell cycle phase
  group.by = "cc_seurat.Phase",
  # We select the plot types
  plots = base::c("jutter", "vlnplot", "boxplot"),
  # We change y axis name
  ylab = "Enrichment score",
  # We make the graph nice-looking with GGplot2
  theme = theme_classic()
)

On purpose analysis

In order to understand what happened, where do these results come from, we need to go step by step and perform both enrichment and GSE analysis manually.

In general, the exploratory method is never used on a raw Seurat object, we usually filter this object in order to search pathways that are related to our biological question.

With the previous method, what ever our biological question was, the cell cycle phase was one of the top gene expression drivers.

Gene Names and Gene identifiers

Let’s search information about the gene ARP2.

So, this gene exists in multiple organisms. Thanks, we know we are working on mice, and we told gene set retriever function to consider ‘mouse’ datasets.

So in the Human genome, ARP2 refers to multiple genes through the alias names. When we searched for gene sets with [escape], we did not perform any disambiguation.

The same with arabidopsis thaliana, ARP2 can refer to actin related protein, or a root development protein.

In mus musculus, ARP2 refers to an actin related protein on chromosome 11, or a subunit adaptor protein on chromosome 5.

We now understand that escape analysis may be completely wrong since it used human-intelligible gene names. These names include synonyms, homonyms, within a single organism, or between multiple ones.

We need to use unique identifiers. These identifiers are called gene identifiers and we usually consider EnsemblID or EntrezID.

For example:

ARP2 from chromosome 11 equals ENSMUSG00000020152 at Ensembl.
ARP2 from chromosome 5 equals ENSMUSG00000019518 at Ensembl.

These identifiers are very unpleasant to use for human beings. On monday meeting, please talk about ARP2 and not ENSMUSG00000020152. However, when machines are working, please given them EnsemblID or EntrezID.

Let’s translate all gene identifiers and evaluate possible number of errors in escape analysis.

First, let’s load the DEA results based on Wilcoxon analysis:

# We load our differential expression analysis result
# using the function `readRDS` from `base` package
sobj_wilcox <- base::readRDS(file = "sobj_wilcox.RDS")
# Copy gene names in a column it makes the future operation easier
sobj_wilcox$SYMBOL <- base::rownames(sobj_wilcox)

Then, we use the function bitr (standing for biological translator) from the package clusterProfiler.

# We store the result of `bitr` function
# from `clusterProfiler` package
# in a variable called `annotation`
annotation <- clusterProfiler::bitr(
  # We provide the variable pointing to gene names
  geneID = base::rownames(sobj_wilcox),
  # We tell the translator to which gene identifiers
  # translation must be done
  toType = base::c("ENTREZID", "ENSEMBL"),
  # We tell the translator which type of gene name we have
  fromType = "SYMBOL",
  # We provide mmu database
  OrgDb = org.Mm.eg.db
)

Now we would like to have these annotation alongside with adjusted pvalues and fold change information. In order to do so, we use the function merge from base package. Not from Seurat package ! This is used to merger Seurat objects, but we have dataframes here!

# We filter these results on **ADJUSTED** pvalue
sobj_wilcox <- sobj_wilcox[sobj_wilcox$p_val_adj <= 0.05, ]
# Use the function `merge` from package `base` in order
# to add annotation to wixocon DEA result.
sobj_wilcox <- base::merge(
  # Variable pointing to the wilcoxon results
  x = sobj_wilcox,
  # Variable pointing to the annotation results
  y = annotation,
  # We tell how to merge sobj_wilcox
  by.x = "SYMBOL",
  # We tell how to merge annotation
  by.y = "SYMBOL"
)

Restricted sed of genes

As we perform the analysis, we are going to provide a numeric variable to the GSEA/Enrichment.

We have the following columns:

p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in each condition.
pct.1: Percent of cells with gene expression in condition one, here in “G1” phase.
pct.2: Percent of cells with gene expression in condition two, here in “S” phase.
p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error.

Is it a good idea to use p_val ? What are the consequences ?

Answer

No. Never. Never ever use raw P-Value. It is never a good idea.

Is it a good idea to use avg_log2FC ? What are the consequences ?

Answer

It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the pathways with highest expression change ?”

Is it a good idea to use pct.1 ? What are the consequences ?

Answer

It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the expression of genes in pathway XXX in the first condition ?”

Is it a good idea to use pct.2 ? What are the consequences ?

Answer

It is a good idea, we could answer biological questions like : “Considering differentially expressed genes, what are the expression of genes in pathway XXX in the second condition ?”

Is it a good idea to use p_val_adj ? What are the consequences ?

Answer

It is a good idea, we could answer biological questions like : “Which pathways are differentially expressed with highest confidence interval ?”

But in order to perform surch test, use -log10(Adjusted P-Value) instead of the raw adjusted p-value. Remember, 0 is a good confidence interval, and 1 a bad one. So we need the values close to 0 to be highly positive.

Enrichment vs GSEA

Previously, we used a function called enrichIt. We talked about enrichment analysis, yet this talk is called GSEA.

This is due to the fact that both techniques exists and are different.

Enrichment tests a list of genes. Their expression, confidence interval, etc. Nothing else matters than their name.

For each cell, enrichIt removes zeros and tests the list of remaining genes. Their expression does not matter. Their order in the gene list does not matter. Their impact on the pathway does not matter. Their differential expression status does not matter.

Behind the scenes, it’s a very simple tests answering the following question: “Among expressed genes, what are the odds that it belongs to the pathway XXX?”

We can do enrichment tests on our wilxoc results, using the function enrichGO from the package clusterProfiler:

# We store the result of the function `enrichGO` from package `clusterProfiler`
# in the function `enrich_wilcox`
enrich_wilcox <- clusterProfiler::enrichGO(
  # We provide the list of gene identifiers to test
  gene = sobj_wilcox$ENTREZID,
  # We provide the complete list of genes we quantified
  universe = annotation$ENTREZID,
  # We provide mouse annotations
  OrgDb = org.Mm.eg.db,
  # We tell the function that we use entrez-identifiers
  keyType = "ENTREZID",
  # We search results on Biological Process"
  ont = "BP",
  # We want all the results
  pvalueCutoff = 1,
  # We are humans, we wan human-readable gene names
  readable = TRUE
)

We can have a look at pathways including the word G2M, using the functions with and grepl from base package.

# Get the result table
results_enrich_wilcox <- enrich_wilcox@result

# We store the result of the line selection
# in a variable called `g2m_enrich`
g2m_enrich <- results_enrich_wilcox[
  # We select rows containing a given information
  # using the function `with` from `base` package
  base::with(
    # We provide the variable pointing to enrichment results
    data = results_enrich_wilcox,
    # We provide the term we are searching and the column in which
    # the term should be searched
    base::grepl("g2m", Description)
  ), # Do not forget his comma, since we are filtering a dataframe on rows
]

GSEA tests the gene names, and uses the numeric value associated with that gene in order to weight the results. It tests the name, the rank, and the numeric value.

First, we need to create a named list of gene weights. For this example, we use the average fold change as weights.

# We extract gene average fold change
gene_list <- sobj_wilcox$avg_log2FC
# We extract genes Entrez ID
base::names(gene_list) <- sobj_wilcox$ENTREZID
# We sort that list deacreasingly
gene_list <- base::sort(gene_list, decreasing=TRUE)

# Check the results with the function `head` from `utils` package
utils::head(gene_list)

    12507     12487    252838     93694     98267     16341 
0.7377889 0.4651682 0.4525624 0.4366415 0.4185094 0.3428277

And now, we can run the gseGO from clusterProfiler. You’ll see, the interfacce is almost the same as for the enrichment:

# We use the function `gseGO` from the package `clusterProfiler`
# and store the result in a variable called `gsea_wilcox`
gsea_wilcox <- clusterProfiler::gseGO(
  # We provide the variable pointing to named gene list
  geneList = gene_list,
  # We provide mouse annotations
  OrgDb = org.Mm.eg.db,
  # We tell the function that we use entrez-identifiers
  keyType = "ENTREZID",
  # We search results on Biological Process"
  ont = "BP",
  # We want all the results
  pvalueCutoff = 1
)

Just like before, we can have a look at pathways including the word G2M, using the functions with and grepl from base package.

# Get the result table
results_gse_wilcox <- enrich_wilcox@result

# We store the result of the line selection
# in a variable called `g2m_enrich`
g2m_gse <- results_gse_wilcox[
  # We select rows containing a given information
  # using the function `with` from `base` package
  base::with(
    # We provide the variable pointing to enrichment results
    data = results_gse_wilcox,
    # We provide the term we are searching and the column in which
    # the term should be searched
    base::grepl("g2m", Description)
  ), # Do not forget his comma, since we are filtering a dataframe on rows !
]

# Check the results with the function `head` from `utils` package
utils::head(g2m_gse)

[1] ID          Description GeneRatio   BgRatio     pvalue      p.adjust   
[7] qvalue      geneID      Count      
<0 lignes> (ou 'row.names' de longueur nulle)

Visualize results

This makes a lot of tables. Let’s make a lot of graphs !

We can use the function barplot from package graphics ; and not the ones from any other package, or it won’t work !

# We use the function `barplot` from package `enrichplot`
graphics::barplot(
  # We provide the variable pointing to enrichment results
  height = enrich_wilcox,
  # We display the best 15 results
  howCategory=15
)

We can also use the function ditplot from package enrichplot ; and not the ones from any other package, or it won’t work ! Note the change in the input parameter name:

# We use the function `barplot` from package `enrichplot`
enrichplot::dotplot(
  # We provide the variable pointing to enrichment results
  object = enrich_wilcox,
  # We display the best 15 results
  showCategory=15
)

We can display traditional GSEA graph using gseaplot2 from package enrichplot:

# We use the function `barplot` from package `enrichplot`
enrichplot::gseaplot2(
  # We provide the variable pointing to enrichment results
  x = gsea_wilcox,
  # We display the best result
  geneSetID = 1
)

With GSEA, you dot not test if a pathway is up or down regulated. A pathway contains both enhancers and suppressors genes. An up-regulation of enhancer genes and a down-regulation of suppressor genes will lead to a “bad” enrichment score. However, this will lead to a strong change in your pathway activity!

If your favorite pathway does not have a “good enrichment score”, it does not mean that pathway is not affected.

The heatplot displays both a heatmap of enriched pathways and their genes in commons:

# We use the function `heatplot` from `enrichplot` package
enrichplot::heatplot(
  # We probide the variable pointing to GSEA results
  x = gsea_wilcox,
  # We show the 15 best results
  showCategory = 15,
  # We color according to FoldChange
  foldChange = gene_list
)

The genes in common between multiple gene sets are also visible through an uspet plot:

# We use the function `upsetplot` from `enrichplot` package
enrichplot::upsetplot(
  # We probide the variable pointing to GSEA results
  x = gsea_wilcox,
  # We show the 10 best results
  n = 10
)

Finally, we can display whole pathways using KEGG database:

# We use the function pathview from pathview package
pathview::pathview(
  gene.data = gene_list, # Our gene list
  pathway.id = "mmu04110", # Our pathway
  species = "mmu", # Our organism
  # The color limits
  limit = list(gene=max(abs(gene_list))),
  gene.idtype = "ENTREZID" # The genes identifiers
)

pathview_results

Session Info

This list of all packages used while you work should be included in each en every R presentation:

utils::sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /home/silivren/Projects/EBAII/2023/ebaiin1/SingleCell/DEA_GSEA/ebaii/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] pathview_1.40.0             Cairo_1.6-1                
 [3] org.Mm.eg.db_3.17.0         AnnotationDbi_1.62.2       
 [5] dittoSeq_1.12.0             clusterProfiler_4.8.1      
 [7] escape_1.10.0               SingleCellExperiment_1.22.0
 [9] SummarizedExperiment_1.30.2 Biobase_2.60.0             
[11] GenomicRanges_1.52.0        GenomeInfoDb_1.36.1        
[13] IRanges_2.34.1              S4Vectors_0.38.1           
[15] BiocGenerics_0.46.0         MatrixGenerics_1.12.2      
[17] matrixStats_1.0.0           SeuratObject_4.1.4         
[19] Seurat_4.4.0                dplyr_1.1.3                
[21] knitr_1.44                  rstatix_0.7.2              
[23] ggpubr_0.6.0                ggplot2_3.4.3              
[25] DT_0.28                     BiocParallel_1.34.2        

loaded via a namespace (and not attached):
  [1] fs_1.6.3                  GSVA_1.48.2              
  [3] spatstat.sparse_3.0-3     bitops_1.0-7             
  [5] enrichplot_1.20.0         doParallel_1.0.17        
  [7] HDO.db_0.99.1             httr_1.4.7               
  [9] RColorBrewer_1.1-3        Rgraphviz_2.44.0         
 [11] tools_4.3.1               sctransform_0.4.1        
 [13] backports_1.4.1           utf8_1.2.4               
 [15] R6_2.5.1                  HDF5Array_1.28.1         
 [17] lazyeval_0.2.2            uwot_0.1.16              
 [19] GetoptLong_1.0.5          rhdf5filters_1.12.1      
 [21] withr_2.5.1               sp_2.1-1                 
 [23] gridExtra_2.3             progressr_0.14.0         
 [25] cli_3.6.1                 spatstat.explore_3.2-5   
 [27] scatterpie_0.2.1          labeling_0.4.3           
 [29] sass_0.4.7                KEGGgraph_1.60.0         
 [31] spatstat.data_3.0-3       ggridges_0.5.4           
 [33] pbapply_1.7-2             yulab.utils_0.1.0        
 [35] ggupset_0.3.0             gson_0.1.0               
 [37] DOSE_3.26.1               parallelly_1.36.0        
 [39] RSQLite_2.3.1             shape_1.4.6              
 [41] gridGraphics_0.5-1        generics_0.1.3           
 [43] crosstalk_1.2.0           ica_1.0-3                
 [45] spatstat.random_3.2-1     car_3.1-2                
 [47] GO.db_3.17.0              Matrix_1.6-1.1           
 [49] fansi_1.0.5               abind_1.4-5              
 [51] lifecycle_1.0.3           yaml_2.3.7               
 [53] carData_3.0-5             rhdf5_2.44.0             
 [55] qvalue_2.32.0             Rtsne_0.16               
 [57] grid_4.3.1                blob_1.2.4               
 [59] promises_1.2.1            crayon_1.5.2             
 [61] miniUI_0.1.1.1            lattice_0.22-5           
 [63] beachmat_2.16.0           msigdbr_7.5.1            
 [65] cowplot_1.1.1             annotate_1.78.0          
 [67] KEGGREST_1.40.0           ComplexHeatmap_2.16.0    
 [69] pillar_1.9.0              fgsea_1.26.0             
 [71] rjson_0.2.21              future.apply_1.11.0      
 [73] codetools_0.2-19          fastmatch_1.1-4          
 [75] leiden_0.4.3              glue_1.6.2               
 [77] downloader_0.4            ggfun_0.1.3              
 [79] data.table_1.14.8         treeio_1.24.1            
 [81] vctrs_0.6.4               png_0.1-8                
 [83] gtable_0.3.4              cachem_1.0.8             
 [85] xfun_0.40                 S4Arrays_1.0.4           
 [87] mime_0.12                 tidygraph_1.2.3          
 [89] survival_3.5-7            pheatmap_1.0.12          
 [91] iterators_1.0.14          ellipsis_0.3.2           
 [93] fitdistrplus_1.1-11       ROCR_1.0-11              
 [95] nlme_3.1-163              ggtree_3.8.0             
 [97] bit64_4.0.5               RcppAnnoy_0.0.21         
 [99] bslib_0.5.1               irlba_2.3.5.1            
[101] KernSmooth_2.23-22        colorspace_2.1-0         
[103] DBI_1.1.3                 UCell_2.4.0              
[105] tidyselect_1.2.0          bit_4.0.5                
[107] compiler_4.3.1            graph_1.78.0             
[109] BiocNeighbors_1.18.0      DelayedArray_0.26.6      
[111] plotly_4.10.3             shadowtext_0.1.2         
[113] scales_1.2.1              lmtest_0.9-40            
[115] stringr_1.5.0             digest_0.6.33            
[117] goftest_1.2-3             spatstat.utils_3.0-4     
[119] rmarkdown_2.25            XVector_0.40.0           
[121] htmltools_0.5.6.1         pkgconfig_2.0.3          
[123] sparseMatrixStats_1.12.2  fastmap_1.1.1            
[125] GlobalOptions_0.1.2       rlang_1.1.1              
[127] htmlwidgets_1.6.2         shiny_1.7.5.1            
[129] DelayedMatrixStats_1.22.1 farver_2.1.1             
[131] jquerylib_0.1.4           zoo_1.8-12               
[133] jsonlite_1.8.7            GOSemSim_2.26.0          
[135] BiocSingular_1.16.0       RCurl_1.98-1.12          
[137] magrittr_2.0.3            GenomeInfoDbData_1.2.10  
[139] ggplotify_0.1.2           patchwork_1.1.3          
[141] Rhdf5lib_1.22.0           munsell_0.5.0            
[143] Rcpp_1.0.11               ape_5.7-1                
[145] babelgene_22.9            viridis_0.6.4            
[147] reticulate_1.34.0         stringi_1.7.12           
[149] ggraph_2.1.0              zlibbioc_1.46.0          
[151] MASS_7.3-60               org.Hs.eg.db_3.17.0      
[153] plyr_1.8.9                parallel_4.3.1           
[155] listenv_0.9.0             ggrepel_0.9.4            
[157] deldir_1.0-9              Biostrings_2.68.1        
[159] graphlayouts_1.0.1        splines_4.3.1            
[161] tensor_1.5                circlize_0.4.15          
[163] igraph_1.5.1              spatstat.geom_3.2-7      
[165] ggsignif_0.6.4            reshape2_1.4.4           
[167] ScaledMatrix_1.8.1        XML_3.99-0.14            
[169] evaluate_0.22             foreach_1.5.2            
[171] tweenr_2.0.2              httpuv_1.6.12            
[173] RANN_2.6.1                tidyr_1.3.0              
[175] purrr_1.0.2               polyclip_1.10-6          
[177] clue_0.3-65               future_1.33.0            
[179] scattermore_1.2           ggforce_0.4.1            
[181] rsvd_1.0.5                broom_1.0.5              
[183] xtable_1.8-4              tidytree_0.4.5           
[185] later_1.3.1               viridisLite_0.4.2        
[187] snow_0.4-4                tibble_3.2.1             
[189] aplot_0.2.2               memoise_2.0.1            
[191] cluster_2.1.4             globals_0.16.2           
[193] GSEABase_1.62.0

Gene Set Enrichment Analysis

Thibault DAYRIS

Bastien JOB

2023-10-25

Forewords

TLDR: R command lines

Purpose of this session

Select a database

Gene: Jund

Database types

Some noticeable databases

How to perform GSEA

Exploratory analysis

Gene sets

Create your own gene set

Load a gene set from disk

Load a gene set from the web

Run exploratory analysis

Visualize results

On purpose analysis

Gene Names and Gene identifiers

Restricted sed of genes

Enrichment vs GSEA

Visualize results

Session Info