This file is associated with the course on Functional Analysis,
accessible at this link: https://moodle.france-bioinformatique.fr/course/view.php?id=37.
This file shows the code to perform over-representation analysis (ORA)
and gene set enrichment analysis (GSEA). The analyses are based on the
clusterProfiler package:
citation("clusterProfiler")
## Please cite S. Xu (2024) for using clusterProfiler. In addition, please
## cite G. Yu (2010) when using GOSemSim, G. Yu (2015) when using DOSE and
## G. Yu (2015) when using ChIPseeker.
##
## G Yu. Thirteen years of clusterProfiler. The Innovation. 2024,
## 5(6):100722
##
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R
## Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
## multiomics data. Nature Protocols. 2024, 19(11):3292-3320
##
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L
## Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal
## enrichment tool for interpreting omics data. The Innovation. 2021,
## 2(3):100141
##
## Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
## clusterProfiler: an R package for comparing biological themes among
## gene clusters. OMICS: A Journal of Integrative Biology 2012,
## 16(5):284-287
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
We load packages of interest:
library(clusterProfiler) # Make enrichment analysis
library(enrichplot) # Awesome graphs
library(org.At.tair.db) # A. Thaliana annotation
.libPaths()
## [1] "/shared/ifbstor1/software/miniconda/envs/r-4.5.1/lib/R/library"
We load the data from the differential expression analysis.
deseq_genes = read.table(
file = "./tables/KOvsWT.complete.txt",
sep = "\t",
header = TRUE
)
We assess the dimensions of the data. First, the column names:
colnames(deseq_genes)
## [1] "Id" "WT1" "WT2" "WT3"
## [5] "KO1" "KO2" "KO3" "norm.WT1"
## [9] "norm.WT2" "norm.WT3" "norm.KO1" "norm.KO2"
## [13] "norm.KO3" "baseMean" "WT" "KO"
## [17] "FoldChange" "log2FoldChange" "stat" "pvalue"
## [21] "padj" "dispGeneEst" "dispFit" "dispMAP"
## [25] "dispersion" "betaConv" "maxCooks"
Second, the number of rows:
nrow(deseq_genes)
## [1] 27655
Third, the first elements from the Id column:
head(deseq_genes$Id)
## [1] "gene:AT1G01010" "gene:AT1G01020" "gene:AT1G01030" "gene:AT1G01040"
## [5] "gene:AT1G01050" "gene:AT1G01060"
We explore the data:
deseq_genes[deseq_genes$Id == "gene:AT1G61580", ]
## Id WT1 WT2 WT3 KO1 KO2 KO3 norm.WT1 norm.WT2 norm.WT3 norm.KO1
## 5120 gene:AT1G61580 248 231 205 119 131 125 229 210 215 131
## norm.KO2 norm.KO3 baseMean WT KO FoldChange log2FoldChange stat
## 5120 131 123 173.19 218 128 0.588 -0.766 -4.48
## pvalue padj dispGeneEst dispFit dispMAP dispersion betaConv
## 5120 7.465947e-06 0.0001156724 0 0.0311 0.0149 0.0149 TRUE
## maxCooks
## 5120 0.0222
For a computer, gene:AT1G01010 is not
AT1G01010. To interact properly with the database, we
remove the gene: string:
head(deseq_genes$Id)
## [1] "gene:AT1G01010" "gene:AT1G01020" "gene:AT1G01030" "gene:AT1G01040"
## [5] "gene:AT1G01050" "gene:AT1G01060"
deseq_genes$Id = sub(pattern = "gene:",
replacement = "",
x = deseq_genes$Id)
head(deseq_genes$Id)
## [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060"
We need to filter differentially expressed genes in order to perform ORA. How many genes are in our data ?
nrow(deseq_genes)
## [1] 27655
How many genes are significantly differentially expressed, given an adjusted p-value threshold set to 0.001 ?
de_genes = deseq_genes[deseq_genes[, "padj"] <= 0.001, ]
de_genes = de_genes[!is.na(de_genes[, "log2FoldChange"]), ]
nrow(de_genes)
## [1] 1807
In this table, there are up- and down-regulated genes:
summary(de_genes$log2FoldChange)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.4190 -1.5990 -0.5370 -0.3679 1.0510 6.6990
Our custom gene set corresponds to the up-regulated genes only:
de_genes = de_genes[de_genes[, "log2FoldChange"] > 0, ]
nrow(de_genes)
## [1] 880
We perform the ORA using the gene ontology for biological processes:
ego = clusterProfiler::enrichGO(
gene = de_genes$Id, # gene list
universe = deseq_genes$Id, # all genes
OrgDb = org.At.tair.db, # annotation
keyType = "TAIR", # nature of the genes ID
ont = "BP", # Biological Processes
pvalueCutoff = 1, # significance threshold (take all)
pAdjustMethod = "BH", # p-value adjustment method
readable = TRUE # For human beings
)
What is stored in ego object ?
View(ego)
What is stored in the ego@result table ?
head(ego@result, 3)
## ID Description GeneRatio BgRatio
## GO:0010087 GO:0010087 phloem or xylem histogenesis 19/711 130/21364
## GO:0009736 GO:0009736 cytokinin-activated signaling pathway 13/711 76/21364
## GO:0009735 GO:0009735 response to cytokinin 17/711 134/21364
## RichFactor FoldEnrichment zScore pvalue p.adjust
## GO:0010087 0.1461538 4.391604 7.196732 6.233419e-08 0.0001004827
## GO:0009736 0.1710526 5.139759 6.707933 1.219050e-06 0.0009825546
## GO:0009735 0.1268657 3.812037 6.058607 2.330987e-06 0.0011947380
## qvalue
## GO:0010087 8.733348e-05
## GO:0009736 8.539769e-04
## GO:0009735 1.038394e-03
## geneID
## GO:0010087 CORD2/ATHB-15/APL/DOT1/DAR2/AGC1-3/AtSEOR1/OPS/FL2/ACS6/FL3/ATERF6/AtERF#100/BAM3/ATHB-8/ACL5/AVB1/PXY/DOF5.6
## GO:0009736 ARR4/ZFP5/ARR7/ATPUP14/ABCG14/GIS3/AHK4/ARR12/ARR5/ARR9/APRR8/ANAC068/ARR6
## GO:0009735 ATGRXS13/ARR4/ZFP5/ATST4B/ARR7/ATPUP14/ABCG14/GIS3/AHK4/ARR12/ARR5/ARR9/APRR8/ANAC068/ABIG1/ATMYB33/ARR6
## Count
## GO:0010087 19
## GO:0009736 13
## GO:0009735 17
We visualize the top 5 gene ontologies are a barplot:
graphics::barplot(ego, showCategory = 5)
## Warning in fortify(object, showCategory = showCategory, by = x, ...): Arguments in `...` must be used.
## ✖ Problematic argument:
## • by = x
## ℹ Did you misspell an argument name?
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the enrichplot package.
## Please report the issue at
## <https://github.com/GuangchuangYu/enrichplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We visualize the top 5 gene ontologies are a dotplot:
enrichplot::dotplot(ego, showCategory = 5)
We need to build a named vector which contains sorted numbers. So, we explore results to guess the right column to extract:
colnames(deseq_genes)
## [1] "Id" "WT1" "WT2" "WT3"
## [5] "KO1" "KO2" "KO3" "norm.WT1"
## [9] "norm.WT2" "norm.WT3" "norm.KO1" "norm.KO2"
## [13] "norm.KO3" "baseMean" "WT" "KO"
## [17] "FoldChange" "log2FoldChange" "stat" "pvalue"
## [21] "padj" "dispGeneEst" "dispFit" "dispMAP"
## [25] "dispersion" "betaConv" "maxCooks"
We choose to use the stat column
geneList = as.numeric(de_genes$stat)
names(geneList) = de_genes$Id
geneList = sort(geneList, decreasing = TRUE)
head(geneList)
## AT2G17820 AT5G19600 AT2G25760 AT3G19670 AT3G48110 AT5G11800
## 18.377 16.078 16.002 15.616 15.249 14.443
We perform the GSEA using the gene ontology for biological processes:
gsea = clusterProfiler::gseGO(
geneList = geneList, # ranked gene list
ont = "BP", # Biological Processes
OrgDb = org.At.tair.db, # annotation
keyType = "TAIR", # nature of the genes ID
pAdjustMethod = "BH", # p-value adjustment method
pvalueCutoff = 1, # significance threshold (take all)
seed = 1 # fix randomness for permutations
)
What is stored in gsea object ?
View(gsea)
What is stored in the gsea@result table ?
head(gsea@result, 3)
## ID Description setSize
## GO:0072522 GO:0072522 purine-containing compound biosynthetic process 11
## GO:1901293 GO:1901293 nucleoside phosphate biosynthetic process 15
## GO:0000375 GO:0000375 RNA splicing, via transesterification reactions 20
## enrichmentScore NES pvalue p.adjust qvalue rank
## GO:0072522 0.6545202 2.234595 0.0002822140 0.04609509 0.03609856 115
## GO:1901293 0.5816372 2.162722 0.0006942031 0.04609509 0.03609856 115
## GO:0000375 0.5279490 2.158984 0.0006641180 0.04609509 0.03609856 105
## leading_edge
## GO:0072522 tags=64%, list=13%, signal=56%
## GO:1901293 tags=53%, list=13%, signal=47%
## GO:0000375 tags=50%, list=12%, signal=45%
## core_enrichment
## GO:0072522 AT1G80050/AT4G22570/AT2G17320/AT1G12350/AT2G35390/AT1G70570/AT2G17340
## GO:1901293 AT1G80050/AT4G22570/AT2G17320/AT3G27190/AT1G12350/AT2G35390/AT1G70570/AT2G17340
## GO:0000375 AT3G19670/AT1G07350/AT3G54230/AT3G01150/AT1G10320/AT2G33435/AT4G38780/AT5G45990/AT1G09660/AT4G34140
What is the most highly and significantly enriched gene set ?
top1_gsea = gsea@result %>%
dplyr::filter(p.adjust < 0.05) %>%
dplyr::filter(NES == max(NES)) %>%
dplyr::select(ID, Description, NES, p.adjust, setSize)
top1_gsea
## ID Description NES
## GO:0072522 GO:0072522 purine-containing compound biosynthetic process 2.234595
## p.adjust setSize
## GO:0072522 0.04609509 11
We can draw the curve associated with this gene set:
enrichplot::gseaplot2(
x = gsea,
geneSetID = top1_gsea$ID,
title = top1_gsea$Description
)
## Warning: `aes_()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`
## ℹ The deprecated feature was likely used in the enrichplot package.
## Please report the issue at
## <https://github.com/GuangchuangYu/enrichplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the enrichplot package.
## Please report the issue at
## <https://github.com/GuangchuangYu/enrichplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this section, we propose other ways to visualize the results from
ORA (ego object) or GSEA (gsea object).
enrichplot::gseaplot2(
x = gsea,
geneSetID = c(1:3),
title = "Most enriched terms"
)
enrichplot::heatplot(
x = ego, # Our ORA
showCategory = phloem_names, # Gene sets of interest
foldChange = setNames(nm = de_genes$Id,
de_genes$log2FoldChange) # Our fold changes
)
ego = enrichplot::pairwise_termsim(ego)
enrichplot::upsetplot(x = ego, # Our ORA
n = 10) # Nb of terms to display
enrichplot::cnetplot(ego,
showCategory = phloem_names,
foldChange = setNames(nm = de_genes$Id,
de_genes$log2FoldChange))
When interacting with databases, you may need TAIR ID, Ensembl ID, ENTREZ ID, UniProt ID… For instance, we could convert TAIR ID to ENTREZ ID and gene symbol:
annotation = clusterProfiler::bitr(
geneID = deseq_genes$Id, # Our gene list
fromType = "TAIR", # We have TAIR ID
toType = c("ENTREZID", "SYMBOL"), # What we want
OrgDb = org.At.tair.db) # Our annotation
## 'select()' returned 1:many mapping between keys and columns
## Warning in clusterProfiler::bitr(geneID = deseq_genes$Id, fromType = "TAIR", :
## 3.42% of input gene IDs are fail to map...
head(annotation)
## TAIR ENTREZID SYMBOL
## 1 AT1G01010 839580 ANAC001
## 2 AT1G01010 839580 NAC001
## 3 AT1G01010 839580 NTL10
## 4 AT1G01020 839569 ARV1
## 5 AT1G01030 839321 NGA3
## 6 AT1G01040 839574 ASU1
We merge this correspondence table without our data:
deseq_genes_with_symbol = merge(
x = deseq_genes,
y = annotation,
by.x = "Id", # In deseq_genes, TAIR IDs are stored in the Id column
by.y = "TAIR") # In annotation, TAIR IDs are stored in the TAIR column
head(deseq_genes_with_symbol)
## Id WT1 WT2 WT3 KO1 KO2 KO3 norm.WT1 norm.WT2 norm.WT3 norm.KO1
## 1 AT1G01010 533 541 473 931 1052 1124 493 492 496 1023
## 2 AT1G01010 533 541 473 931 1052 1124 493 492 496 1023
## 3 AT1G01010 533 541 473 931 1052 1124 493 492 496 1023
## 4 AT1G01020 54 54 42 56 56 63 50 49 44 62
## 5 AT1G01030 24 14 18 9 15 10 22 13 19 10
## 6 AT1G01040 342 355 276 359 391 371 316 323 289 395
## norm.KO2 norm.KO3 baseMean WT KO FoldChange log2FoldChange stat
## 1 1050 1108 777.09 494 1060 2.149 1.104 9.276
## 2 1050 1108 777.09 494 1060 2.149 1.104 9.276
## 3 1050 1108 777.09 494 1060 2.149 1.104 9.276
## 4 56 62 53.78 48 60 1.253 0.325 1.239
## 5 15 10 14.75 18 12 0.647 -0.627 -1.249
## 6 390 366 346.53 309 384 1.238 0.308 2.172
## pvalue padj dispGeneEst dispFit dispMAP dispersion betaConv
## 1 1.765350e-20 2.582102e-18 0 0.0210 0.0087 0.0087 TRUE
## 2 1.765350e-20 2.582102e-18 0 0.0210 0.0087 0.0087 TRUE
## 3 1.765350e-20 2.582102e-18 0 0.0210 0.0087 0.0087 TRUE
## 4 2.152436e-01 4.433923e-01 0 0.0597 0.0311 0.0311 TRUE
## 5 2.115810e-01 4.387259e-01 0 0.1696 0.1105 0.1105 TRUE
## 6 2.983500e-02 1.151557e-01 0 0.0246 0.0116 0.0116 TRUE
## maxCooks ENTREZID SYMBOL
## 1 0.0187 839580 ANAC001
## 2 0.0187 839580 NAC001
## 3 0.0187 839580 NTL10
## 4 0.0341 839569 ARV1
## 5 0.3564 839321 NGA3
## 6 0.0356 839574 ASU1
It looks similar, BUT number of rows differ:
dim(deseq_genes)
## [1] 27655 27
dim(deseq_genes_with_symbol)
## [1] 38169 29
This is due to 1:many mapping:
head(deseq_genes_with_symbol[, c("Id", "SYMBOL", "ENTREZID")])
## Id SYMBOL ENTREZID
## 1 AT1G01010 ANAC001 839580
## 2 AT1G01010 NAC001 839580
## 3 AT1G01010 NTL10 839580
## 4 AT1G01020 ARV1 839569
## 5 AT1G01030 NGA3 839321
## 6 AT1G01040 ASU1 839574
And there are also NA values, which won’t be taken into account in the downstream analyses:
table(is.na(deseq_genes_with_symbol$SYMBOL))
##
## FALSE TRUE
## 26440 11729
table(is.na(deseq_genes_with_symbol$ENTREZID))
##
## FALSE
## 38169
To be able to re-run the analysis or to understand why outputs are different between two compilations, it is important to display the version of the packages we used:
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.5.1/lib/libopenblasp-r0.3.30.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.At.tair.db_3.21.0 AnnotationDbi_1.70.0 IRanges_2.42.0
## [4] S4Vectors_0.46.0 Biobase_2.68.0 BiocGenerics_0.54.1
## [7] generics_0.1.4 enrichplot_1.28.4 clusterProfiler_4.16.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gson_0.1.0 rlang_1.1.6
## [4] magrittr_2.0.4 DOSE_4.2.0 compiler_4.5.1
## [7] RSQLite_2.4.3 png_0.1-8 vctrs_0.6.5
## [10] reshape2_1.4.4 stringr_1.5.2 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0 XVector_0.48.0
## [16] labeling_0.4.3 rmarkdown_2.30 UCSC.utils_1.4.0
## [19] purrr_1.1.0 bit_4.6.0 xfun_0.53
## [22] cachem_1.1.0 aplot_0.2.9 GenomeInfoDb_1.44.3
## [25] jsonlite_2.0.0 blob_1.2.4 BiocParallel_1.42.2
## [28] parallel_4.5.1 R6_2.6.1 bslib_0.9.0
## [31] stringi_1.8.7 RColorBrewer_1.1-3 jquerylib_0.1.4
## [34] GOSemSim_2.34.0 Rcpp_1.1.0 knitr_1.50
## [37] ggtangle_0.0.7 R.utils_2.13.0 Matrix_1.7-4
## [40] splines_4.5.1 igraph_2.2.0 tidyselect_1.2.1
## [43] qvalue_2.40.0 rstudioapi_0.17.1 dichromat_2.0-0.1
## [46] yaml_2.3.10 codetools_0.2-20 lattice_0.22-7
## [49] tibble_3.3.0 plyr_1.8.9 treeio_1.32.0
## [52] withr_3.0.2 KEGGREST_1.48.1 S7_0.2.0
## [55] evaluate_1.0.5 gridGraphics_0.5-1 ggupset_0.4.1
## [58] Biostrings_2.76.0 pillar_1.11.1 ggtree_3.16.3
## [61] ggfun_0.2.0 ggplot2_4.0.0 scales_1.4.0
## [64] tidytree_0.4.6 glue_1.8.0 lazyeval_0.2.2
## [67] tools_4.5.1 data.table_1.17.8 fgsea_1.34.2
## [70] fs_1.6.6 fastmatch_1.1-6 cowplot_1.2.0
## [73] grid_4.5.1 tidyr_1.3.1 ape_5.8-1
## [76] nlme_3.1-168 GenomeInfoDbData_1.2.14 patchwork_1.3.2
## [79] cli_3.6.5 rappdirs_0.3.3 dplyr_1.1.4
## [82] gtable_0.3.6 R.methodsS3_1.8.2 yulab.utils_0.2.1
## [85] sass_0.4.10 digest_0.6.37 ggrepel_0.9.6
## [88] ggplotify_0.1.3 farver_2.1.2 memoise_2.0.1
## [91] htmltools_0.5.8.1 R.oo_1.27.1 lifecycle_1.0.4
## [94] httr_1.4.7 GO.db_3.21.0 bit64_4.6.0-1