--- title: "Joint Dimensional Reduction methods for multi-omics data analysis" author: "Arnaud Gloaguen - Vincent Guillemot - Jimmy Vandel" date: '`r format(Sys.Date(), "%B %d, %Y")`' output: rmdformats::material: code_folding: show use_bookdown: true thumbnails: false css: "ETBII.css" header-includes: - \usepackage{subfig} - \usepackage{booktabs} - \usepackage{xcolor} --- ```{r setup, eval = TRUE, echo = FALSE} workDir <- "C:/Users/agloague/Documents/ETBII/etbii/THEMATIC-SESSIONS/Multivariate_analysis/02_TP/" # workDir <- "~/Bureau/Bilille/Formation/ETBII/2024/Git/etbii/THEMATIC-SESSIONS/Multivariate_analysis/02_TP/" knitr::opts_knit$set(root.dir = workDir) ## = setwd() knitr::opts_chunk$set(cache = TRUE) options(knitr.table.format = "html") ``` # Libraries and environment ## Environment This report was generated using: ```{r librariesEnvironment, echo = FALSE, message = FALSE, warning = FALSE} library("rmarkdown") library("knitr") library("rmdformats") library("bookdown") ``` - R: ``r R.version$version.string`` - FactoMineR: ``r utils::packageVersion("FactoMineR")`` - factoextra: ``r utils::packageVersion("factoextra")`` - ggpubr: ``r utils::packageVersion("ggpubr")`` - corrplot: ``r utils::packageVersion("corrplot")`` - pheatmap: ``r utils::packageVersion("pheatmap")`` - grid: ``r utils::packageVersion("grid")`` - gridExtra: ``r utils::packageVersion("gridExtra")`` - RGCCA: ``r utils::packageVersion("RGCCA")`` - ggplot2: ``r utils::packageVersion("ggplot2")`` - MOFA2: ``r utils::packageVersion("MOFA2")`` - MOFAdata: ``r utils::packageVersion("MOFAdata")`` - DT: ``r utils::packageVersion("DT")`` - psych: ``r utils::packageVersion("psych")`` - GGally: ``r utils::packageVersion("GGally")`` - NMF: ``r utils::packageVersion("NMF")`` - r.jive: ``r utils::packageVersion("r.jive")`` - dplyr: ``r utils::packageVersion("dplyr")`` ## Load libraries

**DIY**: Load required libraries : .... ```{r loadLibrariesRGCCA, class.source = 'fold-hide', results = "hide", message = FALSE, warning = FALSE} library("FactoMineR") library("factoextra") library("ggpubr") library("corrplot") library("pheatmap") library("grid") library("gridExtra") library("RGCCA") library("ggplot2") library("MOFA2") library("MOFAdata")#for CLL_data and enrichment library("DT") #for formatround function library("psych") #for correlation library("GGally") #for ggplot extention library("NMF") library("r.jive") library("dplyr") ```

# Overview of the Tutorial

**General Statement**: In this tutorial, Principal Component Analysis (**PCA**), along with three multi-omics techniques : Regularized Generalized Canonical Correlation Analysis (**RGCCA**), Multi-Omics Factor Analysis (**MOFA**) and Joint and Individual Variance Explained (**JIVE**), are going to be explored. These methods have in common that they all belong to the field of **multivariate statistics**, but also that, in different ways, they can be seen as **generalizations of PCA to more than one omic/matrix/block/table**. Indeed, similarly to PCA, they all aim at estimating a reduced/latent space that describes well the different omics. Some part of the literature refer to these methods as **Joint Dimension Reduction (JDR) methods** because their goal is to estimate this reduced space such that it is common to all the omic data used in the analysis (this is less true for RGCCA). Actually, applying PCA separatly on different omic data in not considered as a JDR method, but applying PCA on the concatenation of multiple omic data acquired on the same set of individuals is.

Bellow is figure **summarizing the key concept behind JDR methods**. This figure was taken from an interesting **benchmark** that compares, on different multi-omic data-sets, multiple JDR methods (among which RGCCA/MOFA/JIVE). ```{r JDRprinciple, echo = FALSE, fig.align = "center", out.width = "100%", fig.cap = "Summarizing Joint Dimension Reduction (JDR) pinciple. A.: Multiomics are profiled from the same sample. Each omics corresponds to a different matrix Xi. jDR methods factorize the matrices $X^i$ into the product of a factor matrix $F$ and weight matrices $A^i$. These matrices can then be used to cluster samples and identify molecular processes. Figure and caption were taken from: Cantini, L., Zakeri, P., Hernandez, C. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12, 124 (2021). https://doi.org/10.1038/s41467-020-20430-7"} knitr::include_graphics("images/JDR_principle.PNG") ``` The tutorial is organized as follow: - In section \@ref(data), you will have to choose the data-set your are going to work with the entire week. - In section \@ref(mdd), we present the biological context data-set (non available for participant) use as example in this tutorial. - In section \@ref(loadata), you will have to load the data-set tou have chosen. - In section \@ref(pca), you will apply PCA on your data-set. - In section \@ref(rgcca), you will apply RGCCA on your data-set. - In section \@ref(mofa), you will apply MOFA on your data-set. - In section \@ref(jive), you will apply JIVE on your data-set. - Section \@ref(comparison) aims at comparing the results of the different approaches. If you are interested going further on the subject, here are an non exhaustive list of publications that might interest you: - About RGCCA: - Original paper of RGCCA: [Tenenhaus, A., Tenenhaus, M. Regularized Generalized Canonical Correlation Analysis. Psychometrika 76, 257–284 (2011)](https://doi.org/10.1007/s11336-011-9206-8) - Presentation of the Sparse version of RGCCA: [Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014](https://doi.org/10.1093/biostatistics/kxu001) - Exploration of the links between RGCCA and several multi-block method: [Tenenhaus, M., Tenenhaus, A. & Groenen, P.J.F. Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods. Psychometrika 82, 737–777 (2017)](https://doi.org/10.1007/s11336-017-9573-x) - About MOFA: - Original paper of MOFA: [Argelaguet, R. et al. Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018)](https://doi.org/10.15252/msb.20178124) - About JIVE: - Original paper of JIVE: [Lock EF, Hoadley KA, Marron JS, Nobel AB. JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. Ann Appl Stat. 2013](https://doi.org/10.1214/12-aoas597) - A supervised version of JIVE: [Elise F. Palzer, Christine H. Wendt, Russell P. Bowler, Craig P. Hersh, Sandra E. Safo, Eric F. Lock, sJIVE: Supervised joint and individual variation explained, Computational Statistics & Data Analysis, Volume 175, 2022](https://doi.org/10.1016/j.csda.2022.107547) - Other references: - The benchmark mentionned above on JDR methods: [Cantini, L., Zakeri, P., Hernandez, C. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12, 124 (2021)](https://doi.org/10.1038/s41467-020-20430-7) - Another broader benchmark on multi-omics methods (contains JDR methods but not only): [Morgane Pierre-Jean, Jean-François Deleuze, Edith Le Floch, Florence Mauger, Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration, Briefings in Bioinformatics, Volume 21, Issue 6, November 2020](https://doi.org/10.1093/bib/bbz138) - A book exploring Multiblock Data Fusion: [Age K. Smilde, Tormod Næs and Kristian Hovde Liland, Multiblock Data Fusion in Statistics and Machine Learning - Applications in the Natural and Life Sciences](https://onlinelibrary.wiley.com/doi/book/10.1002/9781119600978) # Choose your dataset and your modality {#data}

In this tutorial we use Major Depressive Disorder (MDD) dataset, however for practical session you will pick another dataset among the ones proposed below and adapt the following sections.

```{r data-plot, echo = FALSE, fig.align = "center", out.width = "80%", fig.cap = "Five datasets are available: **Metagenomic** data from Tara Ocean (image from Sunagawa et al., 2015), **Breast cancer** data from TCGA (image from TCGA [website](https://portal.gdc.cancer.gov/)), **CLL** data (Dietrich et al., 2018), **tomato** data and **Arabidopsis thaliana** data (Wallomics dataset).", fig.show = "hold"} ##include_graphics("Figures/datasets.png") ``` ## TCGA breast cancer dataset This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer. This dataset was obtained from [Mibiomics](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics/-/tree/master/data/data_breast)

Download the dataset from the [summer school's GitHub repository](https://github.com/sib-swiss/summer-school-multiomics-data-analysis-and-integration/tree/master/general/datasets/BreastCancer_WGCNA)

For this dataset, we will use 3 modalities: * `mrna.csv`, contains mRNA expression data * `mirna.csv`, contains miRNA expression data * `protein.csv`, contains protein abundance data We will also use the file `clinical.csv`, which contains samples metadata, including molecular subtypes. ## TARA ocean dataset Tara Ocean is a scientific initiative that aims to study and understand marine ecosystems on a global scale. It involves the collection of ocean samples from various regions to analyze the diversity of organisms, their genetic content, and their interactions with the environment. In this dataset, we will have access to a **NOG (Non-supervised Orthologous Groups) abundance dataset**, measure accros various oceans and various depths. NOGs are clusters of genes from different species that share a common evolutionary origin and similar functions. These groups are formed computationally, without relying on manual annotations, and are valuable for understanding gene relationships and functions across diverse organisms. NOGs play a key role in comparative genomics and evolutionary studies by revealing insights into the conservation of genes and their roles in different species. NOGs have been measured in **various oceans** (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean) and at **various depths** (DCM: Deep Chlorophyll Maximum, MES: Mesopelagic Zone, MIX: Mixing Layer, SRF: Surface Layer) This dataset was obtained from [Mibiomics](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics/-/tree/master/data/data_tara)

Download the dataset from the [summer school's GitHub repository](https://github.com/sib-swiss/summer-school-multiomics-data-analysis-and-integration/tree/master/general/datasets/TaraOceans)

## CLL dataset The Chronic lymphocytic leukaemia (CLL) dataset includes mRNA expression, methylation data and viability values in response to different drugs from CLL samples. To access this dataset, you need to install and load the `MOFAdata` R library. ```{r CLLdata, eval=FALSE, message=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MOFAdata") library("MOFAdata") data("CLL_data") names(CLL_data) ``` We will use three modalities: * `CLL_data$mRNA`, which contains mRNA data for various CLL samples, * `CLL_data$Methylation`, which contains methylation data from the same samples, * `CLL_data$Drugs`, which contains viability response to various drugs and concentrations.

The samples metadata are not available from `MOFAdata`, you'll have to download it from the [summer school's GitHub repository](https://github.com/sib-swiss/summer-school-multiomics-data-analysis-and-integration/tree/master/general/datasets/CLL)

Use the command `?CLL_data` to know more about this dataset. ## Tomato dataset The tomato dataset was obtained from the authors of the following publication: Isma Belouah and others, Modeling Protein Destiny in Developing Fruit, Plant Physiology, Volume 180, Issue 3, July 2019, Pages 1709–1724, https://doi.org/10.1104/pp.19.00086 It contains mRNA and protein abundance data from tomato plants (*Solanum lycopersicum*) at various development stages. In this dataset, the mRNA and proteins are paired, meaning that corresponding gene expression levels at both the mRNA and protein levels are examined. The dataset encompasses information gathered from tomato plants at different developmental points, representing various **Days Post Anthesis (DPA)** and distinct **Growth Stages**. 'Days Post Anthesis' (DPA) signifies the count of days that have elapsed since the opening of a flower, serving as a marker to track the temporal progression of plant development. On the other hand, 'Growth Stages' denote specific phases within the tomato plant's lifecycle, spanning germination, vegetative growth, flowering, fruit development, and ripening.

The dataset was kindly provided to us by the authors. We have made slight adjustments to it, and made it accessible on the [summer school's GitHub repository](https://github.com/sib-swiss/summer-school-multiomics-data-analysis-and-integration/tree/master/general/datasets/Tomato)

For this dataset, we have access to 2 modalities: * `mrna.tsv`, contains mRNA expression data (already normalized) * `prots.csv`, contains protein abundance data (already normalized) We will also use the file `samples_metadata.tsv`, which contains samples metadata (DPA, Growth Stage). ## WallOmics dataset The tomato dataset was obtained from the `WallomicsData` R library. This dataset contains phenomics, metabolomics, proteomics and transcriptomics data collected from two organs of five ecotypes of the model plant Arabidopsis thaliana exposed to two temperature growth conditions. For more information, you can refer to the following publications: * Duruflé, Harold, et al. "A powerful framework for an integrative study with heterogeneous omics data: from univariate statistics to multi-block analysis." Briefings in Bioinformatics 22.3 (2021): bbaa166. * Duruflé, Harold, et al. "An integrative study showing the adaptation to sub-optimal growth conditions of natural populations of Arabidopsis thaliana: A focus on cell wall changes." Cells 9.10 (2020): 2249. For the analysis to run faster, we will use the CellWall datasets (CW), that only consider the expression of cell wall proteins. The datasets can be accessed with the following R commands: ```{r WallomicsData, eval=FALSE, message=FALSE} intall.packages("WallomicsData") library("WallomicsData") data("Transcriptomics_Rosettes_CW") data("Transcriptomics_Stems_CW") data("Proteomics_Rosettes_CW") data("Proteomics_Stems_CW") ``` Use `?Transcriptomics_Rosettes_CW` and `?Transcriptomics_Stems_CW` for more information on the transciptomics dataset. Use `?Proteomics_Rosettes_CW` and `?Proteomics_Stems_CW` for more information on the proteomics dataset. Additionally, we will use the other data available as metadata. This includes: * the `Ecotype` dataset * the `Temperature` dataset * the `Genetic_Cluster` dataset * the `Altitude_Cluster` dataset * the `Genetic_Cluster` dataset * the `Altitude_Cluster` dataset * the `Metabolomics_Rosettes` dataset for Rosettes * the `Metabolomics_Stems` dataset for Stem * the `Phenomics_Rosettes` dataset for Rosettes * the `Phenomics_Stems` dataset for Stem ## ProMetIS dataset A total of 42 mice from the Lat-/- (knock out for LAT, a gene involved in neurodevelopmental diseases) and Mx2-/- (knock out for Mx2, a gene modelling Down syndrome in mice) genotypes (as well as the wild-type controls), and from both sex, were analyzed by a series of phenomic (preclinical) measurements. Liver and plasma samples from all mice were further analyzed at the molecular level by proteomics and metabolomics [(Alyssa Imbert & al. 2021)](https://doi.org/10.1038/s41597-021-01095-3). The data-set is composed of 9 matrices: 1 preclinical, 2 proteomics (liver and plasma), 6 metabolomics (4 on the plasma and 2 on the liver). There are multiple metabolomics datasets because it depends on the chromatographic column (c18hypersil: Hypersil GOLD C18, hilic: ZIC-pHILIC, and c18acquity: Acquity HSS-T3) and the ionization mode (pos: positive, and neg: negative). For this tutorial, we will focus only one metabolomic data-set per tissue, the one that was acquired with the same chromatographic column with the same ionization mode and that presents the fewest number of variables. In the end, 4 modalities will be explored: - `preclinical` - `proteomics_liver` - `metabolomics_liver_hilic_neg` - `metabolomics_plasma_hilic_neg` The `proteomics_plasma` was put aside as it does not contains all individuals. The following code can be used to retrieve these matrices. ```{r ProMetISdata, eval=FALSE, message=FALSE} devtools::install_github("IFB-ElixirFr/ProMetIS") library(ProMetIS) modalities = c("preclinical", "proteomics_liver", "metabolomics_liver_hilic_neg", "metabolomics_plasma_hilic_neg") data_dir.c = ProMetIS::post_processed_dir.c() data.mds = phenomis::reading(data_dir.c, subsets.vc = modalities, output.c = "set", report.c = "none") blocks = sapply(names(data.mds), function(x) t(Biobase::exprs(data.mds[[x]]))) knock_out_gene = factor(Biobase::pData(data.mds[["preclinical"]])[, "gene"]) ``` For further information on how to retrieve all the data, you can visit the github page: https://github.com/IFB-ElixirFr/ProMetIS, and especially check the tutorial: https://github.com/IFB-ElixirFr/ProMetIS/blob/master/vignettes/tutorial.Rmd.

Again, in the upcoming sections, our primary focus will be on Major Depressive Disorder (MDD) dataset. However, it's important to note that **the procedure we will discuss can also be applied to other datasets**.

**Team organization:** Each method presented here : RGCCA, MOFA and JIVE are independent, so feel free to start with one method or another or split your group in three, each working on a different method.

# Biological context (Major Depressive Disorder) {#mdd} Major depressive disorder (MDD) is a leading cause of disability and reduced life expectancy, with a two-fold increase in prevalence in women compared to men. Over the last few years, identifying reliable molecular biomarkers of MDD has proved challenging, likely reflecting the fact that, in addition to sex-differences, a variety of environmental and genetic risk factors are implicated. Recently, epigenetic processes have been proposed as mediators of the impact of life experiences on functional regulation of the genome, with the potential to contribute to MDD biomarker development. In this context, gene expression data were characterized and integrated with two upstream mechanisms for epigenomic regulation: DNA methylation (DNAm) and microRNAs (miRNAs). The 3 molecular layers were analyzed in peripheral blood samples from a well-characterized cohort of individuals with MDD (n=80) and healthy controls (n=89) [(Mokjhtari A. 2024)](https://doi.org/10.21203/rs.3.rs-3978037/v1). ```{r MMDoverview, echo = FALSE, fig.align = "center", out.width = "100%", fig.cap = "Overview of the Major Depressive Disorder (MDD) data-set. Figure extracted from: Amazigh Mokhtari, `Signatures moléculaires multi-omiques issues de prélévements sanguins de patients atteints de troubles de l'humeur`, 2023, PhD manuscript. https://theses.fr/2022UNIP5226"} include_graphics("images/MMDoverview.PNG") ```

Our objective is to build a multi-omic molecular signature discriminating MDD patients from controls

# Input data {#loadata} ## In summary For **RGCCA**, data must be a list of **matrices** with specific shapes: - **samples** (e.g. patients, organisms ...) are displayed on the **rows** - **variables** (e.g. genes, proteins ...) are displayed on the **columns** For **MOFA and JIVE**, matrices must be the other way round (**samples** in **columns** and **variables** in **rows**). ## Load your dataset Choose a dataset and load all the modalities into R, except for the meta/clinical data. You will probably need to read a csv/tsv file: look at the documentation for the function ```read.table```: ```{r doc_read.table} ?read.table ``` If your data is not stored into a csv/tsv file and you don't know how to load it, don't hesitate to ask us.

**DIY**: Load all the modalities, except for the meta/clinical data, as a list of **matrices**. ```{r loadData, class.source = 'fold-hide'} data_path = "./data/MDD/" blocks = readRDS(file = paste0(data_path,"data.train")) blocks$cli = NULL ```

**DIY**: Load separately the meta/clinical data. ```{r loadClinical, class.source = 'fold-hide'} clinical_data = readRDS(file = paste0(data_path,"pd_mdd_bin.RDS")) clinical_data = clinical_data[match(rownames(blocks$mRNA), clinical_data$Sample_ID), ] rownames(clinical_data) = clinical_data$Sample_ID covariates_explored = c("Sample_Group", "BMI", "BMI.bin", "Age", "Age.bin", "Sex", "Array", "Slide", "CD4", "CD8", "MO", "B", "NK", "GR") clinical_data = clinical_data[, covariates_explored] ```

# Principal Component Analysis (PCA) {#pca} ## In summary Principal Component Analysis (PCA) is an **unsupervised** method for **multivariate statistical analysis**. It is a good practice to first explore your data with PCA, especially if your plan is to perform next analysis with **Joint Dimension Reduction** (JDR) methods such as RGCCA, MOFA or JIVE. Indeed, all three of them can, in a sense, be seen as a generalization of PCA for more than one omic (block). Applying a separate PCA on all your omics will eventually help you understand what information (phenotype/co-variable) drives the main variations in each omic data. If **all their main variations is driven by the same information**, going further with a JDR method is especially interesting as it will be able to **take advantage of the redundancy of this information** across omics to catch a stronger signal. ## Run PCA on a first omic modality In the example of MDD, we will go with the transcriptomic block (mrna), **choose yours** in the data-set you selected.

**DIY**: Run a PCA with the `PCA` function of the `FactoMineR` package. You will have to set: - `ncp`: the number of components. - `scale.unit`: a boolean specifying if you want or not to scale (to unit variance) all your variables separately to make them comparable. You may skip this step if you previously performed a more expert normalization. - `graph`: to avoid unwanted graphical output.

**Important Reminder**: Centering all your variables when performing PCA is mandatory, otherwize you are not extracting the components with the highest variance anymore.

```{r mrnaPCA, class.source = 'fold-hide', fig.align = 'center'} mrna_pca_results <- PCA(blocks$mRNA, scale.unit = TRUE, ncp = 25, graph = F) ```

### Sample/individual Space

**DIY**: Display the first 2 Principal Components (PCs) against each others with the function `fviz_pca_ind` from the `factoextra package`. You will have to set the : - `axes` : the two components you want to display. This type of representation is also called the sample space as each dot represent a sample (individual/subject...). ```{r mrna_show_PCA, class.source = 'fold-hide', fig.align = 'center'} fviz_pca_ind(mrna_pca_results, axes = c(1,2)) ```

Depending on the modality and the data-set you selected, you may see a group of samples that already is separated from the other, as above. In order to understand if this group is associated with a already known information (phenotype/co-variable), you may add this information to the plot.

**DIY**: Select a phenotype/co-variable and color your samples depending on the category of this selected phenotype/co-variable they belong to. If you choose a continuous variable, you may want to categorize it (by making bins). ```{r mrna_show_PCA_with_sex, class.source = 'fold-hide', fig.align = 'center'} fviz_pca_ind(mrna_pca_results, axes = c(1,2)) + geom_point(aes(colour = clinical_data$Sex)) + guides(colour = guide_legend(title = "color")) ```

We can notice in the figure above that there is a strong "Sex" effect. Several methods exist in the literature to correct for Sex (or other batch) effects (ex: ComBat, for more details see [Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007](https://doi.org/10.1093/biostatistics/kxj037)). For the sake of simplicity, we will limit ourselves to the gender with the highest number of samples, which corresponds to the Females.

**DIY**: If needed, extract a subset of your data-set as here for the female group. ```{r keep_only_females, class.source = 'fold-hide', fig.align = 'center'} idx_female = which(clinical_data$Sex == "F") idx_sex = which(colnames(clinical_data) == "Sex") clinical_data = clinical_data[idx_female, -idx_sex] blocks = lapply(blocks, function(x) x[idx_female, ]) ```

**DIY**: You can plot again the sample space once the subset has been extracted. Do not forget to compute the PCA again. This time, samples are colored by their status (case vs. control). ```{r mrna_show_PCA_with_subtypes, class.source = 'fold-hide', fig.align = 'center'} mrna_pca_results <- PCA(blocks$mRNA, scale.unit = TRUE, ncp = 25, graph = F) fviz_pca_ind(mrna_pca_results, axes = c(1,2)) + geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) ```

### Variable Space Once components have been identified as bearing an interesting information, you may want to understand how the different variables contribute to the computation of these components.

**DIY**: Display the top 50 variables contributing to the component of your choice with the function `fviz_contrib` from the `package factoextra`. You will have to set: - `choice = 'var'`: to specify that you are interesting in the contribution of the variables. - `axes`: the PC you are interested in. - `top`: the number of variables contributing the most to this PC. ```{r mrna_PCA_var_space, class.source = 'fold-hide', fig.align = 'center'} fviz_contrib(mrna_pca_results, choice = "var", axes = 1, top = 50) ```

This contribution can also be plot in 2D in order to show, at the same time, the contribution of one variable to two different PCs. This is called the **PCA correlation circle**.

**Important Reminder**: In PCA, the contribution (weight/coefficient) of a variable to a PC equals the correlation between the variable and the PC. This allows to plot the contributions of variables to 2 differents PCs on a correlation circle, as all contribution will be between -1 an 1.

**DIY**: Display the **PCA correlation circle** for the top 50 variables contributing to the 2 components of your choice with the function `fviz_pca_var` from the `package factoextra`. You will have to set: - `axes`: the two PCs you are interested in. - `select.var`: to specify the number of variables contributing the most to this PC. - `repel`: if the names of the variables are not displayed properly (are overlapping). ```{r mrna_PCA_correlation_circle, class.source = 'fold-hide', fig.align = 'center'} fviz_pca_var(mrna_pca_results, axes = c(1, 2), repel = T, select.var = list(contrib = 50)) ```

### Biplot It is also possible to display, on the same plot, the individual and the variable space. In that case, variables are not displayed inside a correlation circle anymore in order to adapt the scale of the individual space. In that case, the coordinates of the variables, representing their contribution to the PCs, is artificially changed. Though, the direction of each variable is kept, allowing still to interpret if a variable is more contributing to one PC or the other. This type of representation is particularly usefull when a group of samples separates from the others and you want to understand if a group of variables especially bears this separation.

**DIY**: Display a biplot for the top 50 variables contributing to the 2 components of your choice with the function `fviz_pca_biplot` from the package `factoextra`. You will have to set: - `axes`: the two PCs you are interested in. - `select.var`: to specify the number of variables contributing the most to this PC. - `repel`: if the names of the variables are not displayed properly. Furthermore, color your samples depending on the category of the selected phenotype/co-variable they belong to. ```{r mrna_PCA_biplot, fig.height=12, fig.width=10, class.source = 'fold-hide', fig.align = 'center'} fviz_pca_biplot(mrna_pca_results, repel = T, select.var = list(contrib = 50), axes = c(1,2))+ geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) ```

### Scree plot Each PC explains a certain percentage of the total variance of the data-set. This explained variance decreases with the rank of the PC. The scree plot subsumes the explained variance of the top PCs. It can be interesting to visualize it in order to understand how many components bear most of the information.

**DIY**: Display a scree plot. ```{r mrna_PCA_scrreeplot, class.source = 'fold-hide', fig.align = 'center'} fviz_eig(mrna_pca_results) ```

### Going Further When trying to understand the information behind each PC, it might be interesting to directly test the association between them and a series of phenotypes/co-varibales. This is the idea behind the function `champ.svd` in the [package CHAMP](https://www.bioconductor.org/packages/devel/bioc/vignettes/ChAMP/inst/doc/ChAMP.html) where a univariate statistical test is performed between each PC and a designated variable of interest. The chosen test depends on the nature of the variable: - **continuous** : a [F-test](https://en.wikipedia.org/wiki/F-test). - **discrete** : a Kruskal and Wallis test, which is a generalization to an arbitrary number of groups to the [Wilcoxon-Mann-Whitney test](https://fr.wikipedia.org/wiki/Test_de_Wilcoxon-Mann-Whitney). Inspired from the `CHAMP` package, an home-made function `interpret_PC` was created that takes as arguments: - `component_matrix`: a matrix of size `number of samples` $\times$ `number of components`. - `clinical_df`: a dataframe of size `number of samples` $\times$ `number of variables`. - `title`: the title of your plot. - `display_legend`: If you want or not the legend for the p-values to be displayed.

```{r, fig.height=12, fig.width=10, warning=FALSE, message=FALSE} # Function taken from CHAMP to help interpret factors drawheatmap <- function(svdPV.m, title = NULL, silent = FALSE,show_rownames = TRUE, show_colnames = TRUE, show_legend = TRUE) { myPalette = c("darkred", "red", "orange", "pink", "white") breaks.v = c(-10000, -10, -5, -2, log10(0.05), 0) rownames(svdPV.m) = paste("Comp.", 1:dim(svdPV.m)[1]) tmp_plot = pheatmap(log10(t(svdPV.m)), cluster_rows = F, cluster_cols = F, breaks = breaks.v, color = myPalette, silent = T, legend = F, main = paste("Interpretation of the Components -", title), show_rownames = show_rownames, show_colnames = show_colnames) if (show_legend){ leg = grid::legendGrob(c(expression("p < 1x" ~ 10^{-10}), expression("p < 1x" ~ 10^{-5}), "p < 0.01", "p < 0.05", "p > 0.05"), nrow = 5, pch = 15, gp = grid::gpar(fontsize = 10, col = myPalette)) }else{ leg = NULL } final_plot = arrangeGrob(tmp_plot$gtable, leg, ncol = 2, widths = c(5,1)) if (!silent){ plot.new() grid::grid.draw(final_plot) } return(final_plot) } check_association_CHAMP <- function(x, y){ if(!is.numeric(y)){ return(kruskal.test(x ~ as.factor(y))$p.value) }else{ return(summary(lm(x ~ y))$coeff[2, 4]) } } interpret_PC <- function(component_matrix, clinical_df, title = NULL, silent = FALSE, show_rownames = TRUE, show_colnames = TRUE, show_legend = TRUE){ tmp = apply(component_matrix, 2, function(x) sapply(clinical_df, function(y) check_association_CHAMP(x, y))) drawheatmap(t(tmp), title = title, silent = silent, show_rownames = show_rownames, show_colnames = show_colnames, show_legend = show_legend) } ```

**DIY**: Apply this `interpret_PC` function on the result of your PCA in order to evaluate the associations with your clinical data. The component matrix can be retrieved through `mrna_pca_results$ind$coord`. ```{r mrna_PCA_champ, class.source = 'fold-hide', fig.align = 'center'} interpret_PC(component_matrix = mrna_pca_results$ind$coord, clinical_df = clinical_data, title = "mRNA") ```

This is not the case here, as `Sample_Group` is only composed of two categories, but sometimes, you can see a strong link with a qualitative variable but without really knowing which category is best represented. It can be interesting to add intermediary variables describing each category. More generally, if you have a qualitative variable with multiple categories, it can be interesting to try to decipher if a PC describes one or more of these categories.

**DIY**: Create intermediary variable for each category of the clinical/co-variable you selected. Each new intermediary variables must specify "I belong to this category or not". Do not forget to convert these variables into factors. ```{r mrna_PCA_intermediary_subtypes, class.source = 'fold-hide', fig.align = 'center'} # clinical_data$Sample_Group = as.factor(clinical_data$Sample_Group) # each_subtypes_vs_ALL = model.matrix(~ subtype - 1, data = clinical_data) # each_subtypes_vs_ALL = apply(each_subtypes_vs_ALL, 2, as.factor) # clinical_data = cbind(clinical_data, each_subtypes_vs_ALL) ```

**DIY**: Draw again the previous plot with these new intermediary variables. ```{r mrna_PCA_champ_intermediary_subtypes, class.source = 'fold-hide', fig.align = 'center'} # interpret_PC(component_matrix = mrna_pca_results$ind$coord, clinical_df = clinical_data, title = "mRNA") ```

## PCA for other modalities

**DIY** Explore your other omic modalities with PCA ```{r mirna_protein_PCA, eval = TRUE, message = FALSE, class.source = 'fold-hide', results='hide'} mirna_pca_results = PCA(blocks$miRNA, scale.unit = TRUE, ncp = 25, graph = F) protein_pca_results = PCA(blocks$DNAm, scale.unit = TRUE, ncp = 25, graph = F) ```

### Sample/individual Space

```{r mirna_protein_show_PCA_with_subtypes, fig.height=8, fig.width=10, class.source = 'fold-hide', fig.align = 'center'} a = fviz_pca_ind(mirna_pca_results, axes = c(1,2), subtitle = "miRNA") + geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) b = fviz_pca_ind(protein_pca_results, axes = c(1,2), subtitle = "DNAm") + geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

### Variable Space

```{r mirna_protein_PCA_var_space, fig.height=6, fig.width=10, class.source = 'fold-hide', fig.align = 'center'} a = fviz_contrib(mirna_pca_results, choice = "var", axes = 1, top = 50, subtitle = "miRNA") b = fviz_contrib(protein_pca_results, choice = "var", axes = 1, top = 50, subtitle = "DNAm") ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

```{r mirna_protein_PCA_correlation_circle, fig.height=6, fig.width=10, class.source = 'fold-hide', fig.align = 'center'} a = fviz_pca_var(mirna_pca_results, axes = c(1, 2), repel = T, select.var = list(contrib = 50), subtitle = "miRNA") b = fviz_pca_var(protein_pca_results, axes = c(1, 2), repel = T, select.var = list(contrib = 50), subtitle = "DNAm") ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

### Biplot

```{r mirna_protein_PCA_biplot, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = fviz_pca_biplot(mirna_pca_results, repel = T, select.var = list(contrib = 50), axes = c(1,2), subtitle = "miRNA")+ geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) b = fviz_pca_biplot(protein_pca_results, repel = T, select.var = list(contrib = 50), axes = c(1,2), subtitle = "DNAm")+ geom_point(aes(colour = clinical_data$Sample_Group)) + guides(colour = guide_legend(title = "color")) ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

### Scree plot

```{r mirna_protein_PCA_scrreeplot, class.source = 'fold-hide', fig.align = 'center'} a = fviz_eig(mirna_pca_results, subtitle = "miRNA") b = fviz_eig(protein_pca_results, subtitle = "DNAm") ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

### Going Further

```{r mirna_protein_PCA_champ_intermediary_subtypes, fig.height=8, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = interpret_PC(component_matrix = mirna_pca_results$ind$coord, clinical_df = clinical_data, title = "miRNA", silent = T, show_rownames = F, show_legend = F) b = interpret_PC(component_matrix = protein_pca_results$ind$coord, clinical_df = clinical_data, title = "DNAm", silent = T) ggarrange(a, b, ncol = 2, nrow = 1) rm(a,b) ```

# Regularized Generalized Canonical Correlation Analysis (RGCCA) {#rgcca} ## First application of RGCCA

**DIY** Store all your omic tables in a list of matrices and verify that they all have the same number of samples, that samples are in rows and that sample_IDs match across matrices. ```{r RGCCA_first_verifications, class.source = 'fold-hide', fig.align = 'center'} sapply(blocks, dim) all(rownames(blocks$DNAm) == rownames(blocks$mRNA)) all(rownames(blocks$DNAm) == rownames(blocks$miRNA)) ```

**DIY** Apply the `rgcca` function from the `RGCCA` package and extract more than 1 component per block. Also, let us print the progression of the algorithm. You will have to set the parameters: - `blocks`: the list of matrices of size `samples` $\times$ `variables` you want to analyze. - `ncomp`: the number of component extracted per block. - `verbose`: if you want or not the progression of the algorithm to be displayed. ```{r RGCCA, class.source = 'fold-hide', fig.align = 'center'} fit_RGCCA = rgcca(blocks = blocks, ncomp = 2, verbose = T) ``` The mention `verbose = TRUE` resulted in the printing of the progression of the algorithm both as a text or as a graph. On the text, three information are displayed for each line: (i) the number of the current iteration, (ii) the value of the optimization criterion of RGCCA (that we want to maximize) and (iii) the difference between the current and the previous value of the criterion. Furthermore, we can see that the algorithm is monotonically convergent (on the plots, the criterion always increases or stay the same; on the text, `Dif` is always positive or null). We can see the "sequentiallity" in the computation of the components as the first one is estimated until convergence and only then, the second component is estimated. In the end, convergence is obtained in a few number of iterations and relatively quickly. The algorithm stops when `Dif` is below a certain threshold, arbitrary low (`1e-8`). Let us explore the parameters of the `rgcca` function, their default value and what they stand for (**to access to the explanation of the parameter, click on it.** These explanations are mostly coming from `??rgcca`, except for comments in bold): -

tau = `r fit_RGCCA$call$tau`

If tau is a vector, tau[j] is used for the constraints applied to all the block weight vectors associated to block $X_j$. If tau is a matrix, tau[k, j] is associated with the constraints applied to the kth block weight vector corresponding to block $X_j$.

tol = `r fit_RGCCA$call$tol`

The stopping value for the convergence of the algorithm. **The lower this value, the more precise should be the estimates but the longer it takes to estimate them.**

init = `r fit_RGCCA$call$init`

A string giving the type of initialization to use in the RGCCA algorithm. It could be either by Singular Value Decomposition ("svd") or by random initialization ("random"). **The latter is actually quicker in situations where a block has a high number of variables and observations.**

bias = `r fit_RGCCA$call$bias`

A logical value for biased $(1/n)$ or unbiased $(1/(n-1))$ estimator of the variance/covariance.

quiet = `r fit_RGCCA$call$quiet`

A logical value indicating if some diagnostic messages are reported.

scale = `r fit_RGCCA$call$scale`

A logical value indicating if each variable for each block have to be standardized **(actually, they are going to be at least centered whatever the value of this parameter is)**.

ncomp = `r fit_RGCCA$call$ncomp`

A numerical value or a vector of length $J$ indicating the number of components per block (**meaning, a different number of component can be extracted for each block**). If a single value is provided, the same number of components is extracted for every block.

scheme = `r fit_RGCCA$call$scheme`

A string or a function specifying the scheme function applied to covariance maximization among "horst" (the identity function), "factorial" (the square function - default value), "centroid" (the absolute value function). The scheme function can be any continuously differentiable convex function and it is possible to design explicitly the scheme function (e.g. function(x) x^4) as argument of the function.

method = `r fit_RGCCA$call$method`

A string specifying which multiblock component method to consider. Possible values are found using `RGCCA::available_methods()`. **When a method is chosen, the value of `tau`, `connection` and `scheme` are automatically defined.**

verbose = `r fit_RGCCA$call$verbose`

A logical value indicating if the progress of the algorithm is reported while computing.

sparsity = `r fit_RGCCA$call$sparsity`

**only relevant when you want to use a sparse version of RGCCA.**

response = `r fit_RGCCA$call$response`

A numerical value giving the position of the response block. When the response argument is filled, the supervised mode is automatically activated. **It especially means that in such situation, the connection matrix is automatically designed so that all blocks are linked to the response block and the response block only.**

NA_method = `r fit_RGCCA$call$NA_method`

A string indicating the method used for handling missing values **("na.ignore": handle missing values, "na.omit": keep only subject without any missing value accross all blocks).**

comp_orth = `r fit_RGCCA$call$comp_orth`

A logical value indicating if the deflation should lead to orthogonal block components or orthogonal block weight vectors.

n_iter_max = `r fit_RGCCA$call$n_iter_max`

Integer giving the algorithm's maximum number of iterations.

connection = `r fit_RGCCA$call$connection`

A $(J\times J)$ symmetric matrix describing the network of connections between blocks. **Usually $c_{kl} = c_{lk} = 1$ is blocks $k$ and $l$ are connected and $0$ otherwise. Most of the time $c_{kk}=0$, otherwize, the algorithm also tries to maximize the variance of the block component (when $tau[k]\neq 0$).**

superblock = `r fit_RGCCA$call$superblock`

A logical value indicating if the superblock option is used. **The superblock corresponds to the concatenation of all blocks provided. When it is used, all blocks are connected to the superbock and the superblock only. It is used to estimate components computed with all the variables across all the blocks. For examples using the superblock, see [(Garali & al. 2018)](https://doi.org/10.1093/bib/bbx060) and [(Cantini & al. 2021)](https://doi.org/10.1038/s41467-020-20430-7)**

scale_block = `r fit_RGCCA$call$scale_block`

A logical value or a string indicating if each block is scaled. If TRUE or "inertia", each block is divided by the sum of eigenvalues of its empirical covariance matrix. If "lambda1", each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix.If standardization is applied (scale = TRUE), the block scaling applies on the standardized blocks. **In multi-block analysis, normalizing variables is important but normalizing blocks is also mandatory !**

## Explore the graphical representations of RGCCA ### Sample space Similarly to PCA, the sample space can be explored.

**DIY** Display the sample space associated with your first RGCCA analysis with the function `plot`. Also color the sample depending on the phenotype/co-variable you are interested in (you can select the same as for the PCA section). You will have to set the parameters: - `type`: to select the type of space you want to display. - `response`: corresponding on the phenotype/co-variable you selected. ```{r RGCCA_sample_space, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "sample", response = clinical_data$Sample_Group) ```

You can see here that, by default, the first component of the first block is displayed against the second component of the second block.

**DIY** Try a different representation for your sample space. You will have to set the parameters: - `block`: to select the blocks you want to be displayed. - `comp`: to select the component you want to be displayed. ```{r RGCCA_sample_space_play_with_comp, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "sample", response = clinical_data$Sample_Group, block = c(1, 3), comp = c(1, 1)) ```

### Variable Space Similarly, the variable space associated with only 1 component can be displayed.

**DIY** Display the variable space associated with your first RGCCA analysis with the function `plot`. Choose the block and the component you want to visualize and restrain the representation to the top 50 variables. You will have to set the parameters: - `type`: to select the type of space you want to display. - `block`: to select the blocks you want to be displayed. - `comp`: to select the component you want to be displayed. - `n_mark`: the number of variables contributing the most to this component. ```{r RGCCA_variable_space, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "weights", block = 3, comp = 2, n_mark = 50) ``` If you compare the variable space between PCA and RGCCA, you can realize that for RGCCA, weights can change sign. It is because for PCA the percentage of contribution was displayed (the absolute value of the weights), whereas here, the weights with its sign is shown (and they are ordered by their absolute value).

Actually, the weights of multiple blocks can be shown at the same time.

**DIY** Display the variable space associated with all the blocks for the component of your choice. You will have to play with `n_mark` for all the blocks to really appear. ```{r RGCCA_variable_space_all_blocks, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "weights", comp = 1, n_mark = 200) ```

Similarly to PCA, you can display the correlation circle. Though, RGCCA does not have the same propriety as PCA. Indeed, the variable weights are not anymore equals to the correlation between the component and the variable: `r stats::cor(fit_RGCCA$Y$mRNA[, c(1,2)], fit_RGCCA$call$blocks$mRNA[, 1])` $\neq$ `r fit_RGCCA$a$mRNA[1, c(1, 2)]`. So, what is actually displayed in the correlation circle proposed by RGCCA is the correlation between each component and the corresponding variable.

**DIY** Display the correlation circle for the block of your choice. You will have to set the parameters: - `type`: to select the type of space you want to display. - `display_blocks`: to select the block you want to be displayed. Unfortunately the `n_marks` parameter is not implemented yet for this type of representation. ```{r RGCCA_variable_cor_cirle, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "cor_circle", display_blocks = 1) ```

**DIY** Display the correlation circle for all the blocks at the same time. ```{r RGCCA_variable_cor_cirle_all_blocks, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "cor_circle") ```

### Biplot Similarly to PCA, a biplot representation can be displayed.

**DIY** Display a biplot for the block of your choice. You will have to set the parameters: - `type`: to select the type of space you want to display. - `response`: corresponding on the phenotype/co-variable you selected. - `block`: to select the blocks you want to be displayed. - `comp`: to select the component you want to be displayed. - `show_arrows`: to eventually remove the arrows classically displayed on a correlation circle plot. Unfortunately the `n_marks` parameter is not implemented yet for this type of representation. ```{r RGCCA_biplot, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "biplot", show_arrows = F, block = 2, response = clinical_data$Sample_Group) ```

### Screeplot Similarly to PCA, a screeplot can be displayed. An important propriety when extracting the $j^{th}$ component $\bf{y}_j$ from block $\bf{X} = \left[\bf{x}_1, \ldots, \bf{x}_p\right]$, of dimension $n\times p$, with PCA is that the variance of $\bf{y}_j$ equals to $\lambda_j^2$, the $j^{th}$ singular value of $\bf{X}$. Hence, the percentage of variance explained by this component can be computed as: $$\frac{\lambda_j^2}{\sum_{k=1}^{min(n,p)}\lambda_k^2}$$ Furthermore, as each new singular value is necessarily lower or equals to the previous one and greater or equals to the next one, it guarantees that the percentage of explained variances decreases with the level of the component. These proprieties are no longer true for RGCCA, especially because a block component no longer focuses on maximizing only the extracted variance from its block but also cares about being highly correlated with its surrounding blocks (no link between a component and the singular values of its block can be made in the general case). A new formula is used to compute the explained variance of the extracted components, which is strictly equivalent to the above formula in the case of PCA (i.e. components computed with PCA): $$\frac{1}{\sum_{k=1}^p var(\bf{x}_k)}\sum_{l=1}^p var(\bf{x}_l) cor(\bf{x}_l, \bf{y}_j)^2$$.

**DIY** Display the averaged variance explained by each block and each component. You will have to correctly define the parameter `type` in the function `plot` for this. ```{r RGCCA_screeplot, class.source = 'fold-hide', fig.align = 'center'} plot(fit_RGCCA, type = "ave") ``` Indeed, the averaged variance explained by each component is not necessarily decreasing with the level of the component.

### Going Further

**DIY** Using the function `interpret_PC` on the result of RGCCA, try to evaluate the associations between each extracted component of each block and your clinical data. The component matrix can be retrieved through `fit_RGCCA$Y$mRNA` for example. ```{r RGCCA_going_further, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = interpret_PC(component_matrix = fit_RGCCA$Y$mRNA, clinical_df = clinical_data, title = "mRNA", silent = T, show_rownames = F, show_legend = F) b = interpret_PC(component_matrix = fit_RGCCA$Y$miRNA, clinical_df = clinical_data, title = "miRNA", silent = T, show_rownames = F, show_legend = F) c = interpret_PC(component_matrix = fit_RGCCA$Y$DNAm, clinical_df = clinical_data, title = "DNAm", silent = T) ggarrange(a, b, c, nrow = 1, ncol = 3) rm(a,b,c) ```

## Playing with the parameters ### Impact of the parameter $\tau$

**DIY** For a higher number of component than previously, compute two RGCCA models, one with $\tau = 1$ and one with an arbitrary low value of $\tau$. You will have to play with the parameter `tau`. ```{r RGCCA_impact_tau, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} ncomp = 5 fit_RGCCA_tau_1 = rgcca(blocks = blocks, tau = 1, ncomp = ncomp) fit_RGCCA_tau_0.1 = rgcca(blocks = blocks, tau = 0.1, ncomp = ncomp) ```

**DIY** Apply the function `interpret_PC` on your two previous RGCCA models in order to see if there are any differences. ```{r RGCCA_impact_tau_fig, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = interpret_PC(component_matrix = fit_RGCCA_tau_1$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) b = interpret_PC(component_matrix = fit_RGCCA_tau_1$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) c = interpret_PC(component_matrix = fit_RGCCA_tau_1$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 1", silent = T) d = interpret_PC(component_matrix = fit_RGCCA_tau_0.1$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) e = interpret_PC(component_matrix = fit_RGCCA_tau_0.1$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) f = interpret_PC(component_matrix = fit_RGCCA_tau_0.1$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 0.1", silent = T) ggarrange(a, b, c, d, e, f, nrow = 2, ncol = 3) rm(a,b,c,d,e,f) ```

### Impact of the function $g$

**DIY** Now, compute two RGCCA models with two different values of the function $g$. You will have to play with the parameter `scheme`. ```{r RGCCA_impact_g, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} fit_RGCCA_tau_factorial = rgcca(blocks = blocks, scheme = "factorial", ncomp = ncomp) fit_RGCCA_tau_horst = rgcca(blocks = blocks, scheme = "horst", ncomp = ncomp) ```

**DIY** Apply the function `interpret_PC` on your two previous RGCCA models in order to see if there are any differences. ```{r RGCCA_impact_g_fig, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = interpret_PC(component_matrix = fit_RGCCA_tau_factorial$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) b = interpret_PC(component_matrix = fit_RGCCA_tau_factorial$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) c = interpret_PC(component_matrix = fit_RGCCA_tau_factorial$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 1", silent = T) d = interpret_PC(component_matrix = fit_RGCCA_tau_horst$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) e = interpret_PC(component_matrix = fit_RGCCA_tau_horst$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) f = interpret_PC(component_matrix = fit_RGCCA_tau_horst$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 0.1", silent = T) ggarrange(a, b, c, d, e, f, nrow = 2, ncol = 3) rm(a,b,c,d,e,f) ```

### Impact of the connection matrix $\bf{C}$

**DIY** Build two connection matrices: one with all the bocks connected to each other and one with all the blocks connected to a single (common) one. Do not forget to put zeros on the diagonal of these matrices (otherwize, the variance of each component will also be maximized). ```{r RGCCA_impact_connection_matrix, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} L = length(blocks) C_1 = matrix(1, nrow = L, ncol = L) - diag(L) C_2 = matrix(0, nrow = L, ncol = L) C_2[, L] = C_2[L, ] = 1 C_2[L, L] = 0 ```

**DIY** Now, compute two RGCCA models, one with each connection matrix you created earlier. You will have to play with the parameter `connection`. ```{r RGCCA_impact_connection, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} fit_RGCCA_tau_complete = rgcca(blocks = blocks, connection = C_1, ncomp = ncomp) fit_RGCCA_tau_hierarchical = rgcca(blocks = blocks, connection = C_2, ncomp = ncomp) ```

**DIY** Apply the function `interpret_PC` on your two previous RGCCA models in order to see if there are any differences. ```{r RGCCA_impact_connection_fig, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} a = interpret_PC(component_matrix = fit_RGCCA_tau_complete$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) b = interpret_PC(component_matrix = fit_RGCCA_tau_complete$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 1", silent = T, show_rownames = F, show_legend = F) c = interpret_PC(component_matrix = fit_RGCCA_tau_complete$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 1", silent = T) d = interpret_PC(component_matrix = fit_RGCCA_tau_hierarchical$Y$mRNA, clinical_df = clinical_data, title = "mRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) e = interpret_PC(component_matrix = fit_RGCCA_tau_hierarchical$Y$miRNA, clinical_df = clinical_data, title = "miRNA \n tau = 0.1", silent = T, show_rownames = F, show_legend = F) f = interpret_PC(component_matrix = fit_RGCCA_tau_hierarchical$Y$DNAm, clinical_df = clinical_data, title = "DNAm \n tau = 0.1", silent = T) ggarrange(a, b, c, d, e, f, nrow = 2, ncol = 3) rm(a,b,c,d,e,f) ```

## Using permutation to tune the parameters in an unsupervised setting In an unsupervised setting, a permutation procedure is provided by the `RGCCA` package in order to tune the parameters.

**DIY** Use the function `rgcca_permutation` to tune the parameter $\tau$. Depending on your previous results, choose the function $g$, the number of components and the connection matrix $\bf{C}$ you want to work with. You will have to set: - `ncomp`: the number of components. By default, 20 permutations are performed (use `n_perms` if you want to change it) to tune `tau` (modify `par_type` if you want to tune another parameter). ```{r RGCCA_perm_tau, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} # This function is commented because it is taking too long # perm_fit_RGCCA_tau = rgcca_permutation(blocks = blocks, ncomp = 10) # Let us load the precomputed results. They were obtained with # par_value = sapply(blocks, function(x) pracma:::logseq(x1 = 1e-4, x2 = 0.5, n = 20)) # n_perms = 50 # ncomp = 10 # init = "random" # tol = 1e-5 perm_fit_RGCCA_tau = readRDS("data/MDD/perm_fit_RGCCA_tau.rds") ``` By default, $10$ values of the tuned parameter are tested and forced to be the same for all blocks and components. If needed, you can provide your own matrix of size `(number of tested sets of parameters)`$\times$`(number of blocks)` to the parameter `par_value` of the `rgcca_permutation`. If this procedure is taking too long, you can play with: - `n_cores`: RGCCA is run $\left(n_{perms}+1\right)\times n_{par\_value}$, it can be interesting to run them in parallel, which is possible with `rgcca_permutation`. `n_cores` corresponds to the number of cores you want to use in parallel. - `tol`: the stopping value for the convergence of the algorithm. As we are mainly interested in relative differences between the value of the RGCCA criterion in permuted and unpermuted situations, as long as the value of `tol` is sufficiently lower than this relative difference, it should not change the conclusion of this permutation test. Thus, it might be interesting to define `tol` to higher values than the default one (default value is $1e-8$) in order to save some computation time. - `init`: the type of initialization used for the RGCCA algorithm. The block-weight vectors can either be initialized with a Singular Value Decomposition (SVD) of their block or randomly. Initializing at the SVD can take more time (when there is a high number of variables and observations).

**DIY** With the function `plot`, display the result of this permutation and identify the optimal value for $\tau$. ```{r RGCCA_perm_tau_plot, fig.height=6, fig.width=4, class.source = 'fold-hide', fig.align = 'center'} plot(perm_fit_RGCCA_tau) ```

Each line of the graph above is associated with a set of tested parameters. For each line, the boxplot corresponds to the permuted distribution of RGCCA's criterion and the triangle to the un-permuted value of the criterion. At the time we speak, the `RGCCA` package does not allow to tune several parameters at the same time (ex: `tau` and `ncomp`). This is not ideal but a workaround can be to perform this tuning sequentially. For example you can set `tau` with an arbitrary high number of components and then, at this best value of `tau`, tune the number of components by permutation. ## Bootstrapping: a way of assessing the robustness of the tuned model Now that we optimized the set of parameters for our model, we wish to evaluate its robustness. This is going to be performed through bootstrapping.

**DIY** Compute an RGCCA model associated with the optimal parameter values estimated through permutation. You can have access to this optimal values through `perm_fit_RGCCA_tau$best_params`. ```{r RGCCA_opt_model, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} optim_fit_RGCCA = rgcca(blocks = blocks, tau = perm_fit_RGCCA_tau$best_params, ncomp = 10, init = "random") ```

**DIY** Use the function `rgcca_bootstrap` to compute the confidence interval associated with each feature for the optimal (according to the permutation procedure) RGCCA model. You will have to set: - `rgcca_res`: the model you want to bootstrap. - `n_boot`: the number of bootstrap re-sampling. ```{r RGCCA_opt_model_bootstrap, fig.height=12, fig.width=20, class.source = 'fold-hide', fig.align = 'center'} # This function is commented beacuse it is taking too long # bootstrap_fit_RGCCA = rgcca_bootstrap(rgcca_res = optim_fit_RGCCA, n_boot = 100) # Let us load the precomputed results. They were obtained with # n_boot=2000 bootstrap_fit_RGCCA = readRDS("data/MDD/bootstrap_fit_RGCCA.rds") ```

**DIY** Using the `plot` function, display the confidence interval associated with the block of your choice (parameter `block`), the component of your choice (parameter `comp`) and corresponding to the top `n_mark` features (by default, variables are ranked in decreasing order of the absolute value of their weight estimated on the whole (non-re-sampled) data-set). ```{r RGCCA_opt_model_bootstrap_fig, fig.height=6, fig.width=8, class.source = 'fold-hide', fig.align = 'center'} plot(bootstrap_fit_RGCCA, block = 3, n_mark = 100, comp = 1) ``` On the figure above, the first 100 weights (black dots: they were estimated on the regular cohort, without re-sampling) with their estimated Confidence Interval (grey lines) for the first component associated with the DNAm block.

**DIY** You can use also the function `summary`, to explore in details your boostrapped results and eventually identify significant weights. You can play with the parameters: - `block`: the block you want to explore. - `comp`: the component you want to explore. - `adj.method`: to define your favorite method to adjust your p-values for multiple testing. ```{r RGCCA_opt_model_bootstrap_significant_weights, fig.height=6, fig.width=8, class.source = 'fold-hide', fig.align = 'center', results='hide'} summary(bootstrap_fit_RGCCA, block = 2, comp = 1, adj.method = "fdr") ``` By default, variables are displayed in decreasing order of the absolute value of their estimated weights.

You have identified one or more components computed with variables whose weights are significantly different from 0 in multiple blocks ?? Congrats ! You have build your first (or not o_0) multi-omics signature !! Now, you have to understand which signal it represents. You can use again `interpret_PC` to identify links between this (theses) component and co-variables... or you can perform enrichment analysis !! There is nothing significant here ? Do not worry, there is plenty of things you can try !: - Have you removed all individuals having one or more NAs ? You can keep them, RGCCA can actually handle them. - Can you re-think about the pre-processing of your data ? Maybe something went wrong there. - Parameters can be better better tuned. By default, RGCCA forced them to be the same across all components and blocks. - Try extensions of RGCCA that allows to select variables (see the parameter `sparsity` in `??rgcca`) - Can you easily remove batch effects and/or unwanted effects ? (for example association with age, site effects...) - Can you supervise your model ? (see the parameter `response` in `??rgcca`) - **But before all that, let us try MOFA and JIVE !!** # Multi-Omics Factor Analysis (MOFA) {#mofa} Multi-Omics Factor Analysis (MOFA) is a multi-omics data integration approach giving a computational framework for unsupervised discovery of the principal axes of biological and technical variation when multiple omics assays are applied to the same samples. Its extension, MOFA+, is designed for integration of single-cell multi-modal data. ![MOFA: model overview and downstream analyses](https://i.imgur.com/4gG6vXG.jpg) Argelaguet et al., Mol Syst Biol, 2018. *MOFA decomposes each omics into the product of a factor matrix and omics-specific weight matrices.* Link to MOFA article : https://www.embopress.org/doi/full/10.15252/msb.20178124 Link to MOFA+ article : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1 We will perform the MOFA analysis and interpret the results using an object containing various data modalities. Refer to the MOFA2 R package documentation to get details about each functions used in this section : https://www.bioconductor.org/packages/release/bioc/manuals/MOFA2/man/MOFA2.pdf ## First steps ### Vocabulary In this MOFA section, we will use the following vocabulary : * view or modality : omic type that will compose a block of data in the analysis (eg transcriptomic, proteomic, genomic...) * feature : variable of a specific view (eg transcipt, protein, locus...) * factor : similar to a component in PCA, a factor is a weighted vector of all available features * factor values : compose the Z matrix * weight values (or loadings) : compose the W matrix ### Input data You should use a multiview matrix that contains all data cleaned and normalized. Each view of the matrix is a block of data corresponding to a modality, with features as lines and samples as columns. No concordance between lines across modalities is required, but columns should correspond to the same sample across modalities (you can still have some missing data). MOFA accept several formats for this multiview matrix: data.frame, list of matrices, multiAssayExperiments...

**DIY**: write the code to create such a multiview matrix for your data. ```{r MOFA_data_loading, class.source = 'fold-hide', results = "hide", message = FALSE, warning = FALSE} #we just have to transpose the blocks multiview_mat=list(methylation=t(blocks$DNAm), mrna=t(blocks$mRNA), mirna=t(blocks$miRNA)) ```

## Prepare MOFA Before running MOFA, you should first create a MOFA object including the different data modalities and then specify options. ### Create MOFA object You can create a MOFA object from an artificially generated dataset with ```make_example_data``` function or using a real dataset as the multiview matrix you just created. Each sub-matrix contains one modality with sample as columns.

**DIY**: use the ```create_mofa``` function to create a *MOFAobject* containing input data, then use the ```plot_data_overview``` function over this object to check available information for each modality. ```{r MOFAobject_creation, message=FALSE, warning=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Create the MOFA object ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## MOFAobject <- create_mofa(multiview_mat) #Plot overview of the input data, including the number of samples, views, features, and the missing assays. plot_data_overview(MOFAobject) # everything matches here, but we could have kept non matching samples with MOFA ``` NB: by default no group is defined among samples, however you can specify some groups if necessary using "groups" option in the ```create_mofa``` function. Group information is useful to analyse single cell dataset by specifying groups of sample (*ie* cell) coming from the same biological sample or to take into account batch or experimental conditions.

### MOFA options {.tabset} We have to define MOFA options before running it, they are grouped in 3 categories : * Data options * Model options * Training options You will have to create a list for each of these option categories. #### Data options **Main data options are:** * scale_views: logical indicating whether to scale views to have the same unit variance. As long as the scale differences between the views is not too high, this is not required. Default is FALSE. * scale_groups: logical indicating whether to scale groups to have the same unit variance. As long as the scale differences between the groups is not too high, this is not required. Default is FALSE. * center_groups: logical indicating whether to center groups to have the same mean. As long as the mean differences between the groups is not too high, this is not required. Default is FALSE. You can also change the name of views and groups through the 'data_options' field of the MOFA object.

**DIY**: retrieve the list of default data options associated to your MOFAobject in a dedicated variable (eg named *data_options*) with the ```get_default_data_options``` function and modify any fields of this list if necessary. ```{r MOFA_dataoptions} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## MOFA data options ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## data_options <- get_default_data_options(MOFAobject) data_options$scale_views <- TRUE data_options ```

#### Model options **Main model options are:** * likelihoods: data nature per view (possible options are “gaussian”, “poisson”, “bernoulli”). 'gaussian' is for continuous data, 'bernoulli' for binary data and 'poisson' for count/discrete data. By default they are inferred automatically. * num_factors: number of factors to use for the factorization You can also specified through 'model_options' the use of sparsity priors on the weight and factor matrices (W and Z). And which type of prior to use (ARD, spike-and-slab, or both). In the original MOFA approach, sparsity is only applied on the W matrix (ARD + spike-and-slab). In MOFA+ sparsity prior is also applied to the factor matrix Z when considering several groups. Corresponding options are : spikeslab_factors, spikeslab_weights, ard_factors and ard_weights.

**DIY**: retrieve the list of default model options associated to your MOFAobject in a dedicated variable (eg named *model_options*) with the ```get_default_model_options``` function and check first the components of these list and modify if necessary the sparsity levels to consider in your model. Likelihoods and number of factors will be modified in the next steps. ```{r MOFA_get_modeloptions} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## MOFA model options ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## model_options <- get_default_model_options(MOFAobject) model_options$spikeslab_weights <- TRUE model_options ```

**Likelihoods to use.**

The model flexibly handles different data distribution, it is important to choose the likelihoods corresponding to the data (you can visually check data distribution for that). You can also let MOFA deciding for you, but it's better if you don't. **DIY**: check visually the distribution of each data modality using basic function such as ```hist```. *Visual inspection of mRNA data* ```{r MOFA_countdistribmRNA} ## mRNA histogram ################################### hist(MOFAobject@data$mrna$group1 , breaks = 100, main = "Histogram of mRNA count values stored in MOFA object") ``` *Visual inspection of miRNA data* ```{r MOFA_countdistribmiRNA} ## miRNA histogram ################################### hist(MOFAobject@data$mirna$group1 , breaks = 100, main = "Histogram of miRNA values stored in MOFA object") ``` *Visual inspection of methylation data* ```{r MOFA_countdistribMethylation} ## methylation histogram ################################### hist(MOFAobject@data$methylation$group1 , breaks = 100, main = "Histogram of methylation values stored in MOFA object") ``` *Open question: Should we model all these data modalities with a gaussian distribution?* **DIY**: if you decide it, change the corresponding likelihood for a given modality in the *model_options* variable. To change this option, you can use *model_options$likelihoods* <- "gaussian" or "poisson" or "bernoulli" ```{r MOFA_changedistribMutations} ## as an example #model_options$likelihoods["Mutations"] <- "bernoulli" ``` You can also transform your data to get a distribution closer than the expected distribution (for gaussian likelihood), it's better to anticipate this step before creating your MOFAobject, or you can modify the corresponding fields in ```MOFAobject@data```. ```{r MOFA_changedistribmRNA} ## as an example #MOFAobject@data$mrna$group1<-log(MOFAobject@data$mrna$group1) #hist(MOFAobject@data$mrna$group1 , breaks = 100, main = "Histogram of log mRNA count values stored in MOFA object") ```

**Number of factors**

For the number of factors, you can orient your choice using factorization matrix functions that estimate an "optimal" factorization rank. For example using the ```nmfEstimateRank``` function from ```NMF``` package. Different strategies can be used to decide the optimal rank based on results produced by ```nmfEstimateRank``` (read section 2.6 in https://cran.r-project.org/web/packages/NMF/vignettes/NMF-vignette.pdf). **DIY**: use the ```nmfEstimateRank``` function on each modality to decide the number of factor to use with MOFA [optional]. **Warning the execution of such approach can be time consuming, so keep a reasonable range (*ie* less than 10 different values).** ```{r MOFA_estimateRank, eval=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Estimate matrix ranks ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## rankEstimate_mrna <- nmfEstimateRank(MOFAobject@data$mrna$group1[,which(apply(MOFAobject@data$mrna$group1,2,function(x)sum(is.na(x)))==0)], seq(5,10), method='brunet', nrun=5, seed=123456) plot(rankEstimate_mrna) rankEstimate_miRNA <- nmfEstimateRank(MOFAobject@data$mirna$group1[,which(apply(MOFAobject@data$mirna$group1,2,function(x)sum(is.na(x)))==0)], seq(5,10), method='brunet', nrun=5, seed=123456) plot(rankEstimate_miRNA) rankEstimate_methylation <- nmfEstimateRank(MOFAobject@data$methylation$group1[,which(apply(MOFAobject@data$methylation$group1,2,function(x)sum(is.na(x)))==0)], seq(5,10), method='brunet', nrun=5, seed=123456) plot(rankEstimate_methylation) ``` Then you can decide the number of factors to use in MOFA based on rank of each data modality, for example by taking the maximum rank over all modalities. You can also take by default a number of factors ~10. MOFA can remove automatically uninformative factors during its execution (see training options). **DIY**: set the number of factors in the corresponding field in your *model_options* variable. ```{r MOFA_set_modeloptions} ## Model options ################################### model_options$num_factors <- 10 ```

#### Training options **Main training options are:** * maxiter: maximum number of iterations (default is 1000). * convergence_mode: "fast", "medium", "slow". For exploration, the fast mode is good enough. * startELBO: initial iteration to compute the ELBO (the objective function used to assess convergence) * freqELBO: frequency of computations of the ELBO (the objective function used to assess convergence) * drop_factor_threshold: minimum fraction of variance explained criteria to drop factors while training (eg 0.01 for 1%). Default is -1 (no dropping of factors). One can change it to discard factors that explain a low percentage of variance in all views. * stochastic: logical indicating whether to use stochastic variational inference (only required for very big data sets, default is FALSE). * seed : numeric indicating the seed for reproducibility (default is 42).

**DIY**: retrieve the list of default training options associated to your MOFAobject in a dedicated variable (*eg* named *train_options*) with the ```get_default_training_options``` function and modify this list if necessary. ```{r MOFA_trainingoptions} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## MOFA training options ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## train_options <- get_default_training_options(MOFAobject) train_options$convergence_mode <- "slow" train_options$freqELBO <- 5 train_options$startELBO <- 400 train_options$seed <- 42 train_options$drop_factor_threshold <- -1 train_options ```

## Run MOFA ### Prepare the object and run MOFA

**DIY**: use the ```prepare_mofa``` function to wrap all the previous variables including *MOFAobject*, *data_options*, *model_options* and *train_options.* Then run MOFA using the ```run_mofa``` function on the *MOFAobject*. ```{r prepare_MOFA, warning=FALSE, message=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Prepare the MOFA object ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## MOFAobject <- prepare_mofa(MOFAobject, data_options = data_options, model_options = model_options, training_options = train_options ) ``` ```{r run_MOFA, warning=FALSE, message=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Run MOFA ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## MOFAobject <- run_mofa(MOFAobject, outfile="/home/vandelj/Bureau/Bilille/Formation/ETBII/2024/Git/etbii/THEMATIC-SESSIONS/Multivariate_analysis/02_TP/data/MDD/MOFA2_TCGA.hdf5", use_basilisk = TRUE) ``` **NB:** You can also load a pre-computed MOFA model with the ```load_Model``` function from an hdf5 file.

## Results interpretation ### Get familiar with the MOFA object #### Slots of MOFA object Resulting *MOFAobject* is a S4 class used to store all relevant data to analyse a MOFA model.

**DIY**: display available attributes using the ```slotNames``` function. ```{r MOFA_available_slots} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## MOFA object attributes ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## slotNames(MOFAobject) ``` Object attributes accessible using '@' : * data : input data (either a list of matrices or a MultiAssayExperiment) * intercepts : feature intercepts * imputed_data: imputed data (filled by running impute on the object, which is not necessary) * samples_ and features_metadata (basically groups) * expectations: expectations of the factors and weights ($Z and $W matrices) * training_stats: statistics about iterative training process (time, elbo..) * data_options: data processing options (given as parameters) * model_options: model options (given as parameters) * training_options: training options (given as parameters) * stochastic_options: training options (given as parameters is stochastic option is TRUE in training_options) * dimensions : dimensions for the number of views (M), groups (G), samples (N), features per view (D) and factors (K) * on_disk : logical indicating whether data is loaded from disk. * dim_red : non-linear dimensionality reduction manifolds. * cache : variance explained by factors. * status : auxiliary variable indicating whether the model has been trained. * mefisto_options, covariates, covariates_warped and interpolated_Z are optional slot to store sample covariate only for training in MEFISTO

#### Functions to get infos from MOFA object Some interesting functions to access these attributes: * ```factors_names()```: to get or set factor names * ```features_names()```: to get or set feature names * ```samples_names()```: to get or set sample names * ```views_names()```: to get or set view names * ```get_dimensions()```: to get dimensions (number of samples, features, etc.) * ```get_factors()```: to get model factors * ```get_weights()```: to get model weights

**DIY**: try some of these functions on *MOFAobject*. ```{r MOFA_basic_function_tests} ## Function tests ################################### views_names(MOFAobject) get_dimensions(MOFAobject) ```

#### Add sample metadata to the model

**DIY**: use the ```samples_metadata``` function to access and rewrite (or add) metadata on the *MOFAobject*. ```{r MOFA_addMetadata} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Metadata ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Add sample metadata to the model temp_metadata <- readRDS(file = paste0(data_path,"pd_mdd_bin.RDS")) colnames(temp_metadata)[2]="sample" #one metadata column should contains IDs of samples matching with input data and labelled as "sample" #remove metadata for samples without omic information temp_metadata <- temp_metadata[which(temp_metadata[,2] %in% colnames(MOFAobject@data$mrna$group1)),] #which(temp_metadata[,2]=="302-1")#two sample with the same number but different name in men+women cohort #temp_metadata <- temp_metadata[-95,] samples_metadata(MOFAobject) <- temp_metadata ```

### Results To interpret these results it is important to have information about the samples [Biological context (Major Depressive Disorder)]. #### General factors plots {.tabset} First, we'll analyze all the factors at a high level to get an overview of the main trends in decomposition. ##### Factors correlation plot

**DIY**: use the ```plot_factor_cor``` to display correlations between factors. ```{r MOFA_CorrelationPlot} ## Factor correlation plot ################################### plot_factor_cor(MOFAobject) ``` Normally, factors should not be too correlated. If they are, you've probably chosen too many factors, creating redundancy between some of them.

##### Variance decomposition Representing percentage of variance explained by each factor in each view is a key plot of MOFA and should always be done before inspecting factor values or weights.

**DIY**: use the ```plot_variance_explained``` to display variance explained by each factor for each view. ```{r MOFA_varianceExplainedFactors, warning=FALSE} ## Variance explained values ################################### plot_variance_explained(MOFAobject, max_r2=20) #plot_variance_explained(MOFAobject, plot_total = T)[[1]] ``` **DIY**: you can get the numerical values from the cache attribute of the *MOFAobject*, in the ```variance_explained``` field. ```{r MOFA_varianceExplained, warning=FALSE} ## Variance explained values ################################### MOFAobject@cache$variance_explained$r2_per_factor ``` We can ask whether the model provides a good fit to the data. To check it we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc.... We can identify some rules about this amount: * Noisy datasets will result in a lower amounts of variance explained (<10%). * Higher the number of samples --> smaller the total variance explained. * Higher the number of factors --> higher the total variance explained. The objective is not to reach 100%, even with a large number of factors. Explaining ~50% for a view thanks to a linear model is already good. **DIY**: use the ```plot_variance_explained``` to display total variance explained for each view. ```{r MOFA_varianceHisto, warning=FALSE} ## Variance explained histogram ################################### plot_variance_explained(MOFAobject, plot_total = T)[[2]] ``` The latent factors inferred by MOFA represent the underlying principal sources of variation among the samples. Some factor can capture a source of variability that is present across all data modalities. Thus, its etiology is likely to be something very important for the studied phenomenon. Whereas some factor can capture a very strong source of variation in only a specific modality

We can also access the variance explained per sample (only available with Gaussian data).

**DIY**: if your data are exclusively Gaussian (regarding model_options parameters) use the ```calculate_variance_explained``` function to display variance explained per sample. ```{r MOFA_varianceExplainedSample, warning=FALSE} ## Variance explained per sample ################################### #calculate_variance_explained(MOFAobject) datatable(calculate_variance_explained_per_sample(MOFAobject)$group1) #only with pure gaussian data ```

##### Summary plot of factors We can look at associations (correlations) between MOFA factors and some metadata such as Gender, Age... these associations are a good starting point to interpret factors.

**DIY**: use the ```correlate_factors_with_covariates``` function to search such associations with available metadata. ```{r MOFA_metadataCorrelations, fig.width = 10, warning=FALSE} ## Factor correlation with metadata ################################### correlate_factors_with_covariates(MOFAobject, covariates = names(clinical_data), plot="log_pval" ) ```

We can also stay at the sample level and display factor values for each sample (Z matrix) crossed with metadata.

**DIY**: use the ```plot_factor``` function to display factors values and use metadata to color dots. Using metadata variable significant in the previous correlation analysis for a given factor to color dots should give you a sharp separation for the same factor on this plot. ```{r MOFA_factorValues, fig.width = 10, warning=FALSE} ## Plot factor values ################################### p <- plot_factor(MOFAobject, factors = c(1:5), color_by = "GR", dot_size = 3, # change dot size dodge = T, # dodge points with different colors legend = T, # remove legend # add_violin = T, # add violin plots (here we have too few points to make violin, let's do boxplots) # violin_alpha = 0.25 # transparency of violin plots add_boxplot = T ) # The output of plot_factor is a ggplot2 object that we can edit print(p) ```

##### Combinations of factors We can study pairwise combination of factors values using the ```plot_factors``` (with an s) function to detect pairwise association between factors and a metadata variable (similar to PCA representation).

**DIY**:try ```plot_factors``` on the *MOFAobject* still coloring dots using metadata. ```{r MOFA_factorCombination, fig.width = 10, fig.height = 9, warning = FALSE, message=FALSE} ## Plot combinations of factor values ################################### plot_factors(MOFAobject, factors = 1:5, color_by = "GR" ) ```

#### Explore individual factors {.tabset} After this general overview about all factors, you can explore in details a specific factor of interest. ##### Useful functions **To explore loadings (W matrix) ** * plot_top_weights(): plot top loadings for a given factor and view * plot_weights(): plot all loadings for a given factor and view * plot_weights_heatmap(): plot a heatmap of the loadings from multiple factors of a given view **To explore data** * plot_data_heatmap(): heatmap of the training data using only top features of a given factor * plot_data_scatter(): scatterplot of the data using only top features of a given factor The use of theses functions is illustrated in the next sections. ##### Table of weights/factors The weights measure how strong each feature relates to each factor : * no association = 0 * strong association = high absolute values The sign of the weight gives the direction of the effect (*eg* a positive weight indicates that the feature has higher levels in the sample with positive factor values).

**DIY**: access the 'W' field of the ```expectation``` *MOFAobject* attribute and display weights associated to a given modality. Use the ```datatable``` function for a nicer visualization. ```{r MOFA_weightTable} ## Weight values ################################### datatable(MOFAobject@expectations$W$mrna[order(MOFAobject@expectations$W$mrna[,1], decreasing = TRUE ),]) %>% formatRound(c(1:dim(MOFAobject@expectations$W$mrna)[2]),3) ```

```{r MOFA_factorToStudy} ## Factor to study ################################### factorToStudy=6 ``` In the next section we will focus on factor `r factorToStudy`, however feel free to choose another factor. ##### Exploration of factor `r factorToStudy` **Factor values** We can first display factor `r factorToStudy` values for each sample (Z matrix).

**DIY**: use the ```plot_factor``` function to display factor `r factorToStudy` values and color dots according to this value. You can also group the sample according to qualitative metadata variable to identify biological association. ```{r MOFA_factorValuePlot} ## Plot factor values ################################### plot_factor(MOFAobject, factors = factorToStudy, color_by = paste("Factor",factorToStudy,sep=""), group_by = "Sample_Group" ) ```

**Weight of associated features** Then we can search for features associated to factor `r factorToStudy` in each modality to give a "signature" of the factor.

**DIY**: visualize weight distribution for factor `r factorToStudy` in the different modalities using ```plot_weights``` function. ```{r MOFA_distribWeights} ## Plot weight values distribution ################################### plot_weights(MOFAobject, view = "mrna", factor = factorToStudy, nfeatures = 10, # Top number of features to highlight scale = T # Scale weights from -1 to 1 ) ``` **NB:** you can specify a given number of features to display in this distribution through 'nfeatures' option.

**DIY**: use the ```plot_top_weights``` function to focus on top weights (up and down) for factor `r factorToStudy` in each modality. The weights measure how strong each feature relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the feature has higher levels in the sample with positive factor values). ```{r MOFA_topWeights} ## Plot top weight values ################################### plot_top_weights(MOFAobject, view = "mrna", factor = factorToStudy, nfeatures = 10, # Top number of features to highlight scale = T # Scale weights from -1 to 1 ) ```

**Raw values of associated features** Instead of looking at weight values associated to each feature, we can also display raw values of features associated to factor `r factorToStudy`.

**DIY**: use the ```plot_data_heatmap``` function to display raw values of the top20 features associated to factor `r factorToStudy` in each modality. You can also cluster features and/or sample based on these values. ```{r MOFA_featureHeatmap} ## Plot feature heatmap ################################### plot_data_heatmap(MOFAobject, view = "mrna", factor = factorToStudy, features = 25, cluster_rows = TRUE, cluster_cols = FALSE, show_rownames = TRUE, show_colnames = TRUE, denoise=TRUE, scale = "row" ) ```

To focus on a limited number of features you can also display raw values as a scatter plot and cross information with metadata or factors.

**DIY**: use the ```plot_data_scatter``` function to plot raw values of top4 features having positive factor `r factorToStudy` values in each modality, and color samples based on metadata. ```{r MOFA_posWeightScatter} ## Plot positive weight feature scatter ################################### plot_data_scatter(MOFAobject, view = "mrna", factor = factorToStudy, features = 4, sign = "positive", color_by = "GR", ) + labs(y="mRNA expression") ```

**DIY**: do the same for top4 features having negative factor `r factorToStudy` values in each modality, and color samples based on metadata. ```{r MOFA_negWeightScatter} ## Plot negative weight feature scatter ################################### plot_data_scatter(MOFAobject, view = "mrna", factor = factorToStudy, features = 4, sign = "negative", color_by = "GR", ) + labs(y="mRNA expression") ```

### To go futher ... Before going any further, consider [Joint and Individual Variation Explained (JIVE)] and [Regularized Generalized Canonical Correlation Analysis (RGCCA)] if you haven't tried them yet. #### Play with data pre-processing and MOFA options You can decide to modify the pre-processing of the data and/or the model/training options to check their effect on the factors. #### Ontology enrichment on features You can run gene set enrichment analysis using group of genes having a high weight for a given factor to get indications about the biological meaning of factors.

**DIY**: using the ```run_enrichment``` and ```plot_enrichment_heatmap``` functions, try to identify ontology enrichment associated to each factor. Some pre-built gene sets are implemented in MOFAdata (See https://github.com/bioFAM/MOFAdata for details), they can be used as feature.sets in the ```run_enrichment``` function. ```{r MOFA_enrichment, eval=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Enrichment ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## data("reactomeGS") ## perform enrichment analysis fsea.out <- run_enrichment( MOFAobject, view = "mrna", feature.sets = reactomeGS, alpha = 0.01 ) #plot_enrichment(fsea.out,1, alpha=0.01) ## plot enrichment heatmap plot_enrichment_heatmap(fsea.out, alpha=0.01) ## plot enrichment lollipop plot_enrichment(fsea.out, factor=factorToStudy, alpha=0.5) ## plot enrichment with features details plot_enrichment_detailed(fsea.out, factor=factorToStudy, alpha=0.5) ```

#### Cluster samples MOFA is an unsupervised approach, as in PCA you can try to cluster sample based on their projection in the MOFA factor space.

**DIY**: using the ```cluster_samples``` function identify clusters of samples based on their projection on the first factors. Then use the ```plot_factors``` function to display the result of this clustering by coloring samples according to clusters. ```{r MOFA_cluster, eval=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Clustering ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## plot sample projection on factor space plot_factors(MOFAobject, factors = 1:2, color_by = "CD4") ## cluster samples based on factor projection clusters <- cluster_samples(MOFAobject, k=2, factors=1:2) ## plot sample projection on factor space with cluster information plot_factors(MOFAobject, factors=1:2, color_by=clusters$cluster) ```

#### Non-linear reduction (UMAP/tSNE) The MOFA factors are linear (as in PCA). Nevertheless, the MOFA factors can be used as input to other methods that learn compact nonlinear manifolds (as t-SNE or UMAP).

**DIY**: using the ```run_tsne``` or ```run_umap``` functions to run t-SNE and UMAP on the MOFA factors, then use the ```plot_dimred``` function on the resulting object to display t-SNE/UMAP representation. ```{r MOFA_UMAP_TSNE, eval=FALSE} ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## UMAP/tSNE reduction ## ##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––## ## Compute non-linear dimensionality reduction set.seed(42) model <- run_umap(MOFAobject) #model <- run_tsne(MOFAobject,factors = 1:5) ## Plot non-linear dimensionality reduction plot_dimred(model, method = "UMAP", # method can be either "TSNE" or "UMAP" color_by = "CD4" ) ```

# Joint and Individual Variation Explained (JIVE) {#jive} JIVE is, like RGCCA and MOFA, a JDR method, but with some particularities. The main characteristic of JIVE is that it has the ability to extract, from the data, a joint structure (called $\mathbf J$) that contains the common signal between the blocks, and individuals structures (called $\mathbf A_k$) that contain signal that specific to block $k$. ## Apply the regular version of JIVE We first apply the regular version of JIVE (within package `r.jive`) to the MDD data. Because of the shift in notations (rows are variables and columns are observations), we need to transpose all the blocks. Be aware that it takes a non-negligible amount of time to run! (more than 1 hour on my computer). If you are running out of time, run the method with `rankJ = 3` and a handful of components for each block (say `rankA = c(5, 5, 5)`). In this case, do not forget to set the parameter `method` to `"given"`. ```{r regular jive, echo = TRUE, eval = FALSE, cache = TRUE} set.seed(456) # For permutation reproducibility res.jive <- jive(data = blocks.jive) ``` ```{r save and load regular jive, echo = FALSE, eval = TRUE} blocks.jive <- lapply(blocks, t) # saveRDS(res.jive, "res.jive.rds") res.jive <- readRDS("data/MDD/res.jive.rds") ``` The permutation procedure ended up estimating 5 joint components ($r_J = 3$) and $\mathbf{r}_A = [13, 8, 36]$ individual components, based on partial results of the permutation method. These are the results that will be shown in the following. If you chose different parameters, you may obtain different results (!). ### "Variation explained" The explained variances (by the joint and individual components) are depicted on the following graph. ```{r explained variance} plot(res.jive) ``` A good portion of the variability is shared between blocks mRNA and miRNA, while block DNAm stands on its own. The following table shows the distribution of the individual block ranks, with more than 30 for DNAm. ```{r jive ranks} knitr::kable(summary(res.jive)$Ranks) ``` ### Heatmaps Then we plot the joint and individual approximations of each block. The result is impressive, and shows a beautiful joint structure, while the individual structure looks more complex to read. ```{r jive heatmaps, width = 40, height = 40, echo = TRUE, eval = FALSE} r.jive:::plot.jive(res.jive, "heat") ``` ```{r jive heatmaps display, width = 40, height = 40, echo = FALSE, eval = TRUE} include_graphics("data/MDD/heatmap_jive_MDD.png") ``` ### Joint structure One key characteristic of JIVE is its ability to extract joint and individual characteristics from the blocks. Here we plot the joint structure: it needs a little bit of work, do not hesitate to take a peek at the code to help you. In a nutshell, we 1. extracted the components and loadings with `jive.predict` 2. plotted the joint components with `ggplot2` ```{r joint} pred.results <- jive.predict(blocks.jive, res.jive) J <- data.frame( J = t(pred.results$joint.scores), group = clinical_data$Sample_Group, row.names = rownames(blocks[[1]])) J %>% ggplot(aes(J.1, J.2, color = group)) + geom_point() + theme_bw() ``` The joint structure could be related to other clinical features. Go ahead and check this hypothesis with the `heatmaps` that were used in the previous sections. ```{r check joint structure} interpret_PC( component_matrix = data.frame(J = t(pred.results$joint.scores)), clinical_df = clinical_data, title = "JIVE joint dimensions") ``` The joint structure does not really look like it explains the difference between MDD versus controls. To explore the individual structures, we perform an MFA on the three blocks of individual structures, and show the individual maps low. ```{r mfa individuals} indivlist <- lapply(pred.results$indiv.scores, t) names(indivlist) <- names(blocks) A <- data.frame( indivlist, group = clinical_data$Sample_Group, row.names = rownames(blocks[[1]])) res.mfa.indiv <- MFA(A, c(res.jive$rankA, 1), type = c(rep("s", 3), "n"), name.group = c(names(blocks), "group"), num.group.sup = 4, graph = FALSE) labz <- sprintf( "Dim %i (%0.2f%%)", 1:nrow(res.mfa.indiv$eig), res.mfa.indiv$eig[, 2]) dat.mfa <- data.frame( res.mfa.indiv$ind$coord, group = A$group) dat.mfa %>% ggplot(aes(Dim.1, Dim.2, color = group)) + geom_hline(yintercept = 0, color = "grey") + geom_vline(xintercept = 0, color = "grey") + geom_point() + labs(x = labz[1], y = labz[2]) + stat_ellipse() + coord_equal() + theme_bw() ``` And we also apply the heatmaps previously used. ```{r check joint structure 2} A <- data.frame(t(Reduce("rbind", pred.results$indiv.scores))) namesA <- unlist(setNames(lapply(res.jive$rankA, seq), names(blocks))) colnames(A) <- names(namesA) interpret_PC( component_matrix = A, clinical_df = clinical_data, title = "JIVE individual dimensions") ``` Finally, we extract the top features associated with each individual components. ```{r top features} load2.mRNA <- pred.results$indiv.load[[1]][, 2] load2.miRNA <- pred.results$indiv.load[[2]][, 1] load2.DNAm <- pred.results$indiv.load[[3]][, 1] dat.load2 <-rbind( tibble( ID = colnames(blocks[[1]]), sqload = load2.mRNA, block = "mRNA") %>% slice_max(load2.mRNA, n = 10), tibble( ID = colnames(blocks[[2]]), sqload = load2.miRNA, block = "miRNA") %>% slice_max(load2.miRNA, n = 10), tibble( ID = colnames(blocks[[3]]), sqload = load2.DNAm, block = "DNAm") %>% slice_max(load2.DNAm, n = 10) ) gload_mRNA <- dat.load2 %>% filter(block == "mRNA") %>% ggplot(aes(x = sqload, y = reorder(ID, sqload))) + geom_point() + geom_segment(aes(xend = sqload, yend = reorder(ID, sqload)), x = 0) + theme_bw() + labs(x = "Squared loading", y = "") gload_miRNA <- dat.load2 %>% filter(block == "miRNA") %>% ggplot(aes(x = sqload, y = reorder(ID, sqload))) + geom_point() + geom_segment(aes(xend = sqload, yend = reorder(ID, sqload)), x = 0) + theme_bw() + labs(x = "Squared loading", y = "") gload_DNAm <- dat.load2 %>% filter(block == "DNAm") %>% ggplot(aes(x = sqload, y = reorder(ID, sqload))) + geom_point() + geom_segment(aes(xend = sqload, yend = reorder(ID, sqload)), x = 0) + theme_bw() + labs(x = "Squared loading", y = "") ggpubr::ggarrange(gload_mRNA, gload_miRNA, gload_DNAm) ``` # Comparaison between PCA/RGCCA/MOFA/JIVE {#comparison} Let's compare the three methods, parametered to all act as "unsupervised", in order not to favor RGCCA and JIVE. First, extract the components from RGCCA (pick your favorite unsupervised RGCCA model, or just re-run RGCCA with a few components). Also extract the components that you want to include in this comparison from `MOFAobject` and `pred.results`. You can compare the dimensions with a number of tools: PCA, correlograms, heatmaps, hierarchical clustering and much more! We decided for a simple PCA and a correlogram on a few components. ```{r rgcca + jive + mofa} res.rgcca.final <- rgcca(blocks = blocks, ncomp = c(10, 10, 10)) comp.rgcca <- data.frame(res.rgcca.final$Y) names(comp.rgcca) <- paste0("RGCCA_",names(comp.rgcca)) comp.mofa <- get_factors(MOFAobject)$group1 colnames(comp.mofa) <- paste0("MOFA_", colnames(comp.mofa)) comp.jive <- cbind(J %>% select(-group), A) names(comp.jive) <- paste0("JIVE_", names(comp.jive)) comp.all <- cbind(comp.rgcca, comp.mofa, comp.jive) res.pca.comparison <- PCA(comp.all, graph = FALSE) ## Correlation circle fviz_pca_var(res.pca.comparison, repel = TRUE, select.var = list(cos2 = 0.6)) ## Correlogram corrplot::corrplot( cor( comp.rgcca, data.frame(comp.mofa, comp.jive %>% select(-contains("DNAm")) ) ) ) ``` Then, extract the loadings (aka weights) that are used by each method to compute the (joint) first component of mRNA and use a Venn diagram to compare the first 5%. ```{r ggvenn} ensemlb <- colnames(blocks$mRNA) wrgcca_mRNA <- res.rgcca.final$a$mRNA[, 1] wmofa_mRNA <- get_weights(MOFAobject)$mrna[, 1] wjive_mRNA <- pred.results$joint.load[1:2000, 1] five_percent_list <- list( RGCCA = ensemlb[order(-wrgcca_mRNA**2)][1:100], MOFA = ensemlb[order(-wmofa_mRNA**2)][1:100], JIVE = ensemlb[order(-wjive_mRNA**2)][1:100]) ggvenn::ggvenn(five_percent_list) ```