1 Overview of the Tutorial

General Statement:

In this tutorial, Principal Component Analysis (PCA), along with two multi-omics techniques : Regularized Generalized Canonical Correlation Analysis (RGCCA) and Multi-Omics Factor Analysis (MOFA), are going to be explored. These methods have in common that they all belong to the field of multivariate statistics, but also that, in different ways, they can be seen as generalizations of PCA to more than one omic/matrix/block/table. Indeed, similarly to PCA, they all aim at estimating a reduced/latent space that describes well the different omics. Some part of the literature refer to these methods as Joint Dimension Reduction (JDR) methods because their goal is to estimate this reduced space such that it is common to all the omic data used in the analysis (this is less true for RGCCA). Actually, applying PCA separatly on different omic data in not considered as a JDR method, but applying PCA on the concatenation of multiple omic data acquired on the same set of individuals is.


Bellow is figure summarizing the key concept behind JDR methods. This figure was taken from an interesting benchmark that compares, on different multi-omic data-sets, multiple JDR methods (among which RGCCA and MOFA).

Summarizing Joint Dimension Reduction (JDR) pinciple. A.: Multiomics are profiled from the same sample. Each omics corresponds to a different matrix Xi. jDR methods factorize the matrices $X^i$ into the product of a factor matrix $F$ and weight matrices $A^i$. These matrices can then be used to cluster samples and identify molecular processes. Figure and caption were taken from: Cantini, L., Zakeri, P., Hernandez, C. et al. [Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer](https://doi.org/10.1038/s41467-020-20430-7). Nat Commun 12, 124 (2021).

Figure 1.1: Summarizing Joint Dimension Reduction (JDR) pinciple. A.: Multiomics are profiled from the same sample. Each omics corresponds to a different matrix Xi. jDR methods factorize the matrices \(X^i\) into the product of a factor matrix \(F\) and weight matrices \(A^i\). These matrices can then be used to cluster samples and identify molecular processes. Figure and caption were taken from: Cantini, L., Zakeri, P., Hernandez, C. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12, 124 (2021).

The tutorial is organized as follow:

  • First a brief presentation of Multi-Omics Factor Analysis (MOFA).
  • In section 3, we present the biological context data-set used as example in this tutorial.
  • In section 5, you will apply MOFA on the data-set.

2 Libraries

2.1 Load libraries

DIY: Load required libraries : ….

library("FactoMineR") #for PCA
library("factoextra") #for fviz_pca_ind
library("MOFA2")      #for MOFA
library("DT")
library("rmarkdown")
library("knitr")
library("rmdformats")
library("bookdown")


3 TCGA breast cancer dataset

This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer.

This dataset was obtained from Mibiomics

Download the dataset from the IFB/SIB summer school’s GitHub repository


For this dataset, we will use 3 modalities:

  • mrna.csv, contains mRNA expression data
  • mirna.csv, contains miRNA expression data
  • protein.csv, contains protein abundance data

We will also use the file clinical.csv, which contains samples metadata, including molecular subtypes.

Again, in the upcoming sections, our primary focus will be on TCGA dataset. However, in the rest of the week, the procedure we will discuss can also be applied to other datasets.


TCGA.

Our objective is to build a multi-omic molecular signature discriminating TCGA patients


4 Input data

4.1 In summary

For MOFA, data must be a list of matrices with specific shapes:

  • variables (e.g. genes, proteins …) are displayed on the rows
  • samples (e.g. patients, organisms …) are displayed on the columns

4.2 Load your dataset

Load all the modalities of the data-set into R, except for the meta/clinical data. You will probably need to read a csv/tsv file: look at the documentation for the function read.table:

?read.table

If your data is not stored into a csv/tsv file and you don’t know how to load it, don’t hesitate to ask us.

DIY: Load all the modalities, except for the meta/clinical data, as a list of matrices.

data_path  = "../data/TCGA/"
modalities    = c("mrna", "mirna", "protein")
data_files    = paste0(data_path, modalities,  ".csv")
blocks        = lapply(data_files, function(x) t(read.table(x, sep = ",", header = T, row.names = 1)))
names(blocks) = modalities


DIY: Load separately the meta/clinical data.

clinical_data = read.table(paste0(data_path,"clinical.csv"), sep = ",", header= T, row.names = 1)

# We rename the variables as they are too long
colnames(clinical_data) = c("age_at_diagnosis", "neoplasm_subdivision", "gender", "vital_status", "subtype")


5 Multi-Omics Factor Analysis (MOFA)

Multi-Omics Factor Analysis (MOFA) is a multi-omics data integration approach giving a computational framework for unsupervised discovery of the principal axes of biological and technical variation when multiple omics assays are applied to the same samples. Its extension, MOFA+, is designed for integration of single-cell multi-modal data.

MOFA: model overview and downstream analyses Argelaguet et al., Mol Syst Biol, 2018.

MOFA decomposes each omics into the product of a factor matrix and omics-specific weight matrices.

If you are interested going further on the subject, here are an non exhaustive list of publications that might interest you:

We will perform the MOFA analysis and interpret the results using an object containing various data modalities.

5.1 First steps

5.1.1 Vocabulary

In this MOFA section, we will use the following vocabulary :

  • view or modality : omic type that will compose a block of data in the analysis (eg transcriptomic, proteomic, genomic…)
  • feature : variable of a specific view (eg transcipt, protein, locus…)
  • factor values : compose the Z matrix (similar to components/scores in PCA)
  • weight values : compose the W matrix (similar to wieghts/loadings in PCA)

5.1.2 Input data

You should use a multiview matrix that contains all data cleaned and normalized. Each view of the matrix is a block of data corresponding to a modality, with features as lines and samples as columns. No concordance between lines across modalities is required, but columns should correspond to the same sample across modalities (you can still have some missing data).

MOFA accept several formats for this multiview matrix: data.frame, list of matrices, multiAssayExperiments…

DIY: write the code to create such a multiview matrix for your data.

#we just have to transpose the corresponding blocks 
multiview_mat=list(mrna=t(blocks$mrna), mirna=t(blocks$mirna), protein=t(blocks$protein))


5.2 Prepare MOFA

Before running MOFA, you should first create a MOFA object including the different data modalities and then specify options.

5.2.1 Create MOFA object

You can create a MOFA object from an artificially generated dataset with make_example_data function or using a real dataset as the multiview matrix you just created. Each sub-matrix contains one modality with sample as columns.

DIY: use the create_mofa function to create a MOFAobject containing input data, then use the plot_data_overview function over this object to check available information for each modality.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                          Create the MOFA object                            ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

MOFAobject <- create_mofa(multiview_mat)

#Plot overview of the input data, including the number of samples, views, features, and the missing assays.
plot_data_overview(MOFAobject) # everything matches here, but we could have kept non matching samples with MOFA

NB: by default no group is defined among samples, however you can specify some groups if necessary using “groups” option in the create_mofa function. Group information is useful to analyse single cell dataset by specifying groups of sample (ie cell) coming from the same biological sample or to take into account batch or experimental conditions.


5.2.2 MOFA options

We have to define MOFA options before running it, they are grouped in 3 categories :

  • Data options
  • Model options
  • Training options

You will have to create a list for each of these option categories.

5.2.2.1 Data options

Main data options are:

  • scale_views: logical indicating whether to scale views to have the same unit variance. As long as the scale differences between the views is not too high, this is not required. Default is FALSE.
  • scale_groups: logical indicating whether to scale groups to have the same unit variance. As long as the scale differences between the groups is not too high, this is not required. Default is FALSE.
  • center_groups: logical indicating whether to center groups to have the same mean. As long as the mean differences between the groups is not too high, this is not required. Default is TRUE.

Beware, scale_views and center_groups are only used for Gaussian likelihoods, see code.

You can also change the name of views and groups through the ‘data_options’ field of the MOFA object.

DIY: retrieve the list of default data options associated to your MOFAobject in a dedicated variable (eg named data_options) with the get_default_data_options function and modify any fields of this list if necessary.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                              MOFA data options                             ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

data_options <- get_default_data_options(MOFAobject)
data_options$scale_views <- TRUE
data_options
## $scale_views
## [1] TRUE
## 
## $scale_groups
## [1] FALSE
## 
## $center_groups
## [1] TRUE
## 
## $use_float32
## [1] TRUE
## 
## $views
## [1] "mrna"    "mirna"   "protein"
## 
## $groups
## [1] "group1"


5.2.2.2 Model options

Main model options are:

  • likelihoods: data nature per view (possible options are “gaussian”, “poisson”, “bernoulli”). ‘gaussian’ is for continuous data, ‘bernoulli’ for binary data and ‘poisson’ for discrete data. By default they are inferred automatically.
  • num_factors: number of factors to use for the factorization

You can also specified through ‘model_options’ the use of sparsity priors on the weight W, and which type of prior to use (none, Spike-and-slab, Automatic Relevance Determination (ARD) or both). In the original MOFA approach, sparsity is only applied on the W matrix (both with ARD and spike-and-slab). In MOFA+ sparsity prior is also applied to the factor matrix Z when considering several groups.

Corresponding options are : spikeslab_factors, spikeslab_weights, ard_factors and ard_weights.

DIY: retrieve the list of default model options associated to your MOFAobject in a dedicated variable (eg named model_options) with the get_default_model_options function and check first the components of these list and modify if necessary the sparsity levels to consider in your model. Likelihoods and number of factors will be modified in the next steps.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                              MOFA model options                            ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

model_options <- get_default_model_options(MOFAobject)
model_options$spikeslab_weights <- TRUE
model_options
## $likelihoods
##       mrna      mirna    protein 
## "gaussian" "gaussian" "gaussian" 
## 
## $num_factors
## [1] 15
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] TRUE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE


Likelihoods to use.

The model flexibly handles different data distribution, it is important to choose the likelihoods corresponding to the data (you can visually check data distribution for that). You can also let MOFA deciding for you, but it’s better if you don’t. Here is the function used by MOFA to automatically decipher likelihoods.

DIY: check visually the distribution of each data modality using basic function such as hist.

Visual inspection of mRNA data

## mRNA histogram
###################################

hist(MOFAobject@data$mrna$group1 , breaks = 100, main = "Histogram of mRNA count values stored in MOFA object")

Visual inspection of miRNA data

## miRNA histogram
###################################

hist(MOFAobject@data$mirna$group1 , breaks = 100, main = "Histogram of miRNA values stored in MOFA object")

Visual inspection of miRNA data ==> Do not forget that we are going to work with centered data

## miRNA histogram
###################################

hist(scale(t(MOFAobject@data$mirna$group1), center = T, scale = F) , breaks = 100, main = "Histogram of centered miRNA values stored in MOFA object")

Visual inspection of protein data

## protein histogram
###################################

hist(MOFAobject@data$protein$group1 , breaks = 100, main = "Histogram of protein values stored in MOFA object")

Open question: Should we model all these data modalities with a gaussian distribution?

DIY: if you decide it, change the corresponding likelihood for a given modality in the model_options variable.

To change this option, you can use model_options$likelihoods <- “gaussian” or “poisson” or “bernoulli”

## as an example
#model_options$likelihoods["mrna"] <- "gaussian" or "poisson" or "bernoulli"

You can also transform your data to get a distribution closer than the expected distribution (for gaussian likelihood), it’s better to anticipate this step before creating your MOFAobject, or you can modify the corresponding fields in MOFAobject@data.

## as an example
#MOFAobject@data$mirna$group1<-log(MOFAobject@data$mirna$group1)
#hist(MOFAobject@data$mirna$group1 , breaks = 100, main = "Histogram of log miRNA count values stored in MOFA object")


Number of factors

Defining the number of factors to extract is not an easy task. One can think about a Cross-validation procedure, i.e. setting this number to maximize a certain prediction performance. However, when you are doing unsupervised analysis this is no longer possible. Stability of the results might be a solution. You might change the initial condition several times and evaluate which factors are robust and which are not. We will try something is that direction later. Or you can rely on unsupervised analysis of each block and make assumptions on what will happen on your cohort. In that sense, Principal Component Analysis (PCA) is of great help.


DIY: Run a PCA with the PCA function of the FactoMineR package. You will have to set:

  • ncp: the number of components.
  • scale.unit: a boolean specifying if you want or not to scale (to unit variance) all your variables separately to make them comparable. You may skip this step if you previously performed a more expert normalization. Actually, scaling here could biais the extrapolation of results from PCA to MOFA as MOFA does not allow to perform such scaling. Though, MOFA guidelines suggested that for For count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation. In order to be coherent between PCA and MOFA normalisation steps and making the hypothesis that an operation allowing to harmonize variance across features (as genes) has already been performed, this will be put to FALSE.
  • graph: to avoid unwanted graphical output.


Important Reminder: Centering all your variables when performing PCA is mandatory, otherwize you are not extracting the components with the highest variance anymore.

mrna_pca_results <- PCA(t(multiview_mat$mrna), scale.unit = FALSE, ncp = 25, graph = F) 


Important Reminder: Looking at mrna_pca_results$eig, what is the number of components from which subsequent components represent less than 2% of the total variance ? How much of the total vairance do they represent ?

nb_mrna_comp              = max(which(mrna_pca_results$eig[, "percentage of variance"]>2))
mrna_ratio_total_variance = mrna_pca_results$eig[nb_mrna_comp, "cumulative percentage of variance"]

print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_mrna_comp))
## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 7"
print(paste("This represent the following pourcentage of the total variance:", mrna_ratio_total_variance))
## [1] "This represent the following pourcentage of the total variance: 46.369948511909"


Important Reminder: Same question for the 2 other blocks

#miRNA
mirna_pca_results <- PCA(t(multiview_mat$mirna), scale.unit = FALSE, ncp = 25, graph = F) 

nb_mirna_comp              = max(which(mirna_pca_results$eig[, "percentage of variance"]>2))
mirna_ratio_total_variance = mirna_pca_results$eig[nb_mirna_comp, "cumulative percentage of variance"]

print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_mirna_comp))
## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 11"
print(paste("This represent the following pourcentage of the total variance:", mirna_ratio_total_variance))
## [1] "This represent the following pourcentage of the total variance: 62.8754841156554"
#Protein
protein_pca_results <- PCA(t(multiview_mat$protein), scale.unit = FALSE, ncp = 25, graph = F) 

nb_protein_comp              = max(which(protein_pca_results$eig[, "percentage of variance"]>2))
protein_ratio_total_variance = protein_pca_results$eig[nb_mirna_comp, "cumulative percentage of variance"]

print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_protein_comp))
## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 10"
print(paste("This represent the following pourcentage of the total variance:", protein_ratio_total_variance))
## [1] "This represent the following pourcentage of the total variance: 70.004723510038"


In the worst case scenario where no commonalities are found across omics, if we decide to extract 7+11+10 and we decide to drop every component explaining less than 2%, we get a result close to PCA, representing at least 46% of the total variance of each bloc. We hope factors will be dropped due to commonalities found accross omics.

DIY: set the number of factors in the corresponding field in your model_options variable.

## Model options
###################################

model_options$num_factors <- 28  


5.2.2.3 Training options

Main training options are:

  • maxiter: maximum number of iterations (default is 1000).

  • convergence_mode: “fast” (\(5\times10^{-4}\)), “medium” (\(5\times10^{-5}\)), “slow” (\(5\times10^{-6}\)). For exploration, the fast mode is good enough.

  • startELBO: initial iteration to compute the ELBO (the objective function used to assess convergence)

  • freqELBO: frequency of computations of the ELBO (the objective function used to assess convergence)

  • drop_factor_threshold: minimum fraction of variance explained criteria to drop factors while training (eg 0.01 for 1%). Default is -1 (no dropping of factors). One can change it to discard factors that explain a low percentage of variance in all views. Explained variance per factor and view is computed as follows:

\[R(\mathbf{z}_k, RNA) = 1 - \frac{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \left(\hat{y}_{i,j}^{rna} - z_{ik}\times w_{kj}^{RNA}\right)^2}{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \hat{y}_{i,j}^{rna^2}}\] where \(\mathbf{z}_k\) is the \(k^{th}\) factor, \(z_{ik}\) it \(i^{th}\) element, \(w_{kj}^{RNA}\) the weight associated with factor \(k\), feature \(j\) on the \(RNA\) view. \(\hat{y}_{i,j}^{RNA}\) is the pseudo-data associated with inividual \(j\) on feature \(j\) of the RNA view. If RNA’s likelihood is Gaussian, pseudo-data egals original data, otherwise, it undergoes a dynamics transformation (all along the algorithm’s iterations) to make it look like Gaussian data. Total explained variance for view RNA is computed as:

\[R(\mathbf{Z}, RNA) = 1 - \frac{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \left(\hat{y}_{i,j}^{rna} - \sum_k^Kz_{ik}\times w_{kj}^{RNA}\right)^2}{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \hat{y}_{i,j}^{rna^2}}\]

  • stochastic: logical indicating whether to use stochastic variational inference (only required for very big data sets, default is FALSE).

  • seed : numeric indicating the seed for reproducibility (default is 42).

DIY: retrieve the list of default training options associated to your MOFAobject in a dedicated variable (eg named train_options) with the get_default_training_options function and modify this list if necessary. As mentioned earlier, let us drop factors if they represent less than 2% of the total variance accross all views.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                          MOFA training options                             ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

train_options <- get_default_training_options(MOFAobject)
train_options$convergence_mode <- "fast"
train_options$freqELBO <- 5
train_options$startELBO <- 1
train_options$seed <- 53
train_options$drop_factor_threshold <- 0.02

train_options
## $maxiter
## [1] 1000
## 
## $convergence_mode
## [1] "fast"
## 
## $drop_factor_threshold
## [1] 0.02
## 
## $verbose
## [1] FALSE
## 
## $startELBO
## [1] 1
## 
## $freqELBO
## [1] 5
## 
## $stochastic
## [1] FALSE
## 
## $gpu_mode
## [1] FALSE
## 
## $gpu_device
## NULL
## 
## $seed
## [1] 53
## 
## $outfile
## NULL
## 
## $weight_views
## [1] FALSE
## 
## $save_interrupted
## [1] FALSE


5.3 Run MOFA

5.3.1 Prepare the object and run MOFA

DIY: use the prepare_mofa function to wrap all the previous variables including MOFAobject, data_options, model_options and train_options. Then run MOFA using the run_mofa function on the MOFAobject.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                          Prepare the MOFA object                           ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

prepare_MOFAobject <- prepare_mofa(MOFAobject,
                                   data_options     = data_options,
                                   model_options    = model_options,
                                   training_options = train_options
)
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                  Run MOFA                                  ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

MOFAobject <- run_mofa(prepare_MOFAobject, "MOFA_models/MOFA2_TCGA.hdf5", use_basilisk = TRUE)
## 
##         #########################################################
##         ###           __  __  ____  ______                    ### 
##         ###          |  \/  |/ __ \|  ____/\    _             ### 
##         ###          | \  / | |  | | |__ /  \ _| |_           ### 
##         ###          | |\/| | |  | |  __/ /\ \_   _|          ###
##         ###          | |  | | |__| | | / ____ \|_|            ###
##         ###          |_|  |_|\____/|_|/_/    \_\              ###
##         ###                                                   ### 
##         ######################################################### 
##        
##  
##         
## use_float32 set to True: replacing float64 arrays by float32 arrays to speed up computations...
## 
## Scaling views to unit variance...
## 
## Successfully loaded view='mrna' group='group1' with N=379 samples and D=2000 features...
## Successfully loaded view='mirna' group='group1' with N=379 samples and D=184 features...
## Successfully loaded view='protein' group='group1' with N=379 samples and D=142 features...
## 
## 
## Model options:
## - Automatic Relevance Determination prior on the factors: False
## - Automatic Relevance Determination prior on the weights: True
## - Spike-and-slab prior on the factors: False
## - Spike-and-slab prior on the weights: True
## Likelihoods:
## - View 0 (mrna): gaussian
## - View 1 (mirna): gaussian
## - View 2 (protein): gaussian
## 
## 
## 
## 
## ######################################
## ## Training the model with seed 53 ##
## ######################################
## 
## 
## ELBO before training: -15493277.37 
## 
## Iteration 1: time=0.22, ELBO=-1152093.06, deltaELBO=14341184.312 (92.56391641%), Factors=27
## Iteration 2: time=0.24, Factors=26
## Iteration 3: time=0.21, Factors=25
## Iteration 4: time=0.20, Factors=24
## Iteration 5: time=0.19, Factors=23
## Iteration 6: time=0.18, ELBO=-954270.09, deltaELBO=197822.970 (1.27683101%), Factors=22
## Iteration 7: time=0.17, Factors=21
## Iteration 8: time=0.19, Factors=20
## Iteration 9: time=0.17, Factors=19
## Iteration 10: time=0.16, Factors=18
## Iteration 11: time=0.16, ELBO=-944857.04, deltaELBO=9413.050 (0.06075570%), Factors=17
## Iteration 12: time=0.14, Factors=16
## Iteration 13: time=0.13, Factors=15
## Iteration 14: time=0.13, Factors=14
## Iteration 15: time=0.15, Factors=13
## Iteration 16: time=0.14, ELBO=-963820.35, deltaELBO=-18963.312 (0.12239703%), Factors=12
## Warning, lower bound is decreasing...
## Iteration 17: time=0.10, Factors=11
## Iteration 18: time=0.10, Factors=10
## Iteration 19: time=0.09, Factors=9
## Iteration 20: time=0.08, Factors=8
## Iteration 21: time=0.08, ELBO=-988307.10, deltaELBO=-24486.747 (0.15804756%), Factors=7
## Warning, lower bound is decreasing...
## Iteration 22: time=0.07, Factors=7
## Iteration 23: time=0.07, Factors=7
## Iteration 24: time=0.07, Factors=7
## Iteration 25: time=0.08, Factors=7
## Iteration 26: time=0.07, ELBO=-983164.78, deltaELBO=5142.318 (0.03319064%), Factors=7
## Iteration 27: time=0.07, Factors=7
## Iteration 28: time=0.08, Factors=7
## Iteration 29: time=0.08, Factors=7
## Iteration 30: time=0.08, Factors=7
## Iteration 31: time=0.08, ELBO=-982171.89, deltaELBO=992.885 (0.00640849%), Factors=7
## Iteration 32: time=0.06, Factors=7
## Iteration 33: time=0.08, Factors=7
## Iteration 34: time=0.07, Factors=7
## Iteration 35: time=0.07, Factors=7
## Iteration 36: time=0.07, ELBO=-981954.93, deltaELBO=216.963 (0.00140037%), Factors=7
## Iteration 37: time=0.06, Factors=7
## Iteration 38: time=0.07, Factors=7
## Iteration 39: time=0.06, Factors=7
## Iteration 40: time=0.07, Factors=7
## Iteration 41: time=0.07, ELBO=-981804.08, deltaELBO=150.848 (0.00097364%), Factors=7
## Iteration 42: time=0.08, Factors=7
## Iteration 43: time=0.07, Factors=7
## Iteration 44: time=0.07, Factors=7
## Iteration 45: time=0.09, Factors=7
## Iteration 46: time=0.08, ELBO=-981677.32, deltaELBO=126.765 (0.00081819%), Factors=7
## Iteration 47: time=0.07, Factors=7
## Iteration 48: time=0.07, Factors=7
## Iteration 49: time=0.08, Factors=7
## Iteration 50: time=0.07, Factors=7
## Iteration 51: time=0.08, ELBO=-974034.42, deltaELBO=7642.900 (0.04933043%), Factors=7
## Iteration 52: time=0.08, Factors=7
## Iteration 53: time=0.07, Factors=7
## Iteration 54: time=0.07, Factors=7
## Iteration 55: time=0.07, Factors=7
## Iteration 56: time=0.08, ELBO=-973408.98, deltaELBO=625.436 (0.00403682%), Factors=7
## Iteration 57: time=0.07, Factors=7
## Iteration 58: time=0.07, Factors=7
## Iteration 59: time=0.07, Factors=7
## Iteration 60: time=0.07, Factors=7
## Iteration 61: time=0.07, ELBO=-973294.59, deltaELBO=114.391 (0.00073833%), Factors=7
## Iteration 62: time=0.08, Factors=7
## Iteration 63: time=0.07, Factors=7
## Iteration 64: time=0.07, Factors=7
## Iteration 65: time=0.07, Factors=7
## Iteration 66: time=0.07, ELBO=-973219.81, deltaELBO=74.780 (0.00048266%), Factors=7
## Iteration 67: time=0.07, Factors=7
## Iteration 68: time=0.07, Factors=7
## Iteration 69: time=0.07, Factors=7
## Iteration 70: time=0.07, Factors=7
## Iteration 71: time=0.07, ELBO=-973157.72, deltaELBO=62.085 (0.00040072%), Factors=7
## 
## Converged!
## 
## 
## 
## #######################
## ## Training finished ##
## #######################
## 
## 
## Warning: Output file MOFA_models/MOFA2_TCGA.hdf5 already exists, it will be replaced
## Saving model in MOFA_models/MOFA2_TCGA.hdf5...

NB: You can also load a pre-computed MOFA model with the load_Model function from an hdf5 file.


5.4 Results interpretation

5.4.1 Get familiar with the MOFA object

5.4.1.1 Slots of MOFA object

Resulting MOFAobject is a S4 class used to store all relevant data to analyse a MOFA model.

DIY: display available attributes using the slotNames function.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                            MOFA object attributes                          ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

slotNames(MOFAobject)
##  [1] "data"               "covariates"         "covariates_warped" 
##  [4] "intercepts"         "imputed_data"       "interpolated_Z"    
##  [7] "samples_metadata"   "features_metadata"  "expectations"      
## [10] "training_stats"     "data_options"       "model_options"     
## [13] "training_options"   "stochastic_options" "mefisto_options"   
## [16] "dimensions"         "on_disk"            "dim_red"           
## [19] "cache"              "status"

Object attributes accessible using ‘@’ :

  • data : input data (either a list of matrices or a MultiAssayExperiment)
  • intercepts : feature intercepts
  • imputed_data: imputed data (filled by running impute on the object, which is not necessary)
  • samples_ and features_metadata (basically groups)
  • expectations: expectations of the factors and weights ($Z and $W matrices)
  • training_stats: statistics about iterative training process (time, elbo..)
  • data_options: data processing options (given as parameters)
  • model_options: model options (given as parameters)
  • training_options: training options (given as parameters)
  • stochastic_options: training options (given as parameters is stochastic option is TRUE in training_options)
  • dimensions : dimensions for the number of views (M), groups (G), samples (N), features per view (D) and factors (K)
  • on_disk : logical indicating whether data is loaded from disk.
  • dim_red : non-linear dimensionality reduction manifolds.
  • cache : variance explained by factors.
  • status : auxiliary variable indicating whether the model has been trained.
  • mefisto_options, covariates, covariates_warped and interpolated_Z are optional slot to store sample covariate only for training in MEFISTO


5.4.1.2 Functions to get infos from MOFA object

Some interesting functions to access these attributes:

  • factors_names(): to get or set factor names
  • features_names(): to get or set feature names
  • samples_names(): to get or set sample names
  • views_names(): to get or set view names
  • get_dimensions(): to get dimensions (number of samples, features, etc.)
  • get_factors(): to get model factors
  • get_weights(): to get model weights

DIY: try some of these functions on MOFAobject.

## Function tests
###################################
views_names(MOFAobject)
## [1] "mrna"    "mirna"   "protein"
get_dimensions(MOFAobject)
## $M
## [1] 3
## 
## $G
## [1] 1
## 
## $N
## group1 
##    379 
## 
## $D
##    mrna   mirna protein 
##    2000     184     142 
## 
## $K
## [1] 7


5.4.1.3 Add sample metadata to the model

DIY: use the samples_metadata function to access and rewrite (or add) metadata on the MOFAobject.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                 Metadata                                   ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

## Add sample metadata to the model
temp_metadata <- cbind(rownames(clinical_data),clinical_data)
colnames(temp_metadata)[1]="sample" #one metadata column should contains IDs of samples matching with input data and labelled as "sample"
samples_metadata(MOFAobject) <- temp_metadata
rm(temp_metadata)


5.4.2 Results

To interpret these results it is important to have information about the samples [Biological context (TCGA)].

5.4.2.1 General factors plots

First, we’ll analyze all the factors at a high level to get an overview of the main trends in decomposition.

5.4.2.1.1 Factors correlation plot

DIY: use the plot_factor_cor to display correlations between factors.

## Factor correlation plot
###################################

plot_factor_cor(MOFAobject)

Normally, factors should not be too correlated. If they are, you’ve probably chosen too many factors, creating redundancy between some of them.


5.4.2.1.2 Variance decomposition

Representing percentage of variance explained by each factor in each view is a key plot of MOFA and should always be done before inspecting factor values or weights.

DIY: use the plot_variance_explained to display variance explained by each factor for each view.

## Variance explained values
###################################

plot_variance_explained(MOFAobject, max_r2=20)

#plot_variance_explained(MOFAobject, plot_total = T)[[1]]

DIY: you can get the numerical values from the cache attribute of the MOFAobject, in the variance_explained field.

## Variance explained values
###################################

MOFAobject@cache$variance_explained$r2_per_factor
## $group1
##              mrna      mirna    protein
## Factor1 15.730399 12.3123467 16.6931450
## Factor2 11.788946  5.5113316  3.1709313
## Factor3  5.845827  3.4777701  1.1063159
## Factor4  3.380114  2.7437747  2.0575702
## Factor5  2.006245  1.9598901  2.7783573
## Factor6  4.302490  0.8132339  0.3555894
## Factor7  2.556664  1.6396344  1.1929691

As opposed to PCA, a factor does not necessarily explain less variance than the previous one.

We can ask whether the model provides a good fit to the data. To check it we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc….

We can identify some rules about this amount:

  • Noisy datasets will result in a lower amounts of variance explained.
  • Higher the number of factors –> higher the total variance explained.

The objective is not to reach 100%, even with a large number of factors. Explaining ~40% for a view thanks to a linear model is already good.

DIY: use the plot_variance_explained to display total variance explained for each view.

## Variance explained histogram
###################################

plot_variance_explained(MOFAobject, plot_total = T)[[2]]

The latent factors inferred by MOFA represent the underlying principal sources of variation among the samples.

Some factor can capture a source of variability that is present across all data modalities. Thus, its etiology is likely to be something very important for the studied phenomenon. Whereas some factor can capture a very strong source of variation in only a specific modality


We can also access the variance explained per sample (only available with Gaussian data).

DIY: if your data are exclusively Gaussian (regarding model_options parameters) use the calculate_variance_explained_per_sample function to display variance explained per sample.

## Variance explained per sample
###################################

#calculate_variance_explained(MOFAobject)
datatable(calculate_variance_explained_per_sample(MOFAobject)$group1) #only with pure gaussian data


5.4.2.1.3 Summary plot of factors

We can look at associations (correlations) between MOFA factors and some metadata such as Gender, Age… these associations are a good starting point to interpret factors.

DIY: use the correlate_factors_with_covariates function to search such associations with available metadata.

## Factor correlation with metadata
###################################

correlate_factors_with_covariates(MOFAobject, 
  covariates = names(clinical_data), 
  plot="log_pval"
)


ATTENTION: to compute such correlations, categorical are simply converted to numeric with as.numeric().

We can also stay at the sample level and display factor values for each sample (Z matrix) crossed with metadata.

DIY: use the plot_factor function to display factors values and use metadata to color dots.

Using metadata variable significant in the previous correlation analysis for a given factor to color dots should give you a sharp separation for the same factor on this plot.

## Plot factor values
###################################

p <- plot_factor(MOFAobject, 
  factors = c(1:5),
  color_by = "subtype",
  dot_size = 3,        # change dot size
  dodge = T,           # dodge points with different colors
  legend = T,          # remove legend
#  add_violin = T,      # add violin plots (here we have too few points to make violin, let's do boxplots)
#  violin_alpha = 0.25  # transparency of violin plots
  add_boxplot = T
)

# The output of plot_factor is a ggplot2 object that we can edit
print(p)


5.4.2.1.4 Combinations of factors

We can study pairwise combination of factors values using the plot_factors (with an s) function to detect pairwise association between factors and a metadata variable (similar to PCA representation).

DIY:try plot_factors on the MOFAobject still coloring dots using metadata.

## Plot combinations of factor values
###################################

plot_factors(MOFAobject, 
  factors = 1:5,
  color_by = "subtype"
)


5.4.2.2 Explore individual factors

After this general overview about all factors, you can explore in details a specific factor of interest.

5.4.2.2.1 Useful functions

To explore loadings (W matrix)

  • plot_top_weights(): plot top weights for a given factor and view
  • plot_weights(): plot all weights for a given factor and view
  • plot_weights_heatmap(): plot a heatmap of the weights from multiple factors of a given view

To explore data

  • plot_data_heatmap(): heatmap of the training data using only top features of a given factor
  • plot_data_scatter(): scatterplot of the data using only top features of a given factor

The use of theses functions is illustrated in the next sections.

5.4.2.2.2 Table of weights/factors

The weights measure how strong each feature relates to each factor :

  • no association = 0
  • strong association = high absolute values

The sign of the weight gives the direction of the effect (eg a positive weight indicates that the feature has higher levels in the sample with positive factor values).

DIY: access the ‘W’ field of the expectation MOFAobject attribute and display weights associated to a given modality. Use the datatable function for a nicer visualization.

## Weight values
###################################

datatable(MOFAobject@expectations$W$mrna[order(MOFAobject@expectations$W$mrna[,1], decreasing = TRUE ),]) %>% formatRound(c(1:dim(MOFAobject@expectations$W$mrna)[2]),3)


## Factor to study
###################################

factorToStudy=1

In the next section we will focus on factor 1, however feel free to choose another factor.

5.4.2.2.3 Exploration of factor 1

Factor values

We can first display factor 1 values for each sample (Z matrix).

DIY: use the plot_factor function to display factor 1 values and color dots according to this value.

You can also group the sample according to qualitative metadata variable to identify biological association.

## Plot factor values
###################################

plot_factor(MOFAobject, 
  factors =  factorToStudy, 
  color_by = paste("Factor",factorToStudy,sep=""),
  group_by = "subtype"
) 


Weight of associated features

Then we can search for features associated to factor 1 in each modality to give a “signature” of the factor.

DIY: visualize weight distribution for factor 1 in the different modalities using plot_weights function.

## Plot weight values distribution
###################################

plot_weights(MOFAobject,
 view = "mrna",
 factor = factorToStudy,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)

NB: you can specify a given number of features to display in this distribution through ‘nfeatures’ option.

DIY: use the plot_top_weights function to focus on top weights (up and down) for factor 1 in each modality.

The weights measure how strong each feature relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the feature has higher levels in the sample with positive factor values).

## Plot top weight values
###################################

plot_top_weights(MOFAobject,
 view = "mrna",
 factor = factorToStudy,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)


Raw values of associated features

Instead of looking at weight values associated to each feature, we can also display raw values of features associated to factor 1.

DIY: use the plot_data_heatmap function to display raw values of the top25 features associated to factor 1 in each modality. You can also cluster features and/or sample based on these values.

## Plot feature heatmap
###################################

plot_data_heatmap(MOFAobject, 
  view = "mrna",
  factor = factorToStudy,  
  features = 25,
  cluster_rows = TRUE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  denoise=TRUE,
  scale = "row"
)


To focus on a limited number of features you can also display raw values as a scatter plot and cross information with metadata or factors.

DIY: use the plot_data_scatter function to plot raw values of top4 features having positive factor 1 values in each modality, and color samples based on metadata.

## Plot positive weight feature scatter
###################################

plot_data_scatter(MOFAobject, 
  view = "mrna",
  factor = factorToStudy,  
  features = 4,
  sign = "positive",
  color_by = "subtype",
) + labs(y="mRNA expression")



DIY: do the same for top4 features having negative factor 1 values in each modality, and color samples based on metadata.

## Plot negative weight feature scatter
###################################

plot_data_scatter(MOFAobject, 
  view = "mrna",
  factor = factorToStudy,  
  features = 4,
  sign = "negative",
  color_by = "subtype",
) + labs(y="mRNA expression")


5.4.3 To go further …

Before going any further, consider [Regularized Generalized Canonical Correlation Analysis (RGCCA)] if you haven’t tried them yet.

5.4.3.1 Play with data pre-processing and MOFA options

You can decide to modify the pre-processing of the data and/or the model/training options to check their effect on the factors.

5.4.3.2 Cluster samples

MOFA is an unsupervised approach, as in PCA you can try to cluster sample based on their projection in the MOFA factor space.

DIY: The function cluster_samples is based on k-means. Use this function to function identify clusters of samples based on their projection all infered factors. Then use the plot_factors function to display the result of this clustering by coloring samples according to clusters.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                 Clustering                                 ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

## cluster samples based on factor projection
clusters <- cluster_samples(MOFAobject, k=4, factors = "all")

## plot sample projection on factor space with cluster information
plot_factors(MOFAobject,
  factors=1:2,
  color_by=clusters$cluster)


DIY: Try different value of k, the number of clusters to extract. With the proxy::dist function, compute the distance from each sample to the barycenter of its clusters and compute the averaged silhouette score out of each clustering. Find the number of extracted cluster leading to the best silhouette score.

if (!require("proxy")){
  install.packages("proxy")
}

if (!require("Silhouette")){
  install.packages("Silhouette")
}

res_silhouette  = c()
tested_k_values = 2:10
for (k in tested_k_values){
    ## cluster samples based on factor projection
    clusters       = cluster_samples(MOFAobject, k=k, factors = "all")
    dist_matrix    = proxy::dist(MOFAobject@expectations$Z$group1, clusters$centers)
    sil            = Silhouette(dist_matrix)
    res_silhouette = c(res_silhouette, (mean(sil$sil_width)))
}
names(res_silhouette) = tested_k_values
print(res_silhouette)
##         2         3         4         5         6         7         8         9 
## 0.3483627 0.2817135 0.3169978 0.2829063 0.2810880 0.2755839 0.2731449 0.2805127 
##        10 
## 0.2798896


DIY: Re-lauch the clustering on the optimal number of clusters and compute both a contingency table between TCGA’ subtypes and the inferred clusters and the Rand Index between the two.

if (!require("aricode")){
  install.packages("aricode")
}

clusters = cluster_samples(MOFAobject, k=tested_k_values[which.max(res_silhouette)], factors = "all")

print("Contingency Table")
## [1] "Contingency Table"
sortPairs(clinical_data$subtype, clusters$cluster, spMat = T)$spMat
## 4 x 2 sparse Matrix of class "dgCMatrix"
##        c2
## c1        1  2
##   LumA  130 58
##   Basal  40 36
##   Her2   25 13
##   LumB   28 49
print("Rand Index")
## [1] "Rand Index"
RI(clinical_data$subtype, clusters$cluster)
## [1] 0.5233907


5.4.3.3 Evaluate stability of number of extracted factors

DIY: MOFA is not randomly initialized as MOFA’s latent space \(\mathbf{Z}\) and weight matrix \(\mathbf{W}\) are initialized based on PCA. So if you launch 2 subsequent MOFA without dropping factors, results will be exactly identicals. However, when you drop factors, if at a specific iteration, several factors are to be dropped, a single one is and picked randomly. Using this variability, apply multiple MOFA models with different seeds. Each time allow factors to be dropped.

list_MOFA_models          = list()
number_of_initializations = 10

for (i in 1:number_of_initializations){
  prepare_MOFAobject@training_options$seed = 53*i
  list_MOFA_models[[i]]                    = run_mofa(object = prepare_MOFAobject, use_basilisk = TRUE) 
}


DIY: Use the function compare_factors to compare these models according to their factors.

compare_factors(list_MOFA_models)


5.4.3.4 Ontology enrichment on features

DIY: You can run gene set enrichment analysis using group of genes having a high weight for a given factor to get indications about the biological meaning of factors.

##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                 Enrichment                                 ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

if (!require("dplyr")){
  install.packages("dplyr")
}

if (!require("fgsea")){
  install.packages("fgsea")
}

annot_table  = read.csv("../data/TCGA/annotation_table_mrna.csv", row.names = 1)
# check for the first component

res = c()
for (comp in 1:dim(MOFAobject@expectations$W$mrna)[2]){
  a1           = MOFAobject@expectations$W$mrna[, comp]
  
  ## Check order
  all(annot_table$symbols_old == gsub("^mRNA_", "", names(a1)))
  names(a1)     = annot_table$entrez
  kegg_pathways = readRDS("../data/TCGA/kegg_pathways.rds")
  
  kegg_hsa_name <- kegg_pathways %>%
    dplyr::select(pathway_id, pathway_name) %>%
    dplyr::distinct()
  
  sum(names(a1) %in% kegg_pathways$entrez_gene)
  
  fgsea_res <- fgseaMultilevel(
    pathways = split(kegg_pathways$entrez_gene, kegg_pathways$pathway_id),
    stats = a1**2,
    minSize = 15,
    maxSize = 500,
    eps = 0
  )
  
  fgsea_res <- fgsea_res |>
    left_join(kegg_hsa_name, by = c("pathway" = "pathway_id")) %>%
    relocate(pathway_name, .after = 1)
  
  res = rbind(res, fgsea_res$pval)
}

res_pval_adjusted = matrix(p.adjust(c(res), method = "BH"), nrow = dim(res)[1], ncol = dim(res)[2], byrow = F)
colnames(res_pval_adjusted) = fgsea_res$pathway_name
rownames(res_pval_adjusted) = paste0("Factor_", 1:dim(res_pval_adjusted)[1])

res_pval_adjusted[which(res_pval_adjusted > 0.05, arr.ind = T)] = NA

pheatmap::pheatmap(-log10(res_pval_adjusted), cluster_cols = F, cluster_rows = F)