General Statement:
In this tutorial, Principal Component Analysis (PCA), along with two multi-omics techniques : Regularized Generalized Canonical Correlation Analysis (RGCCA) and Multi-Omics Factor Analysis (MOFA), are going to be explored. These methods have in common that they all belong to the field of multivariate statistics, but also that, in different ways, they can be seen as generalizations of PCA to more than one omic/matrix/block/table. Indeed, similarly to PCA, they all aim at estimating a reduced/latent space that describes well the different omics. Some part of the literature refer to these methods as Joint Dimension Reduction (JDR) methods because their goal is to estimate this reduced space such that it is common to all the omic data used in the analysis (this is less true for RGCCA). Actually, applying PCA separatly on different omic data in not considered as a JDR method, but applying PCA on the concatenation of multiple omic data acquired on the same set of individuals is.
Bellow is figure summarizing the key concept behind JDR methods. This figure was taken from an interesting benchmark that compares, on different multi-omic data-sets, multiple JDR methods (among which RGCCA and MOFA).
Figure 1.1: Summarizing Joint Dimension Reduction (JDR) pinciple. A.: Multiomics are profiled from the same sample. Each omics corresponds to a different matrix Xi. jDR methods factorize the matrices \(X^i\) into the product of a factor matrix \(F\) and weight matrices \(A^i\). These matrices can then be used to cluster samples and identify molecular processes. Figure and caption were taken from: Cantini, L., Zakeri, P., Hernandez, C. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12, 124 (2021).
The tutorial is organized as follow:
This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer.
This dataset was obtained from Mibiomics
Download the dataset from the IFB/SIB summer school’s GitHub repository
For this dataset, we will use 3 modalities:
mrna.csv, contains mRNA expression datamirna.csv, contains miRNA expression dataprotein.csv, contains protein abundance dataWe will also use the file clinical.csv, which contains samples metadata, including molecular subtypes.
Again, in the upcoming sections, our primary focus will be on TCGA dataset. However, in the rest of the week, the procedure we will discuss can also be applied to other datasets.
TCGA.
Figure 3.1: The Cancer Genome Atlas (TCGA) overview
Our objective is to build a multi-omic molecular signature discriminating TCGA patients
For MOFA, data must be a list of matrices with specific shapes:
Load all the modalities of the data-set into R, except for the meta/clinical data.
You will probably need to read a csv/tsv file: look at the documentation for the function read.table:
If your data is not stored into a csv/tsv file and you don’t know how to load it, don’t hesitate to ask us.
DIY: Load all the modalities, except for the meta/clinical data, as a list of matrices.
DIY: Load separately the meta/clinical data.
Multi-Omics Factor Analysis (MOFA) is a multi-omics data integration approach giving a computational framework for unsupervised discovery of the principal axes of biological and technical variation when multiple omics assays are applied to the same samples. Its extension, MOFA+, is designed for integration of single-cell multi-modal data.
Argelaguet et al., Mol Syst Biol, 2018.
MOFA decomposes each omics into the product of a factor matrix and omics-specific weight matrices.
If you are interested going further on the subject, here are an non exhaustive list of publications that might interest you:
Link to MOFA article
Link to MOFA+ article
Link to MOFA tutorials
Link to MOFA FAQ
Link to Python code
If you want to deep dive into MOFA’s equations, good LUCK !! Here are some links that can help you. First, Ricard Argelaguet’s PhD manusccript, the first author of MOFA, along with the appendix of MOFA and the appendix of MOFA+. Here is also a presentation of the paper Multi-Omics Transfer Learning (MOTL) with a theoretical part that heavily rely on MOFA.
We will perform the MOFA analysis and interpret the results using an object containing various data modalities.
In this MOFA section, we will use the following vocabulary :
You should use a multiview matrix that contains all data cleaned and normalized. Each view of the matrix is a block of data corresponding to a modality, with features as lines and samples as columns. No concordance between lines across modalities is required, but columns should correspond to the same sample across modalities (you can still have some missing data).
MOFA accept several formats for this multiview matrix: data.frame, list of matrices, multiAssayExperiments…
DIY: write the code to create such a multiview matrix for your data.
Before running MOFA, you should first create a MOFA object including the different data modalities and then specify options.
You can create a MOFA object from an artificially generated dataset with make_example_data function or using a real dataset as the multiview matrix you just created. Each sub-matrix contains one modality with sample as columns.
DIY: use the create_mofa function to create a MOFAobject containing input data, then use the plot_data_overview function over this object to check available information for each modality.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Create the MOFA object ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
MOFAobject <- create_mofa(multiview_mat)
#Plot overview of the input data, including the number of samples, views, features, and the missing assays.
plot_data_overview(MOFAobject) # everything matches here, but we could have kept non matching samples with MOFANB: by default no group is defined among samples, however you can specify some groups if necessary using “groups” option in the create_mofa function. Group information is useful to analyse single cell dataset by specifying groups of sample (ie cell) coming from the same biological sample or to take into account batch or experimental conditions.
We have to define MOFA options before running it, they are grouped in 3 categories :
You will have to create a list for each of these option categories.
Main data options are:
Beware, scale_views and center_groups are only used for Gaussian likelihoods, see code.
You can also change the name of views and groups through the ‘data_options’ field of the MOFA object.
DIY: retrieve the list of default data options associated to your MOFAobject in a dedicated variable (eg named data_options) with the get_default_data_options function and modify any fields of this list if necessary.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## MOFA data options ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
data_options <- get_default_data_options(MOFAobject)
data_options$scale_views <- TRUE
data_options## $scale_views
## [1] TRUE
##
## $scale_groups
## [1] FALSE
##
## $center_groups
## [1] TRUE
##
## $use_float32
## [1] TRUE
##
## $views
## [1] "mrna" "mirna" "protein"
##
## $groups
## [1] "group1"
Main model options are:
You can also specified through ‘model_options’ the use of sparsity priors on the weight W, and which type of prior to use (none, Spike-and-slab, Automatic Relevance Determination (ARD) or both). In the original MOFA approach, sparsity is only applied on the W matrix (both with ARD and spike-and-slab). In MOFA+ sparsity prior is also applied to the factor matrix Z when considering several groups.
Corresponding options are : spikeslab_factors, spikeslab_weights, ard_factors and ard_weights.
DIY: retrieve the list of default model options associated to your MOFAobject in a dedicated variable (eg named model_options) with the get_default_model_options function and check first the components of these list and modify if necessary the sparsity levels to consider in your model. Likelihoods and number of factors will be modified in the next steps.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## MOFA model options ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
model_options <- get_default_model_options(MOFAobject)
model_options$spikeslab_weights <- TRUE
model_options## $likelihoods
## mrna mirna protein
## "gaussian" "gaussian" "gaussian"
##
## $num_factors
## [1] 15
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] TRUE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
The model flexibly handles different data distribution, it is important to choose the likelihoods corresponding to the data (you can visually check data distribution for that). You can also let MOFA deciding for you, but it’s better if you don’t. Here is the function used by MOFA to automatically decipher likelihoods.
DIY: check visually the distribution of each data modality using basic function such as hist.
Visual inspection of mRNA data
## mRNA histogram
###################################
hist(MOFAobject@data$mrna$group1 , breaks = 100, main = "Histogram of mRNA count values stored in MOFA object")Visual inspection of miRNA data
## miRNA histogram
###################################
hist(MOFAobject@data$mirna$group1 , breaks = 100, main = "Histogram of miRNA values stored in MOFA object")
Visual inspection of miRNA data ==> Do not forget that we are going to work with centered data
## miRNA histogram
###################################
hist(scale(t(MOFAobject@data$mirna$group1), center = T, scale = F) , breaks = 100, main = "Histogram of centered miRNA values stored in MOFA object")Visual inspection of protein data
## protein histogram
###################################
hist(MOFAobject@data$protein$group1 , breaks = 100, main = "Histogram of protein values stored in MOFA object")Open question: Should we model all these data modalities with a gaussian distribution?
DIY: if you decide it, change the corresponding likelihood for a given modality in the model_options variable.
To change this option, you can use model_options$likelihoods <- “gaussian” or “poisson” or “bernoulli”
You can also transform your data to get a distribution closer than the expected distribution (for gaussian likelihood), it’s better to anticipate this step before creating your MOFAobject, or you can modify the corresponding fields in MOFAobject@data.
Number of factors
Defining the number of factors to extract is not an easy task. One can think about a Cross-validation procedure, i.e. setting this number to maximize a certain prediction performance. However, when you are doing unsupervised analysis this is no longer possible. Stability of the results might be a solution. You might change the initial condition several times and evaluate which factors are robust and which are not. We will try something is that direction later. Or you can rely on unsupervised analysis of each block and make assumptions on what will happen on your cohort. In that sense, Principal Component Analysis (PCA) is of great help.
DIY: Run a PCA with the PCA function of the FactoMineR package. You will have to set:
ncp: the number of components.scale.unit: a boolean specifying if you want or not to scale (to unit variance) all your variables separately to make them comparable. You may skip this step if you previously performed a more expert normalization. Actually, scaling here could biais the extrapolation of results from PCA to MOFA as MOFA does not allow to perform such scaling. Though, MOFA guidelines suggested that for For count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation. In order to be coherent between PCA and MOFA normalisation steps and making the hypothesis that an operation allowing to harmonize variance across features (as genes) has already been performed, this will be put to FALSE.graph: to avoid unwanted graphical output.Important Reminder: Centering all your variables when performing PCA is mandatory, otherwize you are not extracting the components with the highest variance anymore.
Important Reminder: Looking at mrna_pca_results$eig,
what is the number of components from which subsequent components represent less than 2% of the total variance ? How much of the total vairance do they represent ?
nb_mrna_comp = max(which(mrna_pca_results$eig[, "percentage of variance"]>2))
mrna_ratio_total_variance = mrna_pca_results$eig[nb_mrna_comp, "cumulative percentage of variance"]
print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_mrna_comp))## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 7"
print(paste("This represent the following pourcentage of the total variance:", mrna_ratio_total_variance))## [1] "This represent the following pourcentage of the total variance: 46.369948511909"
Important Reminder: Same question for the 2 other blocks
#miRNA
mirna_pca_results <- PCA(t(multiview_mat$mirna), scale.unit = FALSE, ncp = 25, graph = F)
nb_mirna_comp = max(which(mirna_pca_results$eig[, "percentage of variance"]>2))
mirna_ratio_total_variance = mirna_pca_results$eig[nb_mirna_comp, "cumulative percentage of variance"]
print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_mirna_comp))## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 11"
print(paste("This represent the following pourcentage of the total variance:", mirna_ratio_total_variance))## [1] "This represent the following pourcentage of the total variance: 62.8754841156554"
#Protein
protein_pca_results <- PCA(t(multiview_mat$protein), scale.unit = FALSE, ncp = 25, graph = F)
nb_protein_comp = max(which(protein_pca_results$eig[, "percentage of variance"]>2))
protein_ratio_total_variance = protein_pca_results$eig[nb_mirna_comp, "cumulative percentage of variance"]
print(paste("Number of PCA components from which subsequent components represent less than 2% of the total variance:", nb_protein_comp))## [1] "Number of PCA components from which subsequent components represent less than 2% of the total variance: 10"
print(paste("This represent the following pourcentage of the total variance:", protein_ratio_total_variance))## [1] "This represent the following pourcentage of the total variance: 70.004723510038"
In the worst case scenario where no commonalities are found across omics, if we decide to extract 7+11+10 and we decide to drop every component explaining less than 2%, we get a result close to PCA, representing at least 46% of the total variance of each bloc. We hope factors will be dropped due to commonalities found accross omics.
DIY: set the number of factors in the corresponding field in your model_options variable.
Main training options are:
maxiter: maximum number of iterations (default is 1000).
convergence_mode: “fast” (\(5\times10^{-4}\)), “medium” (\(5\times10^{-5}\)), “slow” (\(5\times10^{-6}\)). For exploration, the fast mode is good enough.
startELBO: initial iteration to compute the ELBO (the objective function used to assess convergence)
freqELBO: frequency of computations of the ELBO (the objective function used to assess convergence)
drop_factor_threshold: minimum fraction of variance explained criteria to drop factors while training (eg 0.01 for 1%). Default is -1 (no dropping of factors). One can change it to discard factors that explain a low percentage of variance in all views. Explained variance per factor and view is computed as follows:
\[R(\mathbf{z}_k, RNA) = 1 - \frac{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \left(\hat{y}_{i,j}^{rna} - z_{ik}\times w_{kj}^{RNA}\right)^2}{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \hat{y}_{i,j}^{rna^2}}\] where \(\mathbf{z}_k\) is the \(k^{th}\) factor, \(z_{ik}\) it \(i^{th}\) element, \(w_{kj}^{RNA}\) the weight associated with factor \(k\), feature \(j\) on the \(RNA\) view. \(\hat{y}_{i,j}^{RNA}\) is the pseudo-data associated with inividual \(j\) on feature \(j\) of the RNA view. If RNA’s likelihood is Gaussian, pseudo-data egals original data, otherwise, it undergoes a dynamics transformation (all along the algorithm’s iterations) to make it look like Gaussian data. Total explained variance for view RNA is computed as:
\[R(\mathbf{Z}, RNA) = 1 - \frac{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \left(\hat{y}_{i,j}^{rna} - \sum_k^Kz_{ik}\times w_{kj}^{RNA}\right)^2}{\sum_j^{p_{RNA}}\sum_i^{n_{samples}} \hat{y}_{i,j}^{rna^2}}\]
stochastic: logical indicating whether to use stochastic variational inference (only required for very big data sets, default is FALSE).
seed : numeric indicating the seed for reproducibility (default is 42).
DIY: retrieve the list of default training options associated to your MOFAobject in a dedicated variable (eg named train_options) with the get_default_training_options function and modify this list if necessary. As mentioned earlier, let us drop factors if they represent less than 2% of the total variance accross all views.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## MOFA training options ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
train_options <- get_default_training_options(MOFAobject)
train_options$convergence_mode <- "fast"
train_options$freqELBO <- 5
train_options$startELBO <- 1
train_options$seed <- 53
train_options$drop_factor_threshold <- 0.02
train_options## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "fast"
##
## $drop_factor_threshold
## [1] 0.02
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 5
##
## $stochastic
## [1] FALSE
##
## $gpu_mode
## [1] FALSE
##
## $gpu_device
## NULL
##
## $seed
## [1] 53
##
## $outfile
## NULL
##
## $weight_views
## [1] FALSE
##
## $save_interrupted
## [1] FALSE
DIY: use the prepare_mofa function to wrap all the previous variables including MOFAobject, data_options, model_options and train_options. Then run MOFA using the run_mofa function on the MOFAobject.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Prepare the MOFA object ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
prepare_MOFAobject <- prepare_mofa(MOFAobject,
data_options = data_options,
model_options = model_options,
training_options = train_options
)##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Run MOFA ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
MOFAobject <- run_mofa(prepare_MOFAobject, "MOFA_models/MOFA2_TCGA.hdf5", use_basilisk = TRUE)##
## #########################################################
## ### __ __ ____ ______ ###
## ### | \/ |/ __ \| ____/\ _ ###
## ### | \ / | | | | |__ / \ _| |_ ###
## ### | |\/| | | | | __/ /\ \_ _| ###
## ### | | | | |__| | | / ____ \|_| ###
## ### |_| |_|\____/|_|/_/ \_\ ###
## ### ###
## #########################################################
##
##
##
## use_float32 set to True: replacing float64 arrays by float32 arrays to speed up computations...
##
## Scaling views to unit variance...
##
## Successfully loaded view='mrna' group='group1' with N=379 samples and D=2000 features...
## Successfully loaded view='mirna' group='group1' with N=379 samples and D=184 features...
## Successfully loaded view='protein' group='group1' with N=379 samples and D=142 features...
##
##
## Model options:
## - Automatic Relevance Determination prior on the factors: False
## - Automatic Relevance Determination prior on the weights: True
## - Spike-and-slab prior on the factors: False
## - Spike-and-slab prior on the weights: True
## Likelihoods:
## - View 0 (mrna): gaussian
## - View 1 (mirna): gaussian
## - View 2 (protein): gaussian
##
##
##
##
## ######################################
## ## Training the model with seed 53 ##
## ######################################
##
##
## ELBO before training: -15493277.37
##
## Iteration 1: time=0.22, ELBO=-1152093.06, deltaELBO=14341184.312 (92.56391641%), Factors=27
## Iteration 2: time=0.24, Factors=26
## Iteration 3: time=0.21, Factors=25
## Iteration 4: time=0.20, Factors=24
## Iteration 5: time=0.19, Factors=23
## Iteration 6: time=0.18, ELBO=-954270.09, deltaELBO=197822.970 (1.27683101%), Factors=22
## Iteration 7: time=0.17, Factors=21
## Iteration 8: time=0.19, Factors=20
## Iteration 9: time=0.17, Factors=19
## Iteration 10: time=0.16, Factors=18
## Iteration 11: time=0.16, ELBO=-944857.04, deltaELBO=9413.050 (0.06075570%), Factors=17
## Iteration 12: time=0.14, Factors=16
## Iteration 13: time=0.13, Factors=15
## Iteration 14: time=0.13, Factors=14
## Iteration 15: time=0.15, Factors=13
## Iteration 16: time=0.14, ELBO=-963820.35, deltaELBO=-18963.312 (0.12239703%), Factors=12
## Warning, lower bound is decreasing...
## Iteration 17: time=0.10, Factors=11
## Iteration 18: time=0.10, Factors=10
## Iteration 19: time=0.09, Factors=9
## Iteration 20: time=0.08, Factors=8
## Iteration 21: time=0.08, ELBO=-988307.10, deltaELBO=-24486.747 (0.15804756%), Factors=7
## Warning, lower bound is decreasing...
## Iteration 22: time=0.07, Factors=7
## Iteration 23: time=0.07, Factors=7
## Iteration 24: time=0.07, Factors=7
## Iteration 25: time=0.08, Factors=7
## Iteration 26: time=0.07, ELBO=-983164.78, deltaELBO=5142.318 (0.03319064%), Factors=7
## Iteration 27: time=0.07, Factors=7
## Iteration 28: time=0.08, Factors=7
## Iteration 29: time=0.08, Factors=7
## Iteration 30: time=0.08, Factors=7
## Iteration 31: time=0.08, ELBO=-982171.89, deltaELBO=992.885 (0.00640849%), Factors=7
## Iteration 32: time=0.06, Factors=7
## Iteration 33: time=0.08, Factors=7
## Iteration 34: time=0.07, Factors=7
## Iteration 35: time=0.07, Factors=7
## Iteration 36: time=0.07, ELBO=-981954.93, deltaELBO=216.963 (0.00140037%), Factors=7
## Iteration 37: time=0.06, Factors=7
## Iteration 38: time=0.07, Factors=7
## Iteration 39: time=0.06, Factors=7
## Iteration 40: time=0.07, Factors=7
## Iteration 41: time=0.07, ELBO=-981804.08, deltaELBO=150.848 (0.00097364%), Factors=7
## Iteration 42: time=0.08, Factors=7
## Iteration 43: time=0.07, Factors=7
## Iteration 44: time=0.07, Factors=7
## Iteration 45: time=0.09, Factors=7
## Iteration 46: time=0.08, ELBO=-981677.32, deltaELBO=126.765 (0.00081819%), Factors=7
## Iteration 47: time=0.07, Factors=7
## Iteration 48: time=0.07, Factors=7
## Iteration 49: time=0.08, Factors=7
## Iteration 50: time=0.07, Factors=7
## Iteration 51: time=0.08, ELBO=-974034.42, deltaELBO=7642.900 (0.04933043%), Factors=7
## Iteration 52: time=0.08, Factors=7
## Iteration 53: time=0.07, Factors=7
## Iteration 54: time=0.07, Factors=7
## Iteration 55: time=0.07, Factors=7
## Iteration 56: time=0.08, ELBO=-973408.98, deltaELBO=625.436 (0.00403682%), Factors=7
## Iteration 57: time=0.07, Factors=7
## Iteration 58: time=0.07, Factors=7
## Iteration 59: time=0.07, Factors=7
## Iteration 60: time=0.07, Factors=7
## Iteration 61: time=0.07, ELBO=-973294.59, deltaELBO=114.391 (0.00073833%), Factors=7
## Iteration 62: time=0.08, Factors=7
## Iteration 63: time=0.07, Factors=7
## Iteration 64: time=0.07, Factors=7
## Iteration 65: time=0.07, Factors=7
## Iteration 66: time=0.07, ELBO=-973219.81, deltaELBO=74.780 (0.00048266%), Factors=7
## Iteration 67: time=0.07, Factors=7
## Iteration 68: time=0.07, Factors=7
## Iteration 69: time=0.07, Factors=7
## Iteration 70: time=0.07, Factors=7
## Iteration 71: time=0.07, ELBO=-973157.72, deltaELBO=62.085 (0.00040072%), Factors=7
##
## Converged!
##
##
##
## #######################
## ## Training finished ##
## #######################
##
##
## Warning: Output file MOFA_models/MOFA2_TCGA.hdf5 already exists, it will be replaced
## Saving model in MOFA_models/MOFA2_TCGA.hdf5...
NB: You can also load a pre-computed MOFA model with the load_Model function from an hdf5 file.
Resulting MOFAobject is a S4 class used to store all relevant data to analyse a MOFA model.
DIY: display available attributes using the slotNames function.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## MOFA object attributes ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
slotNames(MOFAobject)## [1] "data" "covariates" "covariates_warped"
## [4] "intercepts" "imputed_data" "interpolated_Z"
## [7] "samples_metadata" "features_metadata" "expectations"
## [10] "training_stats" "data_options" "model_options"
## [13] "training_options" "stochastic_options" "mefisto_options"
## [16] "dimensions" "on_disk" "dim_red"
## [19] "cache" "status"
Object attributes accessible using ‘@’ :
Some interesting functions to access these attributes:
factors_names(): to get or set factor namesfeatures_names(): to get or set feature namessamples_names(): to get or set sample namesviews_names(): to get or set view namesget_dimensions(): to get dimensions (number of samples, features, etc.)get_factors(): to get model factorsget_weights(): to get model weightsDIY: try some of these functions on MOFAobject.
## [1] "mrna" "mirna" "protein"
## $M
## [1] 3
##
## $G
## [1] 1
##
## $N
## group1
## 379
##
## $D
## mrna mirna protein
## 2000 184 142
##
## $K
## [1] 7
DIY: use the samples_metadata function to access and rewrite (or add) metadata on the MOFAobject.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Metadata ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Add sample metadata to the model
temp_metadata <- cbind(rownames(clinical_data),clinical_data)
colnames(temp_metadata)[1]="sample" #one metadata column should contains IDs of samples matching with input data and labelled as "sample"
samples_metadata(MOFAobject) <- temp_metadata
rm(temp_metadata)To interpret these results it is important to have information about the samples [Biological context (TCGA)].
First, we’ll analyze all the factors at a high level to get an overview of the main trends in decomposition.
DIY: use the plot_factor_cor to display correlations between factors.
Normally, factors should not be too correlated. If they are, you’ve probably chosen too many factors, creating redundancy between some of them.
Representing percentage of variance explained by each factor in each view is a key plot of MOFA and should always be done before inspecting factor values or weights.
DIY: use the plot_variance_explained to display variance explained by each factor for each view.
## Variance explained values
###################################
plot_variance_explained(MOFAobject, max_r2=20)DIY: you can get the numerical values from the cache attribute of the MOFAobject, in the variance_explained field.
## Variance explained values
###################################
MOFAobject@cache$variance_explained$r2_per_factor## $group1
## mrna mirna protein
## Factor1 15.730399 12.3123467 16.6931450
## Factor2 11.788946 5.5113316 3.1709313
## Factor3 5.845827 3.4777701 1.1063159
## Factor4 3.380114 2.7437747 2.0575702
## Factor5 2.006245 1.9598901 2.7783573
## Factor6 4.302490 0.8132339 0.3555894
## Factor7 2.556664 1.6396344 1.1929691
As opposed to PCA, a factor does not necessarily explain less variance than the previous one.
We can ask whether the model provides a good fit to the data. To check it we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc….
We can identify some rules about this amount:
The objective is not to reach 100%, even with a large number of factors. Explaining ~40% for a view thanks to a linear model is already good.
DIY: use the plot_variance_explained to display total variance explained for each view.
## Variance explained histogram
###################################
plot_variance_explained(MOFAobject, plot_total = T)[[2]]The latent factors inferred by MOFA represent the underlying principal sources of variation among the samples.
Some factor can capture a source of variability that is present across all data modalities. Thus, its etiology is likely to be something very important for the studied phenomenon. Whereas some factor can capture a very strong source of variation in only a specific modality
We can also access the variance explained per sample (only available with Gaussian data).
DIY: if your data are exclusively Gaussian (regarding model_options parameters) use the calculate_variance_explained_per_sample function to display variance explained per sample.
We can look at associations (correlations) between MOFA factors and some metadata such as Gender, Age… these associations are a good starting point to interpret factors.
DIY: use the correlate_factors_with_covariates function to search such associations with available metadata.
## Factor correlation with metadata
###################################
correlate_factors_with_covariates(MOFAobject,
covariates = names(clinical_data),
plot="log_pval"
)ATTENTION: to compute such correlations, categorical are simply converted to numeric with as.numeric().
We can also stay at the sample level and display factor values for each sample (Z matrix) crossed with metadata.
DIY: use the plot_factor function to display factors values and use metadata to color dots.
Using metadata variable significant in the previous correlation analysis for a given factor to color dots should give you a sharp separation for the same factor on this plot.
## Plot factor values
###################################
p <- plot_factor(MOFAobject,
factors = c(1:5),
color_by = "subtype",
dot_size = 3, # change dot size
dodge = T, # dodge points with different colors
legend = T, # remove legend
# add_violin = T, # add violin plots (here we have too few points to make violin, let's do boxplots)
# violin_alpha = 0.25 # transparency of violin plots
add_boxplot = T
)
# The output of plot_factor is a ggplot2 object that we can edit
print(p)We can study pairwise combination of factors values using the plot_factors (with an s) function to detect pairwise association between factors and a metadata variable (similar to PCA representation).
DIY:try plot_factors on the MOFAobject still coloring dots using metadata.
## Plot combinations of factor values
###################################
plot_factors(MOFAobject,
factors = 1:5,
color_by = "subtype"
)After this general overview about all factors, you can explore in details a specific factor of interest.
To explore loadings (W matrix)
To explore data
The use of theses functions is illustrated in the next sections.
The weights measure how strong each feature relates to each factor :
The sign of the weight gives the direction of the effect (eg a positive weight indicates that the feature has higher levels in the sample with positive factor values).
DIY: access the ‘W’ field of the expectation MOFAobject attribute and display weights associated to a given modality. Use the datatable function for a nicer visualization.
In the next section we will focus on factor 1, however feel free to choose another factor.
Factor values
We can first display factor 1 values for each sample (Z matrix).
DIY: use the plot_factor function to display factor 1 values and color dots according to this value.
You can also group the sample according to qualitative metadata variable to identify biological association.
## Plot factor values
###################################
plot_factor(MOFAobject,
factors = factorToStudy,
color_by = paste("Factor",factorToStudy,sep=""),
group_by = "subtype"
) Weight of associated features
Then we can search for features associated to factor 1 in each modality to give a “signature” of the factor.
DIY: visualize weight distribution for factor 1 in the different modalities using plot_weights function.
## Plot weight values distribution
###################################
plot_weights(MOFAobject,
view = "mrna",
factor = factorToStudy,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)NB: you can specify a given number of features to display in this distribution through ‘nfeatures’ option.
DIY: use the plot_top_weights function to focus on top weights (up and down) for factor 1 in each modality.
The weights measure how strong each feature relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the feature has higher levels in the sample with positive factor values).
## Plot top weight values
###################################
plot_top_weights(MOFAobject,
view = "mrna",
factor = factorToStudy,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)Raw values of associated features
Instead of looking at weight values associated to each feature, we can also display raw values of features associated to factor 1.
DIY: use the plot_data_heatmap function to display raw values of the top25 features associated to factor 1 in each modality. You can also cluster features and/or sample based on these values.
## Plot feature heatmap
###################################
plot_data_heatmap(MOFAobject,
view = "mrna",
factor = factorToStudy,
features = 25,
cluster_rows = TRUE, cluster_cols = FALSE,
show_rownames = TRUE, show_colnames = TRUE,
denoise=TRUE,
scale = "row"
)To focus on a limited number of features you can also display raw values as a scatter plot and cross information with metadata or factors.
DIY: use the plot_data_scatter function to plot raw values of top4 features having positive factor 1 values in each modality, and color samples based on metadata.
## Plot positive weight feature scatter
###################################
plot_data_scatter(MOFAobject,
view = "mrna",
factor = factorToStudy,
features = 4,
sign = "positive",
color_by = "subtype",
) + labs(y="mRNA expression")
DIY: do the same for top4 features having negative factor 1 values in each modality, and color samples based on metadata.
## Plot negative weight feature scatter
###################################
plot_data_scatter(MOFAobject,
view = "mrna",
factor = factorToStudy,
features = 4,
sign = "negative",
color_by = "subtype",
) + labs(y="mRNA expression")Before going any further, consider [Regularized Generalized Canonical Correlation Analysis (RGCCA)] if you haven’t tried them yet.
You can decide to modify the pre-processing of the data and/or the model/training options to check their effect on the factors.
MOFA is an unsupervised approach, as in PCA you can try to cluster sample based on their projection in the MOFA factor space.
DIY: The function cluster_samples is based on k-means. Use this function to function identify clusters of samples based on their projection all infered factors. Then use the plot_factors function to display the result of this clustering by coloring samples according to clusters.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Clustering ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## cluster samples based on factor projection
clusters <- cluster_samples(MOFAobject, k=4, factors = "all")
## plot sample projection on factor space with cluster information
plot_factors(MOFAobject,
factors=1:2,
color_by=clusters$cluster)DIY: Try different value of k, the number of clusters to extract. With the proxy::dist function, compute the distance from each sample to the barycenter of its clusters and compute the averaged silhouette score out of each clustering. Find the number of extracted cluster leading to the best silhouette score.
if (!require("proxy")){
install.packages("proxy")
}
if (!require("Silhouette")){
install.packages("Silhouette")
}
res_silhouette = c()
tested_k_values = 2:10
for (k in tested_k_values){
## cluster samples based on factor projection
clusters = cluster_samples(MOFAobject, k=k, factors = "all")
dist_matrix = proxy::dist(MOFAobject@expectations$Z$group1, clusters$centers)
sil = Silhouette(dist_matrix)
res_silhouette = c(res_silhouette, (mean(sil$sil_width)))
}
names(res_silhouette) = tested_k_values
print(res_silhouette)## 2 3 4 5 6 7 8 9
## 0.3483627 0.2817135 0.3169978 0.2829063 0.2810880 0.2755839 0.2731449 0.2805127
## 10
## 0.2798896
DIY: Re-lauch the clustering on the optimal number of clusters and compute both a contingency table between TCGA’ subtypes and the inferred clusters and the Rand Index between the two.
if (!require("aricode")){
install.packages("aricode")
}
clusters = cluster_samples(MOFAobject, k=tested_k_values[which.max(res_silhouette)], factors = "all")
print("Contingency Table")## [1] "Contingency Table"
## 4 x 2 sparse Matrix of class "dgCMatrix"
## c2
## c1 1 2
## LumA 130 58
## Basal 40 36
## Her2 25 13
## LumB 28 49
## [1] "Rand Index"
## [1] 0.5233907
DIY: MOFA is not randomly initialized as MOFA’s latent space \(\mathbf{Z}\) and weight matrix \(\mathbf{W}\) are initialized based on PCA. So if you launch 2 subsequent MOFA without dropping factors, results will be exactly identicals. However, when you drop factors, if at a specific iteration, several factors are to be dropped, a single one is and picked randomly. Using this variability, apply multiple MOFA models with different seeds. Each time allow factors to be dropped.
DIY: Use the function compare_factors to compare these models according to their factors.
DIY: You can run gene set enrichment analysis using group of genes having a high weight for a given factor to get indications about the biological meaning of factors.
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Enrichment ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
if (!require("dplyr")){
install.packages("dplyr")
}
if (!require("fgsea")){
install.packages("fgsea")
}
annot_table = read.csv("../data/TCGA/annotation_table_mrna.csv", row.names = 1)
# check for the first component
res = c()
for (comp in 1:dim(MOFAobject@expectations$W$mrna)[2]){
a1 = MOFAobject@expectations$W$mrna[, comp]
## Check order
all(annot_table$symbols_old == gsub("^mRNA_", "", names(a1)))
names(a1) = annot_table$entrez
kegg_pathways = readRDS("../data/TCGA/kegg_pathways.rds")
kegg_hsa_name <- kegg_pathways %>%
dplyr::select(pathway_id, pathway_name) %>%
dplyr::distinct()
sum(names(a1) %in% kegg_pathways$entrez_gene)
fgsea_res <- fgseaMultilevel(
pathways = split(kegg_pathways$entrez_gene, kegg_pathways$pathway_id),
stats = a1**2,
minSize = 15,
maxSize = 500,
eps = 0
)
fgsea_res <- fgsea_res |>
left_join(kegg_hsa_name, by = c("pathway" = "pathway_id")) %>%
relocate(pathway_name, .after = 1)
res = rbind(res, fgsea_res$pval)
}
res_pval_adjusted = matrix(p.adjust(c(res), method = "BH"), nrow = dim(res)[1], ncol = dim(res)[2], byrow = F)
colnames(res_pval_adjusted) = fgsea_res$pathway_name
rownames(res_pval_adjusted) = paste0("Factor_", 1:dim(res_pval_adjusted)[1])
res_pval_adjusted[which(res_pval_adjusted > 0.05, arr.ind = T)] = NA
pheatmap::pheatmap(-log10(res_pval_adjusted), cluster_cols = F, cluster_rows = F)