---
title: "MOFA2 analysis of Corces-Buenrostro AML dataset, Part 1"
author: "Delphine Potier & Carl Herrmann"
date: "`r date()`"
output: 
  html_document:
    toc: true
    toc_float: true
    toc_depth: 4
    number_sections: false
    code_folding: show
    highlight: zenburn
params:
  work_dir: "~/mydatalocal/"
editor_options: 
  chunk_output_type: console
---
Multi-Omics Factor Analysis (MOFA) is a multi-omics data integration approach giving a computational framework for unsupervised discovery of the principal axes of biological and technical variation when multiple omics assays are applied to the same samples. 

![MOFA: model overview and downstream analyses](https://i.imgur.com/4gG6vXG.jpg)
Argelaguet et al., Mol Syst Biol, 2018.

*MOFA decomposes each omics into the product of a factor matrix and omics-specific weight matrices.*

Link to MOFA article : https://www.embopress.org/doi/full/10.15252/msb.20178124

Link to MOFA+ article : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1


In Multivariate analysis workshop section II, we will first perform the MOFA analysis and interpret the results using an object containing RNA-seq and ATAC-seq matrices already pre-possessed (data normalized and features selected according to MOFA recommendations).

Then in the second step you will have the opportunity to play with the different pre-processing steps (and with the various MOFA options) to investigate how they can impact the results.


```{r setup, include=FALSE}
knitr::opts_knit$set(root.dir=params$work_dir)
knitr::opts_chunk$set(echo = TRUE)
options(knitr.table.format="html")
setwd(params$work_dir)
```

# Library and data loading

## Load libraries
```{r libraries_loading, warning=FALSE, message=FALSE, results='hide'}
# Libraries already installed on the VM
library(data.table)
library(ggplot2)
library(DESeq2) 
library(gage)
library(MOFA2)
library(FactoMineR)
library(factoextra)
library(ComplexHeatmap)
library(viridis)
library(DT)
library(msigdbr)
library(GGally)
library(dplyr)
library(tidyr)
```


## Load RNA-seq and ATAC-seq multiview matrix

This multiview matrix contains RNA-seq and ATAC-seq cleaned and normalized matrix for the selected features.

```{r data_loading}
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                   Load multi view matrix + annotations                     ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

multiview.norm.mat <- readRDS("~/mydatalocal/data/multiview.RDS")
rna.atac.annot <- readRDS("~/mydatalocal/data/atac_annotations.RDS")
```

# Run MOFA

MOFA can help to identify patterns and relationships between different data types.


## Create MOFA object


```{r MOFAobject_creation, message=FALSE}
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                          Create the MOFA object.                           ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

MOFAobject <- create_mofa(multiview.norm.mat)


#Plot overview of the input data, including the number of samples, views, features, and the missing assays.
plot_data_overview(MOFAobject) # everything matches, but we could have kept non matching samples with MOFA
```

## MOFA options {.tabset}

First, we have to define the options, there classified in 3 categories :

* Data options
* Model options
* Training options


### Data options

Data options important arguments:

* scale_views: logical indicating whether to scale views to have the same unit variance. As long as the scale differences between the views is not too high, this is not required. Default is FALSE.
* scale_groups: logical indicating whether to scale groups to have the same unit variance. As long as the scale differences between the groups is not too high, this is not required. Default is FALSE.


```{r MOFA_dataoptions}
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                 MOFA options                               ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
## Data options
###################################
data_opts <- get_default_data_options(MOFAobject)
# data_opts$scale_views <- TRUE
data_opts
```

### Model options

Model options important arguments:

* likelihoods: likelihood per view (options are “gaussian”, “poisson”, “bernoulli”). 
'gaussian' is for continuous data (Default for all views), 'bernoulli' for binary data and 'poisson' for count data.
* num_factors: number of factors


$~$

The model flexibly handles different data types; it is important to choose the likelihoods option in function of the data (think about check data distribution).

For the number of factors, with ButchR method we found optimal factorization rank was 7. Here we will go up to 10 and inspect them visually.

```{r MOFA_modeloptions}
## Model options
###################################
model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 10  

model_opts
```


**Visual inspection of RNA-seq centered/normalized counts data**

```{r MOFA_countdistribRNA}
hist(MOFAobject@data$rna$group1 , breaks = 100, main = "Histogram of RNA count values stored in MOFA object")
#hist(matTopVarGenes - rowMeans(matTopVarGenes) , breaks = 80, main = "Histogram of VST normalized and centered counts")
```


**Visual inspection of ATAC-seq centered/normalized counts data**

```{r MOFA_countdistribATAC}
# While selecting only 5000 regions, the distribution of ATAC count value is no a gaussian anymore
hist(MOFAobject@data$atac$group1 , breaks = 100, main = "Histogram of ATAC count values stored in MOFA object")
#hist(matTopVarRegs[, match(rna.atac.annot$original.atacID, colnames(matTopVarRegs))], breaks = 80, main = "Histogram of VST normalized and centered counts for 5000 selected features")


# Distribution if we would select more regions
# matTopVarRegs20000 <-atac.norm.mat[head(order(rowVars(atac.norm.mat), decreasing = TRUE), 20000),]
# matTopVarRegs20000 <- matTopVarRegs20000 - rowMeans(matTopVarRegs20000)
# hist(matTopVarRegs20000, breaks = 80, main = "Histogram of VST normalized and centered counts (top 20000 features)")
```

Should we model these data with a gaussian distribution? 

*Open question : The ATAC-seq after selection more look like bimodal, so should we switch to bernoulli?*

*to change this option, one can use model_opts$likelihoods <- "Gaussian" or "poisson" or "bernoulli"*


### Training options

Training important options:

* maxiter: maximum number of iterations (default is 1000). 

* convergence_mode: "fast", "medium", "slow". For exploration, the fast mode is good enough.

* startELBO: initial iteration to compute the ELBO (the objective function used to assess convergence)

* freqELBO: frequency of computations of the ELBO (the objective function used to assess convergence)

* drop_factor_threshold: minimum variance explained criteria to drop factors while training. Default is -1 (no dropping of factors). 
*One can change it to discard factors that explain a low percentage of variance in all views.*


```{r MOFA_trainingoptions}
## Training options
###################################
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "slow"
train_opts$freqELBO <- 1
train_opts$seed <- 42
#train_opts$drop_factor_threshold <- 0.001

train_opts
```


## Prepare the object and run MOFA

```{r run_MOFA, warning=FALSE}
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                          Prepare the MOFA object                           ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
MOFAobject <- prepare_mofa(MOFAobject,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts
)


##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##
##                                  Run MOFA                                  ##
##––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––##

MOFAobject <- run_mofa(MOFAobject, outfile="/home/rstudio/mydatalocal/MOFA2.hdf5", use_basilisk = TRUE) 
```

Use the load_Model() function to load a trained MOFA model from an hdf5 file


# Results interpretation


## Get familiar with the MOFA object

### Slots of MOFA object

MOFAobject is a S4 class used to store all relevant data to analyse a MOFA model. 
Get available slots

```{r available_slots}
slotNames(MOFAobject)
```

Object slots are accessible using @ or $

* data : input data (either a list of matrices or a MultiAssayExperiment)
* intercepts : feature intercepts
* imputed_data: imputed data (filled by running impute on the object)
*samples_ and features_metadata
* expectations: expectations of the factors and weights ($Z and $W)
* TrainStats: training statistics
* data_options: data processing options
* model_options: model options
* training_options: training options
* dimensions : dimensions for the number of views (M), groups (G), samples (N), features per view (D) and factors (K)
* dim_red : non-linear dimensionality reduction manifolds.
* cache : variance explained by factors
* status : Auxiliary variable indicating whether the model has been trained.
* covariates, covariates_warped and interpolated_Z are optional slot to store sample covariate only for training in MEFISTO

### Functions to get infos from MOFA object

Some interesting functions : 

* factors_names(): to get or set factor names
* features_names(): to get or set feature names
* samples_names(): to get or set sample names
* views_names(): to get or set view names
* get_dimensions(): to get dimensions (number of samples, features, etc.)
* get_factors(): to get model factors
* get_weights(): to get model weights

**Check some slots of the trained MOFA model**

```{r}
###data
views_names(MOFAobject)
get_dimensions(MOFAobject)

#Add sample metadata to the model
samples_metadata(MOFAobject) <- rna.atac.annot
```

## Results

To interprete these data it is important to have informations about the samples.

![Scheme representing hematopoiesis *(from Corces et al. Nat Genet. 2016)*](https://i.imgur.com/qb5Tvlb.png)


### General factors plots {.tabset}

#### Factors correlation plot

Correlation plot between factors.

Ideally, they should be uncorrelated.

```{r}
#Check that the Factors are uncorrelated
plot_factor_cor(MOFAobject)
```


#### Variance decomposition

Percentage of variance explained by factors in each view.
This is a key plot of MOFA and should always be done before inspecting factors or weights.

```{r , warning=FALSE}
#variance decomposition analysis. 
plot_variance_explained(MOFAobject, max_r2=45)
#plot_variance_explained(MOFAobject, plot_total = T)[[1]]

MOFAobject@cache$variance_explained$r2_per_factor
```

We can ask whether the model provides a good fit to the data. To check it we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc...:
 
* Noisy datasets will result in a lower amounts of variance explained.
* Higher the number of samples --> smaller the total variance explained.
* Higher the number of factors --> higher the total variance explained.

In this data set, using only K=7 factors the model explains up to ~90% of the variation in the RNAseq data and ~96% of the variation in the ATAC data.


```{r , warning=FALSE}
d <- plot_variance_explained(MOFAobject, plot_total = T)[[2]]
#d$data
d
```


The latent factors inferred by MOFA represent the underlying principal sources of variation among the samples. Some of these factors may be shared across multiple data types, while others may be specific to a particular kind of data.


In our case, a number of factors are shared by both genes expression and regulatory regions accessibility modalities.


We can also access the variance explained per sample :

```{r , warning=FALSE}
#calculate_variance_explained(MOFAobject)
calculate_variance_explained_per_sample(MOFAobject)
```


#### Summary plot of factors

```{r , fig.width = 10, warning=FALSE}
col.table <- rna.atac.annot
col.table <- unique(col.table[,c(2,6)])
rownames(col.table) <- col.table[,1]
col.table <- col.table[,2, drop = FALSE]
colors <- col.table[order(row.names(col.table)),]

# Plot
p <- plot_factor(MOFAobject, 
  factors = c(1:10),
  color_by = "Celltype",
  dot_size = 3,        # change dot size
  dodge = T,           # dodge points with different colors
  legend = T,          # remove legend
#  add_violin = T,      # add violin plots (here we have too few points to make violin, let's do boxplots)
#  violin_alpha = 0.25  # transparency of violin plots
  add_boxplot = T
)

# The output of plot_factor is a ggplot2 object that we can edit
p <- p +  scale_fill_manual(values=colors)

print(p)
```


**Factor 1** separates lymphoid lineage mature cells from immature cells.

**Factor 2** separates CLP (+ monocytes and B cells) from other cell types.

**Factor 3** separates monocytes from CLP.

**Factor 4** separates NK from CD4/CD8 T cells.

**Factor 5** distinguishes MEP cells (most negative cell types are HSC and LMPP).

**Factor 6** distinguishes B cells and separate also HSC and MPP from GMP/LMPP/MEP.

**Factor 7** distinguishes B cells, GMP and LMPP cells.

**Factor 8, 9 and 10** represent noise: we observe a lot of variability among the same cell types.

#### Combinations of factors

Pairwise combination of scatterplots between multiple factors (up to factor 7)
 
 
```{r, , fig.width = 10, fig.height = 9, warning = FALSE, message=FALSE}
plot_factors(MOFAobject, 
  factors = 1:7,
  color_by = "Celltype"
)
```


### Explore individual factors {.tabset}

#### Useful functions

**To explore loadings**

* plot_top_weights(): plot top loadings for a given factor/view
* plot_weights(): plot all loadings for a given factor/view
* plot_weights_heatmap(): plot a heatmap of the loadings from multiple factors of a given view

**To explore data**

* plot_data_heatmap(): heatmap of the training data using only top features of a given factor
* plot_data_scatter(): scatterplot of the data using only top features of a given factor

Functions  usage are shown in the following individual factors tabs 

#### Table of weights/factors

The weights mesure how strong each feature (gene or region in our case) relates to each factor : 

*no association = 0 ; 
*strong association = high absolute values

The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).

**Table of gene weigths**
```{r}
datatable(MOFAobject@expectations$W$rna[order(MOFAobject@expectations$W$rna[,1], decreasing = TRUE ),]) %>% formatRound(c(1,c(1:dim(MOFAobject@expectations$W$rna)[2])))
```

**Table of region weigths**
```{r}
datatable(MOFAobject@expectations$W$atac[order(MOFAobject@expectations$W$atac[,1], decreasing = TRUE),]) %>% formatRound(c(1,c(1:dim(MOFAobject@expectations$W$atac)[2])))
```

#### Factor1

**Plot factor 1 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "Factor1",
  group_by = "Celltype"
) 
```

Factor 1 separates lymphoid lineage mature cells from immature cells.
Here the maturity state is a driver of variation in gene expression and chromatin accessibility.

$~$
$~$

**Find top genes involved in factor 1**

* Visualize weight distribution for factor 1

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 1
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 1**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 1,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 4 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 1,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```

CD2 has a positive weight. This means that samples with positive Factor 1 values have CD2 high expression whereas samples with negative Factor 1 values do not have.

CD2 is a known marker of T cells and NK cells

$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 1,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```

CD34 has a negative weight. This means that samples with negative Factor 1 values have CD34 high expression whereas samples with negative Factor 1 values do not have.

CD34 is a well known marker of immature cells in the hematopoietic lineage


$~$
$~$


**Find top regions involved in factor 1**

* Visualize weight distribution for factor 1
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 1
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 1,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Factor 1 has a strong effect on the lymphoid lineage. 


#### Factor2

**Plot factor 2 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 2, 
  color_by = "Factor2",
  group_by = "Celltype"
) 
```

Factor 2 separates monocytes from CLP (can be checked with CD14)


$~$
$~$

**Find top genes involved in factor 1**

* Visualize weight distribution for factor 2

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 2,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 2

```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 2,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 2**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 2,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 4 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 2,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 2,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 2**

* Visualize weight distribution for factor 2

```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 2,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 2
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 2,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 2,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Factor 2 has a strong effect on monocytes and CLP


#### Factor3

**Plot factor 3 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 3, 
  color_by = "Factor3",
  group_by = "Celltype"
) 
```

Factor 3 separates monocytes and CLP from other cell types.

$~$
$~$

**Find top genes involved in factor 3**

* Visualize weight distribution for factor 3

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 3,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 3
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 3,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 3**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 3,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 4 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 3,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 3,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 3**

* Visualize weight distribution for factor 3
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 3,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 3
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 3,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 3,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```


#### Factor4

**Plot factor 4 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 4, 
  color_by = "Factor4",
  group_by = "Celltype"
) 
```

Factor 4 separates NK from CD4/CD8 T cells.

$~$
$~$

**Find top genes involved in factor 4**

* Visualize weight distribution for factor 4

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 4,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 4
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 4,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 4**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 4,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 4 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 4,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 4,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 4**

* Visualize weight distribution for factor 4
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 4,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 4
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 4,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 4,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```


#### Factor5

**Plot factor 5 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 5, 
  color_by = "Factor5",
  group_by = "Celltype"
) 
```

Factor 5 distinguishes MEP cells (most negative cell types are HSC and LMPP).

$~$
$~$

**Find top genes involved in factor 5**

* Visualize weight distribution for factor 5

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 5,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 5
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 5,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 5**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 5,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 5 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 5,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 5,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 5**

* Visualize weight distribution for factor 5
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 5,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 5
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 5,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 5,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```


#### Factor6

**Plot factor 6 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 6, 
  color_by = "Factor6",
  group_by = "Celltype"
) 
```

Factor 6 distinguishes B cells.

$~$
$~$

**Find top genes involved in factor 6**

* Visualize weight distribution for factor 6

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 6,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 6
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 6,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 6**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 6,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 6 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 6,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 6,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 6**

* Visualize weight distribution for factor 6
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 6,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 6
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 6,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 6,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```


#### Factor7

**Plot factor 7 values for the different samples grouped by cell types**

```{r}
plot_factor(MOFAobject, 
  factors = 7, 
  color_by = "Factor7",
  group_by = "Celltype"
) 
```

Factor 7 distinguishes B cells, GMP and LMPP cells.

$~$
$~$

**Find top genes involved in factor 7**

* Visualize weight distribution for factor 7

```{r}
plot_weights(MOFAobject,
 view = "rna",
 factor = 7,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 7
```{r}
plot_top_weights(MOFAobject,
 view = "rna",
 factor = 7,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 genes having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the gene has higher levels in the cells with positive factor values).


$~$
$~$


**Check expression level of genes involved in factor 7**

```{r}
plot_data_heatmap(MOFAobject, 
  view = "rna",
  factor = 7,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```

Individual plot of the top 4 genes having positive factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 7,  
  features = 4,
  sign = "positive",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


Individual plot of the top 4 genes having negative factor values

```{r}
plot_data_scatter(MOFAobject, 
  view = "rna",
  factor = 7,  
  features = 4,
  sign = "negative",
  color_by = "Celltype",
) + labs(y="RNA expression") + scale_fill_manual(values = colors)
```


$~$
$~$


**Find top regions involved in factor 7**

* Visualize weight distribution for factor 7
```{r}
plot_weights(MOFAobject,
 view = "atac",
 factor = 7,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```


* Zoom on top 10 absolute weight for factor 7
```{r}
plot_top_weights(MOFAobject,
 view = "atac",
 factor = 7,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)
```

We plot the 10 regions having the highest absolute weight. The weights mesure how strong each feature/gene relates to each factor (no association = 0 ; strong association = high absolute values). The sign of the weight gives the direction of the effect (e.g. a positive weight indicates that the region has higher accesibility levels in the cells with positive factor values).


```{r}
plot_data_heatmap(MOFAobject, 
  view = "atac",
  factor = 7,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = TRUE,
  scale = "row"
)
```


## To go futher ...

### Play with data pre-processing and MOFA options

Now you can **go to Multivariate2_MOFA_part2.Rmd** that contains all the code for RNA-seq and ATAC-seq data pre-processing and modify for example data normalization or feature selection or play with MOFA options to compare the differnt results.

### Link to the other workshops
#### Link to the semantic web {.tabset}

##### RNA-seq data

It is possible to run geneset enrichment analysis using group of genes having a high weight for a given factor.

**Example**

```{r}
# In case you want to load specific gene set annotations
# # First we define the gene set matrix. We’ll use the C7 category and the IMMUNESIGDB subcategory from the MSigDB data base (categories can be checked with datatable(msigdbr_collections()))
# gs.msigDB = msigdbr(species = "human", category = "C7", subcategory = "IMMUNESIGDB")
# 
# 
# msigdbr_t2g = gs.msigDB %>% dplyr::distinct(gs_name, gene_symbol) %>% as.data.frame()
# enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)


# set ftp url to RNA-seq data
gmt_url <- file.path("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/msigdb.v6.2.symbols.gmt")

read_delim_gz <- function(file_url) {
  con <- gzcon(url(file_url))
  txt <- readLines(con)
  return(read.delim(textConnection(txt), row.names = 1))
}

# read in data matrix
gs.msigDB <- readList(url(gmt_url))

#run GAGE analysis
rna_msigDB_enrichment <- gage(MOFAobject@expectations$W$rna, gsets=gs.msigDB, same.dir=TRUE)


#Drop NAs for upregulated
k.opt = 7
rna_msigDB_enrichment <- as.data.frame(rna_msigDB_enrichment$greater)
rna_msigDB_enrichment <- rna_msigDB_enrichment[!is.na(rna_msigDB_enrichment$p.geomean),]
rna_msigDB_enrichment <- rna_msigDB_enrichment[, paste0("Factor", 1:k.opt)]

# Select only more enriched terms in one signature compared to the others
idx <- apply(rna_msigDB_enrichment, 1, function(term){
  term <- -log10(term)
  # Change 0 to small value to avoid NAs
  term[term == 0] <- 1e-40
  # find if this term is more enriched in one signature compared to others
  is.enrich <- sapply(term, function(x){
    # p-value 5 times greater than at least 5 other signatures
    sum(x/term > 10) > 3
  })
  any(is.enrich)
})

rna_msigDB_enrichment2 <- rna_msigDB_enrichment[idx,]

# Print table
datatable(rna_msigDB_enrichment2, filter="top",
          extensions = 'Buttons',
          options = list(dom = 'Bfrtip',
                         buttons = list(list(extend = 'collection',
                                             buttons = c('excel', 'csv'),
                                             text = 'DOWNLOAD DATA')))) %>%
  formatSignif(columns=colnames(rna_msigDB_enrichment2), digits=3)

```


##### ATAC-seq data

It is also possible to run geneset/functionnal enrichment analysis using group of DNA region having a high weight for a given factor. To do so, first the regions has to be linked to genes they might regulates (tricky and unoptimal part, most of the time  the closest genes are choose); then a geneset enrichment analysis can be run.

**Example with factor 1**

1- Select factor 1 genomic regions

```{r extract_HVR}
#Extract top 250 genomic region positively associated to factor 1
df.regs <- as.data.frame(head(MOFAobject@expectations$W$atac[order(MOFAobject@expectations$W$atac[,1], decreasing = TRUE ),], n=250))
df.regs$peak <- rownames(df.regs)
df.regs$coordinate <- rownames(df.regs)
df.regs <- df.regs %>% separate(coordinate,sep = "_",into = c("chr","start","end"))
#write.table(df.regs[,c("chr","start","end","peak")], file='test.tsv', quote=FALSE, sep='\t', row.names = FALSE, col.names = FALSE)

# Pass regions to granges format
peaks.granges <- df.regs[,c("chr","start","end","peak")] %>%
     makeGRangesFromDataFrame(keep.extra.columns = TRUE, ignore.strand = TRUE)
```


2- Perform GREAT analysis 

```{r perform_GREAT, warning=FALSE, message=FALSE}
# We have to install an extra package (here we use rGREAT, but there are many others)
BiocManager::install("rGREAT", update = FALSE)
library(rGREAT)

# Run GREAT
job = submitGreatJob(peaks.granges, species = "hg19")
```


3- Get and plot association between genomic regions and genes 


```{r association_plot, warning=FALSE, message=FALSE}
# Get the DNA regions-->genes associations
getRegionGeneAssociations(job)

# Plot association between DNA regions and genes 
plotRegionGeneAssociations(job)
#Can also be done for a specific term if needed
#plotRegionGeneAssociations(job, ontology = "GO Biological Process", term_id = "GO:0002376")

```


4-  Check enrichment analysis results

```{r enrichment, warning=FALSE, message=FALSE}
# Check enrichment analysis results
## BP enrichment
tbl = getEnrichmentTables(job)
datatable(tbl$`GO Biological Process`)
plotVolcano(job, ontology = "GO Biological Process")

# See other available ontologies
availableOntologies(job)
tbl2 = getEnrichmentTables(job, ontology = c("Human Phenotype"))
datatable(tbl2$`Human Phenotype`)
plotVolcano(job, ontology = "Human Phenotype")
```


#### Link to the nework 

An interesting aspect of MOFA is that the various type of data can participate to a proportion of variance of the same factor, therefore it is possible to explore the data concomitantly.

Here with RNA-seq and ATAC-seq, MOFA can be used to identify common patterns and relationships between gene expression and regulatory DNA elements, such as promoter binding and transcription factor binding sites. This can provide insight into the mechanisms underlying changes in gene expression and how different regulatory elements contribute to these changes.

We saw in the results that it is the case for our factors (see [Variance decomposition]), meaning that chromatin accessibility and gene expression are linked.

therefore, we could, for example, reconstruct gene regulatory networks (then we might have to explore an unbalanced number of features per modalities) by including factors information at the region to genes association step. On top of that, we could also add a layer of DNA motif enrichment analysis (with RcisTarget for exemple)/