---
title: "Weighted Gene Co-expression Network Analysis (WGCNA)"
author: "Morgane Térézol - Galadriel Brière"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
rmdformats::material:
code_folding: show
use_bookdown: true
thumbnails: false
css: "ETBII_WGCNA.css"
header-includes:
- \usepackage{subfig}
- \usepackage{booktabs}
- \usepackage{xcolor}
---
```{r setup, eval = TRUE, echo = FALSE}
workDir <- "~/SummerSchool_IFB-SIB/02_WGCNA"
knitr::opts_knit$set(root.dir = workDir) ## = setwd()
knitr::opts_chunk$set(cache = TRUE)
options(knitr.table.format = "html")
```
# Libraries and environment
## Environment
This report was generated using:
```{r librariesEnvironment, echo = FALSE, message = FALSE, warning = FALSE}
library("rmarkdown")
library("knitr")
library("rmdformats")
library("bookdown")
library("WGCNA")
library("pheatmap")
library("compositions")
```
- R: ``r R.version$version.string``
- WGCNA: ``r packageVersion("WGCNA")``
- pheatmap: ``r packageVersion("pheatmap")``
You might also need the `compositions` library for data normalization.
## Load libraries
**DIY**: Load the WGCNA and pheatmap libraries.
```{r loadLibraries, class.source = 'fold-hide', results = "hide", message = FALSE, warning = FALSE}
library("WGCNA")
library("pheatmap")
```
# General principle of WGCNA
**Important Reminder**: While WGCNA was initially developed for processing gene expression data, it can be adapted for other data types as well.
Throughout this tutorial, we will strive to present the material in a manner that is broadly applicable to different data modalities.
However, it is worth noting that certain WGCNA functions have names related to transcriptomics data (e.g., moduleEigen**genes** instead of eigen**vectors**). Moreover, we might use interchangeably the terms **correlation** and **co-expression**, or **gene** and **feature**.
Consequently, certain phrasings used in this tutorial may specifically reference transcriptomic data, even though the overall intent is to apply the methods to diverse data types. Please keep this in mind while following the tutorial, especially when applying it to non-transcriptomic data.
**Weighted Gene Co-expression Network Analysis (WGCNA)** is an approach dedicated to identifying **correlation** patterns between features of a dataset. In most cases, it is applied to gene expression data to find clusters -referred to as **modules**- of highly correlated genes. In transcriptomics, genes that exhibit strong correlations in their expression profiles are said **co-expressed**.
One of the underlying principles of WGCNA is the concept of **guilt-by-association**. This principle suggests that genes sharing similar expression patterns are likely to be **functionally related** or involved in similar biological processes. By leveraging guilt-by-association through correlation, WGCNA provides a powerful tool for identifying potential functional relationships among genes and deciphering their roles in complex biological systems.
WGCNA builds a **co-expression** network where:
- each feature (transcript/gene in transcriptomics data) is represented as a node
- edges represent a correlation pattern between two features (nodes) and are weighted with the corresponding correlation coefficient, indicating the strength of their observed correlation.
The main analysis steps are presented in the Figure \@ref(fig:WGCNAprinciple).
1. First, it constructs a **feature correlation network** (for transcriptomics data, a **gene co-expression network** based on correlation analysis). The correlation network is **(soft)-thresholded** to obtain a **scale-free** network.
2. Then, it identifies **modules**, i.e. groups of features that are highly and tightly correlated.
3. Next, those modules are characterized:
- the modules are related to **external data**, such as **samples metadata** but also **features metadata**
- the users can examine **module-module relationships**, to reveal a **higher-order organization of the transcriptome**
- the users can find **key driver** features in modules of interest
```{r WGCNAprinciple, echo = FALSE, fig.align = "center", out.width = "100%", fig.cap = "WGCNA overview (Langfelder, Peter, and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics 9.1 (2008))"}
knitr::include_graphics("Figures/WGCNA_Overview.png")
```
## Application on other omics
WGCNA is a method dedicated to transcriptomics data. However, some studies have shown its potential for inferring correlation networks on other omic datasets (see: [MiBiOmics](https://doi.org/10.1186/s12859-020-03921-8)). In this tutorial, we will see how it can be used on various omics datasets to reveal correlated modules across different data modalities.
## Ressources
If you are interested in WGCNA, extensive tutorials are available online:
1. [Data input and cleaning](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf)
2. [Network construction](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-man.pdf)
3. [Modules characterization](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf)
4. [Interfacing network analysis with other data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-04-Interfacing.pdf)
5. [Network visualisation](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-05-Visualization.pdf)
6. [Export to external visualisation software](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-06-ExportNetwork.pdf)
- All official WGCNA tutorials can be found [here](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html)
- Publication: [PMID: 19114008](https://pubmed.ncbi.nlm.nih.gov/19114008//)
# Choose your dataset and your modality
For this tutorial, you can pick a dataset among the ones proposed below.
```{r data-plot, echo = FALSE, fig.align = "center", out.width = "80%", fig.cap = "Five datasets are available: **Metagenomic** data from Tara Ocean (image from Sunagawa et al., 2015), **Breast cancer** data from TCGA (image from TCGA [website](https://portal.gdc.cancer.gov/)), **CLL** data (Dietrich et al., 2018), **tomato** data and **Arabidopsis thaliana** data (Wallomics dataset).", fig.show = "hold"}
include_graphics("Figures/datasets.png")
```
## TCGA breast cancer dataset
This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer.
This dataset was obtained from [Mibiomics](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics/-/tree/master/data/data_breast)
Access the dataset from IFB on demand: `/shared/projects/tp_etbii_2024_165650/Networks/BreastCancer_WGCNA/`
For this dataset, we will use 3 modalities:
* `mrna.csv`, contains mRNA expression data
* `mirna.csv`, contains miRNA expression data
* `protein.csv`, contains protein abundance data
We will also use the file `clinical.csv`, which contains samples metadata, including molecular subtypes.
## TARA ocean dataset
Tara Ocean is a scientific initiative that aims to study and understand marine ecosystems on a global scale. It involves the collection of ocean samples from various regions to analyze the diversity of organisms, their genetic content, and their interactions with the environment.
In this dataset, we will have access to a **NOG (Non-supervised Orthologous Groups) abundance dataset**, measure accros various oceans and various depths. NOGs are clusters of genes from different species that share a common evolutionary origin and similar functions. These groups are formed computationally, without relying on manual annotations, and are valuable for understanding gene relationships and functions across diverse organisms. NOGs play a key role in comparative genomics and evolutionary studies by revealing insights into the conservation of genes and their roles in different species.
NOGs have been measured in **various oceans** (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean) and at **various depths** (DCM: Deep Chlorophyll Maximum, MES: Mesopelagic Zone, MIX: Mixing Layer, SRF: Surface Layer)
This dataset was obtained from [Mibiomics](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics/-/tree/master/data/data_tara)
Access the dataset from IFB on demand: `/shared/projects/tp_etbii_2024_165650/Networks/TaraOcean_mibiomics/`
For this dataset, we can only apply WGCNA on the `TARAoceans_proNOGS.csv`. Therefore, the integrative analysis part of this tutorial cannot be applied on TARA data. However, you can still run the tutorial from the proNOGS dataset, using also the metadata contained in `TARAoceans_metadata.csv`.
## CLL dataset
The Chronic lymphocytic leukaemia (CLL) dataset includes mRNA expression, methylation data and viability values in response to different drugs from CLL samples.
To access this dataset, you need to install and load the `MOFAdata` R library.
```{r CLLdata, eval=FALSE, message=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MOFAdata")
library("MOFAdata")
data("CLL_data")
names(CLL_data)
```
We will use three modalities:
* `CLL_data$mRNA`, which contains mRNA data for various CLL samples,
* `CLL_data$Methylation`, which contains methylation data from the same samples,
* `CLL_data$Drugs`, which contains viability response to various drugs and concentrations.
The samples metadata are not available from `MOFAdata`, you'll have to access it from IFB on demand: `/shared/projects/tp_etbii_2024_165650/Networks/CLL/`
Use the command `?CLL_data` to know more about this dataset.
## Tomato dataset
The tomato dataset was obtained from the authors of the following publication:
Isma Belouah and others, Modeling Protein Destiny in Developing Fruit, Plant Physiology, Volume 180, Issue 3, July 2019, Pages 1709–1724, https://doi.org/10.1104/pp.19.00086
It contains mRNA and protein abundance data from tomato plants (*Solanum lycopersicum*) at various development stages. In this dataset, the mRNA and proteins are paired, meaning that corresponding gene expression levels at both the mRNA and protein levels are examined.
The dataset encompasses information gathered from tomato plants at different developmental points, representing various **Days Post Anthesis (DPA)** and distinct **Growth Stages**. 'Days Post Anthesis' (DPA) signifies the count of days that have elapsed since the opening of a flower, serving as a marker to track the temporal progression of plant development. On the other hand, 'Growth Stages' denote specific phases within the tomato plant's lifecycle, spanning germination, vegetative growth, flowering, fruit development, and ripening.
The dataset was kindly provided to us by the authors. We have made slight adjustments to it, and made it accessible on IFB on demand: `/shared/projects/tp_etbii_2024_165650/Networks/Tomato/`
For this dataset, we have access to 2 modalities:
* `mrna.tsv`, contains mRNA expression data (already normalized)
* `prots.csv`, contains protein abundance data (already normalized)
We will also use the file `samples_metadata.tsv`, which contains samples metadata (DPA, Growth Stage).
## WallOmics dataset
The tomato dataset was obtained from the `WallomicsData` R library. This dataset contains phenomics, metabolomics, proteomics and transcriptomics data collected from two organs of five ecotypes of the model plant Arabidopsis thaliana exposed to two temperature growth conditions. For more information, you can refer to the following publications:
* Duruflé, Harold, et al. "A powerful framework for an integrative study with heterogeneous omics data: from univariate statistics to multi-block analysis." Briefings in Bioinformatics 22.3 (2021): bbaa166.
* Duruflé, Harold, et al. "An integrative study showing the adaptation to sub-optimal growth conditions of natural populations of Arabidopsis thaliana: A focus on cell wall changes." Cells 9.10 (2020): 2249.
For this dataset, we can run WGCNA on the transcriptomics and proteomics data obtained from two organs. For the analysis to run faster, we will use the CellWall datasets (CW), that only consider the expression of cell wall proteins. The datasets can be accessed with the following R commands:
```{r WallomicsData, eval=FALSE, message=FALSE}
intall.packages("WallomicsData")
library("WallomicsData")
data("Transcriptomics_Rosettes_CW")
data("Transcriptomics_Stems_CW")
data("Proteomics_Rosettes_CW")
data("Proteomics_Stems_CW")
```
Use `?Transcriptomics_Rosettes_CW` and `?Transcriptomics_Stems_CW` for more information on the transciptomics dataset.
Use `?Proteomics_Rosettes_CW` and `?Proteomics_Stems_CW` for more information on the transciptomics dataset.
Additionally, we will use the other data available as metadata. This includes:
* the `Ecotype` dataset
* the `Temperature` dataset
* the `Genetic_Cluster` dataset
* the `Altitude_Cluster` dataset
* the `Genetic_Cluster` dataset
* the `Altitude_Cluster` dataset
* the `Metabolomics_Rosettes` dataset for Rosettes
* the `Metabolomics_Stems` dataset for Stem
* the `Phenomics_Rosettes` dataset for Rosettes
* the `Phenomics_Stems` dataset for Stem
In the upcoming sections, our primary focus will be on the Breast cancer dataset obtained from WGCNA. However, it's important to note that **the procedure we will discuss can also be applied to other datasets**.
## Team organization
To run the multi-omics analysis, it is necessary to repeat the WGCNA pipeline for each modality. To facilitate this, we suggest **dividing into smaller groups**, with **each group assigned to a specific dataset**. Within each group, **individual members can handle one modality of the multi-omics dataset**, before the **whole group handles the data integration part**.
# Biological context (Breast cancer dataset)
The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site [here](https://portal.gdc.cancer.gov).
```{r TCGA, echo = FALSE, fig.cap = "Screenshot of the TCGA main page", fig.align = "center", out.width = "100%"}
include_graphics("Figures/TCGA_Overview.png")
```
TCGA makes available a high amount of data. In this tutorial, we will use mRNA expression data, miRNA expression data and protein abundance data from breast cancer samples. Metadata about the samples are also available, such as the subtype of cancer, age at diagnosis and so on.
Our objective is to build a correlation network for each modality to identify modules of tightly correlated features associated to our clinical metadata. Then, another correlation analysis between the modules from different modalities will help us identify multi-omics associations in breast cancer.
# Input data
## In summary
**Preparation** of the data is the first and most **important** step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information.
1. **Normalization** of data according to the type of data and the type of analysis to perform. For RNA-seq data, WGCNA recommends to use the varianceStabilizingTransformation from the DESeq2 package, or to perform a log-transformation of normalized counts (RPFK/FPKM).
2. Removing **missing values**
3. Removing **outliers** based on sample hierarchical clustering
4. Add external information
Data must be **matrices** with specific shapes:
- **samples** (e.g. patients, organisms ...) are displayed on the **rows**
- **variables** (e.g. genes, proteins ...) are displayed on the **columns**
## Load your dataset
Choose a dataset and a modality, and load it into R.
You will probably need to read a csv/tsv file: look at the documentation for the function ```read.table```:
```{r doc_read.table}
?read.table
```
If your data is not stored into a csv/tsv file and you don't know how to load it, don't hesitate to ask us.
**DIY**: Load the modality you selected.
```{r loadData, class.source = 'fold-hide'}
dataset = read.table("TCGA_data/mrna.csv", sep=",", header=T)
```
**DIY**: Reshape the matrix. We need the samples in rows and the genes in columns. You can use sample ids as row names and gene names as column names.
```{r reshapeData, class.source = 'fold-hide'}
rownames(dataset) = dataset$X
dataset = dataset[,-1]
dataset = t(dataset)
head(dataset[,1:5])
```
**DIY**: Check if your dataset is normalized. If not, do it!
If you don't know which type of normalization/transformation to apply your omic dataset, use the Centered log ratio transformation: function `clr` from the `compositions` R library.
For the mRNA dataset of the breast cancer dataset, the data is already normalized, but we still need to log-transform it. To avoid errors, we usually define an offset of 1 when log-transforming the data, as this ensures that the logarithm is only applied to values greater than zero.
```{r log2}
dataset = log2(1+dataset)
head(dataset[,1:5])
```
## Run WGCNA quality check
WGCNA provides a function **goodSamplesGenes** to check the quality of input data and remove features and samples with too many missing data as well as genes with zero variance.
**DIY**: Read the documentation for this function using```?goodSamplesGenes``` and check if any sample and/or feature should be removed.
```{r checkWGCNA, eval = TRUE, message = FALSE, class.source = 'fold-hide', results='hide'}
gsg <- goodSamplesGenes(datExpr = dataset, verbose = 3)
print(paste("All ok?", gsg$allOK))
```
**DIY**: According to your results, you may need to remove problematic samples and features from your dataset!
To do so, keep the samples and features contained in ```gsg$goodSamples``` and ```gsg$goodGenes```.
```{r rmSamplesGenes, eval = TRUE, message = FALSE, results = "hide", class.source = 'fold-hide'}
sprintf("Removing %d features", ncol(dataset) - sum(gsg$goodGenes))
sprintf("Removing %d samples", nrow(dataset) - sum(gsg$goodSamples))
dataset = dataset[gsg$goodSamples, gsg$goodGenes]
```
## Checking for outliers
After running WGCNA quality check, we also need to check if we have any outlier sample.
1. We compute the **euclidean distance** between each **pair of samples** using the ```dist``` function.
2. Then, we perform a **hierarchical clustering** on the distance data using the ```hclust``` function.
3. We visualize the dendrogram to identify potential outliers.
**DIY**: Try to perform the first two steps. Don't forget to use ```?``` to know which input give to each function!
```{r hclust, class.source = 'fold-hide'}
samplesTree <- hclust(d = dist(dataset), method = "average")
```
Now, let's plot the dendrogram.
```{r dendrogramPatients, fig.cap = "Samples clustering using hierarchical clustering method, to detect outliers", fig.align = "center", out.width = "70%"}
plot(samplesTree, main = "Samples dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 28, col = "red")
```
What do you think?
**DIY**: Try different thresholds on the Height to cut the dendrogram using the ```abline``` function, and choose one that fits your dataset.
**DIY**: Now let's cluster our samples using the selected height, using the function ```cutreeStatic```.
```{r curTree, class.source = 'fold-hide'}
clust <- cutreeStatic(samplesTree, cutHeight = 28)
table(clust)
```
**DIY**: Let's remove samples from cluster 0 (outliers) and save the final number of samples and features into variables `nSamples` and `nFeatures`.
```{r rmSamples, class.source = 'fold-hide'}
keep <- clust == 1
dataset <- dataset[keep,]
nFeatures <- ncol(dataset)
nSamples <- nrow(dataset)
sprintf("%d features, %d samples",nFeatures, nSamples)
```
## Metadata
We have access to metadata information for every sample.
This time, we may have some Categorial data (e.g: subtype), so let's use the ```stringAsFactors``` parameters from the ```read.table``` function, to specify that strings should be seen as factors (classes).
**DIY**: Load your metadata file. Make sure any categorial data is seed as a *factor*.
```{r metadata, class.source = 'fold-hide', results='hide'}
metadata <- read.table(file = "TCGA_data/clinical.csv",
sep = ",", head = TRUE, stringsAsFactors = TRUE)
head(metadata)
class(metadata$subtype)
```
If needed, reshape your metadata so that it matches the samples in your dataset. For the breast cancer dataset, we also want to rename our clinical features with better colnames.
```{r metadataReshape, class.source = 'fold-hide'}
rownames(metadata) = metadata$X # Keep our sample ids as rownames
metadata = metadata[,-1] # Now we can remove the X col
names(metadata) = c("AgeDiagnosis", "AnatomicNeoplasm", "Gender", "VitalStatus", "Subtype") # Rename columns
metadata = metadata[rownames(dataset),] # We keep only samples that are present in our dataset
head(metadata)
```
**Note for the CLL dataset**: Notice that we have several missing data in the metadata. This will cause some issues for downstream analysis (notably when characterizing our modules). We should impute these missing data to avoid these. To do so, we will replace the missing values with:
* mean values for numerical data
* randomly sampled values for categorical data
Here are some useful functions to do so:
* `is.na` for determining if a value is NA
* `which` to locate elements of a vector that satisfy a condition
* `na.omit` to remove NA values from a vector
* `sample` to randomly select elements from a list
* `mean` to compute the mean value of a vector
Try to combine those functions to impute your missing values.
Here are two examples:
```{r missingDatImpute, class.source = 'fold-hide', eval=FALSE}
metadata$Gender[which(is.na(metadata$Gender))] = sample(na.omit(metadata$Gender),1)
metadata$age[which(is.na(metadata$age))] = mean(na.omit(metadata$age))
```
Do the same for any column that has missing values. Don't forget that in the end, all categorical data should be a factor! You can une the `class` function to determine the class of a column.
Let's replot our sample dendrogram and visualize the metadata associated to each sample.
**DIY**: Create a new dataframe from your metadata and convert categorical traits into colors using the `labels2colors` function. Convert numerical traits into colors using the function `numbers2colors`.
```{r trait2Color, class.source = 'fold-hide'}
traitColors <- labels2colors(metadata)
traitColors[,1] <- numbers2colors(metadata[,1]) # The first column is numerical
head(traitColors)
```
Now we can plot the sample dendrogram again and visualize the traits associated to each sample with the `plotDendroAndColors` function.
```{r dataDendrogram, fig.cap = "Samples clustering using hierarchical clustering method with metadata information", fig.align = "center", out.width = "70%"}
samplesTree <- hclust(d = dist(dataset), method = "average")
plotDendroAndColors(dendro = samplesTree, colors = traitColors, groupLabels = names(metadata),
main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5)
```
# Construction of the correlation network
## Correlation for dummies
WGCNA detects "co-expressed" features by performing a correlation analysis. Let's visualise what two co-expressed features looks like. Use the **plot** function to visualize the associations of two variables, and compute the Pearson and Spearman correlation coefficient using the **cor** function. If you are working on the mRNA dataset from TCGA, let's look at genes COL1A1 and COL1A2, two collagen-coding genes, which are known to be co-expressed.
```{r cor}
plot(dataset[,"COL1A1"],dataset[,"COL1A2"])
cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "pearson")
cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "spearman")
```
**What is the difference between Pearson and Spearman methods?** The Pearson correlation measures **linear** associations between variables, whereas Spearman correlation measures **monotonous** associations between variables. Which one do you want chose for your dataset? Both can have interesting results, so it's completely up to you!
By default, WGCNA uses the Pearson correlation. If you want to use the Spearman correlation, don't forget to specify it each time you notice a **method** parameter controlling the correlation method in one of the WGCNA functions!
## Computing the correlation network
This first step of the analysis consists in creating the **feature correlation network**. This first step is composed of tree main substeps:
1. **Power** selection
- **soft thresholding** is a strategy introduced by WGCNA to transform correlation coefficients for building a weighted correlation network
- the threshold should be set such that it makes the network **scale-free**
2. **Similarity matrix** construction
- first, **correlation** are calculated between each **pair of feature**
- then, correlation or distance are **transformed**, using the selected soft-power
3. **Topological overlap matrix** construction (TOM)
- the similarity matrix is transformed into a TOM to facilitate the module detection
Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.
```{r parallel, echo = TRUE, eval = FALSE}
enableWGCNAThreads()
```
## Power selection
Correlation networks are by nature complete, which means every feature is associated to every other with a weight given by the correlation coefficient. To extract only meaningful associations, the correlation network must be thresholded.
The most straight-forward way of thresholding a correlation dataset is to perform a **hard-thresholding**, i.e. two features are considered correlated if their absolute correlation coefficient exceeds a user-defined threshold.
WGCNA propose another strategy, called **soft-thresholding**. It consists of raising the correlation coefficients to a certain **power**. This process transforms low correlation coefficients to values close to 0, while high correlation coefficients are transformed to values close to 1.
Furthermore, WGCNA presents a function called ```pickSoftThreshold``` that aims to automatically determine the ideal power for your dataset. The optimal power is defined as the one that ensures your network exhibits a **scale-free** property. Let's use it!
First, we need to define a range of powers to try.
```{r powerSelection}
powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers
```
**DIY**: Use the `pickSoftThreshold` function to analyze the network topology of the correlation network for various soft-powers.
```{r thresholdCalculation, class.source = 'fold-hide'}
sft <- pickSoftThreshold(dataset, powerVector = powers, verbose = 5)
```
**DIY**: Which power should be used for your dataset? (don't forget to look at the documentation of the `pickSoftThreshold` fucntion to known how).
```{r optPower, class.source = 'fold-hide'}
sprintf("Optimal soft-power = %d", sft$powerEstimate)
```
The optimal soft-power to use is defined by the topology of the soft-thresholded network, which has to be **scale-free**!
Scale-free networks are networks in which the majority of nodes have a low degree, and a small number of nodes have a large degree. Multiple studies have shown that most biological networks are scale-free!
Let's visualize the difference between the raw correlations and the soft-thresholded correlations:
```{r rawVSsoft, fig.width=8, fig.height=5}
corRaw = abs(cor(dataset))
corSoft = abs(corRaw**sft$powerEstimate)
par(mfrow=c(1,2))
hist(corRaw, main="Raw correlations")
hist(corSoft, main="Soft-thresholded correlations")
```
Sometimes, WGCNA can't find the optimal soft-threshold. In such cases, the variable ```sft$powerEstimate``` contains ```NA```. If the optimal soft-power was estimated but seems low, you might want to increase it.
In both cases, you can plot the following figures to help you decide:
```{r powerPlots, out.width = "50%", fig.show = "hold", fig.cap = "Soft Threshold Indices. Left: Scale Independence plot - Right: Mean Connectivity plot", fig.align = "center"}
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.80, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab = "Soft Threshold power", ylab = "Mean connectivity",
type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")
```
The **Scale Independence plot** (left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a **R² > 0.8**. The red line corresponds to using an R² cut-off of R²=0.80.
The **Mean Connectivity plot** (right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges.
## Similarity matrix construction
WGCNA provides the function ```adjacency``` to compute a features similarity matrix. This function operates in three steps:
1. Compute features **correlation** coefficients
2. Transform the correlation coefficients depending on whether you are interested in **absolute correlations** (unsigned), **signed correlations** (signed) or **positive correlations** (signed hybrid). To have more details on this 3 types, check ```?adjacency```. By default, the constructed network is an **unsigned network** which means that sign are not reported into the network (absolute correlations).
3. Finally, the transformed correlations are raised using the previously estimated soft-threshold power.
**DIY**: Use the `adjacency` function to compute the similarity matrix. Don't forget to provide the soft-power you defined.
```{r simMatrix, class.source = 'fold-hide'}
softPower = sft$powerEstimate
similarityMat <- adjacency(dataset, power = softPower)
head(similarityMat[, c(1:5)])
```
## Topological overlap matrix (TOM)
From the **Similarity matrix**, WGCNA proposes to extract correlated feature modules (co-expressed gene modules) based on **Topological Overlap**. The idea is that two features (genes) that share a high number of correlated (co-expressed) neighbors are likely to participate in the same correlation (co-expression) module.
The **Topological Overlap Matrix (TOM)** is a concept used in network analysis to quantify the similarity or interconnectedness between nodes in a network.
**DIY**: Transform the similarity matrix into a **Topological Overlap Matrix** that incorporates the level of overlapping between the neighborhoods of two nodes (features/genes), using the function ```TOMsimilarity```.
Notice that the output does not contains our row names and column names (i.e. node names). Make sure to rename it!
```{r TOM, message = "hold", warning = FALSE, class.source = 'fold-hide'}
TOM <- TOMsimilarity(similarityMat)
colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
head(TOM[, c(1:5)])
```
Because most clustering strategies are performed on distance (rather than similarities), we transform the TOM matrix into a **dissimilarity matrix**.
**DIY**: Hint: distance = 1 - similarity
```{r dissTOM, class.source = 'fold-hide'}
dissTOM <- 1 - TOM
head(dissTOM[, c(1:5)])
```
# Module detection
## In summary
Modules are groups of highly-connected nodes (features). This step is composed of two substeps:
1. **Clustering**: getting correlated feature modules
- Use the ```hclust``` function to perform **hierarchical clustering**
- Display the dendrogram
- Use the `Dynamic Tree Cut` function to obtain the modules
2. **Merging of modules**: merging close modules
- If they are too similar, modules should be merged
- We compute module-module similarities by computing their eigenvectors (eigengene)
- We merge two modules if their eigenvectors are correlated (the authors suggest a pearson **corr r > 0.75**)
## Clustering
First, a **hierarchical clustering** is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure \@ref(fig:genesClustering).
```{r genesClustering, fig.cap = "Genes clustering using hierarchical clustering using the TOM dissimilarity matrix.", fig.align = "center", out.width = "70%"}
geneTree <- hclust(as.dist(dissTOM), method = "average")
plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)
```
The leaves represent the features (genes). The dendrogram branches group together densely interconnected and highly correlated features.
**DIY**: To identify modules from dendrograms, several methods exist. Here, we use the `cutreeDynamic` function.
First, we have to set the minimum module size `minClusterSize` (i.e. minimum number of genes inside a group/module). The default is 20, but you can try several values and choose according to you results. There are other parameters you can play with, such as `deepSplit`. Look at the documentation to select the parameters.
```{r dynamicMods, class.source = 'fold-hide'}
minModuleSize <- 30
dynamicMods <- cutreeDynamic(dendro = geneTree,
distM = dissTOM,
deepSplit = 2,
pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
table(dynamicMods)
```
After running `cutreeDynamic`, we have obtained our correlation modules.
Note that **module 0 is not a correlation module** !! All unassigned nodes are automatically labelled 0.
**DIY**: Instead of using numbers as module names, we will use the function `labels2colors` to assign a color to each module. The biggest module is assigned `turquoise`.
```{r label2colors, class.source = 'fold-hide'}
dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)
```
Note that **module grey** (module 0 at previous step) **is not a correlation module** !! All unassigned nodes are automatically labelled 0/grey.
Let's visualize the obtained modules on the previous dendrogram:
```{r moduleIdentificationDendrogram, fig.cap = "Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.", fig.align = "center", out.width = "70%"}
plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
```
## Merging modules {#MEs}
Sometimes, especially when dealing with smaller clusters (for instance when the minClusterSize parameter is not set properly), merging modules with similar expression profiles might be usefull.
To do so, we first compute each module **eigenvectors**. A module eigenvector (or **eigengene** if you are dealing with gene data) represents the mean expression profile of the whole corresponding module. We can use the correlation of modules eigengenes to define a module-similarity metric that we can use to merge close modules.
**DIY**: To compute the module eigenvectors, use the function ```moduleEigengenes```. Order the obtained eigenvectors using ```orderMEs```. This function reorders input eigenvectors based on their correlation.
```{r MEs, class.source = 'fold-hide'}
MEList <- moduleEigengenes(dataset, colors = dynamicColors)
MEs <- MEList$eigengenes
head(MEs)
MEs <- orderMEs(MEs)
head(MEs)
```
**DIY**: Now let's perform a **hierarchical clustering** using the correlation between module eigenvectors.
Remember that for using ```hclust``` we need **distances**! Compute those distances and transform then with ```as.dist``` before running ```hclust```.
```{r MEdiss, class.source = 'fold-hide'}
MEDiss <- 1 - cor(MEs)
head(MEDiss)
METree <- hclust(as.dist(MEDiss))
```
**DIY**: WGCNA's authors recommend to merge cluster with a correlation greater than **0.75**, hence a distance lower than **0.25**. Let's look at the tree with ```plot``` and ```abline```, just as we did previously in this tutorial.
```{r MEdissPlot, class.source = 'fold-hide'}
MEDissCut <- 0.25
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissCut, col = "red")
```
**In our case, we dont need our modules merged, but if you need it:**
If you want to merge modules, you can use the `mergeCloseModules` function as below:
```{r MEMerging, eval = FALSE, echo = TRUE}
merge <- mergeCloseModules(dataset, dynamicColors, cutHeight = MEDissCut, verbose = 3)
mergedColors <- merge$colors
mergedMEs <- merge$newMEs
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
```
```{r MEMergingObject, eval = FALSE, echo = TRUE}
moduleColors <- mergedColors
colorOrder <- c("grey", standardColors(50))
moduleLabels <- match(moduleColors, colorOrder) - 1
MEs <- mergedMEs
```
## Save the results
**DIY**: Let's save those clustering results! Choose your favorite export format (Rdata, csv, tsv, ...) and save the results. Don't forget to make sure you have all the information you need (in this case, we need the node names and their cluster labels).
```{r saveModules, class.source = 'fold-hide'}
moduleColors = dynamicColors # Or mergedColors, if some modules have been merged
names(moduleColors) = colnames(dataset)
head(moduleColors)
save(moduleColors, file = "results_TCGA/modules_mrna.rda")
```
**DIY**: We also need to save our module eigenvectors!
```{r saveMEs, class.source = 'fold-hide'}
save(MEs, file = "results_TCGA/MEs_mrna.rda")
```
# Module characterization
## In summary
We identified several modules in the previous step. Now we want to **characterize** them:
- Associations between **modules** and **external information**
- Are some modules associated to a trait of interest?
- Associations between **genes**, **external information** and **modules**
- **Gene Significance**: association between features (genes for expression data) and external information
- **Module membership**: features correlation to each module eigenvector
For all of the above, we can use correlation!
## Association between modules and phenotypic traits
We want to identify which modules are significantly associated with the (clinical) metadata we have. For this, we will use correlation.
For pre-processing numeric variables, a simple normalization should be enough.
While not ideal, for categorical variables, we will binarize the data using the ```binarizeCategoricalVariable```. This function provides two options: ```includePairwise``` to compare all classes against all, and a ```includeLevelVsAll``` to compare each classes to all others. Here, to limit the number of associations to test, we will only include LevelVsAll comparisons.
However, note that binarization is not optimal, and other test might be more appropriate, such as multinomial logistic regression models.
**DIY**: First, scale numeric data, and put the results in a new dataframe.
```{r metadataNum1, class.source = 'fold-hide'}
metadataNum <- data.frame("AgeDiagnosis"=scale(metadata$AgeDiagnosis), row.names = rownames(metadata))
```
**DIY**: Now, binarize categorical data and add it to the new dataframe.
```{r metadataNum2, class.source = 'fold-hide'}
AnatomicNeoplasm = binarizeCategoricalVariable(metadata$AnatomicNeoplasm,includePairwise = FALSE, includeLevelVsAll = TRUE)
rownames(AnatomicNeoplasm) = rownames(metadata)
metadataNum = cbind(metadataNum, AnatomicNeoplasm)
```
Do the same for all categorial variables:
```{r metadataNum3, class.source = 'fold-hide'}
# Gender and vital status are already binary
metadataNum$Gender = ifelse(metadata$Gender=="female", 0, 1)
metadataNum$VitalStatus = ifelse(metadata$VitalStatus=="alive", 1, 0)
Subtype = binarizeCategoricalVariable(metadata$Subtype,includePairwise = FALSE, includeLevelVsAll = TRUE)
rownames(Subtype) = rownames(metadata)
metadataNum = cbind(metadataNum, Subtype)
# We also include a numeric representation of the subtypes
metadataNum$Subtype = as.numeric(metadata$Subtype)-1
```
Now we can compute correlations between the modules eigenvectors and the metadata!
**DIY**: Use the functions ```cor``` and ```corPvalueStudent``` to compute correlations and associated p-values.
```{r traitCor, class.source = 'fold-hide'}
moduleTraitCor <- cor(MEs, metadataNum)
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples)
```
Let's plot the results:
```{r plotTraitCor, fig.width=8, fig.height=5}
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
for(c in colnames(moduleTraitCor)){
if(moduleTraitPvalue[r, c] > 0.05){
moduleTraitCor[r, c] <- 0
textMatrix[r, c] <- ""
}
}
}
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(metadataNum),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
textMatrix = textMatrix,
setStdMargins = TRUE,
cex.text = 0.5,
zlim = c(-1, 1),
main = "Module-trait relationship")
```
## Gene relationship to traits and modules
Two measures are computed to quantify the association between nodes, traits, and modules:
- **Module Membership** (MM): correlation between the module eigengene and the features
- **Gene Significance** (GS): correlation between the features and a selected trait of interest
**DIY**: Fist, we compute the Module Membership. You need the module eigenvectors (MEs) and the features from the input dataset. Compute correlations using the ```cor``` function. Then, compute p-values using the function ```corPvalueStudent```. Store the results in a dataframe.
```{r moduleMembership, class.source = 'fold-hide'}
moduleMembership <- as.data.frame(cor(dataset, MEs))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(moduleMembership), nSamples))
modNames <- substring(names(MEs), 3)
names(moduleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")
head(moduleMembership[,1:5])
head(MMPvalue[,1:5])
```
**DIY**: Very similarly, we can compute the node significance with respect to our traits of interest contained in the metadata. Compute the correlation and p-values between your dataset and the metadata using the sames functions as above.
```{r GS, class.source = 'fold-hide'}
geneTraitSignificance <- as.data.frame(cor(dataset, metadataNum, use = "p"))
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) <- paste("GS", names(metadataNum), sep = "")
names(GSPvalue) <- paste("p.GS", names(metadataNum), sep = "")
```
**DIY**: We can visualize the results with a scatter plot. Pick a module and extract the set of nodes composing the selected module. Then pick a trait of interest. Plot the module membership values against the node significance using the `verboseScatterplot` function. You can do this for various modules and traits of interest.
```{r GSturquoise, class.source = 'fold-hide'}
module='turquoise'
moduleGenes = names(moduleColors)[which(moduleColors==module)]
geneTraitSignifColumn = "GSSubtype"
verboseScatterplot(abs(moduleMembership[moduleGenes, paste('MM', module, sep="")]),
abs(geneTraitSignificance[moduleGenes, geneTraitSignifColumn]),
xlab = paste("Module Membership (MM) in", module, "module"),
ylab = paste("Gene Significance (GS) for", geneTraitSignifColumn),
main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
```
**DIY**: Now that we have computed our correlated modules and characterized them, we can export our results for other downstream analysis.
```{r exportMMGS, class.source = 'fold-hide'}
resultsModuleMembership = cbind(moduleMembership, MMPvalue)
mmNames = names(resultsModuleMembership)
mmNames = mmNames[order(gsub("p\\.", "", mmNames))]
resultsModuleMembership = resultsModuleMembership[,mmNames]
resultsSignificance = cbind(geneTraitSignificance, GSPvalue)
gsNames = names(resultsSignificance)
gsNames = gsNames[order(gsub("p\\.", "", gsNames))]
resultsSignificance = resultsSignificance[,gsNames]
results = data.frame("Gene"=names(moduleColors), "Module"=unlist(moduleColors),
resultsModuleMembership, resultsSignificance)
results = results[order(results$Module),]
write.table(results, file="results_TCGA/TCGA_mRNA_results.tsv", sep="\t", row.names=F, col.names=T, quote=F)
head(results[,1:6])
```
# Integrative analysis
After applying the pipeline for various modalities, we can use the module eigenvectors computed for each modality to perform an integrative analysis and identify correlated multi-omic modules.
**DIY**: Load the module eigenvectors obtain for the various modalities.
```{r loadMEs, class.source = 'fold-hide'}
load("results_TCGA/MEs_mirna.rda")
mirna = MEs
names(mirna) = paste("mirna", names(mirna), sep="_")
load("results_TCGA/MEs_proteins.rda")
protein = MEs
names(protein) = paste("protein", names(protein), sep="_")
load("results_TCGA/MEs_mrna.rda")
mrna = MEs
names(mrna) = paste("mrna", names(mrna), sep="_")
```
**DIY**: Some samples might have been removed during the preprocessing step for some of the modalities. Just keep the samples that are present in all three datasets.
```{r filterMEs, class.source = 'fold-hide'}
samples = Reduce("intersect", list(rownames(mirna), rownames(protein), rownames(mrna)))
MEs = do.call("cbind", list(mirna[samples,], mrna[samples,], protein[samples,]))
```
**DIY**: Now we can compute module-module correlations across modalities! Do it with the `adjacency` function used in the beginning of this tutorial.
```{r integrationMEs, class.source = 'fold-hide'}
softPower = 1 # this time, we don't need a scale-free topology, so we just set softPower to 1
similarityMat <- adjacency(MEs, power = softPower)
```
We can visualize the correlations using `pheatmap`.
**DIY**: First, create an annotation dataframe to indicate the provenance (i.e. modality) of each module.
```{r annotDF, class.source = 'fold-hide'}
annot = data.frame("Modality"=gsub("_.*", "", colnames(similarityMat)))
rownames(annot) = colnames(similarityMat)
head(annot)
```
**DIY**: Now, plot the module correlation matrix using `pheatmap`. Look at the documentation for this function to know how to include your annotation dataframe, and play with the various parameters (e.g: clustering of the heatmap).
```{r heatmap, class.source = 'fold-hide'}
heatmap <- pheatmap(similarityMat, show_rownames = TRUE, show_colnames = TRUE, annotation_row = annot, annotation_col = annot, legend = T, cluster_cols = T, cutree_cols = 5, cluster_rows = T, cutree_rows = 5)
```
**DIY**: Extract the clusters of correlated modules.
```{r clustMap, class.source = 'fold-hide'}
clust <- cbind(similarityMat, cluster = cutree(heatmap$tree_col, k = 5))
clust = clust[,"cluster"]
clust[order(clust)]
```