This report was generated using:
R version 4.3.1 (2023-06-16)
1.72.1
1.0.12
You might also need the compositions
library for data normalization.
DIY: Load the WGCNA and pheatmap libraries.
library("WGCNA")
library("pheatmap")
Important Reminder: While WGCNA was initially developed for processing gene expression data, it can be adapted for other data types as well.
Throughout this tutorial, we will strive to present the material in a manner that is broadly applicable to different data modalities.
However, it is worth noting that certain WGCNA functions have names related to transcriptomics data (e.g., moduleEigengenes instead of eigenvectors). Moreover, we might use interchangeably the terms correlation and co-expression, or gene and feature.
Consequently, certain phrasings used in this tutorial may specifically reference transcriptomic data, even though the overall intent is to apply the methods to diverse data types. Please keep this in mind while following the tutorial, especially when applying it to non-transcriptomic data.
Weighted Gene Co-expression Network Analysis (WGCNA) is an approach dedicated to identifying correlation patterns between features of a dataset. In most cases, it is applied to gene expression data to find clusters -referred to as modules- of highly correlated genes. In transcriptomics, genes that exhibit strong correlations in their expression profiles are said co-expressed.
One of the underlying principles of WGCNA is the concept of guilt-by-association. This principle suggests that genes sharing similar expression patterns are likely to be functionally related or involved in similar biological processes. By leveraging guilt-by-association through correlation, WGCNA provides a powerful tool for identifying potential functional relationships among genes and deciphering their roles in complex biological systems.
WGCNA builds a co-expression network where:
The main analysis steps are presented in the Figure 2.1.
WGCNA is a method dedicated to transcriptomics data. However, some studies have shown its potential for inferring correlation networks on other omic datasets (see: MiBiOmics). In this tutorial, we will see how it can be used on various omics datasets to reveal correlated modules across different data modalities.
If you are interested in WGCNA, extensive tutorials are available online:
For this tutorial, you can pick a dataset among the ones proposed below.
This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer.
This dataset was obtained from Mibiomics
Download the dataset from the summer school’s GitHub repository
For this dataset, we will use 3 modalities:
mrna.csv
, contains mRNA expression datamirna.csv
, contains miRNA expression dataprotein.csv
, contains protein abundance dataWe will also use the file clinical.csv
, which contains samples metadata, including molecular subtypes.
Tara Ocean is a scientific initiative that aims to study and understand marine ecosystems on a global scale. It involves the collection of ocean samples from various regions to analyze the diversity of organisms, their genetic content, and their interactions with the environment.
In this dataset, we will have access to a NOG (Non-supervised Orthologous Groups) abundance dataset, measure accros various oceans and various depths. NOGs are clusters of genes from different species that share a common evolutionary origin and similar functions. These groups are formed computationally, without relying on manual annotations, and are valuable for understanding gene relationships and functions across diverse organisms. NOGs play a key role in comparative genomics and evolutionary studies by revealing insights into the conservation of genes and their roles in different species.
NOGs have been measured in various oceans (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean) and at various depths (DCM: Deep Chlorophyll Maximum, MES: Mesopelagic Zone, MIX: Mixing Layer, SRF: Surface Layer)
This dataset was obtained from Mibiomics
Download the dataset from the summer school’s GitHub repository
For this dataset, we can only apply WGCNA on the TARAoceans_proNOGS.csv
. Therefore, the integrative analysis part of this tutorial cannot be applied on TARA data. However, you can still run the tutorial from the proNOGS dataset, using also the metadata contained in TARAoceans_metadata.csv
.
The Chronic lymphocytic leukaemia (CLL) dataset includes mRNA expression, methylation data and viability values in response to different drugs from CLL samples.
To access this dataset, you need to install and load the MOFAdata
R library.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("MOFAdata")
BiocManager
library("MOFAdata")
data("CLL_data")
names(CLL_data)
We will use three modalities:
CLL_data$mRNA
, which contains mRNA data for various CLL samples,CLL_data$Methylation
, which contains methylation data from the same samples,CLL_data$Drugs
, which contains viability response to various drugs and concentrations.The samples metadata are not available from MOFAdata
, you’ll have to download it from the summer school’s GitHub repository
Use the command ?CLL_data
to know more about this dataset.
The tomato dataset was obtained from the authors of the following publication: Isma Belouah and others, Modeling Protein Destiny in Developing Fruit, Plant Physiology, Volume 180, Issue 3, July 2019, Pages 1709–1724, https://doi.org/10.1104/pp.19.00086
It contains mRNA and protein abundance data from tomato plants (Solanum lycopersicum) at various development stages. In this dataset, the mRNA and proteins are paired, meaning that corresponding gene expression levels at both the mRNA and protein levels are examined.
The dataset encompasses information gathered from tomato plants at different developmental points, representing various Days Post Anthesis (DPA) and distinct Growth Stages. ‘Days Post Anthesis’ (DPA) signifies the count of days that have elapsed since the opening of a flower, serving as a marker to track the temporal progression of plant development. On the other hand, ‘Growth Stages’ denote specific phases within the tomato plant’s lifecycle, spanning germination, vegetative growth, flowering, fruit development, and ripening.
The dataset was kindly provided to us by the authors. We have made slight adjustments to it, and made it accessible on the summer school’s GitHub repository
For this dataset, we have access to 2 modalities:
mrna.tsv
, contains mRNA expression data (already normalized)prots.csv
, contains protein abundance data (already normalized)We will also use the file samples_metadata.tsv
, which contains samples metadata (DPA, Growth Stage).
In the upcoming sections, our primary focus will be on the Breast cancer dataset obtained from WGCNA. However, it’s important to note that the procedure we will discuss can also be applied to other datasets.
To run the multi-omics analysis, it is necessary to repeat the WGCNA pipeline for each modality. To facilitate this, we suggest dividing into smaller groups, with each group assigned to a specific dataset. Within each group, individual members can handle one modality of the multi-omics dataset, before the whole group handles the data integration part.
The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site here.
TCGA makes available a high amount of data. In this tutorial, we will use mRNA expression data, miRNA expression data and protein abundance data from breast cancer samples. Metadata about the samples are also available, such as the subtype of cancer, age at diagnosis and so on.
Our objective is to build a correlation network for each modality to identify modules of tightly correlated features associated to our clinical metadata. Then, another correlation analysis between the modules from different modalities will help us identify multi-omics associations in breast cancer.
Preparation of the data is the first and most important step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information.
Data must be matrices with specific shapes:
Choose a dataset and a modality, and load it into R.
You will probably need to read a csv/tsv file: look at the documentation for the function read.table
:
?read.table
If your data is not stored into a csv/tsv file and you don’t know how to load it, don’t hesitate to ask us.
DIY: Load the modality you selected.
= read.table("TCGA_data/mrna.csv", sep=",", header=T) dataset
DIY: Reshape the matrix. We need the samples in rows and the genes in columns. You can use sample ids as row names and gene names as column names.
rownames(dataset) = dataset$X
= dataset[,-1]
dataset = t(dataset)
dataset head(dataset[,1:5])
## STC2 LOC96610 APOD FADS2 RPL9
## A0SH 6.809810 7.139342 6.399172 5.316216 5.916535
## A0SJ 9.802976 9.538014 10.667394 5.537015 3.204786
## A0SK 5.495795 7.440094 1.722198 8.555220 6.702598
## A0SO 2.926594 8.674870 4.771886 8.911656 6.122295
## A04N 11.579173 8.457158 6.579026 6.459865 3.572641
## A04P 5.011630 11.149950 3.497919 6.820260 8.007429
DIY: Check if your dataset is normalized. If not, do it!
If you don’t know which type of normalization/transformation to apply your omic dataset, use the Centered log ratio transformation: function clr
from the compositions
R library.
For the mRNA dataset of the breast cancer dataset, the data is already normalized, but we still need to log-transform it. To avoid errors, we usually define an offset of 1 when log-transforming the data, as this ensures that the logarithm is only applied to values greater than zero.
= log2(1+dataset)
dataset head(dataset[,1:5])
## STC2 LOC96610 APOD FADS2 RPL9
## A0SH 2.965288 3.024912 2.887364 2.659061 2.790050
## A0SJ 3.433357 3.397531 3.544410 2.708632 2.072032
## A0SK 2.699506 3.077259 1.444772 3.256289 2.945345
## A0SO 1.973278 3.274242 2.529043 3.309126 2.832342
## A04N 3.652965 3.241407 2.922013 2.899150 2.193028
## A04P 2.587756 3.602879 2.169258 2.967217 3.171115
WGCNA provides a function goodSamplesGenes to check the quality of input data and remove features and samples with too many missing data as well as genes with zero variance.
DIY: Read the documentation for this function using?goodSamplesGenes
and check if any sample and/or feature should be removed.
<- goodSamplesGenes(datExpr = dataset, verbose = 3)
gsg print(paste("All ok?", gsg$allOK))
DIY: According to your results, you may need to remove problematic samples and features from your dataset!
To do so, keep the samples and features contained in gsg$goodSamples
and gsg$goodGenes
.
sprintf("Removing %d features", ncol(dataset) - sum(gsg$goodGenes))
sprintf("Removing %d samples", nrow(dataset) - sum(gsg$goodSamples))
= dataset[gsg$goodSamples, gsg$goodGenes] dataset
After running WGCNA quality check, we also need to check if we have any outlier sample.
dist
function.hclust
function.DIY: Try to perform the first two steps. Don’t forget to use ?<functionName>
to know which input give to each function!
<- hclust(d = dist(dataset), method = "average") samplesTree
Now, let’s plot the dendrogram.
plot(samplesTree, main = "Samples dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 28, col = "red")
What do you think?
DIY: Try different thresholds on the Height to cut the dendrogram using the abline
function, and choose one that fits your dataset.
DIY: Now let’s cluster our samples using the selected height, using the function cutreeStatic
.
<- cutreeStatic(samplesTree, cutHeight = 28)
clust table(clust)
## clust
## 0 1
## 1 378
DIY: Let’s remove samples from cluster 0 (outliers) and save the final number of samples and features into variables nSamples
and nFeatures
.
<- clust == 1
keep <- dataset[keep,]
dataset <- ncol(dataset)
nFeatures <- nrow(dataset)
nSamples sprintf("%d features, %d samples",nFeatures, nSamples)
## [1] "2000 features, 378 samples"
We have access to metadata information for every sample.
This time, we may have some Categorial data (e.g: subtype), so let’s use the stringAsFactors
parameters from the read.table
function, to specify that strings should be seen as factors (classes).
DIY: Load your metadata file. Make sure any categorial data is seed as a factor.
<- read.table(file = "TCGA_data/clinical.csv",
metadata sep = ",", head = TRUE, stringsAsFactors = TRUE)
head(metadata)
class(metadata$subtype)
If needed, reshape your metadata so that it matches the samples in your dataset. For the breast cancer dataset, we also want to rename our clinical features with better colnames.
rownames(metadata) = metadata$X # Keep our sample ids as rownames
= metadata[,-1] # Now we can remove the X col
metadata names(metadata) = c("AgeDiagnosis", "AnatomicNeoplasm", "Gender", "VitalStatus", "Subtype") # Rename columns
= metadata[rownames(dataset),] # We keep only samples that are present in our dataset
metadata head(metadata)
## AgeDiagnosis AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH 39 left upper inner quadrant female alive LumA
## A0SJ 39 left female alive LumA
## A0SK 54 right upper outer quadrant female dead Basal
## A0SO 67 right female alive Basal
## A04N 66 right female alive LumA
## A04P 36 left female dead Basal
Note for the CLL dataset: Notice that we have several missing data in the metadata. This will cause some issues for downstream analysis (notably when characterizing our modules). We should impute these missing data to avoid these. To do so, we will replace the missing values with:
Here are some useful functions to do so:
is.na
for determining if a value is NAwhich
to locate elements of a vector that satisfy a conditionna.omit
to remove NA values from a vectorsample
to randomly select elements from a listmean
to compute the mean value of a vectorTry to combine those functions to impute your missing values.
Here are two examples:
$Gender[which(is.na(metadata$Gender))] = sample(na.omit(metadata$Gender),1)
metadata$age[which(is.na(metadata$age))] = mean(na.omit(metadata$age)) metadata
Do the same for any column that has missing values. Don’t forget that in the end, all categorical data should be a factor! You can une the class
function to determine the class of a column.
Let’s replot our sample dendrogram and visualize the metadata associated to each sample.
DIY: Create a new dataframe from your metadata and convert categorical traits into colors using the labels2colors
function. Convert numerical traits into colors using the function numbers2colors
.
<- labels2colors(metadata)
traitColors 1] <- numbers2colors(metadata[,1]) # The first column is numerical
traitColors[,head(traitColors)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "#FFD5CB" "mediumpurple2" "skyblue2" "antiquewhite4" "lightcoral"
## [2,] "#FFD5CB" "skyblue1" "skyblue2" "antiquewhite4" "lightcoral"
## [3,] "#FFA38D" "plum3" "skyblue2" "mediumorchid" "coral2"
## [4,] "#FF7A58" "darkolivegreen4" "skyblue2" "antiquewhite4" "coral2"
## [5,] "#FF7E5E" "darkolivegreen4" "skyblue2" "antiquewhite4" "lightcoral"
## [6,] "#FFE2DB" "skyblue1" "skyblue2" "mediumorchid" "coral2"
Now we can plot the sample dendrogram again and visualize the traits associated to each sample with the plotDendroAndColors
function.
<- hclust(d = dist(dataset), method = "average")
samplesTree plotDendroAndColors(dendro = samplesTree, colors = traitColors, groupLabels = names(metadata),
main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5)
WGCNA detects “co-expressed” features by performing a correlation analysis. Let’s visualise what two co-expressed features looks like. Use the plot function to visualize the associations of two variables, and compute the Pearson and Spearman correlation coefficient using the cor function. If you are working on the mRNA dataset from TCGA, let’s look at genes COL1A1 and COL1A2, two collagen-coding genes, which are known to be co-expressed.
plot(dataset[,"COL1A1"],dataset[,"COL1A2"])
cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "pearson")
## [,1]
## [1,] 0.9879923
cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "spearman")
## [,1]
## [1,] 0.9865966
What is the difference between Pearson and Spearman methods? The Pearson correlation measures linear associations between variables, whereas Spearman correlation measures monotonous associations between variables. Which one do you want chose for your dataset? Both can have interesting results, so it’s completely up to you! By default, WGCNA uses the Pearson correlation. If you want to use the Spearman correlation, don’t forget to specify it each time you notice a method parameter controlling the correlation method in one of the WGCNA functions!
This first step of the analysis consists in creating the feature correlation network. This first step is composed of tree main substeps:
Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.
enableWGCNAThreads()
Correlation networks are by nature complete, which means every feature is associated to every other with a weight given by the correlation coefficient. To extract only meaningful associations, the correlation network must be thresholded.
The most straight-forward way of thresholding a correlation dataset is to perform a hard-thresholding, i.e. two features are considered correlated if their absolute correlation coefficient exceeds a user-defined threshold.
WGCNA propose another strategy, called soft-thresholding. It consists of raising the correlation coefficients to a certain power. This process transforms low correlation coefficients to values close to 0, while high correlation coefficients are transformed to values close to 1.
Furthermore, WGCNA presents a function called pickSoftThreshold
that aims to automatically determine the ideal power for your dataset. The optimal power is defined as the one that ensures your network exhibits a scale-free property. Let’s use it!
First, we need to define a range of powers to try.
<- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers powers
## [1] 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
DIY: Use the pickSoftThreshold
function to analyze the network topology of the correlation network for various soft-powers.
<- pickSoftThreshold(dataset, powerVector = powers, verbose = 5) sft
## pickSoftThreshold: will use block size 2000.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 2000 of 2000
## Warning: executing %dopar% sequentially: no parallel backend registered
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.00469 0.170 0.749 337.000 3.46e+02 565.00
## 2 2 0.26100 -0.922 0.751 94.300 9.08e+01 229.00
## 3 3 0.59300 -1.200 0.879 35.300 2.96e+01 120.00
## 4 4 0.81800 -1.400 0.979 16.200 1.12e+01 79.90
## 5 5 0.86500 -1.550 0.982 8.590 4.49e+00 61.00
## 6 6 0.90100 -1.580 0.979 5.100 1.96e+00 48.70
## 7 7 0.90600 -1.560 0.975 3.280 8.97e-01 39.90
## 8 8 0.91200 -1.540 0.981 2.250 4.21e-01 33.40
## 9 9 0.91300 -1.490 0.975 1.610 2.07e-01 28.40
## 10 10 0.91100 -1.470 0.972 1.200 1.07e-01 24.50
## 11 12 0.90900 -1.430 0.967 0.718 3.04e-02 18.60
## 12 14 0.91000 -1.410 0.970 0.464 8.50e-03 14.60
## 13 16 0.89400 -1.410 0.932 0.317 2.57e-03 11.70
## 14 18 0.91000 -1.380 0.966 0.225 8.32e-04 9.47
## 15 20 0.92100 -1.340 0.976 0.165 2.84e-04 7.79
DIY: Which power should be used for your dataset? (don’t forget to look at the documentation of the pickSoftThreshold
fucntion to known how).
sprintf("Optimal soft-power = %d", sft$powerEstimate)
## [1] "Optimal soft-power = 5"
The optimal soft-power to use is defined by the topology of the soft-thresholded network, which has to be scale-free! Scale-free networks are networks in which the majority of nodes have a low degree, and a small number of nodes have a large degree. Multiple studies have shown that most biological networks are scale-free!
Let’s visualize the difference between the raw correlations and the soft-thresholded correlations:
= abs(cor(dataset))
corRaw = abs(corRaw**sft$powerEstimate)
corSoft par(mfrow=c(1,2))
hist(corRaw, main="Raw correlations")
hist(corSoft, main="Soft-thresholded correlations")
Sometimes, WGCNA can’t find the optimal soft-threshold. In such cases, the variable sft$powerEstimate
contains NA
. If the optimal soft-power was estimated but seems low, you might want to increase it.
In both cases, you can plot the following figures to help you decide:
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.80, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab = "Soft Threshold power", ylab = "Mean connectivity",
type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")
The Scale Independence plot (left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a R² > 0.8. The red line corresponds to using an R² cut-off of R²=0.80.
The Mean Connectivity plot (right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges.
WGCNA provides the function adjacency
to compute a features similarity matrix. This function operates in three steps:
?adjacency
. By default, the constructed network is an unsigned network which means that sign are not reported into the network (absolute correlations).DIY: Use the adjacency
function to compute the similarity matrix. Don’t forget to provide the soft-power you defined.
= sft$powerEstimate
softPower <- adjacency(dataset, power = softPower)
similarityMat head(similarityMat[, c(1:5)])
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 1.000000e+00 2.412996e-03 5.044921e-07 2.202682e-03 1.160268e-08
## LOC96610 2.412996e-03 1.000000e+00 3.146534e-04 1.155482e-04 1.256985e-07
## APOD 5.044921e-07 3.146534e-04 1.000000e+00 4.584558e-07 1.401583e-08
## FADS2 2.202682e-03 1.155482e-04 4.584558e-07 1.000000e+00 2.535449e-05
## RPL9 1.160268e-08 1.256985e-07 1.401583e-08 2.535449e-05 1.000000e+00
## ADAM6 6.429423e-04 4.443884e-01 1.380167e-03 5.118278e-06 3.567073e-07
From the Similarity matrix, WGCNA proposes to extract correlated feature modules (co-expressed gene modules) based on Topological Overlap. The idea is that two features (genes) that share a high number of correlated (co-expressed) neighbors are likely to participate in the same correlation (co-expression) module.
The Topological Overlap Matrix (TOM) is a concept used in network analysis to quantify the similarity or interconnectedness between nodes in a network.
DIY: Transform the similarity matrix into a Topological Overlap Matrix that incorporates the level of overlapping between the neighborhoods of two nodes (features/genes), using the function TOMsimilarity
.
Notice that the output does not contains our row names and column names (i.e. node names). Make sure to rename it!
<- TOMsimilarity(similarityMat) TOM
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
head(TOM[, c(1:5)])
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 1.000000e+00 2.890102e-03 8.557899e-04 5.767491e-03 2.170248e-05
## LOC96610 2.890102e-03 1.000000e+00 2.225817e-03 1.041126e-03 1.103988e-05
## APOD 8.557899e-04 2.225817e-03 1.000000e+00 2.281934e-04 7.573427e-06
## FADS2 5.767491e-03 1.041126e-03 2.281934e-04 1.000000e+00 3.570641e-05
## RPL9 2.170248e-05 1.103988e-05 7.573427e-06 3.570641e-05 1.000000e+00
## ADAM6 1.196107e-03 1.319133e-01 1.809321e-03 3.135948e-04 4.804337e-06
DIY: Hint: distance = 1 - similarity
<- 1 - TOM
dissTOM head(dissTOM[, c(1:5)])
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 0.0000000 0.9971099 0.9991442 0.9942325 0.9999783
## LOC96610 0.9971099 0.0000000 0.9977742 0.9989589 0.9999890
## APOD 0.9991442 0.9977742 0.0000000 0.9997718 0.9999924
## FADS2 0.9942325 0.9989589 0.9997718 0.0000000 0.9999643
## RPL9 0.9999783 0.9999890 0.9999924 0.9999643 0.0000000
## ADAM6 0.9988039 0.8680867 0.9981907 0.9996864 0.9999952
Modules are groups of highly-connected nodes (features). This step is composed of two substeps:
hclust
function to perform hierarchical clusteringDynamic Tree Cut
function to obtain the modulesFirst, a hierarchical clustering is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure 7.1.
<- hclust(as.dist(dissTOM), method = "average")
geneTree plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)
The leaves represent the features (genes). The dendrogram branches group together densely interconnected and highly correlated features.
DIY: To identify modules from dendrograms, several methods exist. Here, we use the cutreeDynamic
function.
First, we have to set the minimum module size minClusterSize
(i.e. minimum number of genes inside a group/module). The default is 20, but you can try several values and choose according to you results. There are other parameters you can play with, such as deepSplit
. Look at the documentation to select the parameters.
<- 30
minModuleSize <- cutreeDynamic(dendro = geneTree,
dynamicMods distM = dissTOM,
deepSplit = 2,
pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
## ..cutHeight not given, setting it to 0.997 ===> 99% of the (truncated) height range in dendro.
## ..done.
table(dynamicMods)
## dynamicMods
## 0 1 2 3 4 5 6 7 8
## 121 457 414 371 303 147 91 56 40
After running cutreeDynamic
, we have obtained our correlation modules.
Note that module 0 is not a correlation module !! All unassigned nodes are automatically labelled 0.
DIY: Instead of using numbers as module names, we will use the function labels2colors
to assign a color to each module. The biggest module is assigned turquoise
.
<- labels2colors(dynamicMods)
dynamicColors table(dynamicColors)
## dynamicColors
## black blue brown green grey pink red turquoise
## 56 414 371 147 121 40 91 457
## yellow
## 303
Note that module grey (module 0 at previous step) is not a correlation module !! All unassigned nodes are automatically labelled 0/grey.
Let’s visualize the obtained modules on the previous dendrogram:
plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
Sometimes, especially when dealing with smaller clusters (for instance when the minClusterSize parameter is not set properly), merging modules with similar expression profiles might be usefull.
To do so, we first compute each module eigenvectors. A module eigenvector (or eigengene if you are dealing with gene data) represents the mean expression profile of the whole corresponding module. We can use the correlation of modules eigengenes to define a module-similarity metric that we can use to merge close modules.
DIY: To compute the module eigenvectors, use the function moduleEigengenes
. Order the obtained eigenvectors using orderMEs
. This function reorders input eigenvectors based on their correlation.
<- moduleEigengenes(dataset, colors = dynamicColors)
MEList <- MEList$eigengenes
MEs head(MEs)
## MEblack MEblue MEbrown MEgreen MEgrey
## A0SH -0.05397373 -0.001812781 8.704370e-02 0.009554327 -0.04249332
## A0SJ -0.02352202 -0.016655975 2.679801e-02 0.025707793 0.01779475
## A0SK -0.17404817 -0.127446084 -7.661646e-02 -0.008179712 0.06381009
## A0SO -0.07745815 -0.079496161 -1.054047e-01 -0.013849727 0.06230835
## A04N -0.03146649 -0.003822502 4.806086e-02 0.005912970 0.02994775
## A04P 0.01736130 0.027381270 9.912705e-05 -0.131843977 0.02085333
## MEpink MEred MEturquoise MEyellow
## A0SH -0.09733225 -0.009707618 -0.006016097 -0.09115749
## A0SJ -0.08105159 0.064320489 0.020362698 -0.00463625
## A0SK -0.03738968 -0.100005686 -0.089068340 0.11699671
## A0SO -0.07455363 -0.060439328 -0.088204094 0.09227040
## A04N 0.05760808 0.022801142 0.041043282 -0.04263349
## A04P 0.07390190 0.030873637 -0.092340083 0.03871340
<- orderMEs(MEs)
MEs head(MEs)
## MEyellow MEbrown MEred MEblack MEblue
## A0SH -0.09115749 8.704370e-02 -0.009707618 -0.05397373 -0.001812781
## A0SJ -0.00463625 2.679801e-02 0.064320489 -0.02352202 -0.016655975
## A0SK 0.11699671 -7.661646e-02 -0.100005686 -0.17404817 -0.127446084
## A0SO 0.09227040 -1.054047e-01 -0.060439328 -0.07745815 -0.079496161
## A04N -0.04263349 4.806086e-02 0.022801142 -0.03146649 -0.003822502
## A04P 0.03871340 9.912705e-05 0.030873637 0.01736130 0.027381270
## MEpink MEgreen MEturquoise MEgrey
## A0SH -0.09733225 0.009554327 -0.006016097 -0.04249332
## A0SJ -0.08105159 0.025707793 0.020362698 0.01779475
## A0SK -0.03738968 -0.008179712 -0.089068340 0.06381009
## A0SO -0.07455363 -0.013849727 -0.088204094 0.06230835
## A04N 0.05760808 0.005912970 0.041043282 0.02994775
## A04P 0.07390190 -0.131843977 -0.092340083 0.02085333
DIY: Now let’s perform a hierarchical clustering using the correlation between module eigenvectors.
Remember that for using hclust
we need distances! Compute those distances and transform then with as.dist
before running hclust
.
<- 1 - cor(MEs)
MEDiss head(MEDiss)
## MEyellow MEbrown MEred MEblack MEblue MEpink
## MEyellow 0.0000000 1.314690e+00 1.4820148 1.0237382 9.691794e-01 1.067170e+00
## MEbrown 1.3146899 5.551115e-16 0.3435587 0.9154158 5.352808e-01 1.083306e+00
## MEred 1.4820148 3.435587e-01 0.0000000 0.9144473 6.159154e-01 9.229435e-01
## MEblack 1.0237382 9.154158e-01 0.9144473 0.0000000 4.880131e-01 9.557461e-01
## MEblue 0.9691794 5.352808e-01 0.6159154 0.4880131 3.330669e-16 9.807942e-01
## MEpink 1.0671697 1.083306e+00 0.9229435 0.9557461 9.807942e-01 7.771561e-16
## MEgreen MEturquoise MEgrey
## MEyellow 1.3131925 1.6050999 1.0337210
## MEbrown 0.9062175 1.0346148 0.9743145
## MEred 0.7740776 0.7698319 0.9730385
## MEblack 0.9683373 0.8827360 1.1884568
## MEblue 1.0837290 1.3047868 0.9926956
## MEpink 0.8743927 0.7882817 0.7369518
<- hclust(as.dist(MEDiss)) METree
DIY: WGCNA’s authors recommend to merge cluster with a correlation greater than 0.75, hence a distance lower than 0.25. Let’s look at the tree with plot
and abline
, just as we did previously in this tutorial.
<- 0.25
MEDissCut plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissCut, col = "red")
In our case, we dont need our modules merged, but if you need it:
If you want to merge modules, you can use the mergeCloseModules
function as below:
<- mergeCloseModules(dataset, dynamicColors, cutHeight = MEDissCut, verbose = 3)
merge <- merge$colors
mergedColors <- merge$newMEs
mergedMEs plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
<- mergedColors
moduleColors <- c("grey", standardColors(50))
colorOrder <- match(moduleColors, colorOrder) - 1
moduleLabels <- mergedMEs MEs
DIY: Let’s save those clustering results! Choose your favorite export format (Rdata, csv, tsv, …) and save the results. Don’t forget to make sure you have all the information you need (in this case, we need the node names and their cluster labels).
= dynamicColors # Or mergedColors, if some modules have been merged
moduleColors names(moduleColors) = colnames(dataset)
head(moduleColors)
## STC2 LOC96610 APOD FADS2 RPL9 ADAM6
## "turquoise" "blue" "red" "yellow" "grey" "blue"
save(moduleColors, file = "results_TCGA/modules_mrna.rda")
DIY: We also need to save our module eigenvectors!
save(MEs, file = "results_TCGA/MEs_mrna.rda")
We identified several modules in the previous step. Now we want to characterize them:
For all of the above, we can use correlation!
We want to identify which modules are significantly associated with the (clinical) metadata we have. For this, we will use correlation.
For pre-processing numeric variables, a simple normalization should be enough.
While not ideal, for categorical variables, we will binarize the data using the binarizeCategoricalVariable
. This function provides two options: includePairwise
to compare all classes against all, and a includeLevelVsAll
to compare each classes to all others. Here, to limit the number of associations to test, we will only include LevelVsAll comparisons.
However, note that binarization is not optimal, and other test might be more appropriate, such as multinomial logistic regression models.
DIY: First, scale numeric data, and put the results in a new dataframe.
<- data.frame("AgeDiagnosis"=scale(metadata$AgeDiagnosis), row.names = rownames(metadata)) metadataNum
DIY: Now, binarize categorical data and add it to the new dataframe.
= binarizeCategoricalVariable(metadata$AnatomicNeoplasm,includePairwise = FALSE, includeLevelVsAll = TRUE)
AnatomicNeoplasm rownames(AnatomicNeoplasm) = rownames(metadata)
= cbind(metadataNum, AnatomicNeoplasm) metadataNum
Do the same for all categorial variables:
# Gender and vital status are already binary
$Gender = ifelse(metadata$Gender=="female", 0, 1)
metadataNum$VitalStatus = ifelse(metadata$VitalStatus=="alive", 1, 0)
metadataNum
= binarizeCategoricalVariable(metadata$Subtype,includePairwise = FALSE, includeLevelVsAll = TRUE)
Subtype rownames(Subtype) = rownames(metadata)
= cbind(metadataNum, Subtype)
metadataNum
# We also include a numeric representation of the subtypes
$Subtype = as.numeric(metadata$Subtype)-1 metadataNum
Now we can compute correlations between the modules eigenvectors and the metadata!
DIY: Use the functions cor
and corPvalueStudent
to compute correlations and associated p-values.
<- cor(MEs, metadataNum)
moduleTraitCor <- corPvalueStudent(moduleTraitCor, nSamples) moduleTraitPvalue
Let’s plot the results:
<- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
textMatrix dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
for(c in colnames(moduleTraitCor)){
if(moduleTraitPvalue[r, c] > 0.05){
<- 0
moduleTraitCor[r, c] <- ""
textMatrix[r, c]
}
}
}labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(metadataNum),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
textMatrix = textMatrix,
setStdMargins = TRUE,
cex.text = 0.5,
zlim = c(-1, 1),
main = "Module-trait relationship")
Two measures are computed to quantify the association between nodes, traits, and modules:
Module Membership (MM): correlation between the module eigengene and the features
Gene Significance (GS): correlation between the features and a selected trait of interest
DIY: Fist, we compute the Module Membership. You need the module eigenvectors (MEs) and the features from the input dataset. Compute correlations using the cor
function. Then, compute p-values using the function corPvalueStudent
. Store the results in a dataframe.
<- as.data.frame(cor(dataset, MEs))
moduleMembership <- as.data.frame(corPvalueStudent(as.matrix(moduleMembership), nSamples))
MMPvalue
<- substring(names(MEs), 3)
modNames names(moduleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")
head(moduleMembership[,1:5])
## MMyellow MMbrown MMred MMblack MMblue
## STC2 -0.43217262 0.04748568 0.1473983 0.03871435 -0.223619330
## LOC96610 0.23596084 0.18676692 0.1702518 0.31262959 0.640187631
## APOD -0.23191026 0.32615271 0.4728507 0.15203720 0.188067120
## FADS2 0.37235530 0.01179123 -0.1840167 0.01959520 0.141079060
## RPL9 -0.09474407 0.03462862 0.1189073 -0.09396184 0.006553046
## ADAM6 0.11173668 0.18504141 0.2314137 0.26154606 0.539568707
head(MMPvalue[,1:5])
## p.MMyellow p.MMbrown p.MMred p.MMblack p.MMblue
## STC2 1.239744e-18 3.572129e-01 4.079331e-03 4.529648e-01 1.138938e-05
## LOC96610 3.516884e-06 2.608997e-04 8.892539e-04 5.147623e-10 5.602733e-45
## APOD 5.209184e-06 8.102363e-11 1.874090e-22 3.042640e-03 2.358109e-04
## FADS2 7.081956e-14 8.192604e-01 3.223981e-04 7.041318e-01 6.004037e-03
## RPL9 6.575713e-02 5.020747e-01 2.075703e-02 6.802883e-02 8.989521e-01
## ADAM6 2.985413e-02 2.980548e-04 5.463571e-06 2.492590e-07 6.099177e-30
DIY: Very similarly, we can compute the node significance with respect to our traits of interest contained in the metadata. Compute the correlation and p-values between your dataset and the metadata using the sames functions as above.
<- as.data.frame(cor(dataset, metadataNum, use = "p"))
geneTraitSignificance <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
GSPvalue names(geneTraitSignificance) <- paste("GS", names(metadataNum), sep = "")
names(GSPvalue) <- paste("p.GS", names(metadataNum), sep = "")
DIY: We can visualize the results with a scatter plot. Pick a module and extract the set of nodes composing the selected module. Then pick a trait of interest. Plot the module membership values against the node significance using the verboseScatterplot
function. You can do this for various modules and traits of interest.
='turquoise'
module= names(moduleColors)[which(moduleColors==module)]
moduleGenes = "GSSubtype"
geneTraitSignifColumn verboseScatterplot(abs(moduleMembership[moduleGenes, paste('MM', module, sep="")]),
abs(geneTraitSignificance[moduleGenes, geneTraitSignifColumn]),
xlab = paste("Module Membership (MM) in", module, "module"),
ylab = paste("Gene Significance (GS) for", geneTraitSignifColumn),
main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
DIY: Now that we have computed our correlated modules and characterized them, we can export our results for other downstream analysis.
= cbind(moduleMembership, MMPvalue)
resultsModuleMembership = names(resultsModuleMembership)
mmNames = mmNames[order(gsub("p\\.", "", mmNames))]
mmNames = resultsModuleMembership[,mmNames]
resultsModuleMembership
= cbind(geneTraitSignificance, GSPvalue)
resultsSignificance = names(resultsSignificance)
gsNames = gsNames[order(gsub("p\\.", "", gsNames))]
gsNames = resultsSignificance[,gsNames]
resultsSignificance
= data.frame("Gene"=names(moduleColors), "Module"=unlist(moduleColors),
results
resultsModuleMembership, resultsSignificance)= results[order(results$Module),]
results write.table(results, file="results_TCGA/TCGA_mRNA_results.tsv", sep="\t", row.names=F, col.names=T, quote=F)
head(results[,1:6])
## Gene Module MMblack p.MMblack MMblue p.MMblue
## IFI6 IFI6 black 0.8573286 1.765762e-110 0.2630361 2.117437e-07
## ISG15 ISG15 black 0.8279306 1.804073e-96 0.2974812 3.665833e-09
## IFI27 IFI27 black 0.8207346 1.918631e-93 0.3937153 1.822846e-15
## MX1 MX1 black 0.8834454 7.349794e-126 0.4205653 1.237612e-17
## IFIT1 IFIT1 black 0.9117530 2.263150e-147 0.3267135 7.489300e-11
## HERC6 HERC6 black 0.7728591 2.880394e-76 0.4226841 8.185472e-18
After applying the pipeline for various modalities, we can use the module eigenvectors computed for each modality to perform an integrative analysis and identify correlated multi-omic modules.
DIY: Load the module eigenvectors obtain for the various modalities.
load("results_TCGA/MEs_mirna.rda")
= MEs
mirna names(mirna) = paste("mirna", names(mirna), sep="_")
load("results_TCGA/MEs_proteins.rda")
= MEs
protein names(protein) = paste("protein", names(protein), sep="_")
load("results_TCGA/MEs_mrna.rda")
= MEs
mrna names(mrna) = paste("mrna", names(mrna), sep="_")
DIY: Some samples might have been removed during the preprocessing step for some of the modalities. Just keep the samples that are present in all three datasets.
= Reduce("intersect", list(rownames(mirna), rownames(protein), rownames(mrna)))
samples = do.call("cbind", list(mirna[samples,], mrna[samples,], protein[samples,])) MEs
DIY: Now we can compute module-module correlations across modalities! Do it with the adjacency
function used in the beginning of this tutorial.
= 1 # this time, we don't need a scale-free topology, so we just set softPower to 1
softPower <- adjacency(MEs, power = softPower) similarityMat
We can visualize the correlations using pheatmap
.
DIY: First, create an annotation dataframe to indicate the provenance (i.e. modality) of each module.
= data.frame("Modality"=gsub("_.*", "", colnames(similarityMat)))
annot rownames(annot) = colnames(similarityMat)
head(annot)
## Modality
## mirna_MEyellow mirna
## mirna_MEbrown mirna
## mirna_MEturquoise mirna
## mirna_MEblue mirna
## mirna_MEgreen mirna
## mirna_MEgrey mirna
DIY: Now, plot the module correlation matrix using pheatmap
. Look at the documentation for this function to know how to include your annotation dataframe, and play with the various parameters (e.g: clustering of the heatmap).
<- pheatmap(similarityMat, show_rownames = TRUE, show_colnames = TRUE, annotation_row = annot, annotation_col = annot, legend = T, cluster_cols = T, cutree_cols = 5, cluster_rows = T, cutree_rows = 5) heatmap
DIY: Extract the clusters of correlated modules.
<- cbind(similarityMat, cluster = cutree(heatmap$tree_col, k = 5))
clust = clust[,"cluster"]
clust order(clust)] clust[
## mirna_MEyellow mirna_MEbrown mirna_MEturquoise mrna_MEyellow
## 1 1 1 1
## mrna_MEturquoise protein_MEbrown mirna_MEblue mrna_MEbrown
## 1 1 2 2
## mrna_MEred protein_MEturquoise mirna_MEgreen mirna_MEgrey
## 2 2 3 3
## mrna_MEpink mrna_MEgrey protein_MEgrey mrna_MEblack
## 3 3 3 4
## mrna_MEblue mrna_MEgreen protein_MEblue
## 4 5 5