1 Libraries and environment

1.1 Environment

This report was generated using:

R: R version 4.3.2 (2023-10-31)
WGCNA: 1.72.5
pheatmap: 1.0.12

You might also need the compositions library for data normalization.

1.2 Load libraries

DIY: Load the WGCNA and pheatmap libraries.

library("WGCNA")
library("pheatmap")

2 General principle of WGCNA

Important Reminder: While WGCNA was initially developed for processing gene expression data, it can be adapted for other data types as well.

Throughout this tutorial, we will strive to present the material in a manner that is broadly applicable to different data modalities.

However, it is worth noting that certain WGCNA functions have names related to transcriptomics data (e.g., moduleEigengenes instead of eigenvectors). Moreover, we might use interchangeably the terms correlation and co-expression, or gene and feature.

Consequently, certain phrasings used in this tutorial may specifically reference transcriptomic data, even though the overall intent is to apply the methods to diverse data types. Please keep this in mind while following the tutorial, especially when applying it to non-transcriptomic data.

Weighted Gene Co-expression Network Analysis (WGCNA) is an approach dedicated to identifying correlation patterns between features of a dataset. In most cases, it is applied to gene expression data to find clusters -referred to as modules- of highly correlated genes. In transcriptomics, genes that exhibit strong correlations in their expression profiles are said co-expressed.

One of the underlying principles of WGCNA is the concept of guilt-by-association. This principle suggests that genes sharing similar expression patterns are likely to be functionally related or involved in similar biological processes. By leveraging guilt-by-association through correlation, WGCNA provides a powerful tool for identifying potential functional relationships among genes and deciphering their roles in complex biological systems.

WGCNA builds a co-expression network where:

each feature (transcript/gene in transcriptomics data) is represented as a node
edges represent a correlation pattern between two features (nodes) and are weighted with the corresponding correlation coefficient, indicating the strength of their observed correlation.

The main analysis steps are presented in the Figure 2.1.

First, it constructs a feature correlation network (for transcriptomics data, a gene co-expression network based on correlation analysis). The correlation network is (soft)-thresholded to obtain a scale-free network.
Then, it identifies modules, i.e. groups of features that are highly and tightly correlated.
Next, those modules are characterized:

the modules are related to external data, such as samples metadata but also features metadata
the users can examine module-module relationships, to reveal a higher-order organization of the transcriptome
the users can find key driver features in modules of interest

WGCNA overview (Langfelder, Peter, and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics 9.1 (2008))

Figure 2.1: WGCNA overview (Langfelder, Peter, and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics 9.1 (2008))

2.1 Application on other omics

WGCNA is a method dedicated to transcriptomics data. However, some studies have shown its potential for inferring correlation networks on other omic datasets (see: MiBiOmics). In this tutorial, we will see how it can be used on various omics datasets to reveal correlated modules across different data modalities.

2.2 Ressources

If you are interested in WGCNA, extensive tutorials are available online:

All official WGCNA tutorials can be found here
Publication: PMID: 19114008

3 Choose your dataset and your modality

For this tutorial, you can pick a dataset among the ones proposed below.

Five datasets are available: **Metagenomic** data from Tara Ocean (image from Sunagawa et al., 2015), **Breast cancer** data from TCGA (image from TCGA [website](https://portal.gdc.cancer.gov/)), **CLL** data (Dietrich et al., 2018), **tomato** data and **Arabidopsis thaliana** data (Wallomics dataset).

Figure 3.1: Five datasets are available: Metagenomic data from Tara Ocean (image from Sunagawa et al., 2015), Breast cancer data from TCGA (image from TCGA website), CLL data (Dietrich et al., 2018), tomato data and Arabidopsis thaliana data (Wallomics dataset).

3.1 TCGA breast cancer dataset

This dataset includes measurements of mRNA, miRNA, and protein abundances taken from samples representing different molecular subtypes of breast cancer.

This dataset was obtained from Mibiomics

Access the dataset from IFB on demand: /shared/projects/tp_etbii_2024_165650/Networks/BreastCancer_WGCNA/

For this dataset, we will use 3 modalities:

mrna.csv, contains mRNA expression data
mirna.csv, contains miRNA expression data
protein.csv, contains protein abundance data

We will also use the file clinical.csv, which contains samples metadata, including molecular subtypes.

3.2 TARA ocean dataset

Tara Ocean is a scientific initiative that aims to study and understand marine ecosystems on a global scale. It involves the collection of ocean samples from various regions to analyze the diversity of organisms, their genetic content, and their interactions with the environment.

In this dataset, we will have access to a NOG (Non-supervised Orthologous Groups) abundance dataset, measure accros various oceans and various depths. NOGs are clusters of genes from different species that share a common evolutionary origin and similar functions. These groups are formed computationally, without relying on manual annotations, and are valuable for understanding gene relationships and functions across diverse organisms. NOGs play a key role in comparative genomics and evolutionary studies by revealing insights into the conservation of genes and their roles in different species.

NOGs have been measured in various oceans (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean) and at various depths (DCM: Deep Chlorophyll Maximum, MES: Mesopelagic Zone, MIX: Mixing Layer, SRF: Surface Layer)

This dataset was obtained from Mibiomics

Access the dataset from IFB on demand: /shared/projects/tp_etbii_2024_165650/Networks/TaraOcean_mibiomics/

For this dataset, we can only apply WGCNA on the TARAoceans_proNOGS.csv. Therefore, the integrative analysis part of this tutorial cannot be applied on TARA data. However, you can still run the tutorial from the proNOGS dataset, using also the metadata contained in TARAoceans_metadata.csv.

3.3 CLL dataset

The Chronic lymphocytic leukaemia (CLL) dataset includes mRNA expression, methylation data and viability values in response to different drugs from CLL samples.

To access this dataset, you need to install and load the MOFAdata R library.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MOFAdata")

library("MOFAdata")
data("CLL_data")
names(CLL_data)

We will use three modalities:

CLL_data$mRNA, which contains mRNA data for various CLL samples,
CLL_data$Methylation, which contains methylation data from the same samples,
CLL_data$Drugs, which contains viability response to various drugs and concentrations.

The samples metadata are not available from MOFAdata, you’ll have to access it from IFB on demand: /shared/projects/tp_etbii_2024_165650/Networks/CLL/

Use the command ?CLL_data to know more about this dataset.

3.4 Tomato dataset

The tomato dataset was obtained from the authors of the following publication: Isma Belouah and others, Modeling Protein Destiny in Developing Fruit, Plant Physiology, Volume 180, Issue 3, July 2019, Pages 1709–1724, https://doi.org/10.1104/pp.19.00086

It contains mRNA and protein abundance data from tomato plants (Solanum lycopersicum) at various development stages. In this dataset, the mRNA and proteins are paired, meaning that corresponding gene expression levels at both the mRNA and protein levels are examined.

The dataset encompasses information gathered from tomato plants at different developmental points, representing various Days Post Anthesis (DPA) and distinct Growth Stages. ‘Days Post Anthesis’ (DPA) signifies the count of days that have elapsed since the opening of a flower, serving as a marker to track the temporal progression of plant development. On the other hand, ‘Growth Stages’ denote specific phases within the tomato plant’s lifecycle, spanning germination, vegetative growth, flowering, fruit development, and ripening.

The dataset was kindly provided to us by the authors. We have made slight adjustments to it, and made it accessible on IFB on demand: /shared/projects/tp_etbii_2024_165650/Networks/Tomato/

For this dataset, we have access to 2 modalities:

mrna.tsv, contains mRNA expression data (already normalized)
prots.csv, contains protein abundance data (already normalized)

We will also use the file samples_metadata.tsv, which contains samples metadata (DPA, Growth Stage).

3.5 WallOmics dataset

The tomato dataset was obtained from the WallomicsData R library. This dataset contains phenomics, metabolomics, proteomics and transcriptomics data collected from two organs of five ecotypes of the model plant Arabidopsis thaliana exposed to two temperature growth conditions. For more information, you can refer to the following publications:

Duruflé, Harold, et al. “A powerful framework for an integrative study with heterogeneous omics data: from univariate statistics to multi-block analysis.” Briefings in Bioinformatics 22.3 (2021): bbaa166.
Duruflé, Harold, et al. “An integrative study showing the adaptation to sub-optimal growth conditions of natural populations of Arabidopsis thaliana: A focus on cell wall changes.” Cells 9.10 (2020): 2249.

For this dataset, we can run WGCNA on the transcriptomics and proteomics data obtained from two organs. For the analysis to run faster, we will use the CellWall datasets (CW), that only consider the expression of cell wall proteins. The datasets can be accessed with the following R commands:

intall.packages("WallomicsData")

library("WallomicsData")
data("Transcriptomics_Rosettes_CW")
data("Transcriptomics_Stems_CW")
data("Proteomics_Rosettes_CW")
data("Proteomics_Stems_CW")

Use ?Transcriptomics_Rosettes_CW and ?Transcriptomics_Stems_CW for more information on the transciptomics dataset. Use ?Proteomics_Rosettes_CW and ?Proteomics_Stems_CW for more information on the transciptomics dataset.

Additionally, we will use the other data available as metadata. This includes:

the Ecotype dataset
the Temperature dataset
the Genetic_Cluster dataset
the Altitude_Cluster dataset
the Genetic_Cluster dataset
the Altitude_Cluster dataset
the Metabolomics_Rosettes dataset for Rosettes
the Metabolomics_Stems dataset for Stem
the Phenomics_Rosettes dataset for Rosettes
the Phenomics_Stems dataset for Stem

In the upcoming sections, our primary focus will be on the Breast cancer dataset obtained from WGCNA. However, it’s important to note that the procedure we will discuss can also be applied to other datasets.

3.6 Team organization

To run the multi-omics analysis, it is necessary to repeat the WGCNA pipeline for each modality. To facilitate this, we suggest dividing into smaller groups, with each group assigned to a specific dataset. Within each group, individual members can handle one modality of the multi-omics dataset, before the whole group handles the data integration part.

4 Biological context (Breast cancer dataset)

The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site here.

Figure 4.1: Screenshot of the TCGA main page

TCGA makes available a high amount of data. In this tutorial, we will use mRNA expression data, miRNA expression data and protein abundance data from breast cancer samples. Metadata about the samples are also available, such as the subtype of cancer, age at diagnosis and so on.

Our objective is to build a correlation network for each modality to identify modules of tightly correlated features associated to our clinical metadata. Then, another correlation analysis between the modules from different modalities will help us identify multi-omics associations in breast cancer.

5 Input data

5.1 In summary

Preparation of the data is the first and most important step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information.

Normalization of data according to the type of data and the type of analysis to perform. For RNA-seq data, WGCNA recommends to use the varianceStabilizingTransformation from the DESeq2 package, or to perform a log-transformation of normalized counts (RPFK/FPKM).
Removing missing values
Removing outliers based on sample hierarchical clustering
Add external information

Data must be matrices with specific shapes:

samples (e.g. patients, organisms …) are displayed on the rows
variables (e.g. genes, proteins …) are displayed on the columns

5.2 Load your dataset

Choose a dataset and a modality, and load it into R. You will probably need to read a csv/tsv file: look at the documentation for the function read.table:

?read.table

If your data is not stored into a csv/tsv file and you don’t know how to load it, don’t hesitate to ask us.

DIY: Load the modality you selected.

dataset = read.table("TCGA_data/mrna.csv", sep=",", header=T)

DIY: Reshape the matrix. We need the samples in rows and the genes in columns. You can use sample ids as row names and gene names as column names.

rownames(dataset) = dataset$X
dataset = dataset[,-1]
dataset = t(dataset)
head(dataset[,1:5])

##           STC2  LOC96610      APOD    FADS2     RPL9
## A0SH  6.809810  7.139342  6.399172 5.316216 5.916535
## A0SJ  9.802976  9.538014 10.667394 5.537015 3.204786
## A0SK  5.495795  7.440094  1.722198 8.555220 6.702598
## A0SO  2.926594  8.674870  4.771886 8.911656 6.122295
## A04N 11.579173  8.457158  6.579026 6.459865 3.572641
## A04P  5.011630 11.149950  3.497919 6.820260 8.007429

DIY: Check if your dataset is normalized. If not, do it! If you don’t know which type of normalization/transformation to apply your omic dataset, use the Centered log ratio transformation: function clr from the compositions R library.

For the mRNA dataset of the breast cancer dataset, the data is already normalized, but we still need to log-transform it. To avoid errors, we usually define an offset of 1 when log-transforming the data, as this ensures that the logarithm is only applied to values greater than zero.

dataset = log2(1+dataset)
head(dataset[,1:5])

##          STC2 LOC96610     APOD    FADS2     RPL9
## A0SH 2.965288 3.024912 2.887364 2.659061 2.790050
## A0SJ 3.433357 3.397531 3.544410 2.708632 2.072032
## A0SK 2.699506 3.077259 1.444772 3.256289 2.945345
## A0SO 1.973278 3.274242 2.529043 3.309126 2.832342
## A04N 3.652965 3.241407 2.922013 2.899150 2.193028
## A04P 2.587756 3.602879 2.169258 2.967217 3.171115

5.3 Run WGCNA quality check

WGCNA provides a function goodSamplesGenes to check the quality of input data and remove features and samples with too many missing data as well as genes with zero variance.

DIY: Read the documentation for this function using?goodSamplesGenes and check if any sample and/or feature should be removed.

gsg <- goodSamplesGenes(datExpr = dataset, verbose = 3)
print(paste("All ok?", gsg$allOK))

DIY: According to your results, you may need to remove problematic samples and features from your dataset! To do so, keep the samples and features contained in gsg$goodSamples and gsg$goodGenes.

sprintf("Removing %d features", ncol(dataset) - sum(gsg$goodGenes))
sprintf("Removing %d samples", nrow(dataset) - sum(gsg$goodSamples))
dataset = dataset[gsg$goodSamples, gsg$goodGenes]

5.4 Checking for outliers

After running WGCNA quality check, we also need to check if we have any outlier sample.

We compute the euclidean distance between each pair of samples using the dist function.
Then, we perform a hierarchical clustering on the distance data using the hclust function.
We visualize the dendrogram to identify potential outliers.

DIY: Try to perform the first two steps. Don’t forget to use ?<functionName> to know which input give to each function!

samplesTree <- hclust(d = dist(dataset), method = "average")

Now, let’s plot the dendrogram.

plot(samplesTree, main = "Samples dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 28, col = "red")

Figure 5.1: Samples clustering using hierarchical clustering method, to detect outliers

What do you think?

DIY: Try different thresholds on the Height to cut the dendrogram using the abline function, and choose one that fits your dataset.

DIY: Now let’s cluster our samples using the selected height, using the function cutreeStatic.

clust <- cutreeStatic(samplesTree, cutHeight = 28)
table(clust)

## clust
##   0   1 
##   1 378

DIY: Let’s remove samples from cluster 0 (outliers) and save the final number of samples and features into variables nSamples and nFeatures.

keep <- clust == 1
dataset <- dataset[keep,]
nFeatures <- ncol(dataset)
nSamples <- nrow(dataset)
sprintf("%d features, %d samples",nFeatures, nSamples)

## [1] "2000 features, 378 samples"

5.5 Metadata

We have access to metadata information for every sample. This time, we may have some Categorial data (e.g: subtype), so let’s use the stringAsFactors parameters from the read.table function, to specify that strings should be seen as factors (classes).

DIY: Load your metadata file. Make sure any categorial data is seed as a factor.

metadata <- read.table(file = "TCGA_data/clinical.csv", 
                           sep = ",", head = TRUE, stringsAsFactors = TRUE)
head(metadata)
class(metadata$subtype)

If needed, reshape your metadata so that it matches the samples in your dataset. For the breast cancer dataset, we also want to rename our clinical features with better colnames.

rownames(metadata) = metadata$X # Keep our sample ids as rownames
metadata = metadata[,-1] # Now we can remove the X col
names(metadata) = c("AgeDiagnosis", "AnatomicNeoplasm", "Gender", "VitalStatus", "Subtype") # Rename columns
metadata = metadata[rownames(dataset),] # We keep only samples that are present in our dataset
head(metadata)

##      AgeDiagnosis           AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH           39  left upper inner quadrant female       alive    LumA
## A0SJ           39                       left female       alive    LumA
## A0SK           54 right upper outer quadrant female        dead   Basal
## A0SO           67                      right female       alive   Basal
## A04N           66                      right female       alive    LumA
## A04P           36                       left female        dead   Basal

Note for the CLL dataset: Notice that we have several missing data in the metadata. This will cause some issues for downstream analysis (notably when characterizing our modules). We should impute these missing data to avoid these. To do so, we will replace the missing values with:

mean values for numerical data
randomly sampled values for categorical data

Here are some useful functions to do so:

is.na for determining if a value is NA
which to locate elements of a vector that satisfy a condition
na.omit to remove NA values from a vector
sample to randomly select elements from a list
mean to compute the mean value of a vector

Try to combine those functions to impute your missing values.

Here are two examples:

metadata$Gender[which(is.na(metadata$Gender))] = sample(na.omit(metadata$Gender),1)
metadata$age[which(is.na(metadata$age))] = mean(na.omit(metadata$age))

Do the same for any column that has missing values. Don’t forget that in the end, all categorical data should be a factor! You can une the class function to determine the class of a column.

Let’s replot our sample dendrogram and visualize the metadata associated to each sample.

DIY: Create a new dataframe from your metadata and convert categorical traits into colors using the labels2colors function. Convert numerical traits into colors using the function numbers2colors.

traitColors <- labels2colors(metadata)
traitColors[,1] <- numbers2colors(metadata[,1]) # The first column is numerical
head(traitColors)

##      [,1]      [,2]              [,3]       [,4]            [,5]        
## [1,] "#FFD5CB" "mediumpurple2"   "skyblue2" "antiquewhite4" "lightcoral"
## [2,] "#FFD5CB" "skyblue1"        "skyblue2" "antiquewhite4" "lightcoral"
## [3,] "#FFA38D" "plum3"           "skyblue2" "mediumorchid"  "coral2"    
## [4,] "#FF7A58" "darkolivegreen4" "skyblue2" "antiquewhite4" "coral2"    
## [5,] "#FF7E5E" "darkolivegreen4" "skyblue2" "antiquewhite4" "lightcoral"
## [6,] "#FFE2DB" "skyblue1"        "skyblue2" "mediumorchid"  "coral2"

Now we can plot the sample dendrogram again and visualize the traits associated to each sample with the plotDendroAndColors function.

samplesTree <- hclust(d = dist(dataset), method = "average")
plotDendroAndColors(dendro = samplesTree, colors = traitColors, groupLabels = names(metadata), 
                    main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5)

Figure 5.2: Samples clustering using hierarchical clustering method with metadata information

6 Construction of the correlation network

6.1 Correlation for dummies

WGCNA detects “co-expressed” features by performing a correlation analysis. Let’s visualise what two co-expressed features looks like. Use the plot function to visualize the associations of two variables, and compute the Pearson and Spearman correlation coefficient using the cor function. If you are working on the mRNA dataset from TCGA, let’s look at genes COL1A1 and COL1A2, two collagen-coding genes, which are known to be co-expressed.

plot(dataset[,"COL1A1"],dataset[,"COL1A2"])

cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "pearson")

##           [,1]
## [1,] 0.9879923

cor(dataset[,"COL1A1"],dataset[,"COL1A2"], method = "spearman")

##           [,1]
## [1,] 0.9865966

What is the difference between Pearson and Spearman methods? The Pearson correlation measures linear associations between variables, whereas Spearman correlation measures monotonous associations between variables. Which one do you want chose for your dataset? Both can have interesting results, so it’s completely up to you! By default, WGCNA uses the Pearson correlation. If you want to use the Spearman correlation, don’t forget to specify it each time you notice a method parameter controlling the correlation method in one of the WGCNA functions!

6.2 Computing the correlation network

This first step of the analysis consists in creating the feature correlation network. This first step is composed of tree main substeps:

Power selection
- soft thresholding is a strategy introduced by WGCNA to transform correlation coefficients for building a weighted correlation network
- the threshold should be set such that it makes the network scale-free
Similarity matrix construction
- first, correlation are calculated between each pair of feature
- then, correlation or distance are transformed, using the selected soft-power
Topological overlap matrix construction (TOM)
- the similarity matrix is transformed into a TOM to facilitate the module detection

Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.

enableWGCNAThreads()

6.3 Power selection

Correlation networks are by nature complete, which means every feature is associated to every other with a weight given by the correlation coefficient. To extract only meaningful associations, the correlation network must be thresholded.

The most straight-forward way of thresholding a correlation dataset is to perform a hard-thresholding, i.e. two features are considered correlated if their absolute correlation coefficient exceeds a user-defined threshold.

WGCNA propose another strategy, called soft-thresholding. It consists of raising the correlation coefficients to a certain power. This process transforms low correlation coefficients to values close to 0, while high correlation coefficients are transformed to values close to 1.

Furthermore, WGCNA presents a function called pickSoftThreshold that aims to automatically determine the ideal power for your dataset. The optimal power is defined as the one that ensures your network exhibits a scale-free property. Let’s use it!

First, we need to define a range of powers to try.

powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers

##  [1]  1  2  3  4  5  6  7  8  9 10 12 14 16 18 20

DIY: Use the pickSoftThreshold function to analyze the network topology of the correlation network for various soft-powers.

sft <- pickSoftThreshold(dataset, powerVector = powers, verbose = 5)

## pickSoftThreshold: will use block size 2000.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 2000 of 2000

## Warning: executing %dopar% sequentially: no parallel backend registered

##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1  0.00469  0.170          0.749 337.000  3.46e+02 565.00
## 2      2  0.26100 -0.922          0.751  94.300  9.08e+01 229.00
## 3      3  0.59300 -1.200          0.879  35.300  2.96e+01 120.00
## 4      4  0.81800 -1.400          0.979  16.200  1.12e+01  79.90
## 5      5  0.86500 -1.550          0.982   8.590  4.49e+00  61.00
## 6      6  0.90100 -1.580          0.979   5.100  1.96e+00  48.70
## 7      7  0.90600 -1.560          0.975   3.280  8.97e-01  39.90
## 8      8  0.91200 -1.540          0.981   2.250  4.21e-01  33.40
## 9      9  0.91300 -1.490          0.975   1.610  2.07e-01  28.40
## 10    10  0.91100 -1.470          0.972   1.200  1.07e-01  24.50
## 11    12  0.90900 -1.430          0.967   0.718  3.04e-02  18.60
## 12    14  0.91000 -1.410          0.970   0.464  8.50e-03  14.60
## 13    16  0.89400 -1.410          0.932   0.317  2.57e-03  11.70
## 14    18  0.91000 -1.380          0.966   0.225  8.32e-04   9.47
## 15    20  0.92100 -1.340          0.976   0.165  2.84e-04   7.79

DIY: Which power should be used for your dataset? (don’t forget to look at the documentation of the pickSoftThreshold fucntion to known how).

sprintf("Optimal soft-power = %d", sft$powerEstimate)

## [1] "Optimal soft-power = 5"

The optimal soft-power to use is defined by the topology of the soft-thresholded network, which has to be scale-free! Scale-free networks are networks in which the majority of nodes have a low degree, and a small number of nodes have a large degree. Multiple studies have shown that most biological networks are scale-free!

Let’s visualize the difference between the raw correlations and the soft-thresholded correlations:

corRaw = abs(cor(dataset))
corSoft = abs(corRaw**sft$powerEstimate)
par(mfrow=c(1,2))
hist(corRaw, main="Raw correlations")
hist(corSoft, main="Soft-thresholded correlations")

Sometimes, WGCNA can’t find the optimal soft-threshold. In such cases, the variable sft$powerEstimate contains NA. If the optimal soft-power was estimated but seems low, you might want to increase it. In both cases, you can plot the following figures to help you decide:

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
     xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
     type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.80, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab = "Soft Threshold power", ylab = "Mean connectivity", 
     type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")

Figure 6.1: Soft Threshold Indices. Left: Scale Independence plot - Right: Mean Connectivity plot

The Scale Independence plot (left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a R² > 0.8. The red line corresponds to using an R² cut-off of R²=0.80.

The Mean Connectivity plot (right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges.

6.4 Similarity matrix construction

WGCNA provides the function adjacency to compute a features similarity matrix. This function operates in three steps:

Compute features correlation coefficients
Transform the correlation coefficients depending on whether you are interested in absolute correlations (unsigned), signed correlations (signed) or positive correlations (signed hybrid). To have more details on this 3 types, check ?adjacency. By default, the constructed network is an unsigned network which means that sign are not reported into the network (absolute correlations).
Finally, the transformed correlations are raised using the previously estimated soft-threshold power.

DIY: Use the adjacency function to compute the similarity matrix. Don’t forget to provide the soft-power you defined.

softPower = sft$powerEstimate
similarityMat <- adjacency(dataset, power = softPower)
head(similarityMat[, c(1:5)])

##                  STC2     LOC96610         APOD        FADS2         RPL9
## STC2     1.000000e+00 2.412996e-03 5.044921e-07 2.202682e-03 1.160268e-08
## LOC96610 2.412996e-03 1.000000e+00 3.146534e-04 1.155482e-04 1.256985e-07
## APOD     5.044921e-07 3.146534e-04 1.000000e+00 4.584558e-07 1.401583e-08
## FADS2    2.202682e-03 1.155482e-04 4.584558e-07 1.000000e+00 2.535449e-05
## RPL9     1.160268e-08 1.256985e-07 1.401583e-08 2.535449e-05 1.000000e+00
## ADAM6    6.429423e-04 4.443884e-01 1.380167e-03 5.118278e-06 3.567073e-07

6.5 Topological overlap matrix (TOM)

From the Similarity matrix, WGCNA proposes to extract correlated feature modules (co-expressed gene modules) based on Topological Overlap. The idea is that two features (genes) that share a high number of correlated (co-expressed) neighbors are likely to participate in the same correlation (co-expression) module.

The Topological Overlap Matrix (TOM) is a concept used in network analysis to quantify the similarity or interconnectedness between nodes in a network.

DIY: Transform the similarity matrix into a Topological Overlap Matrix that incorporates the level of overlapping between the neighborhoods of two nodes (features/genes), using the function TOMsimilarity. Notice that the output does not contains our row names and column names (i.e. node names). Make sure to rename it!

TOM <- TOMsimilarity(similarityMat)

## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
head(TOM[, c(1:5)])

##                  STC2     LOC96610         APOD        FADS2         RPL9
## STC2     1.000000e+00 2.890102e-03 8.557899e-04 5.767491e-03 2.170248e-05
## LOC96610 2.890102e-03 1.000000e+00 2.225817e-03 1.041126e-03 1.103988e-05
## APOD     8.557899e-04 2.225817e-03 1.000000e+00 2.281934e-04 7.573427e-06
## FADS2    5.767491e-03 1.041126e-03 2.281934e-04 1.000000e+00 3.570641e-05
## RPL9     2.170248e-05 1.103988e-05 7.573427e-06 3.570641e-05 1.000000e+00
## ADAM6    1.196107e-03 1.319133e-01 1.809321e-03 3.135948e-04 4.804337e-06

Because most clustering strategies are performed on distance (rather than similarities), we transform the TOM matrix into a dissimilarity matrix.

DIY: Hint: distance = 1 - similarity

dissTOM <- 1 - TOM
head(dissTOM[, c(1:5)])

##               STC2  LOC96610      APOD     FADS2      RPL9
## STC2     0.0000000 0.9971099 0.9991442 0.9942325 0.9999783
## LOC96610 0.9971099 0.0000000 0.9977742 0.9989589 0.9999890
## APOD     0.9991442 0.9977742 0.0000000 0.9997718 0.9999924
## FADS2    0.9942325 0.9989589 0.9997718 0.0000000 0.9999643
## RPL9     0.9999783 0.9999890 0.9999924 0.9999643 0.0000000
## ADAM6    0.9988039 0.8680867 0.9981907 0.9996864 0.9999952

7 Module detection

7.1 In summary

Modules are groups of highly-connected nodes (features). This step is composed of two substeps:

Clustering: getting correlated feature modules
- Use the hclust function to perform hierarchical clustering
- Display the dendrogram
- Use the Dynamic Tree Cut function to obtain the modules
Merging of modules: merging close modules
- If they are too similar, modules should be merged
- We compute module-module similarities by computing their eigenvectors (eigengene)
- We merge two modules if their eigenvectors are correlated (the authors suggest a pearson corr r > 0.75)

7.2 Clustering

First, a hierarchical clustering is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure 7.1.

geneTree <- hclust(as.dist(dissTOM), method = "average")
plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)

Figure 7.1: Genes clustering using hierarchical clustering using the TOM dissimilarity matrix.

The leaves represent the features (genes). The dendrogram branches group together densely interconnected and highly correlated features.

DIY: To identify modules from dendrograms, several methods exist. Here, we use the cutreeDynamic function.

First, we have to set the minimum module size minClusterSize (i.e. minimum number of genes inside a group/module). The default is 20, but you can try several values and choose according to you results. There are other parameters you can play with, such as deepSplit. Look at the documentation to select the parameters.

minModuleSize <- 30
dynamicMods <- cutreeDynamic(dendro = geneTree, 
                             distM = dissTOM, 
                             deepSplit = 2, 
                             pamRespectsDendro = FALSE, 
                             minClusterSize = minModuleSize)

##  ..cutHeight not given, setting it to 0.997  ===>  99% of the (truncated) height range in dendro.
##  ..done.

table(dynamicMods)

## dynamicMods
##   0   1   2   3   4   5   6   7   8 
## 121 457 414 371 303 147  91  56  40

After running cutreeDynamic, we have obtained our correlation modules.

Note that module 0 is not a correlation module !! All unassigned nodes are automatically labelled 0.

DIY: Instead of using numbers as module names, we will use the function labels2colors to assign a color to each module. The biggest module is assigned turquoise.

dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)

## dynamicColors
##     black      blue     brown     green      grey      pink       red turquoise 
##        56       414       371       147       121        40        91       457 
##    yellow 
##       303

Note that module grey (module 0 at previous step) is not a correlation module !! All unassigned nodes are automatically labelled 0/grey.

Let’s visualize the obtained modules on the previous dendrogram:

plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut", 
        dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, 
        main = "Gene dendrogram and module colors")

Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.

Figure 7.2: Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.

7.3 Merging modules

Sometimes, especially when dealing with smaller clusters (for instance when the minClusterSize parameter is not set properly), merging modules with similar expression profiles might be usefull.

To do so, we first compute each module eigenvectors. A module eigenvector (or eigengene if you are dealing with gene data) represents the mean expression profile of the whole corresponding module. We can use the correlation of modules eigengenes to define a module-similarity metric that we can use to merge close modules.

DIY: To compute the module eigenvectors, use the function moduleEigengenes. Order the obtained eigenvectors using orderMEs. This function reorders input eigenvectors based on their correlation.

MEList <- moduleEigengenes(dataset, colors = dynamicColors)
MEs <- MEList$eigengenes
head(MEs)

##          MEblack       MEblue       MEbrown      MEgreen      MEgrey
## A0SH -0.05397373 -0.001812781  8.704370e-02  0.009554327 -0.04249332
## A0SJ -0.02352202 -0.016655975  2.679801e-02  0.025707793  0.01779475
## A0SK -0.17404817 -0.127446084 -7.661646e-02 -0.008179712  0.06381009
## A0SO -0.07745815 -0.079496161 -1.054047e-01 -0.013849727  0.06230835
## A04N -0.03146649 -0.003822502  4.806086e-02  0.005912970  0.02994775
## A04P  0.01736130  0.027381270  9.912705e-05 -0.131843977  0.02085333
##           MEpink        MEred  MEturquoise    MEyellow
## A0SH -0.09733225 -0.009707618 -0.006016097 -0.09115749
## A0SJ -0.08105159  0.064320489  0.020362698 -0.00463625
## A0SK -0.03738968 -0.100005686 -0.089068340  0.11699671
## A0SO -0.07455363 -0.060439328 -0.088204094  0.09227040
## A04N  0.05760808  0.022801142  0.041043282 -0.04263349
## A04P  0.07390190  0.030873637 -0.092340083  0.03871340

MEs <- orderMEs(MEs)
head(MEs)

##         MEyellow       MEbrown        MEred     MEblack       MEblue
## A0SH -0.09115749  8.704370e-02 -0.009707618 -0.05397373 -0.001812781
## A0SJ -0.00463625  2.679801e-02  0.064320489 -0.02352202 -0.016655975
## A0SK  0.11699671 -7.661646e-02 -0.100005686 -0.17404817 -0.127446084
## A0SO  0.09227040 -1.054047e-01 -0.060439328 -0.07745815 -0.079496161
## A04N -0.04263349  4.806086e-02  0.022801142 -0.03146649 -0.003822502
## A04P  0.03871340  9.912705e-05  0.030873637  0.01736130  0.027381270
##           MEpink      MEgreen  MEturquoise      MEgrey
## A0SH -0.09733225  0.009554327 -0.006016097 -0.04249332
## A0SJ -0.08105159  0.025707793  0.020362698  0.01779475
## A0SK -0.03738968 -0.008179712 -0.089068340  0.06381009
## A0SO -0.07455363 -0.013849727 -0.088204094  0.06230835
## A04N  0.05760808  0.005912970  0.041043282  0.02994775
## A04P  0.07390190 -0.131843977 -0.092340083  0.02085333

DIY: Now let’s perform a hierarchical clustering using the correlation between module eigenvectors. Remember that for using hclust we need distances! Compute those distances and transform then with as.dist before running hclust.

MEDiss <- 1 - cor(MEs)
head(MEDiss)

##           MEyellow      MEbrown     MEred   MEblack       MEblue       MEpink
## MEyellow 0.0000000 1.314690e+00 1.4820148 1.0237382 9.691794e-01 1.067170e+00
## MEbrown  1.3146899 5.551115e-16 0.3435587 0.9154158 5.352808e-01 1.083306e+00
## MEred    1.4820148 3.435587e-01 0.0000000 0.9144473 6.159154e-01 9.229435e-01
## MEblack  1.0237382 9.154158e-01 0.9144473 0.0000000 4.880131e-01 9.557461e-01
## MEblue   0.9691794 5.352808e-01 0.6159154 0.4880131 3.330669e-16 9.807942e-01
## MEpink   1.0671697 1.083306e+00 0.9229435 0.9557461 9.807942e-01 7.771561e-16
##            MEgreen MEturquoise    MEgrey
## MEyellow 1.3131925   1.6050999 1.0337210
## MEbrown  0.9062175   1.0346148 0.9743145
## MEred    0.7740776   0.7698319 0.9730385
## MEblack  0.9683373   0.8827360 1.1884568
## MEblue   1.0837290   1.3047868 0.9926956
## MEpink   0.8743927   0.7882817 0.7369518

METree <- hclust(as.dist(MEDiss))

DIY: WGCNA’s authors recommend to merge cluster with a correlation greater than 0.75, hence a distance lower than 0.25. Let’s look at the tree with plot and abline, just as we did previously in this tutorial.

MEDissCut <- 0.25
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissCut, col = "red")

In our case, we dont need our modules merged, but if you need it:

If you want to merge modules, you can use the mergeCloseModules function as below:

merge <- mergeCloseModules(dataset, dynamicColors, cutHeight = MEDissCut, verbose = 3)
mergedColors <- merge$colors
mergedMEs <- merge$newMEs
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
        c("Dynamic Tree Cut", "Merged dynamic"),
        dendroLabels = FALSE, hang = 0.03,
        addGuide = TRUE, guideHang = 0.05)

moduleColors <- mergedColors
colorOrder <- c("grey", standardColors(50))
moduleLabels <- match(moduleColors, colorOrder) - 1 
MEs <- mergedMEs

7.4 Save the results

DIY: Let’s save those clustering results! Choose your favorite export format (Rdata, csv, tsv, …) and save the results. Don’t forget to make sure you have all the information you need (in this case, we need the node names and their cluster labels).

moduleColors = dynamicColors # Or mergedColors, if some modules have been merged
names(moduleColors) = colnames(dataset)
head(moduleColors)

##        STC2    LOC96610        APOD       FADS2        RPL9       ADAM6 
## "turquoise"      "blue"       "red"    "yellow"      "grey"      "blue"

save(moduleColors, file = "results_TCGA/modules_mrna.rda")

DIY: We also need to save our module eigenvectors!

save(MEs, file = "results_TCGA/MEs_mrna.rda")

8 Module characterization

8.1 In summary

We identified several modules in the previous step. Now we want to characterize them:

Associations between modules and external information
- Are some modules associated to a trait of interest?
Associations between genes, external information and modules
- Gene Significance: association between features (genes for expression data) and external information
- Module membership: features correlation to each module eigenvector

For all of the above, we can use correlation!

8.2 Association between modules and phenotypic traits

We want to identify which modules are significantly associated with the (clinical) metadata we have. For this, we will use correlation.

For pre-processing numeric variables, a simple normalization should be enough.

While not ideal, for categorical variables, we will binarize the data using the binarizeCategoricalVariable. This function provides two options: includePairwise to compare all classes against all, and a includeLevelVsAll to compare each classes to all others. Here, to limit the number of associations to test, we will only include LevelVsAll comparisons.

However, note that binarization is not optimal, and other test might be more appropriate, such as multinomial logistic regression models.

DIY: First, scale numeric data, and put the results in a new dataframe.

metadataNum <- data.frame("AgeDiagnosis"=scale(metadata$AgeDiagnosis), row.names = rownames(metadata))

DIY: Now, binarize categorical data and add it to the new dataframe.

AnatomicNeoplasm = binarizeCategoricalVariable(metadata$AnatomicNeoplasm,includePairwise = FALSE, includeLevelVsAll = TRUE)
rownames(AnatomicNeoplasm) = rownames(metadata)
metadataNum = cbind(metadataNum, AnatomicNeoplasm)

Do the same for all categorial variables:

# Gender and vital status are already binary
metadataNum$Gender = ifelse(metadata$Gender=="female", 0, 1) 
metadataNum$VitalStatus = ifelse(metadata$VitalStatus=="alive", 1, 0) 

Subtype = binarizeCategoricalVariable(metadata$Subtype,includePairwise = FALSE, includeLevelVsAll = TRUE)
rownames(Subtype) = rownames(metadata)
metadataNum = cbind(metadataNum, Subtype)

# We also include a numeric representation of the subtypes
metadataNum$Subtype = as.numeric(metadata$Subtype)-1

Now we can compute correlations between the modules eigenvectors and the metadata!

DIY: Use the functions cor and corPvalueStudent to compute correlations and associated p-values.

moduleTraitCor <- cor(MEs, metadataNum)
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples)

Let’s plot the results:

textMatrix <- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
  for(c in colnames(moduleTraitCor)){
    if(moduleTraitPvalue[r, c] > 0.05){
      moduleTraitCor[r, c] <- 0
      textMatrix[r, c] <- ""
    }
  }
}
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(metadataNum),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors =  colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
               textMatrix = textMatrix,
               setStdMargins = TRUE,
               cex.text = 0.5,
               zlim = c(-1, 1),
               main = "Module-trait relationship")

8.3 Gene relationship to traits and modules

Two measures are computed to quantify the association between nodes, traits, and modules:

Module Membership (MM): correlation between the module eigengene and the features
Gene Significance (GS): correlation between the features and a selected trait of interest

DIY: Fist, we compute the Module Membership. You need the module eigenvectors (MEs) and the features from the input dataset. Compute correlations using the cor function. Then, compute p-values using the function corPvalueStudent. Store the results in a dataframe.

moduleMembership <- as.data.frame(cor(dataset, MEs))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(moduleMembership), nSamples))

modNames <- substring(names(MEs), 3)
names(moduleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")

head(moduleMembership[,1:5])

##             MMyellow    MMbrown      MMred     MMblack       MMblue
## STC2     -0.43217262 0.04748568  0.1473983  0.03871435 -0.223619330
## LOC96610  0.23596084 0.18676692  0.1702518  0.31262959  0.640187631
## APOD     -0.23191026 0.32615271  0.4728507  0.15203720  0.188067120
## FADS2     0.37235530 0.01179123 -0.1840167  0.01959520  0.141079060
## RPL9     -0.09474407 0.03462862  0.1189073 -0.09396184  0.006553046
## ADAM6     0.11173668 0.18504141  0.2314137  0.26154606  0.539568707

head(MMPvalue[,1:5])

##            p.MMyellow    p.MMbrown      p.MMred    p.MMblack     p.MMblue
## STC2     1.239744e-18 3.572129e-01 4.079331e-03 4.529648e-01 1.138938e-05
## LOC96610 3.516884e-06 2.608997e-04 8.892539e-04 5.147623e-10 5.602733e-45
## APOD     5.209184e-06 8.102363e-11 1.874090e-22 3.042640e-03 2.358109e-04
## FADS2    7.081956e-14 8.192604e-01 3.223981e-04 7.041318e-01 6.004037e-03
## RPL9     6.575713e-02 5.020747e-01 2.075703e-02 6.802883e-02 8.989521e-01
## ADAM6    2.985413e-02 2.980548e-04 5.463571e-06 2.492590e-07 6.099177e-30

DIY: Very similarly, we can compute the node significance with respect to our traits of interest contained in the metadata. Compute the correlation and p-values between your dataset and the metadata using the sames functions as above.

geneTraitSignificance <- as.data.frame(cor(dataset, metadataNum, use = "p"))
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) <- paste("GS", names(metadataNum), sep = "")
names(GSPvalue) <- paste("p.GS", names(metadataNum), sep = "")

DIY: We can visualize the results with a scatter plot. Pick a module and extract the set of nodes composing the selected module. Then pick a trait of interest. Plot the module membership values against the node significance using the verboseScatterplot function. You can do this for various modules and traits of interest.

module='turquoise'
moduleGenes = names(moduleColors)[which(moduleColors==module)]
geneTraitSignifColumn = "GSSubtype"
verboseScatterplot(abs(moduleMembership[moduleGenes, paste('MM', module, sep="")]),
                   abs(geneTraitSignificance[moduleGenes, geneTraitSignifColumn]),
                   xlab = paste("Module Membership (MM) in", module, "module"),
                   ylab = paste("Gene Significance (GS) for", geneTraitSignifColumn),
                   main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

DIY: Now that we have computed our correlated modules and characterized them, we can export our results for other downstream analysis.

resultsModuleMembership = cbind(moduleMembership, MMPvalue)
mmNames = names(resultsModuleMembership)
mmNames = mmNames[order(gsub("p\\.", "", mmNames))]
resultsModuleMembership = resultsModuleMembership[,mmNames]

resultsSignificance = cbind(geneTraitSignificance, GSPvalue)
gsNames = names(resultsSignificance)
gsNames = gsNames[order(gsub("p\\.", "", gsNames))]
resultsSignificance = resultsSignificance[,gsNames]


results = data.frame("Gene"=names(moduleColors), "Module"=unlist(moduleColors), 
                     resultsModuleMembership, resultsSignificance)
results = results[order(results$Module),]
write.table(results, file="results_TCGA/TCGA_mRNA_results.tsv", sep="\t", row.names=F, col.names=T, quote=F)
head(results[,1:6])

##        Gene Module   MMblack     p.MMblack    MMblue     p.MMblue
## IFI6   IFI6  black 0.8573286 1.765762e-110 0.2630361 2.117437e-07
## ISG15 ISG15  black 0.8279306  1.804073e-96 0.2974812 3.665833e-09
## IFI27 IFI27  black 0.8207346  1.918631e-93 0.3937153 1.822846e-15
## MX1     MX1  black 0.8834454 7.349794e-126 0.4205653 1.237612e-17
## IFIT1 IFIT1  black 0.9117530 2.263150e-147 0.3267135 7.489300e-11
## HERC6 HERC6  black 0.7728591  2.880394e-76 0.4226841 8.185472e-18

9 Integrative analysis

After applying the pipeline for various modalities, we can use the module eigenvectors computed for each modality to perform an integrative analysis and identify correlated multi-omic modules.

DIY: Load the module eigenvectors obtain for the various modalities.

load("results_TCGA/MEs_mirna.rda")
mirna = MEs
names(mirna) = paste("mirna", names(mirna), sep="_")
load("results_TCGA/MEs_proteins.rda")
protein = MEs
names(protein) = paste("protein", names(protein), sep="_")
load("results_TCGA/MEs_mrna.rda")
mrna = MEs
names(mrna) = paste("mrna", names(mrna), sep="_")

DIY: Some samples might have been removed during the preprocessing step for some of the modalities. Just keep the samples that are present in all three datasets.

samples = Reduce("intersect", list(rownames(mirna), rownames(protein), rownames(mrna)))
MEs = do.call("cbind", list(mirna[samples,], mrna[samples,], protein[samples,]))

DIY: Now we can compute module-module correlations across modalities! Do it with the adjacency function used in the beginning of this tutorial.

softPower = 1 # this time, we don't need a scale-free topology, so we just set softPower to 1
similarityMat <- adjacency(MEs, power = softPower)

We can visualize the correlations using pheatmap.

DIY: First, create an annotation dataframe to indicate the provenance (i.e. modality) of each module.

annot = data.frame("Modality"=gsub("_.*", "", colnames(similarityMat)))
rownames(annot) = colnames(similarityMat)
head(annot)

##                   Modality
## mirna_MEyellow       mirna
## mirna_MEbrown        mirna
## mirna_MEturquoise    mirna
## mirna_MEblue         mirna
## mirna_MEgreen        mirna
## mirna_MEgrey         mirna

DIY: Now, plot the module correlation matrix using pheatmap. Look at the documentation for this function to know how to include your annotation dataframe, and play with the various parameters (e.g: clustering of the heatmap).

heatmap <- pheatmap(similarityMat, show_rownames = TRUE, show_colnames = TRUE, annotation_row = annot, annotation_col = annot, legend = T, cluster_cols = T, cutree_cols = 5, cluster_rows = T, cutree_rows  = 5)

DIY: Extract the clusters of correlated modules.

clust <- cbind(similarityMat, cluster = cutree(heatmap$tree_col, k = 5))
clust = clust[,"cluster"]
clust[order(clust)]

##      mirna_MEyellow       mirna_MEbrown   mirna_MEturquoise       mrna_MEyellow 
##                   1                   1                   1                   1 
##    mrna_MEturquoise     protein_MEbrown        mirna_MEblue        mrna_MEbrown 
##                   1                   1                   2                   2 
##          mrna_MEred protein_MEturquoise       mirna_MEgreen        mirna_MEgrey 
##                   2                   2                   3                   3 
##         mrna_MEpink         mrna_MEgrey      protein_MEgrey        mrna_MEblack 
##                   3                   3                   3                   4 
##         mrna_MEblue        mrna_MEgreen      protein_MEblue 
##                   4                   5                   5

Weighted Gene Co-expression Network Analysis (WGCNA)

Morgane Térézol - Galadriel Brière

March 23, 2024