1 Libraries and environment

1.1 Load environment

Libraries used to create and generate this report:

R: R version 4.1.0 (2021-05-18)
rmarkdown: 2.19
knitr: 1.41
rmdformats: 1.0.4
bookdown: 0.31
kableExtra: 1.3.4

1.2 Load libraries

Libraries used to analyse data:

library("WGCNA")
library("pheatmap")

WGCNA: 1.71
pheatmap: 1.0.12

1.3 Cytoscape Apps

Cytoscape is used for visualization. You need to have the Cytoscape v3.9.1 and a few Cytoscape apps:

yFiles Layout Algorithm: 1.1.2
LegendCreator: 1.1.6
stringApps: 1.5.1
Omics visualize: 1.3.0

2 General principle of the WGCNA method

Weighted Gene Co-expression Network Analysis (WGCNA) is a system biology approach dedicated to identifyong correlation patterns between features in a set of samples. In most cases, it is applied on gene expression data to find clusters (called modules) of highly correlated genes.

WGCNA builds a co-expressed gene network:

each node corresponds to a gene
two genes are connected by an edge if their expression are considered as significantly correlated
it’s an undirected network

WGCNA is the most famous approach to create gene co-expression networks.

The main analysis steps are presented in the Figure 2.1.

First, it constructs a gene co-expression network based on correlation
Then, it identifies modules, that are highly connected sub-networks
Next, modules are characterized:

the modules are related to external information
the users can examine module relationships
the users can find key driver genes in modules of interest

Finally, the users can export the network and do some visualization

Figure 2.1: WGCNA overview

If you are interested in WGCNA, extensive tutorials are available online:

Tutorial page: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html
Publication: PMID: 19114008

3 Biological context

The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site here.

Figure 3.1: Screenshot of the TCGA main page

TCGA makes available a high amount of data. We choose to use expression genes data from breast cancer. Some clinical information are available too, as subtype of cancer, gender and so on. These are considered as metadata.

Our objective is to build a co-expression network using expression data, identify modules of tighly co-expressed genes, and relate them to clinical metadata.

4 Input data

4.1 In summary

Preparation of the data is the first and most important step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information.

Normalization of data according to the type of data
Removing missing values
Removing outliers based on sample hierarchical clustering
Add external information

Data must matrices with specific shapes:

samples (e.g. patients, organisms …) in rows
variables (e.g. genes, proteins …) in columns

Different from the shape usually used!

4.2 Expression data

Note of Caution!

We assume that the data have already been normalized.

Data are downloaded from: MiBiOmics gitlab

4.2.1 Loading expression data

This dataset contains the expression data of different patient with breast cancer.

exprData <- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_mrna.csv", head = TRUE, row.names = 1, sep = ",")
exprData[c(1:5), c(1:5)]

##              A0SH      A0SJ     A0SK     A0SO      A04N
## STC2     6.809810  9.802976 5.495795 2.926594 11.579173
## LOC96610 7.139342  9.538014 7.440094 8.674870  8.457158
## APOD     6.399172 10.667394 1.722198 4.771886  6.579026
## FADS2    5.316216  5.537015 8.555220 8.911656  6.459865
## RPL9     5.916535  3.204786 6.702598 6.122295  3.572641

Data are composed of 2000 genes and 379 patients. Pay attention to the data shape.

exprData <- t(exprData)
exprData[c(1:5), c(1:5)]

##           STC2 LOC96610      APOD    FADS2     RPL9
## A0SH  6.809810 7.139342  6.399172 5.316216 5.916535
## A0SJ  9.802976 9.538014 10.667394 5.537015 3.204786
## A0SK  5.495795 7.440094  1.722198 8.555220 6.702598
## A0SO  2.926594 8.674870  4.771886 8.911656 6.122295
## A04N 11.579173 8.457158  6.579026 6.459865 3.572641

4.3 Removing missing values

You need to check if the dataset contains missing data. Genes and samples with too many missing data are removed and also genes with zero variance.

gsg <- goodSamplesGenes(datExpr = exprData, verbose = 3)
gsg$allOK

If gsg$allOK == TRUE that means all genes and samples are kept.

If you run WGCNA with other data and you have missing data, check the WGCNA tutorial to have the function to remove them.

4.4 Checking for outliers

After removing missing values, outliers need to be check.

An euclidean distance is calculated between each pair of patients/samples.
Then, a hierarchical clustering is performed to group the patients/samples.
The groups are visualized to identify potential outliers.

patientsTree <- hclust(d = dist(exprData), method = "average")

The dendrogram of patients is displayed into the Figure 4.1. Two patients seem to be outliers. They can be removed by a cut height to 95 (red line in the dendogram).

plot(patientsTree, main = "Patients dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 95, col = "red")

$Patients clustering using hierarchical clustering method, to detect outliers$

Figure 4.1: Patients clustering using hierarchical clustering method, to detect outliers

So, we decide to remove these two patients by cutting the tree in two parts. The first part contains the two outliers and the other part contains all the remaining patients/samples.

clust <- cutreeStatic(patientsTree, cutHeight = 95, minSize = 10)
table(clust)

## clust
##   0   1 
##   2 377

We keep only patients that are in the second groups.

keepPatients <- clust == 1
exprData <- exprData[keepPatients,]
nGenes <- ncol(exprData)
nPatients <- nrow(exprData)

Now, we have 2000 genes and 377 patients.

4.5 Clinical data

We have access to clinical metadata information for every patient/sample.

clinicalTCGA <- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_clinical.csv", 
                           sep = ",", head = TRUE, stringsAsFactors = TRUE)

We have both categorical data (e.g. subtype of cancer, gender) and numerical data (e.g. age of patient).

head(clinicalTCGA)

##   PatientIDs AgeDiagnosis           AnatomicNeoplasm Gender VitalStatus Subtype
## 1       A0SH           39  left upper inner quadrant female       alive    LumA
## 2       A0SJ           39                       left female       alive    LumA
## 3       A0SK           54 right upper outer quadrant female        dead   Basal
## 4       A0SO           67                      right female       alive   Basal
## 5       A04N           66                      right female       alive    LumA
## 6       A04P           36                       left female        dead   Basal

We have information for 379 patients.

The dimensions of the clinical dataframe and the expression dataframe have to be the same. We need to remove the outliers of the clincical dataframe.

patients <- row.names(exprData)
clinicalRows <- match(patients, clinicalTCGA$PatientIDs)
clinicalData <- clinicalTCGA[clinicalRows, -1]
rownames(clinicalData) <- clinicalTCGA[clinicalRows, 1]
head(clinicalData)

##      AgeDiagnosis           AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH           39  left upper inner quadrant female       alive    LumA
## A0SJ           39                       left female       alive    LumA
## A0SK           54 right upper outer quadrant female        dead   Basal
## A0SO           67                      right female       alive   Basal
## A04N           66                      right female       alive    LumA
## A04P           36                       left female        dead   Basal

The expression data is in the variable exprData and the clinical information is in the variable clinicalData. We can visualize the relationships between the clinical information and patients dendrogram (Figure 4.2).

patientsTree2 <- hclust(d = dist(exprData), method = "average")
traitColors <- labels2colors(clinicalData)
traitColors[,1] <- numbers2colors(clinicalData[1])
plotDendroAndColors(dendro = patientsTree2, colors = traitColors, groupLabels = names(clinicalData), 
                    main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5)

Figure 4.2: Patients clustering using hierarchical clustering method with clinical information

5 Construction of the gene co-expression network

5.1 In summary

This first step of the analysis consists in creating the gene co-expression network. This first step is composed of tree main substeps:

Power selection
- soft thresholding function helps to pick-up a power for weighted networks
- hard thresholding function helps to pick-up a power for unweighted networks
- functions based on the criteria approximate scale-free topology
- Elbow method like: a network is constructed for a range of power
Similarity matrix construction
- first, correlation or distance are calculated between each pair of genes
- you can choose correlation or distance mesure you want
- then, correlation or distance are transformed, according to the targeted network type (e.g. unsigned, signed) and the chosen power
Topological overlap matrix construction (TOM)
- similarity matrix doesn’t take into account the neighbors
- traduces level of overlapping between the neighboring of two genes

Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.

enableWGCNAThreads()

5.2 Power selection

We want a weighted network, i.e., a weighted interaction between each pair of genes. So, as the authors suggested, we use the pickSoftThreshold function to select an appropriate power for the network construction.

First, we define a range of powers.

powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers

##  [1]  1  2  3  4  5  6  7  8  9 10 12 14 16 18 20

Then, the pickSoftThreshold function analyses the network topology of our expression data for multiple powers.

sft <- pickSoftThreshold(exprData, powerVector = powers, verbose = 5)

## Warning: executing %dopar% sequentially: no parallel backend registered

Different indices are calculated and then ploted in the Figure 5.1.

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
     xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
     type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.90, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab = "Soft Threshold power", ylab = "Mean connectivity", 
     type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")

Figure 5.1: Soft Threshold Indices. Left: Scale Independence plot - Right: Mean Connectivity plot

The Scale Independence plot (Figure 5.1 - left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a R² > 0.8. The red line corresponds to using an R² cut-off of R²=0.90.

The Mean Connectivity plot (Figure 5.1 - right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges.

Here, we choose the power 6, which is the lowest power for which the scale-free topology fit index reaches 0.90.

5.3 Similarity matrix construction

The similarity matrix is constructed in two steps:

Correlation between each pair of genes is calculated using Pearson’s correlation.
Then, correlations are transformed into a similarity matrix according the type of network we want to construct and the power choosen. By default, the constructed network is an unsigned network which means that correlations sign are not reported into the network.

The similarity matrix is between 0 and 1 where 1 means strong similarity between two genes.

softPower <- 6
similarityMat <- adjacency(exprData, power = softPower)
similarityMat[c(1:5), c(1:5)]

##                  STC2     LOC96610         APOD        FADS2         RPL9
## STC2     1.000000e+00 9.800808e-04 6.282220e-10 1.020088e-03 1.212973e-09
## LOC96610 9.800808e-04 1.000000e+00 8.826956e-06 1.702183e-05 4.049903e-09
## APOD     6.282220e-10 8.826956e-06 1.000000e+00 2.134894e-08 7.213511e-13
## FADS2    1.020088e-03 1.702183e-05 2.134894e-08 1.000000e+00 3.873576e-06
## RPL9     1.212973e-09 4.049903e-09 7.213511e-13 3.873576e-06 1.000000e+00

5.4 Topological overlap matrix (TOM)

The Similarity matrix describes the similarity between each pair of genes based on correlations. This matrix is next transformed into a Topological Overlap Matrix that incorporates the level of overlapping between the neighboring of two genes.

For instance, genes with high level of similarity are grouped together (value close to 1 in the similarity matrix). If their communities also highly overlap, these two genes will be closer.

TOM <- TOMsimilarity(similarityMat)

## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
TOM[c(1:5), c(1:5)]

##                  STC2     LOC96610         APOD        FADS2         RPL9
## STC2     1.000000e+00 1.409307e-03 1.216776e-04 2.430333e-03 1.172863e-06
## LOC96610 1.409307e-03 1.000000e+00 3.377810e-04 1.925742e-04 1.026022e-06
## APOD     1.216776e-04 3.377810e-04 1.000000e+00 3.177204e-05 3.227948e-07
## FADS2    2.430333e-03 1.925742e-04 3.177204e-05 1.000000e+00 4.802354e-06
## RPL9     1.172863e-06 1.026022e-06 3.227948e-07 4.802354e-06 1.000000e+00

The TOM is then transformed into a dissimilarity TOM: that means that two genes highly similar have a similarity value close to 0.

dissTOM <- 1 - TOM
dissTOM[c(1:5), c(1:5)]

##               STC2  LOC96610      APOD     FADS2      RPL9
## STC2     0.0000000 0.9985907 0.9998783 0.9975697 0.9999988
## LOC96610 0.9985907 0.0000000 0.9996622 0.9998074 0.9999990
## APOD     0.9998783 0.9996622 0.0000000 0.9999682 0.9999997
## FADS2    0.9975697 0.9998074 0.9999682 0.0000000 0.9999952
## RPL9     0.9999988 0.9999990 0.9999997 0.9999952 0.0000000

The TOM matrix corresponds to the co-expression genes network. It’s the basis of the analysis.

6 Module detection

6.1 In summary

We constructed a network of genes with the corresponding TOM matrix that gives us the weight of connections. This network is used to detect modules.

Modules are groups of highly-connected genes. This step is composed of three main substeps:

Clustering
- use hierarchical clustering
- display the clustering using dendrogram
- genes with similar expression and similar neighborhood community will be together
Modules identification
- identification of groups of genes
- use the Dynamic Tree Cut function
- define a minimum size of modules
** Merging of modules **
- if they are too close
- eigengenes calculation of each module
- authors recommend the merging of modules with a pearson corr r > 0.75

6.2 Clustering

First, a hierarchical clustering is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure 6.1.

geneTree <- hclust(as.dist(dissTOM), method = "average")
plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)

Figure 6.1: Genes clustering using hierarchical clustering using the TOM dissimilarity matrix.

The leaves represent the genes. The dendrogram branches group together densely interconnected and highly co-expressed genes.

Visually, we can identify at least 5 or 6 modules. Let’s see what the identification will find.

6.3 Module identification

A module is a group of genes. Several methods exist to identify them. Here, we use the Dynamic Tree Cut (from the dynamicTreeCut R package).

First, we set the minimum module size (i.e. minimum number of genes inside a group/module) at 30.

minModuleSize <- 30
dynamicMods <- cutreeDynamic(dendro = geneTree, 
                 distM = dissTOM, 
                 deepSplit = 2, 
                 pamRespectsDendro = FALSE, 
                 minClusterSize = minModuleSize)

##  ..cutHeight not given, setting it to 0.997  ===>  99% of the (truncated) height range in dendro.
##  ..done.

The method identifies 5 modules. The cluster 0 contains the unassigned genes. There is a big module (the first one) with 908 genes.

table(dynamicMods)

## dynamicMods
##   0   1   2   3   4   5 
## 300 908 371 280  89  52

We assign a color to each module. The bigest module is assigned turquoise.

dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)

## dynamicColors
##      blue     brown     green      grey turquoise    yellow 
##       371       280        52       300       908        89

The identification results is displayed using a dendrogram and the assigned color in the Figure 6.2.

plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut", 
        dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, 
        main = "Gene dendrogram and module colors")

Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.

Figure 6.2: Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.

6.4 Merging modules

It may be prudent to merge modules that have very similar expression profiles. To quantify co-expression similarity of entire modules, we calculate their eigengenes and cluster them based on their correlation.

First, we calculate the eigengenes of each module. Eigengene is the first principal component of a module (dimension reduction).

MEList <- moduleEigengenes(exprData, colors = dynamicColors)
MEs <- MEList$eigengenes
MEs <- orderMEs(MEs)
head(MEs)

##            MEblue     MEgreen MEturquoise     MEbrown    MEyellow      MEgrey
## A0SH -0.003763839 -0.05955536  0.02312983  0.10465496 -0.01212317 -0.05516091
## A0SJ -0.020917560 -0.02408432  0.01715609  0.02630589  0.06563727 -0.03604684
## A0SK -0.104121719 -0.12921101 -0.09806951 -0.07988094 -0.09619792  0.09957517
## A0SO -0.077953493 -0.06860749 -0.09164466 -0.10106987 -0.05746363  0.04673793
## A04N -0.007276244 -0.03752692  0.04488648  0.05265763  0.01783336  0.03871323
## A04P  0.024766791  0.01468890 -0.08919126 -0.01027211  0.02749866  0.04509383

Then, we calculate correlation between each pair of module (Pearson correlation). If modules are close/similar, the correlation will be closed to one. We compute a distance between modules.

MEDiss <- 1 - cor(MEs)
head(MEDiss)

##                   MEblue   MEgreen MEturquoise   MEbrown  MEyellow       MEgrey
## MEblue      2.220446e-16 0.5124083   1.2620744 0.6117870 0.6379130 9.251374e-01
## MEgreen     5.124083e-01 0.0000000   0.9700988 0.9681992 0.9719671 1.158490e+00
## MEturquoise 1.262074e+00 0.9700988   0.0000000 0.8456227 0.6656522 1.121457e+00
## MEbrown     6.117870e-01 0.9681992   0.8456227 0.0000000 0.3500885 1.034015e+00
## MEyellow    6.379130e-01 0.9719671   0.6656522 0.3500885 0.0000000 8.944425e-01
## MEgrey      9.251374e-01 1.1584903   1.1214573 1.0340146 0.8944425 2.220446e-16

Finally, we perform a hierarchical clustering using this distance between modules. Authors recommend to merge cluster with a correlation greater than 0.75, hence a distance lower than 0.25.

METree <- hclust(as.dist(MEDiss))
MEDissTree <- 0.25
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissTree, col = "red")

Figure 6.3: Modules clustering based on correlation. The red line corresponds to a correlation at 0.75.

In our case, no modules need to be merged.

If you want to merge modules, you can use the mergeCloseModules function (see the tutorial for more information).

7 Module characterization

Pay attention!

Use the right module clustering. If you merged modules, you may need to redo some step with the new set of modules, such as the eigengenes calculation.

7.1 In summary

5 modules have been identified in the previous step. They need to be characterized to be further analysed.

Association between modules and external information
- continuous or categorical data?
- Pearson’s correlation
- Which module can be interested to study in-depth?
Gene relationship to external information and modules
- Calculate two measurement:
  - Gene Significance: link between genes and external information
  - Module membership: how the genes contribute to a module?
- Based on correlation (Pearson)

7.2 Association between modules and phenotypic traits

We want to identify which modules are significantly associated with the clinical metadata we have.

For each patient/sample we have the following information:

##      AgeDiagnosis           AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH           39  left upper inner quadrant female       alive    LumA
## A0SJ           39                       left female       alive    LumA
## A0SK           54 right upper outer quadrant female        dead   Basal
## A0SO           67                      right female       alive   Basal
## A04N           66                      right female       alive    LumA
## A04P           36                       left female        dead   Basal

We have both categorical and continuous data. For the correlation calculation, clinical data need to be numerical.

clinicalDataNum <- clinicalData
clinicalDataNum$AnatomicNeoplasm <- as.numeric(clinicalData$AnatomicNeoplasm)
clinicalDataNum$Gender <- as.numeric(clinicalData$Gender)
clinicalDataNum$VitalStatus <- as.numeric(clinicalData$VitalStatus)
clinicalDataNum$Subtype <- as.numeric(clinicalData$Subtype)

Now, we can calculate the Pearson correlations between the identified modules and clinical information. P-values are then calculated to find the most significant correlations.

moduleTraitCor <- cor(MEs, clinicalDataNum, use = "p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nPatients)

The correlation results are displayed into the Figure 7.1. For better visualization, non significant correlation are set to zero.

textMatrix <- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
  for(c in colnames(moduleTraitCor)){
    if(moduleTraitPvalue[r, c] > 0.05){
      moduleTraitCor[r, c] <- 0
      textMatrix[r, c] <- ""
      }
  }
}

labeledHeatmap(Matrix = moduleTraitCor,
   xLabels = names(clinicalDataNum),
   yLabels = names(MEs),
   ySymbols = names(MEs),
   colorLabels = FALSE,
   colors =  colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
   textMatrix = textMatrix,
   setStdMargins = TRUE,
   cex.text = 0.8,
   zlim = c(-1, 1),
   main = "Module-trait relationship")

Figure 7.1: Correlation between identified modules and clinical information

Legend:

White: no correlation
Darkpurple: strong positive correlation
Darkgreen: strong negative correlation

There is no correlation between modules and gender. But, we can see a strong correlation between the turquoise module and the breast cancer subtype.

We will concentrate on subtype as the trait of interest for further analysis.

7.3 Gene relationship to traits and modules

Two measurements are calculated to quantify the association between genes and our trait of interest (here cancer subtypes) and the similarity of all genes to every module.

These two measures are:

Gene Significance (GS): correlation between the gene and the selected trait
Module Membership (MM): correlation between the module eigengene and the gene expression profile

First, we define a variable subtype:

subtype = as.data.frame(clinicalDataNum$Subtype)
names(subtype) <- "subtype"

Then, we calculate the Module Membership (MM) using a Pearson correlation between genes and identified modules. P-values are calculation for those correlations.

modNames <- substring(names(MEs), 3)
geneModuleMembership <- as.data.frame(cor(exprData, MEs, use = "p"))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nPatients))
names(geneModuleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")

And finally, we calculate the Gene Significance (GS) for every genes. A Pearson correlation is calculated between genes and each cancer subtype. Then, a p-value is calculated for those correlations.

geneTraitSignificance <- as.data.frame(cor(exprData, subtype, use = "p"))
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nPatients))
names(geneTraitSignificance) <- paste("GS", names(subtype), sep = "")
names(GSPvalue) <- paste("p.GS", names(subtype), sep = "")

A gene with a high MM correlation means a high membership to this module. A gene with a high GS correlation means a high association with the selected trait.

These two measures help to identify genes that have a high significance for cancer subtypes as well as high module membership in interesting modules (e.g. turquoise module). A scatterplot of GS vs. MM in the turquoise module is displayed in the Figure 7.2.

module <- "turquoise"
column <- match(module, modNames) 
moduleGenes <- dynamicColors == module
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
       abs(geneTraitSignificance[moduleGenes, 1]),
       xlab = paste("Module Membership (MM) in", module, "module"),
       ylab = "Gene Significance (GS) for breast cancer subtype",
       main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
       cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

Figure 7.2: Scatterplot of Gene Significance (GS) for subtype vs Module Membership (MM) in turquoise module.

GS and MM are highly correlated: genes highly significantly associated with a trait are often also the most important (central) elements of modules associated with the trait.

We create a file that summarizes all results. Modules are ordered by their significance for subtype.

geneInfo0 <- data.frame(geneSymbol = row.names(geneTraitSignificance),
                        moduleColor = dynamicColors,
                        geneTraitSignificance,
                        GSPvalue)
modOrder <- order(-abs(cor(MEs, subtype, use = "p")))
for(mod in 1:ncol(geneModuleMembership))
{
oldNames <- names(geneInfo0)
geneInfo0 <- data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
              MMPvalue[, modOrder[mod]])
names(geneInfo0) <- c(oldNames, 
            paste("MM.", modNames[modOrder[mod]], sep = ""),
            paste("p.MM", modNames[modOrder[mod]], sep = ""))
}
geneOrder <- order(geneInfo0$moduleColor, -abs(geneInfo0$p.GSsubtype))
geneInfo <- geneInfo0[geneOrder, ]
write.csv(x = geneInfo, file = "02_Results/WGCNA_genesInfo.csv", quote = FALSE)

Results are stored into a csv file.

7.4 Enrichment analysis

Now, we have identified the contribution of each gene to each module (MM). You can now perform functional enrichment analyses. For instance, you can use gprofiler or STRING. You can also perform an gene set enrichment analysis because you have a weight of contribution for each gene.

We will perform this analysis later directly in Cytoscape.

7.5 Turquoise module

The turquoise module is related to breast cancer subtypes. It could be interesting to look for the expression data of the genes inside this module. On the left in the Figure 7.3, you can see the heatmap of gene expression.

selectedGenes <- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
turquoiseExprData <- t(exprData[,selectedGenes])
pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE, annotation_col = clinicalData["Subtype"], 
         legend = FALSE)
heatmap <- pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE, 
                    annotation_col = clinicalData["Subtype"], cutree_cols = 4)

Figure 7.3: Expression heatmap of genes that are involved in the turquoise module

On the right (Figure 7.3), patients are grouped into 4 clusters based on the expression data. Colors represent the patient subtypes.

Data are split into 4 groups and results are saved into a file.

heat_cluster <- as.data.frame(cutree(heatmap$tree_row, k = 4))
write.table(x = heat_cluster, file = "02_Results/TurquoiseModule_clusters.txt", quote = FALSE, sep = "\t", col.names = FALSE, row.names = TRUE)

8 Network visualization

8.1 In summary

Different visualizations are possible, depending on what you whish to focus on.

Eigengenes visualization
- focus on modules
- display modules and selected trait relationships
- need to calculate module eigengenes (section 6.4)
Genes network visualization
- TOM matrix (section 5.4)
- display the heatmap
- Cytoscape

8.2 Eigengenes network

It could be interesting to study the relationship between the identified modules. You can generate a summary plot of the eigengenes network. It is usually informative to add a selected traits to the eigengenes to see how the traits fit into eigengenes network.

MEs <- moduleEigengenes(exprData, dynamicColors)$eigengenes
MET <- orderMEs(cbind(MEs, subtype))
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0), plotHeatmaps =  FALSE)
plotEigengeneNetworks(MET, "Eigengene similarity heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)

Figure 8.1: Visualisation of the eigengene network - Left: Eigengene dendrogram - Right: Eigengene similarity heatmap

The Figure 8.1 shows the eigengene dendogram and heatmap. They identify groups of correlated eigengenes.
For instance, the turquoise module is highly correlated to the subtypes as we seen previously.

8.3 Gene networks

To visualize a weighted network, you need the TOM similarity between each pair of genes. The first way is the heatmap.
To make moderately strong connections more visible, TOM is transformed using a power. Then, the diagonal are set to NA.

plotTOM <- -dissTOM^7
diag(plotTOM) <- NA
TOMplot(plotTOM, geneTree, dynamicColors, main = "Network heatmap plot, all genes")

There are 2 000 genes, so the heatmap will take time to be displayed. Here, we are displaying only the heatmap for the turquoise module. The heatmap is displayed in the Figure 8.2.

selectedGenes <- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
selectTOM <- dissTOM[selectedGenes, selectedGenes]
selectTree <- hclust(as.dist(selectTOM), method = "average")
plotDiss <- -selectTOM^7
diag(plotDiss) <- NA
TOMplot(plotDiss, selectTree, main = "Network heatmap plot, turquoise module")

Figure 8.2: Visualisation of gene network for turquoise module using heatmap plot.

Each row and column of the heatmap correspond to a single gene. Light colors represents low similarity and darker colors higher similarity. Blocks of darker colors along the diagonal are the modules. The gene dendogram and module assignment are shown along the left side and the top.

It might be interesting to cluster this turquoise module again.

8.4 Cytoscape

Cytoscape needs an edge file to build the network. We extracted the whole network.

cyt <- exportNetworkToCytoscape(TOM,
                    edgeFile = "02_Results/CytoscapeInput_edges.txt",
                    weighted = TRUE,
                    threshold = 0.05,
                    nodeNames = colnames(exprData))

In Cytoscape

Import co-expression network edges
- File -> Import Network from File -> CytoscapeInput_edges.txt
  - Column 1: Source node
  - Column 2: Target node
  - Column 3: Edge attributes
  - Column 4: Interaction Type
  - Columns 5-6: Not imported
Import WGCNA_genesInfo.csv
- File -> Import Table from File -> WGCNA_genesInfo.csv
  - Column 1: Key
  - Other columns: Attribute
  - /! Import Data as Node Table Columns
Import Table from File TurquoiseModule_clusters.txt too.
- Pay attention to the column names
Color the genes according their module (Style tab)
- Lock node width and height
- Shape: Elipse
- Fill Color: moduleColor
Select only turquoise module
- Filter
  - Add column filter
  - Node: moduleColor -> turquoise
- File -> New Network -> From Selected Nodes, All edges
Enrichment of this module
- Apps -> STRING -> STRINGify network
- geneSymbol and Homo sapiens
- Change Style if you want
- Apps -> STRING enrichment -> Retrieve functional enrichment (default)
New tab : STRING enrichment
- Filter Enrichment table (e.g. GO:BP)
- Draw charts
Color nodes according the clustering

Figure 8.3: Cytoscape

As we have seen before, blue and green modules are close, and brown and yellow modules are close to each other too and a bit similar to the turquoise module.

9 Going further: additional potential studies

Now, you know how to perform a WGCNA analysis.

It could be interesting to try several different things, and see how they can impact (or not) the results. Here are a list of potential additional studies:

1. Categorical data information

External information play key role in the WGCNA analysis. But to perform a correlation analysis you need numerical data, and sometimes external information are not numerical.

As we did, you can just transform these categorical values into numbers. WGCNA’s authors propose a function called binarizeCategoricalVariable. Maybe it could be interesting to use this function.

2. Try with another power

In this training, we use a power = 6 and we saw that the turquoise module maybe need to be split again, in a recursive clustering.

3. Try different correlation methods

4. Use the distance calculation instead of correlation to construct the network

5. Check other trait

We focus our analysis on the turquoise module. It could be interesting to go in depth with other module related to other trait.

6. Consensus WGCNA

WGCNA can be perform on several data and make a consensus of the data. Data should be the same and the same type (e.g. same list of genes) with different condition: control vs patient or male vs female.

How can you design your analysis to perform a Consensus WGCNA ?

To help you, see the part 2 of the tutorial page (here).

Weighted Gene Co-expression Network Analysis (WGCNA) on Breast Cancer data

Morgane Térézol - Anaïs Baudot

janvier 16, 2023