---
title: "Weighted Gene Co-expression Network Analysis (WGCNA) on Breast Cancer data"
author: "Morgane Térézol - Anaïs Baudot"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  rmdformats::material:
    use_bookdown: true
    thumbnails: false
    css: "ETBII_WGCNA.css"
header-includes:
  - \usepackage{subfig}
  - \usepackage{booktabs}
  - \usepackage{xcolor}
---

```{r setup, eval = TRUE, echo = FALSE}
workDir <- "/home/morgane/Documents/01_Projects/04_IFB/ETBII/etbii/THEMATIC-SESSIONS/Networks/WGCNA/"
workDir <- "/Morgane/Work/MMG/01_Projects/04_IFB/ETBII/etbii/THEMATIC-SESSIONS/Networks/WGCNA/"
knitr::opts_knit$set(root.dir = workDir) ## = setwd()
knitr::opts_chunk$set(cache = TRUE)
options(knitr.table.format = "html")
```

# Libraries and environment

## Load environment

Libraries used to create and generate this report: 

```{r librariesEnvironment, echo = FALSE, message = FALSE, warning = FALSE}
library("rmarkdown")
library("knitr")
library("rmdformats")
library("bookdown")
```

- R: ``r R.version$version.string``
- rmarkdown: ``r packageVersion("rmarkdown")``
- knitr: ``r packageVersion("knitr")``
- rmdformats: ``r packageVersion("rmdformats")``
- bookdown: ``r packageVersion("bookdown")``
- kableExtra: ``r packageVersion("kableExtra")``

## Load libraries

Libraries used to analyse data: 

```{r librariesAnalysis, echo = TRUE, message = FALSE, warning = FALSE}
library("WGCNA")
library("pheatmap")
```

- WGCNA: ``r packageVersion("WGCNA")``
- pheatmap: ``r packageVersion("pheatmap")``

## Cytoscape Apps

Cytoscape is used for visualization. You need to have the `Cytoscape v3.9.1` and a few Cytoscape apps: 

- yFiles Layout Algorithm: `1.1.2`
- LegendCreator: `1.1.6`
- stringApps: `1.5.1`
- Omics visualize: `1.3.0`


# General principle of the WGCNA method

Weighted Gene Co-expression Network Analysis (WGCNA) is a system biology approach dedicated to identifyong **correlation** patterns between features in a set of samples. In most cases, it is applied on gene expression data to find clusters (called **modules**) of highly correlated genes. 

WGCNA builds a **co-expressed gene** network: 

- each node corresponds to a gene
- two genes are connected by an edge if their expression are considered as significantly correlated
- it's an **undirected** network

WGCNA is the most famous approach to create gene co-expression networks. 


The main analysis steps are presented in the Figure \@ref(fig:WGCNAprinciple). 

1. First, it constructs a **gene co-expression network** based on correlation
2. Then,  it identifies **modules**, that are highly connected sub-networks
3. Next, modules are characterized: 
  - the modules are related to **external information**
  - the users can examine **module relationships**
  - the users can find **key driver** genes in modules of interest
4. Finally, the users can export the network and do some **visualization**  

```{r WGCNAprinciple, echo = FALSE, fig.align = "center", out.width = "100%", fig.cap = "WGCNA overview"}
include_graphics("01_Figures/WGCNA_Overview.png")
```

If you are interested in WGCNA, extensive tutorials are available online: 

1. [Data input and cleaning](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf)

2. [Network construction](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-man.pdf)

3. [Modules characterization](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf)

4. [Interfacing network analysis with other data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-04-Interfacing.pdf)

5. [Network visualisation](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-05-Visualization.pdf)

6. [Export to external visualisation software](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-06-ExportNetwork.pdf)

- Tutorial page: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html
- Publication: [PMID: 19114008](https://pubmed.ncbi.nlm.nih.gov/19114008//)

# Biological context

The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site [here](https://portal.gdc.cancer.gov). 

```{r TCGA, echo = FALSE, fig.cap = "Screenshot of the TCGA main page", fig.align = "center", out.width = "100%"}
include_graphics("01_Figures/TCGA.png")
```

TCGA makes available a high amount of data. We choose to use expression genes data from breast cancer. Some clinical information are available too, as subtype of cancer, gender and so on. These are considered as metadata.

:::: {.blackbox data-latex=""}
::: {.center data-latex=""}
**Our objective is to build a co-expression network using expression data, identify modules of tighly co-expressed genes, and relate them to clinical metadata.**
::: 
::::

# Input data

## In summary

**Preparation** of the data is the first and most **important** step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information. 

1. **Normalization** of data according to the type of data
2. Removing **missing values**
3. Removing **outliers** based on sample hierarchical clustering 
4. Add external information 

Data must be **matrices** with specific shapes: 

- **samples** (e.g. patients, organisms ...) in **rows** 
- **variables** (e.g. genes, proteins ...) in **columns**

Different from the shape usually used! 

## Expression data

:::: {.blackbox data-latex=""}
::: {.center data-latex=""}
**Note of Caution!**
::: 
:::{.center data-latex=""}
We assume that the **data have already been normalized**. 
:::
::::

Data are downloaded from: [MiBiOmics gitlab](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics)

### Loading expression data

This dataset contains the expression data of different patient with breast cancer. 

```{r inputData}
exprData <- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_mrna.csv", head = TRUE, row.names = 1, sep = ",")
exprData[c(1:5), c(1:5)]
```

Data are composed of ``r nrow(exprData)`` genes and ``r ncol(exprData)`` patients. Pay attention to the **data shape**. 

```{r inputDataT}
exprData <- t(exprData)
exprData[c(1:5), c(1:5)]
```

## Removing missing values

You need to check if the dataset contains missing data. Genes and samples with too many missing data are removed and also genes with zero variance.

```{r checkNA, eval = TRUE, message = FALSE, results = "hide"}
gsg <- goodSamplesGenes(datExpr = exprData, verbose = 3)
gsg$allOK
```

If `gsg$allOK == TRUE` that means all genes and samples are kept.

*If you run WGCNA with other data and you have missing data, check the WGCNA tutorial to have the function to remove them.*  

```{r removeNA, echo = FALSE, eval = FALSE}
if(!gsg$allOK)
{
  if(sum(!gsg$goodGenes)>0)
    printFlush(paste("Removing genes:", paste(names(datExpr0[!gsg$goodGenes], collapse = ", "))))
  if(sum(!gsg$goodSamples)>0)
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")))
  datExpr0 <- datExpr0[gsg$goodSamples, gsg$goodGenes]
}
```

## Checking for outliers 

After removing missing values, outliers need to be check. 

1. An **euclidean distance** is calculated between each **pair of patients/samples**. 
2. Then, a **hierarchical clustering** is performed to group the patients/samples.  
3. The groups are visualized to identify potential outliers. 

```{r outlier}
patientsTree <- hclust(d = dist(exprData), method = "average")
```

The dendrogram of patients is displayed into the Figure \@ref(fig:dendrogramPatients). Two patients seem to be outliers. They can be removed by a cut height to 95 (red line in the dendogram). 

```{r dendrogramPatients, fig.cap = "Patients clustering using hierarchical clustering method, to detect outliers", fig.align = "center", out.width = "70%"}
plot(patientsTree, main = "Patients dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 95, col = "red")
```

So, we decide to remove these two patients by cutting the tree in two parts. The first part contains the two outliers and the other part contains all the remaining patients/samples. 

```{r outlierClustering}
clust <- cutreeStatic(patientsTree, cutHeight = 95, minSize = 10)
table(clust)
```

We keep only patients that are in the second groups. 

```{r outlierRemoving}
keepPatients <- clust == 1
exprData <- exprData[keepPatients,]
nGenes <- ncol(exprData)
nPatients <- nrow(exprData)
```

Now, we have ``r nGenes`` genes and ``r nPatients`` patients. 

## Clinical data

We have access to clinical metadata information for every patient/sample. 

```{r clinicalData}
clinicalTCGA <- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_clinical.csv", 
                           sep = ",", head = TRUE, stringsAsFactors = TRUE)
```

We have both **categorical data** (e.g. subtype of cancer, gender) and **numerical data** (e.g. age of patient). 

```{r clinicalDataHead}
head(clinicalTCGA)
```

We have information for ``r nrow(clinicalTCGA)`` patients. 

The dimensions of the clinical dataframe and the expression dataframe have to be the same. We need to remove the outliers of the clincical dataframe. 
  
```{r clinicalRemoveOutliers}
patients <- row.names(exprData)
clinicalRows <- match(patients, clinicalTCGA$PatientIDs)
clinicalData <- clinicalTCGA[clinicalRows, -1]
rownames(clinicalData) <- clinicalTCGA[clinicalRows, 1]
head(clinicalData)
```

The expression data is in the variable `exprData` and the clinical information is in the variable `clinicalData`. We can visualize the relationships between  the clinical information and patients dendrogram (Figure \@ref(fig:dataDendrogram)). 

```{r dataDendrogram, fig.cap = "Patients clustering using hierarchical clustering method with clinical information", fig.align = "center", out.width = "70%"}
patientsTree2 <- hclust(d = dist(exprData), method = "average")
traitColors <- labels2colors(clinicalData)
traitColors[,1] <- numbers2colors(clinicalData[1])
plotDendroAndColors(dendro = patientsTree2, colors = traitColors, groupLabels = names(clinicalData), 
                    main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5) 
```

# Construction of the gene co-expression network

## In summary

This first step of the analysis consists in creating the **gene co-expression network**. This first step is composed of tree main substeps: 

1. **Power** selection
    - **soft thresholding** function helps to pick-up a power for **weighted** networks
    - **hard thresholding** function helps to pick-up a power for **unweighted** networks
    - functions based on the criteria approximate scale-free topology
    - Elbow method like: a network is constructed for a range of power
2. **Similarity matrix** construction
    - first, **correlation** or distance are calculated between each **pair of genes**
    - you can choose correlation or distance mesure you want
    - then, correlation or distance are **transformed**, according to the targeted network type (e.g. unsigned, signed) and the chosen power
3. **Topological overlap matrix** construction (TOM)
    - similarity matrix doesn't take into account the neighbors
    - traduces level of overlapping between the neighboring of two genes

Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.

```{r parallel, echo = TRUE, eval = FALSE}
enableWGCNAThreads()
```

## Power selection

We want a **weighted network**, i.e., a weighted interaction between each pair of genes. So, as the authors suggested, we use the `pickSoftThreshold` function to select an appropriate power for the network construction. 

First, we define a range of powers.

```{r powerSelection}
powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers
```

Then, the `pickSoftThreshold` function analyses the network topology of our expression data for multiple powers. 

```{r thresholdCalculation, results = "hide"}
sft <- pickSoftThreshold(exprData, powerVector = powers, verbose = 5)
```

Different indices are calculated and then ploted in the Figure \@ref(fig:powerPlots). 

```{r powerPlots, out.width = "50%", fig.show = "hold", fig.cap = "Soft Threshold Indices. Left: Scale Independence plot - Right: Mean Connectivity plot", fig.align = "center"}
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
     xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
     type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.90, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab = "Soft Threshold power", ylab = "Mean connectivity", 
     type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")
```

The **Scale Independence plot** (Figure \@ref(fig:powerPlots) - left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a **R² > 0.8**. The red line corresponds to using an R² cut-off of R²=0.90.

The **Mean Connectivity plot** (Figure \@ref(fig:powerPlots) - right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges. 

Here, we choose the **power 6**, which is the lowest power for which the scale-free topology fit index reaches 0.90. 

## Similarity matrix construction

The similarity matrix is constructed in two steps: 

1. **Correlation** between each pair of genes is calculated using Pearson's correlation. 
2. Then, correlations are **transformed** into a similarity matrix according the type of network we want to construct and the power choosen. By default, the constructed network is an **unsigned network** which means that correlations sign are not reported into the network. 

The **similarity matrix** is between 0 and 1 where 1 means strong similarity between two genes.   

```{r simMatrix}
softPower <- 6
similarityMat <- adjacency(exprData, power = softPower)
similarityMat[c(1:5), c(1:5)]
```

## Topological overlap matrix (TOM) {#TOM}

The **Similarity matrix** describes the similarity between each pair of genes based on correlations. This matrix is next transformed into a **Topological Overlap Matrix** that incorporates the level of overlapping between the neighboring of two genes. 

For instance, genes with high level of similarity are grouped together (value close to 1 in the similarity matrix). If their communities also highly overlap, these two genes will be closer. 

```{r TOM, message = "hold", warning = FALSE}
TOM <- TOMsimilarity(similarityMat)
colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
TOM[c(1:5), c(1:5)]
```

The TOM is then transformed into a dissimilarity TOM: that means that two genes **highly similar** have a similarity value **close to 0**.

```{r dissTOM}
dissTOM <- 1 - TOM
dissTOM[c(1:5), c(1:5)]
```

The **TOM** matrix corresponds to the **co-expression genes network**. It's the basis of the analysis. 

# Module detection

## In summary

We constructed a **network of genes** with the corresponding TOM matrix that gives us the **weight** of connections. 
This network is used to detect **modules**. 

Modules are groups of highly-connected genes. This step is composed of three main substeps: 

1. **Clustering**
    - use **hierarchical clustering**
    - display the clustering using dendrogram
    - genes with **similar expression** and **similar neighborhood** community will be together
2. **Modules identification**
    - identification of groups of genes
    - use the `Dynamic Tree Cut` function
    - define a minimum size of modules
3. **Merging of modules**
    - if they are too close
    - eigengenes calculation of each module 
    - authors recommend the merging of modules with a pearson **corr r > 0.75**
    
## Clustering

First, a **hierarchical clustering** is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure \@ref(fig:genesClustering).

```{r genesClustering, fig.cap = "Genes clustering using hierarchical clustering using the TOM dissimilarity matrix.", fig.align = "center", out.width = "70%"}
geneTree <- hclust(as.dist(dissTOM), method = "average")
plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)
```

The leaves represent the genes. The dendrogram branches group together densely interconnected and highly co-expressed genes. 

Visually, we can identify at least 5 or 6 modules. Let's see what the identification will find. 

## Module identification

A module is a group of genes. Several methods exist to identify them. Here, we use the **Dynamic Tree Cut** (from the `dynamicTreeCut` R package). 

First, we set the minimum module size (i.e. minimum number of genes inside a group/module) at `30`. 

```{r moduleIdentifiation, message = FALSE, warning = FALSE}
minModuleSize <- 30
dynamicMods <- cutreeDynamic(dendro = geneTree, 
                 distM = dissTOM, 
                 deepSplit = 2, 
                 pamRespectsDendro = FALSE, 
                 minClusterSize = minModuleSize)
```

The method identifies **5 modules**. The cluster 0 contains the unassigned genes. There is a big module (the first one) with `908` genes.

```{r moduleIdentifiationResults}
table(dynamicMods)
```

We assign a color to each module. The bigest module is assigned `turquoise`. 

```{r moduleIdentificationColors}
dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)
```

The identification results is displayed using a dendrogram and the assigned color in the Figure \@ref(fig:moduleIdentificationDendrogram). 

```{r moduleIdentificationDendrogram, fig.cap = "Modules identification using dynamic tree cut function. A color is assigned to each module. The grey one contains only unassigned genes.", fig.align = "center", out.width = "70%"}
plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut", 
        dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, 
        main = "Gene dendrogram and module colors")
```

## Merging modules {#MEs}

It may be prudent to merge modules that have very similar expression profiles. 
To quantify co-expression similarity of entire modules, we calculate their **eigengenes** and **cluster** them based on their **correlation**.

First, we calculate the **eigengenes** of each module. Eigengene is the first principal component of a module (dimension reduction).

```{r MECalculation}
MEList <- moduleEigengenes(exprData, colors = dynamicColors)
MEs <- MEList$eigengenes
MEs <- orderMEs(MEs)
head(MEs)
```

Then, we calculate **correlation** between each pair of module (Pearson correlation). If modules are close/similar, the correlation will be closed to one. We compute a distance between modules. 

```{r MEDistance}
MEDiss <- 1 - cor(MEs)
head(MEDiss)
```

Finally, we perform a **hierarchical clustering** using this distance between modules. Authors recommend to merge cluster with a correlation greater than **0.75**, hence a distance lower than **0.25**. 

```{r MEClustering, fig.cap = "Modules clustering based on correlation. The red line corresponds to a correlation at 0.75.", fig.align = "center", out.width = "50%", fig.show = "hold"}
METree <- hclust(as.dist(MEDiss))
MEDissTree <- 0.25
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissTree, col = "red")
```

**In our case, no modules need to be merged.** 

If you want to merge modules, you can use the `mergeCloseModules` function (see the tutorial for more information).

```{r MEMerging, eval = FALSE, echo = FALSE}
merge <- mergeCloseModules(exprData, dynamicColors, cutHeight = MEDissTree, verbose = 3)
mergedColors <- merge$colors
mergedMEs <- merge$newMEs
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
        c("Dynamic Tree Cut", "Merged dynamic"),
        dendroLabels = FALSE, hang = 0.03,
        addGuide = TRUE, guideHang = 0.05)
```

```{r MEMergingObject, eval = FALSE, echo = FALSE}
moduleColors <- mergedColors
colorOrder <- c("grey", standardColors(50))
moduleLables <- match(moduleColors, colorOrder) - 1 
MEs <- mergedMEs
```

# Module characterization


:::: {.blackbox data-latex=""}
::: {.center data-latex=""}
**Pay attention!**
::: 
:::{.center data-latex=""}
Use the right module clustering. If you merged modules, you may need to redo some step with the new set of modules, such as the eigengenes calculation.  
:::
::::


## In summary

**5 modules** have been identified in the previous step. They need to be **characterized** to be further analysed. 

- Association between modules and **external information**
    - continuous or categorical data? 
    - Pearson's correlation 
    - Which module can be interested to study in-depth?
- **Gene** relationship to external information and modules
    - Calculate two measurement: 
      - **Gene Significance**: link between genes and external information
      - **Module membership**: how the genes contribute to a module?
    - Based on correlation (Pearson)
    
## Association between modules and phenotypic traits

We want to identify which modules are significantly associated with the clinical metadata we have. 

For each patient/sample we have the following information: 

```{r, echo = FALSE}
head(clinicalData)
```

We have both categorical and continuous data. For the correlation calculation, clinical data need to be **numerical**.

```{r clinicalInfoNames}
clinicalDataNum <- clinicalData
clinicalDataNum$AnatomicNeoplasm <- as.numeric(clinicalData$AnatomicNeoplasm)
clinicalDataNum$Gender <- as.numeric(clinicalData$Gender)
clinicalDataNum$VitalStatus <- as.numeric(clinicalData$VitalStatus)
clinicalDataNum$Subtype <- as.numeric(clinicalData$Subtype)
```

Now, we can calculate the **Pearson correlations** between the identified modules and clinical information. P-values are then calculated to find the most significant correlations. 

```{r modulesTraitsAssociation}
moduleTraitCor <- cor(MEs, clinicalDataNum, use = "p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nPatients)
```

The correlation results are displayed into the Figure \@ref(fig:modulesTraitsAssociationPlot). For better visualization, non significant correlation are set to zero.

```{r corrSelection}
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
  for(c in colnames(moduleTraitCor)){
    if(moduleTraitPvalue[r, c] > 0.05){
      moduleTraitCor[r, c] <- 0
      textMatrix[r, c] <- ""
      }
  }
}
```

```{r modulesTraitsAssociationPlot, fig.cap = "Correlation between identified modules and clinical information", fig.align = "center", out.width = "70%"}
labeledHeatmap(Matrix = moduleTraitCor,
   xLabels = names(clinicalDataNum),
   yLabels = names(MEs),
   ySymbols = names(MEs),
   colorLabels = FALSE,
   colors =  colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
   textMatrix = textMatrix,
   setStdMargins = TRUE,
   cex.text = 0.8,
   zlim = c(-1, 1),
   main = "Module-trait relationship")
```

Legend:

  - White: no correlation
  - Darkpurple: strong positive correlation
  - Darkgreen: strong negative correlation

There is no correlation between modules and gender. But, we can see a **strong correlation** between the **turquoise** module and the breast cancer **subtype**. 

We will concentrate on `subtype` as the trait of interest for further analysis. 

## Gene relationship to traits and modules

Two measurements are calculated to quantify the association between genes and our trait of interest (here cancer subtypes) and the similarity of all genes to every module. 

These two measures are: 

- **Gene Significance** (GS): correlation between the gene and the selected trait
- **Module Membership** (MM): correlation between the module eigengene and the gene expression profile

First, we define a variable `subtype`:

```{r subtypesData}
subtype = as.data.frame(clinicalDataNum$Subtype)
names(subtype) <- "subtype"
```

Then, we calculate the **Module Membership** (MM) using a **Pearson correlation** between genes and identified modules. P-values are calculation for those correlations.

```{r MMCalculation}
modNames <- substring(names(MEs), 3)
geneModuleMembership <- as.data.frame(cor(exprData, MEs, use = "p"))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nPatients))
names(geneModuleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")
```

And finally, we calculate the **Gene Significance** (GS) for every genes. A **Pearson correlation** is calculated between genes and each cancer subtype. Then, a p-value is calculated for those correlations. 

```{r GSCalculation}
geneTraitSignificance <- as.data.frame(cor(exprData, subtype, use = "p"))
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nPatients))
names(geneTraitSignificance) <- paste("GS", names(subtype), sep = "")
names(GSPvalue) <- paste("p.GS", names(subtype), sep = "")
```

A gene with a high MM correlation means a high membership to this module. 
A gene with a high GS correlation means a high association with the selected trait. 

These two measures help to identify genes that have a high significance for cancer subtypes as well as high module membership in interesting modules (e.g. turquoise module). A scatterplot of GS vs. MM in the turquoise module is displayed in the Figure  \@ref(fig:MMvsGSplots).

```{r MMvsGSplots, fig.show = "hold", fig.cap = "Scatterplot of Gene Significance (GS) for subtype vs Module Membership (MM) in turquoise module.", out.width = "50%", fig.align = "center"}
module <- "turquoise"
column <- match(module, modNames) 
moduleGenes <- dynamicColors == module
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
       abs(geneTraitSignificance[moduleGenes, 1]),
       xlab = paste("Module Membership (MM) in", module, "module"),
       ylab = "Gene Significance (GS) for breast cancer subtype",
       main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
       cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
```

GS and MM are highly correlated: genes highly significantly associated with a trait are often also the most important (central) elements of modules associated with the trait. 

We create a file that **summarizes** all results. Modules are ordered by their significance for subtype. 

```{r}
geneInfo0 <- data.frame(geneSymbol = row.names(geneTraitSignificance),
                        moduleColor = dynamicColors,
                        geneTraitSignificance,
                        GSPvalue)
modOrder <- order(-abs(cor(MEs, subtype, use = "p")))
for(mod in 1:ncol(geneModuleMembership))
{
oldNames <- names(geneInfo0)
geneInfo0 <- data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
              MMPvalue[, modOrder[mod]])
names(geneInfo0) <- c(oldNames, 
            paste("MM.", modNames[modOrder[mod]], sep = ""),
            paste("p.MM", modNames[modOrder[mod]], sep = ""))
}
geneOrder <- order(geneInfo0$moduleColor, -abs(geneInfo0$p.GSsubtype))
geneInfo <- geneInfo0[geneOrder, ]
write.csv(x = geneInfo, file = "02_Results/WGCNA_genesInfo.csv", quote = FALSE)
```

Results are stored into a csv file. 

## Enrichment analysis

Now, we have identified the contribution of each gene to each module (MM). You can now perform functional enrichment analyses. For instance, you can use gprofiler or STRING. You can also perform an gene set enrichment analysis because you have a weight of contribution for each gene. 

We will perform this analysis later directly in Cytoscape.

## Turquoise module 

The **turquoise** module is related to breast cancer subtypes. It could be interesting to look for the expression data of the genes inside this module. On the left in the Figure \@ref(fig:turquoiseHeatmap), you can see the heatmap of gene expression. 

```{r turquoiseHeatmap, fig.cap = "Expression heatmap of genes that are involved in the turquoise module", fig.align = "center", out.width = "50%", fig.show = "hold"}
selectedGenes <- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
turquoiseExprData <- t(exprData[,selectedGenes])
pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE, annotation_col = clinicalData["Subtype"], 
         legend = FALSE)
heatmap <- pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE, 
                    annotation_col = clinicalData["Subtype"], cutree_cols = 4)
```

On the right (Figure \@ref(fig:turquoiseHeatmap)), patients are grouped into 4 clusters based on the expression data. Colors represent the patient subtypes.  

Data are split into 4 groups and results are saved into a file. 

```{r}
heat_cluster <- as.data.frame(cutree(heatmap$tree_row, k = 4))
write.table(x = heat_cluster, file = "02_Results/TurquoiseModule_clusters.txt", quote = FALSE, sep = "\t", col.names = FALSE, row.names = TRUE)
```

# Network visualization

## In summary

Different visualizations are possible, depending on what you whish to focus on. 

- **Eigengenes** visualization
    - focus on modules
    - display modules and selected trait relationships 
    - need to calculate module eigengenes (section \@ref(MEs))
- **Genes** network visualization
    - TOM matrix (section \@ref(TOM))
    - display the heatmap
    - Cytoscape
    
    
## Eigengenes network

It could be interesting to study the relationship between the identified modules. You can generate a summary plot of the eigengenes network. It is usually informative to add a selected traits to the eigengenes to see how the traits fit into eigengenes network. 

```{r eigengeneNetwork, fig.show = "hold", fig.cap = "Visualisation of the eigengene network - Left: Eigengene dendrogram - Right: Eigengene similarity heatmap", out.width = "50%", fig.align = "center"}
MEs <- moduleEigengenes(exprData, dynamicColors)$eigengenes
MET <- orderMEs(cbind(MEs, subtype))
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0), plotHeatmaps =  FALSE)
plotEigengeneNetworks(MET, "Eigengene similarity heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)
```

| The Figure \@ref(fig:eigengeneNetwork) shows the eigengene dendogram and heatmap. They identify groups of correlated eigengenes.
| For instance, the turquoise module is highly correlated to the subtypes as we seen previously.

## Gene networks

| To visualize a weighted network, you need the TOM similarity between each pair of genes. The first way is the heatmap.
| To make moderately strong connections more visible, TOM is transformed using a power. Then, the diagonal are set to NA. 

```{r geneNetworkHeatmap, fig.cap = "Visualisation of Gene network using heatmap plot", fig.align = "center", fig.show = "hide"}
plotTOM <- -dissTOM^7
diag(plotTOM) <- NA
TOMplot(plotTOM, geneTree, dynamicColors, main = "Network heatmap plot, all genes")
```

There are 2 000 genes, so the heatmap will take time to be displayed. Here, we are displaying only the heatmap for the turquoise module. The heatmap is displayed in the Figure \@ref(fig:geneNetworkHeatmapTurquoise). 

```{r geneNetworkHeatmapTurquoise, fig.cap = "Visualisation of gene network for turquoise module using heatmap plot.", fig.align = "center"}
selectedGenes <- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
selectTOM <- dissTOM[selectedGenes, selectedGenes]
selectTree <- hclust(as.dist(selectTOM), method = "average")
plotDiss <- -selectTOM^7
diag(plotDiss) <- NA
TOMplot(plotDiss, selectTree, main = "Network heatmap plot, turquoise module")
```

Each row and column of the heatmap correspond to a single gene. Light colors represents low similarity and darker colors higher similarity. Blocks of darker colors along the diagonal are the modules. The gene dendogram and module assignment are shown along the left side and the top.

It might be interesting to cluster this turquoise module again. 

## Cytoscape

Cytoscape needs an edge file to build the network. We extracted the whole network. 

```{r export2Cytoscape}
cyt <- exportNetworkToCytoscape(TOM,
                    edgeFile = "02_Results/CytoscapeInput_edges.txt",
                    weighted = TRUE,
                    threshold = 0.05,
                    nodeNames = colnames(exprData))
```

In Cytoscape

1. Import co-expression network edges
    - File -> Import Network from File -> **CytoscapeInput_edges.txt**
      - Column 1: Source node
      - Column 2: Target node
      - Column 3: Edge attributes
      - Column 4: Interaction Type 
      - Columns 5-6: Not imported

2. Import **WGCNA_genesInfo.csv** 
    - File -> Import Table from File -> **WGCNA_genesInfo.csv**
      - Column 1: Key
      - Other columns: Attribute
      - **/!\ Import Data as Node Table Columns**

3. Import Table from File **TurquoiseModule_clusters.txt** too.
     - **Pay attention to the column names**
     
4. Color the genes according their module (Style tab)
    - Lock node width and height
    - Shape: Elipse
    - Fill Color: moduleColor

5. Select only turquoise module
    - Filter
      - Add column filter
      - Node: moduleColor -> turquoise
    - File -> New Network -> From Selected Nodes, All edges
    
6. Enrichment of this module
    - Apps -> STRING -> STRINGify network
    - geneSymbol and Homo sapiens
    - Change Style if you want
    - Apps -> STRING enrichment -> Retrieve functional enrichment (default)

7. New tab : STRING enrichment
    - Filter Enrichment table (e.g. GO:BP)
    - Draw charts

8. Color nodes according the clustering

```{r cytoscapeW, echo = FALSE, fig.align = "center", out.width = "50%", fig.cap = "Cytoscape", fig.show = "hold"}
include_graphics("03_Cytoscape/WGCNA_cytoscape_moduleColor.png")
include_graphics("03_Cytoscape/WGCNA_cytoscape_turquoiseModule.png")
```

As we have seen before, blue and green modules are close, and brown and yellow modules are close to each other too and a bit similar to the turquoise module. 


# Going further: additional potential studies

Now, you know how to perform a WGCNA analysis. 

It could be interesting to try several different things, and see how they can impact (or not) the results. Here are a list of potential additional studies:

**1. Categorical data information**

External information play key role in the WGCNA analysis. But to perform a correlation analysis you need numerical data, and sometimes external information are not numerical. 

As we did, you can just transform these categorical values into numbers. WGCNA's authors propose a function called `binarizeCategoricalVariable`. Maybe it could be interesting to use this function. 

**2. Try with another power**

In this training, we use a `power = 6` and we saw that the turquoise module maybe need to be split again, in a recursive clustering. 

**3. Try different correlation methods**

**4. Use the distance calculation instead of correlation to construct the network**

**5. Check other trait**

We focus our analysis on the turquoise module. It could be interesting to go in depth with other module related to other trait.

**6. Consensus WGCNA**

WGCNA can be perform on several data and make a consensus of the data. Data should be the same and the same type (e.g. same list of genes) with different condition: control vs patient or male vs female.

How can you design your analysis to perform a Consensus WGCNA ? 

To help you, see the part 2 of the tutorial page ([here](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html)).