Libraries used to create and generate this report:
R version 4.1.0 (2021-05-18)
2.19
1.41
1.0.4
0.31
1.3.4
Libraries used to analyse data:
library("WGCNA")
library("pheatmap")
1.71
1.0.12
Cytoscape is used for visualization. You need to have the Cytoscape v3.9.1
and a few Cytoscape apps:
1.1.2
1.1.6
1.5.1
1.3.0
Weighted Gene Co-expression Network Analysis (WGCNA) is a system biology approach dedicated to identifyong correlation patterns between features in a set of samples. In most cases, it is applied on gene expression data to find clusters (called modules) of highly correlated genes.
WGCNA builds a co-expressed gene network:
WGCNA is the most famous approach to create gene co-expression networks.
The main analysis steps are presented in the Figure 2.1.
If you are interested in WGCNA, extensive tutorials are available online:
The Cancer Genome Atlas (TCGA) is a cancer genomic project that catalogs cancer-associated genomics profiles with several approaches (e.g. gene expression, copy number variation, DNA methylation, etc.). You can find the web site here.
TCGA makes available a high amount of data. We choose to use expression genes data from breast cancer. Some clinical information are available too, as subtype of cancer, gender and so on. These are considered as metadata.
Our objective is to build a co-expression network using expression data, identify modules of tighly co-expressed genes, and relate them to clinical metadata.
Preparation of the data is the first and most important step before running any analysis. Indeed, data need to be prepared correctly in order to extract relevant and pertinent information.
Data must matrices with specific shapes:
Different from the shape usually used!
Note of Caution!
We assume that the data have already been normalized.
Data are downloaded from: MiBiOmics gitlab
This dataset contains the expression data of different patient with breast cancer.
<- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_mrna.csv", head = TRUE, row.names = 1, sep = ",")
exprData c(1:5), c(1:5)] exprData[
## A0SH A0SJ A0SK A0SO A04N
## STC2 6.809810 9.802976 5.495795 2.926594 11.579173
## LOC96610 7.139342 9.538014 7.440094 8.674870 8.457158
## APOD 6.399172 10.667394 1.722198 4.771886 6.579026
## FADS2 5.316216 5.537015 8.555220 8.911656 6.459865
## RPL9 5.916535 3.204786 6.702598 6.122295 3.572641
Data are composed of 2000
genes and 379
patients. Pay attention to the data shape.
<- t(exprData)
exprData c(1:5), c(1:5)] exprData[
## STC2 LOC96610 APOD FADS2 RPL9
## A0SH 6.809810 7.139342 6.399172 5.316216 5.916535
## A0SJ 9.802976 9.538014 10.667394 5.537015 3.204786
## A0SK 5.495795 7.440094 1.722198 8.555220 6.702598
## A0SO 2.926594 8.674870 4.771886 8.911656 6.122295
## A04N 11.579173 8.457158 6.579026 6.459865 3.572641
You need to check if the dataset contains missing data. Genes and samples with too many missing data are removed and also genes with zero variance.
<- goodSamplesGenes(datExpr = exprData, verbose = 3)
gsg $allOK gsg
If gsg$allOK == TRUE
that means all genes and samples are kept.
If you run WGCNA with other data and you have missing data, check the WGCNA tutorial to have the function to remove them.
After removing missing values, outliers need to be check.
<- hclust(d = dist(exprData), method = "average") patientsTree
The dendrogram of patients is displayed into the Figure 4.1. Two patients seem to be outliers. They can be removed by a cut height to 95 (red line in the dendogram).
plot(patientsTree, main = "Patients dendrogram", sub = "", xlab = "", cex.lab = 1, cex.axis = 1, cex.main = 1, cex = 0.5)
abline(h = 95, col = "red")
So, we decide to remove these two patients by cutting the tree in two parts. The first part contains the two outliers and the other part contains all the remaining patients/samples.
<- cutreeStatic(patientsTree, cutHeight = 95, minSize = 10)
clust table(clust)
## clust
## 0 1
## 2 377
We keep only patients that are in the second groups.
<- clust == 1
keepPatients <- exprData[keepPatients,]
exprData <- ncol(exprData)
nGenes <- nrow(exprData) nPatients
Now, we have 2000
genes and 377
patients.
We have access to clinical metadata information for every patient/sample.
<- read.table(file = "00_Data/TCGA_Breast_mibiomics/TCGA_Breast_clinical.csv",
clinicalTCGA sep = ",", head = TRUE, stringsAsFactors = TRUE)
We have both categorical data (e.g. subtype of cancer, gender) and numerical data (e.g. age of patient).
head(clinicalTCGA)
## PatientIDs AgeDiagnosis AnatomicNeoplasm Gender VitalStatus Subtype
## 1 A0SH 39 left upper inner quadrant female alive LumA
## 2 A0SJ 39 left female alive LumA
## 3 A0SK 54 right upper outer quadrant female dead Basal
## 4 A0SO 67 right female alive Basal
## 5 A04N 66 right female alive LumA
## 6 A04P 36 left female dead Basal
We have information for 379
patients.
The dimensions of the clinical dataframe and the expression dataframe have to be the same. We need to remove the outliers of the clincical dataframe.
<- row.names(exprData)
patients <- match(patients, clinicalTCGA$PatientIDs)
clinicalRows <- clinicalTCGA[clinicalRows, -1]
clinicalData rownames(clinicalData) <- clinicalTCGA[clinicalRows, 1]
head(clinicalData)
## AgeDiagnosis AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH 39 left upper inner quadrant female alive LumA
## A0SJ 39 left female alive LumA
## A0SK 54 right upper outer quadrant female dead Basal
## A0SO 67 right female alive Basal
## A04N 66 right female alive LumA
## A04P 36 left female dead Basal
The expression data is in the variable exprData
and the clinical information is in the variable clinicalData
. We can visualize the relationships between the clinical information and patients dendrogram (Figure 4.2).
<- hclust(d = dist(exprData), method = "average")
patientsTree2 <- labels2colors(clinicalData)
traitColors 1] <- numbers2colors(clinicalData[1])
traitColors[,plotDendroAndColors(dendro = patientsTree2, colors = traitColors, groupLabels = names(clinicalData),
main = "Patients dendogram and trait heatmap",cex.dendroLabels = 0.5, cex.colorLabels = 0.5)
This first step of the analysis consists in creating the gene co-expression network. This first step is composed of tree main substeps:
Skip this line if you run Rstudio or other third-party R environments. Allow multi-threading within WGCNA.
enableWGCNAThreads()
We want a weighted network, i.e., a weighted interaction between each pair of genes. So, as the authors suggested, we use the pickSoftThreshold
function to select an appropriate power for the network construction.
First, we define a range of powers.
<- c(c(1:10), seq(from = 12, to = 20, by = 2))
powers powers
## [1] 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
Then, the pickSoftThreshold
function analyses the network topology of our expression data for multiple powers.
<- pickSoftThreshold(exprData, powerVector = powers, verbose = 5) sft
## Warning: executing %dopar% sequentially: no parallel backend registered
Different indices are calculated and then ploted in the Figure 5.1.
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2],
xlab = "Soft Threshold power", ylab = "Scale Free Topology Model Fit, R²",
type = "n", main = "Scale independence", cex.lab = 1.3)
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])* sft$fitIndices[,2], labels = powers, cex = 1, col = "black")
abline(h = 0.90, col = "red")
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab = "Soft Threshold power", ylab = "Mean connectivity",
type = "n", main = "Mean connectivity", cex.lab = 1.3)
text(sft$fitIndices[,1], sft$fitIndices[,5], labels = powers, cex = 1, col = "black")
The Scale Independence plot (Figure 5.1 - left) shows the fit indices for scale free topology for each power. The authors recommend to select a power with a R² > 0.8. The red line corresponds to using an R² cut-off of R²=0.90.
The Mean Connectivity plot (Figure 5.1 - right) shows the average of connectivity for each power. A balance needs to be identified between a fully connected network and too much disconnected edges.
Here, we choose the power 6, which is the lowest power for which the scale-free topology fit index reaches 0.90.
The similarity matrix is constructed in two steps:
The similarity matrix is between 0 and 1 where 1 means strong similarity between two genes.
<- 6
softPower <- adjacency(exprData, power = softPower)
similarityMat c(1:5), c(1:5)] similarityMat[
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 1.000000e+00 9.800808e-04 6.282220e-10 1.020088e-03 1.212973e-09
## LOC96610 9.800808e-04 1.000000e+00 8.826956e-06 1.702183e-05 4.049903e-09
## APOD 6.282220e-10 8.826956e-06 1.000000e+00 2.134894e-08 7.213511e-13
## FADS2 1.020088e-03 1.702183e-05 2.134894e-08 1.000000e+00 3.873576e-06
## RPL9 1.212973e-09 4.049903e-09 7.213511e-13 3.873576e-06 1.000000e+00
The Similarity matrix describes the similarity between each pair of genes based on correlations. This matrix is next transformed into a Topological Overlap Matrix that incorporates the level of overlapping between the neighboring of two genes.
For instance, genes with high level of similarity are grouped together (value close to 1 in the similarity matrix). If their communities also highly overlap, these two genes will be closer.
<- TOMsimilarity(similarityMat) TOM
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
colnames(TOM) <- colnames(similarityMat)
rownames(TOM) <- rownames(similarityMat)
c(1:5), c(1:5)] TOM[
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 1.000000e+00 1.409307e-03 1.216776e-04 2.430333e-03 1.172863e-06
## LOC96610 1.409307e-03 1.000000e+00 3.377810e-04 1.925742e-04 1.026022e-06
## APOD 1.216776e-04 3.377810e-04 1.000000e+00 3.177204e-05 3.227948e-07
## FADS2 2.430333e-03 1.925742e-04 3.177204e-05 1.000000e+00 4.802354e-06
## RPL9 1.172863e-06 1.026022e-06 3.227948e-07 4.802354e-06 1.000000e+00
The TOM is then transformed into a dissimilarity TOM: that means that two genes highly similar have a similarity value close to 0.
<- 1 - TOM
dissTOM c(1:5), c(1:5)] dissTOM[
## STC2 LOC96610 APOD FADS2 RPL9
## STC2 0.0000000 0.9985907 0.9998783 0.9975697 0.9999988
## LOC96610 0.9985907 0.0000000 0.9996622 0.9998074 0.9999990
## APOD 0.9998783 0.9996622 0.0000000 0.9999682 0.9999997
## FADS2 0.9975697 0.9998074 0.9999682 0.0000000 0.9999952
## RPL9 0.9999988 0.9999990 0.9999997 0.9999952 0.0000000
The TOM matrix corresponds to the co-expression genes network. It’s the basis of the analysis.
We constructed a network of genes with the corresponding TOM matrix that gives us the weight of connections. This network is used to detect modules.
Modules are groups of highly-connected genes. This step is composed of three main substeps:
Dynamic Tree Cut
functionFirst, a hierarchical clustering is performed to group genes based on the TOM dissimilarity matrix. The corresponding dendrogram is displayed into the Figure 6.1.
<- hclust(as.dist(dissTOM), method = "average")
geneTree plot(geneTree, xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)
The leaves represent the genes. The dendrogram branches group together densely interconnected and highly co-expressed genes.
Visually, we can identify at least 5 or 6 modules. Let’s see what the identification will find.
A module is a group of genes. Several methods exist to identify them. Here, we use the Dynamic Tree Cut (from the dynamicTreeCut
R package).
First, we set the minimum module size (i.e. minimum number of genes inside a group/module) at 30
.
<- 30
minModuleSize <- cutreeDynamic(dendro = geneTree,
dynamicMods distM = dissTOM,
deepSplit = 2,
pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
## ..cutHeight not given, setting it to 0.997 ===> 99% of the (truncated) height range in dendro.
## ..done.
The method identifies 5 modules. The cluster 0 contains the unassigned genes. There is a big module (the first one) with 908
genes.
table(dynamicMods)
## dynamicMods
## 0 1 2 3 4 5
## 300 908 371 280 89 52
We assign a color to each module. The bigest module is assigned turquoise
.
<- labels2colors(dynamicMods)
dynamicColors table(dynamicColors)
## dynamicColors
## blue brown green grey turquoise yellow
## 371 280 52 300 908 89
The identification results is displayed using a dendrogram and the assigned color in the Figure 6.2.
plotDendroAndColors(dendro = geneTree, colors = dynamicColors, groupLabels = "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
It may be prudent to merge modules that have very similar expression profiles. To quantify co-expression similarity of entire modules, we calculate their eigengenes and cluster them based on their correlation.
First, we calculate the eigengenes of each module. Eigengene is the first principal component of a module (dimension reduction).
<- moduleEigengenes(exprData, colors = dynamicColors)
MEList <- MEList$eigengenes
MEs <- orderMEs(MEs)
MEs head(MEs)
## MEblue MEgreen MEturquoise MEbrown MEyellow MEgrey
## A0SH -0.003763839 -0.05955536 0.02312983 0.10465496 -0.01212317 -0.05516091
## A0SJ -0.020917560 -0.02408432 0.01715609 0.02630589 0.06563727 -0.03604684
## A0SK -0.104121719 -0.12921101 -0.09806951 -0.07988094 -0.09619792 0.09957517
## A0SO -0.077953493 -0.06860749 -0.09164466 -0.10106987 -0.05746363 0.04673793
## A04N -0.007276244 -0.03752692 0.04488648 0.05265763 0.01783336 0.03871323
## A04P 0.024766791 0.01468890 -0.08919126 -0.01027211 0.02749866 0.04509383
Then, we calculate correlation between each pair of module (Pearson correlation). If modules are close/similar, the correlation will be closed to one. We compute a distance between modules.
<- 1 - cor(MEs)
MEDiss head(MEDiss)
## MEblue MEgreen MEturquoise MEbrown MEyellow MEgrey
## MEblue 2.220446e-16 0.5124083 1.2620744 0.6117870 0.6379130 9.251374e-01
## MEgreen 5.124083e-01 0.0000000 0.9700988 0.9681992 0.9719671 1.158490e+00
## MEturquoise 1.262074e+00 0.9700988 0.0000000 0.8456227 0.6656522 1.121457e+00
## MEbrown 6.117870e-01 0.9681992 0.8456227 0.0000000 0.3500885 1.034015e+00
## MEyellow 6.379130e-01 0.9719671 0.6656522 0.3500885 0.0000000 8.944425e-01
## MEgrey 9.251374e-01 1.1584903 1.1214573 1.0340146 0.8944425 2.220446e-16
Finally, we perform a hierarchical clustering using this distance between modules. Authors recommend to merge cluster with a correlation greater than 0.75, hence a distance lower than 0.25.
<- hclust(as.dist(MEDiss))
METree <- 0.25
MEDissTree plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h = MEDissTree, col = "red")
In our case, no modules need to be merged.
If you want to merge modules, you can use the mergeCloseModules
function (see the tutorial for more information).
Pay attention!
Use the right module clustering. If you merged modules, you may need to redo some step with the new set of modules, such as the eigengenes calculation.
5 modules have been identified in the previous step. They need to be characterized to be further analysed.
We want to identify which modules are significantly associated with the clinical metadata we have.
For each patient/sample we have the following information:
## AgeDiagnosis AnatomicNeoplasm Gender VitalStatus Subtype
## A0SH 39 left upper inner quadrant female alive LumA
## A0SJ 39 left female alive LumA
## A0SK 54 right upper outer quadrant female dead Basal
## A0SO 67 right female alive Basal
## A04N 66 right female alive LumA
## A04P 36 left female dead Basal
We have both categorical and continuous data. For the correlation calculation, clinical data need to be numerical.
<- clinicalData
clinicalDataNum $AnatomicNeoplasm <- as.numeric(clinicalData$AnatomicNeoplasm)
clinicalDataNum$Gender <- as.numeric(clinicalData$Gender)
clinicalDataNum$VitalStatus <- as.numeric(clinicalData$VitalStatus)
clinicalDataNum$Subtype <- as.numeric(clinicalData$Subtype) clinicalDataNum
Now, we can calculate the Pearson correlations between the identified modules and clinical information. P-values are then calculated to find the most significant correlations.
<- cor(MEs, clinicalDataNum, use = "p")
moduleTraitCor <- corPvalueStudent(moduleTraitCor, nPatients) moduleTraitPvalue
The correlation results are displayed into the Figure 7.1. For better visualization, non significant correlation are set to zero.
<- paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
textMatrix dim(textMatrix) <- dim(moduleTraitCor)
colnames(textMatrix) <- colnames(moduleTraitCor)
rownames(textMatrix) <- rownames(moduleTraitCor)
for(r in row.names(moduleTraitCor)){
for(c in colnames(moduleTraitCor)){
if(moduleTraitPvalue[r, c] > 0.05){
<- 0
moduleTraitCor[r, c] <- ""
textMatrix[r, c]
}
} }
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(clinicalDataNum),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = colorRampPalette(c("darkgreen", "white", "darkmagenta"))(50),
textMatrix = textMatrix,
setStdMargins = TRUE,
cex.text = 0.8,
zlim = c(-1, 1),
main = "Module-trait relationship")
Legend:
There is no correlation between modules and gender. But, we can see a strong correlation between the turquoise module and the breast cancer subtype.
We will concentrate on subtype
as the trait of interest for further analysis.
Two measurements are calculated to quantify the association between genes and our trait of interest (here cancer subtypes) and the similarity of all genes to every module.
These two measures are:
First, we define a variable subtype
:
= as.data.frame(clinicalDataNum$Subtype)
subtype names(subtype) <- "subtype"
Then, we calculate the Module Membership (MM) using a Pearson correlation between genes and identified modules. P-values are calculation for those correlations.
<- substring(names(MEs), 3)
modNames <- as.data.frame(cor(exprData, MEs, use = "p"))
geneModuleMembership <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nPatients))
MMPvalue names(geneModuleMembership) <- paste("MM", modNames, sep = "")
names(MMPvalue) <- paste("p.MM", modNames, sep = "")
And finally, we calculate the Gene Significance (GS) for every genes. A Pearson correlation is calculated between genes and each cancer subtype. Then, a p-value is calculated for those correlations.
<- as.data.frame(cor(exprData, subtype, use = "p"))
geneTraitSignificance <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nPatients))
GSPvalue names(geneTraitSignificance) <- paste("GS", names(subtype), sep = "")
names(GSPvalue) <- paste("p.GS", names(subtype), sep = "")
A gene with a high MM correlation means a high membership to this module. A gene with a high GS correlation means a high association with the selected trait.
These two measures help to identify genes that have a high significance for cancer subtypes as well as high module membership in interesting modules (e.g. turquoise module). A scatterplot of GS vs. MM in the turquoise module is displayed in the Figure 7.2.
<- "turquoise"
module <- match(module, modNames)
column <- dynamicColors == module
moduleGenes verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership (MM) in", module, "module"),
ylab = "Gene Significance (GS) for breast cancer subtype",
main = paste("Module Membership (MM) vs Gene Significance (GS)\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
GS and MM are highly correlated: genes highly significantly associated with a trait are often also the most important (central) elements of modules associated with the trait.
We create a file that summarizes all results. Modules are ordered by their significance for subtype.
<- data.frame(geneSymbol = row.names(geneTraitSignificance),
geneInfo0 moduleColor = dynamicColors,
geneTraitSignificance,
GSPvalue)<- order(-abs(cor(MEs, subtype, use = "p")))
modOrder for(mod in 1:ncol(geneModuleMembership))
{<- names(geneInfo0)
oldNames <- data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
geneInfo0
MMPvalue[, modOrder[mod]])names(geneInfo0) <- c(oldNames,
paste("MM.", modNames[modOrder[mod]], sep = ""),
paste("p.MM", modNames[modOrder[mod]], sep = ""))
}<- order(geneInfo0$moduleColor, -abs(geneInfo0$p.GSsubtype))
geneOrder <- geneInfo0[geneOrder, ]
geneInfo write.csv(x = geneInfo, file = "02_Results/WGCNA_genesInfo.csv", quote = FALSE)
Results are stored into a csv file.
Now, we have identified the contribution of each gene to each module (MM). You can now perform functional enrichment analyses. For instance, you can use gprofiler or STRING. You can also perform an gene set enrichment analysis because you have a weight of contribution for each gene.
We will perform this analysis later directly in Cytoscape.
The turquoise module is related to breast cancer subtypes. It could be interesting to look for the expression data of the genes inside this module. On the left in the Figure 7.3, you can see the heatmap of gene expression.
<- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
selectedGenes <- t(exprData[,selectedGenes])
turquoiseExprData pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE, annotation_col = clinicalData["Subtype"],
legend = FALSE)
<- pheatmap(turquoiseExprData, show_rownames = FALSE, show_colnames = FALSE,
heatmap annotation_col = clinicalData["Subtype"], cutree_cols = 4)
On the right (Figure 7.3), patients are grouped into 4 clusters based on the expression data. Colors represent the patient subtypes.
Data are split into 4 groups and results are saved into a file.
<- as.data.frame(cutree(heatmap$tree_row, k = 4))
heat_cluster write.table(x = heat_cluster, file = "02_Results/TurquoiseModule_clusters.txt", quote = FALSE, sep = "\t", col.names = FALSE, row.names = TRUE)
Different visualizations are possible, depending on what you whish to focus on.
It could be interesting to study the relationship between the identified modules. You can generate a summary plot of the eigengenes network. It is usually informative to add a selected traits to the eigengenes to see how the traits fit into eigengenes network.
<- moduleEigengenes(exprData, dynamicColors)$eigengenes
MEs <- orderMEs(cbind(MEs, subtype))
MET plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0), plotHeatmaps = FALSE)
plotEigengeneNetworks(MET, "Eigengene similarity heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)
<- -dissTOM^7
plotTOM diag(plotTOM) <- NA
TOMplot(plotTOM, geneTree, dynamicColors, main = "Network heatmap plot, all genes")
There are 2 000 genes, so the heatmap will take time to be displayed. Here, we are displaying only the heatmap for the turquoise module. The heatmap is displayed in the Figure 8.2.
<- geneInfo[geneInfo$moduleColor == "turquoise", ]$geneSymbol
selectedGenes <- dissTOM[selectedGenes, selectedGenes]
selectTOM <- hclust(as.dist(selectTOM), method = "average")
selectTree <- -selectTOM^7
plotDiss diag(plotDiss) <- NA
TOMplot(plotDiss, selectTree, main = "Network heatmap plot, turquoise module")
Each row and column of the heatmap correspond to a single gene. Light colors represents low similarity and darker colors higher similarity. Blocks of darker colors along the diagonal are the modules. The gene dendogram and module assignment are shown along the left side and the top.
It might be interesting to cluster this turquoise module again.
Cytoscape needs an edge file to build the network. We extracted the whole network.
<- exportNetworkToCytoscape(TOM,
cyt edgeFile = "02_Results/CytoscapeInput_edges.txt",
weighted = TRUE,
threshold = 0.05,
nodeNames = colnames(exprData))
In Cytoscape
As we have seen before, blue and green modules are close, and brown and yellow modules are close to each other too and a bit similar to the turquoise module.
Now, you know how to perform a WGCNA analysis.
It could be interesting to try several different things, and see how they can impact (or not) the results. Here are a list of potential additional studies:
1. Categorical data information
External information play key role in the WGCNA analysis. But to perform a correlation analysis you need numerical data, and sometimes external information are not numerical.
As we did, you can just transform these categorical values into numbers. WGCNA’s authors propose a function called binarizeCategoricalVariable
. Maybe it could be interesting to use this function.
2. Try with another power
In this training, we use a power = 6
and we saw that the turquoise module maybe need to be split again, in a recursive clustering.
3. Try different correlation methods
4. Use the distance calculation instead of correlation to construct the network
5. Check other trait
We focus our analysis on the turquoise module. It could be interesting to go in depth with other module related to other trait.
6. Consensus WGCNA
WGCNA can be perform on several data and make a consensus of the data. Data should be the same and the same type (e.g. same list of genes) with different condition: control vs patient or male vs female.
How can you design your analysis to perform a Consensus WGCNA ?
To help you, see the part 2 of the tutorial page (here).