Libraries used to create and generate this report:
R version 4.1.0 (2021-05-18)
2.19
1.41
1.0.4
0.31
1.3.4
Libraries used to analyse data:
library("SNFtool")
library("network")
library("igraph")
library("ggplot2")
library("ggpmisc")
library("pheatmap")
library("cccd")
2.3.1
1.18.0
1.3.5
3.4.0
0.5.2
1.0.12
1.6
Cytoscape is used for visualization. You need to have the Cytoscape v3.9.1
and a few Cytoscape apps:
1.1.2
1.1.6
Similarity Network Fusion (SNF) builds networks of samples for each data type. Then, it fuses them into one network, which represents the full spectrum of underlying data.
In other words, SNF integrates several types of data (e.g. omics data) into one network or relationships between samples.
As you can see on the Figure 2.1, SNF has 3 main steps (note that patient=sample in the description below):
In the final fused network (e), you can identify which data hold weigth edges:
the blue edge information are hold by the mRNA data
the pink edge information come from methylation data
and orange edge information are hold by both data: mRNA and methylation.
GitHub page: https://github.com/maxconway/SNFtool
Publication: PMID: 24464287
Data are coming from TaraOcean project, and correspond to metagenomic data: 139
samples were collected in 8
different oceans around the world (3.1). These samples are prokaryote-enriched with more than 35 thousand species.
In a previous analysis, Sunagawa et al. [PMID: 25999513] identified a stratification mostly driven by the temperature (i.e., vertical stratification) rather than geography or other environmental factors.
Samples come from 4 different layers with different temperature:
The 8 different ocean and their localisation are presented in the Figure 3.1.
We have two types of data:
Does an integrative analysis of these two datasets retrieve the stratification driven by the layers? and also a geographical clustering ?
The preprocessing step is the most important part of the analysis. Data need to be prepared correctly in order to extract relevant information and produce a correct and pertinent interpretation of the results.
Data have to be matrices with a specific shape:
Different from the shape usually used!
NOTICE!
We assume that data have already been preprocessed (i.e., we assume that exploratory analysis, preparation and normalization are done).
Data are downloaded from: MiBiOmics gitlab
We load the first dataset that contains abundance of orthologous genes.
<- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1)
data1 c(1:5), c(1:5)] data1[
## NOG317682 NOG135470 NOG85325 NOG285859 NOG147792
## TARA_109_SRF 0 2.390962e-05 0 4.663604e-08 1.800215e-07
## TARA_149_MES 0 4.339824e-06 0 5.182915e-07 4.190123e-06
## TARA_110_MES 0 1.348252e-05 0 6.000043e-07 2.218342e-07
## TARA_102_MES 0 6.380711e-06 0 3.816016e-07 0.000000e+00
## TARA_142_SRF 0 9.484144e-06 0 6.437103e-08 1.132431e-06
Data are composed of 139
samples around the world and 638
groups of orthologous genes.
Each column (features) need to have mean 0 and standard deviation 1. Pay attention on the data shape.
<- standardNormalization(data1) data1_scaled
Distribution of the data before and after scaling:
hist(as.matrix(data1), nclass = 100, main = "Data 1", xlab = "values")
hist(data1_scaled, nclass = 100, main = "Scaled data 1", xlab = "scaled values")
You can retrieve functional annotation of OGs in EggNOG database.
Then, we load the dataset n°2 which contains composition of each sample.
<- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_proPhylo.csv", sep = ",", head = TRUE, row.names = 1)
data2 c(1:5), c(1:3)] data2[
## EU638706.1.1353 JN537192.1.1500 CAFJ01000195.23.1515
## TARA_109_SRF 0 0 0
## TARA_149_MES 0 0 0
## TARA_110_MES 0 4 0
## TARA_102_MES 0 1 0
## TARA_142_SRF 0 3 0
Data are composed of 139
samples around the world and 356
16S rRNA sequences.
Each column (features) need to have mean 0 and standard deviation 1. Pay attention on the data shape.
<- standardNormalization(data2) data2_scaled
Distribution of the data before and after scaling:
hist(as.matrix(data2), nclass = 100, main = "Data 2", xlab = "values")
hist(data2_scaled, nclass = 100, main = "Scaled data 2", xlab = "scaled values")
Annotations come from Silva website
We have some metadata, i.e., environmental information about the ocean samples (depth and ocean region).
<- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_metadata.csv", sep = ",", head = TRUE, row.names = 1)
metadata head(metadata)
## ocean depth
## TARA_109_SRF SPO SRF
## TARA_149_MES NAO MES
## TARA_110_MES SPO MES
## TARA_102_MES SPO MES
## TARA_142_SRF NAO SRF
## TARA_109_DCM SPO DCM
In this section, we want to construct a network of samples based on the similarity between samples.
We define a network G with nodes (or vertices) called V and links (edges) called E. The graph is defined using:
Indeed, the edge weight between two samples gets higher with their similarity.
First, we need to calculate the distance between all the pairs of samples. Pay attention to the type of variables : numerical, continuous, discrete and so on.
Wang et al. suggest:
The dist2
function performs a squared Euclidean distances between samples:
<- dist2(as.matrix(data1_scaled), as.matrix(data1_scaled))
data1_dist <- dist2(as.matrix(data2_scaled), as.matrix(data2_scaled)) data2_dist
High distance values mean distant samples whereas small distance values implies similar samples.
Distance matrices are converted into similarity matrices in which high values mean similar samples. Two parameters are needed:
The K neighbors is used to set the affinities outside of the neighborhood to zero. The alpha parameter allows scaling exponential similarity kernel, which is used to calculate the similarity.
= 20
K = 0.5 alpha
Similarity matrix calculation:
<- affinityMatrix(data1_dist, K, alpha)
data1_W <- affinityMatrix(data2_dist, K, alpha) data2_W
We can display these similarity (weight) values using an heatmap (Figure 5.1). We are using log transformation of the values to increase the signal. Samples are grouped using hierarchical clustering.
pheatmap(log(data1_W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata, main = "Data1")
pheatmap(log(data2_W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata, main = "Data2")
Legend:
For data1, we can see two main groups of samples. Each group seems to be composed of the same ocean depth. It’s not clear for the ocean regions.
For data2, we do not identify clear groups. But, samples seem to be group by the depth too.
Try different parameters !
Now, we will create a fused network from the similarity network of each data type. It iteratively integrates these networks using a fusion method.
First, we set the number of iteration T. It should be between 10 and 20 (as recommended by the authors).
= 10 T
Then, we perform the fusion. We give the list of similarity networks, the number of neighbors and the number of iterations.
<- SNF(list(data1_W, data2_W), K, T)
W c(1:5), c(1:5)] W[
## TARA_109_SRF TARA_149_MES TARA_110_MES TARA_102_MES TARA_142_SRF
## TARA_109_SRF 5.000000e-01 9.778842e-05 0.0002535809 0.0004410605 0.0019200171
## TARA_149_MES 9.778842e-05 5.000000e-01 0.0129882372 0.0126485110 0.0004977051
## TARA_110_MES 2.535809e-04 1.298824e-02 0.5000000000 0.0266185584 0.0029147232
## TARA_102_MES 4.410605e-04 1.264851e-02 0.0266185584 0.5000000000 0.0009179439
## TARA_142_SRF 1.920017e-03 4.977051e-04 0.0029147232 0.0009179439 0.5000000000
You can display the weight of each interaction using an heatmap (Figure 6.1). Samples are grouped using a hierarchical clustering.
pheatmap(log(W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata)
Legend:
Groups are more clearly defined in this fused similarity matrix. One corresponds to the depth MESO and other to a mix of DCM and SRF.
The fused network is now built, but in this fused networks, all samples are connected together (fully connected). We need to decide a threshold to create a sparser version of the network. This threshold will decide for which weight two samples are considered as linked or not linked. We want a network with enough edges but not too much.
How to choose the good threshold? It’s an open question. Several possibilities:
You can decide to keep only the first 5 neighbors or the 10% of the higher weigths. This way is totally arbitrary. You can display the weight distribution to help you to choose.
hist(W, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(W[W > 0.02], nclass = 100, main = "Fused network weight distribution - W > 0.02", xlab = "weights")
In the Figure 6.2 (left), you can see “two clusters” :
If we remove the weak links, we can better see the distribution (Figure 6.2 - right).
Here, we select the top 10% of the best weights. Distribution of the selected weights is displayed in the Figure 6.3.
<- quantile(x = W, 0.90) W_q90
hist(W[W > W_q90], nclass = 100, main = "Fused network weight distribution - 10% of best weights", xlab = "weights")
19321
1931
Advantages:
Drawbacks:
You can choose a threshold using basic metrics. We decide to try two: median and third quantiles. Weights distribution is ploted into Figure 6.4 with the median and third quantile values. Weights are log-transformed to have a better visualization of the distribution.
<- median(x = W)
W_median <- quantile(x = W, 0.75) W_q75
## Raw weights
hist(W, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = W_median, col = "blue", lwd = 3)
text(W_median, 10000, pos = 4, "Median", col = "blue", cex = 1.5)
## Log10(weights)
hist(log(W, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(W_median, 10), col = "blue", lwd = 3)
text(log(W_median, 10)-0.3, 800, pos = 2, "Median", col = "blue", cex = 1.5)
abline(v = log(W_q75, 10), col = "purple", lwd = 3)
text(log(W_q75, 10)+0.5, 400, pos = 4, "quantile 75%", col = "purple", cex = 1.5)
Median is the value that splits data into two groups with the same number of data. The third quantile corresponds to the values corresponding to the top 25% of highest similarity data.
Advantages:
Drawbacks:
Zahoranszky-Kohalmi et al., propose a method to determine the best threshold according to the topology of the network [PMID: 27030802]. It calculates the Average Clustering Coefficient (ACC) for a whole graph. The ACC represents a global parameter that characterizes the overall network topology.
The approach is working like an Elbow approach: an ACC value is calculated for a range of thresholds and the obtained values are displayed.
Two functions are needed:
## CC calculation function
<- function(node, graph){
CCCalculation <- degree(graph = graph, v = node, loops = FALSE)
degNode if(degNode > 1){
<- neighbors(graph = graph, v = node)
neighborNames <- subgraph(graph = graph, vids = neighborNames)
graph_s <- sum(degree(graph_s, loops = FALSE))
neighborNb <- neighborNb / (degNode * (degNode-1))
CC else{CC <- 0}
}return(CC)
}## ACC calculation function
<- function(graph){
ACCCalculation <- V(graph)
nodes <- do.call(sum, lapply(nodes, CCCalculation, graph)) / length(nodes)
ACC return(ACC)
}
First, we need to determine a range of thresholds. The precision of the ACC increases with the number of thresholds chosen (at least 100).
summary(c(W))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000570 0.0004456 0.0021613 0.0071942 0.0042193 0.5000000
<- seq(0, 0.1, 0.0005) thresholds
Then, we create a network object with the fused network W. We only used the half of the matrix and don’t take the diagonal values: no duplicated links nor loops in the graph.
<- graph_from_adjacency_matrix(W, weighted = TRUE, mode = "upper", diag = FALSE) W_net
For each threshold, an ACC value is calculated on the obtained network. Then, the ACC values are displayed into a line plot (Figure 6.5).
<- do.call(rbind, lapply(thresholds, function(t, graph){
ACC_W <- subgraph.edges(graph, E(graph)[weight >= t])
graph_sub <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
df return(df)
}, W_net))
plot(x = ACC_W$thresholds, y = ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = ACC_W$thresholds[1], y = ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[21], y = ACC_W$ACC[21], col = "pink", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[29], y = ACC_W$ACC[29], col = "purple", pch = 16, cex = 1.2)
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ", ACC_W$thresholds[29]), col = "purple")
plot(x = ACC_W$thresholds, y = ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 2000, pos = 4, paste0("ACCmax = ", ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 1200, pos = 4, paste0("EN = ", ACC_W[29, "EN"]), col = "purple")
On the Figure 6.5, you can see a peak that stands out in comparison with the others.
Advantages:
Drawbacks:
We can do the same for the others similarity network (data1 and data2) to display these networks too.
<- graph_from_adjacency_matrix(data1_W, weighted = TRUE, mode = "upper", diag = FALSE)
data1_net <- seq(0, 0.008, 0.00002)
thresholds <- do.call(rbind, lapply(thresholds, function(t, graph){
ACC_data1 <- subgraph.edges(graph, E(graph)[weight >= t])
graph_sub <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
df return(df)
}, data1_net))plot(x = ACC_data1$thresholds, y = ACC_data1$ACC, xlab = "thresholds", ylab = "ACC", main = "Data 1", type = "o")
plot(x = ACC_data1$thresholds, y = ACC_data1$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of data1", type = "o")
<- graph_from_adjacency_matrix(data2_W, weighted = TRUE, mode = "upper", diag = FALSE)
data2_net <- seq(0, 0.012, 0.00005)
thresholds <- do.call(rbind, lapply(thresholds, function(t, graph){
ACC_data2 <- subgraph.edges(graph, E(graph)[weight >= t])
graph_sub <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
df return(df)
}, data2_net))plot(x = ACC_data2$thresholds, y = ACC_data2$ACC, xlab = "thresholds", ylab = "ACC", main = "Data 2", type = "o")
plot(x = ACC_data2$thresholds, y = ACC_data2$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of data2", type = "o")
To visualize and perform downstream analysis, you can export the similarity matrices (fully connected network).
write.table(as_data_frame(W_net), "02_Results/TARAOcean_W_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
write.table(as_data_frame(data1_net), "02_Results/TARAOcean_data1_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
write.table(as_data_frame(data1_net), "02_Results/TARAOcean_data2_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
We have the fused network similarity matrix. With this, we can perform downstream analysis as network’s algorithm, clustering, visualization, add some external data and so on.
Wang et al. propose a clustering method in their SNFtool
package.
Samples are clustered together according to their similarity (e.g. for disease subtyping). The number of clusters needs to be defined. According to our data and the information we have, we choose 4 and 8 clusters. Obviously, this cluster number can be chosen using an Elbow approach, which would be less arbitrary.
<- 4
C <- data.frame(Groups = spectralClustering(W, C))
group row.names(group) <- colnames(W)
<- merge(metadata, group, by = 0)
dataGroups write.table(dataGroups, "02_Results/TARAOcean_clusters4.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
<- 8
C <- data.frame(Groups = spectralClustering(W, C))
group row.names(group) <- colnames(W)
<- merge(metadata, group, by = 0)
dataGroups write.table(dataGroups, "02_Results/TARAOcean_clusters8.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
Clustering results are saved into files.
In this part, we are using Cytoscape to visualize our similarity matrices.
As expected, samples in the same layers of water are more similar (Figure 7.1 left). And there are more similarity inside MESO layer.
On the right of the Figure 7.1, samples from the same ocean regions seem to be closer. If you check the similarity network of the data separetly, it is less clear.
Your turn now !
Create a visualization for each dataset and play with the threshold! Enjoy!
This kind of analysis can be do automatically using RCy3. But Cytoscape needs to be open in background.
You know how to perform a Similarity Network Fusion.
It could be interesting to try different normalization and distance to choose the most-adapted preprocessing of the data. Indeed, when we are using public data, we might not be 100% sure about how the preprocessing was performed.
In fact, the data we use has not been properly preprocessed. There are two main adapted methods for this type of data :
Moreover, other distance distances exist and are more appropriate for ecological data than Euclidean distance:
So, we propose to try two others normalizations and create a fused network for each of them. The objective is to analyse the effect of the normalization and the distance on the fused network.
The scaling to mean 0 and standard deviation 1 could be not necessary. Be careful of the range of the distance calculated.
For your information, the distance()
, tss()
and clr()
functions are available on these libraries :
library("ecodist")
library("hilldiv")
library("compositions")
It could be interesting to know which data type supports which similarity between two samples in the fused network.
<- as_data_frame(W_net)
W_df <- as_data_frame(data1_net)
data1_df <- as_data_frame(data2_net)
data2_df
<- merge(data1_df, data2_df, by = c(1,2), suffixes = c("_data1", "_data2"))
tmp <- merge(tmp, W_df, by = c(1,2))
weights_df head(weights_df)
## from to weight_data1 weight_data2 weight
## 1 TARA_004_DCM TARA_138_DCM 5.941851e-04 2.261158e-04 0.0093268270
## 2 TARA_004_SRF TARA_004_DCM 1.514222e-03 2.973412e-04 0.0151933836
## 3 TARA_004_SRF TARA_133_MES 7.162040e-06 1.011284e-05 0.0001358067
## 4 TARA_004_SRF TARA_138_DCM 8.719842e-05 1.805816e-04 0.0028421617
## 5 TARA_007_DCM TARA_004_DCM 1.163166e-05 3.450973e-05 0.0030039624
## 6 TARA_007_DCM TARA_004_SRF 1.160152e-05 3.159833e-05 0.0025836564
<- do.call(rbind, apply((weights_df), 1, function(line){
edgeSources <- c(as.numeric(line[3]), as.numeric(line[4]))
weights_v names(weights_v) <- c("weight_data1", "weight_data2")
<- sort(weights_v, decreasing = TRUE)
weights_v if((weights_v[1] - weights_v[2]) >= (weights_v[2]*10/100)){
= names(weights_v[1])
source
}else{
= "both_data"
source
}<- data.frame("from" = line["from"],
df "to" = line["to"],
"source" = source,
"data1" = weights_v["weight_data1"],
"data2" = weights_v["weight_data2"],
"W" = line["weight"])
return(df)
}))
The majority of the links are supported by data 2:
as.data.frame(table(edgeSources$source))
## Var1 Freq
## 1 both_data 660
## 2 weight_data1 3623
## 3 weight_data2 5308
Results are stored into a .txt file and will be used for the visualization of the fused network in Cytoscape.
write.table(edgeSources, "02_Results/TaraOcean_sourceOfEgdes.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")