1 Libraries and environment

1.1 Load environment

Libraries used to create and generate this report:

  • R : R version 4.1.0 (2021-05-18)
  • rmarkdown : 2.19
  • knitr : 1.41
  • rmdformats : 1.0.4
  • bookdown : 0.31
  • kableExtra : 1.3.4

1.2 Load libraries

Libraries used to analyse data:

library("SNFtool")
library("network")
library("igraph")
library("ggplot2")
library("ggpmisc")
library("pheatmap")
library("cccd")
  • SNFtool: 2.3.1
  • network: 1.18.0
  • igraph: 1.3.5
  • ggplot2: 3.4.0
  • ggpmisc: 0.5.2
  • pheatmap: 1.0.12
  • cccd: 1.6

1.3 Cytoscape Apps

Cytoscape is used for visualization. You need to have the Cytoscape v3.9.1 and a few Cytoscape apps:

  • yFiles Layout Algorithm: 1.1.2
  • LegendCreator: 1.1.6

2 General principle of the SNF method

Similarity Network Fusion (SNF) builds networks of samples for each data type. Then, it fuses them into one network, which represents the full spectrum of underlying data.

In other words, SNF integrates several types of data (e.g. omics data) into one network or relationships between samples.

As you can see on the Figure 2.1, SNF has 3 main steps (note that patient=sample in the description below):

  1. First, for each dataset given as input (a), it calculates the patient similarity matrix (b)
  2. Then, from each similarity matrix (b), it creates the patient similarity network (c)
  3. Finally, it fuses the different patient similarity networks (d) into one fused network (e)
SNF overview

Figure 2.1: SNF overview

In the final fused network (e), you can identify which data hold weigth edges:

  • the blue edge information are hold by the mRNA data

  • the pink edge information come from methylation data

  • and orange edge information are hold by both data: mRNA and methylation.

  • GitHub page: https://github.com/maxconway/SNFtool

  • Publication: PMID: 24464287

3 Biological context

Data are coming from TaraOcean project, and correspond to metagenomic data: 139 samples were collected in 8 different oceans around the world (3.1). These samples are prokaryote-enriched with more than 35 thousand species.

In a previous analysis, Sunagawa et al. [PMID: 25999513] identified a stratification mostly driven by the temperature (i.e., vertical stratification) rather than geography or other environmental factors.

Tara Ocean sampling stations over oceans

Figure 3.1: Tara Ocean sampling stations over oceans

Samples come from 4 different layers with different temperature:

  • SRF: Surface Water Layer (0-5 meters)
  • DCM: Deep Chlorophyll Maximum (peak of chlorophyll, 0-600 meters)
  • MIX: Subsurface epipelagic Mixed Layer
  • MES(O): Mesopelagic zone (from 500/1000 meters)

The 8 different ocean and their localisation are presented in the Figure 3.1.

We have two types of data:

  • dataset n°1: orthologous genes
  • dataset n°2: phylogenetic profil
The dataset n°1 contains the relative abundance of groups of orthologous genes (OGs) for each ocean samples.
The dataset n°2 contains the counts of S16 rRNA for each ocean samples.

Does an integrative analysis of these two datasets retrieve the stratification driven by the layers? and also a geographical clustering ?

4 Input data

4.1 In summary

The preprocessing step is the most important part of the analysis. Data need to be prepared correctly in order to extract relevant information and produce a correct and pertinent interpretation of the results.

  1. Prepare the data (e.g. remove outliers, correct batch effect etc …)
  2. Remove and/or impute missing data
    • remove variables/samples if more than 20% of missing data
    • impute using K-nearest neighbor method (KNN)
  3. Normalize according data type
  4. Scale (mean = 0 and sd = 1)
Distribution examples expected after preprocessingDistribution examples expected after preprocessingDistribution examples expected after preprocessing

Figure 4.1: Distribution examples expected after preprocessing

Data have to be matrices with a specific shape:

  • samples (e.g. patients, organisms …) in rows
  • variables (e.g. genes, proteins …) in columns

Different from the shape usually used!

4.2 Tara Ocean data

NOTICE!

We assume that data have already been preprocessed (i.e., we assume that exploratory analysis, preparation and normalization are done).

Data are downloaded from: MiBiOmics gitlab

4.2.1 Dataset n°1: Orthologous genes

We load the first dataset that contains abundance of orthologous genes.

data1 <- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1)
data1[c(1:5), c(1:5)]
##              NOG317682    NOG135470 NOG85325    NOG285859    NOG147792
## TARA_109_SRF         0 2.390962e-05        0 4.663604e-08 1.800215e-07
## TARA_149_MES         0 4.339824e-06        0 5.182915e-07 4.190123e-06
## TARA_110_MES         0 1.348252e-05        0 6.000043e-07 2.218342e-07
## TARA_102_MES         0 6.380711e-06        0 3.816016e-07 0.000000e+00
## TARA_142_SRF         0 9.484144e-06        0 6.437103e-08 1.132431e-06

Data are composed of 139 samples around the world and 638 groups of orthologous genes.

Each column (features) need to have mean 0 and standard deviation 1. Pay attention on the data shape.

data1_scaled <- standardNormalization(data1)

Distribution of the data before and after scaling:

hist(as.matrix(data1), nclass = 100, main = "Data 1", xlab = "values")
hist(data1_scaled, nclass = 100, main = "Scaled data 1", xlab = "scaled values")
Data 1 distribution before and after scalingData 1 distribution before and after scaling

Figure 4.2: Data 1 distribution before and after scaling

You can retrieve functional annotation of OGs in EggNOG database.

4.2.2 Dataset n°2: phylogenetic profile

Then, we load the dataset n°2 which contains composition of each sample.

data2 <- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_proPhylo.csv", sep = ",", head = TRUE, row.names = 1)
data2[c(1:5), c(1:3)]
##              EU638706.1.1353 JN537192.1.1500 CAFJ01000195.23.1515
## TARA_109_SRF               0               0                    0
## TARA_149_MES               0               0                    0
## TARA_110_MES               0               4                    0
## TARA_102_MES               0               1                    0
## TARA_142_SRF               0               3                    0

Data are composed of 139 samples around the world and 356 16S rRNA sequences.

Each column (features) need to have mean 0 and standard deviation 1. Pay attention on the data shape.

data2_scaled <- standardNormalization(data2)

Distribution of the data before and after scaling:

hist(as.matrix(data2), nclass = 100, main = "Data 2", xlab = "values")
hist(data2_scaled, nclass = 100, main = "Scaled data 2", xlab = "scaled values")
Data 2 distribution before and after scalingData 2 distribution before and after scaling

Figure 4.3: Data 2 distribution before and after scaling

Annotations come from Silva website

4.2.3 Environmental information

We have some metadata, i.e., environmental information about the ocean samples (depth and ocean region).

metadata <- read.table(file = "00_Data/TaraOcean_mibiomics/TARAoceans_metadata.csv", sep = ",", head = TRUE, row.names = 1)
head(metadata)
##              ocean depth
## TARA_109_SRF   SPO   SRF
## TARA_149_MES   NAO   MES
## TARA_110_MES   SPO   MES
## TARA_102_MES   SPO   MES
## TARA_142_SRF   NAO   SRF
## TARA_109_DCM   SPO   DCM

5 Similarity networks

5.1 In summary

In this section, we want to construct a network of samples based on the similarity between samples.

We define a network G with nodes (or vertices) called V and links (edges) called E. The graph is defined using:

\(G = (V, E)\)
  • Node: sample
  • Edge: link between two samples
  • Edge weight: similarity weight of samples pairs

Indeed, the edge weight between two samples gets higher with their similarity.

  1. Distances are calculated between all the pairs of samples. Small value means high similarity between two samples.
  2. Distances are transformed into weights.
    • The neighborhood of each sample (K) is used as well as an hyperparameter (alpha) to transform distances into weights.
    • The new matrix contains the weights of the corresponding similarity graph.
    • High value means high similarity between two samples.
  3. You can visualize this similarity graph.

5.2 Distance calculation

First, we need to calculate the distance between all the pairs of samples. Pay attention to the type of variables : numerical, continuous, discrete and so on.

Wang et al. suggest:

  • continuous variables:
    • Distance (e.g. Euclidean)
    • Correlation (e.g. pearson)
  • discrete variables: chi-squared distance
  • binary variables: agreement_based measure

The dist2 function performs a squared Euclidean distances between samples:

data1_dist <- dist2(as.matrix(data1_scaled), as.matrix(data1_scaled))
data2_dist <- dist2(as.matrix(data2_scaled), as.matrix(data2_scaled))

High distance values mean distant samples whereas small distance values implies similar samples.

5.3 Similarity graph construction

Distance matrices are converted into similarity matrices in which high values mean similar samples. Two parameters are needed:

  • K : number of neighbors (between 10 and 30)
  • alpha : hyperparameter (between 0.3 and 0.8)

The K neighbors is used to set the affinities outside of the neighborhood to zero. The alpha parameter allows scaling exponential similarity kernel, which is used to calculate the similarity.

K = 20
alpha = 0.5

Similarity matrix calculation:

data1_W <- affinityMatrix(data1_dist, K, alpha)
data2_W <- affinityMatrix(data2_dist, K, alpha)

We can display these similarity (weight) values using an heatmap (Figure 5.1). We are using log transformation of the values to increase the signal. Samples are grouped using hierarchical clustering.

pheatmap(log(data1_W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata, main = "Data1")
pheatmap(log(data2_W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata, main = "Data2")
Similarity matrix heatmapSimilarity matrix heatmap

Figure 5.1: Similarity matrix heatmap

Legend:

  • Red: high similarity
  • Blue: low similarity
  • Each row and each line corresponds to a sample
  • Diagonal: similarity between a sample and itself

For data1, we can see two main groups of samples. Each group seems to be composed of the same ocean depth. It’s not clear for the ocean regions.

For data2, we do not identify clear groups. But, samples seem to be group by the depth too.

Try different parameters !

6 Fusion

6.1 In summary

  1. Similarity matrices (weight) are created for each dataset.
  2. These matrices are integrated together using an iterative fusion method.
    • Each matrix is iteratively updated with information from other matrices, making them more similar at each step.
    • You can define the number of iteration (T).
  3. The method converges into the final fused network (called W). The new obtained matrix contains the weights of the corresponding fused network.
  4. Threshold selection: in order to go to fully connected network to a sparser network
  5. You can visualize the fused network W.

6.2 Creation of the final fused network

Now, we will create a fused network from the similarity network of each data type. It iteratively integrates these networks using a fusion method.

First, we set the number of iteration T. It should be between 10 and 20 (as recommended by the authors).

T = 10

Then, we perform the fusion. We give the list of similarity networks, the number of neighbors and the number of iterations.

W <- SNF(list(data1_W, data2_W), K, T)
W[c(1:5), c(1:5)]
##              TARA_109_SRF TARA_149_MES TARA_110_MES TARA_102_MES TARA_142_SRF
## TARA_109_SRF 5.000000e-01 9.778842e-05 0.0002535809 0.0004410605 0.0019200171
## TARA_149_MES 9.778842e-05 5.000000e-01 0.0129882372 0.0126485110 0.0004977051
## TARA_110_MES 2.535809e-04 1.298824e-02 0.5000000000 0.0266185584 0.0029147232
## TARA_102_MES 4.410605e-04 1.264851e-02 0.0266185584 0.5000000000 0.0009179439
## TARA_142_SRF 1.920017e-03 4.977051e-04 0.0029147232 0.0009179439 0.5000000000

You can display the weight of each interaction using an heatmap (Figure 6.1). Samples are grouped using a hierarchical clustering.

pheatmap(log(W), show_rownames = FALSE, show_colnames = FALSE, annotation = metadata)
Fused network weights heatmap

Figure 6.1: Fused network weights heatmap

Legend:

  • Red: high similarity
  • Blue: high dissimilarity
  • Each row and each line corresponds to a sample
  • Diagonal: similarity between a sample and itself

Groups are more clearly defined in this fused similarity matrix. One corresponds to the depth MESO and other to a mix of DCM and SRF.

6.3 Threshold

The fused network is now built, but in this fused networks, all samples are connected together (fully connected). We need to decide a threshold to create a sparser version of the network. This threshold will decide for which weight two samples are considered as linked or not linked. We want a network with enough edges but not too much.

How to choose the good threshold? It’s an open question. Several possibilities:

  • arbitrary threshold (top links)
  • basic metrics (median, quantiles)
  • methods based on network topology

6.3.1 Top first neighbors

You can decide to keep only the first 5 neighbors or the 10% of the higher weigths. This way is totally arbitrary. You can display the weight distribution to help you to choose.

hist(W, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(W[W > 0.02], nclass = 100, main = "Fused network weight distribution - W > 0.02", xlab = "weights")
Fused network weight distributionFused network weight distribution

Figure 6.2: Fused network weight distribution

In the Figure 6.2 (left), you can see “two clusters” :

  • on the right, you have the links with the best weights, corresponding to strong connections (in fact they are loops)
  • on the left, a lot of weak connections

If we remove the weak links, we can better see the distribution (Figure 6.2 - right).

Here, we select the top 10% of the best weights. Distribution of the selected weights is displayed in the Figure 6.3.

W_q90 <- quantile(x = W, 0.90)
hist(W[W > W_q90], nclass = 100, main = "Fused network weight distribution - 10% of best weights", xlab = "weights")
Fused network weight distribution. Only the 10% of best weights are selected.

Figure 6.3: Fused network weight distribution. Only the 10% of best weights are selected.

Number of total links : 19321
Number of links selected : 1931

Advantages:

  • Easy to implement
  • Fast

Drawbacks:

  • Arbitrary
  • Doesn’t take account of the topology of the network

6.3.2 Quantiles

You can choose a threshold using basic metrics. We decide to try two: median and third quantiles. Weights distribution is ploted into Figure 6.4 with the median and third quantile values. Weights are log-transformed to have a better visualization of the distribution.

W_median <- median(x = W)
W_q75 <- quantile(x = W, 0.75)
## Raw weights
hist(W, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = W_median, col = "blue", lwd = 3)
text(W_median, 10000, pos = 4, "Median", col = "blue", cex = 1.5)
## Log10(weights)
hist(log(W, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(W_median, 10), col = "blue", lwd = 3)
text(log(W_median, 10)-0.3, 800, pos = 2, "Median", col = "blue", cex = 1.5)
abline(v = log(W_q75, 10), col = "purple", lwd = 3)
text(log(W_q75, 10)+0.5, 400, pos = 4, "quantile 75%", col = "purple", cex = 1.5)
Fused network weight distributionFused network weight distribution

Figure 6.4: Fused network weight distribution

Median is the value that splits data into two groups with the same number of data. The third quantile corresponds to the values corresponding to the top 25% of highest similarity data.

Advantages:

  • Easy to implement
  • Fast
  • Fitted to the data

Drawbacks:

  • Arbitrary (but fitted to the data)
  • Doesn’t take account the topology

6.3.3 Topology networks

Zahoranszky-Kohalmi et al., propose a method to determine the best threshold according to the topology of the network [PMID: 27030802]. It calculates the Average Clustering Coefficient (ACC) for a whole graph. The ACC represents a global parameter that characterizes the overall network topology.

The approach is working like an Elbow approach: an ACC value is calculated for a range of thresholds and the obtained values are displayed.

Two functions are needed:

  • CCCalculation function: it calculates the Clustering Coefficient (CC) for each nodes
  • ACCCalculation function: it averages the CC for a graph, in order to obtain the ACC
## CC calculation function
CCCalculation <- function(node, graph){
  degNode <- degree(graph = graph, v = node, loops = FALSE)
  if(degNode > 1){
    neighborNames <- neighbors(graph = graph, v = node)
    graph_s <- subgraph(graph = graph, vids = neighborNames)
    neighborNb <- sum(degree(graph_s, loops = FALSE))
    CC <- neighborNb / (degNode * (degNode-1))
  }else{CC <- 0}
  return(CC)
}
## ACC calculation function
ACCCalculation <- function(graph){
  nodes <- V(graph)
  ACC <- do.call(sum, lapply(nodes, CCCalculation, graph)) / length(nodes)
  return(ACC)
}

First, we need to determine a range of thresholds. The precision of the ACC increases with the number of thresholds chosen (at least 100).

summary(c(W))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000570 0.0004456 0.0021613 0.0071942 0.0042193 0.5000000
thresholds <- seq(0, 0.1, 0.0005)

Then, we create a network object with the fused network W. We only used the half of the matrix and don’t take the diagonal values: no duplicated links nor loops in the graph.

W_net <- graph_from_adjacency_matrix(W, weighted = TRUE, mode = "upper", diag = FALSE)

For each threshold, an ACC value is calculated on the obtained network. Then, the ACC values are displayed into a line plot (Figure 6.5).

ACC_W <- do.call(rbind, lapply(thresholds, function(t, graph){
  graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
  df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
  return(df)
}, W_net))
plot(x = ACC_W$thresholds, y = ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = ACC_W$thresholds[1], y = ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[21], y = ACC_W$ACC[21], col = "pink", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[29], y = ACC_W$ACC[29], col = "purple", pch = 16, cex = 1.2)
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ",  ACC_W$thresholds[29]), col = "purple")
plot(x = ACC_W$thresholds, y = ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 2000, pos = 4, paste0("ACCmax = ",  ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 1200, pos = 4, paste0("EN = ",  ACC_W[29, "EN"]), col = "purple")
Average Clustering Coeeficient (ACC) on the left and Number of edges on the right.Average Clustering Coeeficient (ACC) on the left and Number of edges on the right.

Figure 6.5: Average Clustering Coeeficient (ACC) on the left and Number of edges on the right.

On the Figure 6.5, you can see a peak that stands out in comparison with the others.

  • Red dot: no filter, every sample is connected to each other
  • Pink dot: the smallest value before the peak
  • Purple dot: the local maxima of ACC (we want this one)

Advantages:

  • Takes account of the topology of the obtained network

Drawbacks:

  • Might be time consuming
  • Need to be used to this kind of analysis for interpretation
  • Sometimes, no peak …

We can do the same for the others similarity network (data1 and data2) to display these networks too.

data1_net <- graph_from_adjacency_matrix(data1_W, weighted = TRUE, mode = "upper", diag = FALSE)
thresholds <- seq(0, 0.008, 0.00002)
ACC_data1 <- do.call(rbind, lapply(thresholds, function(t, graph){
  graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
  df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
  return(df)
}, data1_net))
plot(x = ACC_data1$thresholds, y = ACC_data1$ACC, xlab = "thresholds", ylab = "ACC", main = "Data 1", type = "o")
plot(x = ACC_data1$thresholds, y = ACC_data1$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of data1", type = "o")
data2_net <- graph_from_adjacency_matrix(data2_W, weighted = TRUE, mode = "upper", diag = FALSE)
thresholds <- seq(0, 0.012, 0.00005) 
ACC_data2 <- do.call(rbind, lapply(thresholds, function(t, graph){
  graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
  df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
  return(df)
}, data2_net))
plot(x = ACC_data2$thresholds, y = ACC_data2$ACC, xlab = "thresholds", ylab = "ACC", main = "Data 2", type = "o")
plot(x = ACC_data2$thresholds, y = ACC_data2$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of data2", type = "o")

6.3.4 Save the fused network

To visualize and perform downstream analysis, you can export the similarity matrices (fully connected network).

write.table(as_data_frame(W_net), "02_Results/TARAOcean_W_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
write.table(as_data_frame(data1_net), "02_Results/TARAOcean_data1_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
write.table(as_data_frame(data1_net), "02_Results/TARAOcean_data2_edgesList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")

7 Downstream analyses

We have the fused network similarity matrix. With this, we can perform downstream analysis as network’s algorithm, clustering, visualization, add some external data and so on.

7.1 Clustering

Wang et al. propose a clustering method in their SNFtool package.

Samples are clustered together according to their similarity (e.g. for disease subtyping). The number of clusters needs to be defined. According to our data and the information we have, we choose 4 and 8 clusters. Obviously, this cluster number can be chosen using an Elbow approach, which would be less arbitrary.

C <- 4 
group <- data.frame(Groups = spectralClustering(W, C)) 
row.names(group) <- colnames(W) 
dataGroups <- merge(metadata, group, by = 0) 
write.table(dataGroups, "02_Results/TARAOcean_clusters4.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") 
C <- 8
group <- data.frame(Groups = spectralClustering(W, C)) 
row.names(group) <- colnames(W) 
dataGroups <- merge(metadata, group, by = 0) 
write.table(dataGroups, "02_Results/TARAOcean_clusters8.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") 

Clustering results are saved into files.

7.2 Visualization (Cytoscape)

In this part, we are using Cytoscape to visualize our similarity matrices.

  1. Import fused network W
    • File -> Import Network from File -> TARAOcean_W_edgesList.txt
      • Column 1: Source node
      • Column 2: Target node
      • Column 3: Edge attributes
    • Layout -> yFiles Organic Layout
  2. Import samples information
    • File -> Import Table from File -> TARAOcean_clusters4.txt
      • Column 1: Key
      • Column 2: Attribute
      • Column 3: Attribute
      • Column 4: Attribute - Rename Groups into Clusters_4
      • /! Import Data as Node Table Columns
    • Do the same with TARAOcean_clusters8.txt
  3. Custom style
    • Lock node width and height
    • Shape: Ellipse
    • Fill Color: depth (Pastel1)
  4. Select strong weights
    • Filter
      • Add column filter
      • Edge: W -> threshold determined
    • Select -> Nodes -> Select All nodes
    • File -> New Network -> From Selected Nodes, Selected edges
    • Layout -> yFiles Organic Layout (for example)
    • Style of Edge
      • Stroke Color (Unselected): interaction (Set2)
        • #4A1486 (purple): both data
        • #7BCCC4 (blue): data1
        • #F768A1 (pink): data2
CytoscapeCytoscape

Figure 7.1: Cytoscape

As expected, samples in the same layers of water are more similar (Figure 7.1 left). And there are more similarity inside MESO layer.

On the right of the Figure 7.1, samples from the same ocean regions seem to be closer. If you check the similarity network of the data separetly, it is less clear.

Your turn now !

Create a visualization for each dataset and play with the threshold! Enjoy!

This kind of analysis can be do automatically using RCy3. But Cytoscape needs to be open in background.

8 Go further

8.1 Playing with the normalization and the distance calculation

You know how to perform a Similarity Network Fusion.

It could be interesting to try different normalization and distance to choose the most-adapted preprocessing of the data. Indeed, when we are using public data, we might not be 100% sure about how the preprocessing was performed.

In fact, the data we use has not been properly preprocessed. There are two main adapted methods for this type of data :

  • Centered log ration method (CLR)
  • Total sum-scaling (TSS)

Moreover, other distance distances exist and are more appropriate for ecological data than Euclidean distance:

  • Bray-Curtis distance
  • Hellinger distance

So, we propose to try two others normalizations and create a fused network for each of them. The objective is to analyse the effect of the normalization and the distance on the fused network.

  • CLR + euclidean distance
  • TSS + Bray-Curtis distance

The scaling to mean 0 and standard deviation 1 could be not necessary. Be careful of the range of the distance calculated.

For your information, the distance(), tss() and clr() functions are available on these libraries :

library("ecodist")
library("hilldiv")
library("compositions")

8.2 Source of information

It could be interesting to know which data type supports which similarity between two samples in the fused network.

  • The edge is considered supported by a single data if its weight in that data type’s network is more than 10% higher than the weight of the same edge in the other data type’s networks.
  • If the difference between two highest edge weights from the corresponding data types is less than 10%, the edge is considered supported by those 2 data types.
  • If the difference is less than 10% between all data type, the edge is supported by all data types.
W_df <- as_data_frame(W_net)
data1_df <- as_data_frame(data1_net) 
data2_df <- as_data_frame(data2_net) 

tmp <- merge(data1_df, data2_df, by = c(1,2), suffixes = c("_data1", "_data2"))
weights_df <- merge(tmp, W_df, by = c(1,2))
head(weights_df)
##           from           to weight_data1 weight_data2       weight
## 1 TARA_004_DCM TARA_138_DCM 5.941851e-04 2.261158e-04 0.0093268270
## 2 TARA_004_SRF TARA_004_DCM 1.514222e-03 2.973412e-04 0.0151933836
## 3 TARA_004_SRF TARA_133_MES 7.162040e-06 1.011284e-05 0.0001358067
## 4 TARA_004_SRF TARA_138_DCM 8.719842e-05 1.805816e-04 0.0028421617
## 5 TARA_007_DCM TARA_004_DCM 1.163166e-05 3.450973e-05 0.0030039624
## 6 TARA_007_DCM TARA_004_SRF 1.160152e-05 3.159833e-05 0.0025836564
edgeSources <- do.call(rbind, apply((weights_df), 1, function(line){
  weights_v <- c(as.numeric(line[3]), as.numeric(line[4]))
  names(weights_v) <- c("weight_data1", "weight_data2")
  weights_v <- sort(weights_v, decreasing = TRUE)
  if((weights_v[1] - weights_v[2]) >= (weights_v[2]*10/100)){
    source = names(weights_v[1])
  }
  else{
    source = "both_data"
  }
  df <- data.frame("from" = line["from"],
                   "to" = line["to"],
                   "source" = source,
                   "data1" = weights_v["weight_data1"],
                   "data2" = weights_v["weight_data2"],
                   "W" = line["weight"])
  return(df)
}))

The majority of the links are supported by data 2:

as.data.frame(table(edgeSources$source))
##           Var1 Freq
## 1    both_data  660
## 2 weight_data1 3623
## 3 weight_data2 5308

Results are stored into a .txt file and will be used for the visualization of the fused network in Cytoscape.

write.table(edgeSources, "02_Results/TaraOcean_sourceOfEgdes.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")