Libraries used to create and generate this report:
R version 4.3.3 (2024-02-29)
2.21
1.42
1.0.4
0.34
1.3.4
Libraries used to analyse data:
2.3.1
1.0.12
1.4.2
Libraries used to load data:
1.16.1
1.14.8
6.24.0
1.0
Cytoscape is used for visualization. Figures were generated using the Cytoscape v3.9.1
and several Cytoscape apps:
1.1.3
1.1.6
Similarity Network Fusion (SNF) builds networks of samples for each data type. Then, it fuses them into one network, which represents the full spectrum of underlying data.
In other words, SNF integrates several types of data (e.g. omics data) into one network which represents the relationships between samples.
The SNF methods can be decomposed into three main steps, displayed in the Figure 2.1 (note that patient=samples in the description below):
In the final fused similarity network (e), you can identify which data type contributes to which edge:
You can retreive more information in:
Choose the dataset on which you want to apply SNF!!
Different datasets are available. Note that each dataset has its specificity and some analysis steps should be adapted.
To retrieve data: files are available in /shared/projects/tp_etbii_2024_165650/Networks/TaraOcean_mibiomics
directory path in the IFB server.
dataset:
TARAoceans_proNOGS.cvs
TARAoceans_proPhylo.csv
metadata:
TARAoceans_metadata.csv
Samples come from eight oceans around the world (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean).
Samples can come from different layers with different temperatures:
In a previous analysis (Sunagawa et al., 2015), they identified a stratification mostly driven by the temperature rather than geography or other environmental factors.
We have two types of data:
Does an integrative analysis of these two data types retrieve the stratification driver by the layers? Does it also find a geographical clustering?
Data are coming from: MiBiOmics gitlab.
To retrieve data:
dataset: using data("breast.TCGA")
from the mixOmics
R package
breast.TCGA$data.train$mirna
breast.TCGA$data.train$mrna
breast.TCGA$data.train$protein
metadata: using data("breast.TCGA")
from the mixOmics
R package
breast.TCGA$data.train$subtype
Human breast cancer is a heterogeneous disease. Breast tumors can be classified into several subtypes (PAM50 classification), according to the mRNA expression level (Sorlie et al., 2001). In this dataset, we have three subtypes:
We have three types of data:
Does an integrative analysis of these three data types retrieve the classification of the breast cancer? Or find another classification?
Data are coming from the mixOmics
R package. The full data can be downloaded here.
To retrieve data:
dataset: using data("CLL_data")
from the MOFAdata
R package
CLL_data_t$Drugs
CLL_data_t$Methylation
CLL_data_t$mRNA
CLL_data_t$Mutations
metadata: file is available in /shared/projects/tp_etbii_2024_165650/Networks/CLL
directory path in the IFB server.
sample_metadata.txt
The Chronic Lymphocytic Leukaemia (CLL) is type of blood and bone marrow cancer. The full data are explained in Dietrich et al., 2018 and available here.
We have four types of data:
To retrieve data: files are available in /shared/projects/tp_etbii_2024_165650/Networks/Tomato
directory path in the IFB server.
dataset:
mrna.tsv
prots.tsv
metadata:
samples_metadata.tsv
In order to study the protein turnover in developing tomato fruit (Solanum lycopersicum) in Belouah et al., two omics data types were collected:
Each data type was collected in nine different developmental stages: GR1, GR2, GR3, GR4, GR5, GR6, GR7, GR8 and GR9. For each developmental stages, we have three replicates.
Does an integrative analysis of these data types retrieve the different developmental stages?
Data are coming from Belouah et al., 2019.
To retrieve data:
dataset: using data("X")
from the WallomicsData
R package
Phenomics_Rosettes
Transcriptomics_Rosettes_CW
Proteomics_Rosettes_CW
metadata: using data("X")
from the WallomicsData
R package also
Altitude_Cluster
Ecotype
In order to study the cell wall plasticity of Arabidopsis thaliana plants, exposed to different temperature growth conditions, four omics data types were collected :
Each data type was collected for
To see all the available data, have a look to the manual.
Data are explained in Duruflé et al., 2019, 2020 and 2021.
The preprocessing step is the most important part of the analysis. Data need to be prepared correctly in order to extract relevant information and produce a correct and pertinent interpretation of the results.
Data preprocessing could be summarized by four main steps:
The data must conform to a specific matrix shape:
HELP!
?functionName()
.For this tutorial, we assume that the data have been already prepared: outliers are already removed and there is no batch effect.
To load data from a file, you can use read.table()
and specify the file
name, the sep
character and others parameters if it’s necessary.
To load data from a package, you can use data(dataName)
. Don’t forget to load the corresponding package before with library(packageName)
.
To load data from a website, you can use fread(url)
from the data.table
package.
The metadata contains complementary information about samples. You can load the metadata from package or file. See the section 4.2.1 about data loading.
We use mainly the metadata for visualization. So we suggest to follow these recommendations:
data.frame()
or as.data.Frame()
functions)read.table()
use the row.names = 1
parameter)character
or numerical
(check using str()
function)For instance, to load the ortologous gene data from Tara Ocean, the command could be:
tara_nog <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1)
tara_nog[c(1:5), c(1:5)]
## NOG317682 NOG135470 NOG85325 NOG285859 NOG147792
## TARA_109_SRF 0 2.390962e-05 0 4.663604e-08 1.800215e-07
## TARA_149_MES 0 4.339824e-06 0 5.182915e-07 4.190123e-06
## TARA_110_MES 0 1.348252e-05 0 6.000043e-07 2.218342e-07
## TARA_102_MES 0 6.380711e-06 0 3.816016e-07 0.000000e+00
## TARA_142_SRF 0 9.484144e-06 0 6.437103e-08 1.132431e-06
Don’t hesitate to look the first rows of your data regularly using head()
. It could be more convenient to display only the first 5 rows and columns when the data are big (tara_nog[c(1:5), c(1:5)]
).
Practice:
These functions could help you: nrow()
, ncol()
, lapply()
, dim()
, t()
, names()
, type()
, as.character()
and unique()
.
In the SNF paper, authors recommend to filter out samples with more than 20% of missing data in a certain data type. They also recommend to filter out features with more than 20% of missing data across samples. Then, they impute the remaining missing data using K nearest neighbors (KNN) imputation.
In this tutorial, we decide to remove samples with at least one missing data. To remove samples with missing data, we propose the following NARemoving()
function. Input parameters are:
data
: the data typemargin
: a vector giving the subscripts which the function will be applied over (e.g. 1 indicates rows and 2 indicates columns)threshold
: threshold above which samples/features are deletedNARemoving <- function(data, margin, threshold){
#' NA removing
#'
#' Calculate percentage of na
#' Remove na from rows (margin = 1) or column (margin = 2)
#'
#' @param data data.frame.
#' @param margin int. 1 = row and 2 = column
#' @param threshold int. Number of missing data accepted
#'
#' @return Return data.frame with a specific number of na by row/column
data_na <- apply(data, MARGIN = margin, FUN = function(v){sum(is.na(v)) / length(v) * 100})
# print(table(data_na))
toRemove <- split(names(data_na[data_na > threshold]), " ")[[1]]
if(margin == 1){
data_withoutNa <- data[!(row.names(data) %in% toRemove),]
print(paste0("Remove ", as.character(length(toRemove)), " samples."))
}
if(margin == 2){
data_withoutNa <- data[,!(colnames(data) %in% toRemove)]
print(paste0("Remove ", as.character(length(toRemove)), " features"))
}
return(data_withoutNa)
}
For instance, the CLL drug data contain missing data (sample H024
).
## D_001_1 D_001_2 D_001_3 D_001_4 D_001_5
## H045 0.02363938 0.04623274 0.3187471 0.8237027 0.8962777
## H109 0.07359900 0.10623002 0.2732891 0.7171379 0.8850003
## H024 NA NA NA NA NA
## H056 0.05813930 0.09022028 0.2322145 0.7225736 0.7957497
## H079 0.02042077 0.04750543 0.3638962 0.8073907 0.8794886
To remove samples with missing data in drug data, we use the following command line:
data = CLL_data_t$Drugs
: remove the samples with missing data in the drug datamargin = 1
: samples are in rows, so we want to apply the function on the rowsthreshold = 0
: we remove samples with at least one missing data## [1] "Remove 16 samples."
The H024
sample is not anymore in the CLL drug data.
## D_001_1 D_001_2 D_001_3 D_001_4 D_001_5
## H045 0.02363938 0.04623274 0.3187471 0.8237027 0.8962777
## H109 0.07359900 0.10623002 0.2732891 0.7171379 0.8850003
## H056 0.05813930 0.09022028 0.2322145 0.7225736 0.7957497
## H079 0.02042077 0.04750543 0.3638962 0.8073907 0.8794886
## H164 0.02962725 0.08054628 0.4725991 0.8179143 0.8927961
We repeat this step for each data type. Then, we filter out samples that are not present in all data type.
sampleNames <- Reduce(intersect, list(rownames(CLL_drug), rownames(CLL_mrna)))
CLL_drug <- CLL_drug[rownames(CLL_drug) %in% sampleNames,]
CLL_mrna <- CLL_mrna[rownames(CLL_mrna) %in% sampleNames,]
Practice:
These functions could help you: is.na()
, lapply()
, table()
and view()
.
The normalization should be adapted according the data type. For the data used here, we assumed that data have been already normalized, according their type. This step is really important, and should be correctly done before every type of data integration.
Each feature (column) needs to have the mean equals to zero and the standard deviation equals to one. For that, the SNF package provides the function standardNormalization()
.
In the Figure 4.2, you can see the data distribution for the breast cancer miRNA data before (on the left) and after (on the right) scaling. After scaling, the data distribution should be normal.
hist(as.matrix(tcga_mirna), nclass = 100, main = "Breast cancer miRNA data", xlab = "values")
hist(as.matrix(tcga_mirna_scaled), nclass = 100, main = "Breast cancer miRNA scaled data", xlab = "values")
Practice:
standardNormalization()
.
These functions could help you: hist()
, as.matrix()
.
In this section, we create a sample network for each type of data, based on the similarity between samples. The main steps are (Figure 5.1):
The similarity network is a sample network. It is defined as a network G with nodes (or vertices) called V and connections (or edges) called E. The connections between samples are weighted. Weights come from the similarity matrix.
We can define the similarity network G like this:
First, we calculate the distance between each pair of samples for each data type using the preprocessed data. The distance method used needs to be adapted to the feature type (e.g. continuous, discrete).
In the SNF paper (Wang et al.), authors suggest to use:
In the SNF R package, the dist2()
function performs a squared Euclidean distances between samples.
Practice:
These functions could help you: as.matrix()
, dim()
, nrow()
and ncol()
.
The distance matrix D is then transformed into the similarity matrix W. Distances are converted to weights using the scaled exponential similarity kernel (µ) and average distances between samples and their nearest neighbors (ε):
\[ W = exp(-\frac{D^2}{µ\varepsilon})\] In other words, distances are converted to weights according the distance with the nearest neighbors of each sample pair.
The SNF R package proposes the affinityMatrix()
to calculate this similarity matrix. In this case, affinity
and similarity
are equivalent. This function needs three parameters:
diff
: distance matrix DK
: number of nearest neighbors (between 10 and 30)sigma
: hyperparameter or variance (between 0.3 and 0.8)The K neighbors is used to set the similarities outside of the neighborhood to zero. The sigma parameter allows scaling exponential similarity kernel, which is used to calculated the similarity.
Then, you can visualized the similarity matrix using the pheatmap()
function. This function creates an heatmap of samples. By default samples are clustered using hierarchical clustering. For a better visualization, we recommend to:
show_rownames = FALSE
and show_colnames = FALSE
annotation
log(x, 10)
)Practice:
K = 20
and sigma = 0.5
for each data type.Previously, we created a similarity matrix W and its corresponding similarity network G for each data type (Figure 6.1).
Now, we integrate these similarity matrices (in the Figure 6.1 there are two data types). For that, we use an iterative fusion method. The number of iteration T, needs to be defined.
First, two matrices are created from the similarity matrix of each data type:
The P matrix carries the full information about the similarity of each samples to all others.
The S matrix carries the similarity to the K most similar samples for each sample (i.e. topology). The similarities between non-neighboring nodes are set to zero because authors assume that local similarities (high weights) are more reliable than the remote ones.
Then, the P matrix of each data type is iteratively updated with information from P matrices of the other data type, making them more similar at each step. In the Figure 6.2, you have an example with two data type. It’s a bit more complex with more data type (see the paper if you are interested in).
Finally after T iterations, the P matrices of each data type are merged together to create the final fused similarity matrix, and the corresponding fused similarity network.
Then, you can visualize the fused similarity network using Cytoscape.
First, we have to define the number of iteration called T. It should be between 10 and 20 (recommended by the authors).
Then, to perform the fusion, SNF R package proposes the SNF()
function. You have to provide:
Practice:
T = 10
).
These functions could help you: list()
, length()
.
You can visualize the fused similarity network using Cytoscape.
First, you need to convert the fused similarity matrix into the corresponding fused similarity network. The igraph
R package allows to create and manage networks.
You can create a the fused similarity network using the graph_from_adjacency_matrix()
function. This function uses a similarity matrix to create the corresponding similarity network.
We don’t want duplicate information about connections between samples, neither connections between samples themselves (self loops). But we want to keep the connection weight values. You can use these parameters:
diag = FALSE
mode = "upper"
weighted = TRUE
Then, you can save the fused similarity network into a edge file using write.table()
function.
This is an example of the saving command line:
To create a network using Cytoscape, use the following steps:
*_W_edgeList.txt
and define:
*_metadata.txt
(Import Data as Node Table Columns)In the Style tab:
select a column
from the metadata. Here it’s growth_stage from the tomato datasetMapping Type
tabEllipse
Lock node width and height
Apply your favorite layout. For this step, we recommend to try at least yFiles Organic Layout
.
Practice:
In Cytoscape, you see that the fused similarity network is fully connected. Indeed, each sample is connected to all other samples. Remember, connections between samples are weighted: this weight represents the similarity between samples. A high weight value means a strong similarity between two samples whereas a low weight value means a weak similarity between two samples.
For a better visualization, we can select a threshold to decide which connections to keep, based on this weight value.
How to choose the good threshold? It’s an open question. There are some possibilities:
In this example, we select a threshold based on the weight distribution of the fused similarity network.
First, we extract edges from the network object. Indeed, the network object doesn’t contains duplicate edges, neither self loops unlike the corresponding fused similarity matrix W.
For instance, we extract weight values from the Tara Ocean fused similarity network. This network contains 9591
connections in total.
Then, we display the corresponding weight histogram in the Figure 7.1 (left). You can see a large number of small weight values. We can remove values between 0
and 0.01
where the majority of the small values seem to be. With the 0.01
threshold, 777
connections are kept.
In the Figure 7.1 (right), we log-transform weights. The weight distribution is binomial and we would like to cut the distribution into two parts. With the threshold 0.001 (log(0.001, 10) = -3)
, 6440
connections are kept.
## Raw weights
hist(weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
abline(v = 0.01, col = "red", lwd = 3)
## log10 weights
hist(log(weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = -3.1, col = "red", lwd = 3)
Advantages | Drawbacks |
---|---|
1. Easy to implement | 1. Arbitrary |
2. Fast | 2. Doesn’t take account of the topology of the network |
3. Visual |
As example for this part, we select the median and the third quantile values as thresholds. We calculate the median and the third quantile of the weight values.
The weight distribution and the log-transformed weight distribution are displayed in the Figure 7.2, respectively left and right.
## Raw weights
hist(weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
abline(v = W_median, col = "blue", lwd = 3)
text(W_median, 2000, pos = 4, "Median", col = "blue", cex = 1)
## log10 weights
hist(log(weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(W_median, 10), col = "blue", lwd = 3)
text(log(W_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(W_q75, 10), col = "purple", lwd = 3)
text(log(W_q75, 10), 350, pos = 4, "quantile 75%", col = "purple", cex = 1)
The median is the value that splits data into two groups with the same number of data. With this value (0.002143
), we selected 4795
connections.
The values above the third quantile value (0.0041194
) are in the top of 25% highest weights. We selected 2398
connections.
Advantages | Drawbacks |
---|---|
1. Easy to implement | 1. Arbitrary (but fitted to the data) |
2. Fast | 2. Doesn’t take account of the topology of the network |
3. Fitted to the data |
In the Zahoranszky-Kohalmi et al. paper, authors propose a method to determine the best threshold according to the topology of the network. This method calculates the Average Clustering Coefficient (ACC) for a whole network. The ACC represents a global parameter that characterizes the overall network topology.
As an Elbow approach, an ACC value is calculated for a range of thresholds. The obtained values are displayed and we choose the threshold that seems the best!
For help, we created two functions to calculate these ACC values and choose the best threshold based on the topology:
CCCalculation()
: this function calculates the Clustering Coefficient (CC) for each nodeACCCalculation()
; this function averages the CC for a network, in order to obtain the ACC## CC calculation function
CCCalculation <- function(node, graph){
#' Clustering Coefficient (CC) calculation
#'
#' Calculate the Clustering Coefficient (CC) for each node in a network
#'
#' @param node str.
#' @param graph igraph. Network object (e.g. the fused network object)
#'
#' @return Return the corresponding CC value
degNode <- degree(graph = graph, v = node, loops = FALSE)
if(degNode > 1){
neighborNames <- neighbors(graph = graph, v = node)
graph_s <- subgraph(graph = graph, vids = neighborNames)
neighborNb <- sum(degree(graph_s, loops = FALSE))
CC <- neighborNb / (degNode * (degNode-1))
}else{CC <- 0}
return(CC)
}
## ACC calculation function
ACCCalculation <- function(graph){
#' Average Clustering Coefficient (ACC) calculation
#'
#' It average the Clustering Coefficient (CC) of a network
#'
#' @param graph igraph. Network object (e.g. the fused network object)
#'
#' @return Return the corresponding ACC value
nodes <- V(graph)
ACC <- do.call(sum, lapply(nodes, CCCalculation, graph)) / length(nodes)
return(ACC)
}
First, we determine a range of thresholds. The ACC precision increases with the number of chosen thresholds (choose at least 100).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.696e-05 4.389e-04 2.143e-03 3.623e-03 4.119e-03 1.414e-01
Calculate the ACC values for each selected threshold.
ACC_W <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tara_W_net))
The ACC values are displayed in the Figure 7.3 (left). The number of network edges for each threshold is displayed in the Figure 7.3 (right).
plot(x = ACC_W$thresholds, y = ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = ACC_W$thresholds[1], y = ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[21], y = ACC_W$ACC[21], col = "pink", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[29], y = ACC_W$ACC[29], col = "purple", pch = 16, cex = 1.2)
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 0.7, pos = 4, paste0("Threshold = ", ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ", round(ACC_W$ACC[29], 2)), col = "purple")
plot(x = ACC_W$thresholds, y = ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 2800, pos = 4, paste0("Threshold = ", ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 2000, pos = 4, paste0("ACCmax = ", round(ACC_W$ACC[29], 2)), col = "purple")
text(ACC_W$thresholds[29], 1200, pos = 4, paste0("EN = ", ACC_W[29, "EN"]), col = "purple")
On the Figure 7.3 you can see a peak that stands out in comparison with the others.
Advantages | Drawbacks |
---|---|
1. Take account of the topology of the network | 1. Might be time consuming |
2. Interpretation (need to be used to this method) | |
3. Sometimes, no peak … |
The fused similarity network is already imported in Cytoscape. You already changed the network style and the network layout. If it is not, you can go to the previous Cytoscape section.
We have a fully connected network. To select edges based on their weights, use the following steps:
In the Filter tab:
Column Filter
Edge: weight
column nameSelected Nodes, Selected Edges
Now, it’s your turn to find the best threshold!!
Practice:
Determine the threshold:
1.1. Select a threshold based on the weight distribution.
1.2. Use the median as threshold.
1.3. Determine the threshold based on the topology of the network.
Visualize these thresholds in Cytoscape.
Determine for each network that are in Cytoscape:
3.1. Number of edges and nodes.
3.2. Number of isolated samples.
For you, which threshold is better?
For each data type:
5.1. Create a network from the similarity matrix
5.2. Do the 1-4 steps
Keep in mind that you can change this threshold in Cytoscape anytime.
You integrated your data and created the corresponding fused similarity network: CONGRATULATIONS!!
This fused similarity network can be use for downstream analysis based on network’s algorithms such as clustering, retrieval or classification. You can also visualize the network and add some external data.
In this hands-on, we give you two examples to illustrate what you can do with the fused similarity network: clustering and visualization using Cytoscape.
In the SNF paper, authors propose a clustering method called spectralClustering()
. This function takes in input three parameters:
affinity
: fused similarity matrix WK
: number of clusters we wanttype
: the variants of spectral clustering to use (default 3)Samples are clustered together according to their similarity. The number of clusters needs to be defined. Obviously, this cluster number can be chosen using an Elbow approach, which would be less arbitrary.
First, define the number of clusters. Here, we define four clusters.
Then, perform the clustering. Results are stored into the group
variable.
Next, merge the clustering results with the metadata.
## Row.names ocean depth Groups
## 1 TARA_004_DCM NAO DCM 1
## 2 TARA_004_SRF NAO SRF 1
## 3 TARA_007_DCM MS DCM 1
## 4 TARA_007_SRF MS SRF 1
## 5 TARA_009_DCM MS DCM 1
## 6 TARA_009_SRF MS SRF 1
Finally, save the merged data into a file.
write.table(dataGroups, "TARAOcean_4clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
Practice:
You might merge clustering results to create on result file using merge()
function.
As a reminder:
You already loaded the fused similarity network in Cytoscape. And you also removed the weak edges. To visualize the clustering results follow the steps:
*_clusters.txt
to the Network Collection (information will be add to every network in the collection)Scan Network
Refresh Legend
This is a network visualization example for each dataset:
Practice:
Now, you know how to integrate different type of data using the Similarity Network Fusion (SNF) method.
You’re aware of the important steps such as the preprocessing of the data (e.g. normalization and scaling, data shape), the similarity network creation and the fusion.
You also see how to perform clustering on the fused similarity network and how to visualize results.
This part is to go further in your analysis and your interpretation of the results.
At the beginning of the hands-on, we assumed that data have been already well normalized. We also used only the euclidean distance, whatever the types of data.
It could be interesting to try different normalization and distance calculation to choose the most-adapted preprocessing of the data.
For instance, the two main adapted normalization methods for the Tara Ocean data are:
Moreover, other distance distances exist and are more appropriate for ecological data than Euclidean distance:
These functions could help you: distance()
, tss()
or clr()
. They are available in the ecodist
, hilldiv
and compositions
packages.
Practice:
Did you notice the edge colors in the Figure 8.3? Colors represent the contribution of each data type in each sample association.
In the SNF paper, authors did like this:
Depending of the number of the integrated data type, we suggest two workflows:
Transform similarity network into data frames (to transform similarity matrix into similarity network see section 6.3.1).
tara_W_df <- as_data_frame(tara_W_net)
tara_nog_df <- as_data_frame(tara_nog_net)
tara_phy_df <- as_data_frame(tara_phy_net)
Merge data frames into one data frame:
tmp <- merge(tara_nog_df, tara_phy_df, by = c(1,2), suffixes = c("_tara_nog", "_tara_phy"))
weights_df <- merge(tmp, tara_W_df, by = c(1,2))
head(weights_df)
## from to weight_tara_nog weight_tara_phy weight
## 1 TARA_004_DCM TARA_138_DCM 5.941851e-04 2.261158e-04 0.0093268270
## 2 TARA_004_SRF TARA_004_DCM 1.514222e-03 2.973412e-04 0.0151933836
## 3 TARA_004_SRF TARA_133_MES 7.162040e-06 1.011284e-05 0.0001358067
## 4 TARA_004_SRF TARA_138_DCM 8.719842e-05 1.805816e-04 0.0028421617
## 5 TARA_007_DCM TARA_004_DCM 1.163166e-05 3.450973e-05 0.0030039624
## 6 TARA_007_DCM TARA_004_SRF 1.160152e-05 3.159833e-05 0.0025836564
Determine data type contribution on each edge:
sources_df <- as.data.frame(do.call(rbind, apply(weights_df, 1, function(s){
sources <- as.numeric(s[c(3,4)])
## Initiate keep vector with FALSE value
keep <- c(rep(FALSE, length(sources)))
## Where is the max value ?
max_ind <- which.max(sources)
## Keep it
max <- max(sources)
keep[max_ind] <- TRUE
## Retrieved indices of all non max value
tmp <- seq(1, length(sources))
tmp <- tmp[!tmp %in% max_ind]
## If max weight < 10% of other weight, put it to TRUE
for(i in tmp){
w <- as.numeric(sources[i])
if(max > w+(w*10/100)){keep[i] <- FALSE}
else{keep[i] <- TRUE}
}
## Select all the TRUE weights
## Compare them by pairs
## If w1 - w2 is > 10%, they are not selected anymore
## Except if one of them is the max value
keep_df <- as.data.frame(table(keep))
if(keep_df[keep_df$keep == TRUE, "Freq"] > 1){
ind <- which(keep == TRUE)
ind_comb <- combn(x = ind, m = 2)
for(i in seq(1:ncol(ind_comb))){
vector <- ind_comb[,i]
v1 <- as.numeric(sources[vector[1]])
v2 <- as.numeric(sources[vector[2]])
if(abs(v1 - v2) < (v1*10/100) && abs(v1 - v2) < (v2*10/100)){
keep[vector[1]] <- TRUE
keep[vector[2]] <- TRUE
}else{
keep[vector[1]] <- FALSE
keep[vector[2]] <- FALSE
keep[max_ind] <- TRUE
}
}
}
names(keep) <- names(s[c(3,4)])
final <- c(s[1], s[2], keep)
return(final)
}, simplify = FALSE)))
Encode data type contribution to help the visualization:
sources_df$tara_nog <- ifelse(sources_df$tara_nog == TRUE, 1, 0)
sources_df$tara_phy <- ifelse(sources_df$tara_phy == TRUE, 2, 0)
sources_df$sum <- apply(sources_df[,c(3,4)], 1, sum)
Write results into a file:
tara_W_df_withSource <- (merge(W_df, sources_df, by = c(1,2)))
write.table(tara_W_df_withSource, "TaraOcean_W_edgeList_withSource.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
Counts number of each combination of data:
Finally, load this file into Cytoscape and change the edge colors (see section 6.3.2 to import files and change node color).
Transform similarity network into data frames (to transform similarity matrix into similarity network see section 6.3.1).
tcga_W_df <- as_data_frame(tcga_W_net)
tcga_mirna_df <- as_data_frame(tcga_mirna_net)
tcga_mrna_df <- as_data_frame(tcga_mrna_net)
tcga_prot_df <- as_data_frame(tcga_prot_net)
Merge data frames into one data frame:
tmp1 <- merge(tcga_mirna_df, tcga_mrna_df, by = c(1,2), suffixes = c("_tcga_miRNA", "_tcga_mRNA"))
tmp2 <- merge(tcga_prot_df, tcga_W_df, by = c(1, 2), suffixes = c("_tcga_prot", ""))
weights_df <- merge(tmp1, tmp2, by = c(1,2))
head(weights_df)
## from to weight_tcga_miRNA weight_tcga_mRNA weight_tcga_prot weight
## 1 A03L A08O 3.664207e-05 4.127975e-04 1.376957e-03 0.007833912
## 2 A03L A0BS 9.365460e-05 8.151721e-05 3.543471e-04 0.003016587
## 3 A03L A0E1 7.392441e-05 3.924428e-04 4.981989e-04 0.004073567
## 4 A03L A0FS 1.971148e-05 1.154763e-04 5.335656e-04 0.001935859
## 5 A03L A0H7 2.216591e-04 3.549924e-04 4.890533e-05 0.005945000
## 6 A03L A0W4 6.924449e-05 2.586121e-04 8.181917e-04 0.002147077
Determine data type contribution for each edge:
sources_df <- as.data.frame(do.call(rbind, apply(weights_df, 1, function(s){
sources <- as.numeric(s[c(3,4,5)])
## Initiate keep vector with FALSE value
keep <- c(rep(FALSE, length(sources)))
## Where is the max value ?
max_ind <- which.max(sources)
## Keep it
max <- max(sources)
keep[max_ind] <- TRUE
## Retrieved indices of all non max value
tmp <- seq(1, length(sources))
tmp <- tmp[!tmp %in% max_ind]
## If max weight < 10% of other weight, put it to TRUE
for(i in tmp){
w <- as.numeric(sources[i])
if(max > w+(w*10/100)){keep[i] <- FALSE}
else{keep[i] <- TRUE}
}
## Select all the TRUE weights
## Compare them by pairs
## If w1 - w2 is > 10%, they are not selected anymore
## Except if one of them is the max valye
keep_df <- as.data.frame(table(keep))
if(keep_df[keep_df$keep == TRUE, "Freq"] > 1){
ind <- which(keep == TRUE)
ind_comb <- combn(x = ind, m = 2)
for(i in seq(1:ncol(ind_comb))){
vector <- ind_comb[,i]
v1 <- as.numeric(sources[vector[1]])
v2 <- as.numeric(sources[vector[2]])
if(abs(v1 - v2) < (v1*10/100) && abs(v1 - v2) < (v2*10/100)){
keep[vector[1]] <- TRUE
keep[vector[2]] <- TRUE
}else{
keep[vector[1]] <- FALSE
keep[vector[2]] <- FALSE
keep[max_ind] <- TRUE
}
}
}
names(keep) <- names(s[c(3,4,5)])
final <- c(s[1], s[2], keep)
return(final)
}, simplify = FALSE)))
Encode data type contribution to help the visualization:
sources_df$weight_tcga_miRNA <- ifelse(sources_df$weight_tcga_miRNA == TRUE, 1, 0)
sources_df$weight_tcga_mRNA <- ifelse(sources_df$weight_tcga_mRNA == TRUE, 2, 0)
sources_df$weight_tcga_prot <- ifelse(sources_df$weight_tcga_prot == TRUE, 4, 0)
sources_df$sum <- apply(sources_df[,c(3,4,5)], 1, sum)
Write results into a file:
W_df_withSource <- (merge(tcga_W_df, sources_df, by = c(1,2)))
write.table(W_df_withSource, "TCGA_W_edgeList_withSource.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
Counts number of each combination of data:
## Var1 Freq
## 1 1 3051
## 2 2 2115
## 3 3 280
## 4 4 5090
## 5 5 317
## 6 6 289
## 7 7 33
Finally, load this file into Cytoscape and change the edge colors (see section 6.3.2 to import files and change node color).
Practice: