--- title: "Similarity Network Fusion (SNF)" author: "Morgane Térézol - Galadriel Brière - Samuel Chaffron" date: '`r format(Sys.Date(), "%B %d, %Y")`' output: rmdformats::material: #code_folding: hide use_bookdown: true thumbnails: false css: "ETBII_2024_SNF.css" header-includes: - \usepackage{subfig} - \usepackage{booktabs} - \usepackage{xcolor} - \usepackage{tcolorbox} --- ```{r setup, eval = TRUE, echo = FALSE} workDir <- "01_Scripts/" knitr::opts_knit$set(root.dir = workDir) ## = setwd() knitr::opts_chunk$set(cache = TRUE) options(knitr.table.format = "html") load(file = "ETBII_2024_SNF.RData") ``` # Libraries and environment ## Load environment Libraries used to create and generate this report: ```{r lib-env, echo = FALSE, message = FALSE, warning = FALSE} library("rmarkdown") library("knitr") library("rmdformats") library("bookdown") library("kableExtra") ``` - R : ``r R.version$version.string`` - rmarkdown : ``r packageVersion("rmarkdown")`` - knitr : ``r packageVersion("knitr")`` - rmdformats : ``r packageVersion("rmdformats")`` - bookdown : ``r packageVersion("bookdown")`` - kableExtra : ``r packageVersion("kableExtra")`` ## Load libraries Libraries used to **analyse data**: ```{r lib-analysis, echo = TRUE, message = FALSE, warning = FALSE} library("SNFtool") library("pheatmap") library("igraph") ``` - SNFtool: ``r packageVersion("SNFtool")`` - pheatmap: ``r packageVersion("pheatmap")`` - igraph: ``r packageVersion("igraph")`` Libraries used to **load data**: ```{r lib-data, echo = TRUE, message = FALSE, warning = FALSE} library("MOFAdata") library("data.table") library("mixOmics") ``` - MOFAdata: ``r packageVersion("MOFAdata")`` - data.table: ``r packageVersion("data.table")`` - mixOmics: ``r packageVersion("mixOmics")`` ## Visualization using Cytoscape Cytoscape is used for visualization. Figures were generated using the `Cytoscape v3.9.1` and several Cytoscape apps: - yFiles Layout Algorithm: `1.1.3` - LegendCreator: `1.1.6` # Functions In this tutorial, we decide to remove samples with at least one missing data. To remove samples with missing data, we propose the following ``NARemoving()`` function. Input parameters are: - `data`: the data type - `margin`: a vector giving the subscripts which the function will be applied over (e.g. 1 indicates rows and 2 indicates columns) - `threshold`: threshold above which samples/features are deleted ```{r} NARemoving <- function(data, margin, threshold){ #' NA removing #' #' Calculate percentage of na #' Remove na from rows (margin = 1) or column (margin = 2) #' #' @param data data.frame. #' @param margin int. 1 = row and 2 = column #' @param threshold int. Number of missing data accepted #' #' @return Return data.frame with a specific number of na by row/column data_na <- apply(data, MARGIN = margin, FUN = function(v){sum(is.na(v)) / length(v) * 100}) # print(table(data_na)) toRemove <- split(names(data_na[data_na > threshold]), " ")[[1]] if(margin == 1){ data_withoutNa <- data[!(row.names(data) %in% toRemove),] print(paste0("Remove ", as.character(length(toRemove)), " samples.")) } if(margin == 2){ data_withoutNa <- data[,!(colnames(data) %in% toRemove)] print(paste0("Remove ", as.character(length(toRemove)), " features")) } return(data_withoutNa) } ``` For help, we created two functions to calculate these ACC values and choose the best threshold based on the topology: - ``CCCalculation()``: this function calculates the Clustering Coefficient (CC) for each node - ``ACCCalculation()``; this function averages the CC for a network, in order to obtain the ACC ```{r} ## CC calculation function CCCalculation <- function(node, graph){ #' Clustering Coefficient (CC) calculation #' #' Calculate the Clustering Coefficient (CC) for each node in a network #' #' @param node str. #' @param graph igraph. Network object (e.g. the fused network object) #' #' @return Return the corresponding CC value degNode <- degree(graph = graph, v = node, loops = FALSE) if(degNode > 1){ neighborNames <- neighbors(graph = graph, v = node) graph_s <- subgraph(graph = graph, vids = neighborNames) neighborNb <- sum(degree(graph_s, loops = FALSE)) CC <- neighborNb / (degNode * (degNode-1)) }else{CC <- 0} return(CC) } ## ACC calculation function ACCCalculation <- function(graph){ #' Average Clustering Coefficient (ACC) calculation #' #' It average the Clustering Coefficient (CC) of a network #' #' @param graph igraph. Network object (e.g. the fused network object) #' #' @return Return the corresponding ACC value nodes <- V(graph) ACC <- do.call(sum, lapply(nodes, CCCalculation, graph)) / length(nodes) return(ACC) } ``` # Choose your datasets :::: {.blackbox data-latex=""} ::: {.center data-latex=""} **Choose the dataset on which you want to apply SNF!!** ::: ::::
Different datasets are available. Note that each dataset has its specificity and some analysis steps should be adapted. ```{r data-plot, echo = FALSE, fig.align = "center", out.width = "80%", fig.cap = "Four datasets are available: **Metagenomic** dataset from Tara Ocean (image from Sunagawa et al., 2015), **Breast cancer** dataset from TCGA (image from TCGA [website](https://portal.gdc.cancer.gov/)), **CLL** dataset (Dietrich et al., 2018) and **tomato plant** dataset (figure from google image).", fig.show = "hold"} include_graphics("00_plots/Data_figure.png") ``` ## Metagenomic dataset from Tara Ocean project {#tara} To retrieve data: files are available in `/shared/projects/tp_etbii_2024_165650/Networks/TaraOcean_mibiomics` directory path in the IFB server. - dataset: - `TARAoceans_proNOGS.cvs` - `TARAoceans_proPhylo.csv` - metadata: - `TARAoceans_metadata.csv` Samples come from **eight oceans** around the world (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean). Samples can come from **different layers** with different temperatures: - SRF: Surface Water Layer (0-5 meters) - DCM: Deep Chlorophyll Maximum (peak of chlorophyll, 0-600 meters) - MIX: Subsurface epipelagic Mixed Layer - MES(O): Mesopelagic zone (from 500/1000 meters) In a previous analysis (Sunagawa *et al.*, 2015), they identified a stratification mostly driven by the temperature rather than geography or other environmental factors. We have two types of data: - **orthologous genes**: the relative abundance of groups of orthologous genes (OGs) - **phylogenetic profil**: counts of S16 rRNA Does an integrative analysis of these two data types retrieve the stratification driver by the layers? Does it also find a geographical clustering? Data are coming from: [MiBiOmics gitlab](https://gitlab.univ-nantes.fr/combi-ls2n/mibiomics). ## Breast cancer dataset from The Cancer Genome Atlas To retrieve data: - dataset: using `data("breast.TCGA")` from the `mixOmics` R package - `breast.TCGA$data.train$mirna` - `breast.TCGA$data.train$mrna` - `breast.TCGA$data.train$protein` - metadata: using `data("breast.TCGA")` from the `mixOmics` R package - `breast.TCGA$data.train$subtype` Human breast cancer is a heterogeneous disease. Breast tumors can be classified into several subtypes (PAM50 classification), according to the mRNA expression level (Sorlie *et al.,* 2001). In this dataset, we have **three subtypes**: - Basal: considered more aggressive than LumA - Her2: tend to grow faster than LumA and can have a worse prognosis, but are usually successfully treated - LumA: tend to grow more slowly than other cancers, be lower grade, and have a good prognosis We have three types of data: - **mRNA**: mRNA expression level - **miRNA**: microRNA expression level - **protein**: protein abundance Does an integrative analysis of these three data types retrieve the classification of the breast cancer? Or find another classification? Data are coming from the `mixOmics` R package. The full data can be downloaded [here](http://mixomics.org/wp-content/uploads/2016/08/TCGA.normalised.mixDIABLO.RData_.zip). ## Chronic Lymphocytic Leukaemia (CLL) dataset To retrieve data: - dataset: using `data("CLL_data")` from the `MOFAdata` R package - `CLL_data_t$Drugs` - `CLL_data_t$Methylation` - `CLL_data_t$mRNA` - `CLL_data_t$Mutations` - metadata: file is available in `/shared/projects/tp_etbii_2024_165650/Networks/CLL` directory path in the IFB server. - `sample_metadata.txt` The Chronic Lymphocytic Leukaemia (CLL) is type of blood and bone marrow cancer. The full data are explained in Dietrich *et al.*, 2018 and available [here](http://bioconductor.org/packages/release/data/experiment/html/BloodCancerMultiOmics2017.html). We have four types of data: - **mRNA**: transcriptom expression level - **methylation**: DNA methylation assays - **drug**: drug response measurements - **mutation**: sommatic mutation status ## Tomato plant dataset To retrieve data: files are available in `/shared/projects/tp_etbii_2024_165650/Networks/Tomato` directory path in the IFB server. - dataset: - `mrna.tsv` - `prots.tsv` - metadata: - `samples_metadata.tsv` In order to study the protein turnover in developing tomato fruit (*Solanum lycopersicum*) in Belouah *et al.*, two omics data types were collected: - **transcript** data: gene abundance - **protein** data: protein abundance Each data type was collected in **nine** different developmental stages: GR1, GR2, GR3, GR4, GR5, GR6, GR7, GR8 and GR9. For each developmental stages, we have three replicates. Does an integrative analysis of these data types retrieve the different developmental stages? Data are coming from Belouah *et al.*, 2019. # Tara Ocean dataset ## Input data The metagenomic dataset from the Tara Ocean project contains **2 data types**: - **orthologous genes**: the relative abundance of groups of orthologous genes (OGs) - **phylogenetic profil**: the counts of S16 rRNA Data files are available in `/shared/projects/tp_etbii_2024_165650/Networks/TaraOcean_mibiomics` directory path in the IFB server. ### Load dataset First, we load the data type from `TARAoceans_proNOGS.cvs` and `TARAoceans_proPhylo.csv` files. #### Orthologous genes (nog) data The data file contains header (`head = TRUE`) and the first column contains row names (`row.names = 1`). Below, the first rows and columns are displayed. ```{r} tara_nog <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1) tara_nog[c(1:5), c(1:5)] ``` Dimensions of the data. ```{r} dim(tara_nog) ``` The nog data contain ``r nrow(tara_nog)`` samples (rows) and ``r ncol(tara_nog)`` features (columns). Data are in the right shape: **samples** in **rows** and **features** in **columns**. #### Phylogentic profil (phy) data The data file contains header (`head = TRUE`) and the first column contains row names (`row.names = 1`). Below, the first rows and columns are displayed. ```{r} tara_phy <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proPhylo.csv", sep = ",", head = TRUE, row.names = 1) tara_phy[c(1:5), c(1:3)] ``` Number of rows in the data. ```{r} nrow(tara_phy) ``` Number of columns in the data. ```{r} ncol(tara_phy) ``` The phy data contain ``r nrow(tara_phy)`` samples (rows) and ``r ncol(tara_phy)`` features (columns). Data are in the right shape: **samples** in **rows** and **features** in **columns**. ### Load metadata The `TARAoceans_metadata.csv` file contains metadata. It contains header (`head = TRUE`) and row names (`row.names = 1`). ```{r} tara_metadata <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_metadata.csv", sep = ",", head = TRUE, row.names = 1) head(tara_metadata) ``` The metadata contains two types of information: **ocean** and **depth**. ```{r} names(tara_metadata) ``` Samples come from `8` oceans. ```{r} unique(tara_metadata$ocean) ``` Samples were collected in `4` different depths. ```{r} unique(tara_metadata$depth) ``` ### Missing data The data don't contain missing values. We can go to the following steps. ```{r} table(is.na(tara_nog)) ``` ```{r} table(is.na(tara_phy)) ``` ### Scaling We assume that data have been already **prepared** and **normalized**. #### nog data Nog data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one. ```{r} tara_nog_scaled <- standardNormalization(tara_nog) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(as.matrix(tara_nog), nclass = 100, main = "Orthologous genes - Prepared data", xlab = "values") hist(tara_nog_scaled, nclass = 100, main = "Orthologous genes - Scaled data", xlab = "values") ``` Before scaling, data values are almost all closed to zero. After scaling, data values seem to follow something close to a normal distribution. Data values are centered to zero. #### phy data Phy data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one. ```{r} tara_phy_scaled <- standardNormalization(tara_phy) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(as.matrix(tara_phy), nclass = 100, main = "Phylogenetic profile - Prepared data", xlab = "values") hist(tara_phy_scaled, nclass = 100, main = "Phylogenetic profile - Scaled data", xlab = "values") ``` This kind of data are sparse: there are a lot of zero values. Majority of the values are in the first range on the histogram. After scaling, data values seem to follow something close to a normal distribution. Data values are centered to zero. ## Similarity network In this part, we create the similarity network for each data type. ### Distance calculation We calculate the Euclidean distance between each pair of samples for each type of data. ```{r} tara_nog_dist <- dist2(tara_nog_scaled, tara_nog_scaled) tara_phy_dist <- dist2(tara_phy_scaled, tara_phy_scaled) ``` The distance matrix dimensions are ``r nrow(tara_nog_dist)`` rows and ``r ncol(tara_nog_dist)`` columns. Indeed, we calculated pairwise distance, so the matrix contains samples in rows and in columns. ```{r} dim(tara_nog_dist) ``` The diagonal is composed of zero values (or values very closed). Indeed, there is no distance between the same sample. ```{r} tara_nog_dist[c(1:5), c(1:5)] ``` High distance values mean that samples are not similar. And small distance values mean that samples are similar. ### Similarity calculation The distance matrix is then transformed into similarity matrix for each data type. We set two parameters: - `K = 20`: number of nearest neighbors - `sigma = 0.5`: hyperparameter ```{r} K <- 20 signma <- 0.5 ``` The `affinityMatrix()` function transforms the distance into similarity according the distance with the nearest neighbors. ```{r} tara_nog_W <- affinityMatrix(tara_nog_dist, K, signma) tara_phy_W <- affinityMatrix(tara_phy_dist, K, signma) ``` The following figures are the heatmap of the similarity matrix (W) of each data type. The left heatmap are the nog data and the right heatmap are the phy data. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities. ```{r, fig.show = "hold", out.width = "50%"} pheatmap(log(tara_nog_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Orthologous genes - log10-transformed similarity values") pheatmap(log(tara_phy_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Phylogenetic profil - log10-transformed similarity values") ``` Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples. The two heatmaps are different. Data seem to probably carry different type of information about the samples: - for the nog data (left), we can see two main groups of samples. Each group seems to be composed of the same ocean depth. It’s not clear for the ocean regions. - for the phy data (right), we do not identify clear groups. But, samples seem to be group by the depth too. ## Fusion We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix. ### Create the fused similarity matrix We create the fused similarity matrix using these three parameters: - list that contains the nog and phy similarity matrices - `K = 20`: number of the nearest neighbors - `T = 10`: number of iterations ```{r} K <- 20 T <- 10 ``` ```{r} tara_W <- SNF(list(tara_nog_W, tara_phy_W), K, T) tara_W[c(1:5), c(1:5)] ``` The dimension of the fused similarity matrix are ``r nrow(tara_W)`` rows and ``r ncol(tara_W)`` columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them **weights**. The fused similarity matrix contains ``r length(tara_W)`` weights. ```{r} length(tara_W) ``` The fused similarity matrix doesn’t contain zero: ```{r} length(tara_W[length(tara_W) == 0]) ``` The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization. ```{r, fig.align = "center"} pheatmap(log(tara_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Tara Ocean data - log10-transformed fused similarity matrix") ``` Groups are more clearly defined in this fused similarity matrix. One corresponds to the depth MESO and other to a mix of DCM and SRF. The fused similarity matrix seems to be a mix of each similarity data type matrix. ### Visualized the fused similarity network Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (`diag = FALSE`) and only the upper values of the matrix are taken (`mode = "upper"`, avoid duplicate information). ```{r} tara_W_net <- graph_from_adjacency_matrix(tara_W, weighted = TRUE, mode = "upper", diag = FALSE) ``` Then, the fused similarity network is saved into a the `TaraOcean_W_edgeList.txt` file: ```{r} write.table(as_data_frame(tara_W_net), "../02_Results/01_TaraOcean/TaraOcean_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` This files is loaded into Cytoscape. The Figure \@ref(fig:tara-cytoscape1) shows the fused similarity network of the Tara Ocean dataset. ```{r tara-cytoscape1, echo = FALSE, fig.align = "center", fig.cap = "The fused similarity network of the Tara Ocean dataset."} include_graphics("../02_Results/01_TaraOcean/Cytoscape/TaraOcean_W.png") ``` According to Cytoscape, the network contains ``139`` samples (nodes) and ``9591`` connections (edges). The number of edges is smaller than in the similarity matrix because in the similarity matrix the weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops). For now, the network is fully connected: each sample is connected to every sample. Connections between samples are weights: some connections are strong (samples are similar) some other are weak (samples are not similar). ## Threshold selection So in this section, we will choose a threshold to keep the strongest connections. ### Fused similarity network #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tara_weights <- edge.attributes(tara_W_net)$weight hist(tara_weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights") hist(log(tara_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights") abline(v = log(0.001, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `0.001` With this threshold, we select ``r length(tara_weights[tara_weights>=0.001])`` connections. #### Mean and third quantile We calculate the median of the weights. ```{r} tara_W_median <- median(x = tara_weights) tara_W_median ``` With the mean (``r tara_W_median``) as threshold, we select ``r length(tara_weights[tara_weights>=tara_W_median])`` connections. ```{r} length(tara_weights[tara_weights>=tara_W_median]) ``` Calculate the third quantile of the weights: ```{r} tara_W_q75 <- quantile(x = tara_weights, 0.75) tara_W_q75 ``` With the third quantile (``r tara_W_q75``) as threshold, we select ``r length(tara_weights[tara_weights>=tara_W_q75])`` connections. ```{r} length(tara_weights[tara_weights>=tara_W_q75]) ``` In the following figure, we display the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(tara_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)") abline(v = log(tara_W_median, 10), col = "blue", lwd = 3) text(log(tara_W_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tara_W_q75, 10), col = "purple", lwd = 3) text(log(tara_W_q75, 10), 400, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(tara_W)) ``` We set the range between `0` and `0.0005`. ```{r} thresholds <- seq(0, 0.1, 0.0005) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tara_W_ACC <- do.call(rbind, lapply(thresholds, function(t, net){ net_sub <- subgraph.edges(net, E(net)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub))) return(df) }, tara_W_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} plot(x = tara_W_ACC$thresholds, y = tara_W_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tara_W_ACC$thresholds[1], y = tara_W_ACC$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tara_W_ACC$thresholds[21], y = tara_W_ACC$ACC[21], col = "pink", pch = 16, cex = 1.2) points(x = tara_W_ACC$thresholds[29], y = tara_W_ACC$ACC[29], col = "purple", pch = 16, cex = 1.2) abline(v = tara_W_ACC$thresholds[29], col = "purple") text(tara_W_ACC$thresholds[29], 0.7, pos = 4, paste0("Threshold = ", tara_W_ACC$thresholds[29]), col = "purple") text(tara_W_ACC$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ", round(tara_W_ACC$ACC[29], 2)), col = "purple") plot(x = tara_W_ACC$thresholds, y = tara_W_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tara_W_ACC$thresholds[29], col = "purple") text(tara_W_ACC$thresholds[29], 2800, pos = 4, paste0("Threshold = ", tara_W_ACC$thresholds[29]), col = "purple") text(tara_W_ACC$thresholds[29], 2000, pos = 4, paste0("ACCmax = ", round(tara_W_ACC$ACC[29], 2)), col = "purple") text(tara_W_ACC$thresholds[29], 1200, pos = 4, paste0("EN = ", tara_W_ACC[29, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r length(tara_weights[tara_weights>=tara_W_ACC$thresholds[29]])`` edges. It could be not enough edges. Let’s see during the visualization. ```{r} tara_W_ACC$thresholds[29] ``` #### Visualization using Cytoscape The network visualization on the left was created with the third quantile (``r tara_W_q75``). The network visualization on the right was created with the ACC method (``r tara_W_ACC$thresholds[29]``). ```{r tara-cytoscape2, echo = FALSE, fig.cap = "The fused network of the Tara Ocean dataset.", out.width = "50%", fig.show = "hold", fig.align = "center"} include_graphics("../02_Results/01_TaraOcean/Cytoscape/TaraOcean_W_004.png") include_graphics("../02_Results/01_TaraOcean/Cytoscape/TaraOcean_W_014.png") ``` The left network contains lot of edges and it’s difficult to see clear connection between samples. Nevertheless, we can see two groups of samples. It could be interesting to color the nodes with the depth information. This trend is also shows in the right network. Moreover, ocean samples seem to be groups together. Network visualizations are available in the `TARAocean_cytoscape.cys` file. ### Orthologous gene data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tara_nog_net <- graph_from_adjacency_matrix(tara_nog_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(tara_nog_net), "../02_Results/01_TaraOcean/TaraOcean_nog_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tara_weights <- edge.attributes(tara_nog_net)$weight hist(tara_weights, nclass = 100, main = "nog similarity network weight distribution", xlab = "weights") hist(log(tara_weights, 10), nclass = 100, main = "nog similarity network weight distribution", xlab = "weights") abline(v = log(1.584893e-05, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a kind of normal distribution. We would probably like to cut in the middle of the peak, or just before or after. If we cut in the middle, the corresponding weight is: `1.584893e-05`. With this threshold, we select ``r length(tara_weights[tara_weights>=1.584893e-05])`` connections. #### Mean and third quantile Calculate the median. ```{r} tara_nog_median <- median(x = tara_weights) tara_nog_median ``` Number of selected edges with the median as threshold. ```{r} length(tara_weights[tara_weights >= tara_nog_median]) ``` Calculate the third quantile. ```{r} tara_nog_q75 <- quantile(x = tara_weights, 0.75) tara_nog_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(tara_weights[tara_weights >= tara_nog_q75]) ``` The following figures show where are these two threshold in the weight distribution. ```{r, fig.align = "center"} hist(log(tara_weights, 10), nclass = 100, main = "nog weight distribution", xlab = "log10(weights)") abline(v = log(tara_nog_median, 10), col = "blue", lwd = 3) text(log(tara_nog_median, 10), 350, pos = 4, "Median", col = "blue", cex = 1) abline(v = log(tara_nog_q75, 10), col = "purple", lwd = 3) text(log(tara_nog_q75, 10), 250, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine the range of the threshold, we check the weights. ```{r} summary(tara_weights) ``` We define the threshold range to try. ```{r} thresholds <- seq(0, 0.008, 0.00005) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tara_nog_ACC <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tara_nog_net)) ``` ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tara_nog_ACC$thresholds, y = tara_nog_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC of orthologous gene data", type = "o") points(x = tara_nog_ACC$thresholds[1], y = tara_nog_ACC$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tara_nog_ACC$thresholds[9], y = tara_nog_ACC$ACC[9], col = "pink", pch = 16, cex = 1.2) points(x = tara_nog_ACC$thresholds[10], y = tara_nog_ACC$ACC[10], col = "purple", pch = 16, cex = 1.2) abline(v = tara_nog_ACC$thresholds[10], col = "purple") text(tara_nog_ACC$thresholds[10], 0.8, pos = 4, paste0("Threshold = ", tara_nog_ACC$thresholds[10]), col = "purple") text(tara_nog_ACC$thresholds[10], 0.7, pos = 4, paste0("ACCmax = ", round(tara_nog_ACC$ACC[10], 2)), col = "purple") ## EN plot(x = tara_nog_ACC$thresholds, y = tara_nog_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "Edge number of orthologous genes data", type = "o") abline(v = tara_nog_ACC$thresholds[10], col = "purple") text(tara_nog_ACC$thresholds[10], 2800, pos = 4, paste0("Threshold = ", tara_nog_ACC$thresholds[10]), col = "purple") text(tara_nog_ACC$thresholds[10], 2000, pos = 4, paste0("ACCmax = ", round(tara_nog_ACC$ACC[10], 2)), col = "purple") text(tara_nog_ACC$thresholds[10], 1200, pos = 4, paste0("EN = ", tara_nog_ACC[10, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tara_nog_ACC$thresholds[10] ``` And the number of selected egdes are: ```{r} tara_nog_ACC$EN[10] ``` Network visualizations are available in the `TARAocean_cytoscape.cys` file. ### Phylogenetic profil data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tara_phy_net <- graph_from_adjacency_matrix(tara_phy_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(tara_phy_net), "../02_Results/01_TaraOcean/TaraOcean_phy_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tara_weights <- edge.attributes(tara_phy_net)$weight hist(tara_phy_W, nclass = 100, main = "phy similarity network weight distribution", xlab = "weights") hist(log(tara_weights, 10), nclass = 100, main = "phy similarity network weight distribution", xlab = "log10(weights)") abline(v = log(3.162278e-05, 10), col = "cyan", lwd = 3) ``` Number of connections: ```{r} length(tara_weights[tara_weights>= 3.162278e-05]) ``` #### Mean and third quantile Calculate the median. ```{r} tara_phy_median <- median(x = tara_weights) tara_phy_median ``` Number of selected edges with the median as threshold. ```{r} length(tara_weights[tara_weights>= tara_phy_median]) ``` Calculate the third quantile. ```{r} tara_phy_q75 <- quantile(x = tara_weights, 0.75) tara_phy_q75 ``` Number of selected edges with the third quantile as threshold. ```{r} length(tara_weights[tara_weights>= tara_phy_q75]) ``` The following figure show where are these two threshold in the weight distribution. ```{r, fig.align = "center"} hist(log(tara_weights, 10), nclass = 100, main = "phy similarity network weight distribution", xlab = "log10(weights)") abline(v = log(tara_phy_median, 10), col = "blue", lwd = 3) text(log(tara_phy_median, 10), 370, pos = 4, "Median", col = "blue", cex = 1) abline(v = log(tara_phy_q75, 10), col = "purple", lwd = 3) text(log(tara_phy_q75, 10), 250, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network We define the threshold range to try. ```{r} thresholds <- seq(0, 0.008, 0.00005) length(thresholds) ``` ```{r} tara_phy_ACC <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tara_phy_net)) ``` ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tara_phy_ACC$thresholds, y = tara_phy_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC of phylogenetic profil data", type = "o") points(x = tara_phy_ACC$thresholds[1], y = tara_phy_ACC$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tara_phy_ACC$thresholds[17], y = tara_phy_ACC$ACC[17], col = "pink", pch = 16, cex = 1.2) points(x = tara_phy_ACC$thresholds[22], y = tara_phy_ACC$ACC[22], col = "purple", pch = 16, cex = 1.2) abline(v = tara_phy_ACC$thresholds[22], col = "purple") text(tara_phy_ACC$thresholds[22], 0.9, pos = 4, paste0("Threshold = ", tara_phy_ACC$thresholds[22]), col = "purple") text(tara_phy_ACC$thresholds[22], 0.85, pos = 4, paste0("ACCmax = ", round(tara_phy_ACC$ACC[22], 2)), col = "purple") ## EN plot(x = tara_phy_ACC$thresholds, y = tara_phy_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "Edge number of phylogenetic profil data", type = "o") abline(v = tara_phy_ACC$thresholds[22], col = "purple") text(tara_phy_ACC$thresholds[22], 2800, pos = 4, paste0("Threshold = ", tara_phy_ACC$thresholds[22]), col = "purple") text(tara_phy_ACC$thresholds[22], 2000, pos = 4, paste0("ACCmax = ", round(tara_phy_ACC$ACC[22], 2)), col = "purple") text(tara_phy_ACC$thresholds[22], 1200, pos = 4, paste0("EN = ", tara_phy_ACC[22, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tara_phy_ACC$thresholds[22] ``` And the number of selected egdes are: ```{r} tara_phy_ACC$EN[22] ``` Network visualizations are available in the `TARAocean_cytoscape.cys` file. ## Downstream analysis ### Clustering Samples are clustered together according to their similarity. According to our data and the information we have, we choose `4` and `8` clusters. Indeed, data are coming from four different depths and eight different oceans. #### With 4 clusters ```{r} C <- 4 group <- data.frame(Groups = spectralClustering(tara_W, C)) row.names(group) <- colnames(tara_W) tara_dataGroups4 <- merge(tara_metadata, group, by = 0) ``` #### With 8 clusters ```{r} C <- 8 group <- data.frame(Groups = spectralClustering(tara_W, C)) row.names(group) <- colnames(tara_W) tara_dataGroups8 <- merge(tara_metadata, group, by = 0) ``` #### Save results Results are saved into the same file. ```{r} clusters <- merge(x = tara_dataGroups4, y = tara_dataGroups8[c(1,4)], by = "Row.names", suffixes = c("_4clusters", "_8clusters")) write.table(clusters, "../02_Results/01_TaraOcean/TaraOcean_clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` ### Visualization with Cytoscape These are two examples of network visualization for the Tara ocean dataset. ```{r, echo = FALSE, fig.cap = "**Left network**: edge weights > 0.014 and node color according ocean. **Right network**: edge weights > 0.014 and node color according the depth.", fig.align = "center", fig.show = "hold"} include_graphics("../02_Results/01_TaraOcean/Cytoscape/TaraOcean_W_dataContribution_014_ocean.png") include_graphics("../02_Results/01_TaraOcean/Cytoscape/TaraOcean_W_dataContribution_014_depth.png") ``` - **Node color** represents: - the ocean (left network) - the depth (right network) - **Node label** are the sample names. - **Edge color** represents the data type contribution for each edge. We can see a high connected subnetwork on the left, connected to a sparser subnetwork on the right. The highly connected subnetwork corresponds to the MES samples (deeper layer) and the other subnetwork to DCM and SRF (surface layers). Samples from the same ocean seem to be grouped together. We can't see this stratification if we analysis one type of data alone. # Breast cancer dataset The breast cancer dataset from The Cancer Genome Atlas (TCGA) contains 3 data types: - **mRNA**: mRNA expression level - **miRNA**: microRNA expression level - **protein**: protein abundance Data are available in the R package `mixOmics`. The metadata are also available in this package. ## Input data ### Load dataset We load the breast cancer dataset: ```{r} data(breast.TCGA) ``` The ``breast.TCGA`` object contains 3 types of data and one metadata: ```{r} names(breast.TCGA$data.train) ``` Dimensions of the data are different: ```{r} lapply(breast.TCGA$data.train, dim) ``` Data are extracted into single data frame: ```{r} tcga_mirna = breast.TCGA$data.train$mirna tcga_mrna = breast.TCGA$data.train$mrna tcga_prot = breast.TCGA$data.train$protein ``` - the `tcga_miRNA` data contain ``r nrow(tcga_mirna)`` samples in rows and ``r ncol(tcga_mirna)`` features in columns. - the `tcga_mRNA` data contain ``r nrow(tcga_mrna)`` samples in rows and ``r ncol(tcga_mrna)`` features in columns. - the `tcga_prot` data contain ``r nrow(tcga_prot)`` samples in rows and ``r ncol(tcga_prot)`` features in columns. Data are already well shaped. ```{r} tcga_mrna[c(1:5), c(1:5)] ``` ### Load metadata We extract metadat from the `breast.TCGA$data.train` object. ```{r} tcga_metadata = breast.TCGA$data.train$subtype ``` The metadata contain the subtype of the breast cancer for each sample. ```{r} head(tcga_metadata) ``` For each subtype, there are `45`, `30` and `75` samples: ```{r} summary(tcga_metadata) ``` Metadata should be stored in a data frame: ```{r} tcga_metadata_df <- data.frame("subtype" = tcga_metadata) row.names(tcga_metadata_df) <- row.names(tcga_mirna) ``` We save the metadata into a file. This file will be useful for the visualization. ```{r} write.table(tcga_metadata_df, "../02_Results/03_BreastTCGA/TCGA_metadata.txt", quote = FALSE, row.names = TRUE, col.names = NA, sep = "\t") ``` ### Missing data Data don't contain missing value. We can go to the following steps. ```{r} table(is.na(tcga_mirna)) ``` ```{r} table(is.na(tcga_mrna)) ``` ```{r} table(is.na(tcga_prot)) ``` ### Scaling We assume that data have been already **prepared** and **normalized**. #### miRNA data miRNA data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one. ```{r} tcga_mirna_scaled <- standardNormalization(x = tcga_mirna) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(tcga_mirna, nclass = 100, main = "TCGA miRNA data - Data distribution before scaling", xlab = "values") hist(tcga_mirna_scaled, nclass = 100, main = "TCGA miRNA data - Data distribution after scaling", xlab = "scaled values") ``` After scaling, data values seem to follow a normal distribution. Data values are centered to zero. #### mRNA data mrna data are scaled: ```{r} tcga_mrna_scaled <- standardNormalization(x = tcga_mrna) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(tcga_mrna, nclass = 100, main = "TCGA mRNA data - Data distribution before scaling", xlab = "values") hist(tcga_mrna_scaled, nclass = 100, main = "TCGA mRNA data - Data distribution after scaling", xlab = "scaled values") ``` After scaling, data values seem to follow a normal distribution. Data values are centered to zero. #### Protein data Protein data are scaled: ```{r} tcga_prot_scaled <- standardNormalization(x = tcga_prot) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(tcga_prot, nclass = 100, main = "TCGA proteomic data - Data distribution before scaling", xlab = "values") hist(tcga_prot_scaled, nclass = 100, main = "TCGA proteomic data - Data distribution after scaling", xlab = "scaled values") ``` The protein data seem to be already scaled. So for the following steps, we will used `tcga_prot` variable. ## Similarity network In this part, we create the similarity network for each data type. ### Distance calculation We calculate the Euclidean distance between each pair of samples for each type of data. ```{r} tcga_mirna_dist <- dist2(tcga_mirna_scaled, tcga_mirna_scaled) tcga_mrna_dist <- dist2(tcga_mrna_scaled, tcga_mrna_scaled) tcga_prot_dist <- dist2(tcga_prot, tcga_prot) ``` Distance matrices have ``r nrow(tcga_mirna_dist)`` rows and ``r ncol(tcga_mirna_dist)`` columns. We calculated pairwise distance, so the matrix has samples in rows and in columns. ```{r} dim(tcga_mirna_dist) ``` The diagonal of the distance matrix contains the distance between sample and itself. So the distance is equal (or very close) to zero. ```{r} tcga_mirna_dist[c(1:5), c(1:5)] ``` High distance values mean that samples are not similar. And small distance values mean that samples are similar. ### Similarity calculation The distance matrix is then transformed into similarity matrix for each data type. We set two parameters: - `K = 20`: number of nearest neighbors - `signma = 0.5`: hyperparameter ```{r} K <- 20 sigma <- 0.5 ``` The `affinityMatrix()` function transforms the distance into similarity according the distance with the nearest neighbors. ```{r} tcga_mirna_W <- affinityMatrix(tcga_mirna_dist, K, sigma) tcga_mrna_W <- affinityMatrix(tcga_mrna_dist, K, sigma) tcga_prot_W <- affinityMatrix(tcga_prot_dist, K, sigma) ``` The following figures are the heatmap of the similarity matrix (W) of each data type. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities. ```{r, echo = TRUE, fig.show = "hold", fig.align = "center", out.width = "50%", fig.cap = "TCGA dataset - log-transformed similarity matrix heatmap"} pheatmap(log(tcga_mirna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA miRNA data") pheatmap(log(tcga_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA mRNA data") pheatmap(log(tcga_prot_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA proteomic data") ``` Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples. Heatmaps are different between data types. - for miRNA data (top left), there are several groups of samples. Two of them are very different (basal vs lumA) - for mRNA data (top right), we can see two different groups (basal vs lumA) - for protein data (bottom), samples are grouped by their cancer subtypes. ## Fusion We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix. ### Create the fused similarity matrix We create the fused similarity matrix using these three parameters: - list that contains the miRNA, mRNA and pro similarity matrices - `K = 20`: number of nearest neighbors - `T = 10`: number of iterations ```{r} K = 20 T = 10 tcga_W <- SNF(list(tcga_mirna_W, tcga_mrna_W, tcga_prot_W), K, T) tcga_W[c(1:5), c(1:5)] ``` The dimensions of the fused network are ```r nrow(tcga_W)``` rows and ```r ncol(tcga_W)``` columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them **weights**. The fused similarity matrix contains ``r length(tcga_W)`` weights. ```{r} length(tcga_W) ``` The fused similarity matrix doesn’t contain zero. ```{r} table(tcga_W == 0) ``` The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization. ```{r, echo = TRUE, fig.show = "hold", fig.align = "center", out.width = "50%"} pheatmap(log(tcga_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA - Fused similarity matrix W") ``` Read color means a high similarity between samples. Blue color means a small similarity between samples. In this heatmap, samples seem to be well clustered, according the cancer subtype. Basal samples are very different from LumA samples. ### Visualize the fused similarity network Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (`diag = FALSE`) and only the upper values of the matrix are taken (`mode = "upper"`, avoid duplicate information). ```{r} tcga_W_net <- graph_from_adjacency_matrix(tcga_W, weighted = TRUE, mode = "upper", diag = FALSE) ``` Then, the fused similarity network is saved into a the `TCGA_W_edgeList.txt` file: ```{r} write.table(as_data_frame(tcga_W_net), "../02_Results/03_BreastTCGA/TCGA_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` This files is loaded into Cytoscape. The Figure \@ref(fig:tcga-cyto1) shows the fused similarity network of the Tara Ocean dataset. ```{r tcga-cyto1, echo = FALSE, fig.align = "center", fig.cap = "Fused similarity network of the breast cancer dataset. Visualization using Cytoscape.", out.width = "50%"} include_graphics("../02_Results/03_BreastTCGA/Cytoscape/TCGA_W.png") ``` According Cytoscape, the fused similarity network contains `150` nodes (samples) and ``11175`` edges (connections) between samples. The connections number is smaller in Cytoscape. Indeed, in the similarity matrix weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops). For now, the fused similarity network is fully connected: each sample is connected to every other samples. Connections between samples are weighted: some connections are strong (samples are similar) and some other are weak (samples are not similar). ## Threshold selection In this section, we will determine a threshold to select the strongest connections between samples. ### Fused similarity network #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tcga_weights <- edge.attributes(tcga_W_net)$weight hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(0.0039, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `0.0039`. With this threshold, we select ``r length(tcga_weights[tcga_weights >= 0.0039])`` connections. #### Mean and third quantile Calculate the median of the weights: ```{r} tcga_W_median <- median(x = tcga_weights) tcga_W_median ``` With the mean (``r tcga_W_median``) as threshold, we select ``r length(tcga_weights[tcga_weights >= tcga_W_median])`` connections. ```{r} length(tcga_weights[tcga_weights >= tcga_W_median]) ``` Calculate the third quantile of the weights: ```{r} tcga_W_q75 <- quantile(x = tcga_weights, 0.75) tcga_W_q75 ``` With the third quantile (``r tcga_W_q75``) as threshold, we select ``r length(tcga_weights[tcga_weights >= tcga_W_q75])`` connections. ```{r} length(tcga_weights[tcga_weights >= tcga_W_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(tcga_W_median, 10), col = "blue", lwd = 3) text(log(tcga_W_median, 10), 160, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tcga_W_q75, 10), col = "purple", lwd = 3) text(log(tcga_W_q75, 10), 160, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(tcga_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.02, 0.0002) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tcga_ACC_W <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tcga_W_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tcga_ACC_W$thresholds, y = tcga_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tcga_ACC_W$thresholds[1], y = tcga_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tcga_ACC_W$thresholds[55], y = tcga_ACC_W$ACC[55], col = "pink", pch = 16, cex = 1.2) points(x = tcga_ACC_W$thresholds[58], y = tcga_ACC_W$ACC[58], col = "purple", pch = 16, cex = 1.2) text(tcga_ACC_W$thresholds[58], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_W$thresholds[58]), col = "purple") text(tcga_ACC_W$thresholds[58], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_W$ACC[58]), col = "purple") ## EN plot(x = tcga_ACC_W$thresholds, y = tcga_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tcga_ACC_W$thresholds[58], col = "purple") text(tcga_ACC_W$thresholds[58], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_W$thresholds[58]), col = "purple") text(tcga_ACC_W$thresholds[58], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_W$ACC[58]), col = "purple") text(tcga_ACC_W$thresholds[58], 4000, pos = 4, paste0("EN = ", tcga_ACC_W[58, "EN"]), col = "purple") ``` We don't have obvious and clear local maxima with this dataset. - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r tcga_ACC_W[58, "EN"]`` edges. It could be not enough edges. Let's see during the visualization. ```{r} tcga_ACC_W$thresholds[58] ``` #### Visualization using Cytoscape The network visualization on the left was created with the third quantile (``r tcga_W_q75``). The network visualization on the right was created with the ACC method (``r tcga_ACC_W$thresholds[58]``). ```{r, echo = FALSE, fig.show = "hold", out.width = "50%"} include_graphics("../02_Results/03_BreastTCGA/Cytoscape/TCGA_W_0046.png") include_graphics("../02_Results/03_BreastTCGA/Cytoscape/TCGA_W_011.png") ``` The left network contains lot of edges and it's difficult to see clear connection between samples. Nevertheless, we can see that LumA samples are not connected (very few edges) to the Basal samples. This trend is also shows in the right network. Samples from the same cancer subtypes are connected together. The two representations could be interesting. Network visualizations are available in the `TCGA_cytoscape.cys` file. ### miRNA data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tcga_mirna_net <- graph_from_adjacency_matrix(tcga_mirna_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(tcga_mirna_net), "../02_Results/03_BreastTCGA/TCGA_miRNA_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tcga_weights <- edge.attributes(tcga_mirna_net)$weight hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(0.0001, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a kind of normal distribution. We would probably like to cut in the middle of the peak, or just before or after. If we cut in the middle, the corresponding weight is: `0.0001`. With this threshold, we select ``r length(tcga_weights[tcga_weights >= 0.0001])`` connections. #### Mean and third quantile Calculate the median of the weights: ```{r} tcga_mirna_median <- median(x = tcga_weights) tcga_mirna_median ``` Number of selected edges with the median as threshold. ```{r} length(tcga_weights[tcga_weights >= tcga_mirna_median]) ``` Calculate the third quantile of the weights: ```{r} tcga_mirna_q75 <- quantile(x = tcga_weights, 0.75) tcga_mirna_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(tcga_weights[tcga_weights >= tcga_mirna_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(tcga_mirna_median, 10), col = "blue", lwd = 3) text(log(tcga_mirna_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tcga_mirna_q75, 10), col = "purple", lwd = 3) text(log(tcga_mirna_q75, 10), 400, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(tcga_mirna_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.002, 0.00002) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tcga_ACC_mirna <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tcga_mirna_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tcga_ACC_mirna$thresholds, y = tcga_ACC_mirna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tcga_ACC_mirna$thresholds[1], y = tcga_ACC_mirna$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tcga_ACC_mirna$thresholds[38], y = tcga_ACC_mirna$ACC[38], col = "pink", pch = 16, cex = 1.2) points(x = tcga_ACC_mirna$thresholds[41], y = tcga_ACC_mirna$ACC[41], col = "purple", pch = 16, cex = 1.2) text(tcga_ACC_mirna$thresholds[41], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_mirna$thresholds[41]), col = "purple") text(tcga_ACC_mirna$thresholds[41], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_mirna$ACC[41]), col = "purple") ## EN plot(x = tcga_ACC_mirna$thresholds, y = tcga_ACC_mirna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tcga_ACC_mirna$thresholds[41], col = "purple") text(tcga_ACC_mirna$thresholds[41], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_mirna$thresholds[41]), col = "purple") text(tcga_ACC_mirna$thresholds[41], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_mirna$ACC[41]), col = "purple") text(tcga_ACC_mirna$thresholds[41], 4500, pos = 4, paste0("EN = ", tcga_ACC_mirna[41, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tcga_ACC_mirna$thresholds[41] ``` And the number of selected egdes are: ```{r} tcga_ACC_mirna$EN[41] ``` Network visualizations are available in the ``TCGA_cytoscape.cys`` file. ### mRNA data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tcga_mrna_net <- graph_from_adjacency_matrix(tcga_mrna_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(tcga_mrna_net), "../02_Results/03_BreastTCGA/TCGA_mRNA_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tcga_weights <- edge.attributes(tcga_mrna_net)$weight hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(0.0001, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `0.0001`. With this threshold, we select ``r length(tcga_weights[tcga_weights >= 0.0001])`` connections. #### Mean and third quantile Calculate the median of the weights: ```{r} tcga_mrna_median <- median(x = tcga_weights) tcga_mrna_median ``` Number of selected edges with the median as threshold. ```{r} length(tcga_weights[tcga_weights >= tcga_mrna_median]) ``` Calculate the third quantile of the weights: ```{r} tcga_mrna_q75 <- quantile(x = tcga_weights, 0.75) tcga_mrna_q75 ``` Number of selected edges with the third quantile as threshold. ```{r} length(tcga_weights[tcga_weights >= tcga_mrna_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(tcga_mrna_median, 10), col = "blue", lwd = 3) text(log(tcga_mrna_median, 10) - 0.2, 380, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tcga_mrna_q75, 10), col = "purple", lwd = 3) text(log(tcga_mrna_q75, 10), 380, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(tcga_mrna)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.0025, 0.00002) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tcga_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tcga_mrna_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tcga_ACC_mrna$thresholds, y = tcga_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tcga_ACC_mrna$thresholds[1], y = tcga_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tcga_ACC_mrna$thresholds[54], y = tcga_ACC_mrna$ACC[54], col = "pink", pch = 16, cex = 1.2) points(x = tcga_ACC_mrna$thresholds[56], y = tcga_ACC_mrna$ACC[56], col = "purple", pch = 16, cex = 1.2) text(tcga_ACC_mrna$thresholds[56], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_mrna$thresholds[56]), col = "purple") text(tcga_ACC_mrna$thresholds[56], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_mrna$ACC[56]), col = "purple") ## EN plot(x = tcga_ACC_mrna$thresholds, y = tcga_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tcga_ACC_mrna$thresholds[56], col = "purple") text(tcga_ACC_mrna$thresholds[56], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_mrna$thresholds[56]), col = "purple") text(tcga_ACC_mrna$thresholds[56], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_mrna$ACC[56]), col = "purple") text(tcga_ACC_mrna$thresholds[56], 4500, pos = 4, paste0("EN = ", tcga_ACC_mrna[56, "EN"]), col = "purple") ``` We don't have obvious and clear local maxima with this dataset. - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tcga_ACC_mrna$thresholds[56] ``` And the number of selected egdes are: ```{r} tcga_ACC_mrna$EN[56] ``` Network visualizations are available in the ``TCGA_cytoscape.cys`` file. ### Protein data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tcga_prot_net <- graph_from_adjacency_matrix(tcga_prot_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(tcga_prot_net), "../02_Results/03_BreastTCGA/TCGA_prot_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tcga_weights <- edge.attributes(tcga_prot_net)$weight hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(7.943282e-05, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `7.943282e-05`. With this threshold, we select ``r length(tcga_weights[tcga_weights >= 7.943282e-05])`` connections. #### Mean and third quantile Calculate the median of the weights: ```{r} tcga_prot_median <- median(x = tcga_weights) tcga_prot_median ``` Number of selected edges with the median as threshold. ```{r} length(tcga_weights[tcga_weights >= tcga_prot_median]) ``` Calculate the third quantile of the weights: ```{r} tcga_prot_q75 <- quantile(x = tcga_weights, 0.75) tcga_prot_q75 ``` Number of selected edges with the third quantile as threshold. ```{r} length(tcga_weights[tcga_weights >= tcga_prot_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(tcga_prot_median, 10), col = "blue", lwd = 3) text(log(tcga_prot_median, 10), 280, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tcga_prot_q75, 10), col = "purple", lwd = 3) text(log(tcga_prot_q75, 10), 260, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(tcga_prot)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.01, 0.0001) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tcga_ACC_prot <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, tcga_prot_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = tcga_ACC_prot$thresholds, y = tcga_ACC_prot$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tcga_ACC_prot$thresholds[1], y = tcga_ACC_prot$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tcga_ACC_prot$thresholds[36], y = tcga_ACC_prot$ACC[36], col = "pink", pch = 16, cex = 1.2) points(x = tcga_ACC_prot$thresholds[39], y = tcga_ACC_prot$ACC[39], col = "purple", pch = 16, cex = 1.2) text(tcga_ACC_prot$thresholds[39], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_prot$thresholds[39]), col = "purple") text(tcga_ACC_prot$thresholds[39], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$ACC[39]), col = "purple") ## EN plot(x = tcga_ACC_prot$thresholds, y = tcga_ACC_prot$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tcga_ACC_prot$thresholds[39], col = "purple") text(tcga_ACC_prot$thresholds[39], 6000, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$thresholds[39]), col = "purple") text(tcga_ACC_prot$thresholds[39], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$ACC[39]), col = "purple") text(tcga_ACC_prot$thresholds[39], 4500, pos = 4, paste0("EN = ", tcga_ACC_prot[39, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tcga_ACC_prot$thresholds[39] ``` And the number of selected egdes are: ```{r} tcga_ACC_prot$EN[39] ``` Network visualizations are available in the ``TCGA_cytoscape.cys`` file. ## Downstream analysis ### Clustering Samples are clustered together according to their similarity.We know that there are three breast cancer subtypes in the dataset. So we decide to perform a clustering with three and four clusters. #### With 3 clusters ```{r} C <- 3 group <- data.frame(Groups = spectralClustering(tcga_W, C)) row.names(group) <- colnames(tcga_W) tcga_dataGroups3 <- merge(tcga_metadata_df, group, by = 0) ``` #### With 4 clusters ```{r} C <- 4 group <- data.frame(Groups = spectralClustering(tcga_W, C)) row.names(group) <- colnames(tcga_W) tcga_dataGroups4 <- merge(tcga_metadata_df, group, by = 0) ``` #### Save results Results are saved into the same file. ```{r} clusters <- merge(x = tcga_dataGroups3, y = tcga_dataGroups4[-2], by = "Row.names", suffixes = c("_3clusters", "_4clusters")) write.table(clusters, "../02_Results/03_BreastTCGA/TCGA_clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` ### Visualization with Cytoscape These are two examples of network visualization for the breast cancer dataset. ```{r, echo = FALSE, fig.show = "hold", out.width = "50%", fig.cap = "**Left network**: edge weights > 0.011 and node color according cancer subtypes. **Right network**: edge weights > 0.011 and node color according clustering results.", fig.align = "center"} include_graphics("../02_Results/03_BreastTCGA/Cytoscape/TCGA_W_dataContribution_011.png") include_graphics("../02_Results/03_BreastTCGA/Cytoscape/TCGA_W_dataContribution_011_3clusters.png") ``` - **Node color** represents - the cancer subtypes (left network) - the clustering results (right network), we selected ``three clusters``. - **Node label** are the sample names. - **Edge color** represents the data type contribution for each edge. We can see three interconnected groups. These groups are consistent with the cancer subtypes. We can also assign one cancer subtype per clusters, found by SNF. Protein data support a lot of edges in this network. And miRNA and mRNA data seem to capture same kind of information (light green edges). # CLL dataset The Chronic Lymphocytic Leukaemia (CLL) dataset contains 4 data types: - **mRNA:** transcriptom expression level - **methylation**: DNA methylation assays - **drug**: drug response measurements - **mutation**: sommatic mutation status Data are available in the R package `MOFAdata`. Metadata file is available in `/shared/projects/tp_etbii_2024_165650/Networks/CLL` directory path in the IFB server. ## Input data ### Load dataset The CLL data are available in the `MOFAdata` R package. ```{r} data("CLL_data") ``` The ``CLL_data`` object contains 4 types of data with different dimensions: ```{r} lapply(CLL_data, dim) ``` Rows are features (e.g. drug, genes) and columns are samples. There are `200` samples. We have to change the shape of the data. ```{r} CLL_data_t <- lapply(CLL_data, t) ``` Now, samples are in rows: ```{r} CLL_data_t$mRNA[c(1:5), c(1:5)] ``` ### Load metadata The `sample_metadata.txt` file contains metadata. It contains header (`head = TRUE`) and row names (`row.names = 1`). ```{r, results = "hold"} CLL_metadata <- read.table("../00_Data/CLL/sample_metadata.txt", head = TRUE, sep = "\t", row.names = 1) head(CLL_metadata) ``` We have information for each sample about: ```{r} names(CLL_metadata) ``` For visualization, columns should be numerical, logical or character. ```{r} str(CLL_metadata) ``` ```{r} CLL_metadata$died <- as.character(CLL_metadata$died) CLL_metadata$IGHV <- as.character(CLL_metadata$IGHV) CLL_metadata$trisomy12 <- as.character(CLL_metadata$trisomy12) ``` ### Missing data #### Drug data Overview of the drug data: ```{r} CLL_data_t$Drugs[c(1:5), c(1:5)] ``` The CLL drug data contains missing data. ```{r} table(is.na(CLL_data_t$Drugs)) ``` #### Methylation data Overview of the methylation data: ```{r} CLL_data_t$Methylation[c(1:5), c(1:5)] ``` The CLL methylation data contains missing data. ```{r} table(is.na(CLL_data_t$Methylation)) ``` #### mRNA data Overview of the mRNA data: ```{r} CLL_data_t$mRNA[c(1:5), c(1:5)] ``` The CLL mRNA data contains missing data. ```{r} table(is.na(CLL_data_t$mRNA)) ``` #### Mutation data Overview of the mutation data: ```{r} CLL_data_t$Mutations[c(1:5), c(1:5)] ``` The CLL mutation data contains missing data. ```{r} table(is.na(CLL_data_t$Mutations)) ``` #### Remove missing data We remove samples with at least one missing data in each data type using the `NARemoving()` function. We set: - `margin = 1` because samples are in row - `threshold = 0` because we don't want missing data at all ```{r} CLL_drug <- NARemoving(data = CLL_data_t$Drugs, margin = 1, threshold = 0) ``` ```{r} CLL_meth <- NARemoving(data = CLL_data_t$Methylation, margin = 1, threshold = 0) ``` ```{r} CLL_mrna <- NARemoving(data = CLL_data_t$mRNA, margin = 1, threshold = 0) ``` ```{r} CLL_muta <- NARemoving(data = CLL_data_t$Mutations, margin = 1, threshold = 0) ``` We decide to not use the mutation data because this data contains a lot of missing data. Data types need to have the same set of samples. ```{r} sampleNames <- Reduce(intersect, list(rownames(CLL_drug), rownames(CLL_meth), rownames(CLL_mrna))) CLL_drug <- CLL_drug[rownames(CLL_drug) %in% sampleNames,] CLL_meth <- CLL_meth[rownames(CLL_meth) %in% sampleNames,] CLL_mrna <- CLL_mrna[rownames(CLL_mrna) %in% sampleNames,] lapply(list("Drugs" = CLL_drug, "Meth" = CLL_meth, "mRNA" = CLL_mrna), dim) ``` We will run SNF with ``r nrow(CLL_drug)`` samples and three different data types. ### Scaling We assume that data have been already **prepared** and **normalized.** #### Drug data Drug data are scaled. Each column will have the mean equals to zero and the standard deviation equals to one. ```{r} CLL_drug_scaled <- standardNormalization(x = CLL_drug) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", fig.align = "center", out.width = "50%"} hist(CLL_drug, nclass = 100, main = "CLL drug data - Data distribution before scaling", xlab = "values") hist(CLL_drug_scaled, nclass = 100, main = "CLL drug data - Data distribution after scaling", xlab = "scaled values") ``` After scaling, drug data follow a normal distribution. #### Methylation data Methylation data are scaled. ```{r} CLL_meth_scaled <- standardNormalization(x = CLL_meth) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", fig.align = "center", out.width = "50%"} hist(CLL_meth, nclass = 100, main = "CLL methylation data - Data distribution before scaling", xlab = "values") hist(CLL_meth_scaled, nclass = 100, main = "CLL methylation data - Data distribution after scaling", xlab = "scaled values") ``` Here, we can see more a binomial distribution after scaling. But, the data are centered. #### mRNA data mRNA data are scaled. ```{r} CLL_mrna_scaled <- standardNormalization(x = CLL_mrna) ``` The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling. ```{r, fig.show = "hold", fig.align = "center", out.width = "50%"} hist(CLL_mrna, nclass = 100, main = "CLL mRNA data - Data distribution before scaling", xlab = "values") hist(CLL_mrna_scaled, nclass = 100, main = "CLL mRNA data - Data distribution after scaling", xlab = "scaled values") ``` After scaling, mRNA data follow a normal distribution. ## Similarity network In this part, we create the similarity network for each data type. ### Distance calculation We calculate the Euclidean distance between each pair of samples for each type of data. ```{r} CLL_drug_dist <- dist2(CLL_drug_scaled, CLL_drug_scaled) CLL_meth_dist <- dist2(CLL_meth_scaled, CLL_meth_scaled) CLL_mrna_dist <- dist2(CLL_mrna_scaled, CLL_mrna_scaled) ``` Distance matrices have ``r nrow(CLL_drug_dist)`` rows (samples) and ``r ncol(CLL_drug_dist)`` columns (samples). We calculated pairwise distance, so the matrix has samples in rows and in columns. ```{r} dim(CLL_drug_dist) ``` The diagonal of the distance matrix contains the distance between sample and itself. So the distance is equal (or very close) to zero. ```{r} CLL_drug_dist[c(1:5), c(1:5)] ``` High distance values mean that samples are not similar. And small distance values mean that samples are similar. ### Similarity calculation The distance values are transformed according the neighbors of the samples. We set two parameters: - `K = 20`: number of nearest neighbors - `sigma = 0.5`: hyperparameter ```{r} K = 20 sigma = 0.5 ``` The ``affinityMatrix()`` function transforms the distance into similarity according the distance with the nearest neighbors. ```{r} CLL_drug_W <- affinityMatrix(CLL_drug_dist, K, sigma) CLL_meth_W <- affinityMatrix(CLL_meth_dist, K, sigma) CLL_mrna_W <- affinityMatrix(CLL_mrna_dist, K, sigma) ``` The following figures are the heatmap of the similarity matrix (W) of each data type. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities. ```{r, echo = TRUE, fig.show = "hold", fig.align = "center", out.width = "50%", fig.cap = "CLL dataset - log-transformed similarity matrix heatmap"} pheatmap(log(CLL_drug_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL Drugs") pheatmap(log(CLL_meth_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL Methylation") pheatmap(log(CLL_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL mRNA") ``` The red color means high similarity between two samples. The blue color means small similarity between two samples. Heatmaps are different between data types. Each data seems to carry different information about samples. - **drug data**: the heatmap shows two groups of similar samples, but non of them seems to be related to a specific metadata. - **methylation data**: the heatmap shows two or maybe three groups of similar samples. Groups seem to be related to the IGHV status. - **mRNA data**: the heatmap shows small groups but not clear one. ## Fusion We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix. ### Create the fused similarity matrix To create the fused similarity matrix, we set three parameters: - list of similarity matrices (drug, methylation and mRNA) - `K = 20`: number of nearest neighbors - `T = 10`: number of iterations ```{r} K <- 20 T <- 10 CLL_W <- SNF(list(CLL_drug_W, CLL_meth_W, CLL_mrna_W), K, T) CLL_W[c(1:5), c(1:5)] ``` The dimensions of the fused network are ``r nrow(CLL_W)`` rows and ``r ncol(CLL_W)`` columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them **weights**. ```{r} dim(CLL_W) ``` The fused similarity matrix contains ``r length(CLL_W)`` weights. ```{r} length(CLL_W) ``` The fused similarity matrix doesn't contain zero. ```{r} length(CLL_W[length(CLL_W) == 0]) ``` The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization. ```{r, echo = TRUE, fig.show = "hold", fig.align = "center", out.width = "50%"} pheatmap(log(CLL_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL - Fused similarity matrix W") ``` The red color means high similarity between two samples. The blue color means small similarity between two samples. The heatmap shows two main groups. Samples between groups are very different. Groups seem to be related to the IGHV status. This heatmap doesn't give us obvious information. We can make several assumptions: - it's the result - a previous step wasn't the best for one or several data type (normalization, scaling, distance etc) - parameters used (K, sigma and T) are not the most adapted. It could be interesting to try another distance and/or try with different parameters. ### Visualized the fused similarity network Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (`diag = FALSE`) and only the upper values of the matrix are taken (`mode = "upper"`, avoid duplicate information). ```{r} CLL_W_net <- graph_from_adjacency_matrix(CLL_W, weighted = TRUE, mode = "upper", diag = FALSE) ``` Then, the fused similarity network is saved into a the `CLL_W_edgeList.txt` file: ```{r} write.table(as_data_frame(CLL_W_net), "../02_Results/02_CLL/CLL_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` This files is loaded into Cytoscape. The Figure \@ref(fig:CLL-cyto1) shows the fused similarity network of the CLL dataset. ```{r CLL-cyto1, echo = FALSE, fig.align = "center", out.width = "50%", fig.cap = "First Cytoscape visualization of the fused similarity network of CLL dataset."} include_graphics("../02_Results/02_CLL/Cytoscape/CLL_W.png") ``` According Cytoscape, the fused similarity network contains `121` nodes (samples) and `7260` edges (connections) between samples. The connections number is smaller in Cytoscape. Indeed, in the similarity matrix weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops). For now, the fused similarity network is fully connected: each sample is connected to every other samples. Connections between samples are weighted: some connections are strong (samples are similar) and some other are weak (samples are not similar). ## Threshold selection In this section, we will determine a threshold to select the strongest connections between samples. ### Fused similarity network #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} CLL_weights <- edge.attributes(CLL_W_net)$weight hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(0.004466836, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `0.004466836`. With this threshold, we select ``r length(CLL_weights[CLL_weights >= 0.004466836])`` connections. #### Mean and third quantile We calculate the median of the weights. ```{r} CLL_W_median <- median(x = CLL_weights) CLL_W_median ``` With the mean (``r CLL_W_median``) as threshold, we select ``r length(CLL_weights[CLL_weights >= CLL_W_median])`` connections. ```{r} length(CLL_weights[CLL_weights >= CLL_W_median]) ``` Calculate the third quantile of the weights: ```{r} CLL_W_q75 <- quantile(x = CLL_weights, 0.75) CLL_W_q75 ``` With the third quantile (``r CLL_W_q75``) as threshold, we select ``r length(CLL_weights[CLL_weights >= CLL_W_q75])`` connections. ```{r} length(CLL_weights[CLL_weights >= CLL_W_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(CLL_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)") abline(v = log(CLL_W_median, 10), col = "blue", lwd = 3) text(log(CLL_W_median, 10), 160, pos = 4, "Median", col = "blue", cex = 1) abline(v = log(CLL_W_q75, 10), col = "purple", lwd = 3) text(log(CLL_W_q75, 10), 160, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(CLL_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.03, 0.0003) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} CLL_ACC_W <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, CLL_W_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = CLL_ACC_W$thresholds, y = CLL_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = CLL_ACC_W$thresholds[1], y = CLL_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = CLL_ACC_W$thresholds[41], y = CLL_ACC_W$ACC[41], col = "pink", pch = 16, cex = 1.2) points(x = CLL_ACC_W$thresholds[42], y = CLL_ACC_W$ACC[42], col = "purple", pch = 16, cex = 1.2) text(CLL_ACC_W$thresholds[42], 0.5, pos = 4, paste0("Threshold = ", CLL_ACC_W$thresholds[42]), col = "purple") text(CLL_ACC_W$thresholds[42], 0.4, pos = 4, paste0("ACCmax = ", CLL_ACC_W$ACC[42]), col = "purple") ## EN plot(x = CLL_ACC_W$thresholds, y = CLL_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = CLL_ACC_W$thresholds[42], col = "purple") text(CLL_ACC_W$thresholds[42], 6000, pos = 4, paste0("Threshold = ", CLL_ACC_W$thresholds[42]), col = "purple") text(CLL_ACC_W$thresholds[42], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_W$ACC[42]), col = "purple") text(CLL_ACC_W$thresholds[42], 4500, pos = 4, paste0("EN = ", CLL_ACC_W[42, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r CLL_ACC_W[42, "EN"]`` edges. It could be not enough edges. Let's see during the visualization. ```{r} CLL_ACC_W$thresholds[42] ``` #### Visualization using Cytoscape The network visualization on the left was created with the third quantile (``r CLL_W_q75``). The network visualization on the right was created with the ACC method (``r CLL_ACC_W$thresholds[42]``). ```{r, echo = FALSE, fig.show = "hold", out.width = "50%"} include_graphics("../02_Results/02_CLL/Cytoscape/CLL_W_006.png") include_graphics("../02_Results/02_CLL/Cytoscape/CLL_W_0123.png") ``` The left network contains lot of edges and it's difficult to see clear connection between samples. Nevertheless, we can see two groups of samples, according the IGVH status. This trend is also shows in the right network. Samples with same IGVH status are connected together. It could be interesting to map other metadata on the network. Network visualizations are available in the `CLL_cytoscape.cys` file. ### Drug data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} CLL_drug_net <- graph_from_adjacency_matrix(CLL_drug_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(CLL_drug_net), "../02_Results/02_CLL/CLL_drug_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} CLL_weights <- edge.attributes(CLL_drug_net)$weight hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(0.00007, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: `0.00007`. With this threshold, we select ``r length(CLL_weights[CLL_weights >= 0.00007])`` connections. #### Mean and third quantile We calculate the median of the weights. ```{r} CLL_drug_median <- median(x = CLL_weights) CLL_drug_median ``` Number of selected edges with the median as threshold. ```{r} length(CLL_weights[CLL_weights >= CLL_drug_median]) ``` Calculate the third quantile of the weights: ```{r} CLL_drug_q75 <- quantile(x = CLL_weights, 0.75) CLL_drug_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(CLL_weights[CLL_weights >= CLL_drug_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(CLL_drug_median, 10), col = "blue", lwd = 3) text(log(CLL_drug_median, 10) - 0.5, 200, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(CLL_drug_q75, 10), col = "purple", lwd = 3) text(log(CLL_drug_q75, 10), 230, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(CLL_drug_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.0015, 0.00001) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} CLL_ACC_drug <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, CLL_drug_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = CLL_ACC_drug$thresholds, y = CLL_ACC_drug$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = CLL_ACC_drug$thresholds[1], y = CLL_ACC_drug$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = CLL_ACC_drug$thresholds[39], y = CLL_ACC_drug$ACC[39], col = "pink", pch = 16, cex = 1.2) points(x = CLL_ACC_drug$thresholds[45], y = CLL_ACC_drug$ACC[45], col = "purple", pch = 16, cex = 1.2) text(CLL_ACC_drug$thresholds[45], 0.5, pos = 4, paste0("Threshold = ", CLL_ACC_drug$thresholds[45]), col = "purple") text(CLL_ACC_drug$thresholds[45], 0.4, pos = 4, paste0("ACCmax = ", CLL_ACC_drug$ACC[45]), col = "purple") ## EN plot(x = CLL_ACC_drug$thresholds, y = CLL_ACC_drug$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = CLL_ACC_drug$thresholds[42], col = "purple") text(CLL_ACC_drug$thresholds[45], 5300, pos = 4, paste0("Threshold = ", CLL_ACC_drug$thresholds[45]), col = "purple") text(CLL_ACC_drug$thresholds[45], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_drug$ACC[45]), col = "purple") text(CLL_ACC_drug$thresholds[45], 4700, pos = 4, paste0("EN = ", CLL_ACC_drug[45, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r CLL_ACC_drug[45, "EN"]`` edges. It could be not enough edges. Let's see during the visualization. ```{r} CLL_ACC_drug$thresholds[45] ``` Network visualizations are available in the `CLL_cytoscape.cys` file. ### Methylation data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} CLL_meth_net <- graph_from_adjacency_matrix(CLL_meth_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(CLL_meth_net), "../02_Results/02_CLL/CLL_meth_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} CLL_weights <- edge.attributes(CLL_meth_net)$weight hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(1.258925e-05, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a kind of normal distribution. We probably would like to cut the peaks and choose the corresponding weight: `1.258925e-05`. With this threshold, we select ``r length(CLL_weights[CLL_weights >= 1.258925e-05])`` connections. #### Mean and third quantile We calculate the median of the weights. ```{r} CLL_meth_median <- median(x = CLL_weights) CLL_meth_median ``` Number of selected edges with the median as threshold. ```{r} length(CLL_weights[CLL_weights >= CLL_meth_median]) ``` Calculate the third quantile of the weights: ```{r} CLL_meth_q75 <- quantile(x = CLL_weights, 0.75) CLL_meth_q75 ``` Number of selected edges with the third quantile as threshold. ```{r} length(CLL_weights[CLL_weights >= CLL_meth_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(CLL_meth_median, 10), col = "blue", lwd = 3) text(log(CLL_meth_median, 10), 200, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(CLL_meth_q75, 10), col = "purple", lwd = 3) text(log(CLL_meth_q75, 10), 230, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(CLL_meth_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.00006, 0.0000005) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} CLL_ACC_meth <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, CLL_meth_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = CLL_ACC_meth$thresholds, y = CLL_ACC_meth$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = CLL_ACC_meth$thresholds[1], y = CLL_ACC_meth$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = CLL_ACC_meth$thresholds[41], y = CLL_ACC_meth$ACC[41], col = "pink", pch = 16, cex = 1.2) points(x = CLL_ACC_meth$thresholds[42], y = CLL_ACC_meth$ACC[42], col = "purple", pch = 16, cex = 1.2) text(CLL_ACC_meth$thresholds[42], 0.55, pos = 4, paste0("Threshold = ", CLL_ACC_meth$thresholds[42]), col = "purple") text(CLL_ACC_meth$thresholds[42], 0.45, pos = 4, paste0("ACCmax = ", CLL_ACC_meth$ACC[42]), col = "purple") ## EN plot(x = CLL_ACC_meth$thresholds, y = CLL_ACC_meth$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = CLL_ACC_meth$thresholds[42], col = "purple") text(CLL_ACC_meth$thresholds[42], 5500, pos = 4, paste0("Threshold = ", CLL_ACC_meth$thresholds[42]), col = "purple") text(CLL_ACC_meth$thresholds[42], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_meth$ACC[42]), col = "purple") text(CLL_ACC_meth$thresholds[42], 4500, pos = 4, paste0("EN = ", CLL_ACC_meth[42, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r CLL_ACC_meth[42, "EN"]`` edges. It could be not enough edges. Let's see during the visualization. ```{r} CLL_ACC_meth$thresholds[42] ``` Network visualizations are available in the `CLL_cytoscape.cys` file. ### mRNA data Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} CLL_mrna_net <- graph_from_adjacency_matrix(CLL_mrna_W, weighted = TRUE, mode = "upper", diag = FALSE) write.table(as_data_frame(CLL_mrna_net), "../02_Results/02_CLL/CLL_mrna_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} CLL_weights <- edge.attributes(CLL_mrna_net)$weight hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights") hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights") abline(v = log(5.011872e-06, 10), col = "cyan", lwd = 3) ``` The log-transformed weight distribution shows a kind of binomial distribution. We probably would like to cut between the two peaks and choose the corresponding weight: `5.011872e-06`. With this threshold, we select ``r length(CLL_weights[CLL_weights >= 5.011872e-06])`` connections. #### Mean and third quantile We calculate the median of the weights. ```{r} CLL_mrna_median <- median(x = CLL_weights) CLL_mrna_median ``` Number of selected edges with the median as threshold. ```{r} length(CLL_weights[CLL_weights >= CLL_mrna_median]) ``` Calculate the third quantile of the weights: ```{r} CLL_mrna_q75 <- quantile(x = CLL_weights, 0.75) CLL_mrna_q75 ``` Number of selected edges with the third quantile as threshold. ```{r} length(CLL_weights[CLL_weights >= CLL_mrna_q75]) ``` The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers. ```{r, fig.align = "center"} hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(CLL_mrna_median, 10), col = "blue", lwd = 3) text(log(CLL_mrna_median, 10), 200, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(CLL_mrna_q75, 10), col = "purple", lwd = 3) text(log(CLL_mrna_q75, 10), 190, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine a range of thresholds to try, we check the weights. ```{r} summary(c(CLL_mrna_W)) ``` We define a vector of threshold range to try (at least 100 values). ```{r} thresholds <- seq(0, 0.00004, 0.0000004) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} CLL_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, graph){ graph_sub <- subgraph.edges(graph, E(graph)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub))) return(df) }, CLL_mrna_net)) ``` Calculated values are displayed in the following figures. ```{r, fig.show = "hold", out.width = "50%", fig.cap = "**Left**: Average Clustering Coeeficient (ACC) values for each threshold - **Right**: Number of edges on network for each threshold", fig.align = "center"} ## ACC plot(x = CLL_ACC_mrna$thresholds, y = CLL_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = CLL_ACC_mrna$thresholds[1], y = CLL_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = CLL_ACC_mrna$thresholds[46], y = CLL_ACC_mrna$ACC[46], col = "pink", pch = 16, cex = 1.2) points(x = CLL_ACC_mrna$thresholds[48], y = CLL_ACC_mrna$ACC[48], col = "purple", pch = 16, cex = 1.2) text(CLL_ACC_mrna$thresholds[48], 0.55, pos = 4, paste0("Threshold = ", CLL_ACC_mrna$thresholds[48]), col = "purple") text(CLL_ACC_mrna$thresholds[48], 0.45, pos = 4, paste0("ACCmax = ", CLL_ACC_mrna$ACC[48]), col = "purple") ## EN plot(x = CLL_ACC_mrna$thresholds, y = CLL_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = CLL_ACC_mrna$thresholds[48], col = "purple") text(CLL_ACC_mrna$thresholds[48], 5500, pos = 4, paste0("Threshold = ", CLL_ACC_mrna$thresholds[48]), col = "purple") text(CLL_ACC_mrna$thresholds[48], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_mrna$ACC[48]), col = "purple") text(CLL_ACC_mrna$thresholds[48], 4500, pos = 4, paste0("EN = ", CLL_ACC_mrna[48, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. If we selected the purple local maxima, we will have ``r CLL_ACC_mrna[48, "EN"]`` edges. It could be not enough edges. Let's see during the visualization. ```{r} CLL_ACC_mrna$thresholds[48] ``` Network visualizations are available in the `CLL_cytoscape.cys` file. ## Downstream analysis ### Clustering We decided to perform a clustering analysis with two and three clusters. #### With 2 clusters ```{r} C <- 2 group <- data.frame(Groups = spectralClustering(CLL_W, C)) row.names(group) <- colnames(CLL_W) CLL_dataGroups2 <- merge(CLL_metadata, group, by = 0) ``` #### With 3 clusters ```{r} C <- 3 group <- data.frame(Groups = spectralClustering(CLL_W, C)) row.names(group) <- colnames(CLL_W) CLL_dataGroups3 <- merge(CLL_metadata, group, by = 0) ``` #### Save results Then, results are save into the same file. ```{r} clusters <- merge(x = CLL_dataGroups2, y = CLL_dataGroups3[c(1,7)], by = "Row.names", suffixes = c("_2clusters", "_3clusters")) write.table(clusters, "../02_Results/02_CLL/CLL_clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` ### Visualization with Cytoscape These are two examples of network visualization for the CLL dataset. ```{r, echo = FALSE, fig.show = "hold", out.width = "50%"} include_graphics("../02_Results/02_CLL/Cytoscape/CLL_W_dataContribution_0123.png") include_graphics("../02_Results/02_CLL/Cytoscape/CLL_W_dataContribution_0123_2clusters.png") ``` - **Node color** represents - IGVH status (left network) - clustering results (right network), we selected two clusters. - **Node shape** represents trisomy12 status. - **Node labels** are sample names. - **Edge color** represents data type contribution for each edge. IGHV status is driving the clustering (right network). Inside this two groups, we can see a subnetwork that contains sample with trisomy12. With this network, we can predicted the possible IGVH status of samples without this information. # Tomato plant dataset The omic tomato plant dataset contains two omics data types: - **transcript** data - **protein** data Data files are available in `/shared/projects/tp_etbii_2024_165650/Networks/Tomato` directory path in the IFB server. ## Input data ### Load dataset #### Transcript data First, we load the transcript data that are in the `mrna.tsv` file. This file contains header (`head = TRUE`) and the first column contains row names (`row.names = 1`). Below, the first columns and rows are displayed. ```{r, results = "hold"} tomato_mrna <- read.table("../00_Data/Tomato/mrna.tsv", head = TRUE, row.names = 1) tomato_mrna[c(1:5), c(1:5)] ``` Transcript data dimensions are ``nrow(tomato_mrna)`` rows and ``ncol(tomato_mrna)``: ```{r} dim(tomato_mrna) ``` Samples (``r nrow(tomato_mrna)``) are in columns and transcripts (``r nrow(tomato_mrna)``) are in rows. We need to transpose this matrix. ```{r} tomato_mrna_t <- t(tomato_mrna) ``` The transposed matrix dimensions are ``nrow(tomato_mrna_t)`` rows and ``ncol(tomato_mrna_t)``. Samples are in columns and transcripts are in rows. ```{r} dim(tomato_mrna_t) ``` Below, the first five rows and columns are displayed: ```{r} tomato_mrna_t[c(1:5), c(1:5)] ``` #### Protein data Then, we load the protein data, available in the file `prots.tsv`. The file contains column heads (`head = TRUE`) and the first column contains row names (`row.names = 1`). Below, the first five columns and rows are displayed. ```{r} tomato_prot <- read.table("../00_Data/Tomato/prots.tsv", head = TRUE, row.names = 1) tomato_prot[c(1:5), c(1:5)] ``` Protein data have ``r ncol(tomato_prot)`` columns: ```{r} ncol(tomato_prot) ``` Protein data have ``r nrow(tomato_prot)`` rows: ```{r} nrow(tomato_prot) ``` Rows are proteins and columns are samples. To continue the analysis, data need to have the samples in rows and features in columns. So, we transpose the protein data. ```{r} tomato_prot_t <- t(tomato_prot) ``` Now, protein data are in the right shape, as you can see below: ```{r} tomato_prot_t[c(1:5), c(1:5)] ``` ### Load metadata Finally, we load the metadata that contain information about samples. Metadata are stored in the `samples_metadata.csv` file. This file is semicolon-separated (`sep = ";"`) and contains column heads (`head = TRUE`). The first column is also row names (`row.names = 1`). ```{r} tomato_metadata <- read.table("../00_Data/Tomato/samples_metadata.csv", head = TRUE, row.names = 1, sep = ";") ``` The metadata file contains information about: - dpa: days post anthesis (after the flower opening) - growth_stage: growth stage of the tomato plant ```{r} names(tomato_metadata) ``` There are ``r length(unique(tomato_metadata$dpa))`` different dpa: ```{r} unique(tomato_metadata$dpa) ``` There are three replicates per growth stage: ```{r} table(tomato_metadata$growth_stage) ``` ### Missing data Transcript data don't contain missing value: ```{r} table(is.na(tomato_mrna_t)) ``` Protein data don't contain neither missing value: ```{r} table(is.na(tomato_prot_t)) ``` We can go to the following steps. ### Scaling We assume that data habe been already **prepared** and **normalized**. #### Transcript data Transcript data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one. ```{r} tomato_mrna_scaled <- standardNormalization(tomato_mrna_t) ``` Below, we show the data distribution before and after scaling. We expected to have a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(tomato_mrna_t, nclass = 100, main = "Tomato fruit - Transcript data - Prepared data", xlab = "values") hist(tara_phy_scaled, nclass = 100, main = "Tomato fruit - Transcript data - Scaled data", xlab = "values") ``` Data seem to be already scaled. So we will use the transposed data `tomato_mrna_t` for the following analysis. #### Protein data Protein data are scaled. ```{r} tomato_prot_scaled <- standardNormalization(tomato_prot_t) ``` Below, we show the data distribution before and after scaling. We expected to have a normal distribution of the data after scaling. ```{r, fig.show = "hold", out.width = "50%"} hist(tomato_prot_t, nclass = 100, main = "Tomato fruit - Protein data - Prepared data", xlab = "values") hist(tomato_prot_scaled, nclass = 100, main = "Tomato fruit - Protein data - Scaled data", xlab = "values") ``` These data seem also already scaled. So we will use the transposed data `tomato_prot_t` for the following analysis. ## Similarity network In this part, we create the similarity network for each data type. ### Distance calculation First, we calculate the Euclidean distance between each sample for each data type. ```{r} tomato_mrna_dist <- dist2(tomato_mrna_t, tomato_mrna_t) tomato_prot_dist <- dist2(tomato_prot_t, tomato_prot_t) ``` The created distance matrix dimensions are ``r nrow(tomato_mrna_dist)`` rows and ``r ncol(tomato_mrna_dist)`` columns. We calculated pairwise distance, so the matrix has samples in rows and in columns. ```{r} dim(tomato_mrna_dist) ``` The diagonal of the distance matrix contains the distance between sample and itself. There is distance between a sample and itself, so the distance is equal (or very close) to zero. ```{r} tomato_mrna_dist[c(1:5), c(1:5)] ``` High distance values mean that samples are not similar. And small distance values mean that samples are similar. ### Similarity calculation The distance matrix is then transformed into similarity matrix for each data type. We set two parameters: - `K = 20`: number of nearest neighbors - `signma = 0.5`: hyperparameter ```{r} K <- 20 sigma <- 0.5 ``` The `affinityMatrix()` function transforms the distance into similarity according the distance with the nearest neighbors. ```{r} tomato_mrna_W <- affinityMatrix(tomato_mrna_dist, K = K, sigma = sigma) tomato_prot_W <- affinityMatrix(tomato_prot_dist, K = K, sigma = sigma) ``` The following figures are the heatmap of the similarity matrix (W) of each data type. The left heatmap are the mrna data and the right heatmap are the protein data. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities. ```{r, fig.show = "hold", out.width = "50%"} pheatmap(log(tomato_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Transcript data - log10-transformed similarity values") pheatmap(log(tomato_prot_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Protein data - log10-transformed similarity values") ``` Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples. Heatmaps are different between the transcript and the protein data. Each data type carries different kind of sample information. The clustering shows that one group is clearly retrieve in both data type (samples in last dpa). Protein data seem to define better the development cycle of the tomato fruit. ## Fusion We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix. ### Create the fused similarity matrix We create the fused similarity matrix using these three parameters: - the list of mrna and protein similarity matrices - `K = 20`: number of nearest neighbors - `t = 10`: number of iterations ```{r} tomato_W <- SNF(list(tomato_mrna_W, tomato_prot_W), K = 20, t = 10) ``` The dimension of the fused network are ``r nrow(tomato_W)`` rows and ``r ncol(tomato_W)`` columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them weights. The fused similarity network contains ``r length(tomato_W)`` weights. ```{r} length(tomato_W) ``` The fused similarity network doesn't contain zero: ```{r} table(tomato_W == 0) ``` The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization. ```{r, fig.align = "center"} pheatmap(log(tomato_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Fused similarity matrix - log10-transformed similarity values") ``` Read color means a high similarity between samples. Blue color means a small similarity between samples. This heatmap seems to be a perfect mix between the two previous individual heatmaps. We still see two main groups: one with the last dpa and one other big with the other development stages. But, in this big group, now stages seem to be well grouped. ### Visualize the fused similarity network We create a fused similarity network from the fused similarity matrix. Self loops are remove (`diag = FALSE`) and only the upper values of the matrix are taken (`mode = "upper"`, avoid duplicate information). ```{r} tomato_W_net <- graph_from_adjacency_matrix(tomato_W, diag = FALSE, mode = "upper", weighted = TRUE) ``` Then, the fused similarity network is saved into a the `Tomato_W_edgeList.txt` file: ```{r} write.table(as_data_frame(tomato_W_net), "../02_Results/04_Tomato/Tomato_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` This files is loaded into Cytoscape. The Figure \@ref(fig:tomato-cyto1) shows the fused similarity network of the tomato plant dataset. ```{r tomato-cyto1, echo = FALSE, fig.align = "center", out.width = "50%", fig.cap = "First Cytoscape visualization of the fused similarity network of Tomato fruit dataset."} include_graphics("../02_Results/04_Tomato/Cytoscape/Tomato_W.png") ``` According Cytoscape, the fused network contains `27` samples (nodes) and `351` connections (edges) between samples. The edge number is smaller in Cytoscape because we removed the self loop and took only half of the similarity matrix. For now, the network is fully connected: each sample is connected to every sample. Connections between samples are weights: some connections are strong (samples are similar) some other are weak (samples are not similar). ## Threshold selection So in this section, we will choose a threshold to keep the strongest connections. ### Fused network #### Arbitrary threshold We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights. ```{r, fig.show = "hold", out.width = "50%"} tomato_weights <- edge.attributes(tomato_W_net)$weight hist(tomato_weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights") hist(log(tomato_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights") ``` It's not obvious how to choose the threshold with the weight distribution. Let's see other methods. #### Mean and third quantile Calculate the median: ```{r} tomato_W_median <- median(x = tomato_weights) tomato_W_median ``` Number of selected edges with the median as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_W_median]) ``` Calculate the third quantile: ```{r} tomato_W_q75 <- quantile(x = tomato_weights, 0.75) tomato_W_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_W_q75]) ``` The following figures show where are these two threshold in the weight distribution. ```{r, fig.show = "hold", out.width = "50%", fig.align = "center"} hist(log(tomato_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)") abline(v = log(tomato_W_median, 10), col = "blue", lwd = 3) text(log(tomato_W_median, 10), 10, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tomato_W_q75, 10), col = "purple", lwd = 3) text(log(tomato_W_q75, 10), 12, pos = 4, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine the range of the threshold, we check the weights: ```{r} summary(tomato_weights) ``` We define the threshold range to try: ```{r} thresholds <- seq(0, 0.095629, 0.0005) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tomato_ACC_W <- do.call(rbind, lapply(thresholds, function(t, net){ net_sub <- subgraph.edges(net, E(net)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub))) return(df) }, tomato_W_net)) ``` Calculated values are displayed in the following figures: ```{r, fig.show = "hold", out.width = "50%"} plot(x = tomato_ACC_W$thresholds, y = tomato_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o") points(x = tomato_ACC_W$thresholds[1], y = tomato_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tomato_ACC_W$thresholds[73], y = tomato_ACC_W$ACC[73], col = "pink", pch = 16, cex = 1.2) points(x = tomato_ACC_W$thresholds[80], y = tomato_ACC_W$ACC[80], col = "purple", pch = 16, cex = 1.2) points(x = tomato_ACC_W$thresholds[71], y = tomato_ACC_W$ACC[71], col = "cyan", pch = 16, cex = 1.2) abline(v = tomato_ACC_W$thresholds[80], col = "purple") text(tomato_ACC_W$thresholds[80], 0.5, pos = 4, paste0("Threshold = ", tomato_ACC_W$thresholds[80]), col = "purple") text(tomato_ACC_W$thresholds[80], 0.4, pos = 4, paste0("ACCmax = ", tomato_ACC_W$ACC[80]), col = "purple") plot(x = tomato_ACC_W$thresholds, y = tomato_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o") abline(v = tomato_ACC_W$thresholds[80], col = "purple") text(tomato_ACC_W$thresholds[80], 350, pos = 4, paste0("Threshold = ", tomato_ACC_W$thresholds[80]), col = "purple") text(tomato_ACC_W$thresholds[80], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_W$ACC[80]), col = "purple") text(tomato_ACC_W$thresholds[80], 250, pos = 4, paste0("EN = ", tomato_ACC_W[80, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. - The blue dot value is another local maxima. If we selected the purple local maxima, we will have `r tomato_ACC_W$EN[80]` edges. It could be not enough edges. ```{r} tomato_ACC_W$thresholds[80] ``` We can try with another local maxima. ```{r} tomato_ACC_W$thresholds[71] ``` ```{r} tomato_ACC_W$EN[71] ``` It could be interesting to try another threshold more. ```{r} tomato_ACC_W$thresholds[18] ``` ```{r} tomato_ACC_W$EN[18] ``` The following figures are filtered network using `0.035` (left) and `0.013` (right) as thresholds. ```{r, echo = FALSE, fig.show = "hold", out.width = "50%"} include_graphics("../02_Results/04_Tomato/Cytoscape/Tomato_W_035.png") include_graphics("../02_Results/04_Tomato/Cytoscape/Tomato_W_013.png") ``` We think that the right network doesn't have enough edges. We loose to much information between samples. We will probably use the network on the left for the following visualization. Network visualizations are available in the `Tomato_cytoscape.cys` file. ### Transcript data #### Arbitrary threshold Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tomato_mrna_net <- graph_from_adjacency_matrix(tomato_mrna_W, diag = FALSE, mode = "upper", weighted = TRUE) write.table(as_data_frame(tomato_mrna_net), "../02_Results/04_Tomato/Tomato_mrna_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` It's not obvious how to choose the threshold with the weight distribution. Let's see other methods. ```{r, fig.show = "hold", out.width = "50%"} tomato_weights <- edge.attributes(tomato_mrna_net)$weight hist(tomato_weights, nclass = 100, main = "Transcript weight distribution", xlab = "weights") hist(log(tomato_weights, 10), nclass = 100, main = "Transcript weight distribution", xlab = "weights") ``` #### Mean and third quantile Calculate the median: ```{r} tomato_mrna_median <- median(x = tomato_weights) tomato_mrna_median ``` Number of selected edges with the median as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_mrna_median]) ``` Calculate the third quantile: ```{r} tomato_mrna_q75 <- quantile(x = tomato_weights, 0.75) tomato_mrna_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_mrna_q75]) ``` The following figures show where are these two threshold in the weight distribution. ```{r, fig.show = "hold", out.width = "50%", fig.align = "center"} hist(log(tomato_weights, 10), nclass = 100, main = "Transcript weight distribution", xlab = "log10(weights)") abline(v = log(tomato_mrna_median, 10), col = "blue", lwd = 3) text(log(tomato_mrna_median, 10), 10, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tomato_mrna_q75, 10), col = "purple", lwd = 3) text(log(tomato_mrna_q75, 10), 12, pos = 2, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine the range of the threshold, we check the weights: ```{r} summary(tomato_weights) ``` We define the threshold range to try: ```{r} thresholds <- seq(0, 2.522e-03, 0.00002) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tomato_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, net){ net_sub <- subgraph.edges(net, E(net)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub))) return(df) }, tomato_mrna_net)) ``` Calculated values are displayed in the following figures: ```{r, fig.show = "hold", out.width = "50%"} plot(x = tomato_ACC_mrna$thresholds, y = tomato_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Transcript W", type = "o") points(x = tomato_ACC_mrna$thresholds[1], y = tomato_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tomato_ACC_mrna$thresholds[36], y = tomato_ACC_mrna$ACC[36], col = "pink", pch = 16, cex = 1.2) points(x = tomato_ACC_mrna$thresholds[37], y = tomato_ACC_mrna$ACC[37], col = "purple", pch = 16, cex = 1.2) abline(v = tomato_ACC_mrna$thresholds[37], col = "purple") text(tomato_ACC_mrna$thresholds[37], 0.5, pos = 2, paste0("Threshold = ", tomato_ACC_mrna$thresholds[37]), col = "purple") text(tomato_ACC_mrna$thresholds[37], 0.4, pos = 2, paste0("ACCmax = ", tomato_ACC_mrna$ACC[37]), col = "purple") plot(x = tomato_ACC_mrna$thresholds, y = tomato_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Transcript W", type = "o") abline(v = tomato_ACC_mrna$thresholds[37], col = "purple") text(tomato_ACC_mrna$thresholds[37], 350, pos = 4, paste0("Threshold = ", tomato_ACC_mrna$thresholds[37]), col = "purple") text(tomato_ACC_mrna$thresholds[37], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_mrna$ACC[37]), col = "purple") text(tomato_ACC_mrna$thresholds[37], 250, pos = 4, paste0("EN = ", tomato_ACC_mrna[37, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tomato_ACC_mrna$thresholds[37] ``` And the number of selected egdes are: ```{r} tomato_ACC_mrna$EN[37] ``` Network visualizations are available in the `Tomato_cytoscape.cys` file. ### Protein data #### Arbitrary threshold Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (`mode = "upper"`) of the similarity matrix (avoid redundancies) and remove the self loop (`diag = FALSE`). ```{r} tomato_prot_net <- graph_from_adjacency_matrix(tomato_prot_W, diag = FALSE, mode = "upper", weighted = TRUE) write.table(as_data_frame(tomato_prot_net), "../02_Results/04_Tomato/Tomato_prot_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` It's not obvious how to choose the threshold with the weight distribution. Let's see other methods. ```{r, fig.show = "hold", out.width = "50%"} tomato_weights <- edge.attributes(tomato_prot_net)$weight hist(tomato_weights, nclass = 100, main = "Protein weight distribution", xlab = "weights") hist(log(tomato_weights, 10), nclass = 100, main = "Protein weight distribution", xlab = "weights") ``` #### Mean and third quantile Calculate the median: ```{r} tomato_prot_median <- median(x = tomato_weights) tomato_prot_median ``` Number of selected edges with the median as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_prot_median]) ``` Calculate the third quantile: ```{r} tomato_prot_q75 <- quantile(x = tomato_weights, 0.75) tomato_prot_q75 ``` Number of selected edges with the third quantile as threshold: ```{r} length(tomato_weights[tomato_weights >= tomato_prot_q75]) ``` The following figures show where are these two threshold in the weight distribution. ```{r, fig.show = "hold", out.width = "50%", fig.align = "center"} hist(log(tomato_weights, 10), nclass = 100, main = "Protein weight distribution", xlab = "log10(weights)") abline(v = log(tomato_mrna_median, 10), col = "blue", lwd = 3) text(log(tomato_mrna_median, 10), 12, pos = 2, "Median", col = "blue", cex = 1) abline(v = log(tomato_mrna_q75, 10), col = "purple", lwd = 3) text(log(tomato_mrna_q75, 10), 12, pos = 2, "quantile 75%", col = "purple", cex = 1) ``` #### Topology network To determine the range of the threshold, we check the weights: ```{r} summary(tomato_weights) ``` We define the threshold range to try: ```{r} thresholds <- seq(0, 0.001, 0.00001) length(thresholds) ``` Then, we calculate the Average Clustering Coefficient for each threshold. ```{r} tomato_ACC_prot <- do.call(rbind, lapply(thresholds, function(t, net){ net_sub <- subgraph.edges(net, E(net)[weight >= t]) df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub))) return(df) }, tomato_prot_net)) ``` Calculated values are displayed in the following figures: ```{r, fig.show = "hold", out.width = "50%"} plot(x = tomato_ACC_prot$thresholds, y = tomato_ACC_prot$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the protein W", type = "o") points(x = tomato_ACC_prot$thresholds[1], y = tomato_ACC_prot$ACC[1], col = "red", pch = 16, cex = 1.2) points(x = tomato_ACC_prot$thresholds[7], y = tomato_ACC_prot$ACC[7], col = "pink", pch = 16, cex = 1.2) points(x = tomato_ACC_prot$thresholds[9], y = tomato_ACC_prot$ACC[9], col = "purple", pch = 16, cex = 1.2) abline(v = tomato_ACC_prot$thresholds[9], col = "purple") text(tomato_ACC_prot$thresholds[9], 0.8, pos = 4, paste0("Threshold = ", tomato_ACC_prot$thresholds[9]), col = "purple") text(tomato_ACC_prot$thresholds[9], 0.7, pos = 4, paste0("ACCmax = ", tomato_ACC_prot$ACC[9]), col = "purple") plot(x = tomato_ACC_prot$thresholds, y = tomato_ACC_prot$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the protein W", type = "o") abline(v = tomato_ACC_prot$thresholds[9], col = "purple") text(tomato_ACC_prot$thresholds[9], 350, pos = 4, paste0("Threshold = ", tomato_ACC_prot$thresholds[9]), col = "purple") text(tomato_ACC_prot$thresholds[9], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_prot$ACC[9]), col = "purple") text(tomato_ACC_prot$thresholds[9], 250, pos = 4, paste0("EN = ", tomato_ACC_prot[9, "EN"]), col = "purple") ``` - The red dot value corresponds to the fully connected network. - The pink dot value is the smallest value before the local maxima. - The purple dot value is the local maxima. The one we are interested in. The local maxima threshold is: ```{r} tomato_ACC_prot$thresholds[9] ``` And the number of selected egdes are: ```{r} tomato_ACC_prot$EN[9] ``` Network visualizations are available in the `Tomato_cytoscape.cys` file. ## Downstream analysis ### Clustering In the Belouah *et al.* paper, they define three development stages. According that, we will run a clustering with three clusters. ```{r} C <- 3 group <- data.frame(Groups = spectralClustering(tomato_W, C)) row.names(group) <- colnames(tomato_W) dataGroups <- merge(tomato_metadata, group, by = 0) head(dataGroups) ``` Then, we save the results into a result file: ```{r} write.table(dataGroups, "../02_Results/04_Tomato/Tomato_3clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") ``` ### Visualization with Cytoscape These are two examples of network visualization for the tomato fruit dataset. ```{r, echo = FALSE, fig.show = "hold", out.width = "50%", fig.cap = "**Left network**: edge weights > 0.013. **Right network**: edge weights > 0.032.", fig.align = "center"} include_graphics("../02_Results/04_Tomato/Cytoscape/Tomato_W_dataContribution_013_3clusters.png") include_graphics("../02_Results/04_Tomato/Cytoscape/Tomato_W_dataContribution_032_3clusters.png") ``` - **Node color** represents the clustering results. Here, we selected ``three clusters``. - **Node label** are the growth stat of the tomato fruit. - **Edge color** represents the data type contribution for each edge. Overall, we see two groups of nodes: one with the late growth stages (GR7, GR8 and GR9) and one other with the early growth stages (GR1-GR6). This two groups seem to be very different because there are few connections between them. We already saw these two groups with the individual heatmaps. In the Belouah *et al.* paper, the three last growth state correspond to the ripening stage (appearance of fruit color). With the clustering, we also detect the three stage level that they describe: early, mid and late stages of fruit development. With the network visualization, we can see a kind of kinetic from the early stage (GR1 only connected to GR2) to the mid stage. We can also see that the protein data type carries the most information in this network, inside the groups.