Slides 3 -> 5 of “RGCCA - Theory”.
Starts by doing a simple PCA on gene expression matrix.
##############
# Starts from here
data.train = readRDS(file = "data/data.train")
DNAm_covariates = readRDS("data/pd_mdd_bin.RDS")
Sex = as.factor(data.train$cli$Sex)
levels(Sex) = c("Female", "Male")
# PCA on mRNA with full data-set
pca_results <- PCA(data.train$mRNA, scale.unit=TRUE, ncp=25, graph=F)
fviz_pca_ind(pca_results, axes = c(1,2))+
geom_point(aes(colour = Sex)) +
guides(colour = guide_legend(title = "color"))
==> There is a strong “Sex” effect. Several methods exist in the literature to correct for Sex (or other batch) effects (ex: ComBat Johnson, Li, and Rabinovic (2006))
For the sake of simplicity, we will limit ourselves to the gender with the highest number of samples –> Females.
# Function taken from CHAMP to help interpret factors
drawheatmap <- function(svdPV.m, title = NULL, display_legend = FALSE) {
myPalette <- c("darkred", "red", "orange", "pink", "white")
breaks.v <- c(-10000, -10, -5, -2, log10(0.05), 0)
image(x = 1:nrow(svdPV.m), y = 1:ncol(svdPV.m), z = log10(svdPV.m),
col = myPalette, breaks = breaks.v, xlab = "", ylab = "",
axes = FALSE, main = paste0("Interpretation of the Components-", title))
axis(1, at = 1:nrow(svdPV.m), labels = paste("Comp-",
1:nrow(svdPV.m), sep = ""), las = 2)
suppressWarnings(axis(2, at = 1:ncol(svdPV.m), labels = colnames(svdPV.m),
las = 2))
if(display_legend){
legend(x = 17, y = 10, legend = c(expression("p < 1x" ~ 10^{-10}),
expression("p < 1x" ~ 10^{-5}),
"p < 0.01", "p < 0.05", "p > 0.05"),
fill = c("darkred","red", "orange", "pink", "white"), par("usr")[2],
par("usr")[4], xpd = NA)
}
}
check_association_CHAMP <- function(x, y){
if(!is.numeric(y)){
return(kruskal.test(x ~ as.factor(y))$p.value)
}else{
return(summary(lm(x ~ y))$coeff[2, 4])
}
}
Once we are limited to Females, let’s look back at PCA. Let us start with only two blocks (messenger RNA (mRNA) and DNA methylation (DNAm)). In order to better understand what the different PCA components represent, lets us borrow a technique from the R package Chip Analysis Methylation Pipeline (ChAMP) (Morris et al. (2013) ; function “champ.SVD”; this function allows to test the association between PCA components (continuous) and either continuous (F-test) or discrete (Kruskal-Wallis test) covariates; see slides 7 -> 9 of “RGCCA - Theory”).
Figure below: Components are interpreted in three cases : mRNA, DNAm and the concatenation of mRNA & DNAm. The last case is to see what a naïve integration can achieve and what are the limitations. Multiple remarks :
# Keep only female samples
data.train.female = lapply(data.train,
function(x) x[which(data.train$cli$Sex == F), ])
DNAm_covariates_female = DNAm_covariates[match(rownames(data.train.female$mRNA),
DNAm_covariates$Sample_ID), ]
covariates_explored = c("Sample_Group", "BMI", "BMI.bin", "Age", "Age.bin",
"Array", "Slide", "CD4", "CD8", "MO", "B", "NK", "GR")
DNAm_covariates_explored_female =
as.data.frame(DNAm_covariates_female[, covariates_explored])
# mRNA
pca_female_mRNA <- PCA(data.train.female$mRNA, scale.unit=TRUE, ncp=20, graph=F)
interp_pca_female_mRNA = apply(pca_female_mRNA$ind$coord, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
# DNAm
pca_female_DNAm <- PCA(data.train.female$DNAm, scale.unit=TRUE, ncp=20, graph=F)
interp_pca_female_DNAm = apply(pca_female_DNAm$ind$coord, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
# mRNA & DNAm
pca_female_mRNA_DNAm <- PCA(cbind(data.train.female$mRNA,
data.train.female$DNAm), scale.unit=TRUE,
ncp=20, graph=F)
interp_pca_female_mRNA_DNAm = apply(pca_female_mRNA_DNAm$ind$coord, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 1), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_pca_female_mRNA), title = "Block mRNA")
drawheatmap(t(interp_pca_female_DNAm), title = "Block DNAm", display_legend = T)
drawheatmap(t(interp_pca_female_mRNA_DNAm), title = "Block mRNA & DNAm")
Figure below: Contribution of variables for Principal Component 7 for either the mRNA block or the concatenation of the mRNA and DNAm blocks. Remarks:
=> It might be interesting to look for methods that treat each omic as their own and allow to separately interpret the influence of each omics contributing to a “signal” of interest.
# "var" = variables contributions.
weights_pca_female_mRNA_DNAm = fviz_contrib(pca_female_mRNA_DNAm,
choice = "var", axes = 7,
top = 100) +
labs(title =paste0("Block mRNA & DNAm - Contribution of variables to PC7",
"\nVariance Explained : ", pca_female_mRNA_DNAm$eig[7, 2]))
weights_pca_female_mRNA = fviz_contrib(pca_female_mRNA, choice = "var",
axes = 7, top = 100) +
labs(title =paste0("Block mRNA - Contribution of variables to PC7",
"\nVariance Explained : ", pca_female_mRNA$eig[7, 2]))
ggarrange(weights_pca_female_mRNA, weights_pca_female_mRNA_DNAm,
ncol = 1)
Figure below: A Scree plot allows to visualize the variance explained by each component. Below are two Scree plots, one for the mRNA and one for the DNAm block. We can see that, apart from the first component, all principal components of the DNAm block explain a very low percentage of the total variance which is diffused across all components. Whereas for the mRNA block, the first 10 components seem to bear a signal. This might explain the fact that when the mRNA and DNAm blocks are concatenated, the influence of the mRNA block is greater.
par(mfrow = c(2, 1), omi = c(0, 0, 0, 0))
barplot(pca_female_mRNA$eig[, 2], names.arg=1:nrow(pca_female_mRNA$eig),
main = "mRNA - Scree Plot",
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="steelblue", ylim = c(0, 25))
# Add connected line segments to the plot
lines(x = 1:nrow(pca_female_mRNA$eig), pca_female_mRNA$eig[, 2],
type="b", pch=19, col = "red")
barplot(pca_female_DNAm$eig[, 2], names.arg=1:nrow(pca_female_DNAm$eig),
main = "DNAm - Scree Plot",
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="steelblue", ylim = c(0, 25))
# Add connected line segments to the plot
lines(x = 1:nrow(pca_female_DNAm$eig), pca_female_DNAm$eig[, 2],
type="b", pch=19, col = "red")
Slides 10 -> 14 of “RGCCA - Theory”.
So Now, lets us apply PLS and CCA that we have just learnt about to the MDD data-set.
Figure below: For both PLS and CCA, 10 components are extracted for each block. Similarly to PCA, we can interpret the influence of each computed components:
=> The interpretation of the components for CCA seems really alike, components must be really correlated.
set.seed(27) #favorite number
resampling = caret::createDataPartition(data.train.female$cli$Group, list = TRUE, p = 0.2)$Resample1
data.validation.female = lapply(data.train.female, function(x) x[-resampling, ])
data.test.female = lapply(data.train.female, function(x) x[resampling, ])
PLS_female_mRNA_DNAm = rgcca(blocks = list(mRNA = data.train.female$mRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 2, 2) - diag(2),
tau = c(1, 1),
scheme = "horst", verbose = F, ncomp = 10)
CCA_female_mRNA_DNAm = rgcca(blocks = list(mRNA = data.validation.female$mRNA,
DNAm = data.validation.female$DNAm),
connection = matrix(1, 2, 2) - diag(2),
tau = c(0, 0),
scheme = "horst", verbose = F, ncomp = 10)
interp_PLS_female_mRNA = apply(PLS_female_mRNA_DNAm$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_PLS_female_DNAm = apply(PLS_female_mRNA_DNAm$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_CCA_female_mRNA = apply(CCA_female_mRNA_DNAm$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_CCA_female_DNAm = apply(CCA_female_mRNA_DNAm$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(2, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_PLS_female_mRNA), title = "PLS - Block mRNA")
drawheatmap(t(interp_PLS_female_DNAm), title = "PLS - Block DNAm")
drawheatmap(t(interp_CCA_female_mRNA), title = "CCA - Block mRNA")
drawheatmap(t(interp_CCA_female_DNAm), title = "CCA - Block DNAm")
Figure below: Correlation between components for CCA between the DNAm and mRNA block for a train set (left) and a test set (right). On the train set, components are perfectly pairwise correlated (and uncorrelated to their unmatching components), though this is not the case at all on the test set.
=> The model is overfitting !!
# a = plot(CCA_female_mRNA_DNAm, type = "samples", comp = 1, bloc = 1, cex = 1.5)
# b = plot(CCA_female_mRNA_DNAm, type = "samples", comp = 2, bloc = 1, cex = 1.5)
# c = plot(CCA_female_mRNA_DNAm, type = "samples", comp = 3, bloc = 1, cex = 1.5)
# d = plot(CCA_female_mRNA_DNAm, type = "samples", comp = 4, bloc = 1, cex = 1.5)
# ggarrange(a, b, c, d , ncol = 2, nrow = 2)
CCA_female_mRNA_DNAm_validation = rgcca_transform(rgcca_res = CCA_female_mRNA_DNAm,
blocks_test = list(mRNA = data.test.female$mRNA,
DNAm = data.test.female$DNAm))
colnames(CCA_female_mRNA_DNAm_validation$mRNA) =
colnames(CCA_female_mRNA_DNAm$Y$mRNA) =
paste0("mRNA - comp", 1:dim(CCA_female_mRNA_DNAm_validation$mRNA)[2])
colnames(CCA_female_mRNA_DNAm_validation$DNAm) =
colnames(CCA_female_mRNA_DNAm$Y$DNAm) =
paste0("DNAm - comp", 1:dim(CCA_female_mRNA_DNAm_validation$mRNA)[2])
par(mfrow = c(1, 2), omi = c(0, 0, 0, 0))
corrplot(Reduce(cor, CCA_female_mRNA_DNAm$Y), method = "number", main = "\n\n\n\nTrain Set")
corrplot(Reduce(cor, CCA_female_mRNA_DNAm_validation), method = "number", main = "\n\n\n\nTest Set")
Slides 15 -> 22 of “RGCCA - Theory”.
=> Now let us apply R-CCA to the data-set, with an arbitrary low value of \(\tau\) (0.1) for each block.
Figure below: Correlation between each mRNA component and each DNAm component, for both TRAIN and TEST sets and for CCA/R-CCA/PLS. As mentioned before, CCA suffers from overfitting. PLS and R-CCA also but their first component seems to be spared, which is important as it is the first signal captured, so the main one.
PLS_female_mRNA_DNAm_2 =
rgcca(blocks = list(mRNA = data.validation.female$mRNA,
DNAm = data.validation.female$DNAm),
connection = matrix(1, 2, 2) - diag(2),
tau = c(1, 1),
scheme = "horst", verbose = F, ncomp = 10)
CCA_female_mRNA_DNAm_regu_2 =
rgcca(blocks = list(mRNA = data.validation.female$mRNA,
DNAm = data.validation.female$DNAm),
connection = matrix(1, 2, 2) - diag(2),
tau = c(0.1, 0.1),
scheme = "horst", verbose = F, ncomp = 10)
PLS_female_mRNA_DNAm_validation =
rgcca_transform(rgcca_res = PLS_female_mRNA_DNAm_2,
blocks_test = list(mRNA = data.test.female$mRNA,
DNAm = data.test.female$DNAm))
CCA_female_mRNA_DNAm_validation_regu =
rgcca_transform(rgcca_res = CCA_female_mRNA_DNAm_regu_2,
blocks_test = list(mRNA = data.test.female$mRNA,
DNAm = data.test.female$DNAm))
colnames(PLS_female_mRNA_DNAm_validation$mRNA) =
colnames(PLS_female_mRNA_DNAm_2$Y$mRNA) =
paste0("mRNA - comp", 1:dim(PLS_female_mRNA_DNAm_validation$mRNA)[2])
colnames(PLS_female_mRNA_DNAm_validation$DNAm) =
colnames(PLS_female_mRNA_DNAm_2$Y$DNAm) =
paste0("DNAm - comp", 1:dim(PLS_female_mRNA_DNAm_validation$mRNA)[2])
colnames(CCA_female_mRNA_DNAm_validation_regu$mRNA) =
colnames(CCA_female_mRNA_DNAm_regu_2$Y$mRNA) =
paste0("mRNA - comp", 1:dim(CCA_female_mRNA_DNAm_validation_regu$mRNA)[2])
colnames(CCA_female_mRNA_DNAm_validation_regu$DNAm) =
colnames(CCA_female_mRNA_DNAm_regu_2$Y$DNAm) =
paste0("DNAm - comp", 1:dim(CCA_female_mRNA_DNAm_validation_regu$mRNA)[2])
par(mfrow = c(3, 2), omi = c(0, 0, 0, 0))
corrplot(Reduce(cor, CCA_female_mRNA_DNAm$Y), method = "number", main = "\nTrain Set - CCA")
corrplot(Reduce(cor, CCA_female_mRNA_DNAm_validation), method = "number", main = "\nTest Set - CCA")
corrplot(Reduce(cor, CCA_female_mRNA_DNAm_regu_2$Y), method = "number", main = "\nTrain Set - RCCA")
corrplot(Reduce(cor, CCA_female_mRNA_DNAm_validation_regu), method = "number", main = "\nTest Set - RCCA")
corrplot(Reduce(cor, PLS_female_mRNA_DNAm_2$Y), method = "number", main = "\nTrain Set - PLS")
corrplot(Reduce(cor, PLS_female_mRNA_DNAm_validation), method = "number", main = "\nTest Set - PLS")
Figure below: For both PLS and R-CCA, 10 components are extracted for each block. Similarly to PCA, we can interpret the influence of each computed components:
=> Is there any differences ? -> This is not obvious but as expected, R-CCA leads to components bearing slightly more the same meaning; whereas for PLS we have information specific to each bloc (this is light, but differences are essentially in components 4/5/6)
=> GOOD NEWS: the DNAm now seems to bear a component that explains also the Patient/Control effect. Furthermore, this component seems to be more specific to the group effect than before (associated with only 2 other covariates, whereas before it was 3). Though, it is still only the 7th component (see next figure to see the corresponding % of variance explained by this component).
CCA_female_mRNA_DNAm_regu = rgcca(blocks = list(mRNA = data.train.female$mRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 2, 2) - diag(2),
tau = c(0.1, 0.1),
scheme = "horst", verbose = F, ncomp = 10)
interp_CCA_female_mRNA_regu = apply(CCA_female_mRNA_DNAm_regu$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_CCA_female_DNAm_regu = apply(CCA_female_mRNA_DNAm_regu$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(2, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_PLS_female_mRNA), title = "PLS - Block mRNA")
drawheatmap(t(interp_PLS_female_DNAm), title = "PLS - Block DNAm")
drawheatmap(t(interp_CCA_female_mRNA_regu), title = "R-CCA - Block mRNA")
drawheatmap(t(interp_CCA_female_DNAm_regu), title = "R-CCA - Block DNAm")
Figure below: Let us look closer at what is inside component 7. For both methods, we can display the sample space (here component 7 of the mRNA block against component 7 of the DNAM block). We can see that these two components are quite correlated and indeed seems do be catching a group effect. Then we can also display the contribution (block-weight vector) of each variable to the component of its block. For the sake of space, we only display the top 10 contributing variable for each method and each block. This helps interpreting what variables were used to compute these components. Then it is possible to try going deeper into the interpretation. For example “ENSG00000153234” seems interesting (see below information taken from https://www.uniprot.org):
Protein: Nuclear receptor subfamily 4 group A member 2
Gene: NR4A2
Function: Transcriptional regulator which is important for the differentiation and maintenance of meso-diencephalic dopaminergic (mdDA) neurons during development. It is crucial for expression of a set of genes such as SLC6A3, SLC18A2, TH and DRD2 which are essential for development of mdDA neurons (By similarity).
This gene was already present in top genes of PCA.
==> But how can we add the micro-RNA (miRNA) block now ?
a = plot(PLS_female_mRNA_DNAm, type = "samples", response = data.train.female$cli$Group, cex = 1.5, comp = 7) + ylim(c(-1, 1)) + xlim(c(-1.5, 1.5)) + ggtitle("PLS\nSample space")
Slides 23 -> 27 of “RGCCA - Theory”.
Now let us apply RGCCA to our data-set. In order to be as close as possible to the methods above, let us apply it when:
Figure below: Differences: once again RGCCA \(\tau = 0.1\) seems to have components describing the same information across all blocks. This is particularly true for the miRNA blocks that seems to have specific information when \(\tau = 1\) (located mainly on the first components and the covariates describing cell composition).
=> GOOD NEWS : The components mainly linked to the group effect has raised in position (3 for \(\tau = 1\) - 4 and 9 for \(\tau = 0.1\)). This may be interpreted as the fact that this group effect is indeed common to each block and as such, is sooner/better catch by RGCCA. This has to be mitigated by the fact that a quite strong link to covariate “MO” is also catch by this component in the mRNA block. Furthermore see next figure for the pourcentage of explained variance by each component.
set.seed(53)
RGCCA_PLS_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 3, 3) - diag(3),
tau = c(1,1,1), scheme = "horst", verbose = F, ncomp = 10)
interp_RGCCA_PLS_female_mRNA = apply(RGCCA_PLS_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_female_miRNA = apply(RGCCA_PLS_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_female_DNAm = apply(RGCCA_PLS_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
RGCCA_CCA_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 3, 3) - diag(3),
tau = c(0.1,0.1,0.1), scheme = "horst", verbose = F, ncomp = 10, init = "random")
interp_RGCCA_CCA_female_mRNA = apply(RGCCA_CCA_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_CCA_female_miRNA = apply(RGCCA_CCA_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_CCA_female_DNAm = apply(RGCCA_CCA_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_RGCCA_PLS_female_mRNA), title = "\nRGCCA - tau = 1\nBlock mRNA")
drawheatmap(t(interp_RGCCA_CCA_female_mRNA), title = "\nRGCCA - tau = 0.1\nBlock mRNA")
drawheatmap(t(interp_RGCCA_PLS_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_CCA_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_PLS_female_DNAm), title = "Block DNAm")
drawheatmap(t(interp_RGCCA_CCA_female_DNAm), title = "Block DNAm")
Figure below: Though raised in position, the components of interest, i.e. associated with Patient/Control (3 for \(\tau = 1\) - 4 and 9 for \(\tau = 0.1\)), still explain a low percentage of the total variance.
a = plot(RGCCA_PLS_female, type = "ave", comp = 4) + ggtitle("RGCCA - tau = 1\nAverage Variance Explained")
Until now, all blocks were connected (a.k.a. Complete design). It might be too constraining. Indeed, each block might bear the group effect but not in the same way. Let us try a second design matrix where all blocks are connected to the mRNA bock only (referred to as a Hierarchical design).
Figure below: Interpretation of the components for each block in the Hierarchical (left) or Complete (right) design. Observations:
RGCCA_PLS_H_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(c(0, 1, 1, 1, 0, 0, 1, 0, 0), ncol = 3, byrow = T),
tau = c(1,1,1), scheme = "horst", verbose = F, ncomp = 10)
interp_RGCCA_PLS_H_female_mRNA = apply(RGCCA_PLS_H_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_H_female_miRNA = apply(RGCCA_PLS_H_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_H_female_DNAm = apply(RGCCA_PLS_H_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
RGCCA_PLS_C_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 3, 3) - diag(3),
tau = c(1,1,1), scheme = "horst", verbose = F, ncomp = 10)
interp_RGCCA_PLS_C_female_mRNA = apply(RGCCA_PLS_C_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_female_miRNA = apply(RGCCA_PLS_C_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_female_DNAm = apply(RGCCA_PLS_C_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_RGCCA_PLS_H_female_mRNA), title = "\nRGCCA - tau = 1 - Hierarchical\nBlock mRNA")
drawheatmap(t(interp_RGCCA_PLS_C_female_mRNA), title = "\nRGCCA - tau = 1 - Complete\nBlock mRNA")
drawheatmap(t(interp_RGCCA_PLS_H_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_PLS_C_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_PLS_H_female_DNAm), title = "Block DNAm")
drawheatmap(t(interp_RGCCA_PLS_C_female_DNAm), title = "Block DNAm")
Lets us compare two schemes:
The particular interest of the Factorial scheme is that it allows to catch high effect that are positively or negatively correlated between blocks (Horst allows only positively). It also have the effect of decreasing low links (covariance below 1) and increasing higher ones (covariance above 1).
Figure below: Interpretation of the components for each block with the Horst (left) or the Factorial (right) scheme. Observations:
RGCCA_PLS_C_Horst_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 3, 3) - diag(3),
tau = c(1,1,1), scheme = "horst", verbose = F, ncomp = 10)
interp_RGCCA_PLS_C_Horst_female_mRNA = apply(RGCCA_PLS_C_Horst_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_Horst_female_miRNA = apply(RGCCA_PLS_C_Horst_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_Horst_female_DNAm = apply(RGCCA_PLS_C_Horst_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
RGCCA_PLS_C_Factorial_female = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(1, 3, 3) - diag(3),
tau = c(1,1,1), scheme = "factorial", verbose = F, ncomp = 10)
interp_RGCCA_PLS_C_Factorial_female_mRNA = apply(RGCCA_PLS_C_Factorial_female$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_Factorial_female_miRNA = apply(RGCCA_PLS_C_Factorial_female$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_PLS_C_Factorial_female_DNAm = apply(RGCCA_PLS_C_Factorial_female$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_RGCCA_PLS_C_Horst_female_mRNA), title = "\nRGCCA - tau = 1 - Complete - Horst\nBlock mRNA")
drawheatmap(t(interp_RGCCA_PLS_C_Factorial_female_mRNA), title = "\nRGCCA - tau = 1 - Complete - Factorial\nBlock mRNA")
drawheatmap(t(interp_RGCCA_PLS_C_Horst_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_PLS_C_Factorial_female_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_PLS_C_Horst_female_DNAm), title = "Block DNAm")
drawheatmap(t(interp_RGCCA_PLS_C_Factorial_female_DNAm), title = "Block DNAm")
Slides 28 -> 29 of “RGCCA - Theory”.
Now, let us apply this permutation to our situation in order to tune the parameter \(\tau\) (the RGCCA package also allows to optimize for the number of component). We set the RGCCA model to the Complete design with the Factorial scheme (Horst scheme was giving better results but we believe in the necessity to retrieve both highly positively or negatively correlated events). We choose to explore 20 values for \(\tau\) in the interval [1e-4; 0.5], sampled in a logspace manner. For the sake of computational time, each block and each component are associated with the same value of tau (though one value of \(\tau\) could have been specified for each combination). 50 permutations are performed per tested parameters.
Figure below: Best set of parameters is:
mRNA miRNA DNAm
0.05315656 0.05315656 0.05315656
set_tau = sapply(data.train.female[1:3], function(x) pracma:::logseq(x1 = 1e-4, x2 = 0.5, n = 20))
# # Not run
# set.seed(53)
# permutation_RGCCA_C_Factorial_female = rgcca_permutation(blocks = list(mRNA = data.train.female$mRNA,
# miRNA = data.train.female$miRNA,
# DNAm = data.train.female$DNAm),
# connection = matrix(c(0, 1, 1, 1, 0, 1, 1, 1, 0), ncol = 3, byrow = T),
# par_value = set_tau,
# n_perms = 50, n_cores = 24, ncomp = 10, init = "random")
#
# saveRDS(object = permutation_RGCCA_C_Factorial_female, "pre_computed_results/permutation_RGCCA_C_Factorial_female_V3.rds")
permutation_RGCCA_C_Factorial_female = readRDS("pre_computed_results/permutation_RGCCA_C_Factorial_female_V3.rds")
plot(permutation_RGCCA_C_Factorial_female)
Now that we optimized the set of parameters for our model, we wish to evaluate its robustness. This is going to be performed through bootstrapping.
Slides 30 -> 31 of “RGCCA - Theory”.
Figure below: 2000 bootstrap samples were generated. We are looking at the first 100 weights (black dots: they were estimated on the regular cohort, without resampling) with their estimated Confidence Interval (grey lines) for components 1/3/4 (the last two because it is where the association with the group effect is supposed to be).
=> Nothing is significant at all.
# NOT RUN
RGCCA_C_Factorial_female_bestPerm = rgcca(blocks = list(mRNA = data.train.female$mRNA,
miRNA = data.train.female$miRNA,
DNAm = data.train.female$DNAm),
connection = matrix(c(0, 1, 1, 1, 0, 1, 1, 1, 0), ncol = 3, byrow = T),
tau = permutation_RGCCA_C_Factorial_female$best_params, scheme = "factorial",
verbose = F, ncomp = 10, init = "random")
# set.seed(6464)
# bootstrap_RGCCA_C_Factorial_female_bestPerm = rgcca_bootstrap(RGCCA_C_Factorial_female_bestPerm , n_cores = 24, n_boot = 2000)
#
# saveRDS(object = bootstrap_RGCCA_C_Factorial_female_bestPerm, file = "pre_computed_results/bootstrap_RGCCA_C_Factorial_female_bestPerm_V3.rds")
bootstrap_RGCCA_C_Factorial_female_bestPerm = readRDS("pre_computed_results/bootstrap_RGCCA_C_Factorial_female_bestPerm_light.rds")
a = plot(bootstrap_RGCCA_C_Factorial_female_bestPerm, n = 100, bloc = 1, comp = 1)
=> What can we do to get out of this situation. Actually a lot of things:
Here, our goal is to predict for Patient vs. Control, so we will supervise accordingly.
Slides 32 -> 35 of “RGCCA - Theory”.
In such situation, it is required to perform Cross-Validation in order to tune parameters to obtain the model with the best generalization capacity.
We optimize for the parameter \(\tau\) (for the same set of values as before). The prediction model used in the second step is a Linear Discriminant Analysis (LDA).
We still select the Factorial scheme. Here, as we supervise, we are not forced to extract a high number of components and we can limit ourselves to 3 components per bloc (it is also less time consuming). Cross Validation was performed on the F1-score with 5 repetitions of a 5-folds Cross-Validation (CV). Furthermore, before performing the CV, we extracted 20% of the data-set as a test set to validate our model optimized on the train set (optimized with CV). The resampling to separated train and test was performed with stratification (in order for each sample to be representative of the original proportion of patients vs controls).
Figure below: the best set of extracted parameters is:
mRNA miRNA DNAm group
1e-04 1e-04 1e-04 0e+00
=> Actually this set is on the borders of the grid… It would be interesting to explore lower values.
resampling = readRDS(file = "pre_computed_results/resampling4CV")
data.validation.female = lapply(data.train.female, function(x) x[-resampling, ])
data.test.female = lapply(data.train.female, function(x) x[resampling, ])
# # NOT RUN
# set.seed(53)
# crossValidation_RGCCA_HiearchicalSupervised_Factorial_female <- rgcca_cv(blocks = list(mRNA = data.validation.female$mRNA,
# miRNA = data.validation.female$miRNA,
# DNAm = data.validation.female$DNAm,
# group = data.validation.female$cli$Group),
# response = 4, method = "rgcca",
# par_type = "tau", par_value = cbind(set_tau, 0),
# prediction_model = "lda", #caret::modelLookup()
# metric = "F1",
# k=5, n_run = 5,
# verbose = TRUE, n_cores = 24, ncomp = c(3, 3, 3, 1))
#
# saveRDS(object = crossValidation_RGCCA_HiearchicalSupervised_Factorial_female, "pre_computed_results/crossValidation_RGCCA_HiearchicalSupervised_Factorial_female_V3.rds")
crossValidation_RGCCA_HiearchicalSupervised_Factorial_female = readRDS("pre_computed_results/crossValidation_RGCCA_HiearchicalSupervised_Factorial_female_V3.rds")
plot(crossValidation_RGCCA_HiearchicalSupervised_Factorial_female)
The Best model is learnt and then applied on the test set. The F1 score on the test set is rather high, “0.7272727”, though, overfitting may still be present as the F1 score on the train set is “1”.
RGCCA_HiearchicalSupervised_Factorial_female_bestCV = rgcca(crossValidation_RGCCA_HiearchicalSupervised_Factorial_female)
RGCCA_HiearchicalSupervised_Factorial_female_bestCV$call$init = "random"
predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV =
rgcca_predict(rgcca_res = RGCCA_HiearchicalSupervised_Factorial_female_bestCV,
blocks_test = list(mRNA = data.test.female$mRNA,
miRNA = data.test.female$miRNA,
DNAm = data.test.female$DNAm,
group = data.test.female$cli$Group),
prediction_model = "lda", metric = "F1")
print(predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$metric)
## $train
## group
## Accuracy 1.0000000
## Kappa 1.0000000
## F1 1.0000000
## Sensitivity 1.0000000
## Specificity 1.0000000
## Pos_Pred_Value 1.0000000
## Neg_Pred_Value 1.0000000
## Precision 1.0000000
## Recall 1.0000000
## Detection_Rate 0.5797101
## Balanced_Accuracy 1.0000000
##
## $test
## group
## Accuracy 0.6666667
## Kappa 0.3076923
## F1 0.7272727
## Sensitivity 0.8000000
## Specificity 0.5000000
## Pos_Pred_Value 0.6666667
## Neg_Pred_Value 0.6666667
## Precision 0.6666667
## Recall 0.8000000
## Detection_Rate 0.4444444
## Balanced_Accuracy 0.6500000
Figure below: interpretation of the components for each bloc and on TRAIN or TEST sets. As expected by the supervision effect, the first component is highly linked to the group effect and this is the only effect explained. On the TEST set, this illustrates the overfitting. Though, as expected by the fact that the F1 score was high on the test set, there are still some relevant information extracted from the TRAIN set. The learnt model still seems to bear a certain generalization power (not as strong as expected on the TRAIN set obviously). Especially for the mRNA bloc, a little for the miRNA block and not at all for the DNAm block.
interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA =
apply(RGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA =
apply(RGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm =
apply(RGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA =
apply(predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA =
apply(predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm =
apply(predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA), title = "\nTRAIN SET - RGCCA - tau Opt CV - Complete - Factorial\nBlock mRNA")
drawheatmap(t(interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA), title = "\nTEST SET - RGCCA - tau Opt CV - Complete - Factorial\nBlock mRNA")
drawheatmap(t(interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA), title = "Block miRNA")
drawheatmap(t(interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA), title = "Block miRNA")
drawheatmap(t(interp_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm), title = "Block DNAm")
drawheatmap(t(interp_predictOnTest_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm), title = "Block DNAm")
We can still perform a bootstrap procedure in order to evaluate the robustness of the block-weight vectors and identify robust variables contributing to each component for each block.
# # NOT RUN
# set.seed(53)
# bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV = rgcca_bootstrap(RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n_cores = 24, n_boot = 2000)
#
# saveRDS(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, "pre_computed_results/bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_V3.rds")
bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV = readRDS("pre_computed_results/bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV_light.rds")
a = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 1, comp = 1)
b = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 2, comp = 1)
c = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 3, comp = 1)
d = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 1, comp = 2)
e = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 2, comp = 2)
f = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 3, comp = 2)
g = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 1, comp = 3)
h = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 2, comp = 3)
i = plot(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, n = 100, bloc = 3, comp = 3)
Bellow are the significant variables for each bock and each component. We can now try to perform enrichment analysis on each list associated with each component.
weights_idx = which(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$type == "weights")
signif_weights = which((p.adjust(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$pval[weights_idx], method = "fdr")<=0.05))
print("################# Component 1 #################")
## [1] "################# Component 1 #################"
idx_comp_1 = which(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$comp == 1)
bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$var[intersect(signif_weights, idx_comp_1)]
## [1] "ENSG00000009790" "ENSG00000013441" "ENSG00000018280" "ENSG00000072952"
## [5] "ENSG00000073756" "ENSG00000092094" "ENSG00000100288" "ENSG00000100445"
## [9] "ENSG00000100764" "ENSG00000102908" "ENSG00000103502" "ENSG00000104960"
## [13] "ENSG00000110344" "ENSG00000115523" "ENSG00000117281" "ENSG00000118960"
## [17] "ENSG00000122435" "ENSG00000122642" "ENSG00000123689" "ENSG00000124102"
## [21] "ENSG00000125347" "ENSG00000131368" "ENSG00000132155" "ENSG00000132522"
## [25] "ENSG00000132825" "ENSG00000135441" "ENSG00000137947" "ENSG00000138829"
## [29] "ENSG00000139631" "ENSG00000139725" "ENSG00000140396" "ENSG00000142396"
## [33] "ENSG00000143933" "ENSG00000144655" "ENSG00000152926" "ENSG00000160321"
## [37] "ENSG00000160613" "ENSG00000161960" "ENSG00000162413" "ENSG00000163154"
## [41] "ENSG00000164620" "ENSG00000165548" "ENSG00000165792" "ENSG00000167216"
## [45] "ENSG00000168970" "ENSG00000169403" "ENSG00000173068" "ENSG00000173114"
## [49] "ENSG00000173334" "ENSG00000173825" "ENSG00000174837" "ENSG00000175538"
## [53] "ENSG00000175793" "ENSG00000178951" "ENSG00000179388" "ENSG00000179909"
## [57] "ENSG00000179933" "ENSG00000180879" "ENSG00000182108" "ENSG00000184602"
## [61] "ENSG00000186130" "ENSG00000186594" "ENSG00000187037" "ENSG00000187808"
## [65] "ENSG00000188186" "ENSG00000196313" "ENSG00000204261" "ENSG00000210082"
## [69] "ENSG00000211459" "ENSG00000213190" "ENSG00000213462" "ENSG00000213983"
## [73] "ENSG00000223551" "ENSG00000225972" "ENSG00000233230" "ENSG00000240225"
## [77] "ENSG00000242125" "ENSG00000248323" "ENSG00000250510" "ENSG00000253352"
## [81] "ENSG00000257878" "ENSG00000258682" "ENSG00000260007" "ENSG00000269352"
## [85] "ENSG00000269688" "ENSG00000274008" "ENSG00000274213" "hsa-let-7d-3p"
## [89] "hsa-miR-125b-5p" "hsa-miR-140-5p" "hsa-miR-152-3p" "hsa-miR-16-2-3p"
## [93] "hsa-miR-190a-5p" "hsa-miR-19a-3p" "hsa-miR-222-3p" "hsa-miR-223-5p"
## [97] "hsa-miR-374b-5p" "hsa-miR-4286" "hsa-miR-491-5p" "cg24984195"
## [101] "cg22992730" "cg26607620" "cg13462158" "cg20293942"
## [105] "cg18727742" "cg23480021" "cg22117893" "cg05946535"
## [109] "cg01602139" "cg05745656" "cg18349077" "cg22687569"
## [113] "cg08200543" "cg27638615" "cg09766721" "cg07709590"
## [117] "cg16301894" "cg10833299" "cg07275437" "cg03249723"
## [121] "cg10751070" "cg27088726" "cg08506353" "cg19978674"
## [125] "cg04414766" "cg13108341" "cg14768206" "cg07414487"
## [129] "cg22509179" "cg23900144" "cg17134397" "cg22776392"
## [133] "cg17828921" "cg13132497" "cg05280698" "cg06068545"
## [137] "cg02920129" "cg26610739" "cg18891815" "cg03345454"
## [141] "cg17667591" "Patient"
## [1] "################# Component 2 #################"
idx_comp_2 = which(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$comp == 2)
bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$var[intersect(signif_weights, idx_comp_2)]
## [1] "ENSG00000126746" "ENSG00000163154" "Patient"
## [1] "################# Component 3 #################"
idx_comp_3 = which(bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$comp == 3)
bootstrap_RGCCA_HiearchicalSupervised_Factorial_female_bestCV$stats$var[intersect(signif_weights, idx_comp_3)]
## [1] "Patient"
Slides 36 -> 39 of “RGCCA - Theory”.
This time, permutation is applied in order to tune the sparsity parameters s. Similarly as before, we set the SGCCA model to the Complete design with the Factorial scheme. We choose to explore 20 values for “s” in the interval \(\left[\frac{1}{\sqrt{J_j}-1}; 1\right]\), sampled in a logspace manner. For the sake of computational time, each block and each component are associated with the same level of sparsity (i.e. not exactly the same value of “s” but around the same ratio of sparse values; though one value of “s” could have been specified for each combination). 50 permutations are performed per tested parameters.
Figure below: Best set of parameters is:
set_sparsity = sapply(data.train.female[1:3], function(x) pracma:::logseq(x1 = 1/sqrt(ncol(x)-1), x2 = 1, n = 20))
# # Not run
# set.seed(53)
# permutation_SGCCA_C_Factorial_female = rgcca_permutation(blocks = list(mRNA = data.train.female$mRNA,
# miRNA = data.train.female$miRNA,
# DNAm = data.train.female$DNAm),
# connection = matrix(c(0, 1, 1, 1, 0, 1, 1, 1, 0), ncol = 3, byrow = T),
# par_value = set_sparsity, par_type = "sparsity",
# n_perms = 50, n_cores = 36, ncomp = 10, init = "random")
#
# saveRDS(object = permutation_SGCCA_C_Factorial_female, "permutation_SGCCA_C_Factorial_female_V3.rds")
permutation_SGCCA_C_Factorial_female = readRDS("pre_computed_results/permutation_SGCCA_C_Factorial_female_V3.rds")
permutation_SGCCA_C_Factorial_female$best_params
## mRNA miRNA DNAm
## 0.07426336 0.13492302 0.07426336
rgcca_stability()
: using bootstrapping to assess if
variables are consistenly selected accross resamplingsThis time, bootstrapping is not used to estimated a confidence interval for each variable, but rather to assess if some of them are consistently selected across resamplings.
Figure below: 2000 bootstrap samples were generated. For each bootstrap sample the Variable Importance in Projection (VIP) is computed per variable and averaged across re-samples. Per block, only a certain percentage of variables are kept, in decreasing order of their averaged VIP. This percentage is taken as the proportion of selected variables per block (averaged across components) in the original model with the optimized (by permutation) sparse parameters. We are looking at the weights (for components 1/3/4) per block of all kept variables according to this procedure. These weights comes also from this original model.
=> Limits: this procedure allows to identify the most stables variables. It does not guarantee that they are stable.
# # NOT RUN
# SGCCA_C_Factorial_female_bestPerm = rgcca(blocks = list(mRNA = data.train.female$mRNA,
# miRNA = data.train.female$miRNA,
# DNAm = data.train.female$DNAm),
# connection = matrix(c(0, 1, 1, 1, 0, 1, 1, 1, 0), ncol = 3, byrow = T),
# sparsity = permutation_SGCCA_C_Factorial_female$best_params, scheme = "factorial",
# verbose = F, ncomp = 10, init = "random")
# set.seed(6464)
# stability_SGCCA_C_Factorial_female_bestPerm = rgcca_stability(SGCCA_C_Factorial_female_bestPerm , n_cores = 36, n_boot = 2000)
#
# saveRDS(object = stability_SGCCA_C_Factorial_female_bestPerm, file = "stability_SGCCA_C_Factorial_female_bestPerm_V3.rds")
stability_SGCCA_C_Factorial_female_bestPerm = readRDS("pre_computed_results/stability_SGCCA_C_Factorial_female_bestPerm_light.rds")
a = plot(stability_SGCCA_C_Factorial_female_bestPerm, n = 100, bloc = 1, comp = 1)
The variable selection step forbid to estimate confidence intervals
for each variable. Indeed, in the previous resampling procedure, each
SGCCA model is composed of a different set of variables. Thus
aggregating their weights across resamplings is not advised. Here, the
goal is to estimate confidence intervals in a RGCCA model (no variable
selection anymore) comprising only the kept variables according to the
previous procedure with rgcca_stability()
.
Figure below: 2000 bootstrap samples were generated. We are looking at the weights (black dots: they were estimated on the regular cohort, without resampling) of the kept variables with their estimated Confidence Interval (grey lines) for components 1/3/4.
# # Not run
# set.seed(53)
# bootstrap_SGCCA_C_Factorial_female_bestPerm = rgcca_bootstrap(stability_SGCCA_C_Factorial_female_bestPerm , n_cores = 36, n_boot = 2000)
#
# saveRDS(object = bootstrap_SGCCA_C_Factorial_female_bestPerm, file = "bootstrap_SGCCA_C_Factorial_female_bestPerm_V3.rds")
bootstrap_SGCCA_C_Factorial_female_bestPerm = readRDS("pre_computed_results/bootstrap_SGCCA_C_Factorial_female_bestPerm_light.rds")
a = plot(bootstrap_SGCCA_C_Factorial_female_bestPerm, n = 100, bloc = 1, comp = 1)
Bellow are the significant variables for each bock and each
component. This time (thanks to rgcca_stability()
),
significant variables are identify for each block on the first three
components.
==> Let us try to understand what signal was grasp by these components.
weights_idx = which(bootstrap_SGCCA_C_Factorial_female_bestPerm$stats$type == "weights")
signif_weights = which((p.adjust(bootstrap_SGCCA_C_Factorial_female_bestPerm$stats$pval[weights_idx], method = "fdr")<=0.05))
for (comp in 1:10){
print(sprintf("################# Component %i #################", comp))
idx_comp = which(bootstrap_SGCCA_C_Factorial_female_bestPerm$stats$comp == comp)
print(bootstrap_SGCCA_C_Factorial_female_bestPerm$stats$var[intersect(signif_weights, idx_comp)])
}
## [1] "################# Component 1 #################"
## [1] "ENSG00000118900" "ENSG00000148841" "ENSG00000163220" "ENSG00000167434"
## [5] "ENSG00000185862" "ENSG00000137575" "ENSG00000134830" "ENSG00000131355"
## [9] "ENSG00000134686" "hsa-miR-145-5p" "hsa-miR-143-3p" "hsa-miR-338-5p"
## [13] "hsa-miR-2115-3p" "hsa-miR-199b-5p" "cg14516183" "cg19411998"
## [17] "cg25935979" "cg08279189" "cg07945473" "cg11965692"
## [21] "cg19813868" "cg25065997" "cg07305485" "cg02233041"
## [25] "cg15285175"
## [1] "################# Component 2 #################"
## [1] "ENSG00000268205" "ENSG00000260711" "ENSG00000136541" "ENSG00000230590"
## [5] "ENSG00000196329" "ENSG00000260257" "ENSG00000129197" "hsa-miR-1301-3p"
## [9] "hsa-miR-23a-5p" "hsa-miR-150-5p" "hsa-miR-671-3p" "hsa-miR-1275"
## [13] "hsa-miR-20a-5p" "cg06531573" "cg27636669"
## [1] "################# Component 3 #################"
## [1] "ENSG00000204644" "cg15570656" "cg20228636"
## [1] "################# Component 4 #################"
## character(0)
## [1] "################# Component 5 #################"
## character(0)
## [1] "################# Component 6 #################"
## character(0)
## [1] "################# Component 7 #################"
## character(0)
## [1] "################# Component 8 #################"
## character(0)
## [1] "################# Component 9 #################"
## character(0)
## [1] "################# Component 10 #################"
## character(0)
Figure below: interpretation of the components for each block. Unfortunately, the three first components, where significant variables are, mainly care information about blood cell composition and marginally with age (miRNA/DNAm) and batch effects (DNAm). The components link to patient/control (i.e. component 9 for mRNA, 4 for miRNA and 10 for DNAm) do not have significant variables associated with them.
model_stability_SGCCA_C_Factorial_female_bestPerm = rgcca(blocks = list(mRNA = data.train.female$mRNA[, stability_SGCCA_C_Factorial_female_bestPerm$keepVar[[1]]],
miRNA = data.train.female$miRNA[, stability_SGCCA_C_Factorial_female_bestPerm$keepVar[[2]]],
DNAm = data.train.female$DNAm[, stability_SGCCA_C_Factorial_female_bestPerm$keepVar[[3]]]),
connection = matrix(c(0, 1, 1, 1, 0, 1, 1, 1, 0), ncol = 3, byrow = T),
tau = c(1, 1, 1), scheme = "factorial",
verbose = F, ncomp = 10, init = "random")
interp_model_stability_SGCCA_C_Factorial_female_bestPerm_mRNA =
apply(model_stability_SGCCA_C_Factorial_female_bestPerm$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_model_stability_SGCCA_C_Factorial_female_bestPerm_miRNA =
apply(model_stability_SGCCA_C_Factorial_female_bestPerm$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
interp_model_stability_SGCCA_C_Factorial_female_bestPerm_DNAm =
apply(model_stability_SGCCA_C_Factorial_female_bestPerm$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female,
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 1), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_model_stability_SGCCA_C_Factorial_female_bestPerm_mRNA), title = "SGCCA - Sparse parameters Opt Perm - Complete - Factorial\nBlock mRNA")
drawheatmap(t(interp_model_stability_SGCCA_C_Factorial_female_bestPerm_miRNA), title = "Block miRNA")
drawheatmap(t(interp_model_stability_SGCCA_C_Factorial_female_bestPerm_DNAm), title = "Block DNAm")
Here, our goal is to predict for Patient vs. Control, so we will supervise accordingly.
Similarly as before, we perform Cross-Validation in order to tune the sparse parameters s (for the same set of values as before) to obtain the model with the best generalization capacity. The prediction model used in the second step is again a Linear Discriminant Analysis (LDA).
We still select the Factorial scheme. As mentioned before, as we supervise, we are not forced to extract a high number of components and we can limit ourselves to 3 components per bloc (it is also less time consuming). Cross Validation was performed on the F1-score with 5 repetitions of a 5-folds Cross-Validation (CV). Furthermore, before performing the CV, we extracted 20% of the data-set as a test set (the exact same resampling as with RGCCA) to validate our model optimized on the train set (optimized with CV). The resampling to separated train and test was performed with stratification (in order for each sample to be representative of the original proportion of patients vs controls).
Figure below: the best set of extracted parameters is (the \(0\) for the “group” block, the response block, means no supervision and \(\tau=0\) to focus more on correlation):
resampling = readRDS(file = "pre_computed_results/resampling4CV")
# # NOT RUN
# set.seed(53)
# crossValidation_SGCCA_HiearchicalSupervised_Factorial_female <- rgcca_cv(blocks = list(mRNA = data.validation.female$mRNA,
# miRNA = data.validation.female$miRNA,
# DNAm = data.validation.female$DNAm,
# group = data.validation.female$cli$Group),
# response = 4, method = "sgcca",
# par_type = "sparsity", par_value = cbind(set_sparsity, 1),
# prediction_model = "lda", #caret::modelLookup()
# metric = "F1",
# k=5, n_run = 5,
# verbose = TRUE, n_cores = 36, ncomp = c(3, 3, 3, 1))
#
# saveRDS(object = crossValidation_SGCCA_HiearchicalSupervised_Factorial_female, "crossValidation_SGCCA_HiearchicalSupervised_Factorial_female_V3.rds")
crossValidation_SGCCA_HiearchicalSupervised_Factorial_female = readRDS("pre_computed_results/crossValidation_SGCCA_HiearchicalSupervised_Factorial_female_V3.rds")
crossValidation_SGCCA_HiearchicalSupervised_Factorial_female$best_params
## mRNA miRNA DNAm group
## 0.03336726 0.07284843 0.03336726 0.00000000
The Best model is learnt and then applied on the test set. The F1 score on the test set is rather high, “0.7” (against “0.7272727” with RGCCA), though, overfitting may still be present as the F1 score on the train set is “1” (even though this time, the selected set of parameters is not on the border of the interval).
SGCCA_HiearchicalSupervised_Factorial_female_bestCV = rgcca(crossValidation_SGCCA_HiearchicalSupervised_Factorial_female)
SGCCA_HiearchicalSupervised_Factorial_female_bestCV$call$init = "random"
predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV =
rgcca_predict(rgcca_res = SGCCA_HiearchicalSupervised_Factorial_female_bestCV,
blocks_test = list(mRNA = data.test.female$mRNA,
miRNA = data.test.female$miRNA,
DNAm = data.test.female$DNAm,
group = data.test.female$cli$Group),
prediction_model = "lda", metric = "F1")
print(predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV$metric)
## $train
## group
## Accuracy 1.0000000
## Kappa 1.0000000
## F1 1.0000000
## Sensitivity 1.0000000
## Specificity 1.0000000
## Pos_Pred_Value 1.0000000
## Neg_Pred_Value 1.0000000
## Precision 1.0000000
## Recall 1.0000000
## Detection_Rate 0.5797101
## Balanced_Accuracy 1.0000000
##
## $test
## group
## Accuracy 0.6666667
## Kappa 0.3250000
## F1 0.7000000
## Sensitivity 0.7000000
## Specificity 0.6250000
## Pos_Pred_Value 0.7000000
## Neg_Pred_Value 0.6250000
## Precision 0.7000000
## Recall 0.7000000
## Detection_Rate 0.3888889
## Balanced_Accuracy 0.6625000
Figure below: interpretation of the components for each bloc and on TRAIN or TEST sets. On the TRAIN set, the main caught affect is the group effect, as expected, though more residual effects are also present in the first component for example (Age for DNAm and Slide for miRNA… though a batch effect for DNAm) whereas none were present for this component with RGCCA. This might be linked to the fact that with SGCCA, \(\tau\) is forced to be equal to \(1\), thus less focusing on correlation. On the TEST set, this illustrates again the overfitting. This time, the learnt model seems only to bear a generalization power for the mRNA block (not for the miRNA bock anymore).
interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA =
apply(SGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA =
apply(SGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm =
apply(SGCCA_HiearchicalSupervised_Factorial_female_bestCV$Y$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female[-resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA =
apply(predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$mRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA =
apply(predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$miRNA, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm =
apply(predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV$projection$DNAm, 2,
function(x)
sapply(DNAm_covariates_explored_female[resampling, ],
function(y)
check_association_CHAMP(x, y)))
par(mfrow = c(3, 2), omi = c(0, 0.7, 0, 0))
drawheatmap(t(interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA), title = "\nTRAIN SET - SGCCA - Sparse parameters Opt CV - Hiearchical - Factorial\nBlock mRNA")
drawheatmap(t(interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_mRNA), title = "\nTEST SET - SGCCA - Sparse parameters Opt CV - Hiearchical - Factorial\nBlock mRNA")
drawheatmap(t(interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA), title = "Block miRNA")
drawheatmap(t(interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_miRNA), title = "Block miRNA")
drawheatmap(t(interp_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm), title = "Block DNAm")
drawheatmap(t(interp_predictOnTest_SGCCA_HiearchicalSupervised_Factorial_female_bestCV_DNAm), title = "Block DNAm")
Once again, it is possible to use bootstrapping to evaluate the stability of the variable selection process.
Figure below: 2000 bootstrap samples were generated. Below are the weights (for components 1/3/4) per block of all kept variables according to the procedure evaluating the stability of the selection process as described in section 5.3.
=> Very few variables were kept: \(4\), \(3\) and \(5\) for mRNA, miRNA and DNAm respectively.
# # NOT RUN
# SGCCA_C_Factorial_female_bestCV = rgcca(blocks = list(mRNA = data.train.female$mRNA,
# miRNA = data.train.female$miRNA,
# DNAm = data.train.female$DNAm,
# group = data.train.female$cli$Group),
# response = 4, method = "sgcca",
# sparsity = crossValidation_SGCCA_HiearchicalSupervised_Factorial_female$best_params, scheme = "factorial",
# verbose = F, ncomp = c(3, 3, 3, 1), init = "random")
# set.seed(6464)
# stability_SGCCA_C_Factorial_female_bestCV = rgcca_stability(SGCCA_C_Factorial_female_bestCV , n_cores = 36, n_boot = 2000)
#
# saveRDS(object = stability_SGCCA_C_Factorial_female_bestCV, file = "stability_SGCCA_C_Factorial_female_bestCV_V3.rds")
stability_SGCCA_C_Factorial_female_bestCV = readRDS("pre_computed_results/stability_SGCCA_C_Factorial_female_bestCV_light.rds")
a = plot(stability_SGCCA_C_Factorial_female_bestCV, n = 100, bloc = 1, comp = 1)
Once again, bootstap can be used to estimate confidence intervals for the most stable variables identified with rgcca_stability().
Figure below: 2000 bootstrap samples were generated. We are looking at the weights (black dots: they were estimated on the regular cohort, without resampling) of the kept variables with their estimated Confidence Interval (grey lines) for components 1/2/3.
# # Not run
# set.seed(53)
# bootstrap_SGCCA_C_Factorial_female_bestCV = rgcca_bootstrap(stability_SGCCA_C_Factorial_female_bestCV , n_cores = 36, n_boot = 2000)
#
# saveRDS(object = bootstrap_SGCCA_C_Factorial_female_bestCV, file = "bootstrap_SGCCA_C_Factorial_female_bestCV_V3.rds")
bootstrap_SGCCA_C_Factorial_female_bestCV = readRDS("pre_computed_results/bootstrap_SGCCA_C_Factorial_female_bestCV_V3.rds")
a = plot(bootstrap_SGCCA_C_Factorial_female_bestCV, n = 100, bloc = 1, comp = 1)
It appears that all stable variables are significant for the first components.
weights_idx = which(bootstrap_SGCCA_C_Factorial_female_bestCV$stats$type == "weights")
signif_weights = which((p.adjust(bootstrap_SGCCA_C_Factorial_female_bestCV$stats$pval[weights_idx], method = "fdr")<=0.05))
for (comp in 1:3){
print(sprintf("################# Component %i #################", comp))
idx_comp = which(bootstrap_SGCCA_C_Factorial_female_bestCV$stats$comp == comp)
print(bootstrap_SGCCA_C_Factorial_female_bestCV$stats$var[intersect(signif_weights, idx_comp)])
}
## [1] "################# Component 1 #################"
## [1] "ENSG00000269688" "ENSG00000173114" "ENSG00000179933" "ENSG00000274008"
## [5] "hsa-miR-15b-5p" "hsa-let-7d-3p" "hsa-miR-144-5p" "cg10833299"
## [9] "cg08506353" "cg23480021" "cg18727742" "cg05946535"
## [13] "Patient"
## [1] "################# Component 2 #################"
## [1] "ENSG00000173114" "Patient"
## [1] "################# Component 3 #################"
## [1] "Patient"
signif_weights_SGCCA_C_Factorial_female_bestCV = unique(bootstrap_SGCCA_C_Factorial_female_bestCV$stats$var[signif_weights])
If we try to intersect these variables with the significant ones identified through bootstrap with RGCCA in a supervised setting, it appears that all mRNA and DNAm variable were already present. However, only one miRNA amongst the three identified as stable was present previously. Sparsity may be more carefully tuned, especially for the miRNA block.
intersect(signif_weights_RGCCA_HiearchicalSupervised_Factorial_female_bestCV, signif_weights_SGCCA_C_Factorial_female_bestCV)
## [1] "ENSG00000173114" "ENSG00000179933" "ENSG00000269688" "ENSG00000274008"
## [5] "hsa-let-7d-3p" "cg18727742" "cg23480021" "cg05946535"
## [9] "cg10833299" "cg08506353" "Patient"
Slides 30 -> ? of “RGCCA - Theory”.