Data Integration and Dimension Reduction with RGCCA

Author

Vincent Guillemot

Before you start

This notebook complements two other workshop files:

  • reminders.qmd introduces the core statistical ideas needed before touching RGCCA,
  • more_theory.qmd gives a compact theory-oriented overview of RGCCA and SGCCA, with parameters and practical links to the package.

Introduction

Goals

The goals of this part of the workshop are to:

  • identify the kinds of biological questions that can be addressed with RGCCA and the kinds that require other methods,
  • implement RGCCA on simple and realistic data integration problems,
  • understand the main parameters of the method,
  • learn how to choose reasonable parameter values,
  • interpret the main graphical outputs,
  • and use sparse extensions to highlight variables of interest.

Why use an R Markdown notebook?

A notebook is portable, keeps code and interpretation in one place, and supports reproducible analyses.

I decided to use the qmd for the notebooks of the RGCCA workshop, but for all our uses and purposes it’s almost like an “R Markdown” document.

Types of examples used here

This document relies on two types of examples:

  • small simulated datasets where the correlation structure is fully controlled,
  • and TCGA breast cancer data, which are also used in other workshop sessions.

The big picture

RGCCA is a method for integrating several data blocks measured on the same samples.
Each block can represent a different type of information, such as gene expression, methylation, proteomics, clinical variables, or any other set of features collected on the same individuals.

The goal is to build a few synthetic components per block that:

  • summarize the main variation within each block,
  • capture the relationships between connected blocks,
  • and help identify the variables that drive these shared patterns.

In practice, an RGCCA analysis requires four main ingredients:

  • Blocks: the datasets to integrate;
  • A design matrix: which tells the model which blocks are connected;
  • A mode of analysis: exploratory or driven by a response block;
  • Regularization or sparsity parameters: which control stability and variable selection.

A useful way to think about RGCCA is the following:

  • PCA summarizes one block,
  • CCA relates two blocks,
  • RGCCA extends this logic to several connected blocks,
  • SGCCA adds variable selection to focus on the most informative features.

At the end of the analysis, the main outputs are:

  • block components for the samples,
  • relationships between blocks,
  • variable loadings or selected features,
  • and graphical displays that help interpret the biological structure of the data.

A quick tour of the package

The RGCCA package is available on CRAN.

install.packages("RGCCA")

A development version with additional methods is available on GitHub. Unless you already feel comfortable with RGCCA, you don’t need to go on GitHub.

# remotes::install_github("rgcca-factory/RGCCA")

The main function is called rgcca(). Before using it on real data, it is worth reading the help page and the vignette.

?RGCCA::rgcca

The examples that we will see while discussing will also be recorded in the Shared workshop document.

Below is the example from the manual, on Russett’s data (from “Russett BM. Inequality and Instability: The Relation of Land Tenure to Politics. World Politics. 1964;16(3):442-454. doi:10.2307/2009581”). Run this example to check that everything works!

Code
data(Russett)
blocks <- list(
  agriculture = Russett[, seq(3)],
  industry = Russett[, 4:5],
  politic = Russett[, 6:11]
)

politic <- as.factor(apply(Russett[, 9:11], 1, which.max))

# RGCCA with default values : Blocks are fully connected, factorial scheme
# tau = 1 for all blocks, one component per block.
fit_rgcca <- rgcca(blocks = blocks)

print(fit_rgcca)
Fitted RGCCA model. 
The algorithm converged to a stationnary point after 3 iterations.
Code
plot(fit_rgcca, type = "weight", block = 1:3)

Code
plot(fit_rgcca,
  type = "sample", block = 1:2,
  comp = rep(1, 2), resp = politic
)

In a few words, RGCCA is designed for block-structured data:

  • a block is a coherent set of variables,
  • all blocks are measured on the same samples,
  • and blocks can be connected according to the scientific question.

Below is an example of design with 4 blocks of data (A, B, P, and R)1. The goal of RGCCA is to summarize each block by components (that I noted YA, Y_B, Y_P, and Y_R). The links between the blocks represent the “design matrix”, meaning the correlations between components that I want to impose. With this design, I’m telling RGCCA to maximize all the correlations between pairs of components except two (A – R and B – P). These links represent the “inner” model.

One of the strengths of RGCCA is that it places unsupervised and supervised multiblock approaches in the same general framework. In practice, this is useful when one analysis needs to combine exploratory integration, prediction-oriented modeling, and feature interpretation.

Practical questions to keep in mind

Before running any model, ask the following:

  • What is a biologically meaningful block in my study (e.g. containing groups of samples)?
  • Are all blocks measured on exactly the same samples?
  • Should all blocks be connected, or only some of them?
  • Is my goal exploration, prediction, feature selection, or a mix of these?
  • Do I need all variables, or do I want a sparse solution?

Types of questions addressed in the workshop

In the shared document, I prepared a table of anonymized biological questions collected from a previous RGCCA workshop.

Your turn: based on these examples, formulate your own data integration question, even if you do not have your data with you. Throughout the workshop, we will return to these questions and connect them to concrete analytical choices.

A useful habit is to express each question in terms of:

  • samples,
  • blocks,
  • target outcome or response, if any,
  • expected relationships between blocks,
  • and expected outputs, such as components, selected variables, or interpretable pathways.

The parent method: Principal Component Analysis

Many questions answered with PCA are exploratory and unsupervised.

Typical question PCA output or visualization
Are my data informative? Scree plot
Is there a batch effect? Sample map
Are groups of variables correlated? Correlation circle

PCA is the right starting point because RGCCA also builds low-dimensional components. The main difference is that PCA summarizes one block at a time, whereas RGCCA summarizes several blocks jointly while taking their relationships into account.

A first toy example

Assume that we are performing (standardized) PCA on a dataset with 3 variables. The only information that I am giving you is the correlation matrix below. Based on this correlation matrix, what’s the variance of the first component and what are the weights2?

Code
mini_corrplot(matrix(0.8, 3, 3) + diag(0.2, 3))

Same exercise with this second matrix.

Code
mini_corrplot(matrix(c(1, 0.8, 0, 0.8, 1, 0, 0, 0, 1), 3, 3))

Bonus: in R, use eigen() to decompose these correlation matrices and check that your intuition was correct. PCA is based on the eigenvalue decomposition of the covariance or correlation matrix.

Code
eigen(matrix(0.8, 3, 3) + diag(0.2, 3))
## eigen() decomposition
## $values
## [1] 2.6 0.2 0.2
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.5773503  0.0000000  0.8164966
## [2,] -0.5773503 -0.7071068 -0.4082483
## [3,] -0.5773503  0.7071068 -0.4082483
eigen(matrix(c(1, 0.8, 0, 0.8, 1, 0, 0, 0, 1), 3, 3))
## eigen() decomposition
## $values
## [1] 1.8 1.0 0.2
## 
## $vectors
##           [,1] [,2]       [,3]
## [1,] 0.7071068    0  0.7071068
## [2,] 0.7071068    0 -0.7071068
## [3,] 0.0000000    1  0.0000000

Key intuition to carry into RGCCA

PCA teaches several ideas that remain essential for RGCCA:

  • highly correlated variables tend to move together,
  • a component is a weighted combination (summary) of variables,
  • the first component captures the strongest dominant pattern,
  • variables unrelated to the dominant pattern usually receive little weight,
  • and visualization of samples and variables often reveals structure faster than a table of coefficients.

Two or three tiny blocks

Let’s simulate easy to handle data, on which we control everything, and ask a few questions to RGCCA.

One block with several variables and one block with one variable

Suppose the data are organized into two blocks:

  • a first block with variables X1, X2, X3, and X4,
  • a second block with a single variable V.

Assume the variables follow the correlation structure below.

Code
R_xv <- matrix(
  data = c(
    1, 0.8, 0.8, 0, -0.85,
    0.8, 1, 0.8, 0, -0.85,
    0.8, 0.8, 1, 0, -0.85,
    0, 0, 0, 1, 0,
    -0.85, -0.85, -0.85, 0, 1),
  nrow = 5,
  ncol = 5,
  dimnames = list(
    c("X1", "X2", "X3", "X4", "V"),
    c("X1", "X2", "X3", "X4", "V")))

mini_corrplot(R_xv)

My question is the following: what is the best combination of X1, X2 and X3 that is the most correlated with V?

RGCCA is well suited to answer this question because, like PCA, it creates components with lower dimension than the original blocks. But, unlike PCA, it tries to balance two goals:

  1. each component should (optimally) summarize its own block,
  2. and the components that we decide to connect should be as strongly related as possible.

Because RGCCA builds components, it also computes variable weights. In this example, we would expect:

  • X1, X2, and X3 to receive large weights of similar magnitude,
  • X4 to receive a weight close to zero,
  • and V to define the second block component by itself.

Let us test our intuition with simulated (Gaussian multivariate) data. To simulate this data, I use MASS:mvrnorm: this function is very similar to rnorm.

Below is a simple example showing the histogram of a random univariate gaussian sample (\(\mathcal N(0, 1)\)) of size 100.

Code
tibble(x = rnorm(100)) %>%
  ggplot(aes(x)) + 
  geom_histogram(
    bins = 10, 
    fill = "#2b9b81",
    color = "white")

A multivariate sample needs a vector of theoretical means and a theoretical covariance matrix. Here we will generate a sample of size \(n = 27\), of a \(p = 5\)-dimensional multivariate Gaussian with 0 mean and a covariance matrix that we previously defined. Remember your STATS courses: we could denote this multivariate variable like below.

\[ X = \mathcal{N} \left( \mu = \left[ \begin{array}{c} 0\\0\\0\\0 \end{array} \right], \Sigma = \left[ \begin{array}{ccccc} 1 & 0.8 & 0.8 & 0 & -0.85\\ 0.8 & 1 & 0.8 & 0 & -0.85\\ 0.8 & 0.8 & 1 & 0 & -0.85\\ 0 & 0 & 0 & 1 & 0\\ -0.85 & -0.85 & -0.85 & 0 & 1 \\ \end{array} \right] \right) \]

Code
set.seed(413)
dat_xv <- MASS::mvrnorm(n = 27, mu = rep(0, 5), Sigma = R_xv, empirical = TRUE)
blocks <- list(X = dat_xv[, 1:4], V = dat_xv[, 5, drop = FALSE])

Running RGCCA on this example requires at least one object: the list of “blocks”. On your real dataset, you would start at defining this list with the blocks of data you want to analyze with RGCCA.

Code
fit.rgcca <- rgcca(blocks = blocks)

We will take a look at the following graphs:

  • AVE plot: how much of the relevant structure is captured by the components?
  • Sample plot: do the two block components show a clear relationship?
  • Loadings plot: which variables drive each block component?

The AVE shows that the first component of V explains its own block very well. I’s not surprinsing because V contains only one variable.

It also shows that the first component of X explains \(65\%\) of its own block, which is not unexpected, since its based on 3 out of its 4 variables, the component will explain at most \(75\%\).

Code
plot(fit.rgcca, type = "ave")

Here we see that the correlation between the two first components is negative. It’s directly related to the correlation of \(0.85\) between X1 and V, X2 and V, and X3 and V.

Code
plot(fit.rgcca, type = "samples")

Finally, the loadings are as expected:

  • high similar loadings for X1, X2, and X3,
  • 0 for X4,
  • and 1 for V.
Code
plot(fit.rgcca, type = "loadings")

Two simple blocks with two variables

Consider the new correlation matrix below for two blocks with two variables each: block A with A1 and A2, and block B with B1 and B2.

Before running the model, predict which variables will dominate the first component of each block. After that, simulate some data, run rgcca and test that your intuition was correct!

Code
R_ab <- matrix(
  data = c(
    1, 0.8, 0.9, 0,
    0.8, 1, 0.9, 0,
    0.9, 0.9, 1, 0,
    0, 0, 0, 1),
  nrow = 4,
  ncol = 4,
  dimnames = list(
    c("A1", "A2", "B1", "B2"),
    c("A1", "A2", "B1", "B2")))

mini_corrplot(R_ab)

Sploiler Alert: Don’t look below if you want to do it yourself!

Code
set.seed(413)

dat_ab <- MASS::mvrnorm(n = 27, mu = rep(0, 4), Sigma = R_ab, empirical = TRUE)
blocks_ab <- list(A = dat_ab[, 1:2], B = dat_ab[, 3:4])
fit.rgcca_ab <- RGCCA::rgcca(blocks = blocks_ab)

plot(fit.rgcca_ab, type = "ave")

Code
plot(fit.rgcca_ab, type = "samples")

Code
plot(fit.rgcca_ab, type = "loadings")

Three tiny blocks, with one block of noise

Let’s add a block to the mix (containing only noise).

Code
Z <- matrix(
  data = rnorm(27 * 3), 
  nrow = 27, ncol = 3, 
  dimnames = list(1:27, sprintf("Z%i", 1:3)))

Run RGCCA on three blocks, with the default design (all blocks are connected).

Code
blocks_abz <- list(A = dat_ab[, 1:2], B = dat_ab[, 3:4], Z = Z)
fit.rgcca_abz <- RGCCA::rgcca(blocks = blocks_abz)

plot(fit.rgcca_abz, type = "ave")

Code
plot(fit.rgcca_abz, type = "samples", block = 1:2)

Code
plot(fit.rgcca_abz, type = "samples", block = c(1, 3))

Code
plot(fit.rgcca_abz, type = "samples", block = c(2, 3))

Code
plot(fit.rgcca_abz, type = "loadings")

Let’s use a fancy design, where blocks A and Z are disconnected. The difference is not striking, because we see what we want: the strong correlation between A and B.

Code
C <- matrix(
  data = c(
    0, 1, 0,
    1, 0, 1,
    0, 1, 0), 3, 3)

fit.rgcca_abz_fancy <- RGCCA::rgcca(blocks = blocks_abz, connection = C)

plot(fit.rgcca_abz_fancy, type = "ave")

Code
plot(fit.rgcca_abz_fancy, type = "samples", block = 1:2)

Code
plot(fit.rgcca_abz_fancy, type = "samples", block = c(1, 3))

Code
plot(fit.rgcca_abz_fancy, type = "samples", block = c(2, 3))

Code
plot(fit.rgcca_abz_fancy, type = "loadings")

The message of this exercice on the design matrix is that if changing the design has a very noticeable effect on the result, you should really dive into the data to understand why.

Understanding the main outputs

A simple rule for interpretation

  • High outer AVE means strong within-block summarization.
  • High inner AVE means strong between-block integration.
  • A useful model usually balances both according to the scientific objective.

Three less small blocks and more on the role of design

The design matrix is one of the most important ideas in RGCCA. It encodes which blocks are assumed to be connected.

This is not just a technical detail. It is how the scientific question enters the model.

A larger simulated example

In this example, we have:

  • 150 samples split into two known groups,
  • block A with 7 variables and substantial within-block correlation,
  • block B with 5 variables and substantial within-block correlation,
  • a response block G describing group membership,
  • variables that truly differ between groups,
  • and variables that are correlated but not directly informative about the group structure.

The data are generated by the script scripts/ABG.R and stored in the following files:

  • data/simul/A.csv
  • data/simul/B.csv
  • data/simul/G.csv
Code
A <- read.csv("../data/simul/A.csv")
B <- read.csv("../data/simul/B.csv")
G <- read.csv("../data/simul/G.csv")

After loading the data, take a few minutes to inspect them:

  • what are the dimensions of each block,
  • do the empirical correlations match your expectations,
  • are some variables obviously redundant,
  • do the groups already separate in block-specific exploratory plots?

Two questions naturally arise here:

  • What are the relationships between the blocks?
  • Which variables matter for separating the groups?

Below, the third block is treated as a response block.

A word of caution: when using a response variable, the default design changes. It does not connect all the blocks together anymore. Instead, all the blocks are only connected to the response.

Code
blocks_abg <- list(A = A, B = B, G = G)

fit_rgcca_abg <- rgcca(
  blocks_abg,
  connection = matrix(1, 3, 3),
  ncomp = c(5, 5, 1),
  response = 3
)

plot(fit_rgcca_abg, "ave")

Code
plot(fit_rgcca_abg, comp = 1)

Code
plot(fit_rgcca_abg, comp = 2)

Code
plot(fit_rgcca_abg, comp = 3)

Code
plot(fit_rgcca_abg, type = "samples", response = G$G)

Code
plot(fit_rgcca_abg, type = "samples", block = 1, response = G$G)

Code
plot(fit_rgcca_abg, type = "samples", block = 2, response = G$G)

Code
plot(fit_rgcca_abg, type = "cor_circle", block = 1) + coord_equal()

Code
plot(fit_rgcca_abg, type = "cor_circle", block = 2) + coord_equal()

Interpreting these plots

  • The global sample map shows whether the integrated structure aligns with the response.
  • Block-specific sample maps show which blocks contribute most to separation.
  • Correlation circles help identify variables associated with each component.
  • Component-specific plots help distinguish the first dominant pattern from later patterns.

Important concept: supervised versus unsupervised RGCCA

When a block is declared as a response block, the analysis becomes partly supervised. This changes the interpretation:

  • the model is no longer searching for correlated components (!!),
  • it is also encouraged to align some components with the response,
  • which makes the output more useful for discrimination or prediction.

Comparing two design matrices

By default, RGCCA estimates components by maximizing relationships between connected blocks. But this may not match your scientific knowledge.

Suppose we do not want to force a direct link between blocks A and B. We can compare two designs:

  • a fully connected design,
  • and a more selective design in which a noise block is isolated from one of the biological blocks.

To make the contrast visible, we add a noise block Z. The graph below shows that the first component of block Z is not the same when we change the design. Explore in more details the difference between the two results.

Code
Z <- matrix(
  data = round(rnorm(10 * 150), 1),
  nrow = 150,
  ncol = 10,
  dimnames = list(1:150, sprintf("Z%02i", 1:10))
)

blocks_abz <- list(A = A, B = B, Z = Z)

C_comp <- matrix(
  data = 1, nrow = 3, ncol = 3,
  dimnames = list(names(blocks_abz), names(blocks_abz))
)
diag(C_comp) <- 0

C_smart <- matrix(
  data = c(
    0, 1, 1,
    1, 0, 0,
    1, 0, 0
  ),
  nrow = 3,
  ncol = 3,
  dimnames = list(names(blocks_abz), names(blocks_abz))
)

fit_comp <- rgcca(blocks = blocks_abz, connection = C_comp, ncomp = 5)
fit_smart <- rgcca(blocks = blocks_abz, connection = C_smart, ncomp = 5)

data.frame(
  Z1_comp = fit_comp$Y$Z[, 1],
  Z1_smart = fit_smart$Y$Z[, 1],
  Y = G$G) %>%
  ggplot(aes(Z1_comp, Z1_smart, color = Y)) +
  geom_point()

RGCCA is not only about extracting components, it is also about encoding prior knowledge. A fully connected design can sometimes create relationships that are mathematically convenient but biologically unconvincing. A more selective design may produce components that are less global but more faithful to the study design.

Real data example: TCGA-BRCA

Now that we have explored key RGCCA features on simulated data, we move to real breast cancer data from The Cancer Genome Atlas.

The blocks include:

  • clinical data,
  • gene expression,
  • microRNA abundance,
  • and protein measurements.
Code
cli <- read.csv("../data/TCGA/clinical.csv", row.names = 1)
mrna <- t(read.csv("../data/TCGA/mrna.csv", row.names = 1))
colnames(mrna) <- HGNChelper::checkGeneSymbols(
  colnames(mrna), unmapped.as.na = FALSE)$Suggested.Symbol
mirna <- t(read.csv("../data/TCGA/mirna.csv", row.names = 1))
prot <- t(read.csv("../data/TCGA/protein.csv", row.names = 1))

Exploration and quick quality control

Before integrating the blocks, inspect each one separately.

Questions to ask:

  • Are sample identifiers aligned across blocks?
  • Are there outliers or distributional problems?
  • Are some groups strongly unbalanced?
  • Do we need filtering, transformation, or scaling?
  • Are there missing values that need to be handled first?

Spoiler alert: There are a few propositions below so don’t look if you want to do it yourself. We will put the best ideas on the shared document.

First we check that rownames are the same in all 4 blocks.

Code
table(rownames(cli) == rownames(mrna))

TRUE 
 379 
Code
table(rownames(cli) == rownames(mirna))

TRUE 
 379 
Code
table(rownames(cli) == rownames(prot))

TRUE 
 379 

To evaluate the balanceness, we count the number of patients per sub-type.

Code
cli %>% count(subtype)
  subtype   n
1   Basal  76
2    Her2  38
3    LumA 188
4    LumB  77

Are there any missing values?

Code
sum(is.na(cli))
[1] 0
Code
sum(is.na(mrna))
[1] 0
Code
sum(is.na(mirna))
[1] 0
Code
sum(is.na(prot))
[1] 0

Below is an example plot showing Gender, Age, and subtype from the clinical block.

Code
cli %>%
  ggplot(aes(subtype, patient.age_at_initial_pathologic_diagnosis)) +
  geom_violin(aes(color = subtype, fill = subtype), trim = FALSE, alpha = 0.5) +
  geom_beeswarm(aes(shape = patient.gender)) +
  # ggpubr::stat_pwc(hide.ns = TRUE, y = 100) +
  scale_shape_manual(values = c(4, 10)) +
  labs(
    x = "Cancer Subtype",
    y = "Age at Initial Diagnosis",
    fill = "",
    color = "",
    shape = "Gender"
  )

Below is the histogram of the values in each of the block.

Code
g_hist_mrna <- as_tibble(scale(mrna, scale = TRUE)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 50, color = "white") +
  labs(x = "Gene Expression")

g_hist_mirna <- as_tibble(scale(mirna, scale = TRUE)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 50, color = "white") +
  labs(x = "miRNA Expression")

g_hist_prot <- as_tibble(scale(prot, scale = TRUE)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 50, color = "white") +
  labs(x = "Protein")

(g_hist_mrna + g_hist_mirna) /
  (g_hist_prot + plot_spacer())

First multiblock model

The blocks are combined in a list, like earlier.

Code
blocks_tcga <- list(mRNA = mrna, miRNA = mirna, PROT = prot, Subtype = cli$subtype)

We ask first for a model predicting the subtype based on the information in the rRNA, miRNA and PROT. Beware that with this model, we do not ask for the shared structure between the omics blocks. Nonetheless, we get a very interesting structure of the component space: we get an interesting separation between the groups (not perfect though).

Code
fit_rgcca_tcga <- rgcca(blocks_tcga, response = 4, ncomp = c(5, 5, 5, 1))
plot(fit_rgcca_tcga, type = "ave")

Code
plot(fit_rgcca_tcga, type = "samples", block = 1)

Code
plot(fit_rgcca_tcga, type = "samples", block = 1, comp = 3:4)

Code
plot(fit_rgcca_tcga, type = "samples", block = 1, comp = 4:5)

Code
plot(fit_rgcca_tcga, type = "samples", block = 2)

Code
plot(fit_rgcca_tcga, type = "samples", block = 2, comp = 3:4)

Code
plot(fit_rgcca_tcga, type = "samples", block = 3)

Code
plot(fit_rgcca_tcga, type = "samples", block = 3, comp = 3:4)

Why preprocessing matters so much here

For omics data, preprocessing is often as important as the model itself. In practice, you may need:

  • feature filtering to remove near-constant variables,
  • log transformation or variance stabilization,
  • sample matching across all blocks,
  • scaling so that one block does not dominate only because of its measurement scale,
  • and possibly batch correction before integration.

If these steps are ignored, the extracted components can mostly reflect technical structure rather than biology.

From RGCCA to SGCCA

When the number of variables is (very) large, a dense solution can be difficult to interpret. This is where sparse RGCCA, usually referred to as SGCCA, becomes very useful.

The goal is not only to build integrated components, but also to retain a reduced subset of variables that contribute the most.

In SGCCA, there is a “sparsity” parameter that is directly linked to the number of zeros that we want to impose on the weights: a sparsity parameter close to 0 means that we want a lot of the weights to be null, a sparsity parameter close to 1 means that we do not want the weights to be forced towards zero.

Tuning sparsity by cross-validation

We use cross-validation to estimate the optimal value for the sparsity parameter. It consist in separating the data randomly into training, testing and validation sets to test fairly different sparsity values.

The result is shown below: plotting the result of rgcca_cv

Code
set.seed(84251)

res.rgcca.cv <- rgcca_cv(
  blocks = blocks_tcga,
  ncomp = 3,
  method = "sgcca",
  prediction_model = "lda",
  metric = "Accuracy",
  response = 4,
  par_type = "sparsity",
  par_value = cbind(
    mRNA = seq(0.1, 1, by = 0.1),
    miRNA = seq(0.1, 1, by = 0.1),
    PROT = seq(0.1, 1, by = 0.1),
    Y = 1
  )
)

plot(res.rgcca.cv)

Fit a sparse model using the selected values

Code
fit_sgcca_tcga_best <- rgcca(
  blocks = blocks_tcga,
  response = 4,
  ncomp = c(3, 3, 3, 1),
  method = "sgcca",
  sparsity = c(0.4, 0.4, 0.4, 1)
)

plot(fit_sgcca_tcga_best, type = "samples", block = 1)

Code
plot(fit_sgcca_tcga_best, type = "samples", block = 2)

Code
plot(fit_sgcca_tcga_best, type = "samples", block = 3)

Code
plot(fit_sgcca_tcga_best, type = "samples", block = c(1, 2))

Code
plot(fit_sgcca_tcga_best, type = "samples", block = c(1, 3))

Code
plot(fit_sgcca_tcga_best, type = "ave")

How to interpret sparse results

A sparse solution is especially useful when you want:

  • shorter lists of candidate biomarkers,
  • clearer biological interpretation,
  • more focused downstream enrichment analysis,
  • or a predictive model that is easier to communicate.

But sparsity also introduces choices and trade-offs:

  • too much sparsity can miss relevant variables,
  • too little sparsity can reduce interpretability,
  • and the best tuning often depends on the downstream goal.

Downstream interpretation with pathway analysis

A common next step is to use the weights from one block to rank variables and perform enrichment analysis.

Code
annot_table <- read.csv("../data/TCGA/annotation_table_mrna.csv", row.names = 1)
kegg_pathways <- readRDS("../data/TCGA/kegg_pathways.rds")

a1 <- fit_sgcca_tcga_best$a$mRNA[, 1]
all(annot_table$symbols_new == gsub("^mRNA_", "", names(a1)))
[1] TRUE
Code
names(a1) <- annot_table$entrez


kegg_hsa_name <- kegg_pathways %>%
  dplyr::group_by(pathway_id, pathway_name) %>%
  summarize(n = length(entrez_gene))

# sum(names(a1) %in% kegg_pathways$entrez_gene)

fgsea_res <- fgsea::fgseaMultilevel(
  pathways = split(kegg_pathways$entrez_gene, kegg_pathways$pathway_id),
  stats = a1^2,
  minSize = 15,
  maxSize = 500,
  eps = 0
)

fgsea_res <- fgsea_res |>
  left_join(kegg_hsa_name, by = c("pathway" = "pathway_id")) %>%
  relocate(pathway_name, .after = 1)

fgsea_res %>% 
  filter(padj <= 0.05) %>%
  ggplot(aes(x = padj, y = reorder(pathway_name, padj))) +
  scale_y_discrete(labels = scales::label_wrap(20)) +
  geom_point(aes(size = n)) + 
  labs(x = "Adjusted p-value", y = "")

RGCCA and SGCCA help identify multiblock structure, but biological interpretation usually requires one more layer. Pathway analysis, gene-set analysis, and annotation-driven summaries help connect components back to mechanisms.

Common pitfalls and good habits

Typical pitfalls

  • Treating all blocks as equally informative without checking quality.
  • Connecting all blocks by default without biological justification.
  • Using too many components and over-interpreting late components.
  • Ignoring scaling and preprocessing differences across blocks.
  • Interpreting weights without checking stability.
  • Confusing a good visual separation with robust prediction.

Good habits

  • Start with block-wise exploratory plots.
  • Make the design matrix explicit and justify it.
  • Compare a dense RGCCA solution with a sparse SGCCA solution.
  • Inspect both sample space and variable space.
  • Use cross-validation when tuning sparsity for prediction-related goals.
  • Connect selected variables to pathway or annotation analyses.

Suggested workflow for beginners

  1. Define the samples and blocks.
  2. Check sample alignment and data quality.
  3. Explore each block with simple graphics.
  4. Decide whether the question is exploratory, supervised, or predictive.
  5. Write down a first design matrix based on biology.
  6. Run a simple RGCCA model.
  7. Inspect sample plots, loadings, and AVE summaries.
  8. Adjust the design or number of components if needed.
  9. Move to SGCCA if interpretability or variable selection becomes important.
  10. Use downstream interpretation tools on selected features.

Footnotes

  1. notice that all blocks have the same height (i.e. same number of observations), but different widths (i.e. different number of variables).↩︎

  2. recall that a component in PCA is a linear combination (which needs weights) of the variables↩︎