Code
n <- 20
p <- 3
R <- matrix(c(1, 0.5, -0.8,
0.5, 1, 0.1,
-0.8, 0.1, 1), 3, 3)
set.seed(124)
dat <- data.frame(mvrnorm(n = 20, mu = rep(0, 3), Sigma = R, empirical = TRUE))
GGally::ggpairs(dat)
RGCCA (Regularized Generalized Canonical Correlation Analysis) is easier to understand if a few core statistical ideas are already clear:
This document is a beginner-friendly reminder of these notions. The goal is not to provide a full course in multivariate analysis, but to build the minimal intuition needed before learning RGCCA and SGCCA.
Suppose we measure \(p\) variables on the same \(n\) individuals. We store the data in a matrix
\[ \mathbf X \in \mathbb R^{n \times p}, \]
where:
For example, if we measure 3 variables on 20 individuals, then \(\mathbf X\) has 20 rows and 3 columns.
In R, a data frame or a matrix often plays this role.
For one variable observed on the same \(n\) individuals,
\[ \mathbf x = (x_1, \dots, x_n)^\top, \]
the sample mean is
\[ \bar x = \frac{1}{n}\sum_{i=1}^n x_i, \]
and the sample variance is
\[ \mathrm{Var}(\mathbf x) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2. \]
The variance measures how much the values spread around their mean.
If we observe two variables on the same individuals,
\[ \mathbf x = (x_1, \dots, x_n)^\top, \qquad \mathbf y = (y_1, \dots, y_n)^\top, \]
the sample covariance is
\[ \mathrm{Cov}(\mathbf x, \mathbf y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y). \]
Interpretation:
Unlike correlation, covariance depends on the units of measurement.
The Pearson correlation coefficient is a standardized covariance:
\[ r(\mathbf x, \mathbf y) = \frac{\displaystyle\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y)}{\left(\displaystyle\sum_{i=1}^n (x_i - \bar x)^2\sum_{i=1}^n (y_i - \bar y)^2\right)^{1/2}}. \]
It always lies between \(-1\) and \(1\):
In R, the function cor() computes Pearson correlations by default.
A correlation close to 0 does not necessarily mean that two variables are unrelated. It only means that their linear relationship is weak.
This distinction matters in practice: RGCCA is a linear multiblock method. It looks for linear combinations of variables that summarize relationships between blocks.
n <- 20
p <- 3
R <- matrix(c(1, 0.5, -0.8,
0.5, 1, 0.1,
-0.8, 0.1, 1), 3, 3)
set.seed(124)
dat <- data.frame(mvrnorm(n = 20, mu = rep(0, 3), Sigma = R, empirical = TRUE))
GGally::ggpairs(dat)
The correlation matrix contains the pairwise correlations between all variables in a dataset.
For \(p\) variables \(X_1, \dots, X_p\), it is the \(p \times p\) matrix
\[ \mathbf{R} = \left[ \begin{array}{ccc} r(\mathbf{x}_1, \mathbf{x}_1) & \cdots & r(\mathbf{x}_1, \mathbf{x}_p) \\ \vdots & \ddots & \vdots \\ r(\mathbf{x}_p, \mathbf{x}_1) & \cdots & r(\mathbf{x}_p, \mathbf{x}_p) \end{array} \right]. \]
The diagonal is always equal to 1. In R, on our example, the correlation matrix looks like this:
mini_corrplot(R)
cor(dat) X1 X2 X3
X1 1.0 0.5 -0.8
X2 0.5 1.0 0.1
X3 -0.8 0.1 1.0
The covariance matrix plays a similar role:
\[ \mathbf S = \frac{1}{n-1}\mathbf X_c^\top \mathbf X_c, \]
where \(\mathbf X_c\) is the centered data matrix.
Its diagonal contains the variances of the variables, and the off-diagonal elements contain covariances.
The covariance matrix is central in multivariate methods. Many methods, including PCA, CCA, and RGCCA, are built from matrices such as:
Before applying many multivariate methods, variables are often centered and sometimes scaled.
Centering means subtracting the mean of each variable:
\[ \mathbf X_c = \mathbf X - \mathbf 1_n \bar{\mathbf x}^\top. \]
After centering, each variable has mean 0.
Scaling usually means dividing each centered variable by its standard deviation. After scaling, each variable has variance 1.
In R:
head(scale(dat)) X1 X2 X3
[1,] -1.7199993 -0.9954269 1.2253007
[2,] 0.5861590 1.1542574 -0.3382172
[3,] -0.6599118 0.1716074 0.7793796
[4,] 1.2293427 2.6302732 0.3120024
[5,] 1.2502357 -0.2586077 -1.5313015
[6,] 0.6358610 -0.1830621 -0.8394275
Why is this useful?
Multivariate methods become much easier once we accept matrix notation.
A data matrix is written as
\[ \mathbf X = [\mathbf x_1, \dots, \mathbf x_p], \]
where each \(\mathbf x_j\) is a column vector.
Two matrix products are especially important:
This is a \(p \times p\) matrix describing relationships between variables within the same block.
If the variables are centered, then
\[ \frac{1}{n-1}\mathbf X^\top \mathbf X \]
is the sample covariance matrix.
If \(\mathbf X\) and \(\mathbf Y\) are two blocks measured on the same individuals, then
\[ \mathbf X^\top \mathbf Y \]
summarizes cross-relationships between variables from the two blocks.
This is one of the key objects behind CCA and RGCCA.
A central idea in PCA, CCA, PLS, and RGCCA is to replace many variables by a smaller number of components.
A component is a linear combination of variables. If \(\mathbf X\) is a block and \(\mathbf a\) is a vector of weights, then
\[ \mathbf y = \mathbf X \mathbf a \]
is a component (also called a score vector, latent variable, or block component depending on context).
Interpretation:
This is the core mechanism of RGCCA: for each block, the method estimates a weight vector and therefore a component.
If \(\mathbf y = \mathbf X\mathbf a\), then the variance of the component is
\[ \mathrm{Var}(\mathbf y) \propto \mathbf a^\top \mathbf X^\top \mathbf X \mathbf a. \]
This quantity tells us how much variability is captured by the linear combination defined by \(\mathbf a\).
Many multivariate methods optimize some criterion involving this type of quadratic form.
Suppose two blocks are available:
\[ \mathbf X \in \mathbb R^{n \times p}, \qquad \mathbf Y \in \mathbb R^{n \times q}. \]
If we define two components
\[ \mathbf t = \mathbf X\mathbf a, \qquad \mathbf u = \mathbf Y\mathbf b, \]
then their covariance is proportional to
\[ \mathbf a^\top \mathbf X^\top \mathbf Y \mathbf b. \]
This is fundamental for understanding two-block methods:
An eigenvector of a square matrix \(\mathbf A\) is a non-zero vector \(\mathbf v\) such that
\[ \mathbf A \mathbf v = \lambda \mathbf v, \]
where \(\lambda\) is the corresponding eigenvalue.
In practice:
For example, PCA finds directions maximizing variance. These directions are eigenvectors of the covariance matrix.
Principal Component Analysis (PCA) works on one block only.
Its first component is the linear combination of variables with the largest possible variance:
\[ \mathbf t_1 = \mathbf X\mathbf a_1. \]
PCA is useful here because it introduces three ideas that also matter in RGCCA:
Canonical Correlation Analysis (CCA) works with two blocks.
It looks for weight vectors \(\mathbf a\) and \(\mathbf b\) such that the two components
\[ \mathbf t = \mathbf X\mathbf a, \qquad \mathbf u = \mathbf Y\mathbf b \]
are as correlated as possible.
So:
In RGCCA, we no longer have only two blocks. We may have:
Each block is measured on the same individuals. The goal is to construct one component per block so that connected blocks produce components that are strongly related.
This is why the notion of a design matrix is important in RGCCA: it specifies which blocks should be connected and which ones should not.
In omics and other high-dimensional datasets, variables within the same block are often strongly correlated. This is called multicollinearity.
Consequences:
This is one important reason why regularization is useful in RGCCA.
In many modern datasets, the number of variables is much larger than the number of individuals.
Examples:
In that setting:
RGCCA was designed to be usable in such settings by introducing regularization.
Regularization means adding constraints or penalties to stabilize estimation.
The main idea is simple:
For RGCCA, regularization is controlled by parameters such as \(\tau\) at the block level.
At a very intuitive level:
A very common regularization idea is shrinkage.
Instead of using a raw empirical covariance matrix, we move it toward a simpler target matrix that is more stable.
A generic form is
\[ \mathbf S_{\text{shrunk}} = (1 - \tau)\mathbf S + \tau \mathbf T, \]
where:
This kind of idea is directly relevant for RGCCA, because the method uses block-specific regularization parameters.
In very high-dimensional data, a component can involve thousands of variables. This may be hard to interpret.
A sparse method tries to keep only a subset of variables with non-zero weights.
Benefits:
This is the role of SGCCA: a sparse version of GCCA/RGCCA.
To understand sparsity, two norms are especially useful.
For a vector \(\mathbf a = (a_1, \dots, a_p)^\top\):
\[ \|\mathbf a\|_2 = \left(\sum_{j=1}^p a_j^2\right)^{1/2} \]
is the Euclidean norm, and
\[ \|\mathbf a\|_1 = \sum_{j=1}^p |a_j| \]
is the L1 norm.
The L2 norm controls the global size of the vector. The L1 norm encourages sparsity.
That is why SGCCA uses constraints involving both norms.
Many multivariate methods solve problems of the form
\[ \max_{\mathbf a} f(\mathbf a) \qquad \text{subject to} \qquad \mathbf a \in \mathcal C, \]
where:
Typical constraints include:
RGCCA and SGCCA both fit naturally in this framework.
At a beginner level, RGCCA can be understood as follows:
This is why the previous concepts are the essential prerequisites.
Before studying RGCCA in detail, make sure the following notions are comfortable:
If these ideas are clear, the theory of RGCCA becomes much more accessible.
scale().t(X) %*% X and t(X) %*% Y in a small toy example.Once these basics are in place, the next step is to study: