A few theoretical concepts and tools

Author

Vincent Guillemot

Why this document?

RGCCA (Regularized Generalized Canonical Correlation Analysis) is easier to understand if a few core statistical ideas are already clear:

  • what variance, covariance, and correlation measure,
  • how a dataset is written as a matrix,
  • what a linear combination of variables is,
  • why we often center and scale variables,
  • why high-dimensional data require regularization,
  • and why sparsity can help interpretation.

This document is a beginner-friendly reminder of these notions. The goal is not to provide a full course in multivariate analysis, but to build the minimal intuition needed before learning RGCCA and SGCCA.

One dataset, many variables

Suppose we measure \(p\) variables on the same \(n\) individuals. We store the data in a matrix

\[ \mathbf X \in \mathbb R^{n \times p}, \]

where:

  • rows correspond to individuals,
  • columns correspond to variables.

For example, if we measure 3 variables on 20 individuals, then \(\mathbf X\) has 20 rows and 3 columns.

In R, a data frame or a matrix often plays this role.

Mean and variance

For one variable observed on the same \(n\) individuals,

\[ \mathbf x = (x_1, \dots, x_n)^\top, \]

the sample mean is

\[ \bar x = \frac{1}{n}\sum_{i=1}^n x_i, \]

and the sample variance is

\[ \mathrm{Var}(\mathbf x) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2. \]

The variance measures how much the values spread around their mean.

Covariance

If we observe two variables on the same individuals,

\[ \mathbf x = (x_1, \dots, x_n)^\top, \qquad \mathbf y = (y_1, \dots, y_n)^\top, \]

the sample covariance is

\[ \mathrm{Cov}(\mathbf x, \mathbf y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y). \]

Interpretation:

  • positive covariance: the two variables tend to increase together,
  • negative covariance: one tends to increase when the other decreases,
  • covariance near 0: no clear linear co-variation.

Unlike correlation, covariance depends on the units of measurement.

Correlation coefficient

The Pearson correlation coefficient is a standardized covariance:

\[ r(\mathbf x, \mathbf y) = \frac{\displaystyle\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y)}{\left(\displaystyle\sum_{i=1}^n (x_i - \bar x)^2\sum_{i=1}^n (y_i - \bar y)^2\right)^{1/2}}. \]

It always lies between \(-1\) and \(1\):

  • \(r = 1\): perfect positive linear relationship,
  • \(r = -1\): perfect negative linear relationship,
  • \(r = 0\): no linear relationship.

In R, the function cor() computes Pearson correlations by default.

Correlation is about linear relationships

A correlation close to 0 does not necessarily mean that two variables are unrelated. It only means that their linear relationship is weak.

This distinction matters in practice: RGCCA is a linear multiblock method. It looks for linear combinations of variables that summarize relationships between blocks.

Simulated example

Code
n <- 20
p <- 3
R <- matrix(c(1, 0.5, -0.8,
              0.5, 1, 0.1,
              -0.8, 0.1, 1), 3, 3)
set.seed(124)
dat <- data.frame(mvrnorm(n = 20, mu = rep(0, 3), Sigma = R, empirical = TRUE))
GGally::ggpairs(dat)

Correlation matrix

The correlation matrix contains the pairwise correlations between all variables in a dataset.

For \(p\) variables \(X_1, \dots, X_p\), it is the \(p \times p\) matrix

\[ \mathbf{R} = \left[ \begin{array}{ccc} r(\mathbf{x}_1, \mathbf{x}_1) & \cdots & r(\mathbf{x}_1, \mathbf{x}_p) \\ \vdots & \ddots & \vdots \\ r(\mathbf{x}_p, \mathbf{x}_1) & \cdots & r(\mathbf{x}_p, \mathbf{x}_p) \end{array} \right]. \]

The diagonal is always equal to 1. In R, on our example, the correlation matrix looks like this:

Code
mini_corrplot(R)

Code
cor(dat)
     X1  X2   X3
X1  1.0 0.5 -0.8
X2  0.5 1.0  0.1
X3 -0.8 0.1  1.0

Covariance matrix

The covariance matrix plays a similar role:

\[ \mathbf S = \frac{1}{n-1}\mathbf X_c^\top \mathbf X_c, \]

where \(\mathbf X_c\) is the centered data matrix.

Its diagonal contains the variances of the variables, and the off-diagonal elements contain covariances.

The covariance matrix is central in multivariate methods. Many methods, including PCA, CCA, and RGCCA, are built from matrices such as:

  • \(\mathbf X^\top \mathbf X\) for within-block structure,
  • \(\mathbf X^\top \mathbf Y\) for between-block relationships.

Centering and scaling

Before applying many multivariate methods, variables are often centered and sometimes scaled.

Centering

Centering means subtracting the mean of each variable:

\[ \mathbf X_c = \mathbf X - \mathbf 1_n \bar{\mathbf x}^\top. \]

After centering, each variable has mean 0.

Scaling

Scaling usually means dividing each centered variable by its standard deviation. After scaling, each variable has variance 1.

In R:

Code
head(scale(dat))
             X1         X2         X3
[1,] -1.7199993 -0.9954269  1.2253007
[2,]  0.5861590  1.1542574 -0.3382172
[3,] -0.6599118  0.1716074  0.7793796
[4,]  1.2293427  2.6302732  0.3120024
[5,]  1.2502357 -0.2586077 -1.5313015
[6,]  0.6358610 -0.1830621 -0.8394275

Why is this useful?

  • if variables are measured on very different scales, those with large units may dominate the analysis,
  • correlation-based methods implicitly rely on standardized variables,
  • scaling often makes variables more comparable.

Vectors, matrices, and matrix products

Multivariate methods become much easier once we accept matrix notation.

A data matrix is written as

\[ \mathbf X = [\mathbf x_1, \dots, \mathbf x_p], \]

where each \(\mathbf x_j\) is a column vector.

Two matrix products are especially important:

\(\mathbf X^\top \mathbf X\)

This is a \(p \times p\) matrix describing relationships between variables within the same block.

If the variables are centered, then

\[ \frac{1}{n-1}\mathbf X^\top \mathbf X \]

is the sample covariance matrix.

\(\mathbf X^\top \mathbf Y\)

If \(\mathbf X\) and \(\mathbf Y\) are two blocks measured on the same individuals, then

\[ \mathbf X^\top \mathbf Y \]

summarizes cross-relationships between variables from the two blocks.

This is one of the key objects behind CCA and RGCCA.

Linear combinations and components

A central idea in PCA, CCA, PLS, and RGCCA is to replace many variables by a smaller number of components.

A component is a linear combination of variables. If \(\mathbf X\) is a block and \(\mathbf a\) is a vector of weights, then

\[ \mathbf y = \mathbf X \mathbf a \]

is a component (also called a score vector, latent variable, or block component depending on context).

Interpretation:

  • \(\mathbf a\) tells us how each variable contributes,
  • \(\mathbf y\) gives one synthetic score for each individual.

This is the core mechanism of RGCCA: for each block, the method estimates a weight vector and therefore a component.

Variance of a component

If \(\mathbf y = \mathbf X\mathbf a\), then the variance of the component is

\[ \mathrm{Var}(\mathbf y) \propto \mathbf a^\top \mathbf X^\top \mathbf X \mathbf a. \]

This quantity tells us how much variability is captured by the linear combination defined by \(\mathbf a\).

Many multivariate methods optimize some criterion involving this type of quadratic form.

Covariance between two components

Suppose two blocks are available:

\[ \mathbf X \in \mathbb R^{n \times p}, \qquad \mathbf Y \in \mathbb R^{n \times q}. \]

If we define two components

\[ \mathbf t = \mathbf X\mathbf a, \qquad \mathbf u = \mathbf Y\mathbf b, \]

then their covariance is proportional to

\[ \mathbf a^\top \mathbf X^\top \mathbf Y \mathbf b. \]

This is fundamental for understanding two-block methods:

  • CCA looks for components that are highly correlated,
  • PLS tends to emphasize covariance,
  • RGCCA generalizes this logic to several blocks and lets the user choose the scheme and the block connections.

Eigenvalues and eigenvectors: why they appear everywhere

An eigenvector of a square matrix \(\mathbf A\) is a non-zero vector \(\mathbf v\) such that

\[ \mathbf A \mathbf v = \lambda \mathbf v, \]

where \(\lambda\) is the corresponding eigenvalue.

In practice:

  • eigenvectors define important directions in the data,
  • eigenvalues measure how important these directions are.

For example, PCA finds directions maximizing variance. These directions are eigenvectors of the covariance matrix.

PCA intuition

Principal Component Analysis (PCA) works on one block only.

Its first component is the linear combination of variables with the largest possible variance:

\[ \mathbf t_1 = \mathbf X\mathbf a_1. \]

PCA is useful here because it introduces three ideas that also matter in RGCCA:

  • replacing many variables by a few components,
  • optimizing a criterion under a constraint,
  • interpreting weight vectors and component scores.

CCA intuition

Canonical Correlation Analysis (CCA) works with two blocks.

It looks for weight vectors \(\mathbf a\) and \(\mathbf b\) such that the two components

\[ \mathbf t = \mathbf X\mathbf a, \qquad \mathbf u = \mathbf Y\mathbf b \]

are as correlated as possible.

So:

  • PCA summarizes one block,
  • CCA relates two blocks,
  • RGCCA extends the logic to several blocks.

What changes with several blocks?

In RGCCA, we no longer have only two blocks. We may have:

  • transcriptomics,
  • methylation,
  • proteomics,
  • clinical variables,
  • imaging,
  • metabolomics,
  • and so on.

Each block is measured on the same individuals. The goal is to construct one component per block so that connected blocks produce components that are strongly related.

This is why the notion of a design matrix is important in RGCCA: it specifies which blocks should be connected and which ones should not.

Multicollinearity

In omics and other high-dimensional datasets, variables within the same block are often strongly correlated. This is called multicollinearity.

Consequences:

  • unstable estimates,
  • redundant information,
  • numerical problems,
  • poor interpretability.

This is one important reason why regularization is useful in RGCCA.

High-dimensional data: when \(p \gg n\)

In many modern datasets, the number of variables is much larger than the number of individuals.

Examples:

  • 100 samples and 20,000 genes,
  • 80 patients and 500,000 methylation probes.

In that setting:

  • covariance matrices can become ill-conditioned or singular,
  • classical methods may become unstable,
  • overfitting is a major risk.

RGCCA was designed to be usable in such settings by introducing regularization.

Regularization

Regularization means adding constraints or penalties to stabilize estimation.

The main idea is simple:

  • allow some flexibility,
  • but avoid solutions that are too unstable or too sensitive to noise.

For RGCCA, regularization is controlled by parameters such as \(\tau\) at the block level.

At a very intuitive level:

  • low regularization keeps the method close to correlation-based criteria,
  • stronger regularization stabilizes the problem and can move the method toward covariance-based behavior.

Shrinkage intuition

A very common regularization idea is shrinkage.

Instead of using a raw empirical covariance matrix, we move it toward a simpler target matrix that is more stable.

A generic form is

\[ \mathbf S_{\text{shrunk}} = (1 - \tau)\mathbf S + \tau \mathbf T, \]

where:

  • \(\mathbf S\) is the empirical covariance matrix,
  • \(\mathbf T\) is a simpler target,
  • \(\tau \in [0,1]\) controls the amount of shrinkage.

This kind of idea is directly relevant for RGCCA, because the method uses block-specific regularization parameters.

Why sparsity matters

In very high-dimensional data, a component can involve thousands of variables. This may be hard to interpret.

A sparse method tries to keep only a subset of variables with non-zero weights.

Benefits:

  • easier interpretation,
  • variable selection,
  • better focus on the strongest signals.

This is the role of SGCCA: a sparse version of GCCA/RGCCA.

L1 and L2 norms

To understand sparsity, two norms are especially useful.

For a vector \(\mathbf a = (a_1, \dots, a_p)^\top\):

\[ \|\mathbf a\|_2 = \left(\sum_{j=1}^p a_j^2\right)^{1/2} \]

is the Euclidean norm, and

\[ \|\mathbf a\|_1 = \sum_{j=1}^p |a_j| \]

is the L1 norm.

The L2 norm controls the global size of the vector. The L1 norm encourages sparsity.

That is why SGCCA uses constraints involving both norms.

Optimization under constraints

Many multivariate methods solve problems of the form

\[ \max_{\mathbf a} f(\mathbf a) \qquad \text{subject to} \qquad \mathbf a \in \mathcal C, \]

where:

  • \(f(\mathbf a)\) is a quantity we want to maximize,
  • \(\mathcal C\) is a set of admissible solutions.

Typical constraints include:

  • unit norm constraints,
  • orthogonality constraints,
  • sparsity constraints.

RGCCA and SGCCA both fit naturally in this framework.

A simple bridge to RGCCA

At a beginner level, RGCCA can be understood as follows:

  1. split the data into several blocks measured on the same individuals,
  2. compute one linear combination per block,
  3. force these combinations to satisfy normalization constraints,
  4. optimize a criterion that rewards relationships between connected blocks,
  5. regularize the problem to remain stable,
  6. optionally impose sparsity to keep only the most informative variables.

This is why the previous concepts are the essential prerequisites.

Summary of key ideas

Before studying RGCCA in detail, make sure the following notions are comfortable:

  • mean and variance,
  • covariance and correlation,
  • covariance matrix and correlation matrix,
  • centering and scaling,
  • linear combination of variables,
  • component scores and weight vectors,
  • matrix products such as \(\mathbf X^\top \mathbf X\) and \(\mathbf X^\top \mathbf Y\),
  • eigenvalues/eigenvectors,
  • PCA intuition,
  • CCA intuition,
  • multicollinearity,
  • high-dimensional setting,
  • regularization,
  • sparsity.

If these ideas are clear, the theory of RGCCA becomes much more accessible.

Optional exercises

  1. Simulate two strongly correlated variables and verify that their Pearson correlation is close to 1.
  2. Simulate two variables with different measurement scales and compare the covariance and correlation.
  3. Center and scale a matrix manually, then compare the result to scale().
  4. Construct a linear combination of two variables and interpret the resulting component.
  5. Compare the matrices t(X) %*% X and t(X) %*% Y in a small toy example.

Further reading

Once these basics are in place, the next step is to study:

  • the RGCCA optimization criterion,
  • the meaning of the design matrix,
  • the role of the scheme function,
  • the role of the block regularization parameter \(\tau\),
  • and the sparsity constraints used in SGCCA.