CCA is a technique to find linear combinations of two variables such that the correlation between the combined variables are maximised. It is a powerful tool to extract salience from data of interest, but also with often overlooked limitations. I used the selection of CCA techniques based on Zhuang et al., 20201.
1. Partial Least Squares (PLS) regression
Since CCA is in fact a special case of PLS regression, let’s first take a quick look at the general formulation of a PLS model2. Basically, PLS methods project two variables to a space of maximum covariance, or in other words to a set of latent variables with maximum covariance.
Given $N$ observations of features $X = x_{i}^{I}$ and target/depedent variables $Y = y_{j}^{J}$, we would like to decompose them into:
\[X = \Xi P^T \text{, } Y = \Omega Q^T \text{ maximising } \mathrm{Cov}(\Xi, \Omega)\]where $\Xi$ and $\Omega$ are the projections while $P$ and $Q$ are the loading matrices for $X$ and $Y$ respectively.
In order to find the direction of maximal covariance, we are interested in the cross-covariance matrix $X^TY$, assuming that data have been centred. The direction of maximal covariance can be figured out via compact singular value decomposition (SVD) of the cross-covariance matrix:
\[C = X^T Y = U_r \Sigma_r V_r^T\]The first and the largest singular value with the corresponding first singular vectors $u_r$ and $v_r$ can be used to maximise the covariance between $X$ and $Y$. This gives us the corresponding projections of the data:
\[\xi_r = Xu_r \text{, } \omega_r = Yv_r\]This also lets us compute the rank-one approximation of the data with a least square regression (“partial” because only the first singular vectors were used):
\[\begin{aligned} \hat{X_r} &= \xi_r p_r^T = \xi_r (\xi_r^T \xi_r)^{-1} \xi_r^T X_r \\ \hat{Y_r} &= \omega_r q_r^T = \omega_r (\omega_r^T \omega_r)^{-1} \omega_r^T Y_R \end{aligned}\]The decomposition procedure consists of iteratively computing the first singular vectors, the projections, the rank-one approximations, and the residual data matrix with the approximations subtracted. Also see Appendix A for the power method for finding the first singular vectors without computing the entire SVD.
In the sklearn.cross_decomposition package, the PLSCanonical class implements the PLS-W2A algorithm, as is described above. The PLSSVD class implements the PLS-SVD algorithm, where SVD is only computed once to generate all singular vectors to be used, instead of iteratively estimating new decompositions. The PLSRegression class implements the PLS1 and PLS2 algorithms, where the rank-one approximation for $Y$ is also computed using $\xi_r$ (i.e., no $\omega_r$ is actually computed).
2. Canonical Correlation Analysis (CCA)
PLS can be used in two modes, Mode A (e.g., in PLS-W2A algorithm) and Mode B. While in Mode A, simple regression models are fitted to reconstruct the data, in Mode B, a multiple regression model would be used instead, which is equivalent to CCA. Starting from the PLS-W2A algorithm, the only changes we need to make is how $u$ and $v$ are computed. Using the power method (details in Appendix A), the singular vector update step becomes:
\[\begin{aligned} u_k &= X^{-1} \omega_k \text{, } u_k = u_k / ||u_k|| \\ v_k &= Y^{-1} \xi_k \text{, } v_k = v_k / ||v_k|| \end{aligned}\]where the matrix inversion is usually compute by a pseudo inverse, i.e. $\mathrm{pinv}(X) = (X^TX)^{-1}X^T$. This is implemented in the sklearn.cross_decomposition.CCA class.
Originally, CCA was formulated to optimise correlation between the latent variables of $X$ and $Y$, i.e., to maximise:
\[\rho = \mathrm{Corr}(Xu, Yv) = \frac{(Xu)^T Yv}{\sqrt{(Xu)^T Xu}\sqrt{(Yv)^T Yv}} = \frac{u^T \Sigma_{xy} v}{\sqrt{u^T \Sigma_{xx} u} \sqrt{v^T \Sigma_{yy} v}}\]Solving the maximisation of this correlation cost function can be done in its Langrangian form (details in Appendix B). We will find that $u$ and $v$ are the eigenvectors of the matrices $\Sigma_{xx}^{-1} \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{xy}$ and $\Sigma_{yy}^{-1} \Sigma_{xy} \Sigma_{xx}^{-1} \Sigma_{xy}$ respectively with the same eigenvalue $\rho^2$. This again lets us compute the latent variables by finding the first singular vectors from a SVD of $\Sigma_{xx}^{-\frac{1}{2}} \Sigma_{xy} \Sigma_{yy}^{-\frac{1}{2}}$.
3. Sparse CCA
Due to the use of multiple regression, the decomposition in CCA become unstable in the presence of multicollinear features or when number of observations $N$ are smaller than the number of features. The stability can be alleviated by adding a penalty to $\Sigma_{xx}$ and $\Sigma_{yy}$2. A common variant of CCA constrained with a L1 (lasso) penalty is sparse CCA.
The cost function is in the same form as above, with added constraints:
\[\text{max } \rho = \mathrm{Corr}(Xu, Yv) \text{; s.t.} |u|_1 < c_1 \text{, } |v| < c_2\]In practice, the stability issue is often tackled by further replacing $\Sigma_{xx}$ and $\Sigma_{yy}$ by identify matrices, which makes the formulation equivalent to spase PLS.
To solve a constrained CCA, iterative optimsation techniques are required instead. For sparse CCA, the CCA-Zoo package provides an implementation using iterative Penalised Least Squares techniques in the cca_zoo.linear.SCCA_IPLS class.
4. Nonlinear CCA
To model nonlinear relationships, a school of nonlinear variants of CCA can be used.
4.1. Kernel CCA
By applying kernels $H_{x,y}$ to the data, we can first project $X$ and $Y$ to a higher dimensional feature space $\phi(X)$ and $\phi(Y)$ before finding their latent variables. This converts the cost function to:
\[\text{max } \rho = \mathrm{Corr}(\phi(X)u, \phi(Y)v) = \frac{u^T \phi(X) H_xH_y \phi(Y)^T v}{\sqrt{u^T \phi(X) H_x^2 \phi(X)^T u} \sqrt{v^T \phi(Y) H_y^2 \phi(Y)^T v}}\]where $H_x = \phi(X) \cdot \phi(X)$ and $H_y = \phi(Y) \cdot \phi(Y)$. Similar convenient constraints can be set for this cost function to solve it in the Langrangian form. We will solve for $\phi(X)^T u$ and $\phi(Y)^T v$ here, and find that they are the eigenvectors of $H_x^{-1} H_y^2 H_x$ and $H_y^{-1} H_x^2 H_y$. A regularisation parametre can be added to avoid generating trivial solutions for $\rho = 1$. Depending on the kernel used, hyperparametre tuning may be necessary as well.
4.2. Deep CCA
Making use of deep learning techniques, we can also transform $X$ and $Y$ through layers first. A training sample is required to learn the layer parametres via e.g., gradient descent. An implementation can be found in the cca_zoo.deep.DCCA class.
5. Multiset CCA
Multiset CCA extends conventional CCA to model relationships among more than two variables. In neuroscience, for instance, multiset CCA can help to relate brain data obtained from different modalities. A basic implementation sums the correlation between all pairs of $K$ variables, deriving a SUMCOR cost function:
\[\text{max } \rho = \sum_{i, j, i \ne j}^K \mathrm{Corr}(X_i u_i, X_j u_j)\]The correlation entries now need to be expressed in a matrix with elements $\hat{\Sigma}{ij} = u_i^T \Sigma{ij} u_j$.
Another commonly used objective function is SSQCOR which maximises the sum of squared pairwise correlations $\sum_{i,j}^K \hat{\Sigma}_{ij}^2$. Other objective functions involving the correlation matrix include maximising the largest eigenvalue, minimising the smallest eigenvalue, and minimising the determinant. All of these objective functions have analytical solutions derived in similar ways as conventional CCA. Implementation can be found in the cca_zoo.linear.MCCA class and the meegkit.cca class.
5.1. Multiset CCA with reference
Often in practice, we want to relate one behavioural or outcome variable with multiple feature variables. This concern is addressed in multiset CCA with reference by specifying that behavioural or outcome variable as the reference variable $Y_{ref}$. The objective function additionally accounts for the correlation between each feature variable and the reference variable3:
\[\text{max } \rho = \sum_{i, j, i \ne j}^K [||\mathrm{Corr}(X_i u_i, X_j u_j)||_2^2 + 2\lambda ||\mathrm{Corr}(X_i u_i, Y_{ref})||_2^2]\]Appendix
A. The power method
The power method involves an inner loop for estimating the singular vectors within each outer loop of component definition. To begin, $u_0$ and $v_0$ can be initialised arrays of $1$s.
For each iteration, update the latent scores by projecting the data:
\[\xi_k = X u_{k-1} \text{, } \omega_k = Y v^{k-1}\]Then, update the singular vector estimations with least square regression from each other:
\[\begin{aligned} u_k &= X^T \omega_k (\omega_k^T \omega_k)^{-1} \text{, } u_k = u_k / ||u_k|| \\ v_k &= Y^T \xi_k (\xi_k^T \xi_k)^{-1} \text{, } v_k = v_k / ||v_k|| \end{aligned}\]Finally, check convergence based on $u_k - u_{k-1}$.
B. Optimisation of the correlation function
Quite obviously, the correlation function is invariant to the scales of the variables. Therefore, an easy way to continue is to set a convenient constraint on the scales of the latent variables:
\[\begin{aligned} (Xu)^T Xu &= u^T \Sigma_{xx} u = 1 \\ (Yv)^T Yv &= v^T \Sigma_{yy} v = 1 \end{aligned}\]This means that the denominator of the correlation function is actually $1$ so that $\rho = u^T \Sigma_{xy} v$. This also lets us convert the correlation function into its Langrangian form:
\[\begin{aligned} L(\rho, u, v) = f - \lambda g &= \rho - \lambda_u (u^T \Sigma_{xx} u - 1) - \lambda_v (v^T \Sigma_{yy} v - 1) \\ &= u^T \Sigma_{xy} v - \lambda_u (u^T \Sigma_{xx} u - 1) - \lambda_v (v^T \Sigma_{yy} v - 1) \end{aligned}\]At the critical point, all partial derivatives is zero (ignoring the $\lambda$s since their derivatives are just the constraints):
\[\begin{aligned} \frac{\partial{L(\rho, u, v)}}{\partial{u}} = 0 &= \Sigma_{xy} v - \lambda_u (2 \Sigma_{xx} u) \\ \frac{\partial{L(\rho, u, v)}}{\partial{v}} = 0 &= \Sigma_{xy} u - \lambda_v (2 \Sigma_{yy} v) \\ \end{aligned}\]Combining the two equation shows that:
\[\begin{aligned} u^T \Sigma_{xy} v &= 2 \lambda_u u^T \Sigma_{xx} u = 2 \lambda_v v^T\Sigma_{yy} v \\ \rho & = 2 \lambda_u = 2 \lambda_v \\ \end{aligned}\]So the solutions can be written as
\[\begin{aligned} \Sigma_{xy} v &= \rho \Sigma_{xx} u \\ \Sigma_{xy} u &= \rho \Sigma_{yy} v \\ \end{aligned}\]For decomposition purpose, we would like to write these equations in the form of eigen-equations:
\[\begin{aligned} \Sigma_{xx}^{-1} \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{xy} u &= \rho^2 u \\ \Sigma_{yy}^{-1} \Sigma_{xy} \Sigma_{xx}^{-1} \Sigma_{xy} v &= \rho^2 v \\ \end{aligned}\]which means that $u$ and $v$ are the eigenvectors of the matrices $\Sigma_{xx}^{-1} \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{xy}$ and $\Sigma_{yy}^{-1} \Sigma_{xy} \Sigma_{xx}^{-1} \Sigma_{xy}$ respectively with the eigenvalue $\rho^2$. Similar to PLS, we can compute these vectors as the singular vectors from SVD.
-
Zhuang et al., 2020. A technical review of canonical correlation analysis for neuroscience applications. Hum. Brain. Mapp. https://doi.org/10.1002/hbm.25090. ↩
-
Wegelin, 2000. A survey of partial least squares (PLS) methods, with emphasis on the two-block case. ↩ ↩2
-
Qi et al., 2018. Multimodal fusion with reference: Searching for joint neuromarkers of working memory deficits in schizophrenia. IEEE Trans. Med. Imaging. https://doi.org/10.1109/TMI.2017.2725306. ↩