This set of notes include explanation and derivations for the backward model interpretation approach from Haufe et al. 20141.
1. Forward Models & Backward Models
To start, we define forward and backward models with respect to the same data. Assume we have $N$ samples (or subjects), each with $M$ channels (or features); also, assume we have $K$ latent factors (or target variables). Our data matrix would be $\boldsymbol{x}(n) = [x_1(n), …, x_M(n)]^T \in \mathbb{R}^{M \times N}$, while the target matrix would be $\boldsymbol{y}(n) = [y_1(n), …, y_K(n)]^T \in \mathbb{R}^{K \times N}$.
Forward models, also called generative models, represent the observed data as functions of latent variables. A linear forward model, in our case, would be $\boldsymbol{x}(n) = \boldsymbol{A}\boldsymbol{y}(n) + \epsilon(n)$, where the activation pattern $\boldsymbol{A} = [\boldsymbol{a_1}, …, \boldsymbol{a_K}] \in \mathbb{R}^{M \times K}$ and the noise vector $\epsilon(n) \in \mathbb{R}^M$. The parameters of this forward model ($\boldsymbol{A}$) are interpretable directly, i.e. the magnitudes of the parameters can be used to assess how strongly the latent variables are related to the data.
Backward models, also called discriminative models, represent the target variables as functions of observed data. In many scenarios, backward models need to be employed as forward models may be difficult to formulate or simply intractable. However, parameters in backward models cannot be interpreted simply by looking at their magnitudes (see the next section for 1 example from the original paper). In our case, a linear backward model would be $\boldsymbol{y}(n) = \boldsymbol{W}^T\boldsymbol{x}(n)$, where the transformation matrix $\boldsymbol{W} \in \mathbb{R}^{M \times K}$.
2. A Classification Example
Here is an example from Haufe et al. 20141, which demonstrates how classifier can assign larger (and positive) weights to features not containing the signal of interest, than the signal-related features. Consider data consisting of 2 classes, which are multivariate Gaussian distributed with different means ($\mu_+ = [1.5, 0]^T$) but equal covariance matrices ($\Sigma = \begin{bmatrix} 1.02 & -0.30 \ -0.30 & 0.15 \end{bmatrix}$). These are the blue and red data points in Figure 1.
By applying linear discriminant analysis (LDA) to the data, one can obtain a neat separation between the 2 classes, with a projection vector $\boldsymbol{w}_{LDA} \propto [1, 2]^T$ (purple arrow in Figure 1, resulting in the purple dashed line as separation). We could achieve a correlation of $r = 0.92$ between projected data and class labels ($y(n) \in {-1, +1}$) with this projection. Simply looking at the magnitude of weights, it may seem that channel $x_2$ is more important than (or twice as important as) channel $x_1$. Nevertheless, this is mostly because the variance contained in channel $x_2$ needs to be amplified to provide sufficient information for the classification problem.
To see which channel really contains the signal of interest, in this case, we can project the data onto the two channels separately. This correspond to the light gray ( projection to $x_1$) and black (projection to $x_2$) arrows/lines in Figure 1. These give us data-class correlation values of $r = 0.83$ for projection to $x_1$ and $r = 0.04$ for projection to $x_2$. Clearly, channel $x_1$ contains the class-specific information which is driving the class separation here.
Figure 1 (Haufe et al. 20141) Two-dimensional example of a binary classification setting
3. Contructing Activation Patterns for Backward Models
In order to interpret backward model parameters, we can obtain an corresponding activation pattern matrix $\boldsymbol{A}$ based on $\boldsymbol{W}$. In general cases, where $K \le M$, we need to solve for $\boldsymbol{A}$ satisfying the linear forward model $\boldsymbol{x}(n) = \boldsymbol{A}\boldsymbol{y}(n) + \epsilon(n)$.
We will make 3 assumptions (without loss of generality):
- $\mathbb{E}[\boldsymbol{x}(n)]_n = \mathbb{E}[\boldsymbol{y}(n)]_n = \mathbb{E}[\epsilon(n)]_n = 0$, i.e. their covariance matrices are given by $\Sigma_x = \mathbb{E}[\boldsymbol{x}(n)\boldsymbol{x}(n)^T]_n$, etc.
- $\boldsymbol{y}(n)$ are linearly independent ($\boldsymbol{W}$ has full rank), i.e. $\text{rank}(\boldsymbol{W}) = K$
- $\epsilon(n)$ and $\boldsymbol{y}(n)$ are uncorrelated, i.e. $\mathbb{E}[\epsilon(n)\boldsymbol{y}(n)^T]_n = 0$
This leads us to Theorem 1 which states that the corresponding forward model for our backward model is unique and its parameters are obtained by $\boldsymbol{A} = \Sigma_{\boldsymbol{x}}\boldsymbol{W}\Sigma_{\boldsymbol{y}}^{-1}$.
3.1. Proof of existence of a corresponding forward model
Assumptions 1 and 2 can be applied in general to properly processed data. Due to assumption 2, we know that the covariance matrix $\Sigma_{\boldsymbol{y}}$ is invertible; hence, $\boldsymbol{A}$ is well-defined.
Now, we can show that assumption 3 is true when the forward model $\boldsymbol{x}(n) = \boldsymbol{A}\boldsymbol{y}(n) + \epsilon(n)$ and the activation pattern $\boldsymbol{A} = \Sigma_{\boldsymbol{x}}\boldsymbol{W}\Sigma_{\boldsymbol{y}}^{-1}$ are concerned. Specifically, these two formulations lead to
\[\epsilon(n) = \boldsymbol{x}(n) - \boldsymbol{A} \boldsymbol{y}(n) = \boldsymbol{x}(n) - \Sigma_{\boldsymbol{x}} \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \boldsymbol{y}(n)\]This in turn leads to
\[\begin{aligned} \mathbb{E} [\epsilon(n) \boldsymbol{y}(n)^T]_n & = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)^T] - \mathbb{E} [\Sigma_{\boldsymbol{x}} \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \boldsymbol{y}(n) \boldsymbol{y}(n)^T] \\ & = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)^T] - \mathbb{E} [\Sigma_{\boldsymbol{x}} \boldsymbol{W}] \quad \text{(assumption 1)} \\ & = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)^T] - \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{x}(n)^T \boldsymbol{W}] \quad \text{(assumption 1)} \\ & = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)^T] - \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)^T] \quad \text{(backward model)} \\ & = 0 \end{aligned}\]3.2. Proof of Theorem 1
Assuming the corresponding forward model exists, we can start proving Theorem 1 by combining the formulation of the forward and backward model:
\[\boldsymbol{y}(n) = \boldsymbol{W}^T \boldsymbol{x}(n) = \boldsymbol{W}^T (\boldsymbol{A} \boldsymbol{y}(n) + \epsilon(n)) = \boldsymbol{W}^T \boldsymbol{A} \boldsymbol{y}(n) + \boldsymbol{W}^T \epsilon(n)\]Then, we multiply both sides of the equation by $\boldsymbol{y}(n)^T$ and take the expected values over samples:
\[\begin{aligned} \mathbb{E} [\boldsymbol{y}(n) \boldsymbol{y}(n)^T] & = \mathbb{E} [\boldsymbol{W}^T \boldsymbol{A} \boldsymbol{y}(n) \boldsymbol{y}(n)^T] + \mathbb{E} [\boldsymbol{W}^T \epsilon(n) \boldsymbol{y}(n)^T] \\ & = \boldsymbol{W}^T \boldsymbol{A} \mathbb{E} [\boldsymbol{y}(n) \boldsymbol{y}(n)^T] + \boldsymbol{W}^T \mathbb{E} [\epsilon(n) \boldsymbol{y}(n)^T] \quad \text {($\boldsymbol{W}$ and $\boldsymbol{A}$ persist across samples) } \\ & = \boldsymbol{W}^T \boldsymbol{A} \mathbb{E} [\boldsymbol{y}(n) \boldsymbol{y}(n)^T] \quad \text {(assumption 3) } \\ \end{aligned}\]Since $\boldsymbol{y}(n)$ are linearly independent (assumption 2), $\mathbb{E} [\boldsymbol{y}(n) \boldsymbol{y}(n)^T]$ should have full rank. For the equation above to be true, it follows then that $\boldsymbol{W}^T \boldsymbol{A} = \boldsymbol{I}$.
Similarly, we can insert the backward model equation into the forward model equation:
\[\boldsymbol{x}(n) = \boldsymbol{A} \boldsymbol{y}(n) + \epsilon(n) = \boldsymbol{A} \boldsymbol{W}^T \boldsymbol{x}(n) + \epsilon(n) \\ \text{or} \\ \epsilon(n) = \boldsymbol{x}(n) - (\boldsymbol{A} \boldsymbol{y}(n) + \epsilon(n)) = \boldsymbol{x}(n) - \boldsymbol{A} \boldsymbol{W}^T \boldsymbol{x}(n) = (\boldsymbol{I} - \boldsymbol{A} \boldsymbol{W}^T) \boldsymbol{x}(n)\]Then, we can show that
\[\begin{aligned} \boldsymbol{W}^T \mathbb{E} [\epsilon(n) \epsilon(n)] & = \boldsymbol{W}^T \mathbb{E} [ (\boldsymbol{I} - \boldsymbol{A} \boldsymbol{W}^T) \boldsymbol{x}(n) \epsilon(n)] \\ & = \boldsymbol{W}^T (\boldsymbol{I} - \boldsymbol{A} \boldsymbol{W}^T) \mathbb{E} [\boldsymbol{x}(n) \epsilon(n)] \\ & = (\boldsymbol{W}^T - \boldsymbol{W}^T \boldsymbol{A} \boldsymbol{W}^T) \mathbb{E} [ \boldsymbol{x}(n) \epsilon(n)] \\ & = (\boldsymbol{W}^T - \boldsymbol{W}^T) \mathbb{E} [ \boldsymbol{x}(n) \epsilon(n)] \quad \text{since $\boldsymbol{W}^T \boldsymbol{A} = \boldsymbol{I}$} \\ & = 0 \end{aligned}\]Finally, let’s see if we can obtain $\boldsymbol{A}$:
\[\begin{aligned} \Sigma_x \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} & = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{x}(n)^T] \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \\ & = \mathbb{E} [(\boldsymbol{A} \boldsymbol{y}(n) + \epsilon(n)) (\boldsymbol{A} \boldsymbol{y}(n) + \epsilon(n))^T] \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \quad \text {(forward model)} \\ & = (\boldsymbol{A} \mathbb{E} [\boldsymbol{y}(n) \boldsymbol{A}^T \boldsymbol{y}(n)^T] + \boldsymbol{A} \mathbb{E} [\boldsymbol{y}(n) \epsilon(n)^T] + \mathbb{E} [ \epsilon(n) \boldsymbol{A}^T \boldsymbol{y}(n)^T] + \mathbb{E} [\epsilon(n) \epsilon(n)^T]) \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \\ & = (\boldsymbol{A} \Sigma_{\boldsymbol{y}} \boldsymbol{A}^T + \boldsymbol{A} \mathbb{E} [\boldsymbol{y}(n) \epsilon(n)^T] +\mathbb{E} [ \epsilon(n) \boldsymbol{A}^T \boldsymbol{y}(n)^T] + \Sigma_{\epsilon}) \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \quad \text {(assumption 1)} \\ & = (\boldsymbol{A} \Sigma_{\boldsymbol{y}} \boldsymbol{A}^T + 0 + 0 + \Sigma_{\epsilon}) \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \quad \text {(assumption 3)} \\ & = \boldsymbol{A} \Sigma_{\boldsymbol{y}} \boldsymbol{A}^T \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} + \Sigma_{\epsilon} \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1}\\ & = \boldsymbol{A} + \Sigma_{\epsilon} \boldsymbol{W} \Sigma_{\boldsymbol{y}}^{-1} \quad \text{since $\boldsymbol{W}^T \boldsymbol{A} = \boldsymbol{I}$} \\ & = \boldsymbol{A} + 0 \quad \text {since $\boldsymbol{W}^T \mathbb{E} [\epsilon(n) \epsilon(n)] = \boldsymbol{W}^T \Sigma_{\epsilon} = 0$} \\ & = \boldsymbol{A} \end{aligned}\]3.3. Special cases
- In the special case where $K = M$, we could simply compute $\boldsymbol{A} = \boldsymbol{W}^{-1}$
- If $\boldsymbol{y}(n)$ are uncorrelated (trivially true for $K = 1$), we can simply compute $\boldsymbol{A} \propto \Sigma_{\boldsymbol{x}} \boldsymbol{W} = \mathbb{E} [\boldsymbol{x}(n) \boldsymbol{y}(n)]$
- If both $\boldsymbol{x}(n)$ and $\boldsymbol{y}(n)$ are uncorrelated, then $\boldsymbol{A} \propto \boldsymbol{W}$. However, this is hardly ever the case for neuroimaging data.