Structural Equation Modeling

Structural equation modeling (SEM) is a method to model causal relationships. The model usually include observed variables (or measurements) causally connected to latent variable(s). For instance, a general intelligence model may include a latent variable ‘intelligence’, which causally impacts the observed scores from various intelligence tests. SEM is an umbrella term for modeling methods of this decribed general form, covering methods such as confirmatory factor analysis, path analysis, hierarchical modeling, etc.

Confirmatory factor analysis (CFA)

CFA assumes a known relationship between the latent variable and observed variables, which is to be evaluated and confirmed by the data set. For instance, the general intelligence model may assume a form of $Intelligence = \sim Test_1 + Test_2 + Test_3$, considering three different intelligence test measurements. In a regression form of $Y = WX + \epsilon$, the test scores $Test_1$ to $Test_3$ would be the independent variable $Y$, while the latent variable $Intelligence$ would be the dependent variable $X$.

Maximum likelihood estimate (MLE)

MLE is a common method for parameter estimation used in CFA. First, we assume that the observed variables $Y$ and error $\epsilon$ are normally distributed, which means that the latent variable $X$ is also normally distributed. Considering $N$ observations and $M$ latent variables, we can compute the log likelihood of data given parameters:

\[\begin{aligned} L &= \log P(Y | \theta) \\ &= \sum^M_{i=1} \log [(2\pi)^{\frac{N}{2}} |\Sigma_{\theta}|^{-\frac{1}{2}} e^{-\frac{1}{2} (Y - \mathbb{E}(Y))^T \Sigma_{\theta}^{-1} (Y - \mathbb{E}(Y))}] \\ &= \sum^M_{i=1} [\frac{N}{2} \log (2\pi) - \frac{1}{2} \log |\Sigma_{\theta}| -\frac{1}{2} (Y - \mathbb{E}(Y))^T \Sigma_{\theta}^{-1} (Y - \mathbb{E}(Y))] \\ &= \frac{NM}{2} \log (2\pi) - \frac{M}{2} [\log |\Sigma_{\theta}| + (Y - \mathbb{E}(Y))^T \Sigma_{\theta}^{-1} (Y - \mathbb{E}(Y))] \\ &= \frac{NM}{2} \log (2\pi) - \frac{M}{2} \{ \log |\Sigma_{\theta}| + [ (Y - \mathbb{E}(Y)) (Y - \mathbb{E}(Y))^T] \cdot \Sigma_{\theta}^{-1} \} \\ &= \frac{NM}{2} \log (2\pi) - \frac{M}{2} ( \log |\Sigma_{\theta}| + \Sigma_s^T \cdot \Sigma_{\theta}^{-1} ) \\\end{aligned}\]

Note that $\Sigma_s$ is the sample covariance (from the data), while $\Sigma_{\theta}$ is the distribution variance (based on the free parameters). We denote the covariance matrix of $X$ as $\Phi$, and that of $\epsilon$ as $\Theta$. This gives us the covariance matrix of the normal distribution:

\[\begin{aligned} \Sigma_{\theta} &= W \Phi W^T + \Theta \end{aligned}\]

If the model involves only one latent variable ($M=1$), the covariance matrix can be computed faster with:

\[\begin{aligned} \Sigma_{\theta} &= W \cdot \Phi \cdot W^T + \Theta \end{aligned}\]

Finally, the factor loadings $W$ can be estimated by maximising $L$ (or minimising $-L$).