Auto-Encoding Variational Bayes & Application to Biomedical Segmentation

Journal Club Notes by Jianxiao

This note includes general derivations¹ and specific derivations for segmentation application².

1. Generic Auto-Encoding Variational Bayes (AEVB)

1.1. Problem scenario

Figure 1. Directed graphical model considered

Consider a model consisting of a dataset $X = { x^{(i)} }^N_{i=1}$ and its associated latent variables $z$, shown in Figure 1. Values of variable $z$ are generated from a prior distribution $p_\theta(z)$; values of variable $x$ are generated from the conditional distribution $p_\theta(x|z)$.

Inference based on this model can be intractable. Specifically, the marginal likelihood $p_\theta(x) = \int p_\theta(z)p_\theta(x|z) \mathop{dz}$ is intractable as it requires integrating over the latent variable $z$. As a result, the posteriror density $p_\theta(z|x) = \frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)}$ is also intractable as it involves evaluating $p_\theta(x)$ first.

The Auto-Encoding Variational Bayes (AEVB) approach is proposed for inference of marginal likelihood $p_\theta(x)$ and posterior density $p_\theta(z|x)$, the approximate of which also help to estimate the parameters $\theta$ using ML or MAP.

1.2. Variational lower bound

We start by introducing an approximation $q_\phi(z|x)$ to the intractable true posterior $p_\theta(z|x)$. We can consider the KL divergence between these two distribution:

\[\begin{aligned} KL [q_\phi(z|x) || p_\theta(z|x)] &= \mathbb{E}_q [\log q_\phi(z|x) - \log p_\theta(z|x)] \\ & = \mathbb{E}_q [\log q_\phi(z|x) - \log \frac{p_\theta(x,z)}{p_\theta(x)}] \\ & = \mathbb{E}_q [\log q_\phi(z|x) - \log p_\theta(x,z) + \log p_\theta(x)] \\ \end{aligned}\]

Rearranging the terms gives:

\[\mathbb{E}_q [\log p_\theta(x)] = KL [q_\phi(z|x) || p_\theta(z|x)] - \mathbb{E}_q [\log q_\phi(z|x) - \log p_\theta(x,z)]\]

Since the first term is the non-negative KL divergence, the second term is the lower bound of the log marginal likelihood $\log p_\theta(x)$, also called the variational lower bound. The variational lower bound can be written in two forms:

1) The joint-contrastive form

\[\begin{aligned} \mathcal{L}(\theta,\phi;x) &= \mathbb{E}_q [-\log q_\phi(z|x) + \log p_\theta(x,z)] \end{aligned}\]

2) The prior-contrastive form

\[\begin{aligned} \mathcal{L}(\theta,\phi;x) &= \mathbb{E}_q [-\log q_\phi(z|x) + \log p_\theta(x,z)] \\ & = \mathbb{E}_q [-\log q_\phi(z|x) + \log (p_\theta(x|z)p_\theta(z))] \\ & = \mathbb{E}_q [-\log q_\phi(z|x) + \log p_\theta(x|z) + \log p_\theta(z)] \\ & = \mathbb{E}_q [\log p_\theta(x|z)] - \mathbb{E}_q [\log q_\phi(z|x) - \log p_\theta(z)] \\ & = \mathbb{E}_q [\log p_\theta(x|z)] - KL [q_\phi(z|x) || p_\theta(z)] \end{aligned}\]

The prior-contrastive form can be related to auto-encoders: the KL term acts as a regularisation term while the expectation term is the reconstruction error ($z$ is sampled from $x$ by $g(\epsilon,x)$; $x$ is then sampled from $z$ by $p_\theta(x|z)$).

1.3. Stochastic Gradient Variational Bayes (SGVB) estimator

In order to optimise the $\mathcal{L}(\theta,\phi;x)$, we need to differentiate it w.r.t. $\theta$ and $\phi$. However, differentiating w.r.t. $\phi$ is problematic as the usual Monte Carlo gradient estimator exhibits high variance³. The usual Monte Carlo method estimates the $\mathbb{E}_q[\cdot]$ terms by sampling $z$ from $q_\phi(z|x)$:

\[\begin{aligned} \nabla_\phi \mathbb{E}_q [f(z)] &= \mathbb{E}_q [f(z) \nabla_q \log q_\phi(z)] \\ & \approx \frac{1}{L} \sum^L_{l=1} f(z) \nabla_q \log q_\phi(z[l]) \end{aligned}\]

Instead, we use a reparametrisation trick so that the Monte Carlo estimate becomes differentiable. We reparametrise $z$ as:

\[\begin{aligned} z = g_\phi(\epsilon,x) \quad with \quad \epsilon \sim p(\epsilon) \end{aligned}\]

where the transformation $g_\phi(\cdot)$ and auxiliary noise $\epsilon$ are chosen based on the form of $q_\phi(z|x)$.

Figure 2 How reparametrisation trick changes network structure

Essentially, we are assuming that sampling $z$ from $q_\phi(z|x)$ is equivalent to sampling $\epsilon$ from $p(\epsilon)$ (Figure 2). As a result, we can rewrite integrals involving $q_\phi(z|x)$ in forms involving $\epsilon$ instead:

\[\begin{aligned} \int q_\phi(z|x) f(z) \mathop{dz} &= \int p(\epsilon) f(g_\phi(\epsilon,x)) \mathop{d\epsilon} \\ & \approx \frac{1}{L} \sum^L_{l=1} f(g_\phi(x,\epsilon[l])) \end{aligned}\]

We can then compute the SGVB estimator for both forms of variational lower bound:

1) The joint-contrastive form

\[\begin{aligned} \widetilde{\mathcal{L}}(\theta,\phi;x) = \frac{1}{L} \sum^L_{l=1} (-\log q_\phi(z[l]|x) + \log p_\theta(x,z[l])) \end{aligned}\]

2) The prior-contrastive form

\[\begin{aligned} \widetilde{\mathcal{L}}(\theta,\phi;x) =\frac{1}{L} \sum^L_{l=1} (\log p_\theta(x|z[l])) - KL [q_\phi(z|x) || p_\theta(z)] \end{aligned}\]

The prior-contrastive form typically has less variance, but relies on the KL term having closed form to work. In particular, if both the true prior and the approximate posterior are Gaussian, the KL term can be easily computed.

2. Application to biomedical segmentation

2.1. Problem scenario

CNN-based Biomedical segmentation usually requires large set of annotated data and does not take into account of anatomical knowledge. In the Dalca et al. paper², an two-step AEVB approach is proposed to carry out unsupervised segmentation learning, with anatomical prior incorporated.

Consider a model consisting of image data $x$, its segmentation map $s$, and the latent variable $z$, which represents the underlying shape embedding (Figure 3).

Figure 3 Generative model for the segmentation problem

The model generates values based on the following prior and likelihood densities:

1) the prior probability of the latent variable $z$ is modeled by a standard normal distribution

\[p(z) = \mathcal{N}(z;0,\mathrm{I})\]

2) the labels $s$ are drawn from a categorical distribution given $z$

\[p_{\theta_{s|z}}(s|z) = \prod_j f_{j,s[j]} (z|\theta_{s|z})\]

where $f_{j,m}(\cdot; \theta_{s|z})$ is the probability of label $m$ at voxel $j$

3) the voxel intensity values $x$ are drawn from normal distributions with diagonal covariance matrix

\[p_{\theta_{x|s}}(x|s) = \prod_j \prod_m \mathcal{N} (x[j]; \mu_m, \sigma_m^2)^{\delta(s[j]=m)}\]

where $\delta(s[j]=m) = 1$ if $s[j]=m$ and $0$ otherwise.

We aim to learn segmentation from an image using maximum a posteriori (MAP) estimation:

\[\begin{aligned} \hat{s} &= \mathop{\mathrm{arg\,max}}_{s} \log p_\theta(s|x) \\ & =\mathop{\mathrm{arg\,max}}_{s} \log \frac{p_\theta(s, x)}{p(x)} \\ & =\mathop{\mathrm{arg\,max}}_{s} \log p_\theta(s, x) \end{aligned}\]

However, $p_\theta(s,x)$ is intractable as it requires integrating over $z$:

\[p_\theta(s, x) = p_\theta(x|s)p(s) = p_\theta(x|s) \int p_\theta(s|z)p(z) \mathop{dz}\]

Similarly, $p_\theta(z|x,s)$ is also intractable, which mean EM algorithm does not work in this case.

Instead, we will optimise $\log p_\theta(x)$ and $\log p_\theta(s)$ by optimising their variational lower bounds respectively. Note that we cannot optimise the two probability distributions together because we do not have paired sets of $x$ and $s$ in unsupervised learning setting. Nevertheless, we want to learn mappings between the latent space and both the image space and the label space, so that we can apply anatomical knowledged learnt from label-latent mapping to new images.

2.2. SGVB estimator derivations

2.2.1. Learning anatomical prior

Using the AEVB framework, we approximate the true posterior $p_\theta(z|s)$ with $q_\phi(z|s)$. $q_\phi(z|s)$ is modeled as a normal distribution with diagnoal covariance matrix:

\[q_\phi(z|s) = \mathcal{N} (z; \mu_{z|s}, \sigma_{z|s}^2)\]

The respective variational lower bound is:

\[\mathcal{L}(\theta,\phi;s) = \mathbb{E}_q [\log p_\theta(s|z)] - KL [q_\phi(z|s) || p(z)]\]

Using the reparametrisation trick, we can estimate the expectation term as:

\[\begin{aligned} \mathbb{E}_q [\log p_\theta(s|z)] &\approx \frac{1}{L} \sum^L_{l=1} \log p_\theta(s|z[l]) \\ & = \frac{1}{L} \sum_l \log (\prod_j f_{j,s[j]} (z[l]|\theta_{s|z})) \\ & = \frac{1}{L} \sum_l \sum_j \log f_{j,s[j]} (z[l]|\theta_{s|z}) \end{aligned}\]

Since both the true prior $p(z)$ and the approximate posterior $q_\phi(z|s)$ are gaussian, the KL term has a closed form (see Appendix A for the proof):

\[KL [q_\phi(z|s) || p(z)] = \frac{1}{2} \sum_j (1 + \log (\sigma_{z|s}[j]^2) - \mu_{z|s}[j]^2 - \sigma_{z|s}[j]^2)\]

2.2.2. Unsupervised learning

Similar to the previous section, we approximate the true posterior $p_\theta(z|x)$ with $q_\phi(z|x)$. $q_\phi(z|x)$ is modeled as a normal distribution with diagnoal covariance matrix:

\[q_\phi(z|x) = \mathcal{N} (z; \mu_{z|x}, \sigma_{z|x}^2)\]

The respective variational lower bound is:

\[\mathcal{L}(\theta,\phi;x) = \mathbb{E}_q [\log p_\theta(x|z)] - KL [q_\phi(z|x) || p(z)]\]

Again, the KL term has a closed-form solution:

\[KL [q_\phi(z|x) || p(z)] = \frac{1}{2} \sum_j (1 + \log (\sigma_{z|x}[j]^2) - \mu_{z|x}[j]^2 - \sigma_{z|x}[j]^2)\]

The expectation term, on the other hand, is intractable as we cannot compute $p_\theta(x|z)$. Instead, we will use a lower bound of the term derived based on Jensen’s inequality (more details in Appendix B):

\[\begin{aligned} \mathbb{E}_q [\log p_\theta(x|z)] &= \mathbb{E}_q [\log \int_s p_\theta(x,s|z) \mathop{ds}] \\ & = \mathbb{E}_q [\log \int_s p_\theta(s|z) p_\theta(x|s) \mathop{ds}] \\ & \ge \mathbb{E} [\int p_\theta(s|z) \log p_\theta(x|s) \mathop{ds}] \\ & = \int_s \prod_j f_{j,s[j]} (z|\theta_{s|z}) \log \prod_j \prod_m \mathcal{N} (x[j]; \mu_m, \sigma_m^2)^{\delta(s[j]=m)} \mathop{ds} \\ & = \int_s \prod_j f_{j,s[j]} (z|\theta_{s|z}) \sum_j \sum_m \delta(s[j]=m) \log \mathcal{N} (x[j]; \mu_m, \sigma_m^2) \mathop{ds} \\ & = \sum_j \sum_m \int_s \delta(s[j]=m) \prod_j f_{j,s[j]} (z|\theta_{s|z}) \mathop{ds} \log \mathcal{N} (x[j]; \mu_m, \sigma_m^2) \\ & = \sum_j \sum_m f_{j,m} (z|\theta_{s|z}) \log \mathcal{N} (x[j]; \mu_m, \sigma_m^2) \\ & \sim \sum_j \sum_m f_{j,m} (z|\theta_{s|z}) \frac{-(x[j]-\mu_m)^2}{2\sigma_m^2} \\ \end{aligned}\]

We can then estimate the expectation term, sampling $z$ using the reparametrisation trick:

\[\mathbb{E} [\int p_\theta(s|z) \log p_\theta(x|s) \mathop{ds}] = \frac{1}{L} \sum_l \sum_j \sum_m f_{j,m} (z[l]|\theta_{s|z}) \frac{-(x[j]-\mu_m)^2}{2\sigma_m^2}\]

2.3. Network architecture

Two CNNs are used, one for learning the anatomical prior and one for unsupervised learning of segmentation from image.

Figure 4 shows the auto-encoder architecture for learning anatomical prior. This network learns the mapping between the latent space and the label space, using only segmentation maps.

Figure 4 Auto-encoder to learn anatomical prior

Given an input segmentation map $s$, the encoder outputs the parameter of the posterior $q_\phi(z|s)$, $\mu_{z|s}$ and $\sigma_{z|s}$. The latent variable $z$ is then sampled using reparametrisation trick, where $z \sim g(\epsilon,s) = \mu_{z|s} + \sigma_{z|s} \epsilon$. The transformation $g(\cdot)$ approximates normal distribution by adding a scaled noise (or variance) to the mean. Finally, $s$ is reconstructed by the decoder with $p_\theta(s|z)$, represented by the top output layer in Figure 4.

An additional prior is added in actual implementation, represented by the bottom output layer in Figure 4. This location prior $p_{loc}(s)$ is computed as the voxel-wise frequency of labels in the prior training dataset. By multiplying the prior $p_{loc}(s)$ with the reconstructed segmentation map from $p_\theta(s|z)$ (adding the logarithms in actual implementation), the output segmentation is obtained.

The parameters of $q_\phi(z|s)$ and $p_\theta(s|z)$, as well as the output segmentation labels $s[j]$, are then used to estimate and optimise the SGVB lower bound.

Figure 5 shows the architecture for unsupervised learning of segmentation from input images. The decoder part of this network is copied from the prior-learning network from Figure 4, which helps to incorporate the prior anatomical knowledge learnt.

Figure 5 Architecture for unsupervised learning

The image encoder is first trained in a similar fashion as the prior auto-encoder; the pre-trained weights are used for initialisation during actual training.

Given an input scan $x$, the image encoder outputs the parameter of the posterior $q_\phi(z|x)$, $\mu_{z|x}$ and $\sigma_{z|x}$. The latent variable $z$ is then sampled using reparametrisation trick, where $z \sim g(\epsilon,x) = \mu_{z|x} + \sigma_{z|x} \epsilon$. Then, the already trained prior decoder outputs $s$, which is used to generate the reconstructed $x$ with $p_\theta(x|s)$.

The parameters of $q_\phi(z|x)$ and $p_\theta(x|s)$, as well as the reconstructed image $x$, are then used to estimate and optimise the SGVB lower bound.

2.4. Results

The networks are tested on two datasets (Figure 6):

1) T1w scan dataset: more than 14,000 T1-weighted MRI scans from ADNI, ABIDE, GSP, etc., all resampled to 256x256x256 (1mm isotropic) and cropped to 160x192x224 to remove entirely-background voxels.

2) T2-FLAIR scan dataset: 3800 T2-FLAIR scans from ADNI (5mm slice spacing), linearly registered to T1w images (using ANTs). No ground truth.

Figure 6 Example T1w and T2-FLAIR images

A prior training set is composed using 5000 T1w images with ground truth segmentation maps generated by FreeSurfer (with manual correction and QC). This set is used to train the prior auto-encoder. The rest of the T1w images are split into trianing, validation and test sets for unsupervised learning. In the T2 scan case, subjects common to both datasets are excluded from the prior training set.

Figure 7 shows segmentation results from 3 subjects in the T1w scan dataset, in comparison to their respective ground truth. Figure 8 shows one subject’s segmentation results from the T2-FLAIR scan dataset. The estimated segmentation boundaries are generally aligned with the ground truth boudnaries, but much smoother, lacking of fine details.

Figure 7 Example segmentation results for T1w scan dataset

Figure 8 Example segmentation results for T2-FLAIR scan dataset

Appendix

A. Closed-form solution for KL divergence

When two distributions are both gaussian, the KL divergence distance between them has a closed form. I will start with the proof for the generic case, and then show the solutions for a special case.

Consider two mutivariate normal distribution of dimension $k$, $p(x) \sim \mathcal{N} (\mu_1, \Sigma_1)$ and $q(x) \sim \mathcal{N} (\mu_2, \Sigma_2)$. There is no constraint on the form of covariance matrix. Since the KL divergence is defined as $KL(p(x)||q(x)) = \mathbb{E}_p [\log p(x) - \log q(x)]$ for continuous distributions, we start by expressing log likelihood for the two distributions.

For $p(x)$:

\[\begin{aligned} \log p(x) &= \log (\frac{1}{(2\pi)^{\frac{k}{2}} |\Sigma_1|^{\frac{1}{2}}} \exp^{-\frac{1}{2} (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)}) \\ & = -\frac{k}{2} \log 2\pi - \frac{1}{2} \log |\Sigma_1| - \frac{1}{2} (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) \end{aligned}\]

where $\vert \cdot \vert$ is the matrix determinant.

Similarly for q(x):

\[\log q(x) = -\frac{k}{2} \log 2\pi - \frac{1}{2} \log |\Sigma_2| - \frac{1}{2} (x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)\]

Then we have:

\[\begin{aligned} KL(p(x)||q(x)) &= \mathbb{E}_p [\log p(x) - \log q(x)] \\ & = \mathbb{E}_p [( -\frac{k}{2} \log 2\pi - \frac{1}{2} \log |\Sigma_1| - \frac{1}{2} (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)) \\ & \quad - (-\frac{k}{2} \log 2\pi - \frac{1}{2} \log |\Sigma_2| - \frac{1}{2} (x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2))] \\ & = \mathbb{E}_p [\frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) + \frac{1}{2} (x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)] \\ & = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} \mathrm{Tr} [(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)] + \frac{1}{2} \mathbb{E}_p [(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)] \\ & = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} \mathrm{Tr} (\Sigma_1 \Sigma_1^{-1}) + \frac{1}{2} \mathbb{E}_p [(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)] \\ & = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} \mathrm{Tr} (\mathrm{I}_k) + \frac{1}{2} \mathbb{E}_p [(x-\mu_1+\mu_1-\mu_2)^T \Sigma_2^{-1} (x-\mu_1+\mu_1-\mu_2)] \\ & = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{k}{2} + \frac{1}{2} \mathbb{E}_p [(x-\mu_1)^T\Sigma_2^{-1}(x-\mu_1) \\ & \quad + 2(x-\mu_1)^T \Sigma_2^{-1} (\mu_1-\mu_2) + (\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2)] \\ & = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{k}{2} + \frac{1}{2} \mathrm{Tr} (\Sigma_1 \Sigma_2^{-1}) + 0 + \frac{1}{2}(\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2) \\ & = \frac{1}{2} [\log \frac{|\Sigma_2|}{|\Sigma_1|} - k + \mathrm{Tr} (\Sigma_1 \Sigma_2^{-1}) + (\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2)] \end{aligned}\]

Some tricks used are:

1) trace trick: $\mathbb{E}[x] = \mathbb{E}[\mathrm{Tr}(x)]$ if x is scalar

2) $\mathbb{E}_p [ (x-\mu_1)^T \Sigma_2^{-1} (\mu_1-\mu_2)]$: this is equal to 0. The expression is essentially the expectation of a scaled and rotated version of $(x-\mu_1)$, which are still straight lines passing through the origin.

Now, consider a special case, where $p(x) \sim \mathcal{N} (\mu_1, \sigma_1^2)$ and $q(x) \sim \mathcal{N} (0, \mathrm{I})$. In this case, we will be able to simplify the solution even further:

\[\begin{aligned} KL(p(x)||q(x)) &= \frac{1}{2} [\log \frac{|\Sigma_2|}{|\Sigma_1|} - k + \mathrm{Tr} (\Sigma_1 \Sigma_2^{-1}) + (\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2)] \\ & = \frac{1}{2} [\log \frac{|\mathrm{I}|}{\sigma_1^2} - k + \mathrm{Tr} (\sigma_1^2 \mathrm{I}^{-1}) + (\mu_1-0)^T \mathrm{I}^{-1} (\mu_1-0)] \\ & = \frac{1}{2} (-\log \sigma_1^2 - k + \sigma_1^2 + \mu_1^2) \end{aligned}\]

This is the solution used in section 2.2.

B. Jensen’s inequality

Jensen’s inequality states that the secant line of a convex function lies above the graph of the function. The concept is intuitive.

jensen

Mathematically, for any convex function $f$, Jensen’s inequality is:

\[f(\lambda x_1 + (1-\lambda) x_2) \le \lambda f(x_1) + (1-\lambda) f(x_2)\]

Conversely, for any concave function $g$, the sign is simply flipped:

\[g(\lambda x_1 + (1-\lambda) x_2) \ge \lambda g(x_1) + (1-\lambda) g(x_2)\]

When probability density functions are involved (e.g. when computing expectation), Jensen’s inequality is:

\[f(\int_p p(x) g(x) \mathop{dx}) \le \int_p p(x) f(g(x)) \mathop{dx}\]

where $f$ is the convex function, $g$ is any real-valued measurable function, and $p(x)$ is the probability density function (i.e. $\int_p p(x) \mathop{dx} = 1$).

In section 2.2.2, $\log$ is a concave function, and $p_\theta(s|z)$ is the probability density function. Therefore:

\[\log (\int_s p_\theta(s|z) g(x) \mathop{ds}) \ge \int_s p_\theta(s|z) \log(g(x)) \mathop{ds}\]

Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ArXiv e-prints (2013). arXiv: 1312.6114 ↩
Adrian V Dalca, John Guttag, and Mert R Sabuncu. “Anatomical Pri- ors in Convolutional Networks for Unsupervised Biomedical Segmenta- tion”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 9290–9299. ↩ ↩²
David M Blei, Michael I Jordan, and John W Paisley. “Variational Bayesian Inference with Stochastic Search”. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). 2012, pp. 1367– 1374. ↩