Before diving into DDPMs its best to start our journey from the Variational Autoencoder (VAE) model which was introduced by Diederik P. Kingma and Max Welling in 2013 as some of the intuitions we develop from looking at the VAE will help us understand how the DDPM works and will be useful to draw comparisons between these models later.
Much of generative modelling is tasked with the challenge of sampling from an unknown data distribution $p_{\text{data}}(x)$ given only a set of i.i.d datapoints from this distribution. Note that this is different from the case when we have some approximate analytical distribution to start off with which is easy to evaluate but hard to sample from. In such cases algorithms like Metropolis Hastings Algorithm and Annealed Importance Sampling (AIS) help.
Consider a dataset $X = [x^{(i)}]_{i=1}^N$ consisting of $N$ i.i.d samples of some continuous or discrete variable $x$. The assumption is that these datapoints are generated by a process that uses continuous random variables $z$ lying on a lower dimensional manifold (manifold hypothesis). This process occurs in two steps :
- First $z^i$ is generated from the prior distribution $p(z)$.
- $x^i$ is generated from the conditional distribution $p(x|z)$.
We make another assumption that the prior and likelihood come form a parametric family of distributions and their probability density functions (PDFs) are differentiable wrt $\theta$ and $z$. The marginal likelihood
\[p_{\theta}(x) = \int p_{\theta}(x, z) dz = \int p_{\theta}(x|z) p_{\theta}(z) dz\]is typically intractable to compute directly since we do not have access to $p_{\theta}(x|z)$. Even if did somehow manage to learn this model, it would likely be complex, still leading to an intractable calculation. In extension, the posterior $p_{\theta}(z|x)$ computed using Bayes rule also becomes intractable.
\[p_{\theta}(z|x) = \frac{p_{\theta}(x|z)p_{\rm{prior}}(z)}{p_{\theta}(x)}\]Kigma and Welling came up with a solution to these problems by approximating the true posterior $p_{\theta}(z|x)$ with an approximate $q_{\phi}(z|x)$ modelled using a neural network. In the paper, the authors refer to this neural network as a probabilitic encoder since given a datapoint $x$, the encoder produces a distribution over the values of $z$ from which $x$ could have been generated. Similarly given a $z$ we would like to know the distribution over the corresponding $x$ values. This is done by another neural network which the authors refer to as a probabilistic decoder.
$q_\phi(z|x)$ is modelled as multivariate Gaussian distribution and the probabilistic encoder actually models the mean and standard deviation of this distribution. Similarly, $q_{\phi}(x|z)$ is also modelled as a multivariate Gaussian distribution but with fixed variance $\sigma^2$. The reparametrization technique uses the predicted mean and standard deviation to generate samples of $z$ from the probabilistic encoder. This clever use of the reparametrization technique also solves the issue of how to backpropagate errors for updating the weights of the probabilistic encoder. After training, we can sample $z$ from $p_{\theta}(z)$, pass the sampled vectors through the probabilistic decoder and hopefully get samples that resemble those from the desired data distribution $p_{\text{data}}$.
How do we train the probabilistic encoders and decoders ?
Looking back at the marginal likelihood expression we can rewrite it as
\[\begin{align} p_{\theta}(x) &= \int q_{\phi}(z|x)\frac{p_{\theta}(x, z)}{q_{\phi}(z|x)}dz \\ p_{\theta}(x) &= \mathbb{E}_{q_{\phi}(z|x)} \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &= \log \mathbb{E}_{q_{\phi}(z|x)} \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \end{align}\]From Jensen's inequality $\log \mathbb{E}[p] \geq \mathbb{E}[\log p]$
\[\begin{align} \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p_{\theta}(x | z) p_{\theta}(z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ] + \mathbb{E}_{q_{\phi}(z|x)} \left[ \log \frac{p_{\theta}(z)}{q_{\phi}(z|x)} \right] \\ \log p_{\theta}(x) &\geq \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ]}_{\text{Reconstruction term}} - \underbrace{D_{KL}(q_\phi(z|x) || p_\theta(z))}_{\text{Divergence of samples from approx. vs. true}} \end{align}\]The reconstruction and the KL divergence terms form the lower bound of the evidence (or) data distribution which is why they are typically referred to as Evidence Lower Bound (ELBO). VoilĂ ! We found a clever solution to estimate the log likelihood of the data distribution. All we have to do is maximize the ELBO ! However in machine learning since we are used to minimizing objective functions, we instead minimize the negative of the ELBO
\[\mathcal{L}_{\text{ELBO}} = -\mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ] + D_{KL}(q_\phi(z|x) || p_\theta(z))\]The Kullback Libeler Divergence takes on an easy to evaluate closed form expression when using the standard normal distribution as the prior. Since we assumed the probabilistic decoder is a multivariate Gaussian we can expand the terms to get an objective function we can evaluate and compute gradients to update neural network parameters
\[\mathcal{L}_{\text{ELBO}} = \frac{1}{2\sigma^2}\mathbb{E}_{q_{\phi}(z|x)} \left[(x - \mu_\phi(z))^2 \right] + \frac{1}{2}\sum_{i=1}^k \left[ \sigma_i^2 + \mu_i^2 -1 - \log \sigma_i^2 \right]\]While the auto encoding variational Bayes algorithm offers an ingenious solution to estimating the log likelihood of the data distribution, in practice we observe that the quality of the generated samples is poor. The issues typically stem from the choice of prior and the choice of multivariate Gaussians for the probabilistic encoder and decoder. In most cases when using standard normal distribution as prior and enforcing the probabilistic encoder to generate samples conforming to this distribution can lead to cases where two distinct data samples $x$ and $x'$ are mapped to overlapping regions in the latent space. Such that when $z$ is sampled from this overlapping region, the probabilistic decoder is generating an averaged sample over multiple unrelated inputs. A similar effect also plays out in case of Generative Adversarial Networks (GANs) called as mode collapse where the GAN is unable to learn the different modes in $p_{\text{data}}$ leading to repetitive or poor diversity in generated samples (Note to self : Maybe a discussion point in another blog ...)
So how can we fix this issue ? Well some strategies maybe to use more complicated priors like Mixtures of Gaussians or von Mises-Fischer distribution. However, these strategies are useful when we know there is an underlying bias in the dataset generation process that we can exploit. What if we don't have such biases ? Is there a simpler approach to estimate $p_{\text{data}}$ ?
Denoising Diffusion Probabilistic Model (DDPM)
Unlike VAEs, diffusion probabilistic models (or simply diffusion models) define a forward Markov chain that progressively adds gaussian noise to data samples until the samples resemble data from a standard normal distribution. A reverse Markov process uses samples from this distribution and progressively denoises them for a set number of transitions until (hopefully!) we get samples resembling from the desired data distribution. Sounds like magic right ! I felt too until I really pulled up my sleeves and dug into the math.
The Forward Markov Chain