Nikhil Kumar Thota

Before diving into DDPMs its best to start our journey from the Variational Autoencoder (VAE) model introduced by Diederik P. Kingma and Max Welling in 2013. This will help us develop intuitions to understand how DDPM models work. In addition, it will also allow us to draw comparisons between both models later.

Much of generative modelling is tasked with the challenge of sampling from an unknown data distribution $p_{\text{data}}(x)$ given only a set of (hopefully) identically and independently distributed (i.i.d) datapoints from this distribution. Note that this is different from the case when we have some approximate analytical distribution to start off with which is easy to evaluate but hard to sample from. In such cases algorithms like Metropolis Hastings and Annealed Importance Sampling (AIS) help.

Consider a dataset $X = [x^{(i)}]_{i=1}^N$ consisting of $N$ i.i.d samples of some continuous or discrete variable $x$. The first assumption we make is that these datapoints are generated by a process that uses continuous random variables $z$ lying on a lower dimensional manifold (manifold hypothesis). This process occurs in two steps :

First $z^i$ is generated from the prior distribution $p(z)$.
$x^i$ is generated from the conditional distribution $p(x|z)$.

We make another assumption that the prior and likelihood come form a parametric family of distributions. Using these assumptions we then theoretically could compute $p_{\text{data}}(x)$ from the marginal likelihood expression.

\[p_{\text{data}}(x) = \int p_{\theta}(x, z) dz = \int p_{\theta}(x|z) p_{\theta}(z) dz\]

However in practice this never works out because the integral is intractable (does not have a closed form expression and cannot be computed numerically). The problem arises because we typically do not have access to the likelihood $p_{\theta}(x|z)$ and even if did somehow manage to learn a likelihood model, it would likely be complex, still leading to an intractable calculation.

Kingma and Welling came up with a solution to the problem of learning the likelihood function by first training a neural network to approximate the true posterior. In the paper, the authors refer to this neural network as a probabilistic encoder. Using the $z$ variables sampled from the approximate posterior $q_{\phi}(z|x)$, another neural network parametrizing the likelihood $p_{\theta}(x|z)$ learns to transform these variables back to the $x$ space. This neural network is referred to as a probabilistic decoder.

Figure 1. Variational Autoencoder

$q_\phi(z|x)$ is modelled as multivariate Gaussian distribution and the probabilistic encoder predicts the mean and standard deviation of this distribution. Similarly, $p_{\theta}(x|z)$ is also modelled as a multivariate Gaussian distribution but with fixed variance $\sigma^2$. The reparametrization technique uses the predicted mean and standard deviation of $q_\phi(z|x)$ to generate samples of $z$ from the probabilistic encoder. This clever use of the reparametrization technique also solves the issue of how to backpropagate errors for updating the weights of the probabilistic encoder. After training, we can sample $z$ from $p_{\theta}(z)$, pass the sampled vectors through the probabilistic decoder and hopefully get samples that resemble those from the desired data distribution $p_{\text{data}}$.

How to train the probabilistic encoders and decoders ?

Looking back at the marginal likelihood expression we can rewrite it as

\[\begin{align} p_{\theta}(x) &= \int q_{\phi}(z|x)\frac{p_{\theta}(x, z)}{q_{\phi}(z|x)}dz \\ p_{\theta}(x) &= \mathbb{E}_{q_{\phi}(z|x)} \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &= \log \mathbb{E}_{q_{\phi}(z|x)} \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \end{align}\]

From Jensen's inequality $\log \mathbb{E}[p] \geq \mathbb{E}[\log p]$

\[\begin{align} \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p_{\theta}(x | z) p_{\theta}(z)}{q_{\phi}(z|x)} \\ \log p_{\theta}(x) &\geq \mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ] + \mathbb{E}_{q_{\phi}(z|x)} \left[ \log \frac{p_{\theta}(z)}{q_{\phi}(z|x)} \right] \\ \log p_{\theta}(x) &\geq \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ]}_{\text{Reconstruction term}} - \underbrace{D_{KL}(q_\phi(z|x) || p_\theta(z))}_{\text{Divergence of samples from approx. vs. true}} \end{align}\]

The reconstruction and the KL divergence terms form the lower bound of the evidence (or) data distribution which is why they are typically referred to as Evidence Lower Bound (ELBO). Voilà ! We found a clever solution to train the probabilistic encoders and decoders to maximize the likelihood the samples they generate belong to $p_{\text{data}}(x)$. All we have to do is maximize the ELBO ! However in machine learning since we are used to minimizing objective functions, we minimize the negative of the ELBO

\[\mathcal{L}_{\text{ELBO}} = -\mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\theta}(x | z) ] + D_{KL}(q_\phi(z|x) || p_\theta(z))\]

The Kullback Libeler Divergence takes on an easy to evaluate closed form expression when using the standard normal distribution as the prior. Since we assumed the probabilistic decoder is a multivariate Gaussian we can expand the terms to get an objective function we can evaluate and compute gradients to update neural network parameters

\[\mathcal{L}_{\text{ELBO}} = \frac{-1}{2\sigma^2}\mathbb{E}_{q_{\phi}(z|x)} \left[(x - \mu_\phi(z))^2 \right] + \frac{1}{2}\sum_{i=1}^k \left[ \sigma_i^2 + \mu_i^2 -1 - \log \sigma_i^2 \right]\]

The above model and training procedure is referred to as the auto encoding variational Bayes algorithm. While it offers an ingenious solution to overcome the intractability of computing $p_{\text{data}}(x)$, in practice we observe that the quality of the generated samples is poor. The issues typically stem from the choice of prior and the choice of multivariate Gaussians for the probabilistic encoder and decoder. In most cases when using standard normal distribution as prior and enforcing the probabilistic encoder to generate samples conforming to this distribution can lead to cases where two distinct data samples $x$ and $x'$ are mapped to overlapping regions in the latent space. When $z$ is sampled from this overlapping region, the probabilistic decoder is generating an averaged sample over multiple unrelated inputs. In case of Generative Adversarial Networks (GANs) an effect called mode collapse occurs where the GAN is unable to learn the different modes in $p_{\text{data}}$ leading to repetitive or poor diversity in generated samples (Note to self : Maybe a discussion point in another blog ...)

So how can we fix this issue ? Well some strategies maybe to use more complicated priors like Mixtures of Gaussians or von Mises-Fischer distribution. However, these strategies are useful when we know there is an underlying bias in the dataset generation process that we can exploit. What if we don't have such biases ? Is there a simpler approach to estimate $p_{\text{data}}$ ?

Denoising Diffusion Probabilistic Model (DDPM)

We saw that VAEs operate based on an intermediate random variable $z$. Diffusion probabilistic models (or simply diffusion models) also operate based on intermediate variables but instead of just 1, they have $T$ random variables $x_{0:T} = {x_0, x_1 \dots x_T}$ all with the same dimensionality. The insight is that by having multiple random variables we can chain the transition densities ($p_\theta(x_{t-1}|x_t)$ and $p_\theta(x_t|x_{t-1})$) to form a Markov chain that we can use to transform samples from any starting distribution to any desired data distribution no matter how complex it might be. Does this ring a bell ? This is exactly what Markov Chain Monte Carlo does ! DDPMs define these transition densities as multivariate Gaussian distributions that progressively corrupt the data samples with Gaussian noise until the samples resemble those from a standard normal distribution. This is called as the forward Markov Chain. A reverse Markov chain is also defined using multivariate Gaussian distributions which take the corrupted data samples as input and progressively denoise them for a set number of transitions until (hopefully!) we get samples resembling that from the desired data distribution.

Forward and Reverse Markov Processes of DDPM

Figure 2. Forward and Reverse Markov Processes in the Denoising Diffusion Probabilistic Model

How to train the neural networks in the forward and reverse Markov chains of the DDPM ?

We start with a few clean data samples $x_0$ for which we would like to estimate the density $p_{\theta}(x_0)$. We train the diffusion model to maximize the likelihood that the samples it produces belong to this data distribution. To frame this into an objective function that we can minimize via Gradient Descent, we again start with the marginal likelihood equation just as we did in the VAE case. The difference here though is that the joint distribution comprises of $T$ random variables and we integrate over all but the $x_0$ variable.

\[p_{\theta}(x_0) = \int p_{\theta}(x_{0:T})dx_{1:T}\]

Now multiply and divide by $p_{\theta}(x_{1:T}|x_0) = \prod_{i=1}^T p_{\theta}(x_t|x_{t-1})$ which is the forward Markov chain.

\[p_{\theta}(x_0) = \int p_{\theta}(x_{1:T}|x_0) \frac{p_{\theta}(x_{0:T})}{p_{\theta}(x_{1:T}|x_0)} dx_{1:T}\]

The authors in the paper use a fixed forward Markov chain which simplifies the loss calculation so lets stick with that. I rewrite $p_{\theta}(x_{1:T}|x_0)$ as $q(x_{1:T}|x_0)$ to indicate that there are no learnable parameters.

\[\begin{align} p_{\theta}(x_0) &= \int q(x_{1:T}|x_0) \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} dx_{1:T} \\ p_{\theta}(x_0) &= \mathbb{E}_{q} \left[ \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \right] \end{align}\]

Applying log on both sides and subsequently Jensen's inequality we get

\[\log (p_{\theta}(x_0)) \geq \mathbb{E}_{q} \left[ \log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \right]\]

We multiply both sides of the equation by $-1$ to reframe the optimization as minimizing the variational bound.

\[- \log (p_{\theta}(x_0)) \leq \mathbb{E}_{q} \left[ - \log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \right]\]

Now both the numerator and denominator can be expanded as a product of conditionals.

\[\begin{align} - \log (p_{\theta}(x_0)) &\leq \mathbb{E}_{q} \left[ - \log \frac{p_{\theta}(x_T)\prod_{t=1}^{T} p_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^{T} q(x_t|x_{t-1})} \right] \\ - \log (p_{\theta}(x_0)) &\leq \mathbb{E}_{q} \left[ - \log p_{\theta}(x_T) - \sum_{i=1}^{T} \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right] \end{align}\]

If we use Bayes theorem and substitute the posterior as $q(x_t|x_{t-1}) = \frac{q(x_{t-1}|x_{t})q(x_t)}{q(x_{t-1})}$ we get

\[- \log (p_{\theta}(x_0)) \leq \mathbb{E}_{q} \left[ -\log p_{\theta}(x_T) - \sum_{i=1}^{T} \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_{t})} \frac{q(x_{t-1})}{q(x_t)} \right]\]

While everything may seem alright at first, if you look closely we actually ran into an obstacle. We don't have access to the marginals $q(x_t)$ and $q(x_{t-1})$. This is where we come across another crucial insight that led to the development of DDPM models which is conditioning the probability densities based on clean data $x_0$. Then we see that the intractable formula becomes mathematically tractable.

\[q(x_{t-1}|x_{t}, x_0) = q(x_{t}|x_{t-1}, x_0) \frac{q(x_{t-1}| x_0)}{q(x_{t}| x_0)}\]

The tractability arises from two key properties of the forward process: the forward transition processes depend only on the previous state $q(x_{t}|x_{t-1}, x_0) = q(x_{t}|x_{t-1})$ and the Gaussian nature of all distributions involved. As a result $q(x_{t-1}|x_{t}, x_0)$ is also Gaussian and admits a closed form expression (See Section Sampling from $q(x_{t-1}|x_t, x_0)$). This subtle change makes the previously intractable KL Divergence term into a tractable one. The main point to note here, without going into the derivation, is that minimizing the KL divergence between marginal distributions is mathematically identical to minimizing the KL divergence between specific conditional distributions. Now substituting the probability densities conditioned on clean data $x_0$ and expanding out the terms we get

\[- \log (p_{\theta}(x_0)) \leq \mathbb{E}_{q} \left[ -\log p_{\theta}(x_T) - \sum_{i>1}^{T} \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)} \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} - \log \frac{p_{\theta}(x_{0}|x_1)}{\cancel{q(x_{0}|x_{1}, x_0)}} \frac{\cancel{q(x_{0}|x_0)}}{q(x_1|x_0)} \right] \\\]

Essentially all the marginals cancel out except the $q(x_T|x_0)$ which gets observed into the first term.

\[\begin{align} - \log (p_{\theta}(x_0)) &\leq \mathbb{E}_{q} \left[ -\log \frac{p_{\theta}(x_T)}{q(x_T|x_0)} - \sum_{i>1}^{T} \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)} - \log p_{\theta}(x_{0}|x_1) \right] \\ - \log (p_{\theta}(x_0)) &\leq \mathbb{E}_{q} \left[ \underbrace{D_{KL}(q(x_T|x_0) || p_{\theta}(x_T))}_{L_T} + \underbrace{\sum_{i>1}^{T} D_{KL}(q(x_{t-1}|x_{t}, x_0) || p_{\theta}(x_{t-1}|x_t) )}_{L_{t-1}} - \underbrace{\log p_{\theta}(x_{0}|x_1)}_{L_0} \right] \\ \end{align}\]

And that completes it ! The right hand side of the inequality represents the variational bound to minimize which the authors break down into 3 terms $L_T$, $L_{t-1}$ and $L_0$. The authors were able to simplify this objective function even further but we will get to that later. Right now lets appreciate the fact that since all the KL Divergence terms are comparisons between Gaussians, we can derive analytic closed form expressions without having to rely on high variance Monte Carlo estimates.

Sampling from $q(x_t|x_{t-1})$

The transition densities in the forward chain are defined such that the random variable $x_t$ depends on scaled amounts of $x_{t-1}$ and Gaussian noise. This scaling is controlled by the variance $\beta_t$ which is defined a priori through a variance schedule $\beta_1, \beta_2 \dots \beta_T$.

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t|\sqrt{(1-\beta_t)}x_{t-1}, \beta_t I )\]

Again we can use the reparametrization technique to sample from this distribution.

\[x_t = \sqrt{(1-\beta_t)}x_{t-1} + \sqrt{\beta_t}\epsilon \quad \epsilon \sim \mathcal{N}(0, I)\]

What would the reparametrization formula look like for sampling from $q(x_t|x_0)$ ? To start lets look at the equation for sampling $x_{t-1}$. Lets also define a new variable $\alpha_t = 1 - \beta_t$.

\[\begin{align} x_{t-1} &= \sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}\epsilon \end{align}\]

Substituting this in $x_t$ we get

$$ \begin{align} x_t &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}\epsilon) + \sqrt{1 - \alpha_t}\epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t - \alpha_t \alpha_{t-1}}\epsilon + \sqrt{1 - \alpha_t}\epsilon x_t \\ x_t &= \sqrt{\alpha_t \alpha_{t-1}}(\sqrt{\alpha_{t-2}}x_{t-3} + \sqrt{1 - \alpha_{t-2}}\epsilon ) + \sqrt{\alpha_t - \alpha_t \alpha_{t-1}}\epsilon + \sqrt{1 - \alpha_t}\epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}}x_{t-3} + \sqrt{\alpha_t \alpha_{t-1} - \alpha_t \alpha_{t-1} \alpha_{t-2}}\epsilon + \sqrt{\alpha_t - \alpha_t \alpha_{t-1}}\epsilon + \sqrt{1 - \alpha_t}\epsilon \end{align} $$

By now you might have caught the recursion pattern.

$$ x_t = \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2} \dots \alpha_0}x_0 + \text{sum of normal distributions} $$

Now the sum of normal distributions is also a normal distribution with its variance given as the sum of variances of the individual normals.

$$ \begin{align} \mathcal{N}(0, \sigma_1^2 I) + \mathcal{N}(0, \sigma_2^2 I) + \mathcal{N}(0, \sigma_3^2 I) &= \mathcal{N}(0, (\sigma_1^2 + \sigma_2^2 + \sigma_3^2)I) \\ \mathcal{N}(0, (\sigma_1^2 + \sigma_2^2 + \sigma_3^2)I) &= \mathcal{N}(0, (\alpha_t \alpha_{t-1} - \alpha_t \alpha_{t-1} \alpha_{t-2} + \alpha_t - \alpha_t \alpha_{t-1} + 1 - \alpha_t)I) \\ &= \mathcal{N}(0, (1 - \alpha_t \alpha_{t-1} \alpha_{t-2})I) \end{align} $$

Substituting this new variance and defining $\overline{\alpha_t} = \prod_{i=0}^{t}\alpha_i $, we can rewrite the reparametrization as

$$ x_t = \sqrt{\overline{\alpha_t}}x_{0} + \sqrt{1 - \overline{\alpha_t}}\epsilon $$

This completes the derivation. When the noise schedule is an increasing sequence $\{\beta_i\}_i^T$ then its easy to figure out that $q(x_t|x_0)=\mathcal{N}(x_t|\sqrt{\overline{\alpha_t}}x_0, (1 - \overline{\alpha_t})I)$ converges to $\mathcal{N}(0, I)$ as $T \to \infty$. Thus validating our choice of prior as the standard Normal distribution.

Sampling from $q(x_{t-1}|x_{t}, x_0)$

Lets now turn our attention to deriving the reparametrization formula for sampling from $q(x_{t-1}|x_t, x_0)$. This distribution as mentioned previously takes the form of a multivariate Gaussian distribution with a mean function $\tilde\mu_t(x_t, x_0)$ and variance $\tilde\beta_t$.

$$ \begin{align} q(x_{t-1}|x_t, x_0) &= \mathcal{N}(x_{t-1}|\tilde\mu_t(x_t, x_0), \tilde\beta_tI) \\ q(x_{t-1}|x_t, x_0) &\propto \exp \left[ -\frac{1}{2} \frac{(x_{t-1} - \tilde\mu_t(x_t, x_0))^2}{\tilde\beta_t} \right] \\ \end{align} $$

To derive what $\tilde\mu_t(x_t, x_0)$ and $\tilde\beta_t$ should be, lets look back at the Bayes theorem on conditioned probability distributions.

$$ q(x_{t-1}|x_{t}, x_0) = q(x_{t}|x_{t-1}) \frac{q(x_{t-1}| x_0)}{q(x_{t}| x_0)} $$

We already know what all the distributions on the right hand side look like.

$$ \begin{align} q(x_t|x_{t-1}) &\propto \exp \left[-\frac{1}{2} \frac{(x_t - \sqrt{\alpha_t}x_{t-1})^2}{(1-\alpha_t)} \right] \\ q(x_{t-1}|x_{0}) &\propto \exp \left[-\frac{1}{2} \frac{(x_{t-1} - \sqrt{\overline{\alpha}_{t-1}}x_0)^2}{(1-\overline{\alpha}_{t-1})} \right] \\ q(x_t|x_{0}) &\propto \exp \left[-\frac{1}{2} \frac{(x_t - \sqrt{\overline{\alpha}_t}x_0)^2}{(1-\overline{\alpha}_{t})} \right] \\ \end{align} $$

Substituting these back in the Bayes formula and expanding out the terms we get

$$ \begin{align} q(x_{t-1}|x_{t}, x_0) &\propto \exp [-\frac{1}{2} ( \frac{x_t^2}{(1-\alpha_t)} - \frac{2x_t\sqrt{\alpha_t}x_{t-1}}{(1-\alpha_t)} + \frac{\alpha_t x_{t-1}^2}{(1-\alpha_t)} \\ &\quad \quad \quad \quad + \frac{x_{t-1}^2}{(1-\overline{\alpha}_{t-1})} - \frac{2x_{t-1}\sqrt{\overline{\alpha}_{t-1}}x_{0}}{(1-\overline{\alpha}_{t-1})} + \frac{\overline{\alpha}_{t-1} x_{0}^2}{(1-\overline{\alpha}_{t-1})} \\ &\quad \quad \quad \quad + \frac{x_t^2}{(1-\overline{\alpha}_t)} - \frac{2x_t\sqrt{\overline{\alpha}_t}x_{0}}{(1-\overline{\alpha}_t)} + \frac{\overline{\alpha}_t x_{0}^2}{(1-\overline{\alpha}_t)} ) ] \end{align} $$

Expanding out the terms in $q(x_{t-1}|x_t, x_0)$ and comparing with what we have above, we directly get the equations for the mean and variance.

$$ \begin{align} \frac{1}{\tilde\beta_t} &= \frac{\alpha_t}{1 - \alpha_t} + \frac{1}{1 - \overline{\alpha}_{t-1}} \\ \frac{1}{\tilde\beta_t} &= \frac{\alpha_t(1 - \overline{\alpha}_{t-1}) + (1 - \alpha_t)}{(1 - \alpha_t)(1 - \overline{\alpha}_{t-1})} \\ \frac{1}{\tilde\beta_t} &= \frac{1 - \overline{\alpha}_t}{(1 - \alpha_t)(1 - \overline{\alpha}_{t-1})} \\ \tilde\beta_t &= \frac{(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}\beta_t \end{align} $$ $$ \begin{align} \frac{\tilde\mu_t(x_t, x_0)}{\tilde\beta_t} &= \frac{x_t\sqrt{\alpha_t}}{(1-\alpha_t)} + \frac{\sqrt{\overline{\alpha}_{t-1}}x_{0}}{(1-\overline{\alpha}_{t-1})} \\ \tilde\mu_t(x_t, x_0) &= \frac{\sqrt{\overline{\alpha}_{t-1}}x_{0}}{(1-\overline{\alpha}_{t-1})}\tilde\beta_t + \frac{x_t\sqrt{\alpha_t}}{(1-\alpha_t)}\tilde\beta_t \\ \tilde\mu_t(x_t, x_0) &= \frac{\sqrt{\overline{\alpha}_{t-1}}x_{0}}{\cancel{(1-\overline{\alpha}_{t-1})}} \frac{\cancel{(1 - \overline{\alpha}_{t-1})}}{1 - \overline{\alpha}_t}\beta_t + \frac{x_t\sqrt{\alpha_t}}{\cancel{(1-\alpha_t)}} \frac{(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}\cancel{\beta_t} \\ \tilde\mu_t(x_t, x_0) &= \frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1 - \overline{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}x_t \end{align} $$

Thus we finally arrive at the reparametrization equation for sampling $x_{t-1}$ from $q(x_{t-1}|x_t, x_0)$

$$ \begin{align} x_{t-1} &= \tilde\mu_t(x_t, x_0) + \sqrt{\tilde\beta_t} \epsilon \quad \epsilon \sim \mathcal{N}(0, I) \\ x_{t-1} &= \frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1 - \overline{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}x_t + \sqrt{\frac{(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}\beta_t} \epsilon \\ \end{align} $$

However, we can simplify this even further by substituting for $x_0$ using the forward sampling equation $x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon$ so that we get an equation only in $x_t$

$$ x_0 = \frac{x_t}{\sqrt{\overline{\alpha}_t}} - \frac{\sqrt{1 - \overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}}\epsilon $$ $$ \begin{align} x_{t-1} &= \frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1 - \overline{\alpha}_t} \left( \frac{x_t}{\sqrt{\overline{\alpha}_t}} - \frac{\sqrt{1 - \overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}}\epsilon \right) + \frac{\sqrt{\alpha_t}(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}x_t + \sqrt{\frac{(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}\beta_t} \epsilon \\ x_{t-1} &= \underbrace{\frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha}_t}}\epsilon \right)}_{\tilde\mu_t(x_t, \epsilon)} + \sqrt{\frac{(1 - \overline{\alpha}_{t-1})}{1 - \overline{\alpha}_t}\beta_t} \epsilon \end{align} $$

Simplifying the training objective

Looking back at our original optimization function we see that the $L_T$ term is a constant as there are no learnable parameters. To compute the $L_{t-1}$ term we first define $p_{\theta}(x_{t-1}|x_{t})$ as $\mathcal{N}(x_{t-1}|\mu_\theta(x_t, t), \sum_{\theta}(x_t, t))$. The authors set $\sum_{\theta}(x_t, t) = \sigma_t^2I$ and tested using $\sigma_t^2 = \beta_t$ and $\sigma_t^2 = \tilde \beta_t$. They found both options produced similar results.

$$ \begin{align} L_{t-1} &= \mathbb{E}_{q} \left[ \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_{t})} \right] \\ L_{t-1} &= \mathbb{E}_{q} \left[ \log \frac{\exp \frac{-(x_{t-1} - \tilde\mu(x_t, x_0))^2}{2\sigma_t^2}}{\exp \frac{-(x_{t-1} - \mu_{\theta}(x_t, t))^2}{2\sigma_t^2}} \right] \\ L_{t-1} &= \mathbb{E}_{q} \left[ \frac{- (x_{t-1} - \tilde\mu(x_t, x_0))^2 + (x_{t-1} - \mu_{\theta}(x_t, t))^2}{2\sigma_t^2} \right] \\ L_{t-1} &= \mathbb{E}_{q} \left[ \frac{|| \tilde\mu(x_t, x_0) - \mu_{\theta}(x_t, t)||^2}{2\sigma_t^2} \right] + C \end{align} $$

$C$ is a constant that does not depend on $\theta$. Since we derived what $\tilde\mu(x_t, \epsilon)$ is we can substitute into the above equation.

$$ \begin{align} L_{t-1} - C &= \mathbb{E}_{q} \left[ \frac{1}{2\sigma_t^2} \left|\left| \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha}_t}}\epsilon \right) - \mu_{\theta}(x_t, t) \right|\right|^2 \right] \end{align} $$

For the above objective function to be minimized then

$$ \mu_{\theta}(x_t, t) = \tilde\mu(x_t, x_0) = \tilde\mu(x_t, \epsilon) $$

From here we observe that $\mu_{\theta}(x_t, t)$ must have a similar functional form as $\tilde\mu(x_t, \epsilon)$.

$$ \mu_{\theta}(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left[ x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha}_t}}\epsilon_{\theta}(x_t, t) \right] $$

$\epsilon_{\theta}(x_t, t)$ is called the learned gradient of the data density or score. Substituting these derivations back into $L_{t-1} - C$ we get a very interesting objective function that resembles denoising score matching over multiple noise levels (small to large) indexed by $t$.

$$ L_{t-1} - C = \mathbb{E}_{q} \left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \overline{\alpha}_t)} \left|\left| \epsilon - \epsilon_{\theta}(x_t, t) \right|\right|^2 \right] $$

We only need to train a neural network to predict $\epsilon$. Then using the predicted noise we can sample the denoised random variable $x_{t-1}$ from $p_{\theta}(x_{t-1}|x_t)$ through $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha}_t}}\epsilon_{\theta}(x_t, t)\right) + \sigma_tz$ where $z \sim \mathcal{N}(0, 1)$

We see that an alternative optimization could have been defined that predicts the clean data $x_0$ since $\mu_{\theta}(x_t, t) = \tilde\mu(x_t, x_0) = \tilde\mu(x_t, \epsilon)$ but the authors found that this leads to poor sample quality. As quoted in the paper " $\epsilon$ prediction parametrization both resembles Langevin dynamics and simplifies the diffusion model's variational bound to an objective that resembles denoising score matching."

Simplifying the $L_T$, $L_{t-1}$ and $L_0$ the authors arrived at the following variant of the variational bound $L_{\text{simple}}(\theta)$ which is much simpler to train.

$$ \mathcal{L}_{\text{simple}}(\theta) = \underbrace{\mathbb{E}_{t, x_0, \epsilon}}_{\text{average over} \epsilon \text{predict at diff. $t$ and diff. $x_0$}} \left[ \left|\left| \epsilon - \epsilon_{\theta}(x_t, t) \right|\right|^2 \right] $$

The $t=T$ case is ignored because it has a constant loss. The $1 < t < T$ case is the denoising score matching but the unweighted version compared to what we saw from $L_{t-1}$. This results in down weighting terms corresponding to small $t$ causing model to focus more on the early stages of denoising. The $t=1$ corresponds to $L_0$ which is approximated by a Guassian probability density function times the bin width, ignoring $\sigma_1^2$ and edge effects. The snippet below summarizes the training and sampling process in a DDPM.

Figure 2. Training and Sampling in DDPM

This is all I have for now. If you never heard about a DDPM, I atleast hope you learned why these models are so popular ranging from uses in image generation to predicting 3D biomolecular structures just from sequence (AlphaFold3). If you have experience working with these models but never quite understood the math, I hope this blog helped you get a new perspective on the DDPM paper. In the future I plan to put this math in action and illustrate it using a simple example.