## Introduction

Recently, diffusion models have demonstrated remarkable performance on several generative tasks, ranging from unconditional or text-guided image generation, audio synthesis, natural language generation, human motion synthesis to 3D pointcloud generation, most of which were previously dominated by GAN-based models. Research also has shown that diffusion models are considerably better than GANs in terms of generation quality while also being able to control the results [1]. In this blog, we will go through the mathematics and intuition behind this compelling generative model.

**Figure 1**. Overview of Diffusion Model.## Mechanism behind Diffusion Model

The underlying principle of diffusion model is based on a popular idea of non-equilibrium statistical thermodynamics: we use a Markov chain to gradually convert one distribution into another [2]. Intuitively, it is called "diffusion" because in this model, data points in our original data space are gradually diffused from one position into another until reaching a state that is just a set of noise. Unlike the common VAE or GAN models where the data is typically compressed into a low dimensional latent space, diffusion models are learned with a fixed procedure and with high dimensional latent variable (same as the original data, see Figure 1). Now let's define the forward diffusion process.

### Forward diffusion process

**Figure 2**. Example of forward diffusionGiven a sample from our real data distribution $x_0 \sim q(x_0)$, we define the forward diffusion process as a Markov chain, which gradually adds random Gaussian noise (with diagonal covariance) to the data under a pre-defined schedule up to $t=1,2,...,T$ steps. Specifically, the distribution of the noised sample at step $t$ only depends on the previous step $t-1$:

where $\beta_t$ is the noise schedule controlling the step size or how slow the noises are added in the process. Typically, they are between $(0,1)$ and should hold the followings:

An interesting property of the forward process, which makes it a very powerful generative model, is that when the noise step size is small enough and $T\rightarrow \infty$, the distribution at the final step $x_T$ becomes a standard Gaussian distribution $\mathcal{N}(0,I)$. In other words, we can easily sample from this (*pure*) noise distribution and then follow a reverse process to reconstruct the real data sample, which we will discuss later.

Another notable property of this process is that we can directly sample $x_t$ at any time step $t$ without going through the whole chain. Specifically, let $\alpha_t = 1 -\beta_t$, from **Eq. 1**, we can derive the sample using reparameterization trick $(x\sim\mathcal{N}(\mu,\sigma^2) \Leftrightarrow x = \mu + \sigma\epsilon )$ as follow:

where $\epsilon \sim \mathcal{N}(0,I)$. Note that when we add two Gaussian with variances $\sigma_1^2$ and $\sigma_2^2$ , the merged distribution is also a Gaussian and should have variance $\sigma_1^2 + \sigma_2^2$. Therefore, we obtain the merged variance (the right sum $[...]$ of the above equation) as:

Thus $x_t$ can be written as:

where $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$. We can also re-write the forward diffusion as a distribution depended on $x_0$ as:

This nice property can make the training much more efficient since we only need to compute the loss at some random timesteps $t$ instead of a whole process, which we will see later.

### Reverse diffusion process

We have seen that the forward diffusion essentially pushes a sample off the data manifold and slowly transforms it into noise. The main goal of the reverse process is to learn a trajectory back to the manifold and thus reconstruct the sample in the original data distribution from the noise.

Feller et. al [3] demonstrated that for Gaussian diffusion process with small step size $\beta$, the reverse of the diffusion process has the identical functional form as the forward process. This means that $q(x_t|x_{t-1})$ is a Gaussian distribution, and if $\beta_t$ is small, then $q(x_{t-1}|x_t)$ will also be a Gaussian distribution.

Unfortunately, estimating reverse distribution $q(x_{t-1}|x_t)$ is very difficult because in early steps of the reverse process ($t$ is near $T$), our sample is just a random noise with high variance. In other words, there exist too many trajectories for us to arrive at our original data $x_0$ from that noise, which we may need to use the entire dataset to integrate. And ultimately, our final goal is to learn a model to approximate these reverse distributions in order to build the generative reverse denoising process, specifically:

where $\mu_\theta(\mathbf{x}_t, t)$ is the mean predicted by the model (a neural network with parameters $\theta$), and $\Sigma_\theta(\mathbf{x}_t, t)$ is the variance that is typically kept fixed for efficient learning.

Fortunately, by conditioning on $x_0$, the posterior of the reverse process is tractable and becomes a Gaussian distribution [4]. Our target now is to find the posterior mean $\tilde{\mu}_t(x_t,x_0)$ and the variance $\tilde{\beta}_t$. Following Bayes rule, we can write:

where

$q(x_t | x_{t-1}, x_0) = q(x_t | x_{t-1})$ because the process is a Markov chain, where the future state ($x_t$) is conditionally independent with all previous states ($x_0,\dots,x_{t-2}$) given the present state ($x_{t-1}$). We also leverage the fact that both of the distributions on the right hand side are Gaussian, which means we can apply the density function ($\mathcal{N}(x;\mu,\sigma) \propto \exp(-\frac{(x-\mu)^2}{2\sigma^2})$) as follow:

where $O(x_0,x_t)$ is a function only depending on $x_0$ and $x_t$ but not $x_{t-1}$, so it can be effectively ignored. Therefore we have:

Comparing with the Gaussian density function $\exp(-\frac{(x-\mu)^2}{2\sigma^2}) = \exp(-\frac{1}{2} (\frac{\color{green}{x^2}}{\sigma^2} -\frac{2\mu \color{orange}{x}}{\sigma^2} + \frac{\mu^2}{\sigma^2}))$, we can see some similaries. Recall that $\bar{\alpha}_t = \prod_{i=1}^t\alpha_i = \alpha_t\bar{\alpha}_{t-1}$ and $\beta_t + \alpha_t = 1$. Matching the variance term, we have the posterior variance:

We also obtain the mean:

Now we obtain the final formula for the reverse distribution:

So far we have provided the core concepts behind the two essential constituents of diffusion model: forward process and reverse process. In the next post, we will go into detail about the training loss as well as the sampling algorithm of diffusion model.

## References

[1] Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. In NIPS 2021

[2] Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML 2015.

[3] Feller W. On the theory of stochastic processes, with particular reference to applications. In Selected Papers I 2015 (pp. 769-798). Springer, Cham.

[4] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In NIPS 2020