Introduction to Diffusion Models

A kind introduction to Diffusion Models and its related information.
type: theorylevel: advanced

Introduction

Recently, diffusion models have demonstrated remarkable performance on several generative tasks, ranging from unconditional or text-guided image generation, audio synthesis, natural language generation, human motion synthesis to 3D pointcloud generation, most of which were previously dominated by GAN-based models. Research also has shown that diffusion models are considerably better than GANs in terms of generation quality while also being able to control the results [1]. In this blog, we will go through the mathematics and intuition behind this compelling generative model.

Figure 1. Overview of Diffusion Model.

Mechanism behind Diffusion Model

The underlying principle of diffusion model is based on a popular idea of non-equilibrium statistical thermodynamics: we use a Markov chain to gradually convert one distribution into another [2]. Intuitively, it is called "diffusion" because in this model, data points in our original data space are gradually diffused from one position into another until reaching a state that is just a set of noise. Unlike the common VAE or GAN models where the data is typically compressed into a low dimensional latent space, diffusion models are learned with a fixed procedure and with high dimensional latent variable (same as the original data, see Figure 1). Now let's define the forward diffusion process.

Forward diffusion process

Figure 2. Example of forward diffusion

Given a sample from our real data distribution x0q(x0)x_0 \sim q(x_0), we define the forward diffusion process as a Markov chain, which gradually adds random Gaussian noise (with diagonal covariance) to the data under a pre-defined schedule up to t=1,2,...,Tt=1,2,...,T steps. Specifically, the distribution of the noised sample at step tt only depends on the previous step t1t-1:

q(xtxt1)=N(xt;1βtxt,βtI)(1)\color{red}{q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_t, \beta_tI)} \tag{1}

where βt\beta_t is the noise schedule controlling the step size or how slow the noises are added in the process. Typically, they are between (0,1)(0,1) and should hold the followings:

0<β1<β2<...<βT<10 < \beta_1 < \beta_2 < ... < \beta_T < 1

An interesting property of the forward process, which makes it a very powerful generative model, is that when the noise step size is small enough and TT\rightarrow \infty, the distribution at the final step xTx_T becomes a standard Gaussian distribution N(0,I)\mathcal{N}(0,I). In other words, we can easily sample from this (pure) noise distribution and then follow a reverse process to reconstruct the real data sample, which we will discuss later.

Another notable property of this process is that we can directly sample xtx_t at any time step tt without going through the whole chain. Specifically, let αt=1βt\alpha_t = 1 -\beta_t, from Eq. 1, we can derive the sample using reparameterization trick (xN(μ,σ2)x=μ+σϵ)(x\sim\mathcal{N}(\mu,\sigma^2) \Leftrightarrow x = \mu + \sigma\epsilon ) as follow:

xt=αtxt1+1αtϵt1=αt(αt1xt2+1αt1ϵt2)+1αtϵt1=αtαt1xt2+[αt(1αt1)ϵt2+1αtϵt1]x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{1 - \alpha_t}\epsilon_{t-1} \\ = \sqrt{\alpha_t} \left(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}\epsilon_{t-2} \right) + \sqrt{1 - \alpha_t}\epsilon_{t-1} \\ = \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \left[\sqrt{\alpha_t(1 - \alpha_{t-1})}\epsilon_{t-2} + \sqrt{1 - \alpha_t}\epsilon_{t-1}\right]

where ϵN(0,I)\epsilon \sim \mathcal{N}(0,I). Note that when we add two Gaussian with variances σ12\sigma_1^2 and σ22\sigma_2^2 , the merged distribution is also a Gaussian and should have variance σ12+σ22\sigma_1^2 + \sigma_2^2. Therefore, we obtain the merged variance (the right sum [...][...] of the above equation) as:

αt(1αt1)+1αt=1αtαt1\alpha_t(1 - \alpha_{t-1}) + 1-\alpha_t = 1-\alpha_t\alpha_{t-1}

Thus xtx_t can be written as:

xt=αtαt1xt2+1αtαt1ϵˉt2=xt=αtαt1α1x0+1αtαt1α1ϵxt=αˉtx0+1αˉtϵ(2)x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-2} = \dots \\ x_t = \sqrt{\alpha_t \alpha_{t-1}\dots\alpha_{1}} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1}\dots\alpha_{1}} \epsilon \\ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \tag{2}

where αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i. We can also re-write the forward diffusion as a distribution depended on x0x_0 as:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)(3)\color{red}{q(x_t \vert x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I)} \tag{3}

This nice property can make the training much more efficient since we only need to compute the loss at some random timesteps tt instead of a whole process, which we will see later.

Reverse diffusion process

We have seen that the forward diffusion essentially pushes a sample off the data manifold and slowly transforms it into noise. The main goal of the reverse process is to learn a trajectory back to the manifold and thus reconstruct the sample in the original data distribution from the noise.

Feller et. al [3] demonstrated that for Gaussian diffusion process with small step size β\beta, the reverse of the diffusion process has the identical functional form as the forward process. This means that q(xtxt1)q(x_t|x_{t-1}) is a Gaussian distribution, and if βt\beta_t is small, then q(xt1xt)q(x_{t-1}|x_t) will also be a Gaussian distribution.

Unfortunately, estimating reverse distribution q(xt1xt)q(x_{t-1}|x_t) is very difficult because in early steps of the reverse process (tt is near TT), our sample is just a random noise with high variance. In other words, there exist too many trajectories for us to arrive at our original data x0x_0 from that noise, which we may need to use the entire dataset to integrate. And ultimately, our final goal is to learn a model to approximate these reverse distributions in order to build the generative reverse denoising process, specifically:

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{0:T}) = p(x_T) \prod^T_{t=1} p_\theta(x_{t-1} | x_t), \quad p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))

where μθ(xt,t)\mu_\theta(\mathbf{x}_t, t) is the mean predicted by the model (a neural network with parameters θ\theta), and Σθ(xt,t)\Sigma_\theta(\mathbf{x}_t, t) is the variance that is typically kept fixed for efficient learning.

Fortunately, by conditioning on x0x_0, the posterior of the reverse process is tractable and becomes a Gaussian distribution [4]. Our target now is to find the posterior mean μ~t(xt,x0)\tilde{\mu}_t(x_t,x_0) and the variance β~t\tilde{\beta}_t. Following Bayes rule, we can write:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)q(x_{t-1} | x_t, x_0) = \frac{ q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0) }{ q(x_t | x_0) }

where

{q(xtxt1,x0)=q(xtxt1)=N(xt;αtxt1,βtI),q(xt1x0)=N(xt1;αˉt1x0,(1αˉt1)I),q(xtx0)=N(xt;αˉtx0,(1αˉt)I)\begin{cases} q(x_t | x_{t-1}, x_0) = q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I), \\ q(x_{t-1} | x_0) = \mathcal{N}(x_{t-1}; \sqrt{\bar{\alpha}_{t-1}}x_0, (1-\bar{\alpha}_{t-1})I), \\ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) \end{cases}

q(xtxt1,x0)=q(xtxt1)q(x_t | x_{t-1}, x_0) = q(x_t | x_{t-1}) because the process is a Markov chain, where the future state (xtx_t) is conditionally independent with all previous states (x0,,xt2x_0,\dots,x_{t-2}) given the present state (xt1x_{t-1}). We also leverage the fact that both of the distributions on the right hand side are Gaussian, which means we can apply the density function (N(x;μ,σ)exp((xμ)22σ2)\mathcal{N}(x;\mu,\sigma) \propto \exp(-\frac{(x-\mu)^2}{2\sigma^2})) as follow:

q(xt1xt,x0)exp[(xtαtxt1)22βt(xt1αˉt1x0)22(1αˉt1)+(xtαˉtx0)22(1αˉt))]=exp[12(xt22αtxtxt1+αtxt12βt+xt122αˉt1x0xt1+αˉt1x021αˉt1(xtαˉtx0)21αˉt)]=exp[12(αtxt12βt+xt121αˉt12αtxtxt1βt2αˉt1x0xt11αˉt1)+O(x0,xt)]q(x_{t-1} | x_t, x_0) \propto\exp \left[ -\frac{(x_t - \sqrt{\alpha_t} x_{t-1})^2}{2\beta_t} - \frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}} x_0)^2}{2(1-\bar{\alpha}_{t-1})} + \frac{(x_t - \sqrt{\bar{\alpha}_t} x_0)^2}{2(1-\bar{\alpha}_t)} \big)\right] \\ = \exp \left[ -\frac{1}{2} \left( \frac{x_t^2 - 2\sqrt{\alpha_t} x_t x_{t-1} + \alpha_t x_{t-1}^2 }{\beta_t} + \frac{ x_{t-1}^2 - 2 \sqrt{\bar{\alpha}_{t-1}} x_0 x_{t-1} + \bar{\alpha}_{t-1} x_0^2}{1-\bar{\alpha}_{t-1}} - \frac{(x_t - \sqrt{\bar{\alpha}_t} x_0)^2}{1-\bar{\alpha}_t} \right) \right]\\ = \exp \left[ -\frac{1}{2} \left( \frac{\alpha_t x^2_{t-1}}{\beta_t} + \frac{x^2_{t-1}}{1-\bar{\alpha}_{t-1}} - \frac{2\sqrt{\alpha_t} x_t x_{t-1} }{\beta_t} - \frac{2 \sqrt{\bar{\alpha}_{t-1}} x_0 x_{t-1}}{1-\bar{\alpha}_{t-1}} \right) + O(x_0,x_t) \right]

where O(x0,xt)O(x_0,x_t) is a function only depending on x0x_0 and xtx_t but not xt1x_{t-1}, so it can be effectively ignored. Therefore we have:

q(xt1xt,x0)exp[12((αtβt+11αˉt1)xt122(αtβtxt+αˉt11αˉt1x0)xt1)]q(x_{t-1} | x_t, x_0) \propto \exp\left[ -\frac{1}{2} \left( (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1} } )\color{green}{x^2_{t-1}} - 2 (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0)\color{orange}{x_{t-1}} \right) \right]

Comparing with the Gaussian density function exp((xμ)22σ2)=exp(12(x2σ22μxσ2+μ2σ2))\exp(-\frac{(x-\mu)^2}{2\sigma^2}) = \exp(-\frac{1}{2} (\frac{\color{green}{x^2}}{\sigma^2} -\frac{2\mu \color{orange}{x}}{\sigma^2} + \frac{\mu^2}{\sigma^2})), we can see some similaries. Recall that αˉt=i=1tαi=αtαˉt1\bar{\alpha}_t = \prod_{i=1}^t\alpha_i = \alpha_t\bar{\alpha}_{t-1} and βt+αt=1\beta_t + \alpha_t = 1. Matching the variance term, we have the posterior variance:

β~t=1αtβt+11αˉt1=1αtαˉt+βtβt(1αˉt1)=1αˉt11αˉtβt(4)\color{green}{\tilde{\beta}_t} = \frac{1}{\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}} = \frac{1}{\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}} = {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \tag{4}

We also obtain the mean:

μ~t(xt,x0)β~t=αtβtxt+αˉt11αˉt1x0μ~t(xt,x0)=(αtβtxt+αˉt11αˉt1x0)βt=(αtβtxt+αˉt11αˉt1x0)1αˉt11αˉtβt\frac{\color{orange}{\tilde{\mu}_t(x_t,x_0)}}{\color{green}{\tilde{\beta}_t}} = \frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} x_0 \\ \rightarrow \tilde{\mu}_t(x_t,x_0) = (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} x_0)\beta_t\\ = (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} x_0) {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t}
μ~t(xt,x0)=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉtx0(5)\rightarrow \color{orange}{\tilde{\mu}_t(x_t,x_0)} = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 \tag{5}

Now we obtain the final formula for the reverse distribution:

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)(6)\color{blue}{q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)} \tag{6}

So far we have provided the core concepts behind the two essential constituents of diffusion model: forward process and reverse process. In the next post, we will go into detail about the training loss as well as the sampling algorithm of diffusion model.

References

[1] Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. In NIPS 2021

[2] Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML 2015.

[3] Feller W. On the theory of stochastic processes, with particular reference to applications. In Selected Papers I 2015 (pp. 769-798). Springer, Cham.

[4] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In NIPS 2020