Diffusion models are a new class of state-of-the-art generative models that generate diverse high-resolution images. They have already attracted a lot of attention after OpenAI, Nvidia and Google managed to train large-scale models. Example architectures that are based on diffusion models are GLIDE, DALLE-2, Imagen, and the full open-source stable diffusion.

But what is the main principle behind them?

In this blog post, we will dig our way up from the basic principles. There are already a bunch of different diffusion-based architectures. We will focus on the most prominent one, which is the Denoising Diffusion Probabilistic Models (DDPM) as initialized by Sohl-Dickstein et al and then proposed by Ho. et al 2020. Various other approaches will be discussed to a smaller extent such as stable diffusion and score-based models.

Diffusion models are fundamentally different from all the previous generative methods. Intuitively, they aim to decompose the image generation process (sampling) in many small “denoising” steps.

The intuition behind this is that the model can correct itself over these small steps and gradually produce a good sample. To some extent, this idea of refining the representation has already been used in models like alphafold. But hey, nothing comes at zero-cost. This iterative process makes them slow at sampling, at least compared to GANs.

Diffusion process

The basic idea behind diffusion models is rather simple. They take the input image x0\mathbf{x}_0

Afterward, a neural network is trained to recover the original data by reversing the noising process. By being able to model the reverse process, we can generate new data. This is the so-called reverse diffusion process or, in general, the sampling process of a generative model.

How? Let’s dive into the math to make it crystal clear.

Forward diffusion

Diffusion models can be seen as latent variable models. Latent means that we are referring to a hidden continuous feature space. In such a way, they may look similar to variational autoencoders (VAEs).

In practice, they are formulated using a Markov chain of TT steps. Here, a Markov chain means that each step only depends on the previous one, which is a mild assumption. Importantly, we are not constrained to using a specific type of neural network, unlike flow-based models.

Given a data-point x0\textbf{x}_0

q(xtxt1)=N(xt;μt=1βtxt1,Σt=βtI)q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_t=\sqrt{1 – \beta_t} \mathbf{x}_{t-1}, \boldsymbol{\Sigma}_t = \beta_t\mathbf{I})




forward-diffusion


Forward diffusion process. Image modified by Ho et al. 2020

Since we are in the multi-dimensional scenario I\textbf{I} is the identity matrix, indicating that each dimension has the same standard deviation βt\beta_t

Thus, we can go in a closed form from the input data x0\mathbf{x}_0

q(x1:Tx0)=t=1Tq(xtxt1)q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})

The symbol :: in q(x1:T)q(\mathbf{x}_{1:T})

So far, so good? Well, nah! For timestep t=500<Tt=500 < T

The reparametrization trick provides a magic remedy to this.

The reparameterization trick: tractable closed-form sampling at any timestep

If we define αt=1βt\alpha_t= 1- \beta_t

xt=1βtxt1+βtϵt1=αtxt2+1αtϵt2==αˉtx0+1αˉtϵ0\begin{aligned}

\mathbf{x}_t

&=\sqrt{1 – \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\epsilon}_{t-1}\\

&= \sqrt{\alpha_t}\mathbf{x}_{t-2} + \sqrt{1 – \alpha_t}\boldsymbol{\epsilon}_{t-2} \\

&= \dots \\

&= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\boldsymbol{\epsilon_0}

\end{aligned}

Note: Since all timestep have the same Gaussian noise we will only use the symbol ϵ\boldsymbol{\epsilon} from now on.

Thus to produce a sample xt\mathbf{x}_t

xtq(xtx0)=N(xt;αˉtx0,(1αˉt)I)\mathbf{x}_t \sim q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 – \bar{\alpha}_t)\mathbf{I})

Since βt\beta_t

Variance schedule

The variance parameter βt\beta_t




variance-schedule


Latent samples from linear (top) and cosine (bottom)
schedules respectively. Source: Nichol & Dhariwal 2021

Reverse diffusion

As TT \to \infty

The question is how we can model the reverse diffusion process.

Approximating the reverse process with a neural network

In practical terms, we don’t know q(xt1xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t})

Instead, we approximate q(xt1xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t})

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))




reverse-diffusion


Reverse diffusion process. Image modified by Ho et al. 2020

If we apply the reverse formula for all timesteps (pθ(x0:T)p_\theta(\mathbf{x}_{0:T})

pθ(x0:T)=pθ(xT)t=1Tpθ(xt1xt)p_\theta(\mathbf{x}_{0:T}) = p_{\theta}(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)

By additionally conditioning the model on timestep tt, it will learn to predict the Gaussian parameters (meaning the mean μθ(xt,t)\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)

But how do we train such a model?

Training a diffusion model

If we take a step back, we can notice that the combination of qq and pp is very similar to a variational autoencoder (VAE). Thus, we can train it by optimizing the negative log-likelihood of the training data. After a series of calculations, which we won’t analyze here, we can write the evidence lower bound (ELBO) as follows:

logp(x)Eq(x1x0)[logpθ(x0x1)]DKL(q(xTx0)p(xT))t=2TEq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]=L0LTt=2TLt1\begin{aligned}

log p(\mathbf{x}) \geq

&\mathbb{E}_{q(x_1 \vert x_0)} [log p_{\theta} (\mathbf{x}_0 \vert \mathbf{x}_1)] – \\ &D_{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \vert\vert p(\mathbf{x}_T))- \\

&\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t \vert \mathbf{x}_0)} [D_{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \vert \vert p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)) ] \\

& = L_0 – L_T – \sum_{t=2}^T L_{t-1}

\end{aligned}

Let’s analyze these terms:

  1. The Eq(x1x0)[logpθ(x0x1)]\mathbb{E}_{q(x_1 \vert x_0)} [log p_{\theta} (\mathbf{x}_0 \vert \mathbf{x}_1)]

  2. DKL(q(xTx0)p(xT))D_{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \vert\vert p(\mathbf{x}_T))

  3. The third term t=2TLt1\sum_{t=2}^T L_{t-1}

It is evident that through the ELBO, maximizing the likelihood boils down to learning the denoising steps LtL_t

Important note: Even though q(xt1xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t})

Intuitively, a painter (our generative model) needs a reference image (x0\textbf{x}_0

In other words, we can sample xt\textbf{x}_t

q(xt1xt,x0)=N(xt1;μ~(xt,x0),β~tI)β~t=1αˉt11αˉtβtμ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\begin{aligned}

q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_{t-1}; {\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), {\tilde{\beta}_t} \mathbf{I}) \\

\tilde{\beta}_t &= \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t \\

\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 – \bar{\alpha}_t} \mathbf{x_0} + \frac{\sqrt{\alpha_t}(1 – \bar{\alpha}_{t-1})}{1 – \bar{\alpha}_t} \mathbf{x}_t

\end{aligned}

Note that αt\alpha_t

This little trick provides us with a fully tractable ELBO. The above property has one more important side effect, as we already saw in the reparameterization trick, we can represent x0\mathbf{x}_0

x0=1αˉt(xt1αˉtϵ)),\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t – \sqrt{1 – \bar{\alpha}_t} \boldsymbol{\epsilon})),

where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\textbf{0},\mathbf{I})

By combining the last two equations, each timestep will now have a mean μ~t\tilde{\boldsymbol{\mu}}_t

μ~t(xt)=1αt(xtβt1αˉtϵ))\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t) = {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \boldsymbol{\epsilon} ) \Big)}

Therefore we can use a neural network ϵθ(xt,t)\epsilon_{\theta}(\mathbf{x}_t,t)

μθ~(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\tilde{\boldsymbol{\mu}_{\theta}}( \mathbf{x}_t,t) = {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t) \Big)}

Thus, the loss function (the denoising term in the ELBO) can be expressed as:

Lt=Ex0,t,ϵ[12Σθ(xt,t)22μ~tμθ(xt,t)22]=Ex0,t,ϵ[βt22αt(1αˉt)Σθ22ϵtϵθ(aˉtx0+1aˉtϵ,t)2]\begin{aligned}

L_t &= \mathbb{E}_{\mathbf{x}_0,t,\boldsymbol{\epsilon}}\Big[\frac{1}{2||\boldsymbol{\Sigma}_\theta (x_t,t)||_2^2} ||\tilde{\boldsymbol{\mu}}_t – \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)||_2^2 \Big] \\

&= \mathbb{E}_{\mathbf{x}_0,t,\boldsymbol{\epsilon}}\Big[\frac{\beta_t^2}{2\alpha_t (1 – \bar{\alpha}_t) ||\boldsymbol{\Sigma}_\theta||^2_2} \| \boldsymbol{\epsilon}_{t}- \boldsymbol{\epsilon}_{\theta}(\sqrt{\bar{a}_t} \mathbf{x}_0 + \sqrt{1-\bar{a}_t}\boldsymbol{\epsilon}, t ) ||^2 \Big]

\end{aligned}

This effectively shows us that instead of predicting the mean of the distribution, the model will predict the noise ϵ\boldsymbol{\epsilon} at each timestep tt.

Ho et.al 2020 made a few simplifications to the actual loss term as they ignore a weighting term. The simplified version outperforms the full objective:

Ltsimple=Ex0,t,ϵ[ϵϵθ(aˉtx0+1aˉtϵ,t)2]L_t^\text{simple} = \mathbb{E}_{\mathbf{x}_0, t, \boldsymbol{\epsilon}} \Big[\|\boldsymbol{\epsilon}- \boldsymbol{\epsilon}_{\theta}(\sqrt{\bar{a}_t} \mathbf{x}_0 + \sqrt{1-\bar{a}_t} \boldsymbol{\epsilon}, t ) ||^2 \Big]

The authors found that optimizing the above objective works better than optimizing the original ELBO. The proof for both equations can be found in this excellent post by Lillian Weng or in Luo et al. 2022.

Additionally, Ho et. al 2020 decide to keep the variance fixed and have the network learn only the mean. This was later improved by Nichol et al. 2021, who decide to let the network learn the covariance matrix (Σ)(\boldsymbol{\Sigma}) as well (by modifying LtsimpleL_t^\text{simple}




training-sampling-ddpm


Training and sampling algorithms of DDPMs. Source: Ho et al. 2020

Architecture

One thing that we haven’t mentioned so far is what the model’s architecture looks like. Notice that the model’s input and output should be of the same size.

To this end, Ho et al. employed a U-Net. If you are unfamiliar with U-Nets, feel free to check out our past article on the major U-Net architectures. In a few words, a U-Net is a symmetric architecture with input and output of the same spatial size that uses skip connections between encoder and decoder blocks of corresponding feature dimension. Usually, the input image is first downsampled and then upsampled until reaching its initial size.

In the original implementation of DDPMs, the U-Net consists of Wide ResNet blocks, group normalization as well as self-attention blocks.

The diffusion timestep tt is specified by adding a sinusoidal position embedding into each residual block. For more details, feel free to visit the official GitHub repository. For a detailed implementation of the diffusion model, check out this awesome post by Hugging Face.




unet


The U-Net architecture. Source: Ronneberger et al.

Conditional Image Generation: Guided Diffusion

A crucial aspect of image generation is conditioning the sampling process to manipulate the generated samples. Here, this is also referred to as guided diffusion.

There have even been methods that incorporate image embeddings into the diffusion in order to “guide” the generation. Mathematically, guidance refers to conditioning a prior data distribution p(x)p(\textbf{x}) with a condition yy, i.e. the class label or an image/text embedding, resulting in p(xy)p(\textbf{x}|y).

To turn a diffusion model pθp_\theta

pθ(x0:Ty)=pθ(xT)t=1Tpθ(xt1xt,y)p_\theta(\mathbf{x}_{0:T} \vert y) = p_\theta(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t, y)

The fact that the conditioning is being seen at each timestep may be a good justification for the excellent samples from a text prompt.

In general, guided diffusion models aim to learn logpθ(xty)\nabla \log p_\theta( \mathbf{x}_t \vert y)

xtlogpθ(xty)=xtlog(pθ(yxt)pθ(xt)pθ(y))=xtlogpθ(xt)+xtlog(pθ(yxt))\begin{aligned}

\nabla_{\textbf{x}_{t}} \log p_\theta(\mathbf{x}_t \vert y) &= \nabla_{\textbf{x}_{t}} \log (\frac{p_\theta(y \vert \mathbf{x}_t) p_\theta(\mathbf{x}_t) }{p_\theta(y)}) \\

&= \nabla_{\textbf{x}_{t}} log p_\theta(\mathbf{x}_t) + \nabla_{\textbf{x}_{t}} log (p_\theta( y \vert\mathbf{x}_t ))

\end{aligned}

pθ(y)p_\theta(y)

And by adding a guidance scalar term ss, we have:

logpθ(xty)=logpθ(xt)+slog(pθ(yxt))\nabla \log p_\theta(\mathbf{x}_t \vert y) = \nabla \log p_\theta(\mathbf{x}_t) + s \cdot \nabla \log (p_\theta( y \vert\mathbf{x}_t ))

Using this formulation, let’s make a distinction between classifier and classifier-free guidance. Next, we will present two family of methods aiming at injecting label information.

Classifier guidance

Sohl-Dickstein et al. and later Dhariwal and Nichol showed that we can use a second model, a classifier fϕ(yxt,t)f_\phi(y \vert \mathbf{x}_t, t)

We can build a class-conditional diffusion model with mean μθ(xty)\mu_\theta(\mathbf{x}_t|y)

Since pθN(μθ,Σθ)p_\theta \sim \mathcal{N}(\mu_{\theta}, \Sigma_{\theta})

μ^(xty)=μθ(xty)+sΣθ(xty)xtlogfϕ(yxt,t)\hat{\mu}(\mathbf{x}_t |y) =\mu_\theta(\mathbf{x}_t |y) + s \cdot \boldsymbol{\Sigma}_\theta(\mathbf{x}_t |y) \nabla_{\mathbf{x}_t} logf_\phi(y \vert \mathbf{x}_t, t)

In the famous GLIDE paper by Nichol et al, the authors expanded on this idea and use CLIP embeddings to guide the diffusion. CLIP as proposed by Saharia et al., consists of an image encoder gg and a text encoder hh. It produces an image and text embeddings g(xt)g(\mathbf{x}_t)

Therefore, we can perturb the gradients with their dot product:

μ^(xtc)=μ(xtc)+sΣθ(xtc)xtg(xt)h(c)\hat{\mu}(\mathbf{x}_t |c) =\mu(\mathbf{x}_t |c) + s \cdot \boldsymbol{\Sigma}_\theta(\mathbf{x}_t |c) \nabla_{\mathbf{x}_t} g(\mathbf{x}_t) \cdot h(c)

As a result, they manage to “steer” the generation process toward a user-defined text caption.




classifier-guidance


Algorithm of classifier guided diffusion sampling. Source: Dhariwal & Nichol 2021

Classifier-free guidance

Using the same formulation as before we can define a classifier-free guided diffusion model as:

logp(xty)=slog(p(xty))+(1s)logp(xt)\nabla \log p(\mathbf{x}_t \vert y) =s \cdot \nabla log(p(\mathbf{x}_t \vert y)) + (1-s) \cdot \nabla log p(\mathbf{x}_t)

Guidance can be achieved without a second classifier model as proposed by Ho & Salimans. Instead of training a separate classifier, the authors trained a conditional diffusion model ϵθ(xty)\boldsymbol{\epsilon}_\theta (\mathbf{x}_t|y)

ϵ^θ(xty)=sϵθ(xty)+(1s)ϵθ(xt0)=ϵθ(xt0)+s(ϵθ(xty)ϵθ(xt0))\begin{aligned}

\hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t |y) & = s \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t |y) + (1-s) \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t |0) \\

&= \boldsymbol{\epsilon}_\theta(\mathbf{x}_t |0) + s \cdot (\boldsymbol{\epsilon}_\theta(\mathbf{x}_t |y) -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t |0) )

\end{aligned}

Note that this can also be used to “inject” text embeddings as we showed in classifier guidance.

This admittedly “weird” process has two major advantages:

  • It uses only a single model to guide the diffusion.

  • It simplifies guidance when conditioning on information that is difficult to predict with a classifier (such as text embeddings).

Imagen as proposed by Saharia et al. relies heavily on classifier-free guidance, as they find that it is a key contributor to generating samples with strong image-text alignment. For more info on the approach of Imagen check out this video from AI Coffee Break with Letitia:

Scaling up diffusion models

You might be asking what is the problem with these models. Well, it’s computationally very expensive to scale these U-nets into high-resolution images. This brings us to two methods for scaling up diffusion models to higher resolutions: cascade diffusion models and latent diffusion models.

Cascade diffusion models

Ho et al. 2021 introduced cascade diffusion models in an effort to produce high-fidelity images. A cascade diffusion model consists of a pipeline of many sequential diffusion models that generate images of increasing resolution. Each model generates a sample with superior quality than the previous one by successively upsampling the image and adding higher resolution details. To generate an image, we sample sequentially from each diffusion model.




cascade-diffusion


Cascade diffusion model pipeline. Source: Ho & Saharia et al.

To acquire good results with cascaded architectures, strong data augmentations on the input of each super-resolution model are crucial. Why? Because it alleviates compounding error from the previous cascaded models, as well as due to a train-test mismatch.

It was found that gaussian blurring is a critical transformation toward achieving high fidelity. They refer to this technique as conditioning augmentation.

Stable diffusion: Latent diffusion models

Latent diffusion models are based on a rather simple idea: instead of applying the diffusion process directly on a high-dimensional input, we project the input into a smaller latent space and apply the diffusion there.

In more detail, Rombach et al. proposed to use an encoder network to encode the input into a latent representation i.e. zt=g(xt)\mathbf{z}_t = g(\mathbf{x}_t)

If the loss for a typical diffusion model (DM) is formulated as:

LDM=Ex,t,ϵ[ϵϵθ(xt,t)2]L _{DM} = \mathbb{E}_{\mathbf{x}, t, \boldsymbol{\epsilon}} \Big[\| \boldsymbol{\epsilon}- \boldsymbol{\epsilon}_{\theta}( \mathbf{x}_t, t ) ||^2 \Big]

then given an encoder E\mathcal{E} and a latent representation zz, the loss for a latent diffusion model (LDM) is:

LLDM=EE(x),t,ϵ[ϵϵθ(zt,t)2]L _{LDM} = \mathbb{E}_{ \mathcal{E}(\mathbf{x}), t, \boldsymbol{\epsilon}} \Big[\| \boldsymbol{\epsilon}- \boldsymbol{\epsilon}_{\theta}( \mathbf{z}_t, t ) ||^2 \Big]




stable-diffusion


Latent diffusion models. Source: Rombach et al

For more information check out this video:

Score-based generative models

Around the same time as the DDPM paper, Song and Ermon proposed a different type of generative model that appears to have many similarities with diffusion models. Score-based models tackle generative learning using score matching and Langevin dynamics.

Score-matching refers to the process of modeling the gradient of the log probability density function, also known as the score function. Langevin dynamics is an iterative process that can draw samples from a distribution using only its score function.

xt=xt1+δ2xlogp(xt1)+δϵ, where ϵN(0,I)\mathbf{x}_t=\mathbf{x}_{t-1}+\frac{\delta}{2} \nabla_{\mathbf{x}} \log p\left(\mathbf{x}_{t-1}\right)+\sqrt{\delta} \boldsymbol{\epsilon}, \quad \text { where } \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

where δ\delta is the step size.

Suppose that we have a probability density p(x)p(x) and that we define the score function to be xlogp(x)\nabla_x \log p(x)

Ep(x)[xlogp(x)sθ(x)22]=p(x)xlogp(x)sθ(x)22dx\mathbb{E}_{p(\mathbf{x})}[\| \nabla_\mathbf{x} \log p(\mathbf{x}) – \mathbf{s}_\theta(\mathbf{x}) \|_2^2] = \int p(\mathbf{x}) \| \nabla_\mathbf{x} \log p(\mathbf{x}) – \mathbf{s}_\theta(\mathbf{x}) \|_2^2 \mathrm{d}\mathbf{x}

Then by using Langevin dynamics, we can directly sample from p(x)p(x) using the approximated score function.

In case you missed it, guided diffusion models use this formulation of score-based models as they learn directly xlogp(x)\nabla_x \log p(x)

Adding noise to score-based models: Noise Conditional Score Networks (NCSN)

The problem so far: the estimated score functions are usually inaccurate in low-density regions, where few data points are available. As a result, the quality of data sampled using Langevin dynamics is not good.

Their solution was to perturb the data points with noise and train score-based models on the noisy data points instead. As a matter of fact, they used multiple scales of Gaussian noise perturbations.

Thus, adding noise is the key to make both DDPM and score based models work.




score-based


Score-based generative modeling with score matching + Langevin dynamics. Source: Generative Modeling by Estimating Gradients of the Data Distribution

Mathematically, given the data distribution p(x)p(x), we perturb with Gaussian noise N(0,σi2I)\mathcal{N}(\textbf{0}, \sigma_i^2 I)

pσi(x)=p(y)N(x;y,σi2I)dyp_{\sigma_i}(\mathbf{x}) = \int p(\mathbf{y}) \mathcal{N}(\mathbf{x}; \mathbf{y}, \sigma_i^2 I) \mathrm{d} \mathbf{y}

Then we train a network sθ(x,i)s_\theta(\mathbf{x},i)

i=1Lλ(i)Epσi(x)[xlogpσi(x)sθ(x,i)22]\sum_{i=1}^L \lambda(i) \mathbb{E}_{p_{\sigma_i}(\mathbf{x})}[\| \nabla_\mathbf{x} \log p_{\sigma_i}(\mathbf{x}) – \mathbf{s}_\theta(\mathbf{x}, i) \|_2^2]

Score-based generative modeling through stochastic differential equations (SDE)

Song et al. 2021 explored the connection of score-based models with diffusion models. In an effort to encapsulate both NSCNs and DDPMs under the same umbrella, they proposed the following:

Instead of perturbing data with a finite number of noise distributions, we use a continuum of distributions that evolve over time according to a diffusion process. This process is modeled by a prescribed stochastic differential equation (SDE) that does not depend on the data and has no trainable parameters. By reversing the process, we can generate new samples.




score-sde


Score-based generative modeling through stochastic differential equations (SDE). Source: Song et al. 2021

We can define the diffusion process {x(t)}t[0,T]\{ \mathbf{x}(t) \}_{t\in [0, T]}

dx=f(x,t)dt+g(t)dw\mathrm{d}\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \mathrm{d}t + g(t) \mathrm{d} \mathbf{w}

where w\mathbf{w} is the Wiener process (a.k.a., Brownian motion), f(,t)\mathbf{f}(\cdot, t) is a vector-valued function called the drift coefficient of x(t)\mathbf{x}(t), and g()g(\cdot) is a scalar function known as the diffusion coefficient of x(t)\mathbf{x}(t). Note that the SDE typically has a unique strong solution.

To make sense of why we use an SDE, here is a tip: the SDE is inspired by the Brownian motion, in which a number of particles move randomly inside a medium. This randomness of the particles’ motion models the continuous noise perturbations on the data.

After perturbing the original data distribution for a sufficiently long time, the perturbed distribution becomes close to a tractable noise distribution.

To generate new samples, we need to reverse the diffusion process. The SDE was chosen to have a corresponding reverse SDE in closed form:

dx=[f(x,t)g2(t)xlogpt(x)]dt+g(t)dw\mathrm{d}\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) – g^2(t) \nabla_\mathbf{x} \log p_t(\mathbf{x})]\mathrm{d}t + g(t) \mathrm{d} \mathbf{w}

To compute the reverse SDE, we need to estimate the score function xlogpt(x)\nabla_\mathbf{x} \log p_t(\mathbf{x})

EtU(0,T)Ept(x)[λ(t)xlogpt(x)sθ(x,t)22]\mathbb{E}_{t \in \mathcal{U}(0, T)}\mathbb{E}_{p_t(\mathbf{x})}[\lambda(t) \| \nabla_\mathbf{x} \log p_t(\mathbf{x}) – \mathbf{s}_\theta(\mathbf{x}, t) \|_2^2]

where U(0,T)\mathcal{U}(0, T) denotes a uniform distribution over the time interval, and λ\lambda is a positive weighting function. Once we have the score function, we can plug it into the reverse SDE and solve it in order to sample x(0)\mathbf{x}(0) from the original data distribution p0(x)p_0(\mathbf{x})

There are a number of options to solve the reverse SDE which we won’t analyze here. Make sure to check the original paper or this excellent blog post by the author.




score-based-sde-overview


Overview of score-based generative modeling through SDEs. Source: Song et al. 2021

Summary

Let’s do a quick sum-up of the main points we learned in this blogpost:

  • Diffusion models work by gradually adding gaussian noise through a series of TT steps into the original image, a process known as diffusion.

  • To sample new data, we approximate the reverse diffusion process using a neural network.

  • The training of the model is based on maximizing the evidence lower bound (ELBO).

  • We can condition the diffusion models on image labels or text embeddings in order to “guide” the diffusion process.

  • Cascade and Latent diffusion are two approaches to scale up models to high-resolutions.

  • Cascade diffusion models are sequential diffusion models that generate images of increasing resolution.

  • Latent diffusion models (like stable diffusion) apply the diffusion process on a smaller latent space for computational efficiency using a variational autoencoder for the up and downsampling.

  • Score-based models also apply a sequence of noise perturbations to the original image. But they are trained using score-matching and Langevin dynamics. Nonetheless, they end up in a similar objective.

  • The diffusion process can be formulated as an SDE. Solving the reverse SDE allows us to generate new samples.

Finally, for more associations between diffusion models and VAE or AE check out these really nice blogs.

Cite as

@article{karagiannakos2022diffusionmodels,

title = "Diffusion models: toward state-of-the-art image generation",

author = "Karagiannakos, Sergios, Adaloglou, Nikolaos",

journal = "https://theaisummer.com/",

year = "2022",

howpublished = {https://theaisummer.com/diffusion-models/},

}

References

[1] Sohl-Dickstein, Jascha, et al. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv:1503.03585, arXiv, 18 Nov. 2015

[2] Ho, Jonathan, et al. Denoising Diffusion Probabilistic Models. arXiv:2006.11239, arXiv, 16 Dec. 2020

[3] Nichol, Alex, and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, arXiv, 18 Feb. 2021

[4] Dhariwal, Prafulla, and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, arXiv, 1 June 2021

[5] Nichol, Alex, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, arXiv, 8 Mar. 2022

[6] Ho, Jonathan, and Tim Salimans. Classifier-Free Diffusion Guidance. 2021. openreview.net

[7] Ramesh, Aditya, et al. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, arXiv, 12 Apr. 2022

[8] Saharia, Chitwan, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, arXiv, 23 May 2022

[9] Rombach, Robin, et al. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, arXiv, 13 Apr. 2022

[10] Ho, Jonathan, et al. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, arXiv, 17 Dec. 2021

[11] Weng, Lilian. What Are Diffusion Models? 11 July 2021

[12] O’Connor, Ryan. Introduction to Diffusion Models for Machine Learning AssemblyAI Blog, 12 May 2022

[13] Rogge, Niels and Rasul, Kashif. The Annotated Diffusion Model . Hugging Face Blog, 7 June 2022

[14] Das, Ayan. “An Introduction to Diffusion Probabilistic Models.” Ayan Das, 4 Dec. 2021

[15] Song, Yang, and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. arXiv:1907.05600, arXiv, 10 Oct. 2020

[16] Song, Yang, and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, arXiv, 23 Oct. 2020

[17] Song, Yang, et al. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, arXiv, 10 Feb. 2021

[18] Song, Yang. Generative Modeling by Estimating Gradients of the Data Distribution, 5 May 2021

[19] Luo, Calvin. Understanding Diffusion Models: A Unified Perspective. 25 Aug. 2022

Deep Learning in Production Book 📖

Learn how to build, train, deploy, scale and maintain deep learning models. Understand ML infrastructure and MLOps using hands-on examples.

Learn more

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.



Source link

author-sign