Advanced Techniques: Accelerated Sampling, Conditional Generation

Advanced Techniques of Diffusion Models: Accelerated Sampling, Conditional Generation

How to accelerate the sampling process

What makes a good generative model?

  • Fast sampling from the generative model
  • Model coverage/ diversity (generative model capture most of the major modes of the data distribution)
  • High quality / high fidelity samples
GANs VAEs, Normaling Flows Diffusion Models
Fast Sampling
Mode coverage/ diversity
High quality samples

Accelerating Diffusion Models

Naive accelerate methods, such as reducing diffusion timesteps in training or sampling every k timestep in inference, lead to immediate worse performance.

Advanced forward process

  • Does the noise schedule have to be predefined?
  • Does the forward process of DDPM has to be a Markovian process?
  • Is there any faster mixing diffusion process?

Covering: VDM, DDIM, Critically-damped Langevin diffusion

Variational Diffusion Models

Variational Diffusion Models: https://arxiv.org/abs/2107.00630

Learnable diffusion process => Include Learnable Parameters in the encoder

  • Given the forward process q(xtx0)=N(xt;αtˉx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha_t}}x_0, (1-\bar{\alpha}_t)\mathbf{I})
  • Directly parametrize the variance through a learned function γη\gamma_\eta:
    • (1αˉt)=sigmoid(γη(t))(1 - \bar \alpha_t) = \text{sigmoid}(\gamma_\eta(t))
    • (γη(t))(\gamma_\eta(t)) is a monotonic MLP
      • Strictly positive weights and monotonic activations (e.g. sigmoid)

New parametrization of training objectives

We call that we learned that the diffusion models can be interpreted from the perspective of SDE (Stochastic Differential Equation), and we learned the connection between diffusion models and denoising score matching. This implies that the diffusion models can also be defined in the continuous-time setting.

  • Optimizing variational upper bound of diffusion models can be simplified to the following training objective:
    • LT=T2Ex0,ϵ,t[(exp(γη(t)γη(t1))1)ϵϵθ(xt,t)22]\mathcal{L}_T=\frac{T}{2} \mathbb{E}_{\mathrm{x}_0, \epsilon, t}\left[\left(\exp \left(\gamma_\eta(t)-\gamma_\eta(t-1)\right)-1\right)\left\|\epsilon-\epsilon_\theta\left(\mathbf{x}_t, t\right)\right\|_2^2\right]
  • Letting TT \rightarrow \infty leads to variation upper bound in continuous-time
    • When TT \rightarrow \infty , we have infinity amount of diffusion steps which corresponds to a continuous time setting and then the variantion upper bound can be derived
    • L=12Ex0,ϵ,t[γη(t)ϵϵθ(xt,t)22],γη(t)=dγη(t)/dt\mathcal{L}_\infty=\frac{1}{2} \mathbb{E}_{\mathrm{x}_0, \epsilon, t}\left[\gamma_\eta'(t)\|\epsilon-\epsilon_\theta(\mathbf{x}_t, t)\|_2^2\right], \quad \gamma_\eta'(t) = d\gamma_\eta(t)/dt
  • It is shown to be only related to the signal-to-noise ration SNR(t)=αˉt/(1αˉt)=exp(γη(t))SNR(t) = \bar\alpha_t/(1-\bar\alpha_t) = \exp (-\gamma_\eta(t)) at endpoints, invarient to the noise schedule in-between. It means we only need to optimize the SNR at the beginning and the end of the forward process
  • The continuous-time noise schedule can be learned to minimize the variance of the training objective for faster training.

SOTA likelihood estimation (significant improvements in log-likelihoods)

  • Appending Fourier features to the input of U-Net
    • fi,j,kn=sin(xi,j,k2nπ),gi,j,kn=cos(xi,j,k2nπ),n=7,8f^n_{i,j,k} = \sin(x_{i,j,k}2^n\pi), g^n_{i,j,k} = \cos (x_{i,j,k}2^n\pi), n = 7, 8
  • Hypothesis: To get good likelihoods, the model need to modeling all the bits (details in the input signal, both perceptual and inperceptual). But neural nets are usually bad at modeling small changes to inputs.

Denoising Diffusion Implicit Models (DDIM)

Denoising Diffusion Implicit Models: https://arxiv.org/abs/2010.02502

Non-Markovian Diffusion Process

  • Define a family of non-Markovian diffusion processes and corresponding reverse processes.
  • The process is designed such that the model can be optimized by the same surrogate objective as the original diffusion model.
    • Recall the objective of the original diffusion model: Lsimple(θ):=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathcal{L}_{simple}(\theta) := \mathbb{E}_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon, t)||^2]
    • Therefore can take a pretrained diffusion model but with more choices of sampling procedure.

To Define the non-Markovian forward process

Recall the Objective of diffusion models:

L=Eq[DKL(q(xTx0)p(xT))LT+t>1DKL(q(xt1xt,x0)pθ(xt1xt))Lt1logpθ(x0x1))L0]L=\mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t>1} \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{t-1}} \underbrace{\left.-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right)}_{L_0}]

KL divergence in the variational upper bound can be written as

Lt1=DKL(q(xt1xt,x0)pθ(xt1xt))=Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+C\mathcal{L}_{t-1}= D_{KL} (q(x_{t-1}|x_{t}, x_0) || p_\theta(x_{t-1}|x_t)) = \mathbb{E}_q[\frac{1}{2\sigma^2_t}||\tilde{\mu}_t(x_t, x_0)-\mu_\theta(x_t, t)||^2] + C

  • q(xt1xt,x0)q(x_{t-1}|x_t, x_0) is the posterior distribution
  • pθ(xt1xt)p_\theta(x_{t-1}|x_t) is the denoising distribution
  • Since both distributions are Gaussian distributions with the same variance σt2\sigma_t^2, this can be written as the L2 distance between the mean of the 2 distributions and times a constant 12σt2\frac{1}{2\sigma^2_t}
    • Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]\mathbb{E}_q[\frac{1}{2\sigma^2_t}||\tilde{\mu}_t(x_t, x_0)-\mu_\theta(x_t, t)||^2]

https://vinesmsuic.github.io/paper-ddpm/#Objective-function-of-Diffusion

Recall the two mean functions μ~t(xt,x0)\tilde{\mu}_t(x_t, x_0), μθ(xt,t)\mu_\theta(x_t, t) have been parametrized by simple linear combination of xtx_t

  • μ~t(xt,x0)=11βt(xtβt1αˉtϵ)\tilde{\mu}_t(x_t, x_0) = \frac{1}{\sqrt{1-\beta_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon)
  • μθ(xt,t)=11βt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{1-\beta_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t))

Then we can rewrite Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+C\mathbb{E}_q[\frac{1}{2\sigma^2_t}||\tilde{\mu}_t(x_t, x_0)-\mu_\theta(x_t, t)||^2] + C to

Ex0q(x0),ϵN(0,I)[λtϵϵθ(αˉtx0+1αˉtϵ,t)2]+C\mathbb{E}_{x_0\sim q(x_0), \epsilon\sim\mathcal{N}(0,\mathbf{I})} [\lambda_t||\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)||^2] + C

where

  • xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon

If we assume loss weightings λt\lambda_t can be arbitrary values (the surrogate objective simply set as 1), the above formulation holds as long as

  • q(xtx0)q(x_t|x_0) follows a normal distribution N(xt;αˉtx0,(1αˉt)I)\mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I})
    • (to Make sure xtx_t will equals to αˉtx0+1αˉtϵ\sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon)

Then we have two assumptions:

  • Forward process: q(xt1xt,x0)=N(xt1;μ~t(xt,x0),σ~t2I)q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde\mu_t(x_t, x_0), \tilde\sigma_t^2\mathbf{I}), which the mean of the gaussian distribution is a linear combination of xtx_t such that μ~t(xt,x0)=axt+bϵ\tilde\mu_t(x_t, x_0) = a x_t + b\epsilon
  • Reverse process: pθ(xt1xt)=N(xt1;μθ(xt,t),σ~t2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \tilde\sigma_t^2\mathbf{I}), which the mean of the gaussian distribution is also a linear combination of xtx_t and the predicted noise such that μθ(xt,t)=axt+bϵθ(xt,t)\mu_{\theta}(x_t, t) = a x_t + b\epsilon_\theta(x_t, t)

Since xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon

ϵ=xtαˉtx01αˉt\epsilon = \frac{x_t - \sqrt{\bar\alpha_t}x_0 }{\sqrt{1-\bar{\alpha}_t}}

We can rewrite the Forward process formula and Reverse process formula

  • μ~t(xt,x0)=axt+bxtαˉtx01αtˉ\tilde\mu_t(x_t, x_0) = ax_t + b\frac{x_t - \sqrt{\bar\alpha_t}x_0}{\sqrt{1-\bar{\alpha_t}}}
  • μθ(xt,t)=axt+bxtαˉtx^01αˉt\mu_\theta(x_t, t) = ax_t + b\frac{x_t - \sqrt{\bar\alpha_t}\hat x_0}{\sqrt{1-\bar\alpha_t}}
  • (assume xt=αˉtx^0+(1αˉt)ϵθ(xt,t)x_t = \sqrt{\bar\alpha_t}\hat x_0 + \sqrt{(1-\bar\alpha_t)}\epsilon_\theta(x_t, t))

Which means we need not to specify q(xtxt1)q(x_{t}|x_{t-1}) to be a Markovian process.

img

  • Now for each xtx_t it depends on both xt1x_{t-1} and the x0x_0
  • For our linear combination μ~t(xt,x0)=axt+bxtαˉtx01αtˉ\tilde\mu_t(x_t, x_0) = ax_t + b\frac{x_t - \sqrt{\bar\alpha_t}x_0}{\sqrt{1-\bar{\alpha_t}}}, we need to choose the a,ba, b such that q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_{t}|x_0) = \mathcal{N}(x_{t}; \sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)\mathbf{I})

Therefor with the rules we define a family of forward processes that meets the above requirement:

q(xt1xt,x0)=N(αˉt1x0+1αˉt1σ~t2xtαˉtx01αˉt,σ~t2I)q(x_{t-1}|x_t, x_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}x_0 + \sqrt{1- \bar\alpha_{t-1}-\tilde\sigma^2_t} \cdot \frac{x_t - \sqrt{\bar\alpha_t}x_0}{\sqrt{1- \bar\alpha_t}}, \tilde\sigma_t^2\mathbf{I})

The corresponding reverse process is

p(xt1xt)=N(αˉt1x^0+1αˉt1σ~t2xtαˉtx^01αˉt,σ~t2I)p(x_{t-1}|x_t) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}\hat x_0 + \sqrt{1- \bar\alpha_{t-1}-\tilde\sigma^2_t} \cdot \frac{x_t - \sqrt{\bar\alpha_t}\hat x_0}{\sqrt{1- \bar\alpha_t}}, \tilde\sigma_t^2\mathbf{I})

DDIM Sampler - Deterministic generative process

In p(xt1xt)=N(αˉt1x^0+1αˉt1σ~t2xtαˉtx^01αˉt,σ~t2I)p(x_{t-1}|x_t) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}\hat x_0 + \sqrt{1- \bar\alpha_{t-1}-\tilde\sigma^2_t} \cdot \frac{x_t - \sqrt{\bar\alpha_t}\hat x_0}{\sqrt{1- \bar\alpha_t}}, \tilde\sigma_t^2\mathbf{I})

If we specify σ~t2=0,t\tilde\sigma_t^2 = 0, \forall t (for all the timesteps), this list to the DDIM sampler which is a deterministic generative process, with randomness from only t=Tt = T.

p(xt1xt)=N(αˉt1x^0+1αˉt1xtαˉtx^01αˉt,0)p(x_{t-1}|x_t) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}\hat x_0 + \sqrt{1- \bar\alpha_{t-1}} \cdot \frac{x_t - \sqrt{\bar\alpha_t}\hat x_0}{\sqrt{1- \bar\alpha_t}}, 0)

ODE interpretation - Deterministic generative process

Generative Probability Flow ODE (deterministic):

  • dxt=12β(t)[xt+sθ(xt,t)]dtdx_t = -\frac{1}{2}\beta(t)[x_t + s_\theta(x_t, t)]dt

DDIM Sampler can be considered as an integration rule of the following ODE:

dxˉ(t)=ϵθ(t)(xˉ(t)η2+1)dη(t)d\bar{x}(t) = \epsilon^{(t)}_\theta (\frac{\bar{x}(t)}{\sqrt{\eta^2 + 1}})d\eta(t)

where

  • xˉ=xαˉ\bar{x} = \frac{x}{\sqrt{\bar\alpha}}
    • Simply appling a scaling factor
  • η=1αˉαˉ\eta = \frac{\sqrt{1-\bar{\alpha}}}{\sqrt{\bar\alpha}}
    • Sqaure root of the inverse SNR

If ϵθ(t)\epsilon^{(t)}_\theta is optimal, we have an optimal model

With the optimal model, the ODE is equivalent to a probability flow ODE of a “variance-exploding” SDE:

dxˉ=12g(t)2xˉlogpt(xˉ)dtd\bar{x} = -\frac{1}{2}g(t)^2 \nabla_{\bar{x}}\log p_t(\bar{x})dt

where g(t)=dη2(t)dtg(t) = \sqrt{\frac{d\eta^2(t)}{dt}}

Although with the optimal model, the ODE is equivalent to a probability flow ODE of a “variance-exploding” SDE, the Sampling procedure can be different from standard Euler’s method: wrt. dη(t)d\eta(t) vs wrt dtdt

In practice, we the ODE works better than the SDE here because it depends less on the value of tt . it depends directly on the SNR of the current timesteps

DDIM Sampler - Faster and low curvature

  • Karras et al. argues that the ODE of DDIM is favored, as the tangent of the solution trajectory always point towards the denoiser output
  • Leads to largely linear solution trajectories with low curvature
  • Low curvature means less truncation errors accumulated over the trajectories

Critically-damped Langevin diffusion

Score-Based Generative Modeling with Critically-Damped Langevin Diffusion: https://arxiv.org/abs/2112.07068

Find a “fast mixing diffusion process”

Recall the regular forward diffusion process as SDE

dxt=12β(t)xtdt+β(t)dwtdx_t = -\frac{1}{2}\beta(t)x_tdt + \sqrt{\beta(t)}dw_t

It is a special case of (overdamped) Langevin dynamics

dxt=12β(t)xtlogpEQ(xt)dt+β(t)dwtdx_t = \frac{1}{2}\beta(t)\nabla_{x_t} \log p_{EQ}(x_t)dt + \sqrt{\beta(t)}dw_t

if we assume pEQ(xt)=N(xt;0,I)e12xt2p_{EQ}(x_t) = \mathcal{N}(x_t; 0, \mathbf{I}) \sim e^{-\frac{1}{2}x^2_t}

“Momentum-based” diffusion - introduce a velocity variable and run diffusion in extended space

With this equation we can design more efficient forward process in term

  • Introduce an auxiliary variable velocity vv
  • the diffusion process is defined in the joint space of the velocity and the input
  • during forward process, noise is only added in the velocity space
    • while image (input) space is only erupt by the coupling between the data and the velocity
img

Result:

  • The process in the V space is still zig-zag
  • But the process in the image (input) space are more much smoother
  • Faster mixing and faster traverse of joint space
    • Smooth and efficient forward process

Analogous to Hamiltonian component / momentum in momentum-based optimizers

Advance Reverse Process

  • we assume the denoising distributions are always Gaussian distributions. If we want to use less diffusion time steps, is this normal approximation of the reverse process still true of accurate?
    • No, the assumption only holds only when the noises added between the adjacent steps are small.
img
  • We need complicated functional approximators if we want to have less diffusion steps

Covering: Denoising Diffusion GANs, Diffusion energy-based models

Denoising Diffusion GANs

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs: https://nvlabs.github.io/denoising-diffusion-gan/

Approximating reverse process by conditional GANs

  • Since the conditional GAN only need to model the conditional distribution of (xt1xt)(x_{t-1}|x_t), this is a simple problem for both generator and discriminator to learn

    • Stronger mode coverage and Better training stability

Diffusion energy-based models

Learning Energy-Based Models by Diffusion Recovery Likelihood: https://arxiv.org/abs/2012.08125

Approximating reverse process by conditional energy-based models

Recall an energy-based model (EBM) is in the form

pθ(x)=1Zθexp(fθ(x))=1Zθexp(Eθ(x))p_\theta(x) = \frac{1}{Z_\theta}\exp(f_\theta(x)) = \frac{1}{Z_\theta}\exp(-E_\theta(x))

where

  • ZθZ_\theta is a partition function that Analytically intractable
  • Eθ(x)E_\theta(x) is an energy function

Optimizing energy-based models requires MCMC sampling from the current model pθ(x)p_\theta(x)

θlogpθ(x)=θf(x)Epθ(x)[θf(x)]\nabla_\theta \log p_\theta(x) = \nabla_\theta f(x) - \mathbb{E}_{p_\theta(x')}[\nabla_\theta f(x')]

So if we want to parametrize the denoising distribution by conditional energy based model, we need to assume at each diffusion timestep marginally the data follows the EBM in the standard formulation pθ(x)=1Zθexp(fθ(x))p_\theta(x) = \frac{1}{Z_\theta}\exp(f_\theta(x)).

Let x~=x+σϵ\tilde{x} = x + \sigma\epsilon (data at a higher noise level)

So we can derive the conditional energy-based models by Bayes’ rule:

pθ(xx~)=1Zθ(x~)exp(fθ(x)12σ2x~x2)p_\theta(x|\tilde{x}) = \frac{1}{Z_\theta(\tilde{x})}\exp(f_\theta(x) - \frac{1}{2\sigma^2}||\tilde{x} - x||^2)

where

  • The term 12σ2x~x2\frac{1}{2\sigma^2}||\tilde{x} - x||^2 helps to localize the highly multimodal energy landscape, thus getting a more unimodal landscape and with the modal, mode focus around the higher noise level signal x~\tilde{x}

To learn the model we simply maximize the conditional log-likelihoods: J(θ)=1ni=1nlogpθ(xix~i)\mathcal{J}(\theta) = \frac{1}{n}\sum^n_{i=1} \log p_\theta(x_i|\tilde{x}_i)

After training we just get samples by progressive sampling from EBMs from high-noise levels to low-noise levels.

Compared to a single EBM:

  • Sampling is more friendly and easier to converge
  • Training is more efficient
  • Well-formed energy potential

Compared to diffusion models:

  • Much less diffusion steps

Model distillation

  • Can we do model distillation for fast sampling

Cover: Progressive distillation

Progressive Distillation

Progressive Distillation for Fast Sampling of Diffusion Models: https://arxiv.org/abs/2202.00512

  • Distill a deterministic DDPM sampler to the same model architecture
  • At each stage, a “student” model is learned to distill two adjacent sampling steps of the “teacher” model to one sampling step
  • At next stage, the “student” model from previous stage become the new “teacher” model and repeat

Hybrid models

  • Can we lift the diffusion model to a latent space that is faster to diffuse?

Cove: LDM

Latent-space diffusion models (LDM)

Score-based Generative Modeling in Latent Space: https://nvlabs.github.io/LSGM/

High-Resolution Image Synthesis with Latent Diffusion Models: https://arxiv.org/abs/2112.10752

img

^ Variational Autoencoder (VAE) + Score-based Prior

Main idea: Lift the diffusion models to a latent space which is more friendly to the diffusion process

  • Encoder maps the input data to an embedding space
  • Denoising diffusion models are applied in the latent space

Advantages:

  • The distribution of latent embeddings close to Normal distribution

    • Simpler denoising and faster synthesis
  • Augmented latent space

    • More expressivity
  • Tailored Autoencoders

    • More expressivity, Application to any data type (e.g. graphs, text, 3d data etc.)

Training objective of LDM: score-matching for cross-entropy

L(x,ϕ,θ,ψ)=Eqϕ(z0x)[logpψ(xz0)]+KL(qϕ(z0x)pθ(z0))=Eqϕ(z0x)[logpψ(xz0)]reconstruction term +Eqϕ(z0x)[logqϕ(z0x)negative encoder entropy ]+Eqϕ(z0x)[logpθ(z0)]cross entropy \begin{aligned} \mathcal{L}(\mathbf{x}, \boldsymbol{\phi}, \boldsymbol{\theta}, \boldsymbol{\psi}) & =\mathbb{E}_{q_{\boldsymbol{\phi}}\left(\mathbf{z}_0 \mid \mathbf{x}\right)}\left[-\log p_{\boldsymbol{\psi}}\left(\mathbf{x} \mid \mathbf{z}_0\right)\right]+\mathrm{KL}\left(q_{\boldsymbol{\phi}}\left(\mathbf{z}_0 \mid \mathbf{x}\right) \| p_{\boldsymbol{\theta}}\left(\mathbf{z}_0\right)\right) \\ & =\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}\left(\mathbf{z}_0 \mid \mathbf{x}\right)}\left[-\log p_{\boldsymbol{\psi}}\left(\mathbf{x} \mid \mathbf{z}_0\right)\right]}_{\text {reconstruction term }}+\underbrace{\mathbb{E}_{q_\phi\left(\mathbf{z}_0 \mid \mathbf{x}\right)}\left[\log q_{\boldsymbol{\phi}}\left(\mathbf{z}_0 \mid \mathbf{x}\right)\right.}_{\text {negative encoder entropy }}]+\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}\left(\mathbf{z}_0 \mid \mathbf{x}\right)}\left[-\log p_{\boldsymbol{\theta}}\left(\mathbf{z}_0\right)\right]}_{\text {cross entropy }} \end{aligned}

where

  • It is optimized by minimizing the variational upper bound negative log likelihood
  • the reconstruction term and negative encoder entropy are similar to the training objective of VAE
  • the cross entropy term corresponds to the training objective of Diffusion models

Then the objective can be written as

CE(q(z0x)p(z0))=EtU[0,1][g(t)22Eq(zt,z0x)[ztlogq(ztz0)ztlogp(zt)22]]+D2log(2πeσ02)C E\left(q\left(\mathbf{z}_0 \mid \mathbf{x}\right) \| p\left(\mathbf{z}_0\right)\right)={\mathbb{E}_{t \sim \mathcal{U}[0,1]}}\left[\frac{g(t)^2}{2} \mathbb{E}_{q\left(\mathbf{z}_t, \mathbf{z}_0 \mid \mathbf{x}\right)}\left[\left\|\nabla_{\mathbf{z}_t} \log q\left(\mathbf{z}_t \mid \mathbf{z}_0\right)-\nabla_{\mathbf{z}_t} \log p\left(\mathbf{z}_t\right)\right\|_2^2\right]\right]+\frac{D}{2} \log \left(2 \pi e \sigma_0^2\right)

where

  • EtU[0,1]\mathbb{E}_{t \sim \mathcal{U}[0,1]} is the time sampling
  • Eq(zt,z0x)\mathbb{E}_q(z_t, z_0|x) is the forward diffusion
  • ztlogq(ztz0)\nabla_{\mathbf{z}_t} \log q\left(\mathbf{z}_t \mid \mathbf{z}_0\right) is the diffusion kernel
  • ztlogp(zt)\nabla_{\mathbf{z}_t} \log p(\mathbf{z}_t) is the trainable score function
  • D2log(2πeσ02)\frac{D}{2} \log \left(2 \pi e \sigma_0^2\right) is some constant

Conditional Diffusion Models

How to do high-resolution (conditional) generation?

  • Reverse process is changed

Reverse Process (taking conditional input cc):

pθ(x0:Tc)=p(xT)t=1Tpθ(xt1xt,c)p_\theta(x_{0:T}|c) = p(x_T)\prod^T_{t=1}p_\theta(x_{t-1}|x_t, c)

pθ(xt1xt,c)=N(xt1;μθ(xt,t,c),θ(xt,t,c))p_\theta(x_{t-1}|x_t, c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \sum_\theta(x_t, t, c))

Incorparate conditions into U-Net of the diffusion model

  • Scalar conditioning: encode scalar as a vector embedding, simple spatial addition or adaptive group normalization layers
  • Image conditioning: channel-wise concatenation of the conditional image
  • Text conditioning:
    • single vector embedding: spatial addition or adaptive group norm
    • a sequence of vector embeddings: cross-attention

Classifier Guidance

Diffusion models beat GANs on image synthesis: https://arxiv.org/abs/2105.05233

Using the gradient of a trained classifier as guidance

Main idea:

  • For class-conditional modeling of p(xtc)p(x_t | c), train an extra classifier p(cxt)p(c|x_t)

    • Mix its gradient with the diffusion/score model during sampling
  • Sample with a modified score: xt[logp(xtc)+ωlogp(cxt)]\nabla_{x_t} [\log p(x_t|c) + \omega \log p(c|x_t)]

  • Approximate samples from the distribution p~(xtc)p(xtc)p(cxt)ω\tilde{p}(x_t|c) \propto p(x_t|c)p(c|x_t)^\omega

Classifier-Free Guidance

Classifier-Free Diffusion Guidance: https://arxiv.org/abs/2207.12598

Get guidance by Bayes’ rule on conditional diffusion models

Main idea:

  • Instead of training an additional classifier, get an “implicit classifier” by jointly training a conditional and unconditional diffusion model
    • p(xtc)p(xtc)/p(xt)p(x_t|c) \propto p(x_t|c)/p(x_t)
      • Where p(xtc)p(x_t|c) is the conditional diffusion model, p(xt)p(x_t) is the unconditional diffusion model
  • In practice, p(xtc)p(x_t|c) and p(xt)p(x_t) by randomly dropping the condition of the diffusion model at certain chance
  • The modified score with this implicit classifier included is:
    • xt[logp(xtc)+ωlogp(cxt)]=xt[logp(xtc)+ω(logp(xtc)logp(xt))]\nabla_{x_t} [\log p(x_t|c) + \omega \log p(c|x_t)] = \nabla_{x_t} [\log p(x_t|c) + \omega (\log p(x_t|c) - \log p(x_t))]
      • =xt[(1+ω)logp(xtc)ωlogp(xt)]= \nabla_{x_t} [(1+\omega)\log p(x_t|c) - \omega \log p(x_t)]

Trade-off for sample quality and sample diversity

  • Large guidance weight ω\omega usually leads to better individual sample quality but less sample diversity

Cascaded generation

Cascaded Diffusion Models for High Fidelity Image Generation: https://cascaded-diffusion.github.io

Main idea:

  • Cascaded Diffusion Models (CDM) are pipelines of diffusion models that generate images of increasing resolution.
  • CDMs yield high fidelity samples superior to BigGAN-deep and VQ-VAE-2 in terms of both FID score and classification accuracy score on class-conditional ImageNet generation. These results are achieved with pure generative models without any classifier.
  • Introduce conditioning augmentation, data augmentation technique that find critical towards achieving high sample fidelity.

Noise conditioning augmentation: Reduce compounding error

Need robust super-resolution model:

  • Training conditional on original low-res images from the dataset
  • Inference on low-res images generated by the low-res model.

^If there are artifacts in the low-resolution sample will affact the sample quality due to mismatch conditions of two points

To alleviate this problem:

Noise conditioning augmentation

  • During training, add varying amounts of Gaussian noise ( or blurring by Gaussian kernel) to the low-res images
  • During inference, sweep over the optimal amount of noise added to the low-res images
  • BSR-degradation process: applies JPEG compressions noise, camera sensor noise, different image interpolations for downscampling, Gaussian blue kernels and Gaussian noise in a random order to an image.