Score-based Generative Modeling with Differential Equations

“Score-Based Generative Modeling through Stochastic Differential Equations”

First lets revisit the Forward Diffusion Process:

$q(x_t|x_{t-1})=\mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_tI)$

Now consider the limit of many small steps:

$q(x_t|x_{t-1})=\mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_tI)$

Carrying reparametrization trick:

$x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\mathcal{N}(0,I)$

We define the stepsize $\beta_t$ as $\beta(t)\Delta t$

$x_t = \sqrt{1-\beta(t)\Delta t} x_{t-1}+\sqrt{\beta(t)\Delta t }\mathcal{N}(0,I)$

If there are many many timesteps $\beta(t)$ , the $\Delta t$ go towards to 0, we can perform Taylor expansion,

$x_t \approx x_{t-1} - \frac{\beta(t)\Delta t}{2}x_{t-1} + \sqrt{\beta(t)\Delta t}\mathcal{N}(0,I)$

we can interpret this form as some iterative update that the new $x_t$ is given by old term $x_{t-1}$

minus some term depend on $x_{t-1}$ itself and some noise added.

The iterative update will correspond to a certain solution/discretization of a Stochastic Differential Equation, in particular of this:

$dx_t = -\frac{1}{2}\beta(t)x_tdt + \sqrt{\beta(t)}d\omega_t$

Stochastic Differential Equation

Describing the diffusion in infinitesimal (extremely small) limit

Ordinary Differential Equation (ODE):

$\frac{dx}{dt}=f(x,t) \text{ or } dx= f(x,t)dt$

$x$ is the state that we are interested in (e.g. pixel of the image)
$t$ is some continuous timer variable that captures the time along which this state $x$ changes/evolve
We can apply integration to the equation following the arrows to get the final expression $x$ $x$ of $t$ $t$ .
- However in practice this $f$ function is often a highly complex nonlinear function (e.g. a neural network)

The Analytical Solution (Which cannot be found)

$x(t) = x(0) + \int^t_0 f(x, \tau)d\tau$

Iterative Numerical Solution

$x(t + \Delta t) \approx x(t) + f(x(t), t)\Delta t$

Stochastic Differential Equation (SDE):

$\frac{dx}{dt}= f(x,t) + \sigma(x,t)\omega_t$

$\omega_t$ is called the wiener process (in practice Gaussian White Noise)
$f(x,t)$ is the drift coefficient (which pull towards mode)
$\sigma(x,t)$ is the diffusion coefficient of the noise

To solve it, the answer is similar to Iterative Numerical Solution

$x(t + \Delta t) \approx x(t) + f(x(t), t)\Delta t + \sigma(x(t), t) \sqrt{\Delta t}\mathcal{N}(0,I)$

But since there is a diffusion coefficient with the wiener process, noise that proportional to the diffusion coefficient are added, so there is not a unique solution like the ODE case

Forward Diffusion Process as SDE

As mentioned, forward diffusion process can be a SDE:

$dx_t = -\frac{1}{2}\beta(t)x_tdt + \sqrt{\beta(t)}d\omega_t$

$f(x,t)$ is the drift coefficient (which pull towards mode)
$\sigma(x,t)\omega_t$ is the diffusion coefficient (which injects noise)

Sepcial case of more general SDEs used in generative diffusion models:

$dx_t = f(t)x_tdt + g(t)d\omega_t$

Generative Reverse SDE

$dx_t = -\frac{1}{2}\beta(t)[x_t - \beta(t)\nabla x_t \log q_t (x_t)] dt + \sqrt{\beta(t)}d\bar{\omega}_t$

where

$-\frac{1}{2}\beta(t)x_t - \beta(t)\nabla x_t \log q_t (x_t)$ is the drift term
$\sqrt{\beta(t)}d\bar{\omega}_t$ is the diffusion term
$\Delta x_t \log q_t (x_t)$ is the score function

But how to get the score function $\nabla x_t \log q_t (x_t)$ ?

Learn a neural network

Score Matching

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_t \sim q_t\left(\mathbf{x}_t\right)}\left\|\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t\right)\right\|_2^2$

where

$\mathbb{E}_{t \sim \mathcal{U}(0, T)}$ is the diffusion time $t$
$\mathbb{E}_{\mathbf{x}_t \sim q_t\left(\mathbf{x}_t\right)}$ is the diffused data $x_t$
$\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)$ is the neural network
$\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t\right)$ is the score of diffused data (marginal)

But $\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t\right)$ (score of the marginal diffused density $q_t(x_t)$ ) is not tractable

Instead, diffuse individual data points $x_0$ . Diffused $q_t(x_t|x_0)$ is tractable.

Denoising Score Matching

“Variance Preserving” SDE:

$dx_t = -\frac{1}{2}\beta(t)x_tdt + \sqrt{\beta(t)}d\omega_t$

$q_t(x_t|x_0) = \mathcal{N}(x_t;\gamma_tx_0, \sigma^2_tI)$

$\gamma_t = e^{-\frac{1}{2}\int^t_0\beta(s)ds}$

$\sigma^2_t = 1 - e^{-\int^t_0 \beta(s)ds}$

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\mathbf{x}_t \sim q_t\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}|| \mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t \mid \mathbf{x}_0\right) \|_2^2$

where

$\mathbb{E}_{t \sim \mathcal{U}(0, T)} $ is the diffusion time $t$
$\mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_t\right)}$ is the data sample $x_0$
$\mathbb{E}_{\mathbf{x}_t \sim q_t\left(\mathbf{x}_t|\mathbf{x}_0\right)}$ is the diffused data $x_t$
$\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)$ is the neural network
$\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t|\mathbf{x}_0\right)$ is the score of diffused data sample

=> After expectations,

$\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) \approx \nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t)$

Implementation 1 : Noise Prediction

From $\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\mathbf{x}_t \sim q_t\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}|| \mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t \mid \mathbf{x}_0\right) \|_2^2$

Reparametrized sampling: $x_t = \gamma_tx_0 + \sigma_t\epsilon$ , $\epsilon \sim \mathcal{N}(0,I)$

Score function: $\nabla_{\mathbf{x}_t} \log q_t\left(\mathbf{x}_t|\mathbf{x}_0\right) = - \nabla_{\mathbf{x}_t} \frac{(x_t - \gamma_tx_0)^2}{2\sigma^2_t} = -\frac{x_t - \gamma_tx_0}{\sigma^2_t}$

Neural network model: $\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) := \frac{\epsilon_\theta(x_t, t)}{\sigma_t}$

=> which is basically passed with predicting noise values epsilon

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \frac{1}{\sigma_t^2}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right\|_2^2$

If we our network can predict those noise values that were used for perturbation, then we can denoise and reconstruct original data point $x_0$ from $x_t$

Implementation 2 : Loss Weightings

Denoising Score Matching objective with loss weighting $\lambda(t)$ :

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \frac{\lambda(t)}{\sigma_t^2}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right\|_2^2$

Different loss weightings trade off between model with good perceptual quality vs high log-likelihood

Perceptual quality: $\lambda(t) = \sigma^2_t$
Maximum log-likelihood: $\lambda(t) = \beta(t)$ (negative ELBO)

Same objectives as derived with variational approach

Implementation 3 : Variance Reduction and Numerical Stability

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \frac{\lambda(t)}{\sigma_t^2}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right\|_2^2$

Notice $\sigma^2_t \rightarrow 0$ as $t \rightarrow 0$ . Loss heavily amplified when sampling $t$ close to $0$ (for $\lambda(t)=\beta(t)$ ) . High variance. So in that we wouldnt want to use implementation 2 right the way. We add some tricks instead.

Trick 1: Trail with small time cut-off $\eta$ ( $\approx 10^{-5}$ ):

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim \mathcal{U}(\eta, T)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \frac{\lambda(t)}{\sigma_t^2}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right\|_2^2$

Trick 2: Variance reduction by Importance Sampling:

Importance Sampling distribution: $r(t) \propto \frac{\lambda((t))}{\sigma^2_t}$

$\min _{\boldsymbol{\theta}} \mathbb{E}_{t \sim r(t)} \mathbb{E}_{\mathbf{x}_0 \sim q_0\left(\mathbf{x}_0\right)} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \frac{1}{r(t)}\frac{\lambda(t)}{\sigma_t^2}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right\|_2^2$

Probability Flow ODE

Consider reverse generative diffusion SDE: $dx_t = -\frac{1}{2}\beta(t)[x_t + 2\nabla x_t \log q_t (x_t)] dt + \sqrt{\beta(t)}d\bar{\omega}_t$

In distribution equivalent to Probability Flow ODE:

$dx_t = -\frac{1}{2}\beta(t)[x_t + \nabla_{\mathbf{x}_t} \log q_t(x_t)]dt$

Probability Flow ODE: Diffusion Models as Continuous Normalizing Flows

so why should we care and why should we use this probability flow ODE framework? It turns out this ordinary differential equation allows the use of advanced ordinary differential equation solvers, therefore easier to work with ODE than SDE.

Enables use of advanced ODE solvers
Deterministic encoding and generation (semantic image interpolation, etc.)
- Allow encoding datapoint in the latent space
  - Continuous changes in latent space $x_T$ result in continuous, semantically meaningful changes in data space $x_0$
Log-likelihood computation (instantaneous change of variables):
- $\log p_{\boldsymbol{\theta}}\left(\mathbf{x}_0\right)=\log p_T\left(\mathbf{x}_T\right)-\int_0^T \operatorname{Tr}\left(\frac{1}{2} \beta(t) \frac{\partial}{\partial \mathbf{x}_t}\left[\mathbf{x}_t+\mathbf{s}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right]\right) \mathrm{d} t$
Diffusion models can be considered CNFs trained with score matching

Synthesis with SDE vs ODE

Trajectories are zigzagging in SDE and following the distirbution, while ODE is more deterministic trajectories. (Both land in the modes of the data distribution)

Generative Reverse Diffusion SDE (stochastic):
- $dx_t = -\frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t,t)]dt + \sqrt{\beta(t)}d\bar{\omega}_t$
Generative Probability Flow ODE (deterministic):
- $dx_t = -\frac{1}{2}\beta(t)[x_t + s_\theta(x_t, t)]dt$

Solving generative SDE or ODE in practice

Sampling from “Continuous-Time” Diffusion Models: How to solve the generative SDE or ODE in practice?

Generative Reverse Diffusion SDE (stochastic):

$dx_t = -\frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t,t)]dt + \sqrt{\beta(t)}d\bar{\omega}_t$

Most naive way: Euler-Maruyama:

$x_{t-1} = x_t + \frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t, t)] \Delta t+\sqrt{\beta(t)\Delta t} \mathcal{N}(0, I)$

Generative Probability Flow ODE (deterministic):

$dx_t = -\frac{1}{2}\beta(t)[x_t + s_\theta(x_t, t)]dt$

Naive way use Euler’s Method:

$x_{t-1} = x_t + \frac{1}{2}\beta(t)[x_t + s_\theta(x_t, t)]\Delta t$

=> In practice: Higher-order ODE solvers (Ruge-Kutta, linear multistep methods, exponential integrators…)

What should we use to solve them?

Reconsider Generative DIffusion SDE: $dx_t = -\frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t,t)]dt + \sqrt{\beta(t)}d\bar{\omega}_t$

we can actually decompose it into two terms:

$dx_t = -\frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t,t)]dt - \frac{1}{2}\beta(t)s_\theta(x_t, t)dt + \sqrt{\beta(t)}d\bar{\omega}_t$

where

$-\frac{1}{2}\beta(t)[x_t + 2s_\theta(x_t,t)]dt$ is the Probability Flow ODE
$- \frac{1}{2}\beta(t)s_\theta(x_t, t)dt + \sqrt{\beta(t)}d\bar{\omega}_t$ is the Langevin Diffusion SDE

SDE vs ODE Sampling: Pro’s and Con’s

SDE Sampling:

Pro: Continuous noise injection can help to compensate errors during diffusion process (Langevin sampling actively pushes towards correct distribution).
Con: Often slower, because the stochastic terms themselves require fine discretization during solve.

ODE Sampling:

Pro: Can leverage fast ODE solvers. Best when targeting very fast sampling.
Con: No “stochastic” error correction, often slightly lower performance than stochastic sampling.

Diffusion Models as Energy-based Models

Assume an Energy-based Model (EBM): $p_{\boldsymbol{\theta}}(\mathbf{x}, t)=\frac{e^{-E_{\boldsymbol{\theta}}(\mathbf{x}, t)}}{\mathcal{Z}_{\boldsymbol{\theta}}(t)}$
Sample EBM via Langevin dynamics: $x_{i+1} = x_i - \eta\nabla_{\mathbf{x}}E_\theta(\mathbf{x}_i, t) + \sqrt{2\eta}\mathcal{N}(0,I)$
Requires only gradient of energy $\nabla_{\mathbf{x}}E_\theta(\mathbf{x}_i, t)$ , not $E_\theta(x,t)$ itself, nor $\mathcal{Z}_{\boldsymbol{\theta}}(t)$

In diffusion models, we learn “energy gradients” for all diffused distributions directly:

$\nabla_{\mathbf{x}}\log q_t(x) \approx s_\theta(x,t) =: \nabla_{\mathbf{x}}\log p_\theta(x,t) = -\nabla_{\mathbf{x}}\log \mathcal{Z}_{\boldsymbol{\theta}}(t)$

where $\nabla_{\mathbf{x}}\log \mathcal{Z}_{\boldsymbol{\theta}}(t) = 0$

=> Diffusion models model energy gradient directly, along entire diffusion process, and avoid modeling partition function. Different noise levels along diffusion are analogus to annealed sampling in EBMs.

Unique Identifiability of DIffusion models

The model is supposed approximate the score function of the diffused data $q_t(x_t)$ .

This denoising model is in principle uniquely determined by the data that we’re given and the forward diffusion process.

Denoising model $s_\theta(x_t, t)$ and deterministic data encodings uniquely determined by data and fixed forward diffusion
Even with different architectures and initializations, we recover identical model outputs and encoding (given sufficent training data, model capacity and optimization accuracy), in contrast to GANs, VAEs, etc.

Summary

Why use Differential Equation Framework?

Advantages of the Differential Equation framework for Diffusion models

Can leverage broad existing literature on advanced and fast SDE and ODE solvers when sampling from the model, which accelerate sampling from diffusion models which is very crucial because they can be slow
Allows us to construct deterministic Probability Flow ODE
- Deterministic Data Encodings
- Log-likelihood Estimation like Continuous Normalizing Flows, etc.
Clean mathematical framework based on Diffusion Processes and Score Matching; connections to Neural ODEs, Continuous Normalizing Flows and Energy-based Models