Score-based Generative Modeling with Differential Equations
“Score-Based Generative Modeling through Stochastic Differential Equations”
First lets revisit the Forward Diffusion Process:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Now consider the limit of many small steps:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Carrying reparametrization trick:
xt=1−βtxt−1+βtN(0,I)
We define the stepsize βt as β(t)Δt
xt=1−β(t)Δtxt−1+β(t)ΔtN(0,I)
If there are many many timesteps β(t), the Δt go towards to 0, we can perform Taylor expansion,
xt≈xt−1−2β(t)Δtxt−1+β(t)ΔtN(0,I)
we can interpret this form as some iterative update that the new xt is given by old term xt−1
minus some term depend on xt−1 itself and some noise added.
The iterative update will correspond to a certain solution/discretization of a Stochastic Differential Equation, in particular of this:
dxt=−21β(t)xtdt+β(t)dωt
Stochastic Differential Equation
Describing the diffusion in infinitesimal (extremely small) limit
Ordinary Differential Equation (ODE):
dtdx=f(x,t) or dx=f(x,t)dt
x is the state that we are interested in (e.g. pixel of the image)
t is some continuous timer variable that captures the time along which this state x changes/evolve
We can apply integration to the equation following the arrows to get the final expression x of t.
However in practice this f function is often a highly complex nonlinear function (e.g. a neural network)
The Analytical Solution (Which cannot be found)
x(t)=x(0)+∫0tf(x,τ)dτ
Iterative Numerical Solution
x(t+Δt)≈x(t)+f(x(t),t)Δt
Stochastic Differential Equation (SDE):
dtdx=f(x,t)+σ(x,t)ωt
ωt is called the wiener process (in practice Gaussian White Noise)
f(x,t) is the drift coefficient (which pull towards mode)
σ(x,t) is the diffusion coefficient of the noise
To solve it, the answer is similar to Iterative Numerical Solution
x(t+Δt)≈x(t)+f(x(t),t)Δt+σ(x(t),t)ΔtN(0,I)
But since there is a diffusion coefficient with the wiener process, noise that proportional to the diffusion coefficient are added, so there is not a unique solution like the ODE case
Forward Diffusion Process as SDE
As mentioned, forward diffusion process can be a SDE:
dxt=−21β(t)xtdt+β(t)dωt
f(x,t) is the drift coefficient (which pull towards mode)
σ(x,t)ωt is the diffusion coefficient (which injects noise)
Sepcial case of more general SDEs used in generative diffusion models:
Notice σt2→0 as t→0. Loss heavily amplified when sampling t close to 0 (for λ(t)=β(t)) . High variance. So in that we wouldnt want to use implementation 2 right the way. We add some tricks instead.
In distribution equivalent to Probability Flow ODE:
dxt=−21β(t)[xt+∇xtlogqt(xt)]dt
Probability Flow ODE: Diffusion Models as Continuous Normalizing Flows
so why should we care and why should we use this probability flow ODE framework? It turns out this ordinary differential equation allows the use of advanced ordinary differential equation solvers, therefore easier to work with ODE than SDE.
Enables use of advanced ODE solvers
Deterministic encoding and generation (semantic image interpolation, etc.)
Allow encoding datapoint in the latent space
Continuous changes in latent space xT result in continuous, semantically meaningful changes in data space x0
Log-likelihood computation (instantaneous change of variables):
Diffusion models can be considered CNFs trained with score matching
Synthesis with SDE vs ODE
Trajectories are zigzagging in SDE and following the distirbution, while ODE is more deterministic trajectories. (Both land in the modes of the data distribution)
Generative Reverse Diffusion SDE (stochastic):
dxt=−21β(t)[xt+2sθ(xt,t)]dt+β(t)dωˉt
Generative Probability Flow ODE (deterministic):
dxt=−21β(t)[xt+sθ(xt,t)]dt
Solving generative SDE or ODE in practice
Sampling from “Continuous-Time” Diffusion Models: How to solve the generative SDE or ODE in practice?
−21β(t)[xt+2sθ(xt,t)]dt is the Probability Flow ODE
−21β(t)sθ(xt,t)dt+β(t)dωˉt is the Langevin Diffusion SDE
SDE vs ODE Sampling: Pro’s and Con’s
SDE Sampling:
Pro: Continuous noise injection can help to compensate errors during diffusion process (Langevin sampling actively pushes towards correct distribution).
Con: Often slower, because the stochastic terms themselves require fine discretization during solve.
ODE Sampling:
Pro: Can leverage fast ODE solvers. Best when targeting very fast sampling.
Con: No “stochastic” error correction, often slightly lower performance than stochastic sampling.
Diffusion Models as Energy-based Models
Assume an Energy-based Model (EBM): pθ(x,t)=Zθ(t)e−Eθ(x,t)
Sample EBM via Langevin dynamics: xi+1=xi−η∇xEθ(xi,t)+2ηN(0,I)
Requires only gradient of energy ∇xEθ(xi,t), not Eθ(x,t) itself, nor Zθ(t)
In diffusion models, we learn “energy gradients” for all diffused distributions directly:
=> Diffusion models model energy gradient directly, along entire diffusion process, and avoid modeling partition function. Different noise levels along diffusion are analogus to annealed sampling in EBMs.
Unique Identifiability of DIffusion models
The model is supposed approximate the score function of the diffused data qt(xt).
This denoising model is in principle uniquely determined by the data that we’re given and the forward diffusion process.
Denoising model sθ(xt,t) and deterministic data encodings uniquely determined by data and fixed forward diffusion
Even with different architectures and initializations, we recover identical model outputs and encoding (given sufficent training data, model capacity and optimization accuracy), in contrast to GANs, VAEs, etc.
Summary
Why use Differential Equation Framework?
Advantages of the Differential Equation framework for Diffusion models
Can leverage broad existing literature on advanced and fast SDE and ODE solvers when sampling from the model, which accelerate sampling from diffusion models which is very crucial because they can be slow
Allows us to construct deterministic Probability Flow ODE
Deterministic Data Encodings
Log-likelihood Estimation like Continuous Normalizing Flows, etc.
Clean mathematical framework based on Diffusion Processes and Score Matching; connections to Neural ODEs, Continuous Normalizing Flows and Energy-based Models