Lets p(x)p(x) as the Probability Distribution Function (PDF) of a dataset, where we want to approximate it by pθ(x)p_\theta(x):

pθ(x)=efθ(x)Zθp_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}

  • ZθZ_\theta for normalization to make sure PDF intergrates to 1
  • efθ(x)e^{-f_\theta(x)} for matching the non-negative properties of PDF
    • learned to output how dense the probability is around xx; high => dense area, low => sparse area

Similar to Normal Distribution N(xμ,σ)=12πσ2e(xμ)22σ2\mathcal{N}(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}, the exponent term forces it into non-negative and the denominator normalize to make sure PDF intergrates to 1

Since ZθZ_\theta can be extremely high dimensional, so we need to intergrate the entire space of XX and it is intractable to compute

  • Approximate ZθZ_\theta with Score

Score

Score is the log gradient of the PDF with respect to xx:

xlogpθ(x)=xlogefθ(x)Zθ\nabla_x \log p_\theta(x) = \nabla_x \log \frac{e^{-f_\theta(x)}}{Z_\theta}

With some logarithm rules log(ab)=logalogb\log(\frac{a}{b}) = \log a - \log b and ln(e)=1\ln(e) = 1: (log and ln are interchangeable)

xlogefθ(x)Zθ=xlogefθ(x)xlogZθ=xfθ(x)xlogZθ\nabla_x \log \frac{e^{-f_\theta(x)}}{Z_\theta} = \nabla_x \log e^{-f_\theta(x)} - \nabla_x \log Z_\theta = -\nabla_x f_\theta(x) - \nabla_x \log Z_\theta

The second term becomes zero because it does not depend on xx.

similar to f(x)=xf(x) = x, but df(x)dy=0\frac{df(x)}{dy} = 0. Therefore xlog(Zθ)=0\nabla_x \log(Z_\theta) = 0

xfθ(x)xlogZθ=xfθ(x)0=sθ(x)-\nabla_x f_\theta(x) - \nabla_x \log Z_\theta = -\nabla_x f_\theta(x) - 0 = s_\theta(x)

We let sθ(x)s_\theta(x) equals to xfθ(x)-\nabla_x f_\theta(x).

Score Matching

Bascially means that we try to minimize the difference between the true log graident xlogp(x)\nabla_x \log p(x) and the predicted scores sθ(x)s_\theta(x)

12Ep(x)[xlogp(x)sθ(x)22]\frac{1}{2}\mathbb{E}_{p(x)}[||\nabla_x \log p(x) - s_\theta(x)||^2_2]

The score function xlogpθ(x)\nabla_x \log p_\theta(x) tells you which direction you have to move a datapoint to increase the probability of it, as logarithm increases monotonically

But the problem is that we do not know the true log graident xlogpθ(x)\nabla_x \log p_\theta(x). We would need some math track.

Recall the Expectation Ep(x)=p(x)dx\mathbb{E}_{p(x)} = \int p(x) dx, the equation can be rewritten as

12p(x)(xlogp(x)sθ(x))2dx\frac{1}{2} \int p(x) (\nabla_x \log p(x) - s_\theta(x))^2dx

As (ab)2=(a2+b22ab)(a-b)^2 = (a^2 + b^2 - 2ab), the eqaution can be further expanded

12p(x)(xlogp(x))2+(sθ(x))22xlogp(x)sθ(x) dx\frac{1}{2} \int p(x) (\nabla_x \log p(x))^2 + (s_\theta(x))^2 - 2 \nabla_x \log p(x) s_\theta(x) \space dx

Separating the there terms, we get

12p(x)(xlogp(x))2 dx+12p(x)(sθ(x))2 dx12p(x)2xlogp(x)sθ(x) dx\frac{1}{2} \int p(x) (\nabla_x \log p(x))^2 \space dx + \frac{1}{2} \int p(x) (s_\theta(x))^2 \space dx - \frac{1}{2} \int p(x) 2 \nabla_x \log p(x) s_\theta(x) \space dx

12p(x)xlogp(x)2 dx+12p(x)sθ(x)2 dxp(x)xlogp(x)sθ(x) dx\frac{1}{2} \int p(x) \nabla_x \log p(x)^2 \space dx + \frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x \log p(x) s_\theta(x) \space dx

Looking into the last term, with Derivative chain rule ddxlogf(x)=f(x)f(x)\frac{d}{dx} \log f(x) = \frac{f'(x)}{f(x)} , we can get xlogp(x)=xp(x)p(x)\nabla_x \log p(x) = \frac{\nabla_xp(x)}{p(x)}

p(x)xlogp(x)sθ(x) dx=p(x)xp(x)p(x)sθ(x) dx=xp(x)sθ(x) dx\int p(x) \nabla_x \log p(x) s_\theta(x) \space dx = \int p(x) \frac{\nabla_xp(x)}{p(x)} s_\theta(x) \space dx = \int \nabla_xp(x) s_\theta(x) \space dx

By integration by parts formula dvu=uvabvdu\int dvu = uv|^b_a - \int vdu

xp(x)sθ(x) dx=p(x) sθ(x)x=infinf+p(x)xsθ(x) dx=p(x)xsθ(x) dx\int \nabla_xp(x) s_\theta(x) \space dx = p(x) \space s_\theta(x)|^{inf}_{x=-inf} + \int p(x) \nabla_x s_\theta(x) \space dx = \int p(x) \nabla_x s_\theta(x) \space dx

the term p(x) sθ(x)x=infinfp(x) \space s_\theta(x)|^{inf}_{x=-inf} will become 0 because as X goes to infinity probability densities typically decay to 0. Think about a normal distribution but in the leftmost and rightmost they are 0s.

Plugging it back to previous equation:

12p(x)xlogp(x)2 dx+12p(x)sθ(x)2 dxp(x)xsθ(x) dx\frac{1}{2} \int p(x) \nabla_x \log p(x)^2 \space dx + \frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x s_\theta(x) \space dx

Since the first term can be regarded as constant (no sθ(x)s_\theta(x) involved) for the optimization, it can be discarded.

12p(x)sθ(x)2 dxp(x)xsθ(x) dx\frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x s_\theta(x) \space dx

Rewrite it back to the expectation Ep(x)=p(x)dx\mathbb{E}_{p(x)} = \int p(x) dx Form:

12Ep(x)[sθ(x)2]+Ep(x)[xsθ(x)]\frac{1}{2}\mathbb{E}_{p(x)}[s_\theta(x)^2] + \mathbb{E}_{p(x)}[\nabla_xs_\theta(x)]

Note that now we have an equation that is equivalent to 12Ep(x)[xlogp(x)sθ(x)22]\frac{1}{2}\mathbb{E}_{p(x)}[||\nabla_x \log p(x) - s_\theta(x)||^2_2], without having to know xlogp(x)\nabla_x \log p(x).

  • Ep(x)[xsθ(x)]\mathbb{E}_{p(x)}[\nabla_xs_\theta(x)] : optimizing it means we want the gradient of the score model at data point xx to be 0
  • Ep(x)[sθ(x)2]\mathbb{E}_{p(x)}[s_\theta(x)^2] : optimizing it means we want to make the predicted scores of score model to be 0 at data points

A goal is to learn the model that given some point in the data space, predict the direction to move closer to data.

However, the objective is very expensive to compute because the term Ep(x)[xsθ(x)]\mathbb{E}_{p(x)}[\nabla_xs_\theta(x)] requires calculating the Jacobian (back propagation for every input variable). It becomes a big problem when the input space is very large (e.g. images).

  • Expensive Training

There is also another problem that when looking into the data densities, during training, the data point would be close to the normal distribution which the model becomes good at learning the score.

but when we start sampling from a random position, it can be far away from the data densities and thus

  • Low Coverage of Data Space

Noise Perturbation

One way to tickle the problem of Low Coverage of Data Space is by adding noise in the data x~=x+ϵ\tilde{x} = x + \epsilon (ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2I)) , and the new p(x)p(x) become pσ(x~)p_\sigma(\tilde{x}).

  • Noising the data will cover more of the data space
  • Covering too much noise may make to distribution too different from the original (A Trade-off)
\frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[||\nabla_\tilde{x} \log p_\sigma(\tilde{x}) - s_\theta(\tilde{x})||^2_2] \rightarrow \frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[s_\theta(\tilde{x})^2] + \mathbb{E}_{p_\sigma(\tilde{x})}[\nabla_xs_\theta(\tilde{x})]

Denosing Score Matching

Connecting Score Matching and Denoising Autoencoders.

Denoising Autoencoders were used to learn representative features of data

  • Train a regular autoencoder but add noise in the bottleneck and try to predict the original image through the decoder
\frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[||\nabla_\tilde{x} \log p_\sigma(\tilde{x}) - s_\theta(\tilde{x})||^2_2] \frac{1}{2}\mathbb{E}_{x\sim p(x), \tilde{x} \sim p_\sigma(\tilde{x}|x)}[||s_\theta(\tilde{x}) - \nabla_\tilde{x} \log p_\sigma(\tilde{x}|x) ||^2_2]

Reference

Diffusion Models From Scratch | Score-Based Generative Models Explained | Math Explained