Lets $p(x)$ as the Probability Distribution Function (PDF) of a dataset, where we want to approximate it by $p_\theta(x)$ :

$p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}$

$Z_\theta$ for normalization to make sure PDF intergrates to 1
$e^{-f_\theta(x)}$ $e^{- f_{θ} (x)}$ for matching the non-negative properties of PDF
- learned to output how dense the probability is around $x$ ; high => dense area, low => sparse area

Similar to Normal Distribution $\mathcal{N}(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ , the exponent term forces it into non-negative and the denominator normalize to make sure PDF intergrates to 1

Since $Z_\theta$ can be extremely high dimensional, so we need to intergrate the entire space of $X$ and it is intractable to compute

Approximate $Z_\theta$ with Score

Score

Score is the log gradient of the PDF with respect to $x$ :

$\nabla_x \log p_\theta(x) = \nabla_x \log \frac{e^{-f_\theta(x)}}{Z_\theta}$

With some logarithm rules $\log(\frac{a}{b}) = \log a - \log b$ and $\ln(e) = 1$ : (log and ln are interchangeable)

$\nabla_x \log \frac{e^{-f_\theta(x)}}{Z_\theta} = \nabla_x \log e^{-f_\theta(x)} - \nabla_x \log Z_\theta = -\nabla_x f_\theta(x) - \nabla_x \log Z_\theta$

The second term becomes zero because it does not depend on $x$ .

similar to $f(x) = x$ , but $\frac{df(x)}{dy} = 0$ . Therefore $\nabla_x \log(Z_\theta) = 0$

$-\nabla_x f_\theta(x) - \nabla_x \log Z_\theta = -\nabla_x f_\theta(x) - 0 = s_\theta(x)$

We let $s_\theta(x)$ equals to $-\nabla_x f_\theta(x)$ .

Score Matching

Bascially means that we try to minimize the difference between the true log graident $\nabla_x \log p(x)$ and the predicted scores $s_\theta(x)$

$\frac{1}{2}\mathbb{E}_{p(x)}[||\nabla_x \log p(x) - s_\theta(x)||^2_2]$

The score function $\nabla_x \log p_\theta(x)$ tells you which direction you have to move a datapoint to increase the probability of it, as logarithm increases monotonically

But the problem is that we do not know the true log graident $\nabla_x \log p_\theta(x)$ . We would need some math track.

Recall the Expectation $\mathbb{E}_{p(x)} = \int p(x) dx$ , the equation can be rewritten as

$\frac{1}{2} \int p(x) (\nabla_x \log p(x) - s_\theta(x))^2dx$

As $(a-b)^2 = (a^2 + b^2 - 2ab)$ , the eqaution can be further expanded

$\frac{1}{2} \int p(x) (\nabla_x \log p(x))^2 + (s_\theta(x))^2 - 2 \nabla_x \log p(x) s_\theta(x) \space dx$

Separating the there terms, we get

$\frac{1}{2} \int p(x) (\nabla_x \log p(x))^2 \space dx + \frac{1}{2} \int p(x) (s_\theta(x))^2 \space dx - \frac{1}{2} \int p(x) 2 \nabla_x \log p(x) s_\theta(x) \space dx$

$\frac{1}{2} \int p(x) \nabla_x \log p(x)^2 \space dx + \frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x \log p(x) s_\theta(x) \space dx$

Looking into the last term, with Derivative chain rule $\frac{d}{dx} \log f(x) = \frac{f'(x)}{f(x)}$ , we can get $\nabla_x \log p(x) = \frac{\nabla_xp(x)}{p(x)}$

$\int p(x) \nabla_x \log p(x) s_\theta(x) \space dx = \int p(x) \frac{\nabla_xp(x)}{p(x)} s_\theta(x) \space dx = \int \nabla_xp(x) s_\theta(x) \space dx$

By integration by parts formula $\int dvu = uv|^b_a - \int vdu$

$\int \nabla_xp(x) s_\theta(x) \space dx = p(x) \space s_\theta(x)|^{inf}_{x=-inf} + \int p(x) \nabla_x s_\theta(x) \space dx = \int p(x) \nabla_x s_\theta(x) \space dx$

the term $p(x) \space s_\theta(x)|^{inf}_{x=-inf}$ will become 0 because as X goes to infinity probability densities typically decay to 0. Think about a normal distribution but in the leftmost and rightmost they are 0s.

Plugging it back to previous equation:

$\frac{1}{2} \int p(x) \nabla_x \log p(x)^2 \space dx + \frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x s_\theta(x) \space dx$

Since the first term can be regarded as constant (no $s_\theta(x)$ involved) for the optimization, it can be discarded.

$\frac{1}{2} \int p(x) s_\theta(x)^2 \space dx - \int p(x) \nabla_x s_\theta(x) \space dx$

Rewrite it back to the expectation $\mathbb{E}_{p(x)} = \int p(x) dx$ Form:

$\frac{1}{2}\mathbb{E}_{p(x)}[s_\theta(x)^2] + \mathbb{E}_{p(x)}[\nabla_xs_\theta(x)]$

Note that now we have an equation that is equivalent to $\frac{1}{2}\mathbb{E}_{p(x)}[||\nabla_x \log p(x) - s_\theta(x)||^2_2]$ , without having to know $\nabla_x \log p(x)$ .

$\mathbb{E}_{p(x)}[\nabla_xs_\theta(x)]$ : optimizing it means we want the gradient of the score model at data point $x$ to be 0
$\mathbb{E}_{p(x)}[s_\theta(x)^2]$ : optimizing it means we want to make the predicted scores of score model to be 0 at data points

A goal is to learn the model that given some point in the data space, predict the direction to move closer to data.

However, the objective is very expensive to compute because the term $\mathbb{E}_{p(x)}[\nabla_xs_\theta(x)]$ requires calculating the Jacobian (back propagation for every input variable). It becomes a big problem when the input space is very large (e.g. images).

Expensive Training

There is also another problem that when looking into the data densities, during training, the data point would be close to the normal distribution which the model becomes good at learning the score.

but when we start sampling from a random position, it can be far away from the data densities and thus

Low Coverage of Data Space

Noise Perturbation

One way to tickle the problem of Low Coverage of Data Space is by adding noise in the data $\tilde{x} = x + \epsilon$ ( $\epsilon \sim \mathcal{N}(0, \sigma^2I)$ ) , and the new $p(x)$ become $p_\sigma(\tilde{x})$ .

Noising the data will cover more of the data space
Covering too much noise may make to distribution too different from the original (A Trade-off)

\frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[||\nabla_\tilde{x} \log p_\sigma(\tilde{x}) - s_\theta(\tilde{x})||^2_2] \rightarrow \frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[s_\theta(\tilde{x})^2] + \mathbb{E}_{p_\sigma(\tilde{x})}[\nabla_xs_\theta(\tilde{x})]

Denosing Score Matching

Connecting Score Matching and Denoising Autoencoders.

Denoising Autoencoders were used to learn representative features of data

Train a regular autoencoder but add noise in the bottleneck and try to predict the original image through the decoder

\frac{1}{2}\mathbb{E}_{p_\sigma(\tilde{x})}[||\nabla_\tilde{x} \log p_\sigma(\tilde{x}) - s_\theta(\tilde{x})||^2_2] \frac{1}{2}\mathbb{E}_{x\sim p(x), \tilde{x} \sim p_\sigma(\tilde{x}|x)}[||s_\theta(\tilde{x}) - \nabla_\tilde{x} \log p_\sigma(\tilde{x}|x) ||^2_2]

…