Lets p(x) as the Probability Distribution Function (PDF) of a dataset, where we want to approximate it by pθ(x):
pθ(x)=Zθe−fθ(x)
Zθ for normalization to make sure PDF intergrates to 1
e−fθ(x) for matching the non-negative properties of PDF
learned to output how dense the probability is around x; high => dense area, low => sparse area
Similar to Normal Distribution N(x∣μ,σ)=2πσ21e−2σ2(x−μ)2, the exponent term forces it into non-negative and the denominator normalize to make sure PDF intergrates to 1
Since Zθ can be extremely high dimensional, so we need to intergrate the entire space of X and it is intractable to compute
Approximate Zθ with Score
Score
Score is the log gradient of the PDF with respect to x:
∇xlogpθ(x)=∇xlogZθe−fθ(x)
With some logarithm rules log(ba)=loga−logb and ln(e)=1: (log and ln are interchangeable)
The second term becomes zero because it does not depend on x.
similar to f(x)=x, but dydf(x)=0. Therefore ∇xlog(Zθ)=0
−∇xfθ(x)−∇xlogZθ=−∇xfθ(x)−0=sθ(x)
We let sθ(x) equals to −∇xfθ(x).
Score Matching
Bascially means that we try to minimize the difference between the true log graident ∇xlogp(x) and the predicted scores sθ(x)
21Ep(x)[∣∣∇xlogp(x)−sθ(x)∣∣22]
The score function ∇xlogpθ(x) tells you which direction you have to move a datapoint to increase the probability of it, as logarithm increases monotonically
But the problem is that we do not know the true log graident ∇xlogpθ(x). We would need some math track.
Recall the Expectation Ep(x)=∫p(x)dx, the equation can be rewritten as
21∫p(x)(∇xlogp(x)−sθ(x))2dx
As (a−b)2=(a2+b2−2ab), the eqaution can be further expanded
the term p(x)sθ(x)∣x=−infinf will become 0 because as X goes to infinity probability densities typically decay to 0. Think about a normal distribution but in the leftmost and rightmost they are 0s.
Since the first term can be regarded as constant (no sθ(x) involved) for the optimization, it can be discarded.
21∫p(x)sθ(x)2dx−∫p(x)∇xsθ(x)dx
Rewrite it back to the expectation Ep(x)=∫p(x)dx Form:
21Ep(x)[sθ(x)2]+Ep(x)[∇xsθ(x)]
Note that now we have an equation that is equivalent to 21Ep(x)[∣∣∇xlogp(x)−sθ(x)∣∣22], without having to know ∇xlogp(x).
Ep(x)[∇xsθ(x)] : optimizing it means we want the gradient of the score model at data point x to be 0
Ep(x)[sθ(x)2] : optimizing it means we want to make the predicted scores of score model to be 0 at data points
A goal is to learn the model that given some point in the data space, predict the direction to move closer to data.
However, the objective is very expensive to compute because the term Ep(x)[∇xsθ(x)] requires calculating the Jacobian (back propagation for every input variable). It becomes a big problem when the input space is very large (e.g. images).
Expensive Training
There is also another problem that when looking into the data densities, during training, the data point would be close to the normal distribution which the model becomes good at learning the score.
but when we start sampling from a random position, it can be far away from the data densities and thus
Low Coverage of Data Space
Noise Perturbation
One way to tickle the problem of Low Coverage of Data Space is by adding noise in the data x~=x+ϵ (ϵ∼N(0,σ2I)) , and the new p(x) become pσ(x~).
Noising the data will cover more of the data space
Covering too much noise may make to distribution too different from the original (A Trade-off)