The Gradient Descent

The Gradient Descent is basically an optimization algorithm for finding a local minimum of a differentiable function.

  • Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
  • Our goal is to find the minimum of this function using the gradient descent methodology.

The simplest and the most fundamental optimization algorithm is the gradient descent.

  • the gradient is the multivariate generalization of the derivative concept.

The basic idea - 1 parameter GD

A easy example, only show on 1 variate (1-parameter):

Let there is a formula

f(x)=5x2+2x1f(x) = 5x^2 + 2x - 1

We want to use Gradient descent to find the minimum point. To find the minimum point, we need to make use of the slope since the minimum point is basically slope = 0.

First we take a first the derivative (find the slope of the curve in a given particular point xx):

f(x)=df(x)dx=10x+2f'(x) = \frac{df(x)}{dx} = 10x + 2

Then we choose any arbitrary number to represent xx.

This “choose any arbitrary number” is called Initialization, the process in which we set the initial values of weights.

  • If f(x)f'(x) is +ve, It means the slope is / and we want to move left to slide to minimum point.
  • If f(x)f'(x) is -ve, It means the slope is \ and we want to move right to slide to minimum point.

The update rule is like this:

xk+1=xkηf(xk)\mathrm{x}_{\mathrm{k}+1}=\mathrm{x}_{\mathrm{k}}-\eta \mathrm{f}^{\prime}\left(\mathrm{x}_{\mathrm{k}}\right)

where η\eta is the learning rate. the learning rate allow us to adjust how much do we want to slide.

Some people will notate as this:

x=x+Δxx = x + \Delta x

where Δ\Delta means changes.

\text{new weight} = \text{existing_weight} - \text{learning_rate} \times \text{gradient}

Let our inital x value x0=3x_0 = 3 , learning rate η=0.06\eta = 0.06

The 1st iteration: x1=332×0.06=1.08x_1 = 3 - 32 \times 0.06 = 1.08 (f(xk)f'(x_k) is +ve -> Slope is / -> Point goes left)

The 2nd iteration: x2=1.0812.8×0.06=0.312x_2 = 1.08 - 12.8 \times 0.06 = 0.312 (f(xk)f'(x_k) is +ve -> Slope is / -> Point goes left)

The 3rd iteration: x3=0.3125.12×0.06=0.0048x_3 = 0.312 - 5.12 \times 0.06 = 0.0048 (f(xk)f'(x_k) is +ve -> Slope is / -> Point goes left)

The 4th iteration: x4=0.00482.048×0.06=0.11808x_4 = 0.0048 - 2.048 \times 0.06 = -0.11808 (f(xk)f'(x_k) is +ve -> Slope is / -> Point goes left)

The 5th iteration: x5=0.118080.8192×0.06=0.167232x_5 = -0.11808 - 0.8192 \times 0.06 = -0.167232 (f(xk)f'(x_k) is -ve -> Slope is / -> Point goes left)

  • If xk+1x_{k+1} is smaller than xkx_k, xk+1x_{k+1} will be the left side of xkx_k.
  • If xk+1x_{k+1} is larger than xkx_k, xk+1x_{k+1} will be the right side of xkx_k.

The process goes on and on, eventually the value will stop updating.

  • because the first derivative of the function will be very close to 0 when we have reached the minimum.
  • Usually, we iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.

Let’s visualize it. I coded a simple one using matplotlib.

The 1st iteration:

The 3rd iteration:

We can see the point is running towards to minimum point.

The 10th iteration:

It is reaching the minimum point.

To conclude, We can start from any arbitrary point and descend to the minimum. And the speed of minimization depends on the learning rate η\eta.

1
2
3
4
5
6
7
8
9
10
11
# Just quickly hardcode a program to check will it really be stop updating
learning_rate = 0.06

x = 3

for i in range(50):
diff = (10 * x) + 2
x = x - (diff * learning_rate)

print(x)
# x will stuck at around -0.19999999999999998

How do we initialize the starting xx ?

As we mentioned , the initial xx is chosen arbitrarily, as it will slide during GD.

Pick xx arbitrarily is good enough in this case, but not for all cases. An inappropriate initialization would cause in unoptimized model. There are other ways of initializing the weight such as Xavier Initialization (will be explained in the future).

GD initialization is stuck

If Gradient Descent gets initialized in such a way that it starts at a local maximum (or a saddle point, or a local minimum) with gradient zero, then it will simply stay stuck there. Variations of GD, such as Stochastic GD and Mini-batch GD try to work around this by adding an element of randomness to the search, but even those aren’t guaranteed to escape a zero gradient region if the shape of the gradient is weird enough.

In practice the only way to solve this is to reinitialize your search with new weights or parameters that start in a completely new region of the search space. This won’t be hard to do, since if you do get stuck in such a zero-gradient area you will notice very quickly that the error in your training isn’t changing at all, and you would know that you need to start over.

How about learning rate?

How to choose the learning rate?

Reducing Loss: Learning Rate

TLDR:

  • If you pick a learning rate that is too small, learning will take too long.
  • If you pick a learning rate that is too large, it may totally overshoots the minimum, or result in oscillation and never reach the minimum.
  • Generally, we want the learning rate to be high enough so we can reach the closest minimum in a rational amount of time, but low enough so it will not oscillate around the minimum.
  • The ideal learning rate in one-dimension is 1f(x)\frac{1}{f(x)^{\prime \prime}} (the inverse of the second derivative of f(x)f(x) at xx .
  • The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).

If you have large but not too large learning rate, the point could descend like this:

It will also reach the minimum point eventually.

The interactive visualizer in this image: Reducing Loss: Optimizing Learning Rate - Graph Visualizer by Google Developers

The ML approach - N parameters GD

Video : The mathematics behind Linear Regression with Gradient Descent

Now moving to a ML approach.

In Machine Learning, the problem is often contains more parameters. In this example we will take a look into a linear model is using a 2-parameters GD.

y=w1x+w0y = w_1x + w_0

  • For 2 model parameters, we only need to solve 2 simultaneous equations.
  • For n model parameters, we need to solve n simultaneous equations.
    • e.g. y=w2x2+w1x1+w0y = w_2x_2+w_1x_1 + w_0 -> 3 simultaneous equations

To make it clearer to see, I use mm for w1w_1 and bb for w0w_0 .

Lets look at this linear model.

the formula y^=mx+b\hat{y} = mx + b

y=true valuey = \text{true value}

y^=predicted value\hat{y} = \text{predicted value}

Loss function=L=(yy^)2\text{Loss function} = L = (y-\hat{y})^2

  • For simplicity, Lets just assume we only have 1 point data here. (Data point N=1N = 1)
  • So the Loss function will simply equal to (yy^)2(y-\hat{y})^2

Further reading : Objective function, cost function, loss function: are they the same thing?

We want to minimize the loss function. (find minima)

This bring us to the Gradient Descent.

The Update rule:

mi+1=miηmLbi+1=biηbL\begin{array}{l} \mathbf{m}_{i+1}=\mathbf{m}_{i}-\eta \nabla_{m} L\\ \mathbf{b}_{i+1}=\mathbf{b}_{i}-\eta \nabla_{b} L \end{array}

Some people will notate as this:

m=m+Δmm = m + \Delta m

b=b+Δbb = b + \Delta b

where Δ\Delta means changes.

So I want to know what is a way that I could change the value of mm in y=mx+by = mx + b in order to make the error less.

Our final goal is to solve for mm and bb by considering Δm=0\Delta m = 0 and Δb=0\Delta b = 0, just like what we did in the .

Lets substitute Error EE to yy^y - \hat{y}

Error=E=yy^=y(mx+b)\text{Error} = E = y - \hat{y} = y - (mx+b)

So now our Loss function LL looks like:

L=(E)2L = (E)^2

Now we want to find the derivatives of LL relative to bb.

Power rule is used here. x22xx^2 \rightarrow 2x

Then Chain rule occur.

Since LL is a function of Error EE, and Error EE is a function of mm and bb.

Lb=LE×Eb\frac{\partial L}{\partial b} = \frac{\partial L}{\partial E} \times \frac{\partial E}{\partial b}

Lb=2E×Eb\frac{\partial L}{\partial b} = 2E \times \frac{\partial E}{\partial b}

Since E=ymxbE = y - mx - b

Eb=1\frac{\partial E}{\partial b} = -1

Therefore:

Lb=2E×1\frac{\partial L}{\partial b} = 2E \times -1

Then multiple with a learning rate η\eta

Lb×η=E×1×η×2\frac{\partial L}{\partial b}\times \eta = E \times -1 \times \eta \times 2

Since we are multiplying with a constant learning rate η\eta, it is pointless to have 2 here.

Δb=Lb×η=E×η\Delta b = \frac{\partial L}{\partial b}\times \eta = E \times -\eta

Δb=ηE\Delta b = -\eta E

Now we do the same on mm. find the derivatives of LL relative to mm (mm is w1w_1).

Power rule is used here. x22xx^2 \rightarrow 2x

Then Chain rule occur.

Since LL is a function of Error EE, and Error EE is a function of mm and bb.

Lm=LE×Em\frac{\partial L}{\partial m} = \frac{\partial L}{\partial E} \times \frac{\partial E}{\partial m}

Lm=2E×Lm\frac{\partial L}{\partial m} = 2E \times \frac{\partial L}{\partial m}

Since E=ymxbE = y - mx - b

Em=x\frac{\partial E}{\partial m} = -x

Therefore:

Lm=2E×x\frac{\partial L}{\partial m} = 2E \times -x

Then multiple with a learning rate μ\mu

Lm×μ=E×x×μ×2\frac{\partial L}{\partial m}\times \mu = E \times -x \times \mu \times 2

Since we are multiplying with a constant learning rate μ\mu, it is pointless to have 2 here.

Δm=Lm×μ=E×x×μ\Delta m = \frac{\partial L}{\partial m}\times \mu = E \times -x \times \mu

Δm=μxE\Delta m = -\mu xE

Therefore, during the training:

mk+1=mkμxEm_{k+1} = m_k - \mu xE

bk+1=bkηEb_{k+1} = b_k - \eta E

In code representation (from https://youtu.be/L-Lsfu4ab74?t=1199)

1
2
3
4
5
6
7
8
9
10
11
12
function gradientDescent(){
var learning_rate = 0.05;

for (var i = 0; i < data.length; i++){
var x = data[i].x;
var y = data[i].y;
var guess = m * x + b;
var error = y - guess;
m = m + (error * x) * learning_rate;
b = b + (error) * learning_rate;
}
}

Now lets look at a practical example.

Gradient Descent in Linear / Logistic regression

Linear relationship:

y^=ax+b\hat{y}=ax+b

Cost function: mean-square error (MSE)

  • Square difference between the observed data (y) and the predicted data (ax+b)
  • Lets assume our data is not 1 point only but multiple points (N points)

Cost function=1Ni=0N1(yiaxib)2\text{Cost function} =\frac{1}{N} \sum_{i=0}^{N-1}(y_i-a x_i-b)^{2}

Let JJ be the Cost function.

Problem definition:

Find a and b so that Cost function is minimized

J(a,b)=1Ni=1N(yiaxib)2J_{(a,b)} =\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-a x_{i}-b\right)^{2}

To minimize the cost function, we solve for a and b by considering

Ja=0,Jb=0\frac{\partial J}{\partial a}=0, \frac{\partial J}{\partial b}=0

In other words, to find a point where slope = 0. That minimum is where the cost function converges.

  • For 2 model parameters, we only need to solve 2 simultaneous equations.
  • For n model parameters, we need to solve n simultaneous equations.
    • e.g. y=w2x2+w1x1+w0y = w_2x_2+w_1x_1 + w_0 -> 3 simultaneous equations

First look at bb :

b is (w0w_0)

How do we do Derivative with summation operator?

It is a finite sum, just differentiate each term and sum up!

the idea is like this:

ddtn=0k1fn(t)=ddt(f0(t)++fk1(t))=df0dt++dfk1dt=n=0k1dfndt\frac{d}{d t} \sum_{n=0}^{k-1} f_{n}(t)=\frac{d}{d t}\left(f_{0}(t)+\cdots+f_{k-1}(t)\right)=\frac{d f_{0}}{d t}+\cdots+\frac{d f_{k-1}}{d t}=\sum_{n=0}^{k-1} \frac{d f_{n}}{d t}

(fn(t)+gn(t))=fn(t)+gn(t)\sum\left(f_{n}(t)+g_{n}(t)\right)=\sum f_{n}(t)+\sum g_{n}(t)

Therefore we need not worry about the summation operator, just do the differentiation first.

Jb=1Ni=1N(2(yiaxib)×(1))\frac{\partial J}{\partial b} = \frac{1}{N}\sum_{i=1}^{N}(2\left(y_{i}-a x_{i}-b\right)\times(-1))

To solve for bb:

2Ni=1N((yiaxib)×(1))=0\frac{2}{N}\sum_{i=1}^{N}(\left(y_{i}-a x_{i}-b\right)\times(-1)) = 0

i=1N(yi+axi+b)=i=1Nyi+ai=1Nxi+Nb=0\sum_{i=1}^{N}(-y_{i}+a x_{i}+b) = -\sum_{i=1}^{N}y_{i} + a\sum_{i=1}^{N}x_{i} + Nb = 0

b=1N(i=1Nyiai=1Nxi)b = \frac{1}{N}(\sum_{i=1}^{N}y_{i} - a\sum_{i=1}^{N}x_{i})

Then multply with a learning rate η\eta.

Δb=Jb×μ=μ×2Ni=1N((yiaxib)×(1))\Delta b = \frac{\partial J}{\partial b}\times \mu = \mu \times \frac{2}{N}\sum_{i=1}^{N}(\left(y_{i}-a x_{i}-b\right)\times(-1))

Then look at aa:

a is (w1)(w_1)

Ja=1Ni=1N(2(yiaxib)×(xi))\frac{\partial J}{\partial a} = \frac{1}{N}\sum_{i=1}^{N}(2\left(y_{i}-a x_{i}-b\right)\times(-x_i))

To solve for aa:

2Ni=1N((yiaxib)×(xi))=0\frac{2}{N}\sum_{i=1}^{N}(\left(y_{i}-a x_{i}-b\right)\times(-x_i)) = 0

i=1N(yixiaxi2bxi)=i=1Nyixiai=1Nxi2bi=1Nxi=0\sum_{i=1}^{N}(y_{i}x_i -a x_{i}^2-bx_i) = \sum_{i=1}^{N}y_{i}x_i - a\sum_{i=1}^{N}x_{i}^2-b\sum_{i=1}^{N}x_i = 0

ai=1Nxi2=i=1Nyixibi=1Nxia\sum_{i=1}^{N}x_{i}^2 = \sum_{i=1}^{N}y_{i}x_i -b\sum_{i=1}^{N}x_i

Since b=1N(i=1Nyiai=1Nxi)b = \frac{1}{N}(\sum_{i=1}^{N}y_{i} - a\sum_{i=1}^{N}x_{i})

substitute it into the formula

ai=1Nxi2=i=1Nyixi(1N(i=1Nyiai=1Nxi))i=1Nxia\sum_{i=1}^{N}x_{i}^2 = \sum_{i=1}^{N}y_{i}x_i -(\frac{1}{N}(\sum_{i=1}^{N}y_{i} - a\sum_{i=1}^{N}x_{i}))\sum_{i=1}^{N}x_i

ai=1Nxi2=i=1Nyixi1Ni=1Nyii=1Nxi+1Na(i=1Nxi)2a\sum_{i=1}^{N}x_{i}^2 = \sum_{i=1}^{N}y_{i}x_i -\frac{1}{N}\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_i +\frac{1}{N} a(\sum_{i=1}^{N}x_{i})^2

a(i=1Nxi21N(i=1Nxi)2)=i=1Nyixi1Ni=1Nyii=1Nxia(\sum_{i=1}^{N}x_{i}^2 - \frac{1}{N}(\sum_{i=1}^{N}x_{i})^2) = \sum_{i=1}^{N}y_{i}x_i -\frac{1}{N}\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_i

a=i=1Nyixi1Ni=1Nyii=1Nxii=1Nxi21N(i=1Nxi)2a = \frac{\sum_{i=1}^{N}y_{i}x_i -\frac{1}{N}\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_i}{\sum_{i=1}^{N}x_{i}^2 - \frac{1}{N}(\sum_{i=1}^{N}x_{i})^2 }

a=i=1Nyixi1Ni=1Nyii=1Nxii=1Nxi21N(i=1Nxi)2×NNa = \frac{\sum_{i=1}^{N}y_{i}x_i -\frac{1}{N}\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_i}{\sum_{i=1}^{N}x_{i}^2 - \frac{1}{N}(\sum_{i=1}^{N}x_{i})^2 } \times \frac{N}{N}

a=Ni=1Nyixii=1Nyii=1NxiNi=1Nxi2(i=1Nxi)2a = \frac{N\sum_{i=1}^{N}y_{i}x_i -\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_i}{N\sum_{i=1}^{N}x_{i}^2 - (\sum_{i=1}^{N}x_{i})^2 }

Then multply with a learning rate η\eta.

Δa=Ja×η=η×2Ni=1N((yiaxib)×(xi))\Delta a = \frac{\partial J}{\partial a}\times \eta = \eta \times \frac{2}{N}\sum_{i=1}^{N}(\left(y_{i}-a x_{i}-b\right)\times(-x_i))

Therefore, during the training stage:

ak+1=ak+Δabk+1=bk+Δb\begin{array}{c} a_{k+1}=a_{k}+\Delta a \\ b_{k+1}=b_{k}+\Delta b \end{array}

where

Δa=μJaΔb=μJb\begin{array}{l} \Delta a=-\mu \frac{\partial J}{\partial a} \\ \Delta b=-\mu \frac{\partial J}{\partial b} \end{array}