Understanding ML - the Gradient Descent from Math perspective
The Gradient Descent
The Gradient Descent is basically an optimization algorithm for finding a local minimum of a differentiable function.
- Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
- Our goal is to find the minimum of this function using the gradient descent methodology.
The simplest and the most fundamental optimization algorithm is the gradient descent.
- the gradient is the multivariate generalization of the derivative concept.
The basic idea - 1 parameter GD
A easy example, only show on 1 variate (1-parameter):
Let there is a formula
We want to use Gradient descent to find the minimum point. To find the minimum point, we need to make use of the slope since the minimum point is basically slope = 0.
First we take a first the derivative (find the slope of the curve in a given particular point ):
Then we choose any arbitrary number to represent .
This “choose any arbitrary number” is called Initialization, the process in which we set the initial values of weights.
- If is +ve, It means the slope is
/
and we want to move left to slide to minimum point. - If is -ve, It means the slope is
\
and we want to move right to slide to minimum point.
The update rule is like this:
where is the learning rate. the learning rate allow us to adjust how much do we want to slide.
Some people will notate as this:
where means changes.
\text{new weight} = \text{existing_weight} - \text{learning_rate} \times \text{gradient}
Let our inital x value , learning rate
The 1st iteration: ( is +ve -> Slope is /
-> Point goes left)
The 2nd iteration: ( is +ve -> Slope is /
-> Point goes left)
The 3rd iteration: ( is +ve -> Slope is /
-> Point goes left)
The 4th iteration: ( is +ve -> Slope is /
-> Point goes left)
The 5th iteration: ( is -ve -> Slope is /
-> Point goes left)
…
- If is smaller than , will be the left side of .
- If is larger than , will be the right side of .
The process goes on and on, eventually the value will stop updating.
- because the first derivative of the function will be very close to 0 when we have reached the minimum.
- Usually, we iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.
Let’s visualize it. I coded a simple one using matplotlib.
The 1st iteration:
The 3rd iteration:
We can see the point is running towards to minimum point.
The 10th iteration:
It is reaching the minimum point.
To conclude, We can start from any arbitrary point and descend to the minimum. And the speed of minimization depends on the learning rate .
1 | # Just quickly hardcode a program to check will it really be stop updating |
How do we initialize the starting ?
As we mentioned , the initial is chosen arbitrarily, as it will slide during GD.
Pick arbitrarily is good enough in this case, but not for all cases. An inappropriate initialization would cause in unoptimized model. There are other ways of initializing the weight such as Xavier Initialization (will be explained in the future).
If Gradient Descent gets initialized in such a way that it starts at a local maximum (or a saddle point, or a local minimum) with gradient zero, then it will simply stay stuck there. Variations of GD, such as Stochastic GD and Mini-batch GD try to work around this by adding an element of randomness to the search, but even those aren’t guaranteed to escape a zero gradient region if the shape of the gradient is weird enough.
In practice the only way to solve this is to reinitialize your search with new weights or parameters that start in a completely new region of the search space. This won’t be hard to do, since if you do get stuck in such a zero-gradient area you will notice very quickly that the error in your training isn’t changing at all, and you would know that you need to start over.
How about learning rate?
How to choose the learning rate?
TLDR:
- If you pick a learning rate that is too small, learning will take too long.
- If you pick a learning rate that is too large, it may totally overshoots the minimum, or result in oscillation and never reach the minimum.
- Generally, we want the learning rate to be high enough so we can reach the closest minimum in a rational amount of time, but low enough so it will not oscillate around the minimum.
- The ideal learning rate in one-dimension is (the inverse of the second derivative of at .
- The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).
If you have large but not too large learning rate, the point could descend like this:
It will also reach the minimum point eventually.
The interactive visualizer in this image: Reducing Loss: Optimizing Learning Rate - Graph Visualizer by Google Developers
The ML approach - N parameters GD
Video : The mathematics behind Linear Regression with Gradient Descent
Now moving to a ML approach.
In Machine Learning, the problem is often contains more parameters. In this example we will take a look into a linear model is using a 2-parameters GD.
- For 2 model parameters, we only need to solve 2 simultaneous equations.
- For n model parameters, we need to solve n simultaneous equations.
- e.g. -> 3 simultaneous equations
To make it clearer to see, I use for and for .
Lets look at this linear model.
the formula
- For simplicity, Lets just assume we only have 1 point data here. (Data point )
- So the Loss function will simply equal to
Further reading : Objective function, cost function, loss function: are they the same thing?
We want to minimize the loss function. (find minima)
This bring us to the Gradient Descent.
The Update rule:
Some people will notate as this:
where means changes.
So I want to know what is a way that I could change the value of in in order to make the error less.
Our final goal is to solve for and by considering and , just like what we did in the .
Lets substitute Error to
So now our Loss function looks like:
Now we want to find the derivatives of relative to .
Power rule is used here.
Then Chain rule occur.
Since is a function of Error , and Error is a function of and .
Since
Therefore:
Then multiple with a learning rate
Since we are multiplying with a constant learning rate , it is pointless to have 2 here.
Now we do the same on . find the derivatives of relative to ( is ).
Power rule is used here.
Then Chain rule occur.
Since is a function of Error , and Error is a function of and .
Since
Therefore:
Then multiple with a learning rate
Since we are multiplying with a constant learning rate , it is pointless to have 2 here.
Therefore, during the training:
In code representation (from https://youtu.be/L-Lsfu4ab74?t=1199)
1 | function gradientDescent(){ |
Now lets look at a practical example.
Gradient Descent in Linear / Logistic regression
Linear relationship:
Cost function: mean-square error (MSE)
- Square difference between the observed data (y) and the predicted data (ax+b)
- Lets assume our data is not 1 point only but multiple points (N points)
Let be the Cost function.
Problem definition:
Find a and b so that Cost function is minimized
To minimize the cost function, we solve for a and b by considering
In other words, to find a point where slope = 0. That minimum is where the cost function converges.
- For 2 model parameters, we only need to solve 2 simultaneous equations.
- For n model parameters, we need to solve n simultaneous equations.
- e.g. -> 3 simultaneous equations
First look at :
b is ()
How do we do Derivative with summation operator?
It is a finite sum, just differentiate each term and sum up!
the idea is like this:
Therefore we need not worry about the summation operator, just do the differentiation first.
To solve for :
Then multply with a learning rate .
Then look at :
a is
To solve for :
Since
substitute it into the formula
Then multply with a learning rate .
Therefore, during the training stage:
where