The Gradient Descent is basically an optimization algorithm for finding a local minimum of a differentiable function.
Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
Our goal is to find the minimum of this function using the gradient descent methodology.
The simplest and the most fundamental optimization algorithm is the gradient descent.
the gradient is the multivariate generalization of the derivative concept.
The basic idea - 1 parameter GD
A easy example, only show on 1 variate (1-parameter):
Let there is a formula
f(x)=5x2+2x−1
We want to use Gradient descent to find the minimum point. To find the minimum point, we need to make use of the slope since the minimum point is basically slope = 0.
First we take a first the derivative (find the slope of the curve in a given particular point x):
f′(x)=dxdf(x)=10x+2
Then we choose any arbitrary number to represent x.
This “choose any arbitrary number” is called Initialization, the process in which we set the initial values of weights.
If f′(x) is +ve, It means the slope is / and we want to move left to slide to minimum point.
If f′(x) is -ve, It means the slope is \ and we want to move right to slide to minimum point.
The update rule is like this:
xk+1=xk−ηf′(xk)
where η is the learning rate. the learning rate allow us to adjust how much do we want to slide.
Let our inital x value x0=3 , learning rate η=0.06
The 1st iteration: x1=3−32×0.06=1.08 (f′(xk) is +ve -> Slope is / -> Point goes left)
The 2nd iteration: x2=1.08−12.8×0.06=0.312 (f′(xk) is +ve -> Slope is / -> Point goes left)
The 3rd iteration: x3=0.312−5.12×0.06=0.0048 (f′(xk) is +ve -> Slope is / -> Point goes left)
The 4th iteration: x4=0.0048−2.048×0.06=−0.11808 (f′(xk) is +ve -> Slope is / -> Point goes left)
The 5th iteration: x5=−0.11808−0.8192×0.06=−0.167232 (f′(xk) is -ve -> Slope is / -> Point goes left)
…
If xk+1 is smaller than xk, xk+1 will be the left side of xk.
If xk+1 is larger than xk, xk+1 will be the right side of xk.
The process goes on and on, eventually the value will stop updating.
because the first derivative of the function will be very close to 0 when we have reached the minimum.
Usually, we iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.
Let’s visualize it. I coded a simple one using matplotlib.
The 1st iteration:
The 3rd iteration:
We can see the point is running towards to minimum point.
The 10th iteration:
It is reaching the minimum point.
To conclude, We can start from any arbitrary point and descend to the minimum. And the speed of minimization depends on the learning rate η.
1 2 3 4 5 6 7 8 9 10 11
# Just quickly hardcode a program to check will it really be stop updating learning_rate = 0.06
x = 3
for i inrange(50): diff = (10 * x) + 2 x = x - (diff * learning_rate)
print(x) # x will stuck at around -0.19999999999999998
How do we initialize the starting x ?
As we mentioned , the initial x is chosen arbitrarily, as it will slide during GD.
Pick x arbitrarily is good enough in this case, but not for all cases. An inappropriate initialization would cause in unoptimized model. There are other ways of initializing the weight such as Xavier Initialization (will be explained in the future).
If Gradient Descent gets initialized in such a way that it starts at a local maximum (or a saddle point, or a local minimum) with gradient zero, then it will simply stay stuck there. Variations of GD, such as Stochastic GD and Mini-batch GD try to work around this by adding an element of randomness to the search, but even those aren’t guaranteed to escape a zero gradient region if the shape of the gradient is weird enough.
In practice the only way to solve this is to reinitialize your search with new weights or parameters that start in a completely new region of the search space. This won’t be hard to do, since if you do get stuck in such a zero-gradient area you will notice very quickly that the error in your training isn’t changing at all, and you would know that you need to start over.
If you pick a learning rate that is too small, learning will take too long.
If you pick a learning rate that is too large, it may totally overshoots the minimum, or result in oscillation and never reach the minimum.
Generally, we want the learning rate to be high enough so we can reach the closest minimum in a rational amount of time, but low enough so it will not oscillate around the minimum.
The ideal learning rate in one-dimension is f(x)′′1 (the inverse of the second derivative of f(x) at x .
The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).
If you have large but not too large learning rate, the point could descend like this:
In Machine Learning, the problem is often contains more parameters. In this example we will take a look into a linear model is using a 2-parameters GD.
y=w1x+w0
For 2 model parameters, we only need to solve 2 simultaneous equations.
For n model parameters, we need to solve n simultaneous equations.
e.g. y=w2x2+w1x1+w0 -> 3 simultaneous equations
To make it clearer to see, I use m for w1 and b for w0 .
Lets look at this linear model.
the formula y^=mx+b
y=true value
y^=predicted value
Loss function=L=(y−y^)2
For simplicity, Lets just assume we only have 1 point data here. (Data point N=1)
So the Loss function will simply equal to (y−y^)2
functiongradientDescent(){ var learning_rate = 0.05; for (var i = 0; i < data.length; i++){ var x = data[i].x; var y = data[i].y; var guess = m * x + b; var error = y - guess; m = m + (error * x) * learning_rate; b = b + (error) * learning_rate; } }
Now lets look at a practical example.
Gradient Descent in Linear / Logistic regression
Linear relationship:
y^=ax+b
Cost function: mean-square error (MSE)
Square difference between the observed data (y) and the predicted data (ax+b)
Lets assume our data is not 1 point only but multiple points (N points)
Cost function=N1i=0∑N−1(yi−axi−b)2
Let J be the Cost function.
Problem definition:
Find a and b so that Cost function is minimized
J(a,b)=N1i=1∑N(yi−axi−b)2
To minimize the cost function, we solve for a and b by considering
∂a∂J=0,∂b∂J=0
In other words, to find a point where slope = 0. That minimum is where the cost function converges.
For 2 model parameters, we only need to solve 2 simultaneous equations.
For n model parameters, we need to solve n simultaneous equations.
e.g. y=w2x2+w1x1+w0 -> 3 simultaneous equations
First look at b :
b is (w0)
How do we do Derivative with summation operator?
It is a finite sum, just differentiate each term and sum up!