Understanding ML - the Prerequisites and Math Notations
Prerequisites - Math Notations
Inspired by : MachineLearningMastery: Basics of Mathematical Notation for Machine Learning
Incentive:
When I am trying to learn about Machine Learning, I found the hardest part is to understand those math notations. I have a weak math background and I often got confused by the Math notations.
But We cannot avoid mathematical notation when reading the descriptions of machine learning methods. So I decided to make this article to help myself.
The Greeks letters
Wiki Page: Greek letters used in mathematics, science, and engineering
Common notations:
- : mean
- : Standard deviation
Common conventions of letter case usage
- Areas and volumes often use capital letters, while lengths often use small letters
- Bold letters are usually used to mean vectors, though some writers do not use bold for vectors, and others mark them in a different way
- In linear algebra, a matrix is usually a capital letter
- In calculus, small letters are usually used for functions, with a capital letter often reserved for a function which is the integral of some other function
- In geometry, capital letters are usually used for points, small english letters for lines and small greek letters for planes (including )
- is usually reserved for the number in places where that number is likely to appear. In other places, it is used to mean a probability or a profit or a projection or a permutation.
small is not a common choice for a letter in a scientific formula, because it can be used to mean the number . However in ordinary algebra it can be used for a coefficient, especially if you have already used a, b, c, d
Another point about good writing is that in one piece of writing, you shouldn’t change your notation mid-way. This is very frustrating to the reader! (Imagine if an author changed the name of their main character partway through a novel!)
The final point, which is actually a proper rule is that captital/small/bold/curly letters are not interchangeable. If a capital is used for something in a piece of writing, then you can’t use a small elsewhere to mean the same thing. In short, maths is “case sensitive”.
I had frustration that a professor changed the notation in mid-way. He seems used the 2 different materials for lectures.
Sequence Operations
Some trivial stuffs, but I spent a half semester in year 1 to understand it well.
sum over a sequence
This is the sum of the sequence a starting at element 1 to element n.
multiplication over a sequence
This is the product of the sequence a starting at element 1 to element n.
Set Membership
Usaully we see something like this:
x is defined as being a member of the set R or the set of real numbers
denoting the set of all real numbers. This set includes all rational numbers, together with all irrational numbers (that is, algebraic numbers that cannot be rewritten as fractions such as √2, as well as transcendental numbers such as π, e).
x is defined as being a member of the set N or the set of natural numbers
ℕ denoting the set of all natural numbers: N = {0, 1, 2, 3, …} (sometimes defined excluding 0).
x is defined as being a member of the set Z or the set of integers
ℤ denoting the set of all integers (whether positive, negative or zero): Z = {…, −2, −1, 0, 1, 2, …}.
Vector
In order to understand gradient, we need to have a concept of vector.
Calculus
Differentiation (derivatives)
Not to mention in detail, but remember those ideas (I know someone is gonna be rusty like me):
- Power rule
- Chain rule
We focus on the usage of differentiation in ML.
- A derivative basically finds the slope of a function. (First Derivative Test)
- We want to find a point where slope = 0.
Limit: to estimate the slope of a point where is very close to 2 points and
think of it like , but the 2 points are very close to each other.
Visualizing the slope:
- The +ve slope goes like this:
/
- The -ve slope goes like this:
\
- Slope = 0 goes like this:
一
- Finding local minima (Second Derivative Test)
In Machine Learning, we want to find the local minima of our loss.
At all the stationary points, the gradient is the same (= zero). We may want to know if we found a maximum point, a minimum point or a point of inflection.
Therefore the gradient at either side of the stationary point needs to be looked at (alternatively, we can use the second derivative).
Second derivative give us the idea to test at the point where the slope is zero.
When a function’s slope is zero at x, and the second derivative at x is:
- less than 0, it is a local maximum
- greater than 0, it is a local minimum
- equal to 0, then the test fails (there may be other ways of finding out though)
For regression problems, they are usually convex.
Partial differentiation (partial derivatives)
means Partial differentiation.
If you dont understand how partial differentiation work but you know differentiation, look at this clear picture.
So you just need treat another variables as constant (1 for example). Then after differentiation you put the variable back.
Gradient
The Del Symbol
Gradient Vector derivative
The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:
Use the above example,
is a vector-valued function.
Therefore: