Prerequisites - Math Notations

Inspired by : MachineLearningMastery: Basics of Mathematical Notation for Machine Learning

Incentive:

When I am trying to learn about Machine Learning, I found the hardest part is to understand those math notations. I have a weak math background and I often got confused by the Math notations.

But We cannot avoid mathematical notation when reading the descriptions of machine learning methods. So I decided to make this article to help myself.

The Greeks letters

Wiki Page: Greek letters used in mathematics, science, and engineering

Common notations:

  • μ\mu : mean
  • σ\sigma : Standard deviation

Common conventions of letter case usage

StackExchange: Usage of capital and small letters

  • Areas and volumes often use capital letters, while lengths often use small letters
  • Bold letters are usually used to mean vectors, though some writers do not use bold for vectors, and others mark them in a different way
  • In linear algebra, a matrix is usually a capital letter
  • In calculus, small letters are usually used for functions, with a capital letter often reserved for a function which is the integral of some other function
  • In geometry, capital letters are usually used for points, small english letters for lines and small greek letters for planes (including π\pi)
  • π\pi is usually reserved for the number π\pi in places where that number is likely to appear. In other places, it is used to mean a probability or a profit or a projection or a permutation.

small ee is not a common choice for a letter in a scientific formula, because it can be used to mean the number ee. However in ordinary algebra it can be used for a coefficient, especially if you have already used a, b, c, d

Another point about good writing is that in one piece of writing, you shouldn’t change your notation mid-way. This is very frustrating to the reader! (Imagine if an author changed the name of their main character partway through a novel!)

The final point, which is actually a proper rule is that captital/small/bold/curly letters are not interchangeable. If a capital EE is used for something in a piece of writing, then you can’t use a small ee elsewhere to mean the same thing. In short, maths is “case sensitive”.

I had frustration that a professor changed the notation in mid-way. He seems used the 2 different materials for lectures.

Sequence Operations

Some trivial stuffs, but I spent a half semester in year 1 to understand it well.

sum over a sequence

i=1nai\sum_{i=1}^{n} a_{i}

This is the sum of the sequence a starting at element 1 to element n.

i=1nai=a1+a2+a3++an\sum_{i=1}^{n} a_{i} = a_1 + a_2 + a_3 + \cdots + a_n

multiplication over a sequence

i=1nai\prod_{i=1}^{n} a_{i}

This is the product of the sequence a starting at element 1 to element n.

i=1nai=a1×a2×a3××an\prod_{i=1}^{n} a_{i} = a_1 \times a_2 \times a_3 \times \cdots \times a_n

Set Membership

Wiki Page: Set (mathematics)

Usaully we see something like this:

xRx \in \mathbb{R}

x is defined as being a member of the set R or the set of real numbers

R\mathbb{R} denoting the set of all real numbers. This set includes all rational numbers, together with all irrational numbers (that is, algebraic numbers that cannot be rewritten as fractions such as √2, as well as transcendental numbers such as π, e).

xNx \in \mathbb{N}

x is defined as being a member of the set N or the set of natural numbers

ℕ denoting the set of all natural numbers: N = {0, 1, 2, 3, …} (sometimes defined excluding 0).

xZx \in \mathbb{Z}

x is defined as being a member of the set Z or the set of integers

ℤ denoting the set of all integers (whether positive, negative or zero): Z = {…, −2, −1, 0, 1, 2, …}.

Vector

KhanAcademy: Vector fields

In order to understand gradient, we need to have a concept of vector.

Calculus

Differentiation (derivatives)

Finding Maxima and Minima using Derivatives - Math is Fun

dfdx\frac{df}{dx}

Not to mention in detail, but remember those ideas (I know someone is gonna be rusty like me):

  • Power rule

ddxxn=nxn1\frac{d}{d x} x^{n}=n x^{n-1}

  • Chain rule

dydx=dydududxdydx= deiviative of y with respect to xdydu= deivative of y with respect to u dudx= deiviative of with respect to x\begin{array}{l} \quad \frac{d y}{d x}=\frac{d y}{d u} \frac{d u}{d x} \\ \frac{d y}{d x}=\text { deiviative of } y \text { with respect to } x \\ \frac{d y}{d u}=\text { deivative of } y \text { with respect to u } \\ \frac{d u}{d x}=\text { deiviative of with respect to } x \end{array}

We focus on the usage of differentiation in ML.

  • A derivative basically finds the slope of a function. (First Derivative Test)
    • We want to find a point where slope = 0.

Limit: to estimate the slope of a point where is very close to 2 points (x,f(x))(x,f(x)) and (a,f(a))(a,f(a))

limxaf(x)f(a)xa\lim _{x \rightarrow a} \frac{f(x)-f(a)}{x-a}

think of it like m=y1y2x1x2m = \frac{y_1-y_2}{x_1-x_2}, but the 2 points are very close to each other.

Visualizing the slope:

  • The +ve slope goes like this: /
  • The -ve slope goes like this: \
  • Slope = 0 goes like this:

  • Finding local minima (Second Derivative Test)

In Machine Learning, we want to find the local minima of our loss.

At all the stationary points, the gradient is the same (= zero). We may want to know if we found a maximum point, a minimum point or a point of inflection.

Therefore the gradient at either side of the stationary point needs to be looked at (alternatively, we can use the second derivative).

Second derivative give us the idea to test at the point where the slope is zero.

When a function’s slope is zero at x, and the second derivative at x is:

  • less than 0, it is a local maximum
  • greater than 0, it is a local minimum
  • equal to 0, then the test fails (there may be other ways of finding out though)

For regression problems, they are usually convex.

Partial differentiation (partial derivatives)

KhanAcademy: Introduction to partial derivatives

\partial means Partial differentiation.

fx\frac{\partial f}{\partial x}

If you dont understand how partial differentiation work but you know differentiation, look at this clear picture.

So you just need treat another variables as constant (1 for example). Then after differentiation you put the variable back.

Gradient

KhanAcademy: The gradient

The Del Symbol

Gradient \nabla Vector derivative

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

f=[fxfy]\nabla f=\left[\begin{array}{c} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \vdots \end{array}\right]

Use the above f(x,y)=x4y3f(x,y) = x^4y^3 example,

f(x,y)=(fx(x,y),fy(x,y))=(4x3y3,x43y2)\nabla f(x,y) = (\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y)) = (4x^3y^3 , x^43y^2)

f\nabla f is a vector-valued function.

Therefore:

xf(x,y)=fx(x,y)=4x3y3\nabla_x f(x,y) = \frac{\partial f}{\partial x}(x,y) = 4x^3y^3

yf(x,y)=fy(x,y)=x43y2\nabla_y f(x,y) = \frac{\partial f}{\partial y}(x,y) = x^43y^2