Matrix-Vector Revisit

Matrix and Vector Norm

Vector Norm

xp=(ixip)1ppR,p1\|\boldsymbol{x}\|_{p}=\left(\sum_{i}\left|x_{i}\right|^{p}\right)^{\frac{1}{p}} \quad p \in \mathbb{R}, p \geq 1

  • measure the size of a vector (measure the distance from the origin to the point xx)
  • When p=2p = 2, we have the Euclidean form

Matrix Norm

AF=i,jAi,j2\|A\|_{F}=\sqrt{\sum_{i, j} A_{i, j}^{2}}

  • Frobenius norm measures the size of a matrix
  • If A is symmetric, Frobenius norm measures the sum of the variation along all dimension.

Multiplying Matrix and Vector

Dot product of two vector

z=xw=xw=i=1Dxiwi=xwcosθ\begin{aligned} z &=\mathbf{x}^{\top} \mathbf{w}=\mathbf{x} \cdot \mathbf{w}=\sum_{i=1}^{D} x_{i} w_{i} \\ &=\|\mathbf{x}\|\|\mathbf{w}\| \cos \theta \end{aligned}

Matrix-vector multiplication

z=Wx=(xW)\mathbf{z}=\mathbf{W} \mathbf{x}= (\mathbf{x}^{\top} \mathbf{W}^{\top})^{\top}

Matrix-matrix multiplication

O=f(WX+[bbb])\mathbf{O}=f\left(\mathbf{W} \mathbf{X}+\left[\begin{array}{llll} \mathbf{b} & \mathbf{b} & \cdots & \mathbf{b} \end{array}\right]\right)

  • XX is D×ND\times N matrix
  • WW is M×DM\times D matrix
  • WXWX is an M×NM\times N matrix
  • OO is an M×NM\times N matrix
  • NN is the batch size or the number of input vectors
img

Vector Space

  • A vector space is a set of objects VV called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

Example, a vector on a 3-D vector space [X=2, Y=3, Z=5]

img

Subspace

  • In linear algebra, a linear subspace is a vector space that is a subset of some larger vector space.

img

  • Linear = Subspace
  • Non-linear = Manifold

Is the curve surface (2-D manifold) a linear subspace of a 3-dimensional space?

  • No. It is non-linear subspace.

Revisit Random Variables and Probability Distributions

Random Variables (RV)

  • A Random Variable is a variable whose value is subject to variation due to chance

  • Random Variable can be discrete:

    • E.g. an Random Variable could be X = no. of heads after flipping a fair coin 2 times, Then X can only be 0 or 1 or 2.
    • E.g Throw a fair dice once
      • img
    • Discrete Probability Distribution is also called Probability Mass Function (PMF)
  • Random Variable can also be continuous:

    • E.g. Weight of a person; can take on any value within an interval
      • img * The Majority weight is 200 in this case. (Most likely 200 from the view of probability)
    • Continuous Probability Distribution is also called Probability Density Function (PDF)
  • The function describing the possible values of an Random Variable and their associated probabilities is known as a probability distribution.

Histogram and Probability Density Function (PDF)

  • For a continuous random variable, the histogram converges to PDF when the bin size goes to zero
img

Conditional Probability

img
  • Pr(x,y)\operatorname{Pr}\left(x,y\right) : Joint Probability Density Function of xx and yy
  • Pr(xy=y1)\operatorname{Pr}\left(x \mid y=y_{1}\right) and Pr(xy=y2)\operatorname{Pr}\left(x \mid y=y_{2}\right): The Two conditional probability distributions
  • These are formed by extracting the appropriate slice from the joint pdf and normalizing so that the area is one. A similar operation can be performed for discrete distributions

Rules of Probability

p(X,Y):p(X, Y): joint probability
p(X)p(X) : marginal probability
p(YX):p(Y \mid X): conditional probability

Sum Rule

p(X)=Yp(X,Y)p(X)=\sum_{Y} p(X, Y)

Discrete case : Summation

If Y={Male,Female}\text{If } Y = \{\text{Male},\text{Female}\}

then\text{then}

p(X)=Yp(X,Y)=p(X,Y=Male)+p(X,Y=Female)p(X)=\sum_{Y} p(X, Y) = p(X, Y=\text{Male}) + p(X, Y=\text{Female})

Continuous case : Integration

If Y is Continuous,

then\text{then}

p(X)=p(x,y)dyp(X)=\int{p(x,y)dy}

Product Rule

p(X,Y)=p(YX)p(X)p(X, Y)=p(Y \mid X) p(X)

Discrete RV in Machine Learning

  • In machine learning, discrete probability distributions are often used for modeling binary and multi-class classification problems

  • E.g. Gender Classification from their facial image x (Binary classification problem)

    • P(G=Malex)=pP(G=\operatorname{Male} \mid \mathbf{x})=p \quad and P(G=\quad P(G= Female x)=1p0p1\mid \mathbf{x})=1-p \quad 0 \leq p \leq 1
    • GG is a random variable with possible outcomes “Male” or “Female”.
    • GG follows a Bernoulli distribution.
  • E.g. Classification of handwritten digit with image x (Multi-class classification problem)

    • P(D=kx)P(D=k \mid \mathbf{x}), where k=0,,9k=0, \ldots, 9 and k=09P(D=kx)=1\sum_{k=0}^{9} P(D=k \mid \mathbf{x})=1
    • DD is a random variable with possible outcomes {0,1,⋯,9}
    • DD follows a Multinoulli (categorical) distribution.

Continuous RV in Machine Learning

  • In machine learning, continuous probability distributions are often used for modeling feature vectors

  • E.g. Assume that facial image x follows a Gaussian distribution

  • xN(x;μ,Σ)=1(2π)12Σ12exp{12(xμ)Σ1(xμ)}\mathbf{x} \sim \mathcal{N}(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{\frac{1}{2}}|\boldsymbol{\Sigma}|^{\frac{1}{2}}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}

  • x is a random vector (variable) with infinite number of possible outcomes

    • img
    • mean μ\mu have the highest chance of occuring
    • Note N(x;μ,Σ)dx=1\int \mathcal{N}(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma})dx = 1, just like any PDF.

Expectation for Discrete RV

For discrete RV, the expectation (expected value) of a function f(x)f(x) with respect to a probability distribution P(x)P(x) is average value that f takes on when xx is drawn from PP :

  • Upper case PP Is used to denote discrete probability distribution

ExP[f(x)]=xP(x)f(x)\mathbb{E}_{\mathrm{x} \sim P}[f(x)]=\sum_{x} P(x) f(x)

  • If P is uniform, Expected value = mean

  • If P is not uniform, Expected valued = weighted mean

    • E.g. Throw a fair dice

    • x{1,2,3,4,5,6}f(x)=xP(x)=16EXP[f(x)]=161+162++166=216=3.5\begin{array}{c} x \in\{1,2,3,4,5,6\} \\ f(x)=x \quad P(x)=\frac{1}{6} \\ \mathbb{E}_{X \sim P}[f(x)]=\frac{1}{6} \cdot 1+\frac{1}{6} \cdot 2+\cdots+\frac{1}{6} \cdot 6=\frac{21}{6}=3.5 \end{array}

Expectation for Continuous RV

For continuous RV, the expectation (expected value) of a function f(x)f(x) with respect to a probability distribution p(x)p(x) is the average value that f takes on when xx is drawn from pp :

  • Lower case pp is used to denote continuous probability distribution

Exp[f(x)]=p(x)f(x)dx\mathbb{E}_{\mathrm{x} \sim p}[f(x)]=\int p(x) f(x) d x

  • Expectation of samples drawn from a Gaussian distribution
    • img

Variance and Covariance

Variance gives a measure of how much the values of a function f of a random variable xx vary as we draw samples from the probability distribution of xx :

Var(f(x))=E[(f(x)E[f(x)])2]\operatorname{Var}(f(x))=\mathbb{E}\left[(f(x)-\mathbb{E}[f(x)])^{2}\right]

Variance of f(x) = Expectation[ ( f(x) - Expectation[f(x)] )^2 ]

Note Expectation[f(x)] = mean μ\mu

The covariance measures how much two functions of RVs (x and y) are linearly related to each other, as well as the scale of these variables:

Cov(f(x),g(y))=E[(f(x)E[f(x)])(g(y)E[g(y)])]\operatorname{Cov}(f(x), g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])]

Covariance of f(x) and g(y) = Expectation[ ( f(x) - Expectation[f(x)] ) ( g(y) - Expectation[g(y)] ) ]

  • Positive covariance means when x is large (small), y is also large (small)
  • Negative covariance means when x is large (small), y is small (large)
  • Covariance in Statistics: What is it? Example - Statistics How To

covariance In vector form:

Cov(f(x),g(x))=E[(f(x)E[f(x)])(g(x)E[g(x)])]\operatorname{Cov}(\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{x}))=\mathbb{E}[(\mathbf{f}(\mathbf{x})-\mathbb{E}[\mathbf{f}(\mathbf{x})])(\mathbf{g}(\mathbf{x})-\mathbb{E}[\mathbf{g}(\mathbf{x})])^{\top}]

Cov(f(x),g(y))=Ex,yp(x,y){[f(x)Exp(x)f(x)][g(y)Eyp(y)g(y)]}\operatorname{Cov}(f(\mathbf{x}), g(\mathbf{y}))=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p(\mathbf{x}, \mathbf{y})}\left\{\left[f(\mathbf{x})-\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} f(\mathbf{x})\right]\left[g(\mathbf{y})-\mathbb{E}_{\mathbf{y} \sim p(\mathbf{y})} g(\mathbf{y})\right]^{\top}\right\}

  • Note now Cov(f(x),g(x))\operatorname{Cov}(\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{x})) is a matrix

Note the symmetric properties in Covariance Matrix.

E.g.

For x=[x1x2]x=\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right]

cov(x,x)=[σ112σ12σ21σ222]=[var(x1)cov(x1,x2)cov(x2,x1)var(x2)]\operatorname{cov}(x, x)=\left[\begin{array}{cc} \sigma_{11}^{2} & \sigma_{12} \\ \sigma_{21} & \sigma_{22}^{2} \end{array}\right] = \left[\begin{array}{cc} var(x_1) & cov(x_1,x_2) \\ cov(x_2,x_1) & var(x_2) \end{array}\right]

For x=[x1x2x3]x=\left[\begin{array}{l} x_1 \\ x_2 \\ x_3 \\ \end{array}\right]

cov(x,x)=[σ112σ12σ13σ21σ222σ23σ31σ32σ332]=[var(x1)cov(x1,x2)cov(x1,x3)cov(x2,x1)var(x2)cov(x2,x3)cov(x3,x1)cov(x3,x2)var(x3)]\operatorname{cov}(x, x)=\left[\begin{array}{lll} \sigma_{11}^{2} & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22}^{2} & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_{33}^{2} \end{array}\right] = \left[\begin{array}{lll} var(x_1) & cov(x_1,x_2) & cov(x_1,x_3) \\ cov(x_2,x_1) & var(x_2) & cov(x_2,x_3) \\ cov(x_3,x_1) & cov(x_3,x_2) & var(x_3) \end{array}\right]

Note

  • σ13=σ31=cov(x1,x3)=cov(x3,x1)\sigma_{13} = \sigma_{31} = cov(x_1,x_3) = cov(x_3,x_1)
  • σ12=σ21=cov(x1,x2)=cov(x2,x1)\sigma_{12} = \sigma_{21} = cov(x_1,x_2) = cov(x_2,x_1)
  • σ23=σ32=cov(x2,x3)=cov(x3,x2)\sigma_{23} = \sigma_{32} = cov(x_2,x_3) = cov(x_3,x_2)

Variance and Covariance Relation to Machine Learning

  • Many machine learning problems are multivariate, e.g., an image with 28 x 28 pixels can be viewed as a 784-dim vector
  • Neighboring pixels are positive correlated for smooth images.
  • We use a covariance matrix to quantify the variances and the covariances of different pixels:

img

Bayes Theorem and Its Role in Classification Systems

Bayes theorem states the following:

Posterior=Prior×Likelihood\text{Posterior} = \text{Prior} \times \text{Likelihood}

  • Prior: Probability distribution representing knowledge or uncertainty of a data object prior or before observing it
  • Posterior: Conditional probability distribution representing what parameters are likely after observing the data object
  • Likelihood: The probability of falling under a specific category or class.

p(X,Y)=p(YX)p(X)=p(XY)p(Y)p(X, Y)=p(Y \mid X) p(X)=p(X \mid Y) p(Y)

Bayes Theorem

img

E.g.

Assume XX is our measurement (weight),

Y={Male, Female}Y = \{\text{Male, Female}\}

Prior = p(Y=male)p(Y = \text{male})

Likelihood = p(XY=male)p(X|Y=\text{male})

Posterior = Likelihood * Prior

If Y is discrete:

p(X)=Yp(XY)p(Y)p(X)=\sum_{Y} p(X \mid Y) p(Y)

e.g., X is the height of a person and Y is his/her gender (‘M’ or ‘F’)

If Y is continuous:

p(X)=p(XY)p(Y)dYp(X)=\int p(X \mid Y) p(Y) \mathrm{d} Y

e.g., X is the weight of a person and Y his/her wealth

Note: When XX is continuous, likelihood p(XY)p(X \mid Y) is not probability and could be greater than 1

E.g. When x is too small e.g. 0.0001, p(X|Y) need to be higher than 1 to compansate that the PDF area is equal to 1.

Only in continuous, likelihood is different from probability

If x is a continuous random variable, the likelihood of x could be larger than 1.

  • Yes

Example:

img

Posterior=Prior×Likelihood\text{Posterior} = \text{Prior} \times \text{Likelihood}

Bayes Theorem:

p(YX)=p(XY)p(Y)p(X)p(Y \mid X)=\frac{p(X \mid Y) p(Y)}{p(X)}

Therefore:

p(YX)=p(MaleWeight=70)=p(Weight=70Male)p(Male)p(Weight=70)=p(XY)p(Y)p(X)p(Y \mid X) = p(\text{Male} | \text{Weight} = 70) = \frac{p(\text{Weight} = 70|\text{Male})p(\text{Male})}{p(\text{Weight} = 70)} = \frac{p(X \mid Y) p(Y)}{p(X)}

Marginal Density = p(X)=p(XY=Male)P(Male)+p(XY=Female)P(Female)p(X) = p(X|Y=\text{Male})P(\text{Male}) +p(X|Y=\text{Female})P(\text{Female})

p(X)=0.01×0.4+0.001×0.6p(X) = 0.01\times0.4 + 0.001\times0.6

p(MaleWeight=70)=0.01×0.40.01×0.4+0.001×0.6=0.87p(\text{Male} | \text{Weight} = 70) = \frac{0.01\times0.4}{0.01\times0.4 + 0.001\times0.6} = 0.87

Bayes Decision Theory

Decision Rule for 2-class problems: (ω1\omega_1 means class1, ω2\omega_2 means class2)

  • Where P(ω)P(\omega) is Prior.
img

Decision Rule for K-class problems: (ωk\omega_k means class k)

  • Where P(ω)P(\omega) is Prior.
img

Gaussian Mixture Model

  • Sometime the distribution may not be Gaussian.
  • A Gaussian mixture model is a linear combination of M Gaussian densities:

img

^ww is weight, MM is the number of gaussian distributions

img

http://ethen8181.github.io/machine-learning/clustering/GMM/GMM.html

In GMM, each cluster corresponds to a probability distribution, in this case the Gaussian distribution. What we want to do is to learn the parameters of these distributions, which is the Gaussian’s mean μ\mu, and the variance σ2\sigma^2.

Note The mathematical form of the Gaussian distribution in 1-dimension (univariate Gaussian) can be written as:

N(xμ,σ)=1σ2πe(xμ)22σ2N(x \mid \mu, \sigma)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}}

  • This is also referred to as the probability density function (pdf).
  • Gaussian distribution is commonly referred to as the Normal distribution, hence that’s where the NN comes from.
  • xx refers to the random observation over which this distribution is placed.
  • The mean μ\mu, controls the Gaussian’s “center position” and the variance σ2\sigma^2, controls its “shape”. To be precise, it is actually the standard deviation σ\sigma, i.e. the square root of the variance that controls the distribution’s shape.

In higher dimensions, a Gaussian is fully specified by a mean vector μ\mu and a d-by-d covariance matrix, Σ\Sigma (do not confused this symbol with \sum, which is used for denoting summing a bunch of stuff). Σ|\Sigma| refers to the determinant of the covariance matrix e.g. In two dimension, the Gaussian’s parameters might look like this:

N[(μ1μ2),(σ12σ12σ21σ22)]N\left[\left(\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \sigma_{12} \\ \sigma_{21} & \sigma_{2}^{2} \end{array}\right)\right]

The mean vector, containing elements μ1\mu_{1} and μ1\mu_{1} centers the distribution along every dimension. On the other hand, the covariance matrix specifies the spread and orientation of the distribution. Along the diagonal of this covariance matrix we have the variance terms σ12\sigma_{1}^{2} and σ22\sigma_{2}^{2} representing the shape (spread) along each of the dimensions. But then we also have the off-diagonal terms, σ12\sigma_{12} and σ21\sigma_{21} (these two thing actually take the same value because this a symmetric matrix) that specify the correlation structure of the distribution.

Remember the covariance matrix is symmetric.

  • Diagonal covariance matrix (off-diagonal terms are zero)
  • Full covariance matrix (off-diagonal terms are non-zero)

Gaussian Mixture Model Classifier (GMM Classifier)

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

X={X1,,Xk,,XK}\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}

We use the samples in Class kk to train the kk-th GMM:

Xk={xk,1,,xk,Nk} EM Algorithm Λk={wkj,μkj,Σkj}j=1M\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label cc:

c=argmaxkP(Ckx)=argmaxkp(xCk)P(Ck)c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)

p(xCk)=j=1Mwkjp(xμkj,Σkj)p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)

Note the training of individual GMMs is unsupervised,

but the construction of the GMM classifier is supervised.

  • The training of a GMM is unsupervised in that only the feature vectors Xk={xk,1,,xk,Nk}\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\}are used for training the k-th GMM.
  • We also did not train the GMM to map x to a class label c.
  • Instead, we train the k-th GMM to generate the data in the k-th class.
  • To use GMMs for classification, we train one GMM per class. So, the label information of the training data is implicitly used when constructing a GMM classifier.

Support Vector Machines

  • During the second AI winter, support vector machine (SVM) is an important machine learning model.
  • It is very good at performing classification tasks
  • Even with the recent advances in deep neural networks, SVM still plays an essential role in many practical systems.
  • In many real-word problems, we do not have enough data to train a DNN for the classification tasks.
    • Suitable for small amount of data
  • One solution is to train a DNN using massive amount of data from another similar data source and use the DNN as feature extractor (called embedding). Then, we use the small number of feature vectors to train an SVM for our task.

Margin Maximization

  • SVM is trained by maximizing the margin between the hyperplanes of two classes
img
  • We aim to find a decision plane f(x)=0f(x) = 0 that maximizes the margin d=2wd = \frac{2}{||w||} to obtain

f(x)=iSαiyixix+b=wx+bS={j where αj>0}f(\mathbf{x})=\sum_{i \in \mathcal{S}} \alpha_{i} y_{i} \mathbf{x}_{i}^{\top} \mathbf{x}+b=\mathbf{w}^{\top} \mathbf{x}+b \quad \mathcal{S}=\left\{j \text { where } \alpha_{j}>0\right\}

SVM Classifier

  • SVM is good for binary classification.
  • To classify multiple classes, we can use the one-vs-rest approach to convert K binary classifications to a K-class classification

img

Explain why it is always possible to use a plane (linear SVM) to divide 3 linearly independent samples into two classes. What happens when the 3 samples are linearly dependent?

  • When the 3 samples are not linearly dependent, they should span a plane on the 3-D space, i.e., they become the corners of a triangle. As a result, a linear SVM can classify the 3 points perfectly.
  • On the other hand, when the 3 points are co-linear (linearly dependent), they form a line on the 3-D space. All points on the line can be specified by the first two dimensions, because the third dimension depends on the first two. As we now have more points (N = 3) than the dimension of feature vectors (D = 2), we need a non-linear classifier such as RBM-SVM. The following diagram illustrates such situation.
    • img

The support vectors of an SVM are usually near the decision boundary?

Yes.

MLP vs GMM vs SVM

  • You may consider GMM and SVM as a special form of MLP with only one hidden layer, one output, and with a special activation function

img

img

Why can we consider Gaussian mixture models (GMMs) and support vector machines with non-linear kernel functions (e.g., RBF-SVM) as neural networks with a single hidden layer?

The forward propagation equations of GMMs and RBF-SVMs are linear weighted sum of some non-linear functions. Therefore, the equations have the same form as a neural network with one hidden layer.

Why can we consider Gaussian mixture models (GMMs) and support vector machines with non-linear kernel functions (e.g., RBF-SVM) as neural networks with a single hidden layer?

  • The forward propagation equations of GMMs and RBF-SVMs are linear weighted sum of some non-linear functions. Therefore, the equations have the same form as a neural network with one hidden layer.

img

Questions

img

1(a)

For distributions with a common covariance structure the decision regions are hyper-planes.

img

img

1(b)

For distribution with arbitrary covariance structures the decision regions are defined by hyper-spheres.

img

img

img

2(a)

As none of the central pixels (14, 14) in the 5,000 training samples contains the strokes of Digit ‘0’, that means there is no 0.6~1 intensity. As the digits were contaminated by Gaussian noise, there is also no frequency of 0 intensity is also 0. we can assume the histogram is actually a gaussian PDF if bin = 0, and it won’t fall into the range of 0.6 to 1 (Which is the stroke level intensity) and 0 (because of contamination).

img

img

2(b)

Recall Formula of GMM Classifier:

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

X={X1,,Xk,,XK}\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}

We use the samples in Class kk to train the kk-th GMM:

Xk={xk,1,,xk,Nk} EM Algorithm Λk={wkj,μkj,Σkj}j=1M\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label cc:

c=argmaxkP(Ckx)=argmaxkp(xCk)P(Ck)c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)

p(xCk)=j=1Mwkjp(xμkj,Σkj)p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)

Therefore:

We first convert the training images into vectors by stacking the pixels, which result in 5000 training vectors per digit. Then, we train 10 Gaussian models (density function) with mean vectors and covariance matrices Λ={μi,Σi}i=09\Lambda=\left\{\mu_{i}, \Sigma_{i}\right\}_{i=0}^{9}, one for each digit. A Gaussian classifier is formed as follows:

 Recognized digit =argmaxk{0,,9}P(Ckx)\text { Recognized digit }=\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k} \mid \mathbf{x}\right)

=argmaxk{0,,9}P(Ck)N(x;μk,Σk)j=09P(Cj)N(x;μj,Σj) Bayes rule =argmaxk{0,,9}P(Ck)N(x;μk,Σk)p(x) Law of prob =argmaxk{0,,9}P(Ck)p(x;μk,Σk)p(x) is independent of k\begin{array}{l} =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \Sigma_{k}\right)}{\sum_{j=0}^{9} P\left(C_{j}\right) N\left(\mathbf{x} ; \mu_{j}, \boldsymbol{\Sigma}_{j}\right)} \quad \text { Bayes rule } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \boldsymbol{\Sigma}_{k}\right)}{p(\mathbf{x})} \quad \text { Law of prob } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k}\right) p\left(\mathbf{x} ; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right) \quad p(\mathbf{x}) \text { is independent of } k \end{array}

img

Given a test vector x, the Gaussian classifier determines the class label of x by the following formula:

l(x)=argmaxi{P(i is even )p(xμi,Σi)i is even P(i is odd )p(xμi,Σi)i is odd }=argmaxi{23p(xμi,Σi)i is even 13p(xμi,Σi)i is odd }\begin{aligned} l(\mathbf{x})=& \underset{i}{\operatorname{argmax}}\left\{\begin{array}{c} P(i \text { is even }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ P(i \text { is odd }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is odd } \end{array}\right\} \\ &=\underset{i}{\operatorname{argmax}}\left\{\begin{array}{cc} \frac{2}{3} p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ \frac{1}{3} p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) & i \text { is odd } \end{array}\right\} \end{aligned}

where p(xμi,Σi)p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) is the likelihood of x given the Gaussian model of Digit ii. The prior probability is obtained as follows:

Pe=2Po and Pe+Po=12Po+Po=1Po=13Pe=23\begin{array}{l} P_{e}=2 P_{o} \quad \text { and } \quad P_{e}+P_{o}=1 \\ \Rightarrow 2 P_{o}+P_{o}=1 \\ \Rightarrow P_{o}=\frac{1}{3} \\ \Rightarrow P_{e}=\frac{2}{3} \end{array}

img

2(d)

No. This is because 100 samples could not produce a full covariance matrix with valid inverse.

it’ll change the rank of the cov matrix apparently.

img

2(e)

Change the full covariance matrices to diagonal covariance matrices. Because each element in the diagonal is now estimated by 100 samples, the variance is reliable and will not be 0.

Because diagonal matrix is always inversible as long as no term is zero and variance is always a positive, non zero number. So diagonal matrix is reliable.

img

2(f)

When the number of training sample per digit is 1, using diagonal covariance matrices will not help because the variances are still 0. In such situation, we may use a common diagonal covariance matrix for all of the 10 digits so that each diagonal element will be estimated by 10 numbers. Then, the variance will not be 0. A better solution is NOT to use Gaussian classifier, e.g., using linear SVM instead.