Matrix-Vector Revisit

Matrix and Vector Norm

Vector Norm

$\|\boldsymbol{x}\|_{p}=\left(\sum_{i}\left|x_{i}\right|^{p}\right)^{\frac{1}{p}} \quad p \in \mathbb{R}, p \geq 1$

measure the size of a vector (measure the distance from the origin to the point $x$ )
When $p = 2$ , we have the Euclidean form

Matrix Norm

$\|A\|_{F}=\sqrt{\sum_{i, j} A_{i, j}^{2}}$

Frobenius norm measures the size of a matrix
If A is symmetric, Frobenius norm measures the sum of the variation along all dimension.

Multiplying Matrix and Vector

Dot product of two vector

$\begin{aligned} z &=\mathbf{x}^{\top} \mathbf{w}=\mathbf{x} \cdot \mathbf{w}=\sum_{i=1}^{D} x_{i} w_{i} \\ &=\|\mathbf{x}\|\|\mathbf{w}\| \cos \theta \end{aligned}$

Matrix-vector multiplication

$\mathbf{z}=\mathbf{W} \mathbf{x}= (\mathbf{x}^{\top} \mathbf{W}^{\top})^{\top}$

Matrix-matrix multiplication

$\mathbf{O}=f\left(\mathbf{W} \mathbf{X}+\left[\begin{array}{llll} \mathbf{b} & \mathbf{b} & \cdots & \mathbf{b} \end{array}\right]\right)$

$X$ is $D\times N$ matrix
$W$ is $M\times D$ matrix
$WX$ is an $M\times N$ matrix
$O$ is an $M\times N$ matrix
$N$ is the batch size or the number of input vectors

Vector Space

A vector space is a set of objects $V$ called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

Example, a vector on a 3-D vector space [X=2, Y=3, Z=5]

Subspace

In linear algebra, a linear subspace is a vector space that is a subset of some larger vector space.

Linear = Subspace
Non-linear = Manifold

Is the curve surface (2-D manifold) a linear subspace of a 3-dimensional space?

No. It is non-linear subspace.

Revisit Random Variables and Probability Distributions

Random Variables (RV)

A Random Variable is a variable whose value is subject to variation due to chance
Random Variable can be discrete:
- E.g. an Random Variable could be X = no. of heads after flipping a fair coin 2 times, Then X can only be 0 or 1 or 2.
- E.g Throw a fair dice once
- Discrete Probability Distribution is also called Probability Mass Function (PMF)
Random Variable can also be continuous:
- E.g. Weight of a person; can take on any value within an interval
  - * The Majority weight is 200 in this case. (Most likely 200 from the view of probability)
- Continuous Probability Distribution is also called Probability Density Function (PDF)
The function describing the possible values of an Random Variable and their associated probabilities is known as a probability distribution.

Histogram and Probability Density Function (PDF)

For a continuous random variable, the histogram converges to PDF when the bin size goes to zero

Conditional Probability

$\operatorname{Pr}\left(x,y\right)$ : Joint Probability Density Function of $x$ and $y$
$\operatorname{Pr}\left(x \mid y=y_{1}\right)$ and $\operatorname{Pr}\left(x \mid y=y_{2}\right)$ : The Two conditional probability distributions
These are formed by extracting the appropriate slice from the joint pdf and normalizing so that the area is one. A similar operation can be performed for discrete distributions

Rules of Probability

$p(X, Y):$ joint probability
$p(X)$ : marginal probability
$p(Y \mid X):$ conditional probability

Sum Rule

$p(X)=\sum_{Y} p(X, Y)$

Discrete case : Summation

$\text{If } Y = \{\text{Male},\text{Female}\}$

$\text{then}$

$p(X)=\sum_{Y} p(X, Y) = p(X, Y=\text{Male}) + p(X, Y=\text{Female})$

Continuous case : Integration

If Y is Continuous,

$\text{then}$

$p(X)=\int{p(x,y)dy}$

Product Rule

$p(X, Y)=p(Y \mid X) p(X)$

Discrete RV in Machine Learning

In machine learning, discrete probability distributions are often used for modeling binary and multi-class classification problems
E.g. Gender Classification from their facial image x (Binary classification problem)
- $P(G=\operatorname{Male} \mid \mathbf{x})=p \quad$ and $\quad P(G=$ Female $\mid \mathbf{x})=1-p \quad 0 \leq p \leq 1$
- $G$ is a random variable with possible outcomes “Male” or “Female”.
- $G$ follows a Bernoulli distribution.
E.g. Classification of handwritten digit with image x (Multi-class classification problem)
- $P(D=k \mid \mathbf{x})$ , where $k=0, \ldots, 9$ and $\sum_{k=0}^{9} P(D=k \mid \mathbf{x})=1$
- $D$ is a random variable with possible outcomes {0,1,⋯,9}
- $D$ follows a Multinoulli (categorical) distribution.

Continuous RV in Machine Learning

In machine learning, continuous probability distributions are often used for modeling feature vectors
E.g. Assume that facial image x follows a Gaussian distribution
- Why it is often assumed Gaussian distribution?
$\mathbf{x} \sim \mathcal{N}(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{\frac{1}{2}}|\boldsymbol{\Sigma}|^{\frac{1}{2}}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}$
x is a random vector (variable) with infinite number of possible outcomes
- mean $\mu$ have the highest chance of occuring
- Note $\int \mathcal{N}(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma})dx = 1$ , just like any PDF.

Expectation for Discrete RV

For discrete RV, the expectation (expected value) of a function $f(x)$ with respect to a probability distribution $P(x)$ is average value that f takes on when $x$ is drawn from $P$ :

Upper case $P$ Is used to denote discrete probability distribution

$\mathbb{E}_{\mathrm{x} \sim P}[f(x)]=\sum_{x} P(x) f(x)$

If P is uniform, Expected value = mean
If P is not uniform, Expected valued = weighted mean
- E.g. Throw a fair dice
- $\begin{array}{c} x \in\{1,2,3,4,5,6\} \\ f(x)=x \quad P(x)=\frac{1}{6} \\ \mathbb{E}_{X \sim P}[f(x)]=\frac{1}{6} \cdot 1+\frac{1}{6} \cdot 2+\cdots+\frac{1}{6} \cdot 6=\frac{21}{6}=3.5 \end{array}$

Expectation for Continuous RV

For continuous RV, the expectation (expected value) of a function $f(x)$ with respect to a probability distribution $p(x)$ is the average value that f takes on when $x$ is drawn from $p$ :

Lower case $p$ is used to denote continuous probability distribution

$\mathbb{E}_{\mathrm{x} \sim p}[f(x)]=\int p(x) f(x) d x$

Expectation of samples drawn from a Gaussian distribution

Variance and Covariance

Variance gives a measure of how much the values of a function f of a random variable $x$ vary as we draw samples from the probability distribution of $x$ :

$\operatorname{Var}(f(x))=\mathbb{E}\left[(f(x)-\mathbb{E}[f(x)])^{2}\right]$

Variance of f(x) = Expectation[ ( f(x) - Expectation[f(x)] )^2 ]

Note Expectation[f(x)] = mean $\mu$

The covariance measures how much two functions of RVs (x and y) are linearly related to each other, as well as the scale of these variables:

$\operatorname{Cov}(f(x), g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])]$

Covariance of f(x) and g(y) = Expectation[ ( f(x) - Expectation[f(x)] ) ( g(y) - Expectation[g(y)] ) ]

Positive covariance means when x is large (small), y is also large (small)
Negative covariance means when x is large (small), y is small (large)

covariance In vector form:

$\operatorname{Cov}(\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{x}))=\mathbb{E}[(\mathbf{f}(\mathbf{x})-\mathbb{E}[\mathbf{f}(\mathbf{x})])(\mathbf{g}(\mathbf{x})-\mathbb{E}[\mathbf{g}(\mathbf{x})])^{\top}]$

$\operatorname{Cov}(f(\mathbf{x}), g(\mathbf{y}))=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p(\mathbf{x}, \mathbf{y})}\left\{\left[f(\mathbf{x})-\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} f(\mathbf{x})\right]\left[g(\mathbf{y})-\mathbb{E}_{\mathbf{y} \sim p(\mathbf{y})} g(\mathbf{y})\right]^{\top}\right\}$

Note now $\operatorname{Cov}(\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{x}))$ is a matrix

Note the symmetric properties in Covariance Matrix.

E.g.

For $x=\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right]$

$\operatorname{cov}(x, x)=\left[\begin{array}{cc} \sigma_{11}^{2} & \sigma_{12} \\ \sigma_{21} & \sigma_{22}^{2} \end{array}\right] = \left[\begin{array}{cc} var(x_1) & cov(x_1,x_2) \\ cov(x_2,x_1) & var(x_2) \end{array}\right]$

For $x=\left[\begin{array}{l} x_1 \\ x_2 \\ x_3 \\ \end{array}\right]$

$\operatorname{cov}(x, x)=\left[\begin{array}{lll} \sigma_{11}^{2} & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22}^{2} & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_{33}^{2} \end{array}\right] = \left[\begin{array}{lll} var(x_1) & cov(x_1,x_2) & cov(x_1,x_3) \\ cov(x_2,x_1) & var(x_2) & cov(x_2,x_3) \\ cov(x_3,x_1) & cov(x_3,x_2) & var(x_3) \end{array}\right]$

Note

$\sigma_{13} = \sigma_{31} = cov(x_1,x_3) = cov(x_3,x_1)$

$\sigma_{12} = \sigma_{21} = cov(x_1,x_2) = cov(x_2,x_1)$

$\sigma_{23} = \sigma_{32} = cov(x_2,x_3) = cov(x_3,x_2)$

Variance and Covariance Relation to Machine Learning

Many machine learning problems are multivariate, e.g., an image with 28 x 28 pixels can be viewed as a 784-dim vector
Neighboring pixels are positive correlated for smooth images.
We use a covariance matrix to quantify the variances and the covariances of different pixels:

Bayes Theorem and Its Role in Classification Systems

Bayes theorem states the following:

$\text{Posterior} = \text{Prior} \times \text{Likelihood}$

Prior: Probability distribution representing knowledge or uncertainty of a data object prior or before observing it
Posterior: Conditional probability distribution representing what parameters are likely after observing the data object
Likelihood: The probability of falling under a specific category or class.

$p(X, Y)=p(Y \mid X) p(X)=p(X \mid Y) p(Y)$

Bayes Theorem

E.g.

Assume $X$ is our measurement (weight),

$Y = \{\text{Male, Female}\}$

Prior = $p(Y = \text{male})$

Likelihood = $p(X|Y=\text{male})$

Posterior = Likelihood * Prior

If Y is discrete:

$p(X)=\sum_{Y} p(X \mid Y) p(Y)$

e.g., X is the height of a person and Y is his/her gender (‘M’ or ‘F’)

If Y is continuous:

$p(X)=\int p(X \mid Y) p(Y) \mathrm{d} Y$

e.g., X is the weight of a person and Y his/her wealth

Note: When $X$ is continuous, likelihood $p(X \mid Y)$ is not probability and could be greater than 1

E.g. When x is too small e.g. 0.0001, p(X|Y) need to be higher than 1 to compansate that the PDF area is equal to 1.

Only in continuous, likelihood is different from probability

If x is a continuous random variable, the likelihood of x could be larger than 1.

Yes

Example:

$\text{Posterior} = \text{Prior} \times \text{Likelihood}$

Bayes Theorem:

$p(Y \mid X)=\frac{p(X \mid Y) p(Y)}{p(X)}$

Therefore:

$p(Y \mid X) = p(\text{Male} | \text{Weight} = 70) = \frac{p(\text{Weight} = 70|\text{Male})p(\text{Male})}{p(\text{Weight} = 70)} = \frac{p(X \mid Y) p(Y)}{p(X)}$

Marginal Density = $p(X) = p(X|Y=\text{Male})P(\text{Male}) +p(X|Y=\text{Female})P(\text{Female})$

$p(X) = 0.01\times0.4 + 0.001\times0.6$

$p(\text{Male} | \text{Weight} = 70) = \frac{0.01\times0.4}{0.01\times0.4 + 0.001\times0.6} = 0.87$

Bayes Decision Theory

Decision Rule for 2-class problems: ( $\omega_1$ means class1, $\omega_2$ means class2)

Where $P(\omega)$ is Prior.

Decision Rule for K-class problems: ( $\omega_k$ means class k)

Where $P(\omega)$ is Prior.

Gaussian Mixture Model

Sometime the distribution may not be Gaussian.
A Gaussian mixture model is a linear combination of M Gaussian densities:

^ $w$ is weight, $M$ is the number of gaussian distributions

http://ethen8181.github.io/machine-learning/clustering/GMM/GMM.html

In GMM, each cluster corresponds to a probability distribution, in this case the Gaussian distribution. What we want to do is to learn the parameters of these distributions, which is the Gaussian’s mean $\mu$ , and the variance $\sigma^2$ .

Note The mathematical form of the Gaussian distribution in 1-dimension (univariate Gaussian) can be written as:

$N(x \mid \mu, \sigma)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}}$

This is also referred to as the probability density function (pdf).
Gaussian distribution is commonly referred to as the Normal distribution, hence that’s where the $N$ comes from.
$x$ refers to the random observation over which this distribution is placed.
The mean $\mu$ , controls the Gaussian’s “center position” and the variance $\sigma^2$ , controls its “shape”. To be precise, it is actually the standard deviation $\sigma$ , i.e. the square root of the variance that controls the distribution’s shape.

In higher dimensions, a Gaussian is fully specified by a mean vector $\mu$ and a d-by-d covariance matrix, $\Sigma$ (do not confused this symbol with $\sum$ , which is used for denoting summing a bunch of stuff). $|\Sigma|$ refers to the determinant of the covariance matrix e.g. In two dimension, the Gaussian’s parameters might look like this:

$N\left[\left(\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \sigma_{12} \\ \sigma_{21} & \sigma_{2}^{2} \end{array}\right)\right]$

The mean vector, containing elements $\mu_{1}$ and $\mu_{1}$ centers the distribution along every dimension. On the other hand, the covariance matrix specifies the spread and orientation of the distribution. Along the diagonal of this covariance matrix we have the variance terms $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ representing the shape (spread) along each of the dimensions. But then we also have the off-diagonal terms, $\sigma_{12}$ and $\sigma_{21}$ (these two thing actually take the same value because this a symmetric matrix) that specify the correlation structure of the distribution.

Remember the covariance matrix is symmetric.

Diagonal covariance matrix (off-diagonal terms are zero)
Full covariance matrix (off-diagonal terms are non-zero)

Gaussian Mixture Model Classifier (GMM Classifier)

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

$\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}$

We use the samples in Class $k$ to train the $k$ -th GMM:

$\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}$

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label $c$ :

$c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)$

$p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)$

Note the training of individual GMMs is unsupervised,

but the construction of the GMM classifier is supervised.

The training of a GMM is unsupervised in that only the feature vectors $\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\}$ are used for training the k-th GMM.

We also did not train the GMM to map x to a class label c.

Instead, we train the k-th GMM to generate the data in the k-th class.

To use GMMs for classification, we train one GMM per class. So, the label information of the training data is implicitly used when constructing a GMM classifier.

Support Vector Machines

During the second AI winter, support vector machine (SVM) is an important machine learning model.
It is very good at performing classification tasks
Even with the recent advances in deep neural networks, SVM still plays an essential role in many practical systems.
In many real-word problems, we do not have enough data to train a DNN for the classification tasks.
- Suitable for small amount of data
One solution is to train a DNN using massive amount of data from another similar data source and use the DNN as feature extractor (called embedding). Then, we use the small number of feature vectors to train an SVM for our task.

Margin Maximization

SVM is trained by maximizing the margin between the hyperplanes of two classes

We aim to find a decision plane $f(x) = 0$ that maximizes the margin $d = \frac{2}{||w||}$ to obtain

$f(\mathbf{x})=\sum_{i \in \mathcal{S}} \alpha_{i} y_{i} \mathbf{x}_{i}^{\top} \mathbf{x}+b=\mathbf{w}^{\top} \mathbf{x}+b \quad \mathcal{S}=\left\{j \text { where } \alpha_{j}>0\right\}$

SVM Classifier

SVM is good for binary classification.
To classify multiple classes, we can use the one-vs-rest approach to convert K binary classifications to a K-class classification

Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo/

Explain why it is always possible to use a plane (linear SVM) to divide 3 linearly independent samples into two classes. What happens when the 3 samples are linearly dependent?

When the 3 samples are not linearly dependent, they should span a plane on the 3-D space, i.e., they become the corners of a triangle. As a result, a linear SVM can classify the 3 points perfectly.

On the other hand, when the 3 points are co-linear (linearly dependent), they form a line on the 3-D space. All points on the line can be specified by the first two dimensions, because the third dimension depends on the first two. As we now have more points (N = 3) than the dimension of feature vectors (D = 2), we need a non-linear classifier such as RBM-SVM. The following diagram illustrates such situation.

The support vectors of an SVM are usually near the decision boundary?

Yes.

MLP vs GMM vs SVM

You may consider GMM and SVM as a special form of MLP with only one hidden layer, one output, and with a special activation function

Why can we consider Gaussian mixture models (GMMs) and support vector machines with non-linear kernel functions (e.g., RBF-SVM) as neural networks with a single hidden layer?

The forward propagation equations of GMMs and RBF-SVMs are linear weighted sum of some non-linear functions. Therefore, the equations have the same form as a neural network with one hidden layer.

Why can we consider Gaussian mixture models (GMMs) and support vector machines with non-linear kernel functions (e.g., RBF-SVM) as neural networks with a single hidden layer?

The forward propagation equations of GMMs and RBF-SVMs are linear weighted sum of some non-linear functions. Therefore, the equations have the same form as a neural network with one hidden layer.

Questions

1(a)

For distributions with a common covariance structure the decision regions are hyper-planes.

1(b)

For distribution with arbitrary covariance structures the decision regions are defined by hyper-spheres.

2(a)

As none of the central pixels (14, 14) in the 5,000 training samples contains the strokes of Digit ‘0’, that means there is no 0.6~1 intensity. As the digits were contaminated by Gaussian noise, there is also no frequency of 0 intensity is also 0. we can assume the histogram is actually a gaussian PDF if bin = 0, and it won’t fall into the range of 0.6 to 1 (Which is the stroke level intensity) and 0 (because of contamination).

2(b)

Recall Formula of GMM Classifier:

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

$\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}$

We use the samples in Class $k$ to train the $k$ -th GMM:

$\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}$

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label $c$ :

$c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)$

$p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)$

Therefore:

We first convert the training images into vectors by stacking the pixels, which result in 5000 training vectors per digit. Then, we train 10 Gaussian models (density function) with mean vectors and covariance matrices $\Lambda=\left\{\mu_{i}, \Sigma_{i}\right\}_{i=0}^{9}$ , one for each digit. A Gaussian classifier is formed as follows:

$\text { Recognized digit }=\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k} \mid \mathbf{x}\right)$

$\begin{array}{l} =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \Sigma_{k}\right)}{\sum_{j=0}^{9} P\left(C_{j}\right) N\left(\mathbf{x} ; \mu_{j}, \boldsymbol{\Sigma}_{j}\right)} \quad \text { Bayes rule } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \boldsymbol{\Sigma}_{k}\right)}{p(\mathbf{x})} \quad \text { Law of prob } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k}\right) p\left(\mathbf{x} ; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right) \quad p(\mathbf{x}) \text { is independent of } k \end{array}$

2©

Given a test vector x, the Gaussian classifier determines the class label of x by the following formula:

$\begin{aligned} l(\mathbf{x})=& \underset{i}{\operatorname{argmax}}\left\{\begin{array}{c} P(i \text { is even }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ P(i \text { is odd }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is odd } \end{array}\right\} \\ &=\underset{i}{\operatorname{argmax}}\left\{\begin{array}{cc} \frac{2}{3} p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ \frac{1}{3} p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) & i \text { is odd } \end{array}\right\} \end{aligned}$

where $p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right)$ is the likelihood of x given the Gaussian model of Digit $i$ . The prior probability is obtained as follows:

$\begin{array}{l} P_{e}=2 P_{o} \quad \text { and } \quad P_{e}+P_{o}=1 \\ \Rightarrow 2 P_{o}+P_{o}=1 \\ \Rightarrow P_{o}=\frac{1}{3} \\ \Rightarrow P_{e}=\frac{2}{3} \end{array}$

2(d)

No. This is because 100 samples could not produce a full covariance matrix with valid inverse.

it’ll change the rank of the cov matrix apparently.

2(e)

Change the full covariance matrices to diagonal covariance matrices. Because each element in the diagonal is now estimated by 100 samples, the variance is reliable and will not be 0.

Because diagonal matrix is always inversible as long as no term is zero and variance is always a positive, non zero number. So diagonal matrix is reliable.

2(f)

When the number of training sample per digit is 1, using diagonal covariance matrices will not help because the variances are still 0. In such situation, we may use a common diagonal covariance matrix for all of the 10 digits so that each diagonal element will be estimated by 10 numbers. Then, the variance will not be 0. A better solution is NOT to use Gaussian classifier, e.g., using linear SVM instead.