Gaussian Classifier / GMM

$Q0\space$ Assume that you are given $N$ independent handwritten-digit images, each with a size of $28× 28$ pixels. Assume also that you have converted the images into vectors $\mathcal{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}$ .

Q0(a) Suggest a way to convert the images into vectors such that each pixel of an image corresponds to an element in the corresponding vector.

For each image, flattern the $28\times28$ pixels into a $784$ -dim vector by stacking the rows or columns .

Q0(b) If we want to use a Gaussian density function with a full covariance matrix to approximate the distribution of $\mathcal{X}$ , what will be the minimum value of $N$ ? Briefly explain your answer.

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Number of training samples = $N$

The feature dimension $D = 28 \times 28 = 784$

In order to have an invertable covariance matrix $\sum$ , we need $N$ to be of full rank.

As the dimension of $\sum$ is $784 \times 784$ , we need at least 784+1 = 785 samples to make it full rank.

Q0© What will be your answer in Q1(b) if $x_1 = 3x_2 + 2x_3$ ? Briefly explain your answer.

This means that:

$\mathbf{x}_{1}$ depends on $\mathbf{x}_{2}$ and $\mathbf{x}_{3}$

$\mathbf{x}_{1}$ Is redundant if we know $\mathbf{x}_{2}$ and $\mathbf{x}_{3}$

Therefore, 1 sample $\mathbf{x}_{1}$ Is redundant in our data

As a result, we need 1 extra samples for each person to make sure that $\operatorname{rank}(\boldsymbol{\Sigma})=784$ .

Therefore, the minimum value of $N$ is $785+1=786$ .

Q0(d) How would you reduce the dimension of $x_i$ , $i = 1,...,N,$ to $5$ using principal component analysis?

Recall the procedure of PCA:

Given a set of training vectors, If we want to find the subspace:

Let $u_1$ as the vector representing the first axis of the subspace

Let $X=\{x_1,...,x_N\}$ be the training vectors

The mean of projected data is $u_1^T \bar{x}$ (Note $u_1^T$ is the transpose of $u_1$ )

$\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n$

In this case:

First, we compute the covariance matrix

$\boldsymbol{\Sigma}=\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right)\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right)^{\top}$

where $\boldsymbol{\mu}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{i}$ .

Then, compute the eigenvectors and eigenvalues of $\boldsymbol{\Sigma}$ . Then, select the top-5 eigenvectors whose eigenvalues are the largest. Then, project the vectors using

$\mathbf{y}_{i}=\mathbf{W}^{T}\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right) \quad \mathbf{y}_{i} \in \Re^{5},$

where $\mathbf{W}$ comprises the top- 5 eigenvectors in its columns.

Q0(e)

More features = Eigenvalue is larger

Therefore, Figure 1(a) is the eigenvector corresponding to the last PC.

Figure 1(a) is the eigenvector corresponding to the last PC. This is because the variation in the central part of the image is very small when compared with that in Figure 1(b).

Figure 1(b) is the eigenvector corresponding to the first PC. This is because the variation in the central part of the image is very large when compared with that in Figure 1(a).

Q0(f) Assume that N = 10, 000, what is the maximum number of eigenvectors (with non-zero eigenvalues) that you could obtain? Briefly explain your answer.

Maximum number of eigenvectors is 784 because the dimension of $x_i$ is 784.

Q1(a) If $K = 100$ , what is the theoretical minimum value of N for the Gaussian classifier to recognize the persons in this dataset? Show how you calculate this value.

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Number of training samples = $K \times N$

The feature dimension $D = 200 \times 200$

As $K = 100$ :

$100 N > 200\times200$

$=> N > 400$

Therefore minimum value of $N = 400+1 = 401$ .

Q1(b) Determine the theoretical minimum value of N if we have the following relationships:

$\begin{aligned} \mathbf{x}_{k, 5} &=2 \mathbf{x}_{k, 4}+3 \mathbf{x}_{k, 3}+\mathbf{1} \\ \mathbf{x}_{k, 4} &=3 \mathbf{x}_{k, 3}+\mathbf{x}_{k, 2} \end{aligned}$

where $k=1, \ldots, K$ and $\mathbf{1}$ is a vector containing all 1’s. Briefly explain your answer.

This means that:

$\mathbf{x}_{k, 5}$ depends on $\mathbf{x}_{k, 2}$ and $\mathbf{x}_{k, 3}$

$\mathbf{x}_{k, 4}$ depends on $\mathbf{x}_{k, 2}$ and $\mathbf{x}_{k, 3}$

$\mathbf{x}_{k, 4}$ and $\mathbf{x}_{k, 5}$ are redundant if we know $\mathbf{x}_{k, 2}$ and $\mathbf{x}_{k, 3}$ .

Therefore, 2 samples $\mathbf{x}_{k, 4}$ and $\mathbf{x}_{k, 5}$ are redundant in our data

As a result, we need 2 extra samples for each person to make sure that $\operatorname{rank}(\boldsymbol{\Sigma})=200\times200 = 40000$ .

Therefore, the minimum value of $N$ is $401+2=403$ .

Q1© Briefly describe the procedure for training the Gaussian classifier. Write the equation that computes the mean vectors of the Gaussian classifier. Also state the parameters of the classifier.

We use $\mathcal{X}_{k}$ to estimate the mean vector $\boldsymbol{\mu}_{k}$ of person $k$ , where $k=1, \ldots, K .$ Specifically, we compute

$\boldsymbol{\mu}_{k}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{k, i} .$

Then, a Gaussian classifier can be constructed by using the parameters $\left\{\boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right\}_{k=1}^{K} .$

Q1(d) Are the decision boundaries produced by this Gaussian classifier linear or non-linear? Briefly explain your answer.

The decision boundaries are linear because the covariance matrices of all classes (Gaussian models) are the same.

Q1(e) Assume that all of the fingerprint vectors are independent, describe how you would use principal component analysis (PCA) to reduce the dimension of the vectors. State the equation for projecting $x_{k,i}$ to a P-dimensional vector $y_{k,i}$ .

Recall the procedure of PCA:

Given a set of training vectors, If we want to find the subspace:

Let $u_1$ as the vector representing the first axis of the subspace

Let $X=\{x_1,...,x_N\}$ be the training vectors

The mean of projected data is $u_1^T \bar{x}$ (Note $u_1^T$ is the transpose of $u_1$ )

$\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n$

In this case, we apply eigen-decomposition on the global covariance matrix $\sum$ to obtain $K\times N$ eigen-vectors with decreasing eigenvalues.

Then, we select the first $P$ eigenvectors $\{u_i\}^P_{i=1}$ with the largest eigenvalues to form a transformation matrix $U= [u_1, u_2, ... , u_P].$

Then, we project every fingerprint vector as follows:

$\mathbf{y}_{k, i}=\mathbf{U}^{\top}\left(\mathbf{x}_{k, i}-\boldsymbol{\mu}\right), \quad \mathbf{y}_{k, i} \in \mathbb{R}^{P}$

^Note that this method is impractical as the dimension of $\sum$ is too high.

Q1(f) Denote $P(C_k)$ as the prior probability of Person $k$ . Describe how you would use the Gaussian classifier and the Bayes’ theorem to classify an unknown fingerprint vector $x$ . Hint: You may use the Bayes’ theorem to compute the posterior probability of Person $k$ given $x$ .

The posterior probability of Person k given x is given by

$P\left(C_{k} \mid \mathbf{x}\right)=\frac{P\left(C_{k}\right) p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right)}{\sum_{j=1}^{K} P\left(C_{j}\right) p\left(\mathbf{x} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}\right)}$

where $p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right)$ is the output of the $k$ -th Gaussian model in the Gaussian classifier. Then, we assign $\mathbf{x}$ to Person $l$ according to

$l=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)$

Q2(a) Express $\log N(x|\mu_j, \sum_j)$ in terms of $x$ , the mean vector $\mu_j$ , and the covariance matrix $\sum_j$ .

$\log \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)=-\frac{D}{2} \log (2 \pi)-\frac{1}{2} \log \left|\boldsymbol{\Sigma}_{j}\right|-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{j}\right)$

Q2(b) Assume that there is a special sample $x_0$ that is very far away from all of the mean vectors $\mu_j$ ’s. Use your answer in Q2(a) to explain the numerical problem that may occur when computing the log-likelihood $\log p(x_0|\Lambda)$ .

When $x$ is far away from $\mu_j$ for all $j$ , $\log N(x|\mu_j, \sum_j)$ will tend to $-\infty$ for all $j$ . As a result, $\exp\{\log N(x|\mu_j, \sum_j)\} \rightarrow0$ for all $j$ . Then the resulting $\log p(x|\Lambda)$ will be $-\infty$ .

Q2© If $D = 100$ and $M = 1$ , what is the minimum number of training samples $N$ for the EM algorithm to compute the parameters in $\Lambda$ .

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Minimum number of independent training samples is $N= 100 + 1 = 101$ .

Q2(d) If $D = 100, M = 1,$ and $\sum_1$ is a diagonal matrices, what is the minimum number of training samples $N$ for the EM algorithm to compute the parameters in $\Lambda$ .

For diagonal matrices, its a different case.

Minimum number of independent training samples is $N = 2$ .

Q3(a) Sketch the distribution (histogram) of the 407th pixel as indicated by the arrow shown in Fig. Q3. You may not need to specify the actual values in the y-axis in your histogram. But your answer should specify the range of the x-axis.

Note the data is clean. So the 407-th pixel can be either 255 or 0 (White or Black).

Q3(b) Explain why Gaussian classifiers are not appropriate for classifying the digits in this set of data.

The histograms of pixels do not follow a Gaussian distribution. The peaks are so sharp that they cannot be modeled by GMM. Therefore, Gaussian density functions cannot fit the distribution well.

Q3© How would you process the data so that they become suitable for being classified by Gaussian classifiers.

Add Gaussian noise to the data so that the histograms of pixels look more like a Gaussian distribution. Alternatively, you may perform PCA to reduce the dimension and hope that the first few principal components have sufficient variance to produce a non-singular covariance matrix.

Q4. Assume that you have built a Gaussian classifier for a two-class problem. Assume also that the means $\left(\mu_{k}\right)$ and covariance matrices $\left(\Sigma_{k}\right)$ in your Gaussian classifier are as follows:
Class $1: \mu_{1}=\left[\begin{array}{l}1 \\ 1\end{array}\right]$ and $\Sigma_{1}=\left[\begin{array}{ll}1 & 0 \\ 0 & 1\end{array}\right]$
Class $2: \mu_{2}=\left[\begin{array}{l}-1 \\ -1\end{array}\right]$ and $\Sigma_{2}=\left[\begin{array}{ll}2 & 0 \\ 0 & 2\end{array}\right]$
Assume that both classes have equal prior, i.e., $P\left(C_{1}\right)=P\left(C_{2}\right) .$

Q4(a) Mark the location of the class means and sketch the decision boundary of the two classes, i.e., plot the locus of $p\left(\mathbf{x} \mid \mu_{1}, \Sigma_{1}\right)=p\left(\mathbf{x} \mid \mu_{2}, \Sigma_{2}\right)$ , where $\mathbf{x}$ are random vectors from either Class 1 or Class $2 .$

Note:

mean $\mu$ determines the center of the distribution

covariance matrice $\sum$ determines the size of the distribution

Therefore:

For distribution of class 1:

mean is at (1, 1)

size = horizonal and vertical both 1

For distribution of class 2:

mean is at (-1, -1)

size = horizonal and vertical both 2

Q4(b)

For distribution with arbitrary covariance structures the decision regions are defined by hyper-spheres.

Therefore:

For distribution of class 1:

mean is at (0, 0)

size = horizonal 1 and vertical 2

For distribution of class 2:

mean is at (1, 1)

size = horizonal and vertical both 1

5(a)

As none of the central pixels (14, 14) in the 5,000 training samples contains the strokes of Digit ‘0’, that means there is no 0.6~1 intensity. As the digits were contaminated by Gaussian noise, there is also no frequency of 0 intensity is also 0. we can assume the histogram is actually a gaussian PDF if bin = 0, and it won’t fall into the range of 0.6 to 1 (Which is the stroke level intensity) and 0 (because of contamination).

5(b)

Recall Formula of GMM Classifier:

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

$\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}$

We use the samples in Class $k$ to train the $k$ -th GMM:

$\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}$

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label $c$ :

$c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)$

$p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)$

Therefore:

We first convert the training images into vectors by stacking the pixels, which result in 5000 training vectors per digit. Then, we train 10 Gaussian models (density function) with mean vectors and covariance matrices $\Lambda=\left\{\mu_{i}, \Sigma_{i}\right\}_{i=0}^{9}$ , one for each digit. A Gaussian classifier is formed as follows:

$\text { Recognized digit }=\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k} \mid \mathbf{x}\right)$

$\begin{array}{l} =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \Sigma_{k}\right)}{\sum_{j=0}^{9} P\left(C_{j}\right) N\left(\mathbf{x} ; \mu_{j}, \boldsymbol{\Sigma}_{j}\right)} \quad \text { Bayes rule } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \boldsymbol{\Sigma}_{k}\right)}{p(\mathbf{x})} \quad \text { Law of prob } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k}\right) p\left(\mathbf{x} ; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right) \quad p(\mathbf{x}) \text { is independent of } k \end{array}$

5©

Given a test vector x, the Gaussian classifier determines the class label of x by the following formula:

$\begin{aligned} l(\mathbf{x})=& \underset{i}{\operatorname{argmax}}\left\{\begin{array}{c} P(i \text { is even }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ P(i \text { is odd }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is odd } \end{array}\right\} \\ &=\underset{i}{\operatorname{argmax}}\left\{\begin{array}{cc} \frac{2}{3} p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ \frac{1}{3} p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) & i \text { is odd } \end{array}\right\} \end{aligned}$

where $p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right)$ is the likelihood of x given the Gaussian model of Digit $i$ . The prior probability is obtained as follows:

$\begin{array}{l} P_{e}=2 P_{o} \quad \text { and } \quad P_{e}+P_{o}=1 \\ \Rightarrow 2 P_{o}+P_{o}=1 \\ \Rightarrow P_{o}=\frac{1}{3} \\ \Rightarrow P_{e}=\frac{2}{3} \end{array}$

5(d)

No. This is because 100 samples could not produce a full covariance matrix with valid inverse.

it’ll change the rank of the cov matrix apparently.

5(e)

Change the full covariance matrices to diagonal covariance matrices. Because each element in the diagonal is now estimated by 100 samples, the variance is reliable and will not be 0.

Because diagonal matrix is always inversible as long as no term is zero and variance is always a positive, non zero number. So diagonal matrix is reliable.

5(f)

When the number of training sample per digit is 1, using diagonal covariance matrices will not help because the variances are still 0. In such situation, we may use a common diagonal covariance matrix for all of the 10 digits so that each diagonal element will be estimated by 10 numbers. Then, the variance will not be 0. A better solution is NOT to use Gaussian classifier, e.g., using linear SVM instead.

SVM

$Q1$ The output of a support vector machine (SVM) is given by

$f(\mathbf{x})=\sum_{i \in \mathcal{S}} a_{i} K\left(\mathbf{x}, \mathbf{x}_{i}\right)+b$

where $\mathbf{x}$ is an input vector on $\mathbb{R}^{D}, \mathcal{S}$ comprises the indexes to a set of support vectors, $b$ is a bias term, $a_{i}$ 's are SVM parameters and $K(\cdot, \cdot)$ is a kernel.

Q1(a) If $K(x,x_i)$ is a radial basis function (RBF) kernel, express $K(x,x_i)$ in terms of $x, x_i,$ and the kernel width $\sigma$ .

Note:

For Linear Kernel,

$K(x_i,x) = x^Tx \quad \phi(x) = x$

For RBF Kernel,

$K\left(\mathbf{x}, \mathbf{x}_{i}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2}\right\}$

For Polynominal Kernel,

$K\left(\mathbf{x}_{i}, \mathbf{x}\right)=\left(1+\mathbf{x} \cdot \mathbf{x}_{i}\right)^{p}, p>0$

Therefore if $K(x,x_i)$ is a radial basis function (RBF) kernel,

$K\left(\mathbf{x}, \mathbf{x}_{i}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2}\right\}$

Q1(b) Explain why $f(x)$ is a nonlinear function of $x$ if $K(x, x_i)$ Is an RBF kernel.

Because

$\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2} = (x-x_i)^{T}(x-x_i)$

which is quadratic in $x$ .

Also, the exponential function is non-linear.

Therefore $f(x)$ is a nonlinear function of $x$ if $K(x, x_i)$ Is an RBF kernel.

Q1©

Q1(d) Repeat Q1© when $\sigma \rightarrow 0$ .

Answer of Q1© and Q1(d):

Keynote:

when $\sigma$ is significantly large than the range of the data on the 2-D plane, it acts like linear kernel.

when $\sigma$ is significantly small, it overfit the data.

Q2(a) Assume that you train a set of one-versus-rest support vector machines (SVMs) to construct an SVM classifier for classifing the digits. Draw a block diagram illustrating the architecture of your SVM classifier.

Q2(b) Suggest the most appropriate kernel function (i.e., linear, polynomial or radial basis function (RBF)) for the SVMs. Briefly explain your choice.

D = 28 x 28 = 784

N = 10 x 10 = 100

In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.

Q3(a) What are the possible values of $y_i$ ?

As SVM classifiy the data into two classes, $y_i$ will be either -1 or +1.

Q3(b) Suggest an appropriate kernel for the SVM. Briefly explain your answer.

D = 1000

N = 10

In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.

FDA cannot solve this problem because the rank of the within-class covariance $S_w$ is at most $N − 1$ , which is significantly smaller than D. As a result, $S_w$ does not have a valid inverse $S_w^{-1}$ .

Q4(a) If $x$ is two-dimensional vectors, what is the dimension of $\phi_p(x)$ ?

If $x$ is two-dimensional vectors,

The dimension of $\phi_p(x)$ is 6 because $(1+x_i^Tx)^2$ has 6 terms.

Q4(b) if $x = [x_1\space x_2]^T$ , express $\phi_p(\mathbf{x})$ in a vector form in terms of $x_1$ and $x_2$ .

Denote $\mathbf{x}_{i}=\left[\begin{array}{ll}u_{i} & v_{i}\end{array}\right]^{T}$ and $\mathbf{x}=\left[\begin{array}{ll}u & v\end{array}\right]^{T}$ .

Then,

$\begin{aligned} K\left(\mathbf{x}_{i}, \mathbf{x}\right) &=\left(1+\mathbf{x}_{i}^{T} \mathbf{x}\right)^{2}=\left(1+\left[\begin{array}{ll} u_{i} & v_{i} \end{array}\right]\left[\begin{array}{c} u \\ v \end{array}\right]\right)^{2} \\ &=\left(1+u_{i} u+v_{i} v\right)\left(1+u_{i} u+v_{i} v\right) \\ &=1+2 u_{i} u+2 v_{i} v+2 u_{i} u v_{i} v+u_{i}^{2} u^{2}+v_{i}^{2} v^{2} \\ &=\left[\begin{array}{ccc} 1 & \sqrt{2} u_{i} & \sqrt{2} v_{i} & \sqrt{2} u_{i} v_{i} & u_{i}^{2} & v_{i}^{2} \end{array}\right]\left[\begin{array}{c} 1 \\ \sqrt{2} u \\ \sqrt{2} v \\ \sqrt{2} u v \\ u^{2} \\ v^{2} \end{array}\right] \\ &=\phi_{p}(\mathbf{x})^{T} \phi_{p}\left(\mathbf{x}_{i}\right) \end{aligned}$

Therefore $\phi_{p}(\mathbf{x})=\left[\begin{array}{llllll}1 & \sqrt{2} x_{1} & \sqrt{2} x_{2} & \sqrt{2} x_{1} x_{2} & x_{1}^{2} & x_{2}^{2}\end{array}\right]^{T}$ where $\mathbf{x}=\left[\begin{array}{ll}x_{1} & x_{2}\end{array}\right]^{T}$

The decision function of a linear SVM is given by

$\begin{aligned} f(\mathbf{x}) &=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} K\left(\mathbf{x}_{i}, \mathbf{x}\right)+b \\ &=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} \mathbf{x}_{i}^{T} \mathbf{x}+b \\ &=\mathbf{w}^{T} \mathbf{x}+b \\ \end{aligned}$

where the SVM weight $\mathbf{w}$ is

$\mathbf{w}=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} \mathbf{x}_{i}$

Note that $f(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+b$ is linear in $\mathbf{x}$ .

Q4(d) Explain why the decision planes formed by polynomial SVMs are nonlinear.

(d) Because in polynomial SVMs, $K\left(\mathbf{x}_{i}, \mathbf{x}\right)=\left\langle\phi_{p}\left(\mathbf{x}_{i}\right), \phi_{p}(\mathbf{x})\right\rangle$ involves the squares and cross-products of the elements in $\mathbf{x}$ and $\mathbf{x}_{i}, f(\mathbf{x})=\sum_{i \in S} y_{i} \alpha_{i}\left\langle\phi_{p}\left(\mathbf{x}_{i}\right), \phi_{p}(\mathbf{x})\right\rangle+b$ is non-linear in $\mathbf{x}$ .

Q4(e) Given the following 2-class problem in 2-D space, sketch the decision boundaries produced by

(i) a polynomial SVM with $d = 2$
(ii) an RBF-SVM for $\sigma = 1$ .
(iii) Can the problem be solved by a linear SVM?

(i)

(ii)

(iii)
The problem cannot be solved by linear SVMs because there is no straight line that can separate the two classes without making substantial errors.

Q5 Explain why it is always possible to use a plane (linear SVM) to divide 3 linearly independent samples into two classes. What happens when the 3 samples are linearly dependent?

When the 3 samples are not linearly dependent, they should span a plane on the 3-D space, i.e., they become the corners of a triangle. As a result, a linear SVM can classify the 3 points perfectly.

On the other hand, when the 3 points are co-linear (linearly dependent), they form a line on the 3-D space. As we now have more points (N = 3) than the dimension of feature vectors (D = 2), we need a non-linear classifier such as RBM-SVM.

K-means clustering

Q1

Randomly picks K samples from the training data and consider them as the centroids. In this question, K = 2.

$\mu_1$ and $\mu_2$ are the centroids

In first iteration, first we use the centroids to assign the clusters. Then, we use the clusters to Re-compute Centroids. Then in the next iternation, we do the same by using the centroids to assign the clusters, then re-compute Centroids.

Denote $\mathcal{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}$ as a set of training vectors. The K-means algorithm aims to partition $\mathcal{X}$ into $K$ disjoint sets, $\mathcal{X}_{k}$ where $k=1, \ldots, K$ , such that the following mean squared error is minimized:

$E=\sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{X}_{k}}\left\|\mathbf{x}-\boldsymbol{\mu}_{k}\right\|^{2}$

where

$\boldsymbol{\mu}_{k}=\frac{\sum_{\mathbf{x} \in \mathcal{X}_{k}} \mathbf{x}}{\sum_{\mathbf{x} \in \mathcal{X}_{k}} 1}=\frac{1}{N_{k}} \sum_{\mathbf{x} \in \mathcal{X}_{k}} \mathbf{x}$

where $N_{k}$ is the number of vectors in $\mathcal{X}_{k}$ .

Q2(a) Given a Gaussian mixture model $\Lambda = \{\pi_k, \mu_k, \sum_k\}^K_{k=1}$ with $K$ mixture components, where $\pi_k$ , $\mu_k$ , and $\sum_k$ denote the mixture coefficient, mean vector, and covariance matrix of the k-th Gaussian, respectively. Under what conditions the EM algorithm for estimating $\Lambda$ becomes the K-means algorithm? Briefly explain your answer.

Hints: You may answer this question by stating the EM algorithm for GMM and specifying the conditions for the posterior probabilities of mixture components $P(\operatorname{mix}=k \mid \mathbf{x})$ and the covariance matrices $\boldsymbol{\Sigma}_{k}$ 's such that the EM algorithm reduces to the $\mathrm{K}$ -means algorithm.

EM becomes K-means when the following conditions (C1 and C2) are satisfied.

C1: $\quad P(k \mid \mathbf{x})=\left\{\begin{array}{cc}1 & \text { if } \mathbf{x} \in S_{k} \\ 0 & \text { if } \mathbf{x} \notin S_{k}\end{array} \quad\right.$ or $P(k \mid \mathbf{x})=\left\{\begin{array}{cc}1 & \text { if } P(k \mid \mathbf{x})>P(j \mid \mathbf{x}) \quad \forall j \neq k \\ 0 & \text { otherwise }\end{array}\right.$
C2: $\quad \Sigma_{k}=\mathbf{I} \quad \forall k$

Condition (1) means that each sample can belong to one and only one cluster, which is what K-means is about. Plugging Condition 1 into Eq. 3, we obtain the update equation for K-means

$\boldsymbol{\mu}_{k}=\frac{\sum_{\mathbf{x} \in S_{k}} \mathbf{x}}{\sum_{\mathbf{x} \in S_{k}} 1}=\frac{1}{N_{k}} \sum_{\mathbf{x} \in S_{k}} \mathbf{x}$

Condition (2) makes the comparison between each sample and the $K$ centroids based on Euclidean distance. It also changes the maximization of log-likelihood of training data $X$ to the minimization of total Euclidean distance as in the objective function of K-means.

Instead of assigning to only one cluster, each vector is assigned to multiple clusters, with the posterior probability of being in the k-th cluster equals to

$\begin{array}{l} P(\text { mixture }=k \mid \mathbf{x})=\frac{\pi_{k} p\left(\mathbf{x} \mid \hat{\mu}_{k}, \Sigma_{k}\right)}{\sum_{r=1}^{K} \pi_{r} p\left(\mathbf{x} \mid \hat{\mu}_{r}, \Sigma_{r}\right)} \quad k=1, \ldots, K \\ p\left(\mathbf{x} \mid \hat{\mu}_{k}, \Sigma_{k}\right)=\frac{1}{(2 \pi)^{D / 2}\left|\Sigma_{k}\right|^{1 / 2}} \exp \left\{-\frac{1}{2}\left(\mathbf{x}-\mu_{k}\right)^{T} \Sigma_{k}^{-1}\left(\mathbf{x}-\mu_{k}\right)\right\} \end{array}$

Because $0 \leq P(k \mid \mathbf{x}) \leq 1$ , the assignment decisions are “soft”.