Gaussian Classifier / GMM

Q0 Q0\space Assume that you are given NN independent handwritten-digit images, each with a size of 28×2828× 28 pixels. Assume also that you have converted the images into vectors X={x1,,xN}\mathcal{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}.

Q0(a) Suggest a way to convert the images into vectors such that each pixel of an image corresponds to an element in the corresponding vector.

For each image, flattern the 28×2828\times28 pixels into a 784784-dim vector by stacking the rows or columns .

Q0(b) If we want to use a Gaussian density function with a full covariance matrix to approximate the distribution of X\mathcal{X}, what will be the minimum value of NN? Briefly explain your answer.

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Number of training samples = NN

The feature dimension D=28×28=784D = 28 \times 28 = 784

In order to have an invertable covariance matrix \sum, we need NN to be of full rank.

As the dimension of \sum is 784×784784 \times 784, we need at least 784+1 = 785 samples to make it full rank.

Q0© What will be your answer in Q1(b) if x1=3x2+2x3x_1 = 3x_2 + 2x_3? Briefly explain your answer.

This means that:

  • x1\mathbf{x}_{1} depends on x2\mathbf{x}_{2} and x3\mathbf{x}_{3}
  • x1\mathbf{x}_{1} Is redundant if we know x2\mathbf{x}_{2} and x3\mathbf{x}_{3}
    • Therefore, 1 sample x1\mathbf{x}_{1} Is redundant in our data

As a result, we need 1 extra samples for each person to make sure that rank(Σ)=784\operatorname{rank}(\boldsymbol{\Sigma})=784.

Therefore, the minimum value of NN is 785+1=786785+1=786.

Q0(d) How would you reduce the dimension of xix_i,i=1,...,N,i = 1,...,N, to 55 using principal component analysis?

Recall the procedure of PCA:

Given a set of training vectors, If we want to find the subspace:

  • Let u1u_1 as the vector representing the first axis of the subspace
  • Let X={x1,...,xN}X=\{x_1,...,x_N\} be the training vectors
  • The mean of projected data is u1Txˉu_1^T \bar{x} (Note u1Tu_1^T is the transpose of u1u_1)

xˉ=xavg=1Nn=1Nxn\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n

In this case:

First, we compute the covariance matrix

Σ=1Ni=1N(xiμ)(xiμ)\boldsymbol{\Sigma}=\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right)\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right)^{\top}

where μ=1Ni=1Nxi\boldsymbol{\mu}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{i}.

Then, compute the eigenvectors and eigenvalues of Σ\boldsymbol{\Sigma}. Then, select the top-5 eigenvectors whose eigenvalues are the largest. Then, project the vectors using

yi=WT(xiμ)yi5,\mathbf{y}_{i}=\mathbf{W}^{T}\left(\mathbf{x}_{i}-\boldsymbol{\mu}\right) \quad \mathbf{y}_{i} \in \Re^{5},

where W\mathbf{W} comprises the top- 5 eigenvectors in its columns.

Q0(e)

img

More features = Eigenvalue is larger

Therefore, Figure 1(a) is the eigenvector corresponding to the last PC.

Figure 1(a) is the eigenvector corresponding to the last PC. This is because the variation in the central part of the image is very small when compared with that in Figure 1(b).

Figure 1(b) is the eigenvector corresponding to the first PC. This is because the variation in the central part of the image is very large when compared with that in Figure 1(a).

Q0(f) Assume that N = 10, 000, what is the maximum number of eigenvectors (with non-zero eigenvalues) that you could obtain? Briefly explain your answer.

Maximum number of eigenvectors is 784 because the dimension of xix_i is 784.

img

Q1(a) If K=100K = 100, what is the theoretical minimum value of N for the Gaussian classifier to recognize the persons in this dataset? Show how you calculate this value.

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Number of training samples = K×NK \times N

The feature dimension D=200×200D = 200 \times 200

As K=100K = 100:

100N>200×200100 N > 200\times200

=>N>400=> N > 400

Therefore minimum value of N=400+1=401N = 400+1 = 401.

Q1(b) Determine the theoretical minimum value of N if we have the following relationships:

xk,5=2xk,4+3xk,3+1xk,4=3xk,3+xk,2\begin{aligned} \mathbf{x}_{k, 5} &=2 \mathbf{x}_{k, 4}+3 \mathbf{x}_{k, 3}+\mathbf{1} \\ \mathbf{x}_{k, 4} &=3 \mathbf{x}_{k, 3}+\mathbf{x}_{k, 2} \end{aligned}

where k=1,,Kk=1, \ldots, K and 1\mathbf{1} is a vector containing all 1’s. Briefly explain your answer.

This means that:

  • xk,5\mathbf{x}_{k, 5} depends on xk,2\mathbf{x}_{k, 2} and xk,3\mathbf{x}_{k, 3}
  • xk,4\mathbf{x}_{k, 4} depends on xk,2\mathbf{x}_{k, 2} and xk,3\mathbf{x}_{k, 3}
  • xk,4\mathbf{x}_{k, 4} and xk,5\mathbf{x}_{k, 5} are redundant if we know xk,2\mathbf{x}_{k, 2} and xk,3\mathbf{x}_{k, 3}.
    • Therefore, 2 samples xk,4\mathbf{x}_{k, 4} and xk,5\mathbf{x}_{k, 5} are redundant in our data

As a result, we need 2 extra samples for each person to make sure that rank(Σ)=200×200=40000\operatorname{rank}(\boldsymbol{\Sigma})=200\times200 = 40000.

Therefore, the minimum value of NN is 401+2=403401+2=403.

Q1© Briefly describe the procedure for training the Gaussian classifier. Write the equation that computes the mean vectors of the Gaussian classifier. Also state the parameters of the classifier.

We use Xk\mathcal{X}_{k} to estimate the mean vector μk\boldsymbol{\mu}_{k} of person kk, where k=1,,K.k=1, \ldots, K . Specifically, we compute

μk=1Ni=1Nxk,i.\boldsymbol{\mu}_{k}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{k, i} .

Then, a Gaussian classifier can be constructed by using the parameters {μk,Σ}k=1K.\left\{\boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right\}_{k=1}^{K} .

Q1(d) Are the decision boundaries produced by this Gaussian classifier linear or non-linear? Briefly explain your answer.

The decision boundaries are linear because the covariance matrices of all classes (Gaussian models) are the same.

Q1(e) Assume that all of the fingerprint vectors are independent, describe how you would use principal component analysis (PCA) to reduce the dimension of the vectors. State the equation for projecting xk,ix_{k,i} to a P-dimensional vector yk,iy_{k,i}.

Recall the procedure of PCA:

Given a set of training vectors, If we want to find the subspace:

  • Let u1u_1 as the vector representing the first axis of the subspace
  • Let X={x1,...,xN}X=\{x_1,...,x_N\} be the training vectors
  • The mean of projected data is u1Txˉu_1^T \bar{x} (Note u1Tu_1^T is the transpose of u1u_1)

xˉ=xavg=1Nn=1Nxn\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n

In this case, we apply eigen-decomposition on the global covariance matrix \sum to obtain K×NK\times N eigen-vectors with decreasing eigenvalues.

Then, we select the first PP eigenvectors {ui}i=1P\{u_i\}^P_{i=1} with the largest eigenvalues to form a transformation matrix U=[u1,u2,...,uP].U= [u_1, u_2, ... , u_P].

Then, we project every fingerprint vector as follows:

yk,i=U(xk,iμ),yk,iRP\mathbf{y}_{k, i}=\mathbf{U}^{\top}\left(\mathbf{x}_{k, i}-\boldsymbol{\mu}\right), \quad \mathbf{y}_{k, i} \in \mathbb{R}^{P}

^Note that this method is impractical as the dimension of \sum is too high.

Q1(f) Denote P(Ck)P(C_k) as the prior probability of Person kk. Describe how you would use the Gaussian classifier and the Bayes’ theorem to classify an unknown fingerprint vector xx. Hint: You may use the Bayes’ theorem to compute the posterior probability of Person kk given xx.

The posterior probability of Person k given x is given by

P(Ckx)=P(Ck)p(xμk,Σ)j=1KP(Cj)p(xμj,Σ)P\left(C_{k} \mid \mathbf{x}\right)=\frac{P\left(C_{k}\right) p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right)}{\sum_{j=1}^{K} P\left(C_{j}\right) p\left(\mathbf{x} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}\right)}

where p(xμk,Σ)p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}\right) is the output of the kk-th Gaussian model in the Gaussian classifier. Then, we assign x\mathbf{x} to Person ll according to

l=argmaxkP(Ckx)l=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)

img

Q2(a) Express logN(xμj,j)\log N(x|\mu_j, \sum_j) in terms of xx, the mean vector μj\mu_j, and the covariance matrix j\sum_j.

logN(xμj,Σj)=D2log(2π)12logΣj12(xμj)Σj1(xμj)\log \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)=-\frac{D}{2} \log (2 \pi)-\frac{1}{2} \log \left|\boldsymbol{\Sigma}_{j}\right|-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{j}\right)

Q2(b) Assume that there is a special sample x0x_0 that is very far away from all of the mean vectors μj\mu_j’s. Use your answer in Q2(a) to explain the numerical problem that may occur when computing the log-likelihood logp(x0Λ)\log p(x_0|\Lambda).

When xx is far away from μj\mu_j for all jj, logN(xμj,j)\log N(x|\mu_j, \sum_j) will tend to -\infty for all jj. As a result, exp{logN(xμj,j)}0\exp\{\log N(x|\mu_j, \sum_j)\} \rightarrow0 for all jj. Then the resulting logp(xΛ)\log p(x|\Lambda) will be -\infty.

Q2© If D=100D = 100 and M=1M = 1, what is the minimum number of training samples NN for the EM algorithm to compute the parameters in Λ\Lambda.

We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.

Minimum number of independent training samples is N=100+1=101N= 100 + 1 = 101.

Q2(d) If D=100,M=1,D = 100, M = 1, and 1\sum_1 is a diagonal matrices, what is the minimum number of training samples NN for the EM algorithm to compute the parameters in Λ\Lambda.

For diagonal matrices, its a different case.

Minimum number of independent training samples is N=2N = 2.

img

Q3(a) Sketch the distribution (histogram) of the 407th pixel as indicated by the arrow shown in Fig. Q3. You may not need to specify the actual values in the y-axis in your histogram. But your answer should specify the range of the x-axis.

Note the data is clean. So the 407-th pixel can be either 255 or 0 (White or Black).

img

Q3(b) Explain why Gaussian classifiers are not appropriate for classifying the digits in this set of data.

The histograms of pixels do not follow a Gaussian distribution. The peaks are so sharp that they cannot be modeled by GMM. Therefore, Gaussian density functions cannot fit the distribution well.

Q3© How would you process the data so that they become suitable for being classified by Gaussian classifiers.

Add Gaussian noise to the data so that the histograms of pixels look more like a Gaussian distribution. Alternatively, you may perform PCA to reduce the dimension and hope that the first few principal components have sufficient variance to produce a non-singular covariance matrix.

Q4. Assume that you have built a Gaussian classifier for a two-class problem. Assume also that the means (μk)\left(\mu_{k}\right) and covariance matrices (Σk)\left(\Sigma_{k}\right) in your Gaussian classifier are as follows:
Class 1:μ1=[11]1: \mu_{1}=\left[\begin{array}{l}1 \\ 1\end{array}\right] and Σ1=[1001]\Sigma_{1}=\left[\begin{array}{ll}1 & 0 \\ 0 & 1\end{array}\right]
Class 2:μ2=[11]2: \mu_{2}=\left[\begin{array}{l}-1 \\ -1\end{array}\right] and Σ2=[2002]\Sigma_{2}=\left[\begin{array}{ll}2 & 0 \\ 0 & 2\end{array}\right]
Assume that both classes have equal prior, i.e., P(C1)=P(C2).P\left(C_{1}\right)=P\left(C_{2}\right) .

Q4(a) Mark the location of the class means and sketch the decision boundary of the two classes, i.e., plot the locus of p(xμ1,Σ1)=p(xμ2,Σ2)p\left(\mathbf{x} \mid \mu_{1}, \Sigma_{1}\right)=p\left(\mathbf{x} \mid \mu_{2}, \Sigma_{2}\right), where x\mathbf{x} are random vectors from either Class 1 or Class 2.2 .

Note:

  • mean μ\mu determines the center of the distribution
  • covariance matrice \sum determines the size of the distribution

Therefore:

For distribution of class 1:

  • mean is at (1, 1)
  • size = horizonal and vertical both 1

For distribution of class 2:

  • mean is at (-1, -1)
  • size = horizonal and vertical both 2
img

Q4(b)

img

For distribution with arbitrary covariance structures the decision regions are defined by hyper-spheres.

Therefore:

For distribution of class 1:

  • mean is at (0, 0)
  • size = horizonal 1 and vertical 2

For distribution of class 2:

  • mean is at (1, 1)
  • size = horizonal and vertical both 1
img

img

5(a)

As none of the central pixels (14, 14) in the 5,000 training samples contains the strokes of Digit ‘0’, that means there is no 0.6~1 intensity. As the digits were contaminated by Gaussian noise, there is also no frequency of 0 intensity is also 0. we can assume the histogram is actually a gaussian PDF if bin = 0, and it won’t fall into the range of 0.6 to 1 (Which is the stroke level intensity) and 0 (because of contamination).

img

img

5(b)

Recall Formula of GMM Classifier:

Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:

X={X1,,Xk,,XK}\mathcal{X}=\left\{\mathcal{X}_{1}, \ldots, \mathcal{X}_{k}, \ldots, \mathcal{X}_{K}\right\}

We use the samples in Class kk to train the kk-th GMM:

Xk={xk,1,,xk,Nk} EM Algorithm Λk={wkj,μkj,Σkj}j=1M\begin{array}{c} {\mathcal{X}_{k}=\left\{\mathbf{x}_{k, 1}, \ldots, \mathbf{x}_{k, N_{k}}\right\} \longrightarrow{ } \begin{array}{c} \text { EM Algorithm } \\ \end{array}} \longrightarrow \Lambda_{k}=\left\{w_{k j}, \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right\}_{j=1}^{M} \end{array}

During classification, given an unknown vector x, we use the Bayes theorem to decide its class label cc:

c=argmaxkP(Ckx)=argmaxkp(xCk)P(Ck)c=\arg \max _{k} P\left(C_{k} \mid \mathbf{x}\right)=\arg \max _{k} p\left(\mathbf{x} \mid C_{k}\right) P\left(C_{k}\right)

p(xCk)=j=1Mwkjp(xμkj,Σkj)p\left(\mathbf{x} \mid C_{k}\right)=\sum_{j=1}^{M} w_{k j} p\left(\mathbf{x} \mid \boldsymbol{\mu}_{k j}, \boldsymbol{\Sigma}_{k j}\right)

Therefore:

We first convert the training images into vectors by stacking the pixels, which result in 5000 training vectors per digit. Then, we train 10 Gaussian models (density function) with mean vectors and covariance matrices Λ={μi,Σi}i=09\Lambda=\left\{\mu_{i}, \Sigma_{i}\right\}_{i=0}^{9}, one for each digit. A Gaussian classifier is formed as follows:

 Recognized digit =argmaxk{0,,9}P(Ckx)\text { Recognized digit }=\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k} \mid \mathbf{x}\right)

=argmaxk{0,,9}P(Ck)N(x;μk,Σk)j=09P(Cj)N(x;μj,Σj) Bayes rule =argmaxk{0,,9}P(Ck)N(x;μk,Σk)p(x) Law of prob =argmaxk{0,,9}P(Ck)p(x;μk,Σk)p(x) is independent of k\begin{array}{l} =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \Sigma_{k}\right)}{\sum_{j=0}^{9} P\left(C_{j}\right) N\left(\mathbf{x} ; \mu_{j}, \boldsymbol{\Sigma}_{j}\right)} \quad \text { Bayes rule } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} \frac{P\left(C_{k}\right) N\left(\mathbf{x} ; \mu_{k}, \boldsymbol{\Sigma}_{k}\right)}{p(\mathbf{x})} \quad \text { Law of prob } \\ =\underset{k \in\{0, \ldots, 9\}}{\operatorname{argmax}} P\left(C_{k}\right) p\left(\mathbf{x} ; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right) \quad p(\mathbf{x}) \text { is independent of } k \end{array}

img

Given a test vector x, the Gaussian classifier determines the class label of x by the following formula:

l(x)=argmaxi{P(i is even )p(xμi,Σi)i is even P(i is odd )p(xμi,Σi)i is odd }=argmaxi{23p(xμi,Σi)i is even 13p(xμi,Σi)i is odd }\begin{aligned} l(\mathbf{x})=& \underset{i}{\operatorname{argmax}}\left\{\begin{array}{c} P(i \text { is even }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ P(i \text { is odd }) p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is odd } \end{array}\right\} \\ &=\underset{i}{\operatorname{argmax}}\left\{\begin{array}{cc} \frac{2}{3} p\left(\mathbf{x} \mid \mu_{i}, \boldsymbol{\Sigma}_{i}\right) & i \text { is even } \\ \frac{1}{3} p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) & i \text { is odd } \end{array}\right\} \end{aligned}

where p(xμi,Σi)p\left(\mathbf{x} \mid \mu_{i}, \Sigma_{i}\right) is the likelihood of x given the Gaussian model of Digit ii. The prior probability is obtained as follows:

Pe=2Po and Pe+Po=12Po+Po=1Po=13Pe=23\begin{array}{l} P_{e}=2 P_{o} \quad \text { and } \quad P_{e}+P_{o}=1 \\ \Rightarrow 2 P_{o}+P_{o}=1 \\ \Rightarrow P_{o}=\frac{1}{3} \\ \Rightarrow P_{e}=\frac{2}{3} \end{array}

img

5(d)

No. This is because 100 samples could not produce a full covariance matrix with valid inverse.

it’ll change the rank of the cov matrix apparently.

img

5(e)

Change the full covariance matrices to diagonal covariance matrices. Because each element in the diagonal is now estimated by 100 samples, the variance is reliable and will not be 0.

Because diagonal matrix is always inversible as long as no term is zero and variance is always a positive, non zero number. So diagonal matrix is reliable.

img

5(f)

When the number of training sample per digit is 1, using diagonal covariance matrices will not help because the variances are still 0. In such situation, we may use a common diagonal covariance matrix for all of the 10 digits so that each diagonal element will be estimated by 10 numbers. Then, the variance will not be 0. A better solution is NOT to use Gaussian classifier, e.g., using linear SVM instead.

SVM

Q1Q1 The output of a support vector machine (SVM) is given by

f(x)=iSaiK(x,xi)+bf(\mathbf{x})=\sum_{i \in \mathcal{S}} a_{i} K\left(\mathbf{x}, \mathbf{x}_{i}\right)+b

where x\mathbf{x} is an input vector on RD,S\mathbb{R}^{D}, \mathcal{S} comprises the indexes to a set of support vectors, bb is a bias term, aia_{i} 's are SVM parameters and K(,)K(\cdot, \cdot) is a kernel.

Q1(a) If K(x,xi)K(x,x_i) is a radial basis function (RBF) kernel, express K(x,xi)K(x,x_i) in terms of x,xi,x, x_i,and the kernel width σ\sigma.

Note:

For Linear Kernel,

K(xi,x)=xTxϕ(x)=xK(x_i,x) = x^Tx \quad \phi(x) = x

For RBF Kernel,

K(x,xi)=exp{12σ2xxi2}K\left(\mathbf{x}, \mathbf{x}_{i}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2}\right\}

For Polynominal Kernel,

K(xi,x)=(1+xxi)p,p>0K\left(\mathbf{x}_{i}, \mathbf{x}\right)=\left(1+\mathbf{x} \cdot \mathbf{x}_{i}\right)^{p}, p>0

Therefore if K(x,xi)K(x,x_i) is a radial basis function (RBF) kernel,

K(x,xi)=exp{12σ2xxi2}K\left(\mathbf{x}, \mathbf{x}_{i}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2}\right\}

Q1(b) Explain why f(x)f(x) is a nonlinear function of xx if K(x,xi)K(x, x_i) Is an RBF kernel.

Because

xxi2=(xxi)T(xxi)\left\|\mathbf{x}-\mathbf{x}_{i}\right\|^{2} = (x-x_i)^{T}(x-x_i)

which is quadratic in xx.

Also, the exponential function is non-linear.

Therefore f(x)f(x) is a nonlinear function of xx if K(x,xi)K(x, x_i) Is an RBF kernel.

Q1©

img

Q1(d) Repeat Q1© when σ0\sigma \rightarrow 0.

Answer of Q1© and Q1(d):

img

Keynote:

  • when σ\sigma is significantly large than the range of the data on the 2-D plane, it acts like linear kernel.

  • when σ\sigma is significantly small, it overfit the data.

img

Q2(a) Assume that you train a set of one-versus-rest support vector machines (SVMs) to construct an SVM classifier for classifing the digits. Draw a block diagram illustrating the architecture of your SVM classifier.

img

Q2(b) Suggest the most appropriate kernel function (i.e., linear, polynomial or radial basis function (RBF)) for the SVMs. Briefly explain your choice.

  • D = 28 x 28 = 784
  • N = 10 x 10 = 100

In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.

img

Q3(a) What are the possible values of yiy_i?

As SVM classifiy the data into two classes, yiy_i will be either -1 or +1.

Q3(b) Suggest an appropriate kernel for the SVM. Briefly explain your answer.

  • D = 1000
  • N = 10

In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.

Q3© Can Fisher discriminant analysis (FDA) solve this classification problem? Briefly explain your answer.

FDA cannot solve this problem because the rank of the within-class covariance SwS_w is at most N1N − 1, which is significantly smaller than D. As a result, SwS_w does not have a valid inverse Sw1S_w^{-1}.

Q4img

Q4(a) If xx is two-dimensional vectors, what is the dimension of ϕp(x)\phi_p(x)?

If xx is two-dimensional vectors,

The dimension of ϕp(x)\phi_p(x) is 6 because (1+xiTx)2(1+x_i^Tx)^2 has 6 terms.

Q4(b) if x=[x1 x2]Tx = [x_1\space x_2]^T , express ϕp(x)\phi_p(\mathbf{x}) in a vector form in terms of x1x_1 and x2x_2.

Denote xi=[uivi]T\mathbf{x}_{i}=\left[\begin{array}{ll}u_{i} & v_{i}\end{array}\right]^{T} and x=[uv]T\mathbf{x}=\left[\begin{array}{ll}u & v\end{array}\right]^{T}.

Then,

K(xi,x)=(1+xiTx)2=(1+[uivi][uv])2=(1+uiu+viv)(1+uiu+viv)=1+2uiu+2viv+2uiuviv+ui2u2+vi2v2=[12ui2vi2uiviui2vi2][12u2v2uvu2v2]=ϕp(x)Tϕp(xi)\begin{aligned} K\left(\mathbf{x}_{i}, \mathbf{x}\right) &=\left(1+\mathbf{x}_{i}^{T} \mathbf{x}\right)^{2}=\left(1+\left[\begin{array}{ll} u_{i} & v_{i} \end{array}\right]\left[\begin{array}{c} u \\ v \end{array}\right]\right)^{2} \\ &=\left(1+u_{i} u+v_{i} v\right)\left(1+u_{i} u+v_{i} v\right) \\ &=1+2 u_{i} u+2 v_{i} v+2 u_{i} u v_{i} v+u_{i}^{2} u^{2}+v_{i}^{2} v^{2} \\ &=\left[\begin{array}{ccc} 1 & \sqrt{2} u_{i} & \sqrt{2} v_{i} & \sqrt{2} u_{i} v_{i} & u_{i}^{2} & v_{i}^{2} \end{array}\right]\left[\begin{array}{c} 1 \\ \sqrt{2} u \\ \sqrt{2} v \\ \sqrt{2} u v \\ u^{2} \\ v^{2} \end{array}\right] \\ &=\phi_{p}(\mathbf{x})^{T} \phi_{p}\left(\mathbf{x}_{i}\right) \end{aligned}

Therefore ϕp(x)=[12x12x22x1x2x12x22]T\phi_{p}(\mathbf{x})=\left[\begin{array}{llllll}1 & \sqrt{2} x_{1} & \sqrt{2} x_{2} & \sqrt{2} x_{1} x_{2} & x_{1}^{2} & x_{2}^{2}\end{array}\right]^{T} where x=[x1x2]T\mathbf{x}=\left[\begin{array}{ll}x_{1} & x_{2}\end{array}\right]^{T}

Q4© Explain why decision planes formed by linear SVMs are linear.

The decision function of a linear SVM is given by

f(x)=iSyiαiK(xi,x)+b=iSyiαixiTx+b=wTx+b\begin{aligned} f(\mathbf{x}) &=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} K\left(\mathbf{x}_{i}, \mathbf{x}\right)+b \\ &=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} \mathbf{x}_{i}^{T} \mathbf{x}+b \\ &=\mathbf{w}^{T} \mathbf{x}+b \\ \end{aligned}

where the SVM weight w\mathbf{w} is

w=iSyiαixi\mathbf{w}=\sum_{i \in \mathcal{S}} y_{i} \alpha_{i} \mathbf{x}_{i}

Note that f(x)=wTx+bf(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+b is linear in x\mathbf{x}.

Q4(d) Explain why the decision planes formed by polynomial SVMs are nonlinear.

(d) Because in polynomial SVMs, K(xi,x)=ϕp(xi),ϕp(x)K\left(\mathbf{x}_{i}, \mathbf{x}\right)=\left\langle\phi_{p}\left(\mathbf{x}_{i}\right), \phi_{p}(\mathbf{x})\right\rangle involves the squares and cross-products of the elements in x\mathbf{x} and xi,f(x)=iSyiαiϕp(xi),ϕp(x)+b\mathbf{x}_{i}, f(\mathbf{x})=\sum_{i \in S} y_{i} \alpha_{i}\left\langle\phi_{p}\left(\mathbf{x}_{i}\right), \phi_{p}(\mathbf{x})\right\rangle+b is non-linear in x\mathbf{x}.

Q4(e) Given the following 2-class problem in 2-D space, sketch the decision boundaries produced by

  • (i) a polynomial SVM with d=2d = 2
  • (ii) an RBF-SVM for σ=1\sigma = 1.
  • (iii) Can the problem be solved by a linear SVM?
img

(i)

img
(ii)
img

(iii)
The problem cannot be solved by linear SVMs because there is no straight line that can separate the two classes without making substantial errors.

Q5 Explain why it is always possible to use a plane (linear SVM) to divide 3 linearly independent samples into two classes. What happens when the 3 samples are linearly dependent?

When the 3 samples are not linearly dependent, they should span a plane on the 3-D space, i.e., they become the corners of a triangle. As a result, a linear SVM can classify the 3 points perfectly.

On the other hand, when the 3 points are co-linear (linearly dependent), they form a line on the 3-D space. As we now have more points (N = 3) than the dimension of feature vectors (D = 2), we need a non-linear classifier such as RBM-SVM.

img

K-means clustering

img

Q1

Randomly picks K samples from the training data and consider them as the centroids. In this question, K = 2.

  • μ1\mu_1 and μ2\mu_2 are the centroids

In first iteration, first we use the centroids to assign the clusters. Then, we use the clusters to Re-compute Centroids. Then in the next iternation, we do the same by using the centroids to assign the clusters, then re-compute Centroids.

img

Q2

Denote X={x1,,xN}\mathcal{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\} as a set of training vectors. The K-means algorithm aims to partition X\mathcal{X} into KK disjoint sets, Xk\mathcal{X}_{k} where k=1,,Kk=1, \ldots, K, such that the following mean squared error is minimized:

E=k=1KxXkxμk2E=\sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{X}_{k}}\left\|\mathbf{x}-\boldsymbol{\mu}_{k}\right\|^{2}

where

μk=xXkxxXk1=1NkxXkx\boldsymbol{\mu}_{k}=\frac{\sum_{\mathbf{x} \in \mathcal{X}_{k}} \mathbf{x}}{\sum_{\mathbf{x} \in \mathcal{X}_{k}} 1}=\frac{1}{N_{k}} \sum_{\mathbf{x} \in \mathcal{X}_{k}} \mathbf{x}

where NkN_{k} is the number of vectors in Xk\mathcal{X}_{k}.

Q2(a) Given a Gaussian mixture model Λ={πk,μk,k}k=1K\Lambda = \{\pi_k, \mu_k, \sum_k\}^K_{k=1} with KK mixture components, where πk\pi_k, μk\mu_k , and k\sum_k denote the mixture coefficient, mean vector, and covariance matrix of the k-th Gaussian, respectively. Under what conditions the EM algorithm for estimating Λ\Lambda becomes the K-means algorithm? Briefly explain your answer.

Hints: You may answer this question by stating the EM algorithm for GMM and specifying the conditions for the posterior probabilities of mixture components P(mix=kx)P(\operatorname{mix}=k \mid \mathbf{x}) and the covariance matrices Σk\boldsymbol{\Sigma}_{k} 's such that the EM algorithm reduces to the K\mathrm{K}-means algorithm.

img

EM becomes K-means when the following conditions (C1 and C2) are satisfied.

C1: P(kx)={1 if xSk0 if xSk\quad P(k \mid \mathbf{x})=\left\{\begin{array}{cc}1 & \text { if } \mathbf{x} \in S_{k} \\ 0 & \text { if } \mathbf{x} \notin S_{k}\end{array} \quad\right. or P(kx)={1 if P(kx)>P(jx)jk0 otherwise P(k \mid \mathbf{x})=\left\{\begin{array}{cc}1 & \text { if } P(k \mid \mathbf{x})>P(j \mid \mathbf{x}) \quad \forall j \neq k \\ 0 & \text { otherwise }\end{array}\right.
C2: Σk=Ik\quad \Sigma_{k}=\mathbf{I} \quad \forall k

Condition (1) means that each sample can belong to one and only one cluster, which is what K-means is about. Plugging Condition 1 into Eq. 3, we obtain the update equation for K-means

μk=xSkxxSk1=1NkxSkx\boldsymbol{\mu}_{k}=\frac{\sum_{\mathbf{x} \in S_{k}} \mathbf{x}}{\sum_{\mathbf{x} \in S_{k}} 1}=\frac{1}{N_{k}} \sum_{\mathbf{x} \in S_{k}} \mathbf{x}

Condition (2) makes the comparison between each sample and the KK centroids based on Euclidean distance. It also changes the maximization of log-likelihood of training data XX to the minimization of total Euclidean distance as in the objective function of K-means.

Q3

img

Instead of assigning to only one cluster, each vector is assigned to multiple clusters, with the posterior probability of being in the k-th cluster equals to

P( mixture =kx)=πkp(xμ^k,Σk)r=1Kπrp(xμ^r,Σr)k=1,,Kp(xμ^k,Σk)=1(2π)D/2Σk1/2exp{12(xμk)TΣk1(xμk)}\begin{array}{l} P(\text { mixture }=k \mid \mathbf{x})=\frac{\pi_{k} p\left(\mathbf{x} \mid \hat{\mu}_{k}, \Sigma_{k}\right)}{\sum_{r=1}^{K} \pi_{r} p\left(\mathbf{x} \mid \hat{\mu}_{r}, \Sigma_{r}\right)} \quad k=1, \ldots, K \\ p\left(\mathbf{x} \mid \hat{\mu}_{k}, \Sigma_{k}\right)=\frac{1}{(2 \pi)^{D / 2}\left|\Sigma_{k}\right|^{1 / 2}} \exp \left\{-\frac{1}{2}\left(\mathbf{x}-\mu_{k}\right)^{T} \Sigma_{k}^{-1}\left(\mathbf{x}-\mu_{k}\right)\right\} \end{array}

Because 0P(kx)10 \leq P(k \mid \mathbf{x}) \leq 1, the assignment decisions are “soft”.