Some Practice Questions on GMM, SVM and K-means
Gaussian Classifier / GMM
Assume that you are given independent handwritten-digit images, each with a size of pixels. Assume also that you have converted the images into vectors .
Q0(a) Suggest a way to convert the images into vectors such that each pixel of an image corresponds to an element in the corresponding vector.
For each image, flattern the pixels into a -dim vector by stacking the rows or columns .
Q0(b) If we want to use a Gaussian density function with a full covariance matrix to approximate the distribution of , what will be the minimum value of ? Briefly explain your answer.
We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.
Number of training samples =
The feature dimension
In order to have an invertable covariance matrix , we need to be of full rank.
As the dimension of is , we need at least 784+1 = 785 samples to make it full rank.
Q0© What will be your answer in Q1(b) if ? Briefly explain your answer.
This means that:
- depends on and
- Is redundant if we know and
- Therefore, 1 sample Is redundant in our data
As a result, we need 1 extra samples for each person to make sure that .
Therefore, the minimum value of is .
Q0(d) How would you reduce the dimension of , to using principal component analysis?
Recall the procedure of PCA:
Given a set of training vectors, If we want to find the subspace:
- Let as the vector representing the first axis of the subspace
- Let be the training vectors
- The mean of projected data is (Note is the transpose of )
In this case:
First, we compute the covariance matrix
where .
Then, compute the eigenvectors and eigenvalues of . Then, select the top-5 eigenvectors whose eigenvalues are the largest. Then, project the vectors using
where comprises the top- 5 eigenvectors in its columns.
Q0(e)
More features = Eigenvalue is larger
Therefore, Figure 1(a) is the eigenvector corresponding to the last PC.
Figure 1(a) is the eigenvector corresponding to the last PC. This is because the variation in the central part of the image is very small when compared with that in Figure 1(b).
Figure 1(b) is the eigenvector corresponding to the first PC. This is because the variation in the central part of the image is very large when compared with that in Figure 1(a).
Q0(f) Assume that N = 10, 000, what is the maximum number of eigenvectors (with non-zero eigenvalues) that you could obtain? Briefly explain your answer.
Maximum number of eigenvectors is 784 because the dimension of is 784.
Q1(a) If , what is the theoretical minimum value of N for the Gaussian classifier to recognize the persons in this dataset? Show how you calculate this value.
We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.
Number of training samples =
The feature dimension
As :
Therefore minimum value of .
Q1(b) Determine the theoretical minimum value of N if we have the following relationships:
where and is a vector containing all 1’s. Briefly explain your answer.
This means that:
- depends on and
- depends on and
- and are redundant if we know and .
- Therefore, 2 samples and are redundant in our data
As a result, we need 2 extra samples for each person to make sure that .
Therefore, the minimum value of is .
Q1© Briefly describe the procedure for training the Gaussian classifier. Write the equation that computes the mean vectors of the Gaussian classifier. Also state the parameters of the classifier.
We use to estimate the mean vector of person , where Specifically, we compute
Then, a Gaussian classifier can be constructed by using the parameters
Q1(d) Are the decision boundaries produced by this Gaussian classifier linear or non-linear? Briefly explain your answer.
The decision boundaries are linear because the covariance matrices of all classes (Gaussian models) are the same.
Q1(e) Assume that all of the fingerprint vectors are independent, describe how you would use principal component analysis (PCA) to reduce the dimension of the vectors. State the equation for projecting to a P-dimensional vector .
Recall the procedure of PCA:
Given a set of training vectors, If we want to find the subspace:
- Let as the vector representing the first axis of the subspace
- Let be the training vectors
- The mean of projected data is (Note is the transpose of )
In this case, we apply eigen-decomposition on the global covariance matrix to obtain eigen-vectors with decreasing eigenvalues.
Then, we select the first eigenvectors with the largest eigenvalues to form a transformation matrix
Then, we project every fingerprint vector as follows:
^Note that this method is impractical as the dimension of is too high.
Q1(f) Denote as the prior probability of Person . Describe how you would use the Gaussian classifier and the Bayes’ theorem to classify an unknown fingerprint vector . Hint: You may use the Bayes’ theorem to compute the posterior probability of Person given .
The posterior probability of Person k given x is given by
where is the output of the -th Gaussian model in the Gaussian classifier. Then, we assign to Person according to
Q2(a) Express in terms of , the mean vector , and the covariance matrix .
Q2(b) Assume that there is a special sample that is very far away from all of the mean vectors ’s. Use your answer in Q2(a) to explain the numerical problem that may occur when computing the log-likelihood .
When is far away from for all , will tend to for all . As a result, for all . Then the resulting will be .
Q2© If and , what is the minimum number of training samples for the EM algorithm to compute the parameters in .
We need to ensure number of training samples is larger than the feature dimension (N > D) in order to avoid curse of dimensionality.
Minimum number of independent training samples is .
Q2(d) If and is a diagonal matrices, what is the minimum number of training samples for the EM algorithm to compute the parameters in .
For diagonal matrices, its a different case.
Minimum number of independent training samples is .
Q3(a) Sketch the distribution (histogram) of the 407th pixel as indicated by the arrow shown in Fig. Q3. You may not need to specify the actual values in the y-axis in your histogram. But your answer should specify the range of the x-axis.
Note the data is clean. So the 407-th pixel can be either 255 or 0 (White or Black).
Q3(b) Explain why Gaussian classifiers are not appropriate for classifying the digits in this set of data.
The histograms of pixels do not follow a Gaussian distribution. The peaks are so sharp that they cannot be modeled by GMM. Therefore, Gaussian density functions cannot fit the distribution well.
Q3© How would you process the data so that they become suitable for being classified by Gaussian classifiers.
Add Gaussian noise to the data so that the histograms of pixels look more like a Gaussian distribution. Alternatively, you may perform PCA to reduce the dimension and hope that the first few principal components have sufficient variance to produce a non-singular covariance matrix.
Q4. Assume that you have built a Gaussian classifier for a two-class problem. Assume also that the means and covariance matrices in your Gaussian classifier are as follows:
Class and
Class and
Assume that both classes have equal prior, i.e.,
Q4(a) Mark the location of the class means and sketch the decision boundary of the two classes, i.e., plot the locus of , where are random vectors from either Class 1 or Class
Note:
- mean determines the center of the distribution
- covariance matrice determines the size of the distribution
Therefore:
For distribution of class 1:
- mean is at (1, 1)
- size = horizonal and vertical both 1
For distribution of class 2:
- mean is at (-1, -1)
- size = horizonal and vertical both 2
Q4(b)
For distribution with arbitrary covariance structures the decision regions are defined by hyper-spheres.
Therefore:
For distribution of class 1:
- mean is at (0, 0)
- size = horizonal 1 and vertical 2
For distribution of class 2:
- mean is at (1, 1)
- size = horizonal and vertical both 1
5(a)
As none of the central pixels (14, 14) in the 5,000 training samples contains the strokes of Digit ‘0’, that means there is no 0.6~1 intensity. As the digits were contaminated by Gaussian noise, there is also no frequency of 0 intensity is also 0. we can assume the histogram is actually a gaussian PDF if bin = 0, and it won’t fall into the range of 0.6 to 1 (Which is the stroke level intensity) and 0 (because of contamination).
5(b)
Recall Formula of GMM Classifier:
Given a data set comprising the samples of K classes, we partition the data into K class-dependent subsets:
We use the samples in Class to train the -th GMM:
During classification, given an unknown vector x, we use the Bayes theorem to decide its class label :
Therefore:
We first convert the training images into vectors by stacking the pixels, which result in 5000 training vectors per digit. Then, we train 10 Gaussian models (density function) with mean vectors and covariance matrices , one for each digit. A Gaussian classifier is formed as follows:
5©
Given a test vector x, the Gaussian classifier determines the class label of x by the following formula:
where is the likelihood of x given the Gaussian model of Digit . The prior probability is obtained as follows:
5(d)
No. This is because 100 samples could not produce a full covariance matrix with valid inverse.
it’ll change the rank of the cov matrix apparently.
5(e)
Change the full covariance matrices to diagonal covariance matrices. Because each element in the diagonal is now estimated by 100 samples, the variance is reliable and will not be 0.
Because diagonal matrix is always inversible as long as no term is zero and variance is always a positive, non zero number. So diagonal matrix is reliable.
5(f)
When the number of training sample per digit is 1, using diagonal covariance matrices will not help because the variances are still 0. In such situation, we may use a common diagonal covariance matrix for all of the 10 digits so that each diagonal element will be estimated by 10 numbers. Then, the variance will not be 0. A better solution is NOT to use Gaussian classifier, e.g., using linear SVM instead.
SVM
The output of a support vector machine (SVM) is given by
where is an input vector on comprises the indexes to a set of support vectors, is a bias term, 's are SVM parameters and is a kernel.
Q1(a) If is a radial basis function (RBF) kernel, express in terms of and the kernel width .
Note:
For Linear Kernel,
For RBF Kernel,
For Polynominal Kernel,
Therefore if is a radial basis function (RBF) kernel,
Q1(b) Explain why is a nonlinear function of if Is an RBF kernel.
Because
which is quadratic in .
Also, the exponential function is non-linear.
Therefore is a nonlinear function of if Is an RBF kernel.
Q1©
Q1(d) Repeat Q1© when .
Answer of Q1© and Q1(d):
Keynote:
when is significantly large than the range of the data on the 2-D plane, it acts like linear kernel.
when is significantly small, it overfit the data.
Q2(a) Assume that you train a set of one-versus-rest support vector machines (SVMs) to construct an SVM classifier for classifing the digits. Draw a block diagram illustrating the architecture of your SVM classifier.
Q2(b) Suggest the most appropriate kernel function (i.e., linear, polynomial or radial basis function (RBF)) for the SVMs. Briefly explain your choice.
- D = 28 x 28 = 784
- N = 10 x 10 = 100
In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.
Q3(a) What are the possible values of ?
As SVM classifiy the data into two classes, will be either -1 or +1.
Q3(b) Suggest an appropriate kernel for the SVM. Briefly explain your answer.
- D = 1000
- N = 10
In this case, feature dimension is larger than the number of training samples (N < D). a hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.
Q3© Can Fisher discriminant analysis (FDA) solve this classification problem? Briefly explain your answer.
FDA cannot solve this problem because the rank of the within-class covariance is at most , which is significantly smaller than D. As a result, does not have a valid inverse .
Q4
Q4(a) If is two-dimensional vectors, what is the dimension of ?
If is two-dimensional vectors,
The dimension of is 6 because has 6 terms.
Q4(b) if , express in a vector form in terms of and .
Denote and .
Then,
Therefore where
Q4© Explain why decision planes formed by linear SVMs are linear.
The decision function of a linear SVM is given by
where the SVM weight is
Note that is linear in .
Q4(d) Explain why the decision planes formed by polynomial SVMs are nonlinear.
(d) Because in polynomial SVMs, involves the squares and cross-products of the elements in and is non-linear in .
Q4(e) Given the following 2-class problem in 2-D space, sketch the decision boundaries produced by
- (i) a polynomial SVM with
- (ii) an RBF-SVM for .
- (iii) Can the problem be solved by a linear SVM?
(i)
(ii)
(iii)
The problem cannot be solved by linear SVMs because there is no straight line that can separate the two classes without making substantial errors.
Q5 Explain why it is always possible to use a plane (linear SVM) to divide 3 linearly independent samples into two classes. What happens when the 3 samples are linearly dependent?
When the 3 samples are not linearly dependent, they should span a plane on the 3-D space, i.e., they become the corners of a triangle. As a result, a linear SVM can classify the 3 points perfectly.
On the other hand, when the 3 points are co-linear (linearly dependent), they form a line on the 3-D space. As we now have more points (N = 3) than the dimension of feature vectors (D = 2), we need a non-linear classifier such as RBM-SVM.
K-means clustering
Q1
Randomly picks K samples from the training data and consider them as the centroids. In this question, K = 2.
- and are the centroids
In first iteration, first we use the centroids to assign the clusters. Then, we use the clusters to Re-compute Centroids. Then in the next iternation, we do the same by using the centroids to assign the clusters, then re-compute Centroids.
Q2
Denote as a set of training vectors. The K-means algorithm aims to partition into disjoint sets, where , such that the following mean squared error is minimized:
where
where is the number of vectors in .
Q2(a) Given a Gaussian mixture model with mixture components, where , , and denote the mixture coefficient, mean vector, and covariance matrix of the k-th Gaussian, respectively. Under what conditions the EM algorithm for estimating becomes the K-means algorithm? Briefly explain your answer.
Hints: You may answer this question by stating the EM algorithm for GMM and specifying the conditions for the posterior probabilities of mixture components and the covariance matrices 's such that the EM algorithm reduces to the -means algorithm.
EM becomes K-means when the following conditions (C1 and C2) are satisfied.
C1: or
C2:Condition (1) means that each sample can belong to one and only one cluster, which is what K-means is about. Plugging Condition 1 into Eq. 3, we obtain the update equation for K-means
Condition (2) makes the comparison between each sample and the centroids based on Euclidean distance. It also changes the maximization of log-likelihood of training data to the minimization of total Euclidean distance as in the objective function of K-means.
Q3
Instead of assigning to only one cluster, each vector is assigned to multiple clusters, with the posterior probability of being in the k-th cluster equals to
Because , the assignment decisions are “soft”.