Curse-of-dimensionality problem

the required number of training samples exponentially increases with the dimensionality
We can have an inverse of the covariance matrix of these 2 vectors if Number of data is larger than the feature Dimension (i.e. N > D).
If Number of data is larger than the feature Dimension (N > D), the inverse of the covariance matrix does not exist if we do not have enough data to find covariance matrix.
- For example, Given two 3-dimensional vectors, the covariance matrix of these 2 vectors is not invertable (N < D). This is because the inverse of the covariance matrix does not exist if we do not have enough data to find covariance matrix.

The problem leads to the classifiers.

If the number of training samples is larger than the feature dimension (N > D), we can train a GMM with full covariance matrices.
If the number of training samples is smaller than the feature dimension (N < D), we cannot train a GMM with full covariance matrices. This is because the data samples could not produce a full covariance matrix with valid inverse. (curse of dimensionality).

Dimension Reduction

Why Dimension Reduction?

High-dim feature vectors can easily cause the curse-of-dimensionality problem.
Some feautre elements in the feature vectors might be:
- Irrelevant to the classification task that we want to do
- Redundant, i.e. highly correlated with each other

Principal Component Analysis (PCA)

A classical approach to reducing dimension is to find a small number of axes in which the data have the highest variability.

The axes may not parallel to the original axes.

E.g. project the data from 3D to 2D space:

PCA is mainly used for dimension reduction
PCA can project data to at most $D$ dimensional space where $D$ is the dimension of feature vectors

How PCA works

PCA uses unsupervised learning (Eigen-decomposition of data covariance matrix) to find its parameters (projection directions)

To project onto M-dimensional space, we find the M eigenvectors whose corresponding eigenvalues are the largest.
The idea of PCA is that any data can be represented by a weighted sum of the first M eigenvectors plus an offset (the global mean).

Procedure

Given a set of training vectors, If we want to find the subspace:

Let $u_1$ as the vector representing the first axis of the subspace
Let $X=\{x_1,...,x_N\}$ be the training vectors
The mean of projected data is $u_1^T \bar{x}$ (Note $u_1^T$ is the transpose of $u_1$ )

$\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n$

The variance of the projected data is

$\begin{array}{c} \boldsymbol{\sigma}_{y}^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\{\mathbf{u}_{1}^{\mathrm{T}} \mathbf{x}_{n}-\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathbf{x}}\right\}^{2}=\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1} \\ \end{array}$

$\mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}}$

Then we find the best $u_1$ that maximizes this variance, such that

$\mathbf{u}_{1}^{*}=\underset{\mathbf{u}_{1}}{\operatorname{argmax}} \mathbf{u}_{1} \mathbf{S}^{T} \mathbf{u}_{1} \quad \text { subject to } \mathbf{u}_{1}^{T} \mathbf{u}_{1}=1$

Skipping some math on constrained optimization, we get the solution:

$Su_1^{*} = \lambda_1u_1^{*}$

^Where the $\lambda_1$ is the eigenvalue.
So, the best axis is an eigenvector of the covariance matrix S.
This eigenvector has the largest eigenvalue $\lambda_1$ among all the eigenvalues.

Limitations of PCA

PCA will fail if the subspace is non-linear.
For non-linear subspace, we might use non-linear embedding such as kernel PCA, ISOMAP, or DNN (autoencoder)

Fisher Discriminant Analysis (FDA)

FDA is a classification method to separate data into two classes.

But FDA could also be considered as a supervised dimension reduction method that reduces the dimension to 1.
- The data are projected into a line

Example:

How FDA works

The idea of FDA is to find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

Skipping lots of math, the projection direction is given by

$\mathbf{w} \propto \mathbf{S}_{\mathrm{W}}^{-1}\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)$

$\mathbf{S}_{\mathrm{W}}=\sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)^{\mathrm{T}}+\sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)^{\mathrm{T}}$

$\mathbf{m}_{1}=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}} \mathbf{x}_{n}, \quad \mathbf{m}_{2}=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}} \mathbf{x}_{n}$

Where $S_W$ is the within-class covariance matrix

Given a vector $x_n$ , the projected scalar is $y_n = w^{T}x_n$

Linear Discriminant Analysis (LDA)

LDA is a generalization of FDA. It is typically used as a supervised classifier.
LDA also allows us to project data to low dimensional space
- LDA can only project data onto at most $K-1$ dim subspace, where $K$ is the number of classes.

How LDA works

Skipping a lot of math, the projection matrix of LDA is

$\mathbf{S}_{B} \mathbf{W}=\Lambda \mathbf{S}_{W} \mathbf{W} \Rightarrow \mathbf{S}_{W}^{-1} \mathbf{S}_{B} \mathbf{W}=\Lambda \mathbf{W}$ $\Rightarrow \mathbf{W}$ comprises the eigenvectors of $\mathbf{S}_{W}^{-1} \mathbf{S}_{B}$

$\mathbf{S}_{\mathrm{B}}=\sum_{k=1}^{K} N_{k}\left(\mathbf{m}_{k}-\mathbf{m}\right)\left(\mathbf{m}_{k}-\mathbf{m}\right)^{\mathrm{T}}$

$\mathbf{S}_{W}=\sum_{k=1}^{K} \sum_{\mathbf{x} \in C_{k}}\left(\mathbf{x}-\mathbf{m}_{k}\right)\left(\mathbf{x}-\mathbf{m}_{k}\right)^{T}$

Where $S_W$ is the within-class covariance matrix, $S_B$ is the between-class covariance matrix.

Limitations of LDA

LDA can only project data onto at most $K-1$ dim subspace, where $K$ is the number of classes.
LDA assumes that the covariance matrices of all classes are the same. If the distributions of different classes are significantly non-Gaussian, the LDA projection will not be able to preserve any complex structure of the data, which may lead to misclassification. For example, LDA will fail in the following 2-class problems.

means overlap

Exercise Question Set

Summary

PCA: mainly used for dimension reduction

PCA can project data to at most D dimensional space where D is the dimension of feature vectors in a unsupervised manner (Eigen-decomposition of data covariance matrix)
PCA will fail if the subspace is non-linear.

FDA: find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

LDA: mainly used for classification, generalization of FDA

LDA uses supervised learning (Eigen-decomposition of $S_w^{-1}S_B$ which are labelled).
LDA can only project data onto at most $K-1$ dim subspace, where $K$ is the number of classes.

Q1

Q1 You are given a dataset comprising the facial images of ten persons. Each person has 100 images and each image has size 128×128 pixels. Assume that for each image, a facial vector is obtained by stacking the columns of the image.

(a) What will be the dimension of the resulting vectors?

128x128 = 16384

(b) Assume that you train a set of one-versus-rest support vector machines (SVMs) to construct an SVM classifier for classifying the ten persons. Draw a block diagram illustrating the architecture of your SVM classifier.

© Suggest the most appropriate kernel function for the SVMs. Briefly explain your choice.

Linear kernel is the most appropriate. This is because the dimension of features (16384) is larger than the number of training samples (1000).

Feature dimension is larger than the number of training samples (N < D). A hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.
- Use Linear kernel when feature dimension is larger than the number of training samples (N < D).

(d) State two disadvantages of converting a facial image to a facial vector by stacking its columns.

Stacking the columns will destroy the local spatial information of the images.
This method produces feature vectors of very large dimension.

(e) Assume that principal component analysis (PCA) is applied to project the facial vectors onto a lower dimensional space. What will be the maximum dimension of the PCA-projected vectors? You may assume that all of the images (facial vectors) are independent.

The maximum dimension is 999. Note that the 1000-th eigenvalue is 0.

Q2

2(a) A dataset comprises 1,000 facial images of size 100×200. What is the maximum number of valid eigenfaces that could be obtained from this dataset? Briefly explain your answer and state the assumption(s) that you have made.

The maximum number of eigenfaces is 999. Because the feature dimension (20000) > the number of training vectors (1000), the maximum rank of the covariance matrix is 1000 − 1 = 999. The assumption is that these 1000 images are all different and independent.

2(b) You are given a dataset comprising the images of 10,000 handwritten digits, where each digit has 1,000 samples. Assume that the image size is 28 × 28. Discuss how you will use linear discriminant analysis (LDA) and support vector machines (SVMs) to construct a handwritten digit recognizer. What are the pros and cons of using LDA in this application?

We first convert the images into 784-dimensional vectors by stacking the columns (or rows) of the images. Then, we compute the within-class scatter matrix $\mathbf{S}_{w}$ and betweenclass scatter matrix $\mathbf{S}_{b}$ . Then, we compute the LDA project matrix by finding the eigenvectors of $\mathbf{S}_{w}^{-1} \mathbf{S}_{b}$ . Then we project each vector by the LDA project matrix and train 10 one-vs-rest SVMs (one for each digit) using the LDA-projected vectors.
Pros: The LDA-projected vectors are class discriminative because the projection matrix is found by maximizing the between-class scatter and minimizing the within-class scatter.
Cons: The maximum dimension of the LDA-projected vectors is 9 , which may be too small for the SVMs.

Q3

3a) Assign the following covariance matrices to the clusters

$\Sigma_{?}=\left[\begin{array}{ccc} 6 & 2 & 0 \\ 2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right] \quad \Sigma_{?}=\left[\begin{array}{ccc} 6 & 2 & 0 \\ 2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right] \quad \Sigma_{?}=\left[\begin{array}{ccc} 6 & -2 & 0 \\ -2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right]$

We can tell by the negative signs.

3b) Draw on the 2-D scatter plot the first two LDA projection directions. Given:

the first two eigenvectors of $S_w^{-1}S_B$ $S_{w}^{- 1} S_{B}$ are:
- $u_1 = [-0.35\space -0.93\space -0.06]^T$
- $u_2 = [-0.65\space -0.76\space -0.03]^T$
global mean of the data is $\mu = [3.62\space 4.69\space -0.02]^T$

Note $S_W$ is the within-class covariance matrix, $S_B$ is the between-class covariance matrix.

3c) Sketch the probability density of the data projected to $u_1$ and $u_3$ , respectively.

Just draw the gaussian distributions across the axis.

3d) What is the values of the eigevalue corresponding to $u_3$ ?

Zero. Because the maximum number of projection directions in LDA is K - 1 (number of classes minus one), which is equals to 3 - 1 = 2.

3e) Which of the eigenvalues is the biggest?

$u_1$

Q4

State the key differences between PCA and LDA.

In the context of applications, PCA is mainly used for dimension reduction, where LDA is mainly used for classification.

PCA can project data to at most D dimensional space where D is the dimension of feature vectors, whereas LDA can project data to at most K – 1 dimension.

PCA uses unsupervised learning (Eigen-decomposition of data covariance matrix) to find its parameters (projection directions), whereas LDA uses supervised learning (Eigen-decomposition of $S_w^{-1}S_B$ ). Note that both $S_W$ and $S_B$ require data labels to compute.

Note LDA is a generalization of FDA whichs job is to find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

Q5

What is the relationship between the eigenvalues and the variances of projected data?

The eigenvalues are proportional to the variances of the projected data.

More specifically, $\sigma_i = \lambda_i$

where $\sigma_i$ is the standard deviation of the data projected onto $u_i$ .

Q6

6a) What is the maximum number of eigenfaces (with non-zero eigenvalues) that you can possibly obtain from this dataset?

Dataset:

10 class

each class has 20 image

Total images data = 10 * 20 = 200

Since the eigenvalue of the 200-th eigenface is almost 0 in practice,

Therefore maximum number of eigenfaces (with non-zero eigenvalues) = 200 - 1 = 199.

6b) Which of the images in Fig. Q6 corresponds to the eigenface with a larger eigenvalue? Briefly explain your answer.

More features = Eigenvalue is larger
Fig. Q6(a) corresponds to the eigenfaces with the largest eigen value because it contains more facial features than Fig.Q6(b).

6c) Describe how the top-five eigenfaces (eigenvectors with the top-5 eigenvalues) can be used for face recognition.

Project the mean-subtracted facial vectors to the 5-dimensional space defined by the top-5 eigenfaces. Then, use a one-versus-rest SVM classifier, a GMM classifier, or a Gaussian classifier to classify the 5-dim projected vectors into 10 classes.