Curse-of-dimensionality problem

  • the required number of training samples exponentially increases with the dimensionality

  • We can have an inverse of the covariance matrix of these 2 vectors if Number of data is larger than the feature Dimension (i.e. N > D).

  • If Number of data is larger than the feature Dimension (N > D), the inverse of the covariance matrix does not exist if we do not have enough data to find covariance matrix.

    • For example, Given two 3-dimensional vectors, the covariance matrix of these 2 vectors is not invertable (N < D). This is because the inverse of the covariance matrix does not exist if we do not have enough data to find covariance matrix.

The problem leads to the classifiers.

  • If the number of training samples is larger than the feature dimension (N > D), we can train a GMM with full covariance matrices.
  • If the number of training samples is smaller than the feature dimension (N < D), we cannot train a GMM with full covariance matrices. This is because the data samples could not produce a full covariance matrix with valid inverse. (curse of dimensionality).

Dimension Reduction

Why Dimension Reduction?

  • High-dim feature vectors can easily cause the curse-of-dimensionality problem.
  • Some feautre elements in the feature vectors might be:
    • Irrelevant to the classification task that we want to do
    • Redundant, i.e. highly correlated with each other

Principal Component Analysis (PCA)

A classical approach to reducing dimension is to find a small number of axes in which the data have the highest variability.

The axes may not parallel to the original axes.

E.g. project the data from 3D to 2D space:

img

  • PCA is mainly used for dimension reduction
  • PCA can project data to at most DD dimensional space where DD is the dimension of feature vectors

How PCA works

  • PCA uses unsupervised learning (Eigen-decomposition of data covariance matrix) to find its parameters (projection directions)
img
  • To project onto M-dimensional space, we find the M eigenvectors whose corresponding eigenvalues are the largest.
  • The idea of PCA is that any data can be represented by a weighted sum of the first M eigenvectors plus an offset (the global mean).

Procedure

Given a set of training vectors, If we want to find the subspace:

  • Let u1u_1 as the vector representing the first axis of the subspace
  • Let X={x1,...,xN}X=\{x_1,...,x_N\} be the training vectors
  • The mean of projected data is u1Txˉu_1^T \bar{x} (Note u1Tu_1^T is the transpose of u1u_1)

xˉ=xavg=1Nn=1Nxn\bar x = x_{avg}= \frac{1}{N} \sum^N_{n=1} x_n

  • The variance of the projected data is

σy2=1Nn=1N{u1Txnu1Tx}2=u1TSu1\begin{array}{c} \boldsymbol{\sigma}_{y}^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\{\mathbf{u}_{1}^{\mathrm{T}} \mathbf{x}_{n}-\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathbf{x}}\right\}^{2}=\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1} \\ \end{array}

S=1Nn=1N(xnx)(xnx)T\mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}}

  • Then we find the best u1u_1 that maximizes this variance, such that

u1=argmaxu1u1STu1 subject to u1Tu1=1\mathbf{u}_{1}^{*}=\underset{\mathbf{u}_{1}}{\operatorname{argmax}} \mathbf{u}_{1} \mathbf{S}^{T} \mathbf{u}_{1} \quad \text { subject to } \mathbf{u}_{1}^{T} \mathbf{u}_{1}=1

  • Skipping some math on constrained optimization, we get the solution:

Su1=λ1u1Su_1^{*} = \lambda_1u_1^{*}

  • ^Where the λ1\lambda_1 is the eigenvalue.
  • So, the best axis is an eigenvector of the covariance matrix S.
  • This eigenvector has the largest eigenvalue λ1\lambda_1 among all the eigenvalues.

Limitations of PCA

  • PCA will fail if the subspace is non-linear.
  • For non-linear subspace, we might use non-linear embedding such as kernel PCA, ISOMAP, or DNN (autoencoder)
img

Fisher Discriminant Analysis (FDA)

FDA is a classification method to separate data into two classes.

  • But FDA could also be considered as a supervised dimension reduction method that reduces the dimension to 1.
    • The data are projected into a line

Example:

img

How FDA works

  • The idea of FDA is to find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

Skipping lots of math, the projection direction is given by

wSW1(m2m1)\mathbf{w} \propto \mathbf{S}_{\mathrm{W}}^{-1}\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)

SW=nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)T\mathbf{S}_{\mathrm{W}}=\sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)^{\mathrm{T}}+\sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)^{\mathrm{T}}

m1=1N1nC1xn,m2=1N2nC2xn\mathbf{m}_{1}=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}} \mathbf{x}_{n}, \quad \mathbf{m}_{2}=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}} \mathbf{x}_{n}

Where SWS_W is the within-class covariance matrix

Given a vector xnx_n, the projected scalar is yn=wTxny_n = w^{T}x_n

Linear Discriminant Analysis (LDA)

  • LDA is a generalization of FDA. It is typically used as a supervised classifier.
  • LDA also allows us to project data to low dimensional space
    • LDA can only project data onto at most K1K-1 dim subspace, where KK is the number of classes.

How LDA works

Skipping a lot of math, the projection matrix of LDA is

SBW=ΛSWWSW1SBW=ΛW\mathbf{S}_{B} \mathbf{W}=\Lambda \mathbf{S}_{W} \mathbf{W} \Rightarrow \mathbf{S}_{W}^{-1} \mathbf{S}_{B} \mathbf{W}=\Lambda \mathbf{W} W\Rightarrow \mathbf{W} comprises the eigenvectors of SW1SB\mathbf{S}_{W}^{-1} \mathbf{S}_{B}

SB=k=1KNk(mkm)(mkm)T\mathbf{S}_{\mathrm{B}}=\sum_{k=1}^{K} N_{k}\left(\mathbf{m}_{k}-\mathbf{m}\right)\left(\mathbf{m}_{k}-\mathbf{m}\right)^{\mathrm{T}}

SW=k=1KxCk(xmk)(xmk)T\mathbf{S}_{W}=\sum_{k=1}^{K} \sum_{\mathbf{x} \in C_{k}}\left(\mathbf{x}-\mathbf{m}_{k}\right)\left(\mathbf{x}-\mathbf{m}_{k}\right)^{T}

Where SWS_W is the within-class covariance matrix, SBS_B is the between-class covariance matrix.

Limitations of LDA

  • LDA can only project data onto at most K1K-1 dim subspace, where KK is the number of classes.
  • LDA assumes that the covariance matrices of all classes are the same. If the distributions of different classes are significantly non-Gaussian, the LDA projection will not be able to preserve any complex structure of the data, which may lead to misclassification. For example, LDA will fail in the following 2-class problems.

img

  • means overlap

Exercise Question Set

Summary

PCA: mainly used for dimension reduction

  • PCA can project data to at most D dimensional space where D is the dimension of feature vectors in a unsupervised manner (Eigen-decomposition of data covariance matrix)
  • PCA will fail if the subspace is non-linear.

FDA: find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

LDA: mainly used for classification, generalization of FDA

  • LDA uses supervised learning (Eigen-decomposition of Sw1SBS_w^{-1}S_B which are labelled).
  • LDA can only project data onto at most K1K-1 dim subspace, where KK is the number of classes.

Q1

Q1 You are given a dataset comprising the facial images of ten persons. Each person has 100 images and each image has size 128×128 pixels. Assume that for each image, a facial vector is obtained by stacking the columns of the image.

(a) What will be the dimension of the resulting vectors?

  • 128x128 = 16384

(b) Assume that you train a set of one-versus-rest support vector machines (SVMs) to construct an SVM classifier for classifying the ten persons. Draw a block diagram illustrating the architecture of your SVM classifier.

img

© Suggest the most appropriate kernel function for the SVMs. Briefly explain your choice.

Linear kernel is the most appropriate. This is because the dimension of features (16384) is larger than the number of training samples (1000).

  • Feature dimension is larger than the number of training samples (N < D). A hyperplane will perfectly separate the training data into two classes. Linear kernel is the most appropriate.
    • Use Linear kernel when feature dimension is larger than the number of training samples (N < D).

(d) State two disadvantages of converting a facial image to a facial vector by stacking its columns.

  • Stacking the columns will destroy the local spatial information of the images.
  • This method produces feature vectors of very large dimension.

(e) Assume that principal component analysis (PCA) is applied to project the facial vectors onto a lower dimensional space. What will be the maximum dimension of the PCA-projected vectors? You may assume that all of the images (facial vectors) are independent.

  • The maximum dimension is 999. Note that the 1000-th eigenvalue is 0.

Q2

2(a) A dataset comprises 1,000 facial images of size 100×200. What is the maximum number of valid eigenfaces that could be obtained from this dataset? Briefly explain your answer and state the assumption(s) that you have made.

  • The maximum number of eigenfaces is 999. Because the feature dimension (20000) > the number of training vectors (1000), the maximum rank of the covariance matrix is 1000 − 1 = 999. The assumption is that these 1000 images are all different and independent.

2(b) You are given a dataset comprising the images of 10,000 handwritten digits, where each digit has 1,000 samples. Assume that the image size is 28 × 28. Discuss how you will use linear discriminant analysis (LDA) and support vector machines (SVMs) to construct a handwritten digit recognizer. What are the pros and cons of using LDA in this application?

  • We first convert the images into 784-dimensional vectors by stacking the columns (or rows) of the images. Then, we compute the within-class scatter matrix Sw\mathbf{S}_{w} and betweenclass scatter matrix Sb\mathbf{S}_{b}. Then, we compute the LDA project matrix by finding the eigenvectors of Sw1Sb\mathbf{S}_{w}^{-1} \mathbf{S}_{b}. Then we project each vector by the LDA project matrix and train 10 one-vs-rest SVMs (one for each digit) using the LDA-projected vectors.
  • Pros: The LDA-projected vectors are class discriminative because the projection matrix is found by maximizing the between-class scatter and minimizing the within-class scatter.
  • Cons: The maximum dimension of the LDA-projected vectors is 9 , which may be too small for the SVMs.

Q3

img

3a) Assign the following covariance matrices to the clusters

Σ?=[620220000.5]Σ?=[620220000.5]Σ?=[620220000.5]\Sigma_{?}=\left[\begin{array}{ccc} 6 & 2 & 0 \\ 2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right] \quad \Sigma_{?}=\left[\begin{array}{ccc} 6 & 2 & 0 \\ 2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right] \quad \Sigma_{?}=\left[\begin{array}{ccc} 6 & -2 & 0 \\ -2 & 2 & 0 \\ 0 & 0 & 0.5 \end{array}\right]

We can tell by the negative signs.

img

3b) Draw on the 2-D scatter plot the first two LDA projection directions. Given:

  • the first two eigenvectors of Sw1SBS_w^{-1}S_B are:
    • u1=[0.35 0.93 0.06]Tu_1 = [-0.35\space -0.93\space -0.06]^T
    • u2=[0.65 0.76 0.03]Tu_2 = [-0.65\space -0.76\space -0.03]^T
  • global mean of the data is μ=[3.62 4.69 0.02]T\mu = [3.62\space 4.69\space -0.02]^T

Note SWS_W is the within-class covariance matrix, SBS_B is the between-class covariance matrix.

img

3c) Sketch the probability density of the data projected to u1u_1 and u3u_3, respectively.

Just draw the gaussian distributions across the axis.

img

3d) What is the values of the eigevalue corresponding to u3u_3?

Zero. Because the maximum number of projection directions in LDA is K - 1 (number of classes minus one), which is equals to 3 - 1 = 2.

3e) Which of the eigenvalues is the biggest?

u1u_1

Q4

  1. State the key differences between PCA and LDA.
  • In the context of applications, PCA is mainly used for dimension reduction, where LDA is mainly used for classification.
  • PCA can project data to at most D dimensional space where D is the dimension of feature vectors, whereas LDA can project data to at most K – 1 dimension.
  • PCA uses unsupervised learning (Eigen-decomposition of data covariance matrix) to find its parameters (projection directions), whereas LDA uses supervised learning (Eigen-decomposition of Sw1SBS_w^{-1}S_B). Note that both SWS_W and SBS_B require data labels to compute.
  • Note LDA is a generalization of FDA whichs job is to find a 1-D line so that the projected data give a large separation between the class means while also giving a small variance within each class, thereby minimizing the class overlap.

Q5

  1. What is the relationship between the eigenvalues and the variances of projected data?

The eigenvalues are proportional to the variances of the projected data.

More specifically, σi=λi\sigma_i = \lambda_i

where σi\sigma_i is the standard deviation of the data projected onto uiu_i.

Q6

img

6a) What is the maximum number of eigenfaces (with non-zero eigenvalues) that you can possibly obtain from this dataset?

Dataset:

  • 10 class
  • each class has 20 image
  • Total images data = 10 * 20 = 200

Since the eigenvalue of the 200-th eigenface is almost 0 in practice,

Therefore maximum number of eigenfaces (with non-zero eigenvalues) = 200 - 1 = 199.

6b) Which of the images in Fig. Q6 corresponds to the eigenface with a larger eigenvalue? Briefly explain your answer.

More features = Eigenvalue is larger
Fig. Q6(a) corresponds to the eigenfaces with the largest eigen value because it contains more facial features than Fig.Q6(b).

6c) Describe how the top-five eigenfaces (eigenvectors with the top-5 eigenvalues) can be used for face recognition.

Project the mean-subtracted facial vectors to the 5-dimensional space defined by the top-5 eigenfaces. Then, use a one-versus-rest SVM classifier, a GMM classifier, or a Gaussian classifier to classify the 5-dim projected vectors into 10 classes.