CNN

CNN (Math)

102 - 5 + 1 = 98

Why?

Imagine if the number of time points is 10, while the convolutional filter is still 5, stride is still 1.

10 - 5 + 1 = 6 number of times points after convolution.

Therefore if the original number of time points is 102:

102 - 5 + 1 = 98 number of times points after convolution.

Since $x_1(n)$ and $k_1(n)$ are same length, the convolution slide only once.

The output = $x_1(n) * k_1(n) + x_2(n) * k_2(n) + x_3(n) * k_3(n)$

$x_1(n) * k_1(n) = (1\times0+2\times1+1\times1+2\times1+0\times0) = 0+2+1+2+0 = 5$
$x_2(n) * k_2(n) = (1\times0+0\times1+1\times0+2\times1+0\times0) = 0+0+0+2+0 = 2$
$x_3(n) * k_3(n) =(1\times0+1\times0+0\times1+0\times1+2\times2)= 0+0+0+0+4 = 4$

$x_1(n) * k_1(n) + x_2(n) * k_2(n) + x_3(n) * k_3(n) = 5+2+4 = 11$

Therefore the output of the convolutional filters is 11.

If you have 5 time series and you apply 1D-CNN to process the time series, how many convolutional filters in the first layer of the 1D-CNN?

5

Given a color image of size 28 x 28 x 3 pixels, how many convolutional filters in the first layer of a CNN if the first layer’s output tensor has size 26 x 26 x 64?

64

CNN (Concept)

Is it appropriate to apply 1D-CNN on images?

False

The number of weights in the convolutional layers of a CNN is much smaller than the number of weights in a hidden layer of a fully connected network. Is it true?

True

In CNN, the pooling operation can reduce the number of weights. Is it true?

False.

In 2D-CNN, the convolutional operation is to capture the spatial features of images. Is it true?

True

Name two methods that enable a CNN or TDNN to accept input matrices of variable size.

Statistics pooling

Gloabal average pooling

For time series prediction, we want to capture the short-term temporal relationship of the samples across time instead of capturing the spatial relationship across the signal amplitudes. This can be achieved by using a 1D-CNN with a large kernel size, say 10. In order words, it is more important to capture the dynamic of $x_{j}(n)$ for $n=0,1, \cdots, N$ , instead of capturing the correlation between $x_{1}(n)$ and $x_{2}(n)$ or the correlation between $x_{2}(n)$ and $x_{3}(n)$ , and so on. If we use a 2D-CNN for the same task, we assume that the order of $\left[x_{1}(n), x_{2}(n), x_{3}(n), \cdots\right]$ has some meaning (similar to the vertical axis of an image). However, for time series prediction or classification, the result should be the same even if we reshuffle the time series, e.g., $\left[x_{3}(n), x_{2}(n), x_{1}(n), \cdots\right]$ . Therefore, we should not use a 2D-CNN for multi-dimensional time series prediction.

CNN (Math + Concept)

(b)(i)

Number of shared weights = ( Kernel weights + bias ) x number of output channels

Therefore, Number of shared weights = $(3\times 3 + 1) \times 64 = 640$

(b)(ii)

convolutional layers are used for extracting spatial features that are relevant to the classification task.

(b)(iii)

Max-pooling layers perform subsampling to reduce the amount of computation performed in the network.

(b)(iv)

The fully connected layers act as a non-linear classifier that classifies the features extracted from the last convolutional layer or last max-pooling layer of the CNN.

Recurrent Neural Networks

The number of weights in an RNN depends on the number of unfoldings (time steps) we apply to the RNN. Is it True?

False

LSTM is a special kind of RNN that can model long term dependence. Is it True?

True

LSTM can only look at evens/information happen in the past. Is it True?

Yes

Bi-directional LSTM uses information in the past and the future to perform prediction. Is it True?

Yes

An LSTM is better than an RNN in modeling long-term dependence because the LSTM has cell states and hidden states while RNN only has hidden states. Is it True?

Yes

Transformers

Transformer is based on the seq2seq model in that both use the encoder-decoder architecture. Is it True?

True

Which modules in Transformer is responsible for representing the dependency of input vectors in a sequence?

Self-attention layers

Which of the module in Transformer is responsible for encoding the relative positions of words in a sentence?

Positional encoder

Computation (Concept)

If we have $N$ words, the size of the matrix $Q$ and $K$ will be:

$Q = X W^Q$

Size of $X$ : $N \times D$

Size of $W^Q$ : $D \times d_k$

where $D$ is one of the dimensions of the weight matrices $W^Q$ or $W^K$ .

Therefore size of $Q$ and $K$ will both be $N \times d_k$

For the self attention in Transformer model, the equation is $\text{Softmax}(\frac{QK^T}{\sqrt{d_k}})$ .

Size of $Q$ is $N \times d_k$

Size of $K^T$ is $d_k \times N$

So size of $QK^T$ will be $N \times N$

Therefore the complexity is $O(N^2)$

Note we actually just need to compute the half of the diagonals, because it is symmetrical. So the formula is actually $\frac{N(N+1)}{2}$ . But the complexity is still $O(N^2)$ .

Recall the rules of complexity:

$O(N^2) = O(\frac{N(N+1)}{2}) = O(N^2 +C)$

If $N$ is big enough, practially they are the same.

The scaling by $\sqrt{d_{k}}$ is to prevent the value of $\mathbf{q}_{i}^{T} \mathbf{k}_{j}$ (Dot product) from getting too big, which may over-throw the precision of the CPU or GPU. Note the dimension of various matrices in Transformer:

$\begin{array}{c} \mathbf{X}: N \times D_{e} \\ \mathbf{Q}: N \times d_{k} \\ \mathbf{K}: N \times d_{k} \\ \mathbf{V}: N \times d_{v} \\ \mathbf{Z}: N \times d_{v} \\ \mathbf{W}_{Q}: D_{e} \times d_{k} \\ \mathbf{W}_{K}: D_{e} \times d_{k} \\ \mathbf{W}_{V}: D_{e} \times d_{v} \\ \mathbf{Z}=\operatorname{Softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{k}}}\right) \mathbf{V} \end{array}$

where $D_{e}$ is the dimension of word embedding. Note that because of the identity connection (ResNet structure), $d_{v}=D_{e}$ .

The feedforward operation and the gradient computation in Transformer can be written as matrix multiplications and additions, which can be easily parallelized. Also, the multiple heads can be executed in parallel before they are finally combined through a weight matrix. On the other hand, in LSTM, the words in a sentence should be presented to the LSTM cell sequentially. It is therefore much harder to parallelize the computation in LSTM.

Multiheads (Concept)

Multi-head attention aims to increase the diversity of the attention model, i.e., each head attempts to encode different kind of dependency across the words in a sentence. If the weight matrices are identical matrices, all heads will be the same, which defeats the purpose of having multiple heads.

The attention weights are calculated as follows

$\operatorname{Softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{k}}}\right)=\operatorname{Softmax}\left(\frac{\mathbf{X} \mathbf{W}_{Q} \mathbf{W}_{K}^{\top} \mathbf{X}^\top}{\sqrt{d_{k}}}\right)$

If $\mathbf{W}_{Q} \mathbf{W}_{K}^{T}$ is an identity matrix, then $\mathbf{W}_{Q} \mathbf{W}_{K}^{T}=I$ , and the attention weights become

$\text { Softmax }\left(\frac{\mathbf{XIX}^{\top}}{\sqrt{d_{k}}}\right) = \text { Softmax }\left(\frac{\mathbf{X X}^{\top}}{\sqrt{d_{k}}}\right)$

which is independent of the weight matrices $\mathbf{W}_{Q}$ and $\mathbf{W}_{K} .$ Again, all head are the same (except for the possible difference in $\mathbf{W}_{V}$ . Again, it defeats the purpose of multi-head attention.

If the Q=X, K=X, and V=X in Transformer,is it still meaningful to use multiple heads in the attention model?

False.

Originally:

$Q = W^QX$

$K = W^KX$

$V = W^VX$

Without the weights, it will be no meaning to use multiple heads in the attention model.

Embedding and Pooling

Embedding

In some applications, we may need to summarize the information embedded in a group of feature vectors.

For example, Summarize the speaker/language information in a sequence of acoustic vectors from an utterance for speaker/language recognition
For healthcare, we want to use a single vector to determine whether the speaker has dementia/autism or not
We may want to use a single vector to summarize the snoozing sounds of a person to determine if he/she has OSA
For computer vision, we want to use a vector to determine if the input image is a real or a fake image. The process of finding a representation of a group of feature vectors is known as embedding.

Speaker embedding

Speaker embedding aims to represent the characteristics of a speaker from a variable-length utterance.
By making the output of the network to produce the posterior probabilities of languages and sound, we have language and sound embedding, respectively.

^ the x-vector is the embedding.

Besides TDNN, ResNet could also be used.

Pooling

Statistics Pooling

Statistics pooling aims to aggregate the frame-level information into segment-level information.
In the x-vector network, it converts a C×T matrix to a (2*C)-dim vector
- E.g. Channel = 100, Gives a 200-dim vector