CNN

CNN (Math)

img

img

102 - 5 + 1 = 98

Why?

Imagine if the number of time points is 10, while the convolutional filter is still 5, stride is still 1.

img

10 - 5 + 1 = 6 number of times points after convolution.

Therefore if the original number of time points is 102:

102 - 5 + 1 = 98 number of times points after convolution.

img

Since x1(n)x_1(n) and k1(n)k_1(n) are same length, the convolution slide only once.

The output = x1(n)k1(n)+x2(n)k2(n)+x3(n)k3(n)x_1(n) * k_1(n) + x_2(n) * k_2(n) + x_3(n) * k_3(n)

x1(n)k1(n)=(1×0+2×1+1×1+2×1+0×0)=0+2+1+2+0=5x_1(n) * k_1(n) = (1\times0+2\times1+1\times1+2\times1+0\times0) = 0+2+1+2+0 = 5
x2(n)k2(n)=(1×0+0×1+1×0+2×1+0×0)=0+0+0+2+0=2x_2(n) * k_2(n) = (1\times0+0\times1+1\times0+2\times1+0\times0) = 0+0+0+2+0 = 2
x3(n)k3(n)=(1×0+1×0+0×1+0×1+2×2)=0+0+0+0+4=4x_3(n) * k_3(n) =(1\times0+1\times0+0\times1+0\times1+2\times2)= 0+0+0+0+4 = 4

x1(n)k1(n)+x2(n)k2(n)+x3(n)k3(n)=5+2+4=11x_1(n) * k_1(n) + x_2(n) * k_2(n) + x_3(n) * k_3(n) = 5+2+4 = 11

Therefore the output of the convolutional filters is 11.

If you have 5 time series and you apply 1D-CNN to process the time series, how many convolutional filters in the first layer of the 1D-CNN?

  • 5

Given a color image of size 28 x 28 x 3 pixels, how many convolutional filters in the first layer of a CNN if the first layer’s output tensor has size 26 x 26 x 64?

  • 64

CNN (Concept)

Is it appropriate to apply 1D-CNN on images?

False

The number of weights in the convolutional layers of a CNN is much smaller than the number of weights in a hidden layer of a fully connected network. Is it true?

True

In CNN, the pooling operation can reduce the number of weights. Is it true?

False.

In 2D-CNN, the convolutional operation is to capture the spatial features of images. Is it true?

True

Name two methods that enable a CNN or TDNN to accept input matrices of variable size.

  • Statistics pooling
  • Gloabal average pooling

img

For time series prediction, we want to capture the short-term temporal relationship of the samples across time instead of capturing the spatial relationship across the signal amplitudes. This can be achieved by using a 1D-CNN with a large kernel size, say 10. In order words, it is more important to capture the dynamic of xj(n)x_{j}(n) for n=0,1,,Nn=0,1, \cdots, N, instead of capturing the correlation between x1(n)x_{1}(n) and x2(n)x_{2}(n) or the correlation between x2(n)x_{2}(n) and x3(n)x_{3}(n), and so on. If we use a 2D-CNN for the same task, we assume that the order of [x1(n),x2(n),x3(n),]\left[x_{1}(n), x_{2}(n), x_{3}(n), \cdots\right] has some meaning (similar to the vertical axis of an image). However, for time series prediction or classification, the result should be the same even if we reshuffle the time series, e.g., [x3(n),x2(n),x1(n),]\left[x_{3}(n), x_{2}(n), x_{1}(n), \cdots\right]. Therefore, we should not use a 2D-CNN for multi-dimensional time series prediction.

CNN (Math + Concept)

img

img

(b)(i)

Number of shared weights = ( Kernel weights + bias ) x number of output channels

Therefore, Number of shared weights = (3×3+1)×64=640(3\times 3 + 1) \times 64 = 640

(b)(ii)

convolutional layers are used for extracting spatial features that are relevant to the classification task.

(b)(iii)

Max-pooling layers perform subsampling to reduce the amount of computation performed in the network.

(b)(iv)

The fully connected layers act as a non-linear classifier that classifies the features extracted from the last convolutional layer or last max-pooling layer of the CNN.

Recurrent Neural Networks

The number of weights in an RNN depends on the number of unfoldings (time steps) we apply to the RNN. Is it True?

False

LSTM is a special kind of RNN that can model long term dependence. Is it True?

True

LSTM can only look at evens/information happen in the past. Is it True?

Yes

Bi-directional LSTM uses information in the past and the future to perform prediction. Is it True?

Yes

An LSTM is better than an RNN in modeling long-term dependence because the LSTM has cell states and hidden states while RNN only has hidden states. Is it True?

Yes

Transformers

Transformer is based on the seq2seq model in that both use the encoder-decoder architecture. Is it True?

True

Which modules in Transformer is responsible for representing the dependency of input vectors in a sequence?

  • Self-attention layers

Which of the module in Transformer is responsible for encoding the relative positions of words in a sentence?

  • Positional encoder

Computation (Concept)

img

If we have NN words, the size of the matrix QQ and KK will be:

Q=XWQQ = X W^Q

Size of XX: N×DN \times D

Size of WQW^Q: D×dkD \times d_k

where DD is one of the dimensions of the weight matrices WQW^Q or WKW^K.

Therefore size of QQ and KK will both be N×dkN \times d_k

For the self attention in Transformer model, the equation is Softmax(QKTdk)\text{Softmax}(\frac{QK^T}{\sqrt{d_k}}).

Size of QQ is N×dkN \times d_k

Size of KTK^T is dk×Nd_k \times N

So size of QKTQK^T will be N×NN \times N

Therefore the complexity is O(N2)O(N^2)

Note we actually just need to compute the half of the diagonals, because it is symmetrical. So the formula is actually N(N+1)2\frac{N(N+1)}{2}. But the complexity is still O(N2)O(N^2).

Recall the rules of complexity:

O(N2)=O(N(N+1)2)=O(N2+C)O(N^2) = O(\frac{N(N+1)}{2}) = O(N^2 +C)

If NN is big enough, practially they are the same.

img

The scaling by dk\sqrt{d_{k}} is to prevent the value of qiTkj\mathbf{q}_{i}^{T} \mathbf{k}_{j} (Dot product) from getting too big, which may over-throw the precision of the CPU or GPU. Note the dimension of various matrices in Transformer:

X:N×DeQ:N×dkK:N×dkV:N×dvZ:N×dvWQ:De×dkWK:De×dkWV:De×dvZ=Softmax(QKdk)V\begin{array}{c} \mathbf{X}: N \times D_{e} \\ \mathbf{Q}: N \times d_{k} \\ \mathbf{K}: N \times d_{k} \\ \mathbf{V}: N \times d_{v} \\ \mathbf{Z}: N \times d_{v} \\ \mathbf{W}_{Q}: D_{e} \times d_{k} \\ \mathbf{W}_{K}: D_{e} \times d_{k} \\ \mathbf{W}_{V}: D_{e} \times d_{v} \\ \mathbf{Z}=\operatorname{Softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{k}}}\right) \mathbf{V} \end{array}

where DeD_{e} is the dimension of word embedding. Note that because of the identity connection (ResNet structure), dv=Ded_{v}=D_{e}.

img

The feedforward operation and the gradient computation in Transformer can be written as matrix multiplications and additions, which can be easily parallelized. Also, the multiple heads can be executed in parallel before they are finally combined through a weight matrix. On the other hand, in LSTM, the words in a sentence should be presented to the LSTM cell sequentially. It is therefore much harder to parallelize the computation in LSTM.

Multiheads (Concept)

img

Multi-head attention aims to increase the diversity of the attention model, i.e., each head attempts to encode different kind of dependency across the words in a sentence. If the weight matrices are identical matrices, all heads will be the same, which defeats the purpose of having multiple heads.

The attention weights are calculated as follows

Softmax(QKdk)=Softmax(XWQWKXdk)\operatorname{Softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{k}}}\right)=\operatorname{Softmax}\left(\frac{\mathbf{X} \mathbf{W}_{Q} \mathbf{W}_{K}^{\top} \mathbf{X}^\top}{\sqrt{d_{k}}}\right)

If WQWKT\mathbf{W}_{Q} \mathbf{W}_{K}^{T} is an identity matrix, then WQWKT=I\mathbf{W}_{Q} \mathbf{W}_{K}^{T}=I, and the attention weights become

 Softmax (XIXdk)= Softmax (XXdk)\text { Softmax }\left(\frac{\mathbf{XIX}^{\top}}{\sqrt{d_{k}}}\right) = \text { Softmax }\left(\frac{\mathbf{X X}^{\top}}{\sqrt{d_{k}}}\right)

which is independent of the weight matrices WQ\mathbf{W}_{Q} and WK.\mathbf{W}_{K} . Again, all head are the same (except for the possible difference in WV\mathbf{W}_{V}. Again, it defeats the purpose of multi-head attention.

If the Q=X, K=X, and V=X in Transformer,is it still meaningful to use multiple heads in the attention model?

False.

Originally:

Q=WQXQ = W^QX

K=WKXK = W^KX

V=WVXV = W^VX

Without the weights, it will be no meaning to use multiple heads in the attention model.

Embedding and Pooling

Embedding

In some applications, we may need to summarize the information embedded in a group of feature vectors.

  • For example, Summarize the speaker/language information in a sequence of acoustic vectors from an utterance for speaker/language recognition
  • For healthcare, we want to use a single vector to determine whether the speaker has dementia/autism or not
  • We may want to use a single vector to summarize the snoozing sounds of a person to determine if he/she has OSA
  • For computer vision, we want to use a vector to determine if the input image is a real or a fake image. The process of finding a representation of a group of feature vectors is known as embedding.

Speaker embedding

  • Speaker embedding aims to represent the characteristics of a speaker from a variable-length utterance.
  • By making the output of the network to produce the posterior probabilities of languages and sound, we have language and sound embedding, respectively.

img

  • ^ the x-vector is the embedding.

Besides TDNN, ResNet could also be used.

img

Pooling

Statistics Pooling

  • Statistics pooling aims to aggregate the frame-level information into segment-level information.
  • In the x-vector network, it converts a C×T matrix to a (2*C)-dim vector
    • E.g. Channel = 100, Gives a 200-dim vector

img

uu is the Utterance-level representation.

With statistics pooling or global average pooling, a CNN requires the input of fixed size.

False.