Some Practice Questions on common Deep Learning architectures
CNN
CNN (Math)
102 - 5 + 1 = 98
Why?
Imagine if the number of time points is 10, while the convolutional filter is still 5, stride is still 1.
10 - 5 + 1 = 6 number of times points after convolution.
Therefore if the original number of time points is 102:
102 - 5 + 1 = 98 number of times points after convolution.
Since and are same length, the convolution slide only once.
The output =
Therefore the output of the convolutional filters is 11.
If you have 5 time series and you apply 1D-CNN to process the time series, how many convolutional filters in the first layer of the 1D-CNN?
- 5
Given a color image of size 28 x 28 x 3 pixels, how many convolutional filters in the first layer of a CNN if the first layer’s output tensor has size 26 x 26 x 64?
- 64
CNN (Concept)
Is it appropriate to apply 1D-CNN on images?
False
The number of weights in the convolutional layers of a CNN is much smaller than the number of weights in a hidden layer of a fully connected network. Is it true?
True
In CNN, the pooling operation can reduce the number of weights. Is it true?
False.
In 2D-CNN, the convolutional operation is to capture the spatial features of images. Is it true?
True
Name two methods that enable a CNN or TDNN to accept input matrices of variable size.
- Statistics pooling
- Gloabal average pooling
For time series prediction, we want to capture the short-term temporal relationship of the samples across time instead of capturing the spatial relationship across the signal amplitudes. This can be achieved by using a 1D-CNN with a large kernel size, say 10. In order words, it is more important to capture the dynamic of for , instead of capturing the correlation between and or the correlation between and , and so on. If we use a 2D-CNN for the same task, we assume that the order of has some meaning (similar to the vertical axis of an image). However, for time series prediction or classification, the result should be the same even if we reshuffle the time series, e.g., . Therefore, we should not use a 2D-CNN for multi-dimensional time series prediction.
CNN (Math + Concept)
(b)(i)
Number of shared weights = ( Kernel weights + bias ) x number of output channels
Therefore, Number of shared weights =
(b)(ii)
convolutional layers are used for extracting spatial features that are relevant to the classification task.
(b)(iii)
Max-pooling layers perform subsampling to reduce the amount of computation performed in the network.
(b)(iv)
The fully connected layers act as a non-linear classifier that classifies the features extracted from the last convolutional layer or last max-pooling layer of the CNN.
Recurrent Neural Networks
The number of weights in an RNN depends on the number of unfoldings (time steps) we apply to the RNN. Is it True?
False
LSTM is a special kind of RNN that can model long term dependence. Is it True?
True
LSTM can only look at evens/information happen in the past. Is it True?
Yes
Bi-directional LSTM uses information in the past and the future to perform prediction. Is it True?
Yes
An LSTM is better than an RNN in modeling long-term dependence because the LSTM has cell states and hidden states while RNN only has hidden states. Is it True?
Yes
Transformers
Transformer is based on the seq2seq model in that both use the encoder-decoder architecture. Is it True?
True
Which modules in Transformer is responsible for representing the dependency of input vectors in a sequence?
- Self-attention layers
Which of the module in Transformer is responsible for encoding the relative positions of words in a sentence?
- Positional encoder
Computation (Concept)
If we have words, the size of the matrix and will be:
Size of :
Size of :
where is one of the dimensions of the weight matrices or .
Therefore size of and will both be
For the self attention in Transformer model, the equation is .
Size of is
Size of is
So size of will be
Therefore the complexity is
Note we actually just need to compute the half of the diagonals, because it is symmetrical. So the formula is actually . But the complexity is still .
Recall the rules of complexity:
If is big enough, practially they are the same.
The scaling by is to prevent the value of (Dot product) from getting too big, which may over-throw the precision of the CPU or GPU. Note the dimension of various matrices in Transformer:
where is the dimension of word embedding. Note that because of the identity connection (ResNet structure), .
The feedforward operation and the gradient computation in Transformer can be written as matrix multiplications and additions, which can be easily parallelized. Also, the multiple heads can be executed in parallel before they are finally combined through a weight matrix. On the other hand, in LSTM, the words in a sentence should be presented to the LSTM cell sequentially. It is therefore much harder to parallelize the computation in LSTM.
Multiheads (Concept)
Multi-head attention aims to increase the diversity of the attention model, i.e., each head attempts to encode different kind of dependency across the words in a sentence. If the weight matrices are identical matrices, all heads will be the same, which defeats the purpose of having multiple heads.
The attention weights are calculated as follows
If is an identity matrix, then , and the attention weights become
which is independent of the weight matrices and Again, all head are the same (except for the possible difference in . Again, it defeats the purpose of multi-head attention.
If the Q=X, K=X, and V=X in Transformer,is it still meaningful to use multiple heads in the attention model?
False.
Originally:
Without the weights, it will be no meaning to use multiple heads in the attention model.
Embedding and Pooling
Embedding
In some applications, we may need to summarize the information embedded in a group of feature vectors.
- For example, Summarize the speaker/language information in a sequence of acoustic vectors from an utterance for speaker/language recognition
- For healthcare, we want to use a single vector to determine whether the speaker has dementia/autism or not
- We may want to use a single vector to summarize the snoozing sounds of a person to determine if he/she has OSA
- For computer vision, we want to use a vector to determine if the input image is a real or a fake image. The process of finding a representation of a group of feature vectors is known as embedding.
Speaker embedding
- Speaker embedding aims to represent the characteristics of a speaker from a variable-length utterance.
- By making the output of the network to produce the posterior probabilities of languages and sound, we have language and sound embedding, respectively.
- ^ the x-vector is the embedding.
Besides TDNN, ResNet could also be used.
Pooling
Statistics Pooling
- Statistics pooling aims to aggregate the frame-level information into segment-level information.
- In the x-vector network, it converts a C×T matrix to a (2*C)-dim vector
- E.g. Channel = 100, Gives a 200-dim vector
is the Utterance-level representation.
With statistics pooling or global average pooling, a CNN requires the input of fixed size.
False.