Self-Attention and Transformer
Self-Attention
Self-Attention is the method to provide a learnable receptive field in deep learning. Using attention score, it can make use of the relation between inputs.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Input and Output of Self-Attention models
Vector Set as Input
- Input can be a set of variable vectors
Example:
- Language processing
- One-hot encoding for each word => Assume each word has no relation!
- No Semantics information
- Word embedding => each word has a vector, according to its meaning
- Has Semantics information
- One-hot encoding for each word => Assume each word has no relation!
- Speech processing
- Speech signal are a set of vector (frame)
- E.g. 25ms of signal as a frame window, then shift 10ms every move. 1s => 100 frames
- Vector are produced through (400 sample points / 39-dim MFCC / 80-dim filter bank output)
- Graph is also a set of vectors, consider each node as a vector
- E.g. Social Network, Drug Discovery
Possible output
There are 3 cases.
-
Each vector has a label
- e.g. POS tagging (determine the word nature in a sentence)
-
The whole sequence has a label
- e.g. Sentiment analysis
-
Let the model decides the number of labels itself (Seq2Seq)
- e.g. Language Translation
Why Self-Attention?
- Consider relation between inputs?
- Use a window for whole sequence to fit in Fully Connected layer => create an extra amount of parameters => easier to overfit
- Self-Attention layer then fit in Fully Connected layer => Good result
We can use Self-Attention layer then fit in Fully Connected layer. The Self-Attention layer uses Attention Score to consider the relation between inputs.
We can do a Self-attention => FC => Self-attention => FC like architecture. We can use Self-attention layer for many times.
- Self-attention layer process the whole sequence information
- Fully Connected layer process the particular point of information
Types of Self-Attention
The Self-Attention layer uses Attention Score to consider the relation between inputs.
There are many types of Self-Attention, including:
- Dot-Product Self-Attention (Most used)
- Additive Self-Attention
Dot-Product Self-Attention
We are given 3 learning parameter , and , which will determine the value of query , key and value .
- Compute the query of each input
- Compute the key of each input
- Compute the value of each input
- For each query, compute the attention score , use the attention score and value to produce a weighted sum
The attention scores are computed using the dot product of query and key .
An example, computing the attention score in the using query:
- Usually people count the attention score of itself aswell. Though it is a not a must.
The Attention score represent the how strong the relation between the inputs.
Then apply softmax (You can apply ReLU instead of Softmax) to normalize the attention score.
To Extract information based on attention scores, we use the value and normalized attention score to calculate the output.
- The highest , the corresponding will dominant the output.
The output is equal to sum of the dot products of normalized/processed attention score and value .
In this case,
So the attention for only a single query:
Similarly, for other outputs , , , same logic applies.
- Note all the output are computed in parallel.
Therefore, from Matrix operation perspective:
Note means transpose.
The full picture:
Scaled Dot-Product Self-Attention
Scaled dot-product attention is an attention mechanism where the dot products are scaled down by .
- identical to Dot-product attention, except for scaling factor of
- Used in Transformer
where represents keys of dimension.
If we assume that and are -dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, , has mean 0 and variance . Since we would prefer these values to have variance 1, we divide by .
The Scale processing is applied before softmax.
Dot-product attention is much faster and more space-efficient comparing to Additive attention in practice, since it can be implemented using highly optimized matrix multiplication code.
Multi-Head Self-Attention
Multi-Head Self-Attention is an advanced type of self-attention.
- A number of heads to let the usage of more query to find the different types of relevance between inputs.
- Number of heads is a hyperparameter.
- Some tasks perform better in more heads, some tasks perform better in less heads.
An example - 2 heads
- Compute head by head
Finally concat, transform it back to scalar using
In Math representation:
Positional Encoding
The Problem of self-attention?
- There is no position information in self-attention.
- For some tasks, position information is important, we need to use positional encoding.
Each position has a unique positional vector
- positional encoding can be hand-crafted or learned from data
- Position representation methods including:
- Sinusoidal (hand-crafted)
- Position Embedding (learned from data)
More info can be found in Paper: Learning to Encode Position for Transformer with Continuous Dynamical Model
Self-attention for Image
An image can also be considered as a vector set (tensor), .
Thus we can use self-attention for image.
Some examples:
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Self-Attention GAN
- DEtection Transformer (DETR)
Self-attention VS CNN
Comparing Self-attention and CNN, CNN is actually a type of Self-attention.
- CNN: self-attention that can only attends in a receptive field (kernel size, hand-crafted)
- Self-attention: CNN with learnable receptive field
Thats why we can say CNN is simplified self-attention, or Self-attention is the complex / flexible version of CNN.
More info can be found in Paper: On the Relationship between Self-Attention and Convolutional Layers
- With suitable hyperparameters, Self-attention can do what CNN do.
Note a flexible model would require more data, otherwise overfitting will happen.
- Flexible model - Good for more data
- Less Flexible model - Good for less data
A approach between CNN and Self-attention would be Conformer.
Self-attention VS RNN
RNN
- Non parallel processing
- The first values (marked in red) might not be kept in the memory, hard to consider
Self-Attention
- Parallel processing
- The first values (marked in red) can be easily consider
More info can be found in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention to know more about how Self-Attention can mimic RNN.
Seq2Seq
Sequence-to-Sequence Model
- Input a sequence, output a sequence (output length is determined by model)
- E.g. Speech Recognition, Machine Translation, Speech Translation, ChatBot, Question Answering
- Has a Encoder and a Decoder
What can Seq2Seq model do?
Seq2Seq for Multi-label Classification
- An object can belong to multiple classes.
- The model may pick more then one class label.
Seq2Seq for Object Detection
Transformer
Transformer is a Sequence-to-Sequence (Seq2Seq) model.
A full picture:
Encoder of Transformer
Transformer’s Encoder uses Self-Attention.
- You can actually use RNN or CNN for Encoder.
Multi-Head Attention
Multi-Head Self-Attention is an advanced type of self-attention.
- A number of heads to let the usage of more query to find the different types of relevance between inputs.
- Number of heads is a hyperparameter.
Residual Connection (Add) and Layer Normalization (Norm)
Add
Add a input into output
- Known as residual connection
Norm
- Layer Normalization
where is mean, is standard deviation, are the inputs
Why we don’t use other normalization? Further reading:
Position-wise Feed-Forward Network
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
- Fully Connected Layer (FC) with a ReLU activation + a Fully Connected layer (FC).
Auto-regressive Decoder of Transformer
Auto-regressive
- Decoder previous output as Decoder input
- Error might propagate
Masked Multi-Head Attention
- Same as Multi-Head Attention, but only calucate the attension score and weighted sum of current and previous inputs
- Because the input of decoder is generated one by one.
Add & Norm
- Same as Residual Connection and Layer Normalization
Softmax
- Create a normalized distribution that sum to 1
- Highest value class is the output
Encoder-Decoder Interaction of Transformer
Cross Attention
- query from decoder
- key and value from encoder
- Then use them to perform Multi-Head Attention
More different types of Cross Attention: Layer-Wise Multi-View Decoding for Natural Language Generation
Revisit the full picture of Transformer
Training and Inference of Transformer
Training - Cross Entropy
- Using the Cross Entropy between the one-hot vector of ground truth and the decoder output softmax distribution
- We want to minimize it
Optimizer
- Adam with
warmup_steps
Regularization
- Residual Dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized
- Dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks
Tips on Training Seq2Seq model
Copy Mechanism
- Copying some input as output
- Usage E.g. Article Summarization
Guided Attention
- Force a specific way of attention
- Monotonic Attention, Location-aware attention
- Usage E.g. TTS, Speech Recognition
Beam Search
- The red path is Greedy Decoding
- The green path is the best one
- But it is not possible to check all the paths
- So Beam Search is used, but not guaranteed to give a better performance.
- Beam Search may favor:
- Speech recognition
- Beam Search not favor:
- Task that need randomness (More than one answer), e.g. Text-to-Speech (TTS), Sentence completion
- Beam Search may favor:
- So Beam Search is used, but not guaranteed to give a better performance.
Evaluation Metrics
BLEU score
Compare the ground truth sentence with sentence output of decoder
- Minimizing Cross-Entropy =/= Maximizing BLEU score
- Not related
- Therefore during training we consider Minimizing Cross-Entropy, but when validating we consider Maximizing BLEU score.
- Note BLEU score is not differentiable, it cannot be used as Loss function.