Self-Attention

Self-Attention is the method to provide a learnable receptive field in deep learning. Using attention score, it can make use of the relation between inputs.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Input and Output of Self-Attention models

Vector Set as Input

Input can be a set of variable vectors

Example:

Language processing
- One-hot encoding for each word => Assume each word has no relation!
  - No Semantics information
- Word embedding => each word has a vector, according to its meaning
  - Has Semantics information
Speech processing
- Speech signal are a set of vector (frame)
- E.g. 25ms of signal as a frame window, then shift 10ms every move. 1s => 100 frames
  - Vector are produced through (400 sample points / 39-dim MFCC / 80-dim filter bank output)
Graph is also a set of vectors, consider each node as a vector
- E.g. Social Network, Drug Discovery

Possible output

There are 3 cases.

Each vector has a label
- e.g. POS tagging (determine the word nature in a sentence)
The whole sequence has a label
- e.g. Sentiment analysis
Let the model decides the number of labels itself (Seq2Seq)
- e.g. Language Translation

Why Self-Attention?

Consider relation between inputs?
- Use a window for whole sequence to fit in Fully Connected layer => create an extra amount of parameters => easier to overfit
- Self-Attention layer then fit in Fully Connected layer => Good result

We can use Self-Attention layer then fit in Fully Connected layer. The Self-Attention layer uses Attention Score to consider the relation between inputs.

We can do a Self-attention => FC => Self-attention => FC like architecture. We can use Self-attention layer for many times.

Self-attention layer process the whole sequence information
Fully Connected layer process the particular point of information

Types of Self-Attention

The Self-Attention layer uses Attention Score to consider the relation between inputs.

There are many types of Self-Attention, including:

Dot-Product Self-Attention (Most used)
Additive Self-Attention

Dot-Product Self-Attention

We are given 3 learning parameter $W^q$ , $W^k$ and $W^v$ , which will determine the value of query $(q)$ , key $(k)$ and value $(v)$ .

Compute the query of each input
Compute the key of each input
Compute the value of each input
For each query, compute the attention score $\alpha$ , use the attention score and value to produce a weighted sum

The attention scores are computed using the dot product of query $(q)$ and key $(k)$ .

An example, computing the attention score in the using query:

Usually people count the attention score of itself aswell. Though it is a not a must.

The Attention score represent the how strong the relation between the inputs.

Then apply softmax (You can apply ReLU instead of Softmax) to normalize the attention score.

To Extract information based on attention scores, we use the value $(v)$ and normalized attention score $(\alpha')$ to calculate the output.

The highest $\alpha'$ , the corresponding $v$ will dominant the output.

The output is equal to sum of the dot products of normalized/processed attention score $(\alpha')$ and value $(v)$ .

In this case, $\boldsymbol{b}^{\mathbf{1}}=\sum_{i} \alpha_{1, i}^{\prime} \boldsymbol{v}^{i}$

So the attention for only a single query:

$\text { attention }(q, \boldsymbol{k}, \boldsymbol{v})=\sum_{i} \text { similarity }\left(q, k_{i}\right) \times v_{i}$

Similarly, for other outputs $b^2$ , $b^3$ , $b^4$ , same logic applies.

$\boldsymbol{b}^{\mathbf{j}}=\sum_{i} \alpha_{j, i}^{\prime} \boldsymbol{v}^{i}$

Note all the output $b^j$ are computed in parallel.

Therefore, from Matrix operation perspective:

$\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(Q K^{T}\right) V$

Note $T$ means transpose.

$A = Q K^{T}$

$\operatorname{softmax}\left(Q K^{T}\right) V = \operatorname{softmax}(A)V = O$

The full picture:

Scaled Dot-Product Self-Attention

Scaled dot-product attention is an attention mechanism where the dot products are scaled down by $\sqrt{d_{k}}$ .

identical to Dot-product attention, except for scaling factor of $\frac{1}{\sqrt{d_k}}$
Used in Transformer

where $d_k$ represents keys of dimension.

$\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$

If we assume that $q$ and $k$ are $d_{k}$ -dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, $q \cdot k=\sum_{i=1}^{d_{k}} u_{i} v_{i}$ , has mean 0 and variance $d_{k}$ . Since we would prefer these values to have variance 1, we divide by $\sqrt{d_{k}}$ .

The Scale processing is applied before softmax.

Dot-product attention is much faster and more space-efficient comparing to Additive attention in practice, since it can be implemented using highly optimized matrix multiplication code.

Multi-Head Self-Attention

Multi-Head Self-Attention is an advanced type of self-attention.

A number of heads to let the usage of more query to find the different types of relevance between inputs.
Number of heads is a hyperparameter.
- Some tasks perform better in more heads, some tasks perform better in less heads.

An example - 2 heads

Compute head by head

Finally concat, transform it back to scalar using $W^O$

In Math representation:

$\begin{aligned} \operatorname{MultiHead}(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{h}\right) W^{O} \\ \text { where head }_{i} &=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}$

Positional Encoding

The Problem of self-attention?

There is no position information in self-attention.
For some tasks, position information is important, we need to use positional encoding.

Each position has a unique positional vector $e^i$

positional encoding can be hand-crafted or learned from data
Position representation methods including:
- Sinusoidal (hand-crafted)
- Position Embedding (learned from data)

More info can be found in Paper: Learning to Encode Position for Transformer with Continuous Dynamical Model

Self-attention for Image

An image can also be considered as a vector set (tensor), $W \times H \times C$ .

Thus we can use self-attention for image.

Some examples:

Self-attention VS CNN

Comparing Self-attention and CNN, CNN is actually a type of Self-attention.

CNN: self-attention that can only attends in a receptive field (kernel size, hand-crafted)
Self-attention: CNN with learnable receptive field

Thats why we can say CNN is simplified self-attention, or Self-attention is the complex / flexible version of CNN.

More info can be found in Paper: On the Relationship between Self-Attention and Convolutional Layers

With suitable hyperparameters, Self-attention can do what CNN do.

Note a flexible model would require more data, otherwise overfitting will happen.

Flexible model - Good for more data
Less Flexible model - Good for less data

A approach between CNN and Self-attention would be Conformer.

Conformer: Convolution-augmented Transformer for Speech Recognition

Self-attention VS RNN

RNN

Non parallel processing
The first values (marked in red) might not be kept in the memory, hard to consider

Self-Attention

Parallel processing
The first values (marked in red) can be easily consider

More info can be found in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention to know more about how Self-Attention can mimic RNN.

Seq2Seq

Sequence-to-Sequence Model

Input a sequence, output a sequence (output length is determined by model)
- E.g. Speech Recognition, Machine Translation, Speech Translation, ChatBot, Question Answering

Has a Encoder and a Decoder

What can Seq2Seq model do?

Seq2Seq for Multi-label Classification

An object can belong to multiple classes.
The model may pick more then one class label.

Seq2Seq for Object Detection

End-to-End Object Detection with Transformers

Transformer

Transformer is a Sequence-to-Sequence (Seq2Seq) model.

A full picture:

Encoder of Transformer

Transformer’s Encoder uses Self-Attention.

You can actually use RNN or CNN for Encoder.

Multi-Head Attention

Multi-Head Self-Attention is an advanced type of self-attention.

A number of heads to let the usage of more query to find the different types of relevance between inputs.
Number of heads is a hyperparameter.

Residual Connection (Add) and Layer Normalization (Norm)

Add

Add a input into output

Known as residual connection

Norm

Layer Normalization

$x_{i}^{\prime}=\frac{x_{i}-m}{\sigma}$

where $m$ is mean, $\sigma$ is standard deviation, $x_i$ are the inputs

Why we don’t use other normalization? Further reading:

On Layer Normalization in the Transformer Architecture

PowerNorm: Rethinking Batch Normalization in Transformers

Position-wise Feed-Forward Network

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

Fully Connected Layer (FC) with a ReLU activation + a Fully Connected layer (FC).

$\operatorname{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2}$

Auto-regressive Decoder of Transformer

Auto-regressive

Decoder previous output as Decoder input
- Error might propagate

Masked Multi-Head Attention

Same as Multi-Head Attention, but only calucate the attension score and weighted sum of current and previous inputs
Because the input of decoder is generated one by one.

Add & Norm

Same as Residual Connection and Layer Normalization

Softmax

Create a normalized distribution that sum to 1
- Highest value class is the output

Encoder-Decoder Interaction of Transformer

Cross Attention

query $(q)$ from decoder
key $(k)$ and value $(v)$ from encoder
Then use them to perform Multi-Head Attention

More different types of Cross Attention: Layer-Wise Multi-View Decoding for Natural Language Generation

Revisit the full picture of Transformer

Training and Inference of Transformer

Training - Cross Entropy

Using the Cross Entropy between the one-hot vector of ground truth and the decoder output softmax distribution
- We want to minimize it

Optimizer

Adam with warmup_steps

Regularization

Residual Dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized
Dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks

Tips on Training Seq2Seq model

Copy Mechanism

Copying some input as output
- Incorporating Copying Mechanism in Sequence-to-Sequence Learning
Usage E.g. Article Summarization

Guided Attention

Force a specific way of attention
- Monotonic Attention, Location-aware attention
Usage E.g. TTS, Speech Recognition

Beam Search

The red path is Greedy Decoding
The green path is the best one
But it is not possible to check all the paths
- So Beam Search is used, but not guaranteed to give a better performance.
  - Beam Search may favor:
    - Speech recognition
  - Beam Search not favor:
    - Task that need randomness (More than one answer), e.g. Text-to-Speech (TTS), Sentence completion

Evaluation Metrics

BLEU score

Compare the ground truth sentence with sentence output of decoder

Minimizing Cross-Entropy =/= Maximizing BLEU score
- Not related
Therefore during training we consider Minimizing Cross-Entropy, but when validating we consider Maximizing BLEU score.
Note BLEU score is not differentiable, it cannot be used as Loss function.