Self-Attention

Self-Attention is the method to provide a learnable receptive field in deep learning. Using attention score, it can make use of the relation between inputs.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Input and Output of Self-Attention models

Vector Set as Input

  • Input can be a set of variable vectors

Example:

  • Language processing
    • One-hot encoding for each word => Assume each word has no relation!
      • No Semantics information
    • Word embedding => each word has a vector, according to its meaning
      • Has Semantics information
  • Speech processing
    • Speech signal are a set of vector (frame)
    • E.g. 25ms of signal as a frame window, then shift 10ms every move. 1s => 100 frames
      • Vector are produced through (400 sample points / 39-dim MFCC / 80-dim filter bank output)
  • Graph is also a set of vectors, consider each node as a vector
    • E.g. Social Network, Drug Discovery

Possible output

There are 3 cases.

  • Each vector has a label

    • e.g. POS tagging (determine the word nature in a sentence)
  • The whole sequence has a label

    • e.g. Sentiment analysis
  • Let the model decides the number of labels itself (Seq2Seq)

    • e.g. Language Translation

Why Self-Attention?

  • Consider relation between inputs?
    • Use a window for whole sequence to fit in Fully Connected layer => create an extra amount of parameters => easier to overfit
    • Self-Attention layer then fit in Fully Connected layer => Good result

We can use Self-Attention layer then fit in Fully Connected layer. The Self-Attention layer uses Attention Score to consider the relation between inputs.

img

We can do a Self-attention => FC => Self-attention => FC like architecture. We can use Self-attention layer for many times.

  • Self-attention layer process the whole sequence information
  • Fully Connected layer process the particular point of information

Types of Self-Attention

The Self-Attention layer uses Attention Score to consider the relation between inputs.

img

There are many types of Self-Attention, including:

  • Dot-Product Self-Attention (Most used)
  • Additive Self-Attention

Dot-Product Self-Attention

We are given 3 learning parameter WqW^q , WkW^k and WvW^v, which will determine the value of query (q)(q) , key (k)(k) and value (v)(v).

  • Compute the query of each input
  • Compute the key of each input
  • Compute the value of each input
  • For each query, compute the attention score α\alpha, use the attention score and value to produce a weighted sum

The attention scores are computed using the dot product of query (q)(q) and key (k)(k).

img

An example, computing the attention score in the using query:

  • Usually people count the attention score of itself aswell. Though it is a not a must.

img

The Attention score represent the how strong the relation between the inputs.

Then apply softmax (You can apply ReLU instead of Softmax) to normalize the attention score.

img

To Extract information based on attention scores, we use the value (v)(v) and normalized attention score (α)(\alpha') to calculate the output.

  • The highest α\alpha' , the corresponding vv will dominant the output.

The output is equal to sum of the dot products of normalized/processed attention score (α)(\alpha') and value (v)(v).

In this case, b1=iα1,ivi\boldsymbol{b}^{\mathbf{1}}=\sum_{i} \alpha_{1, i}^{\prime} \boldsymbol{v}^{i}

So the attention for only a single query:

 attention (q,k,v)=i similarity (q,ki)×vi\text { attention }(q, \boldsymbol{k}, \boldsymbol{v})=\sum_{i} \text { similarity }\left(q, k_{i}\right) \times v_{i}

img

Similarly, for other outputs b2b^2, b3b^3, b4b^4 , same logic applies.

img

bj=iαj,ivi\boldsymbol{b}^{\mathbf{j}}=\sum_{i} \alpha_{j, i}^{\prime} \boldsymbol{v}^{i}

  • Note all the output bjb^j are computed in parallel.

Therefore, from Matrix operation perspective:

img

Attention(Q,K,V)=softmax(QKT)V\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(Q K^{T}\right) V

Note TT means transpose.

A=QKTA = Q K^{T}

img

softmax(QKT)V=softmax(A)V=O\operatorname{softmax}\left(Q K^{T}\right) V = \operatorname{softmax}(A)V = O

img

The full picture:

img

Scaled Dot-Product Self-Attention

Scaled dot-product attention is an attention mechanism where the dot products are scaled down by dk\sqrt{d_{k}}.

  • identical to Dot-product attention, except for scaling factor of 1dk\frac{1}{\sqrt{d_k}}
  • Used in Transformer

where dkd_k represents keys of dimension.

Attention(Q,K,V)=softmax(QKTdk)V\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V

If we assume that qq and kk are dkd_{k} -dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, qk=i=1dkuiviq \cdot k=\sum_{i=1}^{d_{k}} u_{i} v_{i}, has mean 0 and variance dkd_{k}. Since we would prefer these values to have variance 1, we divide by dk\sqrt{d_{k}}.

The Scale processing is applied before softmax.

img

Dot-product attention is much faster and more space-efficient comparing to Additive attention in practice, since it can be implemented using highly optimized matrix multiplication code.

Multi-Head Self-Attention

Multi-Head Self-Attention is an advanced type of self-attention.

  • A number of heads to let the usage of more query to find the different types of relevance between inputs.
  • Number of heads is a hyperparameter.
    • Some tasks perform better in more heads, some tasks perform better in less heads.
img

An example - 2 heads

  • Compute head by head

img

img

Finally concat, transform it back to scalar using WOW^O

img

In Math representation:

MultiHead(Q,K,V)= Concat ( head 1,, head h)WO where head i=Attention(QWiQ,KWiK,VWiV)\begin{aligned} \operatorname{MultiHead}(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{h}\right) W^{O} \\ \text { where head }_{i} &=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}

Positional Encoding

The Problem of self-attention?

  • There is no position information in self-attention.
  • For some tasks, position information is important, we need to use positional encoding.

Each position has a unique positional vector eie^i

img
  • positional encoding can be hand-crafted or learned from data
  • Position representation methods including:
    • Sinusoidal (hand-crafted)
    • Position Embedding (learned from data)

More info can be found in Paper: Learning to Encode Position for Transformer with Continuous Dynamical Model

Self-attention for Image

An image can also be considered as a vector set (tensor), W×H×CW \times H \times C.

Thus we can use self-attention for image.

Some examples:

Self-attention VS CNN

Comparing Self-attention and CNN, CNN is actually a type of Self-attention.

  • CNN: self-attention that can only attends in a receptive field (kernel size, hand-crafted)
  • Self-attention: CNN with learnable receptive field

Thats why we can say CNN is simplified self-attention, or Self-attention is the complex / flexible version of CNN.

More info can be found in Paper: On the Relationship between Self-Attention and Convolutional Layers

  • With suitable hyperparameters, Self-attention can do what CNN do.

Note a flexible model would require more data, otherwise overfitting will happen.

  • Flexible model - Good for more data
  • Less Flexible model - Good for less data
img

A approach between CNN and Self-attention would be Conformer.

Self-attention VS RNN

RNN

  • Non parallel processing
  • The first values (marked in red) might not be kept in the memory, hard to consider

img

Self-Attention

  • Parallel processing
  • The first values (marked in red) can be easily consider

img

More info can be found in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention to know more about how Self-Attention can mimic RNN.

Seq2Seq

Sequence-to-Sequence Model

  • Input a sequence, output a sequence (output length is determined by model)
    • E.g. Speech Recognition, Machine Translation, Speech Translation, ChatBot, Question Answering
img
  • Has a Encoder and a Decoder

What can Seq2Seq model do?

Seq2Seq for Multi-label Classification

  • An object can belong to multiple classes.
  • The model may pick more then one class label.

Seq2Seq for Object Detection

Transformer

Transformer is a Sequence-to-Sequence (Seq2Seq) model.

img

A full picture:

img

Encoder of Transformer

Transformer’s Encoder uses Self-Attention.

  • You can actually use RNN or CNN for Encoder.

img

Multi-Head Attention

Multi-Head Self-Attention is an advanced type of self-attention.

  • A number of heads to let the usage of more query to find the different types of relevance between inputs.
  • Number of heads is a hyperparameter.

Residual Connection (Add) and Layer Normalization (Norm)

img

Add

Add a input into output

  • Known as residual connection

Norm

  • Layer Normalization

xi=ximσx_{i}^{\prime}=\frac{x_{i}-m}{\sigma}

where mm is mean, σ\sigma is standard deviation, xix_i are the inputs

Why we don’t use other normalization? Further reading:

Position-wise Feed-Forward Network

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

  • Fully Connected Layer (FC) with a ReLU activation + a Fully Connected layer (FC).

FFN(x)=max(0,xW1+b1)W2+b2\operatorname{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2}

Auto-regressive Decoder of Transformer

img

Auto-regressive

  • Decoder previous output as Decoder input
    • Error might propagate
img

Masked Multi-Head Attention

  • Same as Multi-Head Attention, but only calucate the attension score and weighted sum of current and previous inputs
  • Because the input of decoder is generated one by one.

Add & Norm

  • Same as Residual Connection and Layer Normalization

Softmax

  • Create a normalized distribution that sum to 1
    • Highest value class is the output

Encoder-Decoder Interaction of Transformer

img

Cross Attention

  • query (q)(q) from decoder
  • key (k)(k) and value (v)(v) from encoder
  • Then use them to perform Multi-Head Attention

More different types of Cross Attention: Layer-Wise Multi-View Decoding for Natural Language Generation

Revisit the full picture of Transformer

img

Training and Inference of Transformer

Training - Cross Entropy

  • Using the Cross Entropy between the one-hot vector of ground truth and the decoder output softmax distribution
    • We want to minimize it

Optimizer

  • Adam with warmup_steps

Regularization

  • Residual Dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized
  • Dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks

Tips on Training Seq2Seq model

Copy Mechanism

Guided Attention

  • Force a specific way of attention
    • Monotonic Attention, Location-aware attention
  • Usage E.g. TTS, Speech Recognition
img
  • The red path is Greedy Decoding
  • The green path is the best one
  • But it is not possible to check all the paths
    • So Beam Search is used, but not guaranteed to give a better performance.
      • Beam Search may favor:
        • Speech recognition
      • Beam Search not favor:
        • Task that need randomness (More than one answer), e.g. Text-to-Speech (TTS), Sentence completion

Evaluation Metrics

BLEU score

Compare the ground truth sentence with sentence output of decoder

  • Minimizing Cross-Entropy =/= Maximizing BLEU score
    • Not related
  • Therefore during training we consider Minimizing Cross-Entropy, but when validating we consider Maximizing BLEU score.
  • Note BLEU score is not differentiable, it cannot be used as Loss function.

More Varients of Transformers

Reference

YouTube: Hung-yi Lee - 機器學習2021

Paper: Attention Is All You Need