Explain why Transformer can achieve the same tasks as Bi-LSTM and Seq2Seq models, and in many cases, Transformer performs better.

A Seq2Seq model takes a sequence as input and produces a sequence as output. The input sequence and output sequence are not always in the same length. In order words, the length of the output sequence is determined by the model. In problems such as Machine Translation and Speech Recognition, a Seq2Seq model which has an encoder-decoder architecture is needed.

The transformer can achieve such tasks because it also has an encoder-decoder architecture. The encoder processes the input sequence into hidden states, and it will provide information for the decoder to predict the output sequence. The transformer uses an Auto-regressive decoder, which uses the previously predicted output as input to generate new predicted output. As illustrated in the figure below, Transformer can determine the length of output and thus solve Seq2Seq problems like Speech Recognition.

The transformer performs superior mainly because it uses an Attention mechanism. The Attention mechanism mimics retrieving a value for a query based on a key in a database in a probabilistic manner. The Transformer utilizes Scaled Dot-Product Self-Attention to consider relations between inputs. Query, key, and value matrices come from an input vector with a weight matrix. The weight matrices are the learning parameters.

Q=WqIK=WkIV=WvI\begin{array}{l} Q=W^{q} I \\ K=W^{k} I \\ V=W^{v} I \end{array}

The Scaled Dot-Product Self-Attention module is formulated as following math expression:

Attention(Q,K,V)=softmax(QKTdk)V\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V

The attention score matrix is computed from the matrix products of the query vector and key vector. The attention score matrix represents how strong the relation is between the inputs. Then the attention score matrix is divided by the square root of keys of dimension to keep a variance of one. Then softmax function is applied to normalize the attention score matrix further. Finally, the output is computed from the matrix products of the value vector and the normalized attention score matrix. Below is a visualization of the Scaled Dot-Product Self-Attention module. The left side shows a complete picture of the equation, while the right side shows the computation of one sample query from the query vector.

Since RNNs such as Bi-LSTM have a recurrence property, gradient vanishing or explosion may happen, so earlier inputs would not be kept in memory. However, such problems do not exist in Transformer models because the self-Attention provides parallel processing in every step.

Furthermore, Transformer uses a Multi-head Scaled Dot-Product Self-Attention to increase the number of queries to find the different types of relevance between inputs. In other words, an attention mechanism can be run several times in a parallel manner and allow the attentions to jointly attend to different parts of the sequence, thus improving the performance. However, the number of heads is a hyperparameter, and setting the large number of heads does not guarantee the best performance. Some tasks perform better with more heads, while some tasks perform better with fewer heads.

 MultiHead (Q,K,V)= Concat (head1,, head h)WO where head i= Attention (QWiQ,KWiK,VWiV)s\begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat }\left(\operatorname{head}_{1}, \ldots, \text { head }_{h}\right) W^{O} \\ \text { where head }_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}s

The computation of Multi-head Scaled Dot-Product Self-Attention is a collection of Scaled Dot-Product Self-Attention then concatenate the output into a vector and feed into another fully connected neuron.

To conclude, Transformer could achieve the same tasks as Bi-LSTM and other Seq2Seq models because it has an encoder-decoder architecture. Transformer makes use of Multi-head Scaled Dot-Product Self-Attention to attend to both short-term and long-term dependencies of a sequence. Moreover, the properties of Attention completely get rid of the problems encountered in RNN models, such as gradient vanishing problems.

Detail can be found in this post: https://vinesmsuic.github.io/transformer/#why-self-attention