Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model
ManindaEdirisooriya
98 views
22 slides
May 07, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
Learn about the limitations of earlier Deep Sequence Models like RNNs, GRUs and LSTMs; Evolution of Attention Model as the Transformer Model with the paper, "Attention is All You Need". This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2024 firs...
Learn about the limitations of earlier Deep Sequence Models like RNNs, GRUs and LSTMs; Evolution of Attention Model as the Transformer Model with the paper, "Attention is All You Need". This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2024 first half of the year.
Limitations of RNN Models
•Slow computation for longer sequences as the computation cannot
be done in parallel due to the dependencies in timesteps
•As there are significant number of timesteps the backpropagation
depth increases which increases Vanishing Gradient and Exploding
Gradient problems
•As information is passed from the history as a hidden state vector the
amount of information is limited to that vector size
•As information passed from the history gets updated in each time
step, the history is forgotten after number of time steps
Attention-based Models
•Instead of processing all the time steps with the same weight,
attention models performed well when only certain time steps are
given an exponentially higher weight while processing any time step
which are known as Attention Models
•Thought Attention Models were significantly better, its processing
requirement (Complexity) was Quadratic (i.e. proportional to the
square of the number of time stamps) which was an extra slowdown
•However, the paper published with name “Attention is all you need”
by Vasvaniet al. 2017 proposed that RNN units can be replaced with a
higher performance mechanism keeping only the “Attention” in mind
•This model is known as a Transformer Model
Transformer Model Architecture
Encoder Decoder
Transformer Model
•The original paper defined this model (with both Encoder and Decoder) for the
application of Natural Language Translation
•However, the Encoder and Decoder models were separately used independently
in some later models for different tasks
Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21
Encoder Only (Autoencoding) Models
•Only the Encoder of the Transformer is used
•Pre-Trained with Masked Language Models
•Some random tokens of the input sequence are masked
•Try to predict the missing (masked) tokens to reconstruct the original
sequence
•This process learns the Bidirectional Context of the tokens in a sequence
(probabilities of being around certain tokens in both right and left)
•Used in applications like Sentence Classification for Sentiment
Analysis and token level operations like Named Entity Recognition
•BERT and RoBERTa are some examples
Decoder Only (Autoregressive) Models
•Only the Decoder of the Transformer is used
•Pre-Trained with Causal Language Models
•Last token of the input sequence is masked
•Try to predict the last token to reconstruct the original sequence
•Also known as Full Language Model as well
•This process learns the Unidirectional Context of the tokens in a sequence
(probabilities of being the next token given the tokens at the left)
•Used in applications like Text Generation
•GPT and BLOOM are some examples
Encoder Decoder (Sequence-to-Sequence)
Models
•Use both Encoder and the Decoder of the Transformer
•Pre-Training objective may depend on the requirement. In T5 model,
•In Encoder, some random tokens of the input sequence are masked with a
unique placeholder token, added to the vocabulary, known as Sentinel token
•This process is known as Span Corruption
•Decoder tries to predict the missing (masked) tokens to reconstruct the
original sequence, replacing the Sentinel tokens, with auto-regression
•Used in applications like Translation, Summarization and Question-
answering
•T5 and BART are some examples
Encoder – Input and Embedding
•Inputs is the sequence of tokens (words in case of
Natural Language Processing (NLP))
•Each input token is converted to a vector using Input
Embedding (Word Embedding in case of NLP)
Output
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Multi-Head Attention
•Multi-Head Attention is about applying multiple
similar operations known as Single-Head Attention
or simply Attention
Attention(Q, K, V) = softmax(
????????????
??????
??????
??????
)V
•The type of attention used here is known as Self
Attention where each token is having a attention
against all the tokens in the input sequence
•For the Encoder we take, Q = K = V = X
Output
Self Attention
Source: https://jalammar.github.io/illustrated-transformer/
•Self Attention formula is inspired by
the data query from a data store
where Q is the query which is
matched against the K key values
where V is the actual value
•????????????
??????
is a measure between the
similarity between Q and K
•??????
?????? is used to normalize by dividing it
by the dimensionality of the K
•Softmax is used to give the attention
to the largest
•Finally, normalized similarity is used to
the weight V resulting the Attention
Encoder – Multi-Head Attention
•When Single-Head Attention is defined as,
Attention(Q, K, V) = softmax(
????????????
??????
??????
??????
)V
•Multi-Head Attention Head is defined as,
head
i(Q, K, V) = Attention(QW
i
Q
, KW
i
K
, VW
i
V
)
•i.e.: We can have arbitrary number of heads where
parameter weight matrices have to be defined for Q, K, and
V for all heads
•Multi-Head is defined as,
MultiHead(Q, K, V) = Concat(head
1, head
2, … head
h)W
O
•i.e. MultiHead is the concatenation of all the heads
multiplied by another parameter matrix W
O
Output
Encoder – Add & Normalization
•Input given to the multi-head attention is added to the
output as the Residual Inputs (remember ResNet?)
•Then the result is Layer Normalized
•Similar to the Batch Norm but instead of normalizing on the
items in the batch (or the minibatch), normalization happens
on the values in the layer
Output
Decoder – Masked Multi-Head Attention
•Multi-Head Attention for the Decoder is same as for the
Encoder
•However, only the query, Q is received from the
previous layer
•K and V are received from the Encoder output
•Here, K and V contain the context related information
that are required to process Q which is generated only
from the input to decoder
Masking the Multi-Head Attention
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
•The model must not see the tokens on
the right side of the sequence
•Therefore, the softmax output related
this attention should be zero
•For that, all the values that are right
from the diagonal will be replaced
with minus infinite, before the
Softmax is applied
Training a Transformer
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
•Vocabulary have special tokens,
• <SOS> for the Start of the Sentence
•<EOS> for the End of the Sentence
•Encoded output is given to the Decoder
(as K and V) to translate its input to
Italian
•Linear layer maps the Decoder output to
the vocabulary size
•Softmax layer outputs the positional
encodings of the tokens in one timestep
•Cross Entropy loss is used
Making Inferences with a Transformer
•Unlike training a transformer, while making inferences, a transformer
needs one timestep to generate a single token
•The reason is because we have to use that generated token to
generate the next token