AIS302-Artificial Neural Networks-Spr24-lec4.pdf

twitchprimegaming112 0 views 62 slides Oct 07, 2025
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

Nlp


Slide Content

AIS302
Artificial Neural
Networks
Spring 24
Ghada Khoriba
Associate Prof. of Artificial Intelligence
1

Mechanics of Seq2seq Models With
Attention
•Sequence-to-sequence models are deep learning models
that have achieved a lot of success in tasks like machine
translation, text summarization, and image captioning.
•Google Translate startedusingsuch a model in production in
late 2016.
•These models are explained in the two pioneering papers
(Sutskever et al., 2014,Cho et al., 2014).
2

•A sequence-to-sequence model is a model that takes a sequence of items (words, letters,
features of an images…etc) and outputs another sequence of items.
•A trained model would work like this:
Spring 20233
"The Illustrated Transformer" by Jay Alammar.

In neural machine translation, a sequence is a series of words, processed one after another. The
output is, likewise, a series of words:
Spring 20234
"The Illustrated Transformer" by Jay Alammar.

Spring 20235
•The model is composed of anencoderand adecoder.
•Theencoderprocesses each item in the input sequence, it compiles the information
it captures into a vector (called thecontext).
•After processing the entire input sequence, theencodersends thecontextover to
thedecoder, which begins producing the output sequence item by item.
"The Illustrated Transformer" by Jay Alammar.

Thecontextis a vector (an array of numbers, basically) in the case of machine
translation. Theencoderanddecodertend to both be recurrent neural networks
Spring 20236
You can set the size of thecontextvector when you set up
your model.
It is basically the number of hidden units in
theencoderRNN.
Thecontextvector would be of a size like 256, 512, or 1024.
"The Illustrated Transformer" by Jay Alammar.

•RNN takes two inputs at each time step: an input (in the case of the encoder, one
word from the input sentence), and a hidden state. The word, however, needs to be
represented by a vector.
•To transform a word into a vector, we turn to the class of methods called “word
embedding” algorithms.
•These turn words into vector spaces that capture a lot of the meaning/semantic
information of the words (e.g.king - man + woman = queen).
Spring 20237
"The Illustrated Transformer" by Jay Alammar.

•Since theencoderanddecoderare both RNNs, each time step one of the RNNs does some
processing, it updates itshidden statebased on its inputs and previous inputs it has seen.
Spring 20238
"The Illustrated Transformer" by Jay Alammar.
Thedecoderalso maintains ahidden statethat it passes from one
time step to the next.

9
"The Illustrated Transformer" by Jay Alammar.

•Thecontextvector turned out to be a bottleneck for these types of models.
•It made it challenging for the models to deal with long sentences.
•2014/2015 “Attention” highly improved the quality of machine translation
systems.
•Attention allows the model to focus on the relevant parts of the input sequence as
needed.
10
https://youtu.be/UPtG_38Oq8o?si=OTZ8kCwf0g4KrT1a

•An attention model differs from a classic sequence-to-sequence model in two main ways:
•First, theencoderpasses a lot more data to thedecoder.
•Instead of passing the last hidden state of the encoding stage,
theencoderpassesallthehidden statesto thedecoder:
Spring 202311

Second, an attentiondecoderdoes an extra step before producing its output.
•To focus on the parts of the input that are relevant to this decoding time step,
thedecoderdoes the following:
1.Look at the set of encoderhidden statesit received – eachencoder hidden stateis
most associated with a certain word in the input sentence
2.Give eachhidden statea score
3.Multiply eachhidden stateby its softmax score, thus amplifyinghidden stateswith
high scores and drowning outhidden stateswith low scores12

13

How the attention process works:
1.The attention decoder RNN takes in the embedding of the<END>token, and aninitial
decoder hidden state.
2.The RNN processes its inputs, producing an output and anew hidden statevector (h4). The
output is discarded.
3.Attention Step: We use theencoder 's hidden statesand theh4vector to calculate a context
vector (C4) for this time step.
4.We concatenateh4andC4into one vector.
5.We pass this vector through afeedforward neural network(one trained jointly with the
model).
6.Theoutputof the feedforward neural networks indicates the output word of this time step.
7.Repeat for the next time steps
14

15
signifies the start of the decoding process
The context vector is a weighted sum of the encoder's hidden states,
where the weights are determined by how relevant each part of the input
sequence is to the current word being generated in the output sequence.

•Note that the model isn’t just mindless aligning the first word at
the output with the first word from the input.
•It actually learned from the training phase how to align words in
that language pair (French and English in our example).
16

•The encoding component is a stack of encoders.
•The decoding component is a stack of decoders of the same number.
17

•The encoders are all identical in structure (yet they do not share weights). Each one is broken
down into two sub-layers:
Spring 202318

•The encoder’s inputs first flow through a self-attention layer, which helps the encoder look at
other words in the input sentence as it encodes a specific word.
•The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same
feed-forward network is independently applied to each position.
•The decoder has both those layers, but between them is an attention layer that helps the decoder
focus on relevant parts of the input sentence (similar to what attention does inseq2seq models).
Spring 202319

After embedding the words in our input sequence, each of
them flows through each of the two layers of the encoder.
Spring 202320

•one key property of the Transformer, which is that the word in each position flows
through its own path in the encoder.
• There are dependencies between these paths in the self-attention layer.
•The feed-forward layer does not have those dependencies, however, and thus the various
paths can be executed in parallel while flowing through the feed-forward layer.
21

Self-Attention at a High Level
22
the following sentence is an input sentence we want to
translate:
”The animal didn't cross the street because it was too tired”
What does “it” in this sentence refer to?
Is it referring to the street or to the animal?
When the model is processing the word “it”, self-attention
allows it to associate “it” with “animal”.
As the model processes each word (each position in the input
sequence), self attention allows it to look at other positions in the
input sequence for clues that can help lead to a better encoding
for this word.
Self-attention is the method the Transformer uses to bake
the “understanding” of other relevant words into the one
we’re currently processing.
"The Illustrated Transformer" by Jay Alammar.

encapsulate the information from the input
sequence necessary for starting the
decoding process.
final hidden state of
the encoder h4 is used
to generate the initial
hidden state of the
decoder

Justin Johnson March 21, 2022Lecture 17 -
s1
Sequence-to-Sequence with RNNs
x1
weareeating
x2x3
h1h2h3 s0 s2
[START]
y0y1
y1y2
bread
x4
h4
estamoscomiendo
pan
y2y3
estamoscomiendo
s3s4
y3y4
pan[STOP]
c
Input: Sequence x1, …xT
Output: Sequence y1, …, yT’
Sutskeveret al, “Sequence to sequence learning with neural networks”, NeurIPS2014
Encoder: ht= fW(xt, ht-1)
Decoder: st= gU(yt-1, st-1, c)
From final hidden state predict:
Initial decoder state s0
Context vector c(often c=hT)
Problem: Input sequence
bottlenecked through fixed-
sized vector. What if T=1000?Idea: use new context vector
at each step of decoder!
11

Jus6n Johnson March 21, 2022Lecture 17 -
Sequence-to-Sequence with RNNs and Attention
x1
weareeating
x2x3
h1h2h3 s0
bread
x4
h4
e11e12e13e14
softmax
a11a12a13a14
c1

+
✖✖✖
IntuiGon: Context vector
aUendsto the relevant
part of the input sequence
“estamos”= “we are”
so maybe a11=a12=0.45,
a13=a14=0.05
s1
y0
y1
estamos
Normalize alignment scores
to get aKenGon weights
0 < at,i< 1 ∑iat,i= 1
Compute context vector as linear
combina6on of hidden states
ct= ∑iat,ihi
Use context vector in
decoder: st= gU(yt-1, st-1, ct)
From final hidden state:
IniGal decoder state s0
This is all differenGable! Do not
supervise aKenGon weights –
backpropthrough everythingBahdanauet al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Compute (scalar)alignment scores
et,i= fa-(st-1, hi) (fa-is an MLP)
[START]
16

Jus6n Johnson March 21, 2022Lecture 17 -
Sequence-to-Sequence with RNNs
x1
weareea5ng
x2x3
h1h2h3 s0
bread
x4
h4 s1
[START]
y0
y1
estamos
c1c2
e21e22e23e24
soXmax
a21a22a23a24
✖✖✖✖
+
Repeat: Use s1to compute
new context vector c2
Bahdanauet al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
17

27
Jus6n Johnson March 21, 2022Lecture 17 -
Sequence-to-Sequence with RNNs and A>en?on
x1
weareea5ng
x2x3
h1h2h3 s0
bread
x4
h4 s1
[START]
y0
y1
estamos
c1c2
e21e22e23e24
soXmax
a21a22a23a24
✖✖✖✖
+
Repeat: Use s1to
compute new context
vector c2
s2
y2
comiendo
y1
Use c2to compute s2, y2
estamos
Bahdanauet al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
18

28
Jus6n Johnson March 21, 2022Lecture 17 -
Sequence-to-Sequence with RNNs and A>en?on
x1
weareea5ng
x2x3
h1h2h3 s0
bread
x4
h4 s1s2
[START]
y0
y1y2
estamoscomiendo
panestamoscomiendo
s3s4
y3y4
pan[STOP]
c1y1c2y2c3y3c4
Bahdanauet al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Use a different context vector in each Gmestepof decoder
-Input sequence not boKlenecked through single vector
-At each Gmestepof decoder, context vector “looks at”
different parts of the input sequence
20
ﻧﺣنﻧﺄﻛلاﻟﺧﺑز

Spring 202329
Jus6n Johnson March 21, 2022Lecture 17 -
Sequence-to-Sequence with RNNs and Attention
Bahdanauet al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Example: English to French
transla^on
Input: “The agreement on the
European Economic Areawas
signedin August 1992.”
Output: “L’accordsur lazone
économiqueeuropéennea
étésignéenaoût1992.”
Visualize acen^on weights at,i
A)en+on figures out
different word orders
Diagonal a)en+on means
words correspond in order
Diagonal a)en+on means
words correspond in order
Verb conjuga+on
24

Jus6n Johnson March 21, 2022Lecture 17 -
Image Cap?oning with RNNs and A>en?on
s0s1s2
[START]
y0
y1y2
catsi9ng
outsidecatsitting
s3s4
y3y4
outside[STOP]
c1y1c2y2c3y3c4
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
CNN
Use a CNN to compute a
grid of features for an image
Each ^mestepof decoder
uses a different context
vector that looks at different
parts of the input image
et,i,j= fa((st-1, hi,j)
at,:,:= soUmax(et,:,:)
ct= ∑i,jat,i,jhi,j
h
2,1
h
2,2
h
2,3
h
3,1
h
3,2
h
3,3
h
1,1
h
1,2
h
1,3
36
30

Self-Attention
•Thefirst stepin calculating self-
attention is to create three vectors from
each of the encoder’s input vectors (in
this case, the embedding of each word).
•for each word, we create a Query
vector, a Key vector, and a Value vector.
These vectors are created by
multiplying the embedding by three
matrices that we trained during the
training process.
31
https://youtu.be/YAgjfMR9R_Mhttps://jalammar.github.io/illustrated-transformer/

Self-Attention
•Thesecond stepin calculating self-attention
is to calculate a score.
•Say we’re calculating the self-attention for
the first word in this example, “Thinking”.
•We need to score each word of the input
sentence against this word.
•The score determines how much focus to
place on other parts of the input sentence as
we encode a word at a certain position.
32
The score is calculated by taking the dot product of thequery
vectorwith thekey vectorof the respective word we’re scoring.
if we’re processing the self-attention for the word in position#1,
the first score would be the dot product ofq1andk1. The
second score would be the dot product ofq1andk2.

Self-Attention
•Thethird and fourth stepsare to divide the
scores by 8, then pass the result through a
softmax operation.
•Softmax normalizes the scores so they’re all
positive and add up to 1.
33
This softmax score determines how much each word will be
expressed at this position. Clearly the word at this position will
have the highest softmax score, but sometimes it’s useful to attend
to another word that is relevant to the current word.
(the square root of the dimension of the
key vectors used in the paper – 64.
This leads to having more stable
gradients.
There could be other possible values here,
but this is the default

Self-Attention
•Thefifth stepis to multiply each value
vector by the softmax score (in preparation
to sum them up). The intuition here is to
keep intact the values of the word(s) we
want to focus on, and drown-out irrelevant
words (by multiplying them by tiny numbers
like 0.001, for example).
•Thesixth stepis to sum up the weighted
value vectors. This produces the output of
the self-attention layer at this position (for
the first word).
34
The resulting vector is one we can send along to
the feed-forward neural network.

35

Spring 202336
Jus6n Johnson March 21, 2022Lecture 17 -
Self-A>en?on Layer
Q3Q1Q2
K2
K1
K3
E3,2
E3,1
E3,3
E1,2
E1,1
E1,3
E2,2
E2,1
E2,3
A3,2
A3,1
A3,3
A1,2
A1,1
A1,3
A2,2
A2,1
A2,3
V2
V1
V3
Product(→), Sum(↑)
So#max(↑)
Y3Y1Y2
X3X1X2
Consider permu+ng
the input vectors:
Outputs will be the
same, but permuted
Self-aRen5on layer is
Permuta+on Equivariant
f(s(x)) = s(f(x))
Self-ARen5on layer works
on setsof vectors
Inputs:
Input vectors: X(Shape: NXx DX)
Key matrix: WK(Shape: DXx DQ)
Value matrix: WV(Shape: DXx DV)
Query matrix: WQ(Shape: DXx DQ)
ComputaGon:
Query vectors: Q= XWQ
Key vectors: K= XWK(Shape: NXx DQ)
Value Vectors: V= XWV(Shape: NXx DV)
SimilariGes: E = QKT/!!(Shape: NXx NX) Ei,j= (Qi· Kj) /!!
AKenGon weights: A = soXmax(E, dim=1) (Shape: NXx NX)
Output vectors: Y = AV(Shape: NXx DV) Yi= ∑jAi,jVj
69

Spring 202337
With multi-headed attention, we maintain separate Q/K/V
weight matrices for each head resulting in different Q/K/V
matrices. As we did before, we multiply X by the
WQ/WK/WV matrices to produce Q/K/V matrices.

Spring 202338
If we do the same self-attention calculation we outlined
above, just eight different times with different weight
matrices, we end up with eight different Z matrices

Spring 202339

Spring 202340

The Residuals
•One detail in the architecture of the encoder , each sub-layer (self-attention, FNN) in
each encoder has a residual connection around it, and is followed by alayer-
normalizationstep.
Spring 202341

•If we’re to visualize the vectors and the
layer-norm operation associated with self
attention, it would look like this:
42

Spring 202343
If we’re to think of a Transformer of 2 stacked encoders
and decoders, it would look something like this:

Spring 202344

Spring 202345

46
https://www.tensorflow.org/text/tutorials/transformer

Spring 202347
RNN+Attention modelA 1-layer transformer

Before Transformers and self attention, models commonly used RNNs or CNNs to do this task:
48
The global self attention layer
https://www.tensorflow.org/text/tutorials/transformer

A 1-layer transformerA 4-layer transformer
The RNN+Attention model
49

12.2Dot-product self-attention 211
Figure12.3Computing attention weights. a) Query vectorsqn=β
q+Ωqxn
and key vectorskn=β
k+Ωkxnare computed for each inputxn. b) The dot
products between each query and the three keys are passed through a softmax
function to form non-negative attentions that sum to one. c) These route the
value vectors (}gure 12.1) via the sparse matrix from }gure 12.2c.
and keys must have the same dimensions. However, these can dizer from the dimension
of the values, which is usually the same size as the input so that the representation does
Problem 12.2
not change size.
12.2.3 Self-attention summary
Then
th
output is a weighted sum of the same linear transformationv•=β
v+Ωvx•
applied to all of the inputs, where these attention weights are positive and sum to one.
The weights depend on a measure of similarity between inputxnand the other inputs.
There is no activation function, but the mechanism is nonlinear due to the dot-product
and a softmax operation used to compute the attention weights.
Note that this mechanism ful}lls the initial requirements. First, there is a single
Draft: please send errata to [email protected].
50
Computing attention weights.
a) Query vectors qn = βq + Ωqxn and key vectors
kn = βk + Ωkxn are computed for each input xn.
b) The dot products
between each query and
the three keys are passed
through a softmax function
to form non-negative
attentions that sum to one.
c) These route the value
vectors via the sparse
matrix
A sparse matrix isa
special case of a matrix
in which the number of
zero elements is much
higher than the number
of non-zero elements.

12.4Transformer layers 215
Figure12.6Multi-head self-attention. Self-attention occurs in parallel across
multiple “heads.” Each has its own queries, keys, and values. Here two heads are
depicted, in the cyan and orange boxes, respectively. The outputs are vertically
concatenated, and another linear transformationΩcis used to recombine them.
MhSa[X]=Ωc
!
Sa1[X]
T
,Sa2[X]
T
,...,SaH[X]
T
"
T
. (12.12)
Multiple heads seem to be necessary to make the transformer work well. It has been
Notebook 12.2
Multi-head
self-attentionspeculated that they make the self-attention network more robust to bad initializations.
12.4 Transformer layers
Self-attention is just one part of a largertransformer layer. This consists of a multi-head
self-attention unit (which allows the word representations to interact with each other)
Draft: please send errata to [email protected].
51
Multi-head self-attention.
Self-attention occurs in parallel across
multiple “heads.”
Each has its own queries, keys, and values.
Here two heads are depicted, in the cyan
and orange boxes, respectively.
The outputs are vertically concatenated,
and another linear transformation Ωc is
used to recombine them.

Transformer layers
216 12Transformers
Figure12.7The transformer layer. The input consists of aD×Nmatrix con-
taining theD-dimensional word embeddings for each of theNinput tokens. The
output is a matrix of the same size. The transformer layer consists of a series
of operations. First, there is a multi-head attention block, allowing the word
embeddings to interact with one another. This forms the processing of a resid-
ual block, so the inputs are added back to the output. Second, a LayerNorm
operation is applied. Third, there is a second residual layer where the same two-
layer fully connected neural network is applied separately to each of theNword
representations (columns). Finally, LayerNorm is applied again.
followed by a fully connected networkmlp[x•](that operates separately on each word).
Both units are residual networks (i.e., their output is added back to the original input).
In addition, it is typical to add a LayerNorm operation after both the self-attention and
fully connected networks. This is similar to BatchNorm but uses statistics across the
tokens within a single input sequence to perform the normalization (section 11.4 and
}gure 11.14). The complete layer can be described by the following series of operations:
X←X+MhSa[X]
X←LayerNorm[X]
xn←xn+mlp[xn] ∀n∈{1,...,N}
X←LayerNorm[X], (12.13)
where the column vectorsxnare separately taken from the full data matrixX. In a real
network, the data passes through a series of these layers.
12.5 Transformers for natural language processing
The previous section described the transformer block. This section describes how it is
used in natural language processing (NLP) tasks. A typical NLP pipeline starts with a
tokenizerthat splits the text into words or word fragments. Then each of these tokens
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Spring 2023 52
The input consists
of a D × N matrix
containing the D-
dimensional word
embeddings for
each N input
token.
The transformer layer consists of a series of operations.
First, there is a multi-head attention block, allowing the word embeddings to interact
with one another. This forms the processing of a residual block, so the inputs are added
back to the output.
Second, a LayerNorm operation is applied.
Third, there is a second residual layer where the same two-layer fully connected neural
network is applied separately to each N-word representation (columns).
Finally, LayerNorm is applied again.
The output is a matrix of the same size.

Spring 202353
For any hidden layerh, we pass the inputs through a non-linear activation to get the output. For every neuron
(activation) in a particular layer, we can force the pre-activations to have zero mean and unit standard deviation. This can
be achieved by subtracting the mean from each of the input features across the mini-batch and dividing by the standard
deviation.

Spring 202354
•Batch normalization normalizes each feature independently across the mini-batch. Layer normalization
normalizes each of the inputs in the batch independently across all features.

•In batch normalization, we use thebatch
statistics: the mean and standard
deviation corresponding to the current
mini-batch. However, when the batch size
is small, the sample mean and sample
standard deviation are not representative
enough of the actual distribution and the
network cannot learn anything
meaningful.
•As batch normalization depends on batch
statistics for normalization, it is less suited
for sequence models. This is because, in
sequence models, we may have
sequences of potentially different lengths
and smaller batch sizes corresponding to
longer sequences.
Spring 202355
•Normalizingacross all featuresbut for
each of the inputs to a specific layer
removes the dependence on batches.
This makes layer normalization well
suited for sequence models such
astransformersandrecurrent neural
networks (RNNs)that were popular in
the pre-transformer era.
•Layer Normalization, instead of the
batch statistics, we use the mean and
variance corresponding to specific
input to the neurons in a particular
layer, say!. This is equivalent to
normalizing the output vector from the
layer!−1.

204 11Residual networks
Figure11.14Normalization schemes. BatchNorm modi}es each channel sepa-
rately but adjusts each batch member in the same way based on statistics gath-
ered across the batch and spatial position. Ghost BatchNorm computes these
statistics from only part of the batch to make them more variable. LayerNorm
computes statistics for each batch member separately, based on statistics gath-
ered across the channels and spatial position. It retains a separate learned scaling
factor for each channel. GroupNorm normalizes within each group of channels
and also retains a separate scale and ozset parameter for each channel. Instan-
ceNorm normalizes within each channel separately, computing the statistics only
across spatial position. Adapted from Wu & He (2018).
across spatial position alone. Salimans & Kingma (2016) investigated normalizing the network
weights rather than the activations, but this has been less empirically successful. Teye et al.
(2018) introducedMonte Carlo batch normalization, which can provide meaningful estimates
of uncertainty in the predictions of neural networks. A recent comparison of the properties of
dizerent normalization schemes can be found in Lubana et al. (2021).
Why BatchNorm helps:BatchNorm helps control the initial gradients in a residual network
(}gure 11.6c). However, the mechanism by which BatchNorm improves performance is not
well understood. The stated goal of Ioze & Szegedy (2015) was to reduce problems caused
byinternal covariate shift, which is the change in the distribution of inputs to a layer caused
by updating preceding layers during the backpropagation update. However, Santurkar et al.
(2018) provided evidence against this view by arti}cially inducing covariate shift and showing
that networks with and without BatchNorm performed equally well.
Motivated by this, they searched for another explanation for why BatchNorm should improve
performance. They showed empirically for the VGG network that adding batch normalization
decreases the variation in both the loss and its gradient as we move in the gradient direction.
In other words, the loss surface is both smoother and changes more slowly, which is why larger
learning rates are possible. They also provide theoretical proofs for both these phenomena
and show that for any parameter initialization, the distance to the nearest optimum is less for
networks with batch normalization.Bjorck et al. (2018) also argue that BatchNorm improves
the properties of the loss landscape and allows larger learning rates.
Other explanations of why BatchNorm improves performance include decreasing the importance
of tuning the learning rate (Ioze & Szegedy, 2015; Arora et al., 2018). Indeed Li & Arora
(2019) show that using an exponentially increasing learning rate schedule is possible with batch
normalization. Ultimately, this is because batch normalization makes the network invariant to
the scales of the weight matrices (see Huszár, 2019, for an intuitive visualization).
Hozer et al. (2017) identi}ed that BatchNorm has a regularizing ezect due to statistical ~uc-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
56
BatchNorm modifies each channel separately but adjusts each batch member in the same way based on statistics
gathered across the batch and spatial position.
Ghost BatchNorm computes these statistics from only part of the batch to make them more variable.
LayerNorm computes statistics for each batch member separately, based on statistics gathered across the channels
and spatial position. It retains a separate learned scaling factor for each channel.
GroupNorm normalizes within each group of channels and also retains a separate scale and offset parameter for
each channel.
InstanceNorm normalizes within each channel separately, computing the statistics only across spatial position.
Normalization schemes.
spatial position : meaning that the normalization is performed
separately for each channel at every pixel location

Spring 2023 57
The original Transformer diagramA representation of a 4-layer Transformer
https://www.tensorflow.org/text/tutorials/transformer

220 12Transformers
Figure12.10Pre-training for BERT-like encoder. The input tokens (and a spe-
cial<cls>token denoting the start of the sequence) are converted to word em-
beddings. Here, these are represented as rows rather than columns, so the box
labeled ?word embeddings? isX
T
. These embeddings are passed through a series
of transformer layers (orange connections indicate that every token attends to
every other token in these layers) to create a set of output embeddings. A small
fraction of the input tokens is randomly replaced with a generic<mask>token.
In pre-training, the goal is to predict the missing word from the associated output
embedding. As such, the output embeddings are passed through a softmax func-
tion, and the multiclass classi}cation loss (section 5.24) is used. This task has
the advantage that it uses both the left and right context to predict the missing
word but has the disadvantage that it does not make e{cient use of data; here,
seven tokens need to be processed to add two terms to the loss function.
12.6.1 Pre-training
In the pre-training stage, the network is trained using self-supervision. This allows the
use of enormous amounts of data without the need for manual labels. For BERT, the self-
supervision task consists of predicting missing words from sentences from a large internet
Problem 12.7
corpus (}gure 12.10).
1
During training, the maximum input length is 512 tokens, and
the batch size is 256. The system is trained for a million steps, corresponding to roughly
50 epochs of the 3.3-billion word corpus.
Predicting missing words forces the transformer network to understand some syntax.
For example, it might learn that the adjectiveredis often found before nouns likehouse
orcarbut never before a verb likeshout. It also allows the model to learn super}cial
common senseabout the world. For example, after training, the model will assign a
higher probability to the missing wordtrainin the sentenceThe<mask>pulled into
the stationthan it would to the wordpeanut. However, the degree of ?understanding?
this type of model can ever have is limited.
1
BERT also uses a secondary task that predicts whether two sentences were originally adjacent in
the text or not, but this only marginally improves performance.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Spring 202358
Pre-training for BERT-like encoder. The input tokens (and a special <cls> token denoting the
start of the sequence) are converted to word embeddings.

12.6Encoder model example: BERT 221
Figure12.11After pre-training, the encoder is }ne-tuned using manually labeled
data to solve a particular task. Usually, a linear transformation or a multi-layer
perceptron (MLP) is appended to the encoder to produce whatever output is
required. a) Example text classi}cation task. In this sentiment classi}cation
task, the<cls>token embedding is used to predict the probability that the
review is positive. b) Example word classi}cation task. In this named entity
recognition problem, the embedding for each word is used to predict whether the
word corresponds to a person, place, or organizatio, or is not an entity.
12.6.2 Fine-tuning
In the }ne-tuning stage, the model parameters are adjusted to specialize the network to
a particular task. An extra layer is appended onto the transformer network to convert
the output vectors to the desired output format. Examples include:
Text classi}cation:In BERT, a special token known as the classi}cation or<cls>
token is placed at the start of each string during pre-training. For text classi}cation
tasks likesentiment analysis(in which the passage is labeled as having a positive or
negative emotional tone), the vector associated with the<cls>token is mapped to a
single number and passed through a logistic sigmoid (}gure 12.11a). This contributes to
a standard binary cross-entropy loss (section 5.4).
Draft: please send errata to [email protected].
Spring 202359
After pre-training, the encoder is fine-
tuned using manually labeled data to solve
a particular task.
Usually, a linear transformation or a multi-
layer perceptron (MLP) is appended to the
encoder to produce whatever output is
required.
a)Example text classification task.
In this sentiment classification task, the
<cls> token embedding is used to predict
the probability that the review is positive.
b) Example word classification task. In
this named entity recognition problem, the
embedding for each word is used to predict
whether the word corresponds to a person,
place, or organization, or is not an entity.

Decoder model example: GPT3
224 12Transformers
Figure12.12Training GPT3-type decoder network. The tokens are mapped to
word embeddings with a special<start>token at the beginning of the sequence.
The embeddings are passed through a series of transformers that use masked
self-attention. Here, each position in the sentence can only attend to its own
embedding and the embeddings of tokens earlier in the sequence (orange connec-
tions). The goal at each position is to maximize the probability of the following
ground truth token in the sequence. In other words, at position one, we want to
maximize the probability of the tokenIt; at position two, we want to maximize
the probability of the tokentakes; and so on. Masked self-attention ensures the
system cannot cheat by looking at subsequent inputs. The autoregressive task has
the advantage of making e{cient use of the data since every word contributes a
term to the loss function. However, it only exploits the left context of each word.
the masked self-attention. Hence, much of the earlier computation can be recycled as we
generate subsequent tokens.
In practice, many strategies can make the output text more coherent. For example,
Notebook 12.4
Decoding
strategies
beam searchkeeps track of multiple possible sentence completions to }nd the overall most
likely (which is not necessarily found by greedily choosing the most likely next word at
each step).Top-k samplingrandomly draws the next word from only the top-K most
likely possibilities to prevent the system from accidentally choosing from the long tail of
low-probability tokens and leading to an unnecessary linguistic dead end.
12.7.4 GPT3 and few-shot learning
GPT3 applies these ideas on a massive scale. The sequence lengths are 2048 tokens long,
and since multiple spans of 2048 tokens are processed at once, the total batch size is
3.2 million tokens. There are 96 transformer layers (some of which implement a sparse
version of attention), each processing a word embedding of size 12288. There are 96
heads in the self-attention layers, and the value, query, and key dimension is 128. It is
trained with 300 billion tokens and contains 175 billion parameters.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Spring 202360
Training GPT3-type decoder network.
The tokens are mapped to word embeddings with a special <start> token at the beginning of the sequence.
The embeddings are passed through a series of transformers that use masked self-attention.
Here, each position in the sentence can only attend to its own embedding and the embeddings of tokens earlier in the sequence (orange connections).
The goal at each position is to maximize the probability of the following ground truth token in the sequence. In other words, at position one, we want to
maximize the probability of the token It; at position two, we want to maximize the probability of the token takes; and so on.
Masked self-attention ensures the system cannot cheat by looking at subsequent inputs.
The autoregressive task has the advantage of making efficient use of the data since every word contributes a term to the loss function. However, it only exploits
the left context of each word.

226 12Transformers
Figure12.13Encoder-decoder architecture. Two sentences are passed to the
system with the goal of translating the }rst into the second. a) The }rst sentence
is passed through a standard encoder. b) The second sentence is passed through a
decoder that uses masked self-attention but also attends to the output embeddings
of the encoder using cross-attention (orange rectangle). The loss function is the
same as for the decoder model; we want to maximize the probability of the next
word in the output sequence.
12.8 Encoder-decoder model example: machine translation
Translation between languages is an example of asequence-to-sequencetask. This re-
quires an encoder (to compute a good representation of the source sentence) and a
decoder (to generate the sentence in the target language). This task can be tackled
using anencoder-decodermodel.
Consider the example of translating from English to French. The encoder receives the
sentence in English and processes it through a series of transformer layers to create an
output representation for each token. During training, the decoder receives the ground
truth translation in French and passes it through a series of transformer layers that
use masked self-attention and predict the following word at each position. However,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
61
Encoder-decoder architecture.
Two sentences are passed to the
system with the goal of translating the
first into the second.
a)The first sentence is passed through
a standard encoder.
b)The second sentence is passed
through a decoder that uses
masked self-attention but also
attends to the output embeddings
of the encoder using cross-attention
(orange rectangle).
The loss function is the same as for the
decoder model; we want to maximize
the probability of the next word in the
output sequence.

The downsides of this architecture are:
•For the first downside, the Transformer architecture processes the entire sequence of
inputs at once, rather than processing them sequentially like an RNN. This means that
for each output in the sequence, the model must attend to all of the inputs in the
sequence, which can be computationally expensive for long sequences.
•The second downside mentioned is related to the fact that the Transformer architecture
does not inherently encode positional information in the input sequence. In other words,
the model does not inherently know which token in the sequence came before or after
another token. To address this, the Transformer architecture includes positional
embeddings, which are added to the input embeddings to provide information about the
position of each token in the sequence. However, if these embeddings are not included,
the model may not be able to effectively capture the temporal or spatial relationship
between the inputs.
62
Tags