deep learning slides on word embeddings.

Deep Learning
13 Word Embeddings
Dr. Konda Reddy Mopuri
Dept. of AI, IIT Hyderabad
Jan-May 2024
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 1

Why Word Embeddings?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 2

Why Word Embeddings?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 3

Terminology
Corpus: a collection of authentic text organized into dataset
Vocabulary (V): Set of allowed words
Target: Representation for every word in V
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 4

One-hot Encoding
Representation using discrete symbols
|V|words encoded as binary vectors of length|V|
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 5

One-hot encoding: Drawbacks
1
Space ineﬃcient (e.g. 13M words in Google 1T corpus)
2
No notion of similarity (or, distance) between words
‘Dog’ is as close to‘Cat’as it is to‘Machine’
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 6

Notion of Meaning for words
1
What is a good notion of meaning for a word?
2
How do we, humans, know the meaning of a word?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 7

Notion of Meaning for words
1
What doessillamean?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 8

Notion of Meaning for words
1
Let’s see how this word is used in diﬀerent contexts
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 9

Notion of Meaning for words
1
Does this context help you understand the wordsilla?
2
{ positioned near a window or against a wall or in the corner, used for
conversing/events, can be used to relax }
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 10

Notion of Meaning for words
1
How did we do that?
2
“We searched for other words that can be used in the same contexts,
found some, and made a conclusion thatsillahas to mean similar to
those words."
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 11

Notion of Meaning for words
1
Distributional Hypothesis: Words that frequently appear in similar
contexts have a similar meaning
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 12

Distributed Representations
1
Representation/meaning of a word should consider its context in the
corpus
2
Use many contexts of a word to build up a representation for it
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 13

Distributed Representations
1
Co-occurrence matrixis a way to can capture this!
size:(#words×#words)
rows: words (m), cols: context (n)
words and context can be of same or diﬀerent size
2
Context can be deﬁned as a ‘h’ word neighborhood
3
Each row (column): vectorial representation of the word (context)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 14

Co-occurrence matrix
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 15

Co-occurrence matrix
1
Very sparse
2
Very high-dimensional (grows with the vocabulary size)
3
Solution:Dimensionality reduction (SVD)!
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 16

SVD on the Co-occurrence matrix
1
X=UΣV
T
2

X


m×n
=


↑. . .↑
u1. . . uk
↓. . .↓


m×k
·



σ1
.
.
.
σk



k×k
·



←v
T
1→
.
.
.
←v
T
k
→



k×n
3
X=σ1u1v
T
1
+σ2u2v
T
2
+. . .+σkukv
T
k
4
ˆ
X=
∑
d<k
i=1
σiuiv
T
i
is ad-rank approximation ofX
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 17

SVD on the Co-occurrence matrix
1
Before the SVD, representations were the rows ofX
2
How do we reduce the representation size with SVD ?
3
Wword=Um×k·Σk×k
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 18

SVD on the Co-occurrence matrix
1
Wword∈R
m×k
(k≪ |V|=m) are considered the representation of
the words
2
Lesser dimensions but the same similarities! (one may verify that
XX
T
=
ˆ
X
ˆ
X
T
)
3
Wcontext=V∈R
n×k
are taken as the representations for the context
words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 19

A bit more clever things...
1
Entries in the occurrence matrix can be weighted (HAL model)
2
Better associations can be used (PPMI)
3
....
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 20

Count-based vs prediction-based models
1
Techniques we have seen so far rely on the counts (or, co-occurrence
of words)
2
Next, we see prediction based models for word embeddings
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 21

Word2Vec
1
T Mikolov et al. (2013)
2
Two versions: Predict words from the contexts (or contexts from
words)
3
Continuous Bag of Words (CBoW) and Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 22

Word2Vec
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 23

Word Embeddings: Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 24

Word Embeddings: Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 25

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 27

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 28

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 29

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 30

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 31

Word Embeddings: Skip-gram
For each position int= 1,2, . . . Tin the corpus, Skip-gram predicts
the context words inm−sized window (θis the variables to be
optimized)
LikelihoodL(θ) =
T
∏
t=1
∏
−m≤j≤m,j,0
P(wt+j|wt, θ)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 32

Word Embeddings: Skip-gram
The loss is mean NLL
LossJ(θ) =−
1
T
T
∏
t=1
∏
−m≤j≤m,j,0
logP(wt+j|wt, θ)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 33

Word Embeddings: Skip-gram
What are the parameters(θ)to be learned?
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 34

Word Embeddings: Skip-gram
How to computeP(wt+j|wt, θ)?
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 35

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 36

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 37

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 38

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 39

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 40

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 41

Word Embeddings: Skip-gram
Train using Gradient Descent
For one word at a time, i.e., (a center word, one of the context words)
Jt,j(θ) =−logP(cute|cat) =−log
expu
T
cute
vcat
∑
w∈V oc
expu
T
wvcat
=
−u
T
cutevcat+ log
∑
w∈V oc
expu
T
wvcat
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 42

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 43

Word Embeddings: Skip-gram
Training is slow (for each central word, all the context words need to
be updated)
Negative sampling: not all the context words are considered, but a
random sample of them
Training over a large corpus leads to suﬃcient updates for each vector
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 44

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 45

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 46

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 47

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 48

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 49

Word Embeddings: Skip-gram
1
WN×mis theWword(used for representing the words)
2
W
′
m×N
is theWcontext(may be ignored after the training)
3
Some showed averaging word and context vectors may be more
beneﬁcial
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 50

Bag of Words (BoW)
1
Bag of Words: Collection and frequency of words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 51

CBoW
1
Considers the embeddings of ‘h’ words before and ‘h’ words after the
target word
2
Adds them (order is lost) for predicting the target word
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 52

CBoW
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 53

CBoW
1
Size of the vocabulary =m
2
Dimension of the embeddings =N
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 54

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 55

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 56

Word Embeddings: CBoW
1
Next layer has a weight matrixW
′
m×N
(embeddings for the center
words)
2
Projects the accumulated embeddings onto the vocabulary
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 57

Word Embeddings: CBoW
1
Next layer has a weight matrixW
′
m×N
(embeddings for the center
words)
2
Projects the accumulated embeddings onto the vocabulary
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 58

Word Embeddings: CBoW
1
m- way classiﬁcation→(after a softmax) maximizes the probability
for the target word
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 59

Word Embeddings: CBoW
1
WN×mis theWcontext
2
W
′
m×N
is theWwords
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 60

Glove
1
Glove - Global Vectors
2
Combines the score-based and predict-based approaches
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 61

Glove
1
Xijin the co-occurrence matrix encodes the global info. about words
iandj
p(j/i) =
Xij
Xi
2
Glove attempts to learn representations that are faithful to the
co-occurrence info.
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 62

Glove
1
Glove attempts to learn representations that are faithful to the
co-occurrence info.
2
Try to enforcev
T
i
cj= logP(j/i) = logXij−logXi
vi- central representation of wordi,cj- context representation of
wordj
3
Similarly,v
T
j
ci= logP(i/j) = logXij−logXj(aim is to learn such
embeddingsviandci)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 63

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word speciﬁc biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =
∑
i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 65

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =
∑
i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
2
Much of the entries in the co-occurrence matrix are zeros (noisy or
less informative)
3
Suggests to apply a weight
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 66

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 67

Evaluating the embeddings
1
Intrinsic - studying the internal properties (how well they capture the
meaning: word similarity, analogy, etc.)
2
Extrinsic - studying how they perform a task
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 68

Analysing the embeddings
1
Walking the semantic space
2
Structure - (form clusters) nearest neighbors have a similar meaning,
Linear structure
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 69

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 70

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 71

deep learning slides on word embeddings.

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

deep learning slides on word embeddings.

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77