deep learning slides on word embeddings.

cs21btech11057 16 views 123 slides Jun 23, 2024
Slide 1
Slide 1 of 123
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123

About This Presentation

IITH DL


Slide Content

Deep Learning
13 Word Embeddings
Dr. Konda Reddy Mopuri
Dept. of AI, IIT Hyderabad
Jan-May 2024
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 1

Why Word Embeddings?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 2

Why Word Embeddings?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 3

Terminology
Corpus: a collection of authentic text organized into dataset
Vocabulary (V): Set of allowed words
Target: Representation for every word in V
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 4

Terminology
Corpus: a collection of authentic text organized into dataset
Vocabulary (V): Set of allowed words
Target: Representation for every word in V
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 4

Terminology
Corpus: a collection of authentic text organized into dataset
Vocabulary (V): Set of allowed words
Target: Representation for every word in V
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 4

One-hot Encoding
Representation using discrete symbols
|V|words encoded as binary vectors of length|V|
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 5

One-hot Encoding
Representation using discrete symbols
|V|words encoded as binary vectors of length|V|
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 5

One-hot encoding: Drawbacks
1
Space inefficient (e.g. 13M words in Google 1T corpus)
2
No notion of similarity (or, distance) between words
‘Dog’ is as close to‘Cat’as it is to‘Machine’
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 6

One-hot encoding: Drawbacks
1
Space inefficient (e.g. 13M words in Google 1T corpus)
2
No notion of similarity (or, distance) between words
‘Dog’ is as close to‘Cat’as it is to‘Machine’
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 6

One-hot encoding: Drawbacks
1
Space inefficient (e.g. 13M words in Google 1T corpus)
2
No notion of similarity (or, distance) between words
‘Dog’ is as close to‘Cat’as it is to‘Machine’
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 6

Notion of Meaning for words
1
What is a good notion of meaning for a word?
2
How do we, humans, know the meaning of a word?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 7

Notion of Meaning for words
1
What is a good notion of meaning for a word?
2
How do we, humans, know the meaning of a word?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 7

Notion of Meaning for words
1
What doessillamean?
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 8

Notion of Meaning for words
1
Let’s see how this word is used in different contexts
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 9

Notion of Meaning for words
1
Does this context help you understand the wordsilla?
2
{ positioned near a window or against a wall or in the corner, used for
conversing/events, can be used to relax }
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 10

Notion of Meaning for words
1
Does this context help you understand the wordsilla?
2
{ positioned near a window or against a wall or in the corner, used for
conversing/events, can be used to relax }
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 10

Notion of Meaning for words
1
How did we do that?
2
“We searched for other words that can be used in the same contexts,
found some, and made a conclusion thatsillahas to mean similar to
those words."
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 11

Notion of Meaning for words
1
How did we do that?
2
“We searched for other words that can be used in the same contexts,
found some, and made a conclusion thatsillahas to mean similar to
those words."
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 11

Notion of Meaning for words
1
Distributional Hypothesis: Words that frequently appear in similar
contexts have a similar meaning
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 12

Distributed Representations
1
Representation/meaning of a word should consider its context in the
corpus
2
Use many contexts of a word to build up a representation for it
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 13

Distributed Representations
1
Representation/meaning of a word should consider its context in the
corpus
2
Use many contexts of a word to build up a representation for it
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 13

Distributed Representations
1
Co-occurrence matrixis a way to can capture this!
size:(#words×#words)
rows: words (m), cols: context (n)
words and context can be of same or different size
2
Context can be defined as a ‘h’ word neighborhood
3
Each row (column): vectorial representation of the word (context)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 14

Distributed Representations
1
Co-occurrence matrixis a way to can capture this!
size:(#words×#words)
rows: words (m), cols: context (n)
words and context can be of same or different size
2
Context can be defined as a ‘h’ word neighborhood
3
Each row (column): vectorial representation of the word (context)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 14

Distributed Representations
1
Co-occurrence matrixis a way to can capture this!
size:(#words×#words)
rows: words (m), cols: context (n)
words and context can be of same or different size
2
Context can be defined as a ‘h’ word neighborhood
3
Each row (column): vectorial representation of the word (context)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 14

Co-occurrence matrix
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 15

Co-occurrence matrix
1
Very sparse
2
Very high-dimensional (grows with the vocabulary size)
3
Solution:Dimensionality reduction (SVD)!
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 16

Co-occurrence matrix
1
Very sparse
2
Very high-dimensional (grows with the vocabulary size)
3
Solution:Dimensionality reduction (SVD)!
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 16

Co-occurrence matrix
1
Very sparse
2
Very high-dimensional (grows with the vocabulary size)
3
Solution:Dimensionality reduction (SVD)!
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 16

SVD on the Co-occurrence matrix
1
X=UΣV
T
2

X


m×n
=


↑. . .↑
u1. . . uk
↓. . .↓


m×k
·



σ1
.
.
.
σk



k×k
·



←v
T
1→
.
.
.
←v
T
k




k×n
3
X=σ1u1v
T
1
+σ2u2v
T
2
+. . .+σkukv
T
k
4
ˆ
X=

d<k
i=1
σiuiv
T
i
is ad-rank approximation ofX
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 17

SVD on the Co-occurrence matrix
1
X=UΣV
T
2

X


m×n
=


↑. . .↑
u1. . . uk
↓. . .↓


m×k
·



σ1
.
.
.
σk



k×k
·



←v
T
1→
.
.
.
←v
T
k




k×n
3
X=σ1u1v
T
1
+σ2u2v
T
2
+. . .+σkukv
T
k
4
ˆ
X=

d<k
i=1
σiuiv
T
i
is ad-rank approximation ofX
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 17

SVD on the Co-occurrence matrix
1
X=UΣV
T
2

X


m×n
=


↑. . .↑
u1. . . uk
↓. . .↓


m×k
·



σ1
.
.
.
σk



k×k
·



←v
T
1→
.
.
.
←v
T
k




k×n
3
X=σ1u1v
T
1
+σ2u2v
T
2
+. . .+σkukv
T
k
4
ˆ
X=

d<k
i=1
σiuiv
T
i
is ad-rank approximation ofX
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 17

SVD on the Co-occurrence matrix
1
Before the SVD, representations were the rows ofX
2
How do we reduce the representation size with SVD ?
3
Wword=Um×k·Σk×k
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 18

SVD on the Co-occurrence matrix
1
Before the SVD, representations were the rows ofX
2
How do we reduce the representation size with SVD ?
3
Wword=Um×k·Σk×k
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 18

SVD on the Co-occurrence matrix
1
Before the SVD, representations were the rows ofX
2
How do we reduce the representation size with SVD ?
3
Wword=Um×k·Σk×k
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 18

SVD on the Co-occurrence matrix
1
Wword∈R
m×k
(k≪ |V|=m) are considered the representation of
the words
2
Lesser dimensions but the same similarities! (one may verify that
XX
T
=
ˆ
X
ˆ
X
T
)
3
Wcontext=V∈R
n×k
are taken as the representations for the context
words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 19

SVD on the Co-occurrence matrix
1
Wword∈R
m×k
(k≪ |V|=m) are considered the representation of
the words
2
Lesser dimensions but the same similarities! (one may verify that
XX
T
=
ˆ
X
ˆ
X
T
)
3
Wcontext=V∈R
n×k
are taken as the representations for the context
words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 19

SVD on the Co-occurrence matrix
1
Wword∈R
m×k
(k≪ |V|=m) are considered the representation of
the words
2
Lesser dimensions but the same similarities! (one may verify that
XX
T
=
ˆ
X
ˆ
X
T
)
3
Wcontext=V∈R
n×k
are taken as the representations for the context
words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 19

A bit more clever things...
1
Entries in the occurrence matrix can be weighted (HAL model)
2
Better associations can be used (PPMI)
3
....
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 20

A bit more clever things...
1
Entries in the occurrence matrix can be weighted (HAL model)
2
Better associations can be used (PPMI)
3
....
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 20

A bit more clever things...
1
Entries in the occurrence matrix can be weighted (HAL model)
2
Better associations can be used (PPMI)
3
....
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 20

Count-based vs prediction-based models
1
Techniques we have seen so far rely on the counts (or, co-occurrence
of words)
2
Next, we see prediction based models for word embeddings
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 21

Count-based vs prediction-based models
1
Techniques we have seen so far rely on the counts (or, co-occurrence
of words)
2
Next, we see prediction based models for word embeddings
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 21

Word2Vec
1
T Mikolov et al. (2013)
2
Two versions: Predict words from the contexts (or contexts from
words)
3
Continuous Bag of Words (CBoW) and Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 22

Word2Vec
1
T Mikolov et al. (2013)
2
Two versions: Predict words from the contexts (or contexts from
words)
3
Continuous Bag of Words (CBoW) and Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 22

Word2Vec
1
T Mikolov et al. (2013)
2
Two versions: Predict words from the contexts (or contexts from
words)
3
Continuous Bag of Words (CBoW) and Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 22

Word2Vec
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 23

Word Embeddings: Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 24

Word Embeddings: Skip-gram
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 25

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
1
Start: huge corpus and random initialization of the word embeddings
2
Process the text with a sliding window (one word at a time)
1
At each step, there is a central word and context words (other words
in the window)
2
Given the central word, compute the probabilities for the context
words
3
Modify the word embeddings to increase these probabilities
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 26

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 27

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 28

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 29

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 30

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 31

Word Embeddings: Skip-gram
For each position int= 1,2, . . . Tin the corpus, Skip-gram predicts
the context words inm−sized window (θis the variables to be
optimized)
LikelihoodL(θ) =
T

t=1

−m≤j≤m,j,0
P(wt+j|wt, θ)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 32

Word Embeddings: Skip-gram
The loss is mean NLL
LossJ(θ) =−
1
T
T

t=1

−m≤j≤m,j,0
logP(wt+j|wt, θ)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 33

Word Embeddings: Skip-gram
What are the parameters(θ)to be learned?
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 34

Word Embeddings: Skip-gram
How to computeP(wt+j|wt, θ)?
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 35

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 36

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 37

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 38

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 39

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 40

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 41

Word Embeddings: Skip-gram
Train using Gradient Descent
For one word at a time, i.e., (a center word, one of the context words)
Jt,j(θ) =−logP(cute|cat) =−log
expu
T
cute
vcat

w∈V oc
expu
T
wvcat
=
−u
T
cutevcat+ log

w∈V oc
expu
T
wvcat
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 42

Word Embeddings: Skip-gram
Train using Gradient Descent
For one word at a time, i.e., (a center word, one of the context words)
Jt,j(θ) =−logP(cute|cat) =−log
expu
T
cute
vcat

w∈V oc
expu
T
wvcat
=
−u
T
cutevcat+ log

w∈V oc
expu
T
wvcat
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 42

Word Embeddings: Skip-gram
Train using Gradient Descent
For one word at a time, i.e., (a center word, one of the context words)
Jt,j(θ) =−logP(cute|cat) =−log
expu
T
cute
vcat

w∈V oc
expu
T
wvcat
=
−u
T
cutevcat+ log

w∈V oc
expu
T
wvcat
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 42

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 43

Word Embeddings: Skip-gram
Training is slow (for each central word, all the context words need to
be updated)
Negative sampling: not all the context words are considered, but a
random sample of them
Training over a large corpus leads to sufficient updates for each vector
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 44

Word Embeddings: Skip-gram
Training is slow (for each central word, all the context words need to
be updated)
Negative sampling: not all the context words are considered, but a
random sample of them
Training over a large corpus leads to sufficient updates for each vector
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 44

Word Embeddings: Skip-gram
Training is slow (for each central word, all the context words need to
be updated)
Negative sampling: not all the context words are considered, but a
random sample of them
Training over a large corpus leads to sufficient updates for each vector
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 44

Word Embeddings: Skip-gram
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 45

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 46

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 47

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 48

Word Embeddings: Skip-gram
Can be viewed as a Neural Network
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 49

Word Embeddings: Skip-gram
1
WN×mis theWword(used for representing the words)
2
W

m×N
is theWcontext(may be ignored after the training)
3
Some showed averaging word and context vectors may be more
beneficial
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 50

Word Embeddings: Skip-gram
1
WN×mis theWword(used for representing the words)
2
W

m×N
is theWcontext(may be ignored after the training)
3
Some showed averaging word and context vectors may be more
beneficial
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 50

Word Embeddings: Skip-gram
1
WN×mis theWword(used for representing the words)
2
W

m×N
is theWcontext(may be ignored after the training)
3
Some showed averaging word and context vectors may be more
beneficial
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 50

Bag of Words (BoW)
1
Bag of Words: Collection and frequency of words
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 51

CBoW
1
Considers the embeddings of ‘h’ words before and ‘h’ words after the
target word
2
Adds them (order is lost) for predicting the target word
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 52

CBoW
1
Considers the embeddings of ‘h’ words before and ‘h’ words after the
target word
2
Adds them (order is lost) for predicting the target word
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 52

CBoW
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 53

CBoW
1
Size of the vocabulary =m
2
Dimension of the embeddings =N
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 54

CBoW
1
Size of the vocabulary =m
2
Dimension of the embeddings =N
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 54

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 55

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 55

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 56

Word Embeddings: CBoW
1
Input layerWN×m(embeddings for the context words) projects the
context (sum of 1-hot vectors of all the context vectors) intoN-dim
space
2
Representations of all the(2h)words in the context are summed
(result is anN-dim context vector)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 56

Word Embeddings: CBoW
1
Next layer has a weight matrixW

m×N
(embeddings for the center
words)
2
Projects the accumulated embeddings onto the vocabulary
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 57

Word Embeddings: CBoW
1
Next layer has a weight matrixW

m×N
(embeddings for the center
words)
2
Projects the accumulated embeddings onto the vocabulary
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 57

Word Embeddings: CBoW
1
Next layer has a weight matrixW

m×N
(embeddings for the center
words)
2
Projects the accumulated embeddings onto the vocabulary
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 58

Word Embeddings: CBoW
1
m- way classification→(after a softmax) maximizes the probability
for the target word
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 59

Word Embeddings: CBoW
1
WN×mis theWcontext
2
W

m×N
is theWwords
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 60

Word Embeddings: CBoW
1
WN×mis theWcontext
2
W

m×N
is theWwords
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 60

Glove
1
Glove - Global Vectors
2
Combines the score-based and predict-based approaches
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 61

Glove
1
Glove - Global Vectors
2
Combines the score-based and predict-based approaches
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 61

Glove
1
Xijin the co-occurrence matrix encodes the global info. about words
iandj
p(j/i) =
Xij
Xi
2
Glove attempts to learn representations that are faithful to the
co-occurrence info.
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 62

Glove
1
Xijin the co-occurrence matrix encodes the global info. about words
iandj
p(j/i) =
Xij
Xi
2
Glove attempts to learn representations that are faithful to the
co-occurrence info.
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 62

Glove
1
Glove attempts to learn representations that are faithful to the
co-occurrence info.
2
Try to enforcev
T
i
cj= logP(j/i) = logXij−logXi
vi- central representation of wordi,cj- context representation of
wordj
3
Similarly,v
T
j
ci= logP(i/j) = logXij−logXj(aim is to learn such
embeddingsviandci)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 63

Glove
1
Glove attempts to learn representations that are faithful to the
co-occurrence info.
2
Try to enforcev
T
i
cj= logP(j/i) = logXij−logXi
vi- central representation of wordi,cj- context representation of
wordj
3
Similarly,v
T
j
ci= logP(i/j) = logXij−logXj(aim is to learn such
embeddingsviandci)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 63

Glove
1
Glove attempts to learn representations that are faithful to the
co-occurrence info.
2
Try to enforcev
T
i
cj= logP(j/i) = logXij−logXi
vi- central representation of wordi,cj- context representation of
wordj
3
Similarly,v
T
j
ci= logP(i/j) = logXij−logXj(aim is to learn such
embeddingsviandci)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 63

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word specific biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word specific biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word specific biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word specific biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
To realize the exchange symmetry of
v
T
i
cj= logP(j/i) = logXij−logXi
we may capture thelogXias a biasbiof the wordwi
And, an additional term
˜
bj
2
v
T
i
cj+bi+
˜
bj= logXij
3
SincelogXiandlogXjdepend on the wordsiandj, they can be
considered as the word specific biases (learnable)
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 64

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =

i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 65

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =

i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
2
Much of the entries in the co-occurrence matrix are zeros (noisy or
less informative)
3
Suggests to apply a weight
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 66

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =

i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
2
Much of the entries in the co-occurrence matrix are zeros (noisy or
less informative)
3
Suggests to apply a weight
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 66

Glove
1
Learning objective becomes
argmin
vi,cj,bi,
˜
bj
J() =

i,j
(
v
T
i
cj+bi+
˜
bj−logXij
)
2
2
Much of the entries in the co-occurrence matrix are zeros (noisy or
less informative)
3
Suggests to apply a weight
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 66

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 67

Evaluating the embeddings
1
Intrinsic - studying the internal properties (how well they capture the
meaning: word similarity, analogy, etc.)
2
Extrinsic - studying how they perform a task
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 68

Evaluating the embeddings
1
Intrinsic - studying the internal properties (how well they capture the
meaning: word similarity, analogy, etc.)
2
Extrinsic - studying how they perform a task
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 68

Analysing the embeddings
1
Walking the semantic space
2
Structure - (form clusters) nearest neighbors have a similar meaning,
Linear structure
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 69

Analysing the embeddings
1
Walking the semantic space
2
Structure - (form clusters) nearest neighbors have a similar meaning,
Linear structure
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 69

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 70

Glove
Figure fromLena Voita
Dr. Konda Reddy Mopuri dl - 13/ Word Embeddings 71
Tags