Large Language Models - From RNN to BERT

Language Models
From RNN to BERT

LI Yanran

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
1!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
2!

Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
3!
Word Sequence!
Probability!
Language Model!

Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Predicting the next word
- e.g. “How are ________? (you) _______”
Previous words word being
(Context) predicted
4!

Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
- Take the example above
P(how are you ?)
= P(how) P(are|how) P(you|how are) P(?|how are you)

5!

Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
Estimate the joint probability of an entire sequence of words
by multiplying together a number of conditional probabilities.

6!

N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- Instead of computing the probability of a word given its entire
history, we can approximate the history by just the last few
words.

7!

A Bi-gram Example
8!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- This is problematic!
- Zero counts
- Tradeoff between computations and performances

9!

Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity

- The higher the conditional probability of the word sequence,
the lower the perplexity.
- Training a LM is to minimizing perplexity.

10!

Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity
n Extrinsic v.s. Intrinsic
- Embedding a trained LM in an application and measure how
much the application improves.
- Evaluating the outputs of the down-stream application
models
- Evaluating the internal representations learned by a trained
LM
n A good LM should provide useful features!

11! Language !
Model!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?\ie??OK?\?n'?o????”
“??8???6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?]]?/o*???R? ?jO?'”
“ ??/o*???R6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“X???·?u S3?Y- `‹W/ò6”
“May may quit in May.”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“v?w&Ij?Z&1???\D €ÖÈÔ?
“o.&&&/&&1?Z6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“oj?Zoj?oj??oj6”
“WOK??H?Z §OK6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“?&Dn&D??&DI??6”
“Before was was was, was was is.”

3!

n Useful features are those for language difficulties
- Background Knowledge for Understanding
- e.g.

“÷ûÜÅ·??.??? `6”

Evaluating Language Model
3!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
19!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The RNNs are neural nets that can deal with sequences of
variable length. They are able to this by defining a recurrence
relation over timesteps:
20!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step.
- In this way, Recurrent neural language models (RNNLMs)
avoid the limited context constraint inherent in traditional N-
gram models, since the hidden state embodies information
about all of the preceding words all the way back to the
beginning of the sequence.
21!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step:
22!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
23!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
24!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN variants
n Gated RNNs
- Credit assignment issue
- Vanishing gradient issue
25!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN Variants
n Gated RNNs
n Bi-directional RNNs
26!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN Features
n Gated RNNs
n Bi-directional RNNs
27!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Magic and Reality
n Magic: Flexible
- complex networks can be treated as modules that can be
combined in creative ways.
- Naturally suited for variable-length sequence processing
- Theoretically handling long-term dependencies
n Reality: hard to train
- Long-range dependencies still tricky, despite gating
- Limited contextual information are captured
- Loss of hierarchy
- Sequentiality prohibits parallelization within instances
- Training is slow
28!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
29!

ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n Propose a new type of deep contextualised word
representations (ELMo) that model:
- Complex characteristics of word use
- e.g., syntax and semantics
- How these uses vary across linguistic contexts
- i.e., to model polysemy
30!

ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n ELMo representations are:
- Contextual: The word representations depends on the entire
context in which it is used.
- Deep: The word representations combine all layers of a deep
pre-trained neural network.
- Task-specific (two-stage):
- The word representation is a linear combination of
corresponding hidden layers.
- A down-stream task learns weighting parameters.
31!

ELMo
32!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!

ELMo
33!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!

Two-stage Feature Ensembling
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
34!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Evaluation
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
35!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Evaluation
n Feature-based Pre-training
- The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features
36!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Features
n ELMo representations are:
- Contextual
- Deep
- Task-specific (two-stage)
n Compared with standard RNNLMs, ELMo
- Empirically learns contextual information more sufficiently
- Captures more abstract linguistic characteristics in the higher
level of layers
- But still hard to parallelize
37!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
38!

Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
39!

Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
40!

Transformer Block
41!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
42!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
43!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- e.g. Encoder-Decoder Attention
44!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention
45!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Convolution v.s. Self-Attention
46!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention (similar with convolution)
- Constant path length between any two positions
- Variable-sized perceptive field
- Gating/multiplication enables crisp error propagation
- Trivial to parallelize (per layer)
- Long-distance context has “equal opportunity”
- Self-attention is the method the Transformer uses to bake the
“understanding” of other relevant words into the one we’re
currently processing.
47!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
48!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
49!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Multiplying x
1
by the W
Q
weight matrix produces q
1
, the
"query" vector associated with that word.
- We end up creating a "query", a "key", and a "value"
projection of each word in the input sentence.
50!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
51!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
52!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
53!

Self-Attention Layer
54!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
55!
Credit: Dissecting BERT Part 1: The Encoder.!

Self-Attention Layer
56!
Credit: Dissecting BERT Part 1: The Encoder.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum cannot distinguish what information came
from where
- Convolution can: a different linear transformation for each
relative position.
57!

Self-Attention Layer
n Multi-Head Self-Attention
- Multiple attention layers (heads) in parallel (shown by
different colors)
- Each head uses different linear transformations
- Different heads can learn different relationships
58!

Self-Attention Layer
n Multi-Head Self-Attention
59!

Self-Attention Layer
n Multi-Head Self-Attention
8!
Credit: The Illustrated Transformer. Jay Alammar.!
60!

Self-Attention Layer
n Multi-Head Self-Attention
Credit: The Illustrated Transformer. Jay Alammar.!
61!

Self-Attention Layer
62!
Visualized using Tensor2Tensor notebook !

Transformer Evaluation
63!
n Result on WMT-14

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
64!

BERT
n Bidirectional Encoder Representations from Transformers:
BERT
- BERT ≈ Encoder of Transformer
n Combining the Two Worlds:
- BERT ≈ ELMo + Transformer
- ELMo’s language model was bi-directional, but based on
(slow) RNNs
- Transformer purely relies on (parallelized) attention, but
only trains a forward language model.
- Why: Language models only use left context or right context,
but language understanding is bidirectional.
65!

BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
66!

BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Next Sentence Prediction (Two-Sentence Tasks)
67!

Masked Language Model
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
68!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- Too little masking: Too expensive to train
- Too much masking: Not enough context
69!
Credit: The Illustrated BERT. Jay Alammar.!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way:
- 80% of the time, replace with [MASK]
- went to the store → went to the [MASK]
- 10% of the time, replace random word
- went to the store → went to the running
- 10% of the time, keep same
- went to the store → went to the store
70!
Credit: The Illustrated BERT. Jay Alammar.!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way since:
- If the model had been trained on only predicting
‘<MASK>’ tokens and then never saw this token during
fine-tuning, it would have thought that there was no need
to predict anything.
- The model would have only learned a contextual
representation of the ‘<MASK>’ token and this would
have made it learn slowly (since only 15% of the input
tokens are masked).
71!
Credit: The Illustrated BERT. Jay Alammar.!

Next Sentence Prediction
n To learn relationships between sentences
- predict whether Sentence B is actual sentence that proceeds
Sentence A, or a random one
- BERT ≈ Encoder of Transformer with Input Specifics
- [CLS]
- [SEP]
7!
Credit: The Illustrated BERT. Jay Alammar.!

Next Sentence Prediction
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
73!

Two-Stage Fine-Tuning
n Fine-tuning for downstream tasks
- The BERT paper shows a number of ways to use BERT for
different tasks.
74!

BERT Evaluation
75!
n GLUE Results
- 8 GLUE classification tasks were used to assess performance.
- BERT base not only beat OpenAI GPT on all tasks achieving
SOTA, but it improved SOTA by an impressive 5% on
average. BERT large beat BERT base for all tasks as well.

BERT Features
76!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!

BERT Features
77!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!

BERT Features
78!
n Intrinsic Representations
- An important detail of BERT is the preprocessing used for the
input text: [CLS] [SEP]
Credit: What Does BERT Look At? An Analysis of BERT’s Attention. BlackBoxNLP 2019.!

Outline
n Language Model
n Neural Language Model: RNN
n Contextual Embeddings: ELMo
n Parallelizable: Transformers
n Milestone: BERT!
n What’s Next?
n How to Use BERT?
n How to improve BERT?
n What are Hands-on Experiences?

Syllabus
n Language Model
n Machine Translation
n Sentiment Analysis
n Machine Comprehension
n Knowledge Graph
n Text Generation

Syllabus

Thank you!
Questions?!
Course Registration WeChat Group!

Large Language Models - From RNN to BERT

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Large Language Models - From RNN to BERT

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77